## Task 3
### Course: D7058E, Text Mining
### Group: Group 5 (Muhammad Raees Khan, Karl Larsson, Jonas Tollofsen, Christián Ulehla)<hr>

Task 3
1. Get any other dataset besides the IMDB and evaluate how much similarity exists between the first sentence and the immediate 19 sentences (regardless of label) by using both sentence and corpus BLEU.
2. Discuss the disadvantages of using the BLEU score.

BLEU score doesn't consider synonyms and actually penalizes the use of synonyms that aren't present in the reference, as it only considers how similar two sentences are in the form of N-grams. 

The most common use of BLEU is on a corpus-level, although a problem with this approach is that scores the text generation on a high level. Due to this it's possible that some individual sentences are very poor and others excellent.

When humans understand language we mainly try to understand what someone mean with what they said, i.e what do the person communicating actually mean? The BLEU score cannot consider meaning and only considers the amount of matching N-grams. So there may be two sentences that are fairly similar in amount of N-grams, but holds two different meanings, but BLEU would consider them similar, for example: "I hate this book so much" and "I love this book so much" would be considered fairly similar to BLEU. There're ways to mitigate this, like weighing uncommon words heavier than common words, but this also brings in other forms of problems. For example, a generated text that replaces a very uncommon word with a synonym would be penalized heavily, even though it carry the same meaning.

Additionally, BLEU can struggle with languages other than English, for example languages that have different morphological forms for the same word that helps it carry different form of meanings. As previously mentioned, BLEU doesn't consider meaning, only words, thus it would struggle taking this into account.

As BLEU relies on tokens/N-grams to compute the score it can be very tricky / difficult to use on unsegmented languages,as they're not segmented with white space which we normally use for tokenization. It is possible to use the greedy algorithm but theres a chance that the algorithm generates the wrong tokens in this case, and therefore a faulty BLEU score.

There are however also advantages. BLEU is very computationally efficient. There're more accurate ways to evaluate the ouput of a model than BLEU but BLEU can atleast give us a "feeling" about how good our model is, but it should not be the sole metrics we use to evaluate the performance of a model, we should thus either use a different metric than BLEU, or more optimally, combine the BLEU metric with other forms of metrics.


Grading questions (in addition to any relevant ones not listed below):

1. Run your code, go through it and explain what it does.
2. Is there any part of the code that may be optimized (made better)
3. What other challenges can this solution be adapted for?
4. Will you like to discuss the time & space complexities of parts of your solution (optional)?

In [None]:
#if the code doesnt work, you need to upload file from the dataset here
#dataset with sentences => https://www.kaggle.com/datasets/rahulin05/sentiment-labelled-sentences-data-set
#subfile: amazon_cells_labelled.txt => changed .txt into .csv
from google.colab import files
uploaded = files.upload()

Saving amazon_cells_labelled.csv to amazon_cells_labelled.csv


# Imports

In [None]:
import pandas as pd
import spacy
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
from tabulate import tabulate
import io

#Preprocessing

In [None]:
#Loading English pipeline language model and assign it to a vairable
tokenizer = spacy.load("en_core_web_sm")

nlp = spacy.load("en_core_web_sm")
#Being able to print full column content, not truncating the results
pd.set_option('display.max_colwidth', 110)

#read file with specific delimiter and only the column with sentences
df = pd.read_csv(io.BytesIO(uploaded['amazon_cells_labelled.csv']), sep=';', header=None, usecols=[0])
#rename sentences column
df.rename(columns={0: 'sentence'}, inplace=True)
#sample 100 sentences out of the dataset
#df = df.sample(n=100) 
df = df.head(100)
    
#to lowercase
df['sentence'] = df['sentence'].str.lower()
#remove line endings
df = df.replace('[\.\?!]+$','', regex=True)
#remove quotes
df = df.replace('[\"]+', '', regex=True)
#replace sentence breaks with spaces
df = df.replace('[\.,]+', ' ', regex=True)
#replace spaces with one space
df.sentence = df.sentence.replace('\s+',' ', regex=True)

#removes label for sentiment that's present at the end of the sentence
df.sentence = df.sentence.str[:-1]

#create tokens column and create tokens for every row from sentences
df['tokens'] = ''
for index, row in df.iterrows():
    sentence_tokens = tokenizer(row['sentence'])
    sentence_token_list = []
    for x in sentence_tokens:
        sentence_token_list.append(x.text)
    df['tokens'][index] = sentence_token_list


#BLEU

In [None]:
#iterate over rows and get BLEU of current row according to next 19 rows
#last 19 rows are ignored atm
df['sentence_bleu'] = ''
df['corpus_bleu'] = ''
chencherry = SmoothingFunction()
for index, row in df.iloc[:-19].iterrows():
    references = []
    for x in range(index+1,index+20):
        references.append(df['tokens'][x]) 
    df['sentence_bleu'][index] = sentence_bleu(references, row['tokens'],weights=(0.25,0.25,0.15,0.35),smoothing_function=chencherry.method1)

#do a corpus bleu for the whole dataset
references = []
candidates = []
for index, row in df.iloc[:].iterrows():
  if (index < 50):
    references.append(row['tokens'])
  else:
    candidates.append(row['tokens'])
df['corpus_bleu'][0] = corpus_bleu(references, candidates,weights=(0.5,0.25,0.15,0.1),smoothing_function=chencherry.method1)

print(tabulate(df, headers='keys', tablefmt='psql'))

+----+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+-----------------------+
|    | sentence                                                                                                                                            | tokens                                                                                                                                                                                                                                                | sentence_bleu        | corpus_bleu           |
|----+------------------------------------------------------------------------------------------