## Task 6 - Find average number of negations in each category
**For this, I loop in each category of tweets, and test if there are negation words or prefixes or suffixes inside. I add this to a variable named ```negation_count``` that i divide by the number of tweets at the end. I created a third column that i add to the dataset**  

Response time : 10s  


In [1]:
import pandas as pd
import nltk

negation_words = {'no', 'not', 'neither', 'never', 'no one', 'nobody', 'none', 'nor', 'nothing', 'nowhere'}
prefixes = {'un', 'im', 'in', 'il', 'ir', 'dis'}
suffix = 'less'

def find_average_number_of_negation_in_each_category(file_path, file_name):
    average_number_of_negation_in_each_category = []
    df = pd.read_csv("{}/{}.csv".format(file_path, file_name))
    for tweets in df["Concatenated_Tweets"]:
        negation_count = 0
        separated_tweets = tweets.replace("[", "").replace("]", "").split(',')
        for tweet in separated_tweets:
            tokens = nltk.word_tokenize(tweet)
            negation_count += sum(1 for token in tokens if token in negation_words)
            negation_count += sum(1 for token in tokens if any(token.startswith(prefix) for prefix in prefixes))
            negation_count += sum(1 for token in tokens if token.endswith(suffix))
        average_number_of_negation_in_each_category.append(round(negation_count/len(separated_tweets), 3))
    df["Average_Number_Of_Negation"]=average_number_of_negation_in_each_category
    return df
            
    

test_with_negation = find_average_number_of_negation_in_each_category('categories','test')
train_with_negation = find_average_number_of_negation_in_each_category('categories','train')
val_with_negation = find_average_number_of_negation_in_each_category('categories','val')

## Task 7 - Training a SVM classifier (with n-gram TF-IDF vecotrization)
**I vectorized the features of the train set and the test set and created the model ```svm_classifier``` and fitted it with train set. Then, I predicted the results with the test set and computed all f1-scores and displayed them by categories**  

Response time : 7m  


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.svm import SVC

# import the python file
import functions_and_variables as fs

def read_txt_data_and_split(file_path, file_name, should_preprocess=False):
    df = pd.read_csv("{}/{}.txt".format(file_path, file_name), sep=";", header=None)
    df.columns=["Tweet", "Emotion"]
    
    # preprocess data if the optionally passed parameter is set to True
    if should_preprocess:
        for i, tweet in df.iterrows():
            df.at[i, "Tweet"] = " ".join(fs.preprocess(df.at[i,"Tweet"], True)[0])
    return df["Tweet"], df["Emotion"]

X_train, y_train = read_txt_data_and_split("data", "train", True)
X_test, y_test = read_txt_data_and_split("data", "test", True)
print("Data imported")

tfidf_vectorizer = TfidfVectorizer(ngram_range=(2,3), analyzer='char', max_features=500)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print("Vectorization done")

svm_classifier = SVC()
print("Blank model created")

svm_classifier.fit(X_train_tfidf, y_train)
print("Model trained")

y_pred = svm_classifier.predict(X_test_tfidf)
print("Prediction done")

f1_svm = f1_score(y_test, y_pred, average=None)
category_names = svm_classifier.classes_

for category, score in zip(category_names, f1_svm):
    print(f"F1-score for {category}: {score}")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/franzi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Data imported
Vectorization done
Blank model created
Model trained
Prediction done
F1-score for anger: 0.5505376344086022
F1-score for fear: 0.5891472868217054
F1-score for joy: 0.7242848447961048
F1-score for love: 0.28571428571428575
F1-score for sadness: 0.7054963084495488
F1-score for surprise: 0.23684210526315788


## Task 8 - Training a classifier (with bag-of-word TF-IDF vecotrization)
**I vectorized the features of the train set and the test set and created the model ```random_forest``` and fitted it with train set. Then, I predicted the results with the test set and computed all f1-scores and displayed them by categories**  

Response time : 1m  


In [3]:
from sklearn.ensemble import RandomForestClassifier

X_train, y_train = read_txt_data_and_split("data", "train")
X_test, y_test = read_txt_data_and_split("data", "test")
print("Data imported")

tfidf_vectorizer = TfidfVectorizer(max_features=500)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print("Vectorization done")

random_forest = RandomForestClassifier()
print("Blank model created")

random_forest.fit(X_train_tfidf, y_train)
print("Model trained")

y_pred = random_forest.predict(X_test_tfidf)
print("Prediction done")

f1_forest = f1_score(y_test, y_pred, average=None)
category_names = random_forest.classes_

for category, score in zip(category_names, f1_forest):
    print(f"F1-score for {category}: {score}")

Data imported
Vectorization done
Blank model created
Model trained
Prediction done
F1-score for anger: 0.39999999999999997
F1-score for fear: 0.5155807365439095
F1-score for joy: 0.635593220338983
F1-score for love: 0.3228699551569507
F1-score for sadness: 0.5606425702811244
F1-score for surprise: 0.358974358974359


**Now, let's try ```SVC()``` again with the bag-of-word method and compare the results**

In [4]:
svm_classifier = SVC()
print("Blank model created")

svm_classifier.fit(X_train_tfidf, y_train)
print("Model trained")

y_pred = svm_classifier.predict(X_test_tfidf)
print("Prediction done")

f1_svm_2 = f1_score(y_test, y_pred, average=None)
category_names = svm_classifier.classes_

for category, score_svm_1, score_svm_2 in zip(category_names, f1_svm, f1_svm_2):
    print(f"F1-score for {category}: {score_svm_1}, {score_svm_2}")

Blank model created
Model trained
Prediction done
F1-score for anger: 0.5505376344086022, 0.3723404255319148
F1-score for fear: 0.5891472868217054, 0.5098039215686275
F1-score for joy: 0.7242848447961048, 0.6522001205545509
F1-score for love: 0.28571428571428575, 0.2898550724637681
F1-score for sadness: 0.7054963084495488, 0.5875190258751902
F1-score for surprise: 0.23684210526315788, 0.29885057471264365
