## Task 6 - Find average number of negations in each category
**For this, I loop in each category of tweets, and test if there are negation words or prefixes or suffixes inside. I add this to a variable named ```negation_count``` that i divide by the number of tweets at the end. I created a third column that i add to the dataset**  

Response time : 18s  


In [1]:
import pandas as pd
import nltk

negation_words = {'no', 'not', 'neither', 'never', 'no one', 'nobody', 'none', 'nor', 'nothing', 'nowhere'}
prefixes = {'un', 'im', 'in', 'il', 'ir', 'dis'}
suffix = 'less'

def find_average_number_of_negation_in_each_category(file_path, file_name):
    average_number_of_negation_in_each_category = []
    df = pd.read_csv("{}/{}.csv".format(file_path, file_name))
    for tweets in df["Concatenated_Tweets"]:
        negation_count = 0
        separated_tweets = tweets.replace("[", "").replace("]", "").split(',')
        for tweet in separated_tweets:
            tokens = nltk.word_tokenize(tweet)
            negation_count += sum(1 for token in tokens if token in negation_words)
            negation_count += sum(1 for token in tokens if any(token.startswith(prefix) for prefix in prefixes))
            negation_count += sum(1 for token in tokens if token.endswith(suffix))
        average_number_of_negation_in_each_category.append(round(negation_count/len(separated_tweets), 3))
    df["Average_Number_Of_Negation"]=average_number_of_negation_in_each_category
    return df
            
    

test_with_negation = find_average_number_of_negation_in_each_category('categories','test')
train_with_negation = find_average_number_of_negation_in_each_category('categories','train')
val_with_negation = find_average_number_of_negation_in_each_category('categories','val')

## Task 7 - Training a SVM classifier (with n-gram TF-IDF vecotrization)
**I vectorized the features of the train set and the test set and created the model ```svm_classifier``` and fitted it with train set. Then, I predicted the results with the test set and computed all f1-scores and displayed them by categories**  

Response time : 11m  


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.svm import SVC

def read_txt_data_and_split(file_path, file_name):
    df = pd.read_csv("{}/{}.txt".format(file_path, file_name), sep=";", header=None)
    df.columns=["Tweet", "Emotion"]
    return df["Tweet"], df["Emotion"]

X_train, y_train = read_txt_data_and_split("data", "train")
X_test, y_test = read_txt_data_and_split("data", "test")
print("Data imported")


tfidf_vectorizer = TfidfVectorizer(ngram_range=(2,3), analyzer='char', max_features=500)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print("Vectorization done")

svm_classifier = SVC()
print("Blank model created")

svm_classifier.fit(X_train_tfidf, y_train)
print("Model trained")

y_pred = svm_classifier.predict(X_test_tfidf)
print("Prediction done")

f1 = f1_score(y_test, y_pred, average=None)
print(f1)

Data imported
Vectorization done
Blank model created
Model trained
Prediction done
[0.43309002 0.49562682 0.68173706 0.21649485 0.63790447 0.10958904]


In [3]:
category_names = svm_classifier.classes_

for category, score in zip(category_names, f1):
    print(f"F1-score for {category}: {score}")

F1-score for anger: 0.4330900243309002
F1-score for fear: 0.49562682215743437
F1-score for joy: 0.6817370612730518
F1-score for love: 0.21649484536082475
F1-score for sadness: 0.6379044684129429
F1-score for surprise: 0.10958904109589043
