We want to test the quality of data augmentation on classification performance. For this purpose, consider the classes with small sample size (categories Disgust and Anticipation), identify from Kaggle or elsewhere labeled dataset of the same categories And add it to these two categories. Test the SVM classifier in Task 7 and discuss whether the overall performance can be increased or not. 

The dataset doesn't contain categories Disgust and Anticipation. So here I use surprise and love. 

The new dataset is not yet preprocessed. So we have to remove stopwords, and perform stemming.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import random
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk import word_tokenize
import string

def preprocess_to_sentence(sentence):
    stop_words = list(set(stopwords.words('english')))
    stop_words.extend(['im', 'ive', 'id', 'dont','cant', 'wont'])
    words = word_tokenize(sentence)
    words = [word.lower() for word in words if word not in string.punctuation and word not in stop_words]
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    preprocessed_sentence = ' '.join(words)
    
    return preprocessed_sentence

For each of dataset, we select 100 samples that in the categories love and surprise.

In [2]:

def get_sample_data_for_categories(file_name,categories_to_extract):
    data = pd.read_csv('categories/{}.csv'.format(file_name))
    sample_size = 100 / len(categories_to_extract)
    
    df = pd.DataFrame()
    for category in categories_to_extract:
        index = data[data['Category'] == category].index[0]
        tweets = data['Concatenated_Tweets'][index].split(',')
        if len(tweets) >= int(sample_size):
            sampled_tweets = random.sample(tweets, int(sample_size))
        else:
            sampled_tweets = tweets
        
        for tweet in sampled_tweets:
            df = pd.concat([df, pd.DataFrame([{'category': category, 'tweet': tweet}])])
    return df

def get_sample_data_for_categories_addition(categories_to_extract):
    data = pd.read_csv('data/{}.csv'.format('text_emotion'))
    data['pre_process'] = data['content'].apply(preprocess_to_sentence)
    data = data.dropna(subset=['pre_process'])
    sample_size = 100
    sampled_data = data[data['sentiment'].isin(categories_to_extract)].sample(n=sample_size)
    selected_columns = sampled_data[['sentiment', 'pre_process']]

    selected_columns = selected_columns.rename(columns={'sentiment': 'category', 'pre_process': 'tweet'})
    return selected_columns
        
categories_to_extract = ['love', 'surprise']
sample_data = get_sample_data_for_categories('complete_preprocessed',categories_to_extract)
sample_data_addition = get_sample_data_for_categories_addition(categories_to_extract)


data = pd.concat([sample_data, sample_data_addition], ignore_index=True)
shuffled_df = data.sample(frac=1, random_state=42).reset_index(drop=True)

shuffled_df

Unnamed: 0,category,tweet
0,surprise,'feel like need emphas impress color'
1,love,'feel love chang modern mother'
2,love,'know feel like legitem like someon somehow g...
3,surprise,happi star war day ohhhh ... i get may 4th lov...
4,love,curious_jo ... i like freedom enigma give
...,...,...
195,surprise,warn tweet rid bike dangero waaaaaaa crash
196,love,'much lighter feel extrem passion life ye'
197,surprise,'feel impress happi thing like use'
198,surprise,today interest ...


In [3]:
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score

Vectorizedd the features and split the dataset in to training and testing set. Then, we created svm model to fitted it. The result will be used to calculate the f1-score and compare it with the result of the Task 7.

In [4]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
x = tfidf_vectorizer.fit_transform(data['tweet'])

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['category'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

svm_classifier = SVC(kernel='linear', C=1.0, random_state=42)
svm_classifier.fit(x_train, y_train)

y_pred = svm_classifier.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_, output_dict=True)
f1 = f1_score(y_test, y_pred, average=None)


print("Accuracy:", accuracy)

df_report = pd.DataFrame(report).transpose()
df_report.to_csv('results/classification_report.csv', index=True)
print(df_report)


Accuracy: 0.65
              precision    recall  f1-score  support
love           0.533333  1.000000  0.695652    16.00
surprise       1.000000  0.416667  0.588235    24.00
accuracy       0.650000  0.650000  0.650000     0.65
macro avg      0.766667  0.708333  0.641944    40.00
weighted avg   0.813333  0.650000  0.631202    40.00


Compared to task 7, we can see that the F1 score has improved.<br>
F1 score before the data augmentation:<br>
F1-score for surprise: 0.23684<br>
F1-score for love: 0.26794<br>

F1 score after the data augmentation:<br>
F1-score for surprise: 0.588235<br>
F1-score for love: 0.695652<br>

