## TF-IDF exercise

In [1]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv("../Datasets/Emotion_classify_Data.csv")

#print the shape of dataframe
print(df.shape)

#print top 5 rows
print(df.head())

(5937, 2)
                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear


In [4]:
#check the distribution of Emotion
df.Emotion.value_counts()

Unnamed: 0_level_0,count
Emotion,Unnamed: 1_level_1
anger,2000
joy,2000
fear,1937


In [5]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df['Emotion_num'] = df['Emotion'].map({
    'joy' : 0,
    'fear': 1,
    'anger': 2
})

#checking the results by printing top 5 rows
df.head()

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### Modelling without Pre-processing Text data

In [13]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling
X_train, X_test, y_train, y_test = train_test_split(
    df.Comment,
    df.Emotion_num,
    test_size=0.2,
    random_state=2022,
    stratify=df.Emotion_num # ensure balance X_train and X_test
)

In [15]:
#print the shapes of X_train and X_test
print(X_train.shape)
print(X_test.shape)

(4749,)
(1188,)


import required libraries

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#### Attempt 1 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

    using CountVectorizer with only trigrams.
    use RandomForest as the classifier.
    print the classification report.

In [18]:
#1. create a pipeline object
model = Pipeline([
     ('CV',CountVectorizer(ngram_range = (3,3))),
     ('RF', RandomForestClassifier())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.60      0.27      0.37       400
           1       0.37      0.81      0.51       388
           2       0.54      0.21      0.30       400

    accuracy                           0.43      1188
   macro avg       0.50      0.43      0.39      1188
weighted avg       0.50      0.43      0.39      1188



#### Attempt 2 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

    using CountVectorizer with both unigram and bigrams.
    use Multinomial Naive Bayes as the classifier.
    print the classification report.

In [19]:
#1. create a pipeline object
model = Pipeline([
     ('CV',CountVectorizer(ngram_range = (1,2))),
     ('MNB', MultinomialNB())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       400
           1       0.87      0.83      0.85       388
           2       0.83      0.88      0.85       400

    accuracy                           0.86      1188
   macro avg       0.86      0.86      0.86      1188
weighted avg       0.86      0.86      0.86      1188



#### Attempt 3 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

    using CountVectorizer with both unigram and Bigrams.
    use RandomForest as the classifier.
    print the classification report.

In [20]:
#1. create a pipeline object
model = Pipeline([
     ('CV', CountVectorizer(ngram_range = (1,2))),
     ('RF', RandomForestClassifier())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.97      0.91       400
           1       0.94      0.88      0.91       388
           2       0.94      0.86      0.90       400

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188



#### Attempt 4 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

    using TF-IDF vectorizer for Pre-processing the text.
    use RandomForest as the classifier.
    print the classification report.

In [21]:
#1. create a pipeline object
model = Pipeline([
     ('TF-IDF', TfidfVectorizer()),
     ('RF', RandomForestClassifier())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.95      0.90       400
           1       0.92      0.90      0.91       388
           2       0.95      0.87      0.91       400

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188



### Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [23]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [24]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df['preprocessed_Comment'] = df['Comment'].apply(preprocess)

Build a model with pre processed text

In [25]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(
    df.preprocessed_Comment,
    df.Emotion_num,
    test_size = 0.2,
    random_state = 2022,
    stratify = df.Emotion_num
)

#### Attempt1 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

    using CountVectorizer with both unigrams and bigrams.
    use RandomForest as the classifier.
    print the classification report.

In [26]:
#1. create a pipeline object
model = Pipeline([
     ('CV', CountVectorizer(ngram_range = (1,2))),
     ('RF', RandomForestClassifier())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       400
           1       0.93      0.91      0.92       388
           2       0.92      0.93      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



#### Attempt 2 :

using the sklearn pipeline module create a classification pipeline to classify the data.

Note:

    using TF-IDF vectorizer for pre-processing the text.
    use RandomForest as the classifier.
    print the classification report.

In [27]:
#1. create a pipeline object
model = Pipeline([
     ('TF-IDF', TfidfVectorizer()),
     ('RF', RandomForestClassifier())
])

#2. fit with X_train and y_train
model.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94       400
           1       0.94      0.91      0.92       388
           2       0.92      0.92      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



### Own Final Observations

**No preprocess:**

**Attemp 1**, using CountVectorizer only trigrams,
use RandomForest as the classifier.
- Give very poor performance(f1 score).

**Attemp 2**, using CountVectorizer with both unigram and bigrams,
use Multinomial Naive Bayes as the classifier.
- Give better performance(f1 score) than Attemp 1.
- changing to Multinomial Naive Bayes and using both both unigram and bigrams improve the performance.

**Attemp 3**, using CountVectorizer with both unigram and Bigrams,
use RandomForest as the classifier.
- Give a good performance with 0.90-0.91 in f1 score.
- Changing to back to RandomForest improve the performance.

**Attemp 4**, using TF-IDF vectorizer for Pre-processing the text,
use RandomForest as the classifier.
- Give similar performance to Attemp 3.
- Changing from CountVectorizer with both unigram and Bigrams to TF-IDF vectorizer with unigram (default=(1, 1)) still give similar performance.

**With preprocess:**

- Preprocessing text take alot of time due to its size

**Attemp 1**, using CountVectorizer with both unigrams and bigrams,
use RandomForest as the classifier.
- Improve the performance of Attemp 3 (without preprocess).
- model perform better when preprocess comment column

**Attemp 2**, using TF-IDF vectorizer for pre-processing the text,
use RandomForest as the classifier.
- Give similar performance to Attemp 1
- preprocessing comment column first improve performance (f1 score) by around 0.02 for both this attmep and Attemp 1


### Final Observations

As part of this exercise we have trained the data with algorithms like Multinomial Naive Bayes and Random Forest which are most used and provide good results for text related problems.

As Machine learning algorithms do not work on text data directly, we need to convert them into numeric vectors and feed that into models while training. For this purpose, we have used Bag of words(unigrams, bigrams, n-grams) and TF-IDF text representation techniques.

Key Findings

- As the n_gram range keeps increasing, there's drastic fall of improvement in performance metrics.

- There's seen a significant improvement in results before pre-processing and after pre-processing the data.

- TF-IDF and Bag of words both performed equally well in performance metrics like Recall and F1-score.

- Random Forest performed quite well when compared to Multinomial Naive Bayes.



---

