<a href="https://colab.research.google.com/github/newtonxp/Natural_language_processing/blob/main/tf_idf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **TF-IDF**

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [None]:
#import pandas library
import pandas as pd


#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv("Emotion_classify_Data.csv")

#print the shape of dataframe
print(df.shape)

#print top 5 rows
df.head()

(5937, 2)


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [None]:
#check the distribution of Emotion
df.Emotion.value_counts()

anger    2000
joy      2000
fear     1937
Name: Emotion, dtype: int64

In [None]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
df["Emotion_nums"] = df['Emotion'].map({
    'joy': 0,
    'fear': 1,
    'anger': 2
})

#checking the results by printing top 5 rows
df.head()

Unnamed: 0,Comment,Emotion,Emotion_nums
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### **Modelling without Pre-processing Text data**

In [None]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 0 and also do the stratify sampling on df.Emotion_nums
x_train, x_test, y_train, y_test = train_test_split(df.Comment, df.Emotion_nums, test_size=0.2, random_state=0, stratify=df.Emotion_nums)


In [None]:
#print the shapes of X_train and X_test
print(x_train.shape)
print(x_test.shape)


(4749,)
(1188,)


In [None]:
y_train.value_counts()

2    1600
0    1600
1    1549
Name: Emotion_nums, dtype: int64


**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

- using CountVectorizer with only trigrams.
- using **RandomForest** as the classifier.
- printing the classification report.


In [None]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

# create a pipeline object
clf = Pipeline([
    ('Count Vactorizer', CountVectorizer(ngram_range=(3, 3))),
    ('RF Classifier', RandomForestClassifier())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.58      0.21      0.31       400
           1       0.40      0.81      0.53       388
           2       0.55      0.34      0.42       400

    accuracy                           0.45      1188
   macro avg       0.51      0.45      0.42      1188
weighted avg       0.51      0.45      0.42      1188




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

- using CountVectorizer with both unigram and bigrams.
- using **Multinomial Naive Bayes** as the classifier.
- printing the classification report.


In [None]:
#import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB


# create a pipeline object
clf = Pipeline([
    ('Count Vectorizer', CountVectorizer()),
    ('NB Classifier', MultinomialNB())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)

# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.85      0.88       400
           1       0.86      0.89      0.87       388
           2       0.86      0.89      0.88       400

    accuracy                           0.88      1188
   macro avg       0.88      0.88      0.88      1188
weighted avg       0.88      0.88      0.88      1188




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

- using CountVectorizer with both unigram and Bigrams.
- using **RandomForest** as the classifier.
- printing the classification report.


In [None]:
# create a pipeline object
clf = Pipeline([
    ('Count Vactorizer', CountVectorizer(ngram_range=(1, 2))),
    ('RF Classifier', RandomForestClassifier())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.95      0.90       400
           1       0.95      0.85      0.90       388
           2       0.91      0.90      0.90       400

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- using **RandomForest** as the classifier.
- printing the classification report.


In [None]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer


# create a pipeline object
clf = Pipeline([
    ('Tf-Idf Vactorizer', TfidfVectorizer()),
    ('RF Classifier', RandomForestClassifier())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90       400
           1       0.93      0.87      0.89       388
           2       0.91      0.90      0.91       400

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [None]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
df['preprocessed_comment'] = df['Comment'].apply(preprocess)


**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 0 and stratify sampling om df.Emotion_nums
x_train, x_test, y_train, y_test = train_test_split(df.preprocessed_comment, df.Emotion_nums, test_size=0.2, random_state=0, stratify=df.Emotion_nums)

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

- using CountVectorizer with both unigrams and bigrams.
- using **RandomForest** as the classifier.
- printing the classification report.


In [None]:
# create a pipeline object
clf = Pipeline([
    ('Count Vactorizer', CountVectorizer(ngram_range=(1, 2))),
    ('RF Classifier', RandomForestClassifier())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.95      0.94       400
           1       0.95      0.88      0.92       388
           2       0.92      0.94      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

- using **TF-IDF vectorizer** for pre-processing the text.
- using **RandomForest** as the classifier.
- printing the classification report.


In [None]:
# create a pipeline object
clf = Pipeline([
    ('Tf-Idf Vactorizer', TfidfVectorizer()),
    ('RF Classifier', RandomForestClassifier())
])


# fit with X_train and y_train
clf.fit(x_train, y_train)


# get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


# print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       400
           1       0.93      0.91      0.92       388
           2       0.93      0.92      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.92      0.92      1188
weighted avg       0.93      0.93      0.93      1188



## **Observations**

- We employed popular machine learning algorithms, such as Multinomial Naive Bayes and Random Forest, which are widely known for their effectiveness in solving text-related problems. Since machine learning algorithms cannot directly process text data, we adopted text representation techniques like Bag of Words (unigrams, bigrams, n-grams) and TF-IDF to convert textual information into numeric vectors, enabling seamless training and analysis.

- As the n-gram range increases, the improvement in performance metrics decreases significantly. Notably, there was a substantial improvement in results after pre-processing the data. Both TF-IDF and Bag of Words demonstrated equally impressive performance metrics, including Recall and F1-score. Comparatively, Random Forest outperformed Multinomial Naive Bayes.
