# Muthu Palaniappan M - 21011101079 NLP LAB EX -2

### Pipeline

1. **Data Gathering**
   - Source: Kaggle - https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
   - Labels: Netural, Negative, Positive

2. **Data Preprocessing**
      - Removing irrelevant characters, symbols, and numbers.
      - Tokenization
      - Removing stop words
      - Lemmatization and stemming to reduce words to their base form. (I did both in this notebook for my Lab ex)

3. **Feature Extraction**
   - Processed text data into numerical features -> input for machine learning model.
   - Bag of Words.
   - TF-IDF.
   - Skipgram.
   - CBOW.

4. **Model Choice**
      - Naive Bayes classifier.
      - Decision Tree with depth 5

5. **Model Training**
   - Dataset into training and validation sets.
   - Train it

6. **Model Evaluation**
   - Metrics -> accuracy, precision, recall, and F1 score

### Importing Packages

In [1]:
import numpy as np
import pandas as pd
import nltk
import gensim
import string as st
import re
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk import WordNetLemmatizer
from wordcloud import WordCloud, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models.word2vec import Word2Vec
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

### Loading the Data

In [2]:
data = pd.read_csv("train.csv",encoding='unicode_escape')
data.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [3]:
data = data[['selected_text','sentiment']]
data.head()

Unnamed: 0,selected_text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD,negative
2,bullying me,negative
3,leave me alone,negative
4,"Sons of ****,",negative


In [4]:
data.rename(columns={"selected_text":"text"},inplace=True)
data.head()

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD,negative
2,bullying me,negative
3,leave me alone,negative
4,"Sons of ****,",negative


### Pre-Processing

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       27480 non-null  object
 1   sentiment  27481 non-null  object
dtypes: object(2)
memory usage: 429.5+ KB


#### 1. Removing Punctuation

In [6]:
def remove_punctuation(text):
    removed_text = ""
    for char in str(text):
        if char not in st.punctuation:
            removed_text+=char
    return removed_text

In [7]:
data['removed_punc'] = data['text'].apply(remove_punctuation)

In [8]:
print(f"Before: {data['text'][0]}\nAfter: {data['removed_punc'][0]}")

Before: I`d have responded, if I were going
After: Id have responded if I were going


#### 2. Tokenization

In [9]:
def convert_tokens(text):
    text = str(text).lower()
    tokens = []
    tokens = re.split("\s+",text)
    return tokens

In [10]:
data['Tokens'] = data['removed_punc'].apply(convert_tokens)

In [11]:
print(f"Tokens: {data['Tokens'][0]}")

Tokens: ['id', 'have', 'responded', 'if', 'i', 'were', 'going']


#### 3.Stopwords Removal

In [12]:
def remove_stopwords(tokens):
  return [token for token in tokens if token not in stopwords.words("english")]

In [13]:
data['removed_stopwords_tokens'] = data['Tokens'].apply(remove_stopwords)

In [14]:
print(f"Before: {data['Tokens'][0]}\nAfter: {data['removed_stopwords_tokens'][0]}")

Before: ['id', 'have', 'responded', 'if', 'i', 'were', 'going']
After: ['id', 'responded', 'going']


#### 4.Stemming

In [15]:
def stem_tokens(tokens):
    ps = PorterStemmer()
    tokens = [ps.stem(tok) for tok in tokens]
    return tokens

In [16]:
data['stemming_tokens'] = data['removed_stopwords_tokens'].apply(stem_tokens)

In [17]:
print(f"Before: {data['removed_stopwords_tokens'][0]}\nAfter: {data['stemming_tokens'][0]}")

Before: ['id', 'responded', 'going']
After: ['id', 'respond', 'go']


#### 5.Lemma Building

In [18]:
def lema_tokens(tokens):
    word_net = WordNetLemmatizer()
    tokens = [word_net.lemmatize(tok) for tok in tokens]
    return tokens

In [19]:
data['lemma_tokens'] = data['removed_stopwords_tokens'].apply(lema_tokens)

In [20]:
print(f"Before: {data['removed_stopwords_tokens'][0]}\nAfter: {data['lemma_tokens'][0]}")

Before: ['id', 'responded', 'going']
After: ['id', 'responded', 'going']


#### 6. Return Pre-Processed Text

In [21]:
def return_sequence(tokens):
  return " ".join([token for token in tokens])

In [22]:
data['pre_processed_text'] = data['lemma_tokens'].apply(return_sequence)

In [23]:
print(f"Before: {data['lemma_tokens'][0]}\nAfter: {data['pre_processed_text'][0]}")

Before: ['id', 'responded', 'going']
After: id responded going


In [24]:
data.dropna(inplace=True)

### Feature Representation

#### Bag of Words

In [25]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(data['pre_processed_text'].values.tolist())

In [26]:
count_matrix.toarray().shape

(27480, 17424)

##### Impact on BoW
- BoW may struggle with capturing semantic meaning and context, leading to misclassification. 
- It treats each word independently, ignoring word relationships.
- OOV Problem

#### Tf-IDF

In [29]:
data['pre_processed_text'].values.tolist()

['id responded going',
 'sooo sad',
 'bullying',
 'leave alone',
 'son ',
 'httpwwwdothebouncycomsmf shameless plugging best ranger forum earth',
 'fun',
 'soooo high',
 '',
 'wow u became cooler',
 'much love hopeful reckon chance minimal p im never gonna get cake stuff',
 'like',
 'dangerously',
 'lost',
 'test test lg env2',
 'uh oh sunburned',
 'sigh',
 'sick',
 'onna',
 'he',
 'oh marly im sorry hope find soon 3 3',
 'interesting',
 'cleaning house family comming later today',
 'gotta restart computer thought win7 supposed put end constant rebootiness',
 'see wat mean bout foll0w friidays called lose f0llowers friday smh',
 'free fillin app ipod fun im addicted',
 'im sorry',
 'internet',
 'fun',
 'power back working',
 'quiteheavenly',
 'hope',
 'well much unhappy 10 minute',
 'funny',
 'ahhh slept game im gonna try best watch tomorrow though hope play army',
 'thats end tear fear',
 'miss',
 'case wonder really busy today coming adding ton new blog update stay tuned',
 'soooooo 

In [82]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['pre_processed_text'].values.tolist())

In [83]:
tfidf_array = tfidf_matrix.toarray()

###### Impact on TF-IDF
- While TF-IDF addresses some BoW limitations by giving more weight to important words, it still doesn't capture word relationships and semantics well.
- OOV Problem

#### Continuous Bag of Words (CBOW)

In [29]:
cbow = Word2Vec(data['pre_processed_text'].values.tolist(), vector_size=100, window=5, min_count=2, sg=0)
vocab = cbow.wv.index_to_key

def get_mean_vector(model, sentence):
    words = [word for word in sentence if word in vocab]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    return np.zeros((100,))

cbow_array = []
for sentence in data['pre_processed_text'].values.tolist():
    cbow_array.append(get_mean_vector(cbow, sentence))

In [30]:
cbow_array = np.array(cbow_array)
cbow_array.shape

(27480, 100)

#### Skipgram

In [31]:
sg = Word2Vec(data['pre_processed_text'].values.tolist(), vector_size=100, window=5, min_count=2, sg=1)
vocab = sg.wv.index_to_key

def get_mean_vector(model, sentence):
    words = [word for word in sentence if word in vocab]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    return np.zeros((100,))

sg_array = []
for sentence in data['pre_processed_text'].values.tolist():
    sg_array.append(get_mean_vector(sg, sentence))

In [32]:
sg_array = np.array(sg_array)
sg_array.shape

(27480, 100)

###### Impact on Word2Vec
- These models are better at capturing semantic relationships and context, reducing misclassification related to word semantics.

### Feature Engineering

In [33]:
lb = LabelEncoder()
data['sentiment'] = lb.fit_transform(data['sentiment'])

In [34]:
y = data['sentiment']

In [35]:
x_train_bow, x_test_bow, y_train_bow, y_test_bow = train_test_split(count_matrix, y, test_size=0.2, random_state=9)

In [36]:
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(tfidf_array, y, test_size=0.2, random_state=9)

In [37]:
x_train_cbow, x_test_cbow, y_train_cbow, y_test_cbow = train_test_split(cbow_array, y, test_size=0.2, random_state=9)

In [38]:
x_train_skg, x_test_skg, y_train_skg, y_test_skg = train_test_split(sg_array, y, test_size=0.2, random_state=9)

In [39]:
print("Bag of Words (BoW) Shapes:")
print("x_train_bow shape:", x_train_bow.shape)
print("x_test_bow shape:", x_test_bow.shape)
print("y_train_bow shape:", y_train_bow.shape)
print("y_test_bow shape:", y_test_bow.shape)
print("=======================")
print("\nTF-IDF Shapes:")
print("x_train_tfidf shape:", x_train_tfidf.shape)
print("x_test_tfidf shape:", x_test_tfidf.shape)
print("y_train_tfidf shape:", y_train_tfidf.shape)
print("y_test_tfidf shape:", y_test_tfidf.shape)
print("=========================")
print("\nContinuous Bag of Words (CBOW) Shapes:")
print("x_train_cbow shape:", x_train_cbow.shape)
print("x_test_cbow shape:", x_test_cbow.shape)
print("y_train_cbow shape:", y_train_cbow.shape)
print("y_test_cbow shape:", y_test_cbow.shape)
print("========================")
print("\nSkip-Gram Shapes:")
print("x_train_skg shape:", x_train_skg.shape)
print("x_test_skg shape:", x_test_skg.shape)
print("y_train_skg shape:", y_train_skg.shape)
print("y_test_skg shape:", y_test_skg.shape)


Bag of Words (BoW) Shapes:
x_train_bow shape: (21984, 17424)
x_test_bow shape: (5496, 17424)
y_train_bow shape: (21984,)
y_test_bow shape: (5496,)

TF-IDF Shapes:
x_train_tfidf shape: (21984, 17424)
x_test_tfidf shape: (5496, 17424)
y_train_tfidf shape: (21984,)
y_test_tfidf shape: (5496,)

Continuous Bag of Words (CBOW) Shapes:
x_train_cbow shape: (21984, 100)
x_test_cbow shape: (5496, 100)
y_train_cbow shape: (21984,)
y_test_cbow shape: (5496,)

Skip-Gram Shapes:
x_train_skg shape: (21984, 100)
x_test_skg shape: (5496, 100)
y_train_skg shape: (21984,)
y_test_skg shape: (5496,)


### Model Builing

In [45]:
def train_and_evaluate_decision_tree(x_train, x_test, y_train, y_test, representation):
    
    dtclassifier = DecisionTreeClassifier(random_state=9,max_depth=5)
    dtclassifier.fit(x_train, y_train)
    y_pred = dtclassifier.predict(x_test)

    print(f"\nMetrics for {representation}:")
    print(f"Model Score: {dtclassifier.score(x_train,y_train)}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

In [90]:
def train_and_evaluate_navie_bayes(x_train, x_test, y_train, y_test, representation):
    
    nbclassifier = MultinomialNB()
    nbclassifier.fit(x_train, y_train)
    y_pred = nbclassifier.predict(x_test)

    print(f"\nMetrics for {representation}:")
    print(f"Model Score: {nbclassifier.score(x_train,y_train)}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    
    return nbclassifier

In [46]:
train_and_evaluate_decision_tree(x_train_bow, x_test_bow, y_train_bow, y_test_bow, "BoW")


Metrics for BoW:
Model Score: 0.4932678311499272
Accuracy: 0.4798034934497817
Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.00      0.00      1548
           1       0.43      0.94      0.59      2182
           2       0.79      0.33      0.46      1766

    accuracy                           0.48      5496
   macro avg       0.61      0.42      0.35      5496
weighted avg       0.59      0.48      0.39      5496



In [50]:
train_and_evaluate_decision_tree(x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf, "TF-IDF")


Metrics for TF-IDF:
Model Score: 0.5020469432314411
Accuracy: 0.4885371179039301
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.00      0.00      1548
           1       0.44      0.99      0.61      2182
           2       0.91      0.30      0.45      1766

    accuracy                           0.49      5496
   macro avg       0.78      0.43      0.35      5496
weighted avg       0.75      0.49      0.39      5496



In [48]:
train_and_evaluate_decision_tree(x_train_cbow, x_test_cbow, y_train_cbow, y_test_cbow, "CBOW")


Metrics for CBOW:
Model Score: 0.6676219068413392
Accuracy: 0.6519286754002911
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.52      0.56      1548
           1       0.67      0.80      0.73      2182
           2       0.66      0.58      0.62      1766

    accuracy                           0.65      5496
   macro avg       0.65      0.63      0.64      5496
weighted avg       0.65      0.65      0.65      5496



In [49]:
train_and_evaluate_decision_tree(x_train_skg, x_test_skg, y_train_skg, y_test_skg, "Skip-Gram")


Metrics for Skip-Gram:
Model Score: 0.6605258369723436
Accuracy: 0.6486535662299855
Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.55      0.58      1548
           1       0.62      0.83      0.71      2182
           2       0.78      0.51      0.62      1766

    accuracy                           0.65      5496
   macro avg       0.67      0.63      0.63      5496
weighted avg       0.67      0.65      0.64      5496



In [91]:
nbc_1 = train_and_evaluate_navie_bayes(x_train_bow, x_test_bow, y_train_bow, y_test_bow, "BoW")


Metrics for BoW:
Model Score: 0.8563045851528385
Accuracy: 0.7530931586608443
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.61      0.69      1548
           1       0.71      0.81      0.76      2182
           2       0.78      0.80      0.79      1766

    accuracy                           0.75      5496
   macro avg       0.76      0.74      0.75      5496
weighted avg       0.76      0.75      0.75      5496



In [92]:
nbc_2 = train_and_evaluate_navie_bayes(x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf, "Tf-IDF")


Metrics for Tf-IDF:
Model Score: 0.8605349344978166
Accuracy: 0.774745269286754
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.59      0.70      1548
           1       0.71      0.89      0.79      2182
           2       0.82      0.79      0.81      1766

    accuracy                           0.77      5496
   macro avg       0.80      0.76      0.77      5496
weighted avg       0.79      0.77      0.77      5496



### Prediction

In [104]:
texts = [
    "What is not to like about this product.",
    "Not bad.",
    "Not an issue.",
    "Not buggy.",
    "Not happy.",
    "Not user-friendly.",
    "Not good.",
    "Is it any good?",
    "I do not dislike horror movies.",
    "Disliking horror movies is not uncommon.",
    "Sometimes I really hate the show.",
    "I love having to wait two months for the next series to come out!",
    "The final episode was surprising with a terrible twist at the end.",
    "The film was easy to watch but I would not recommend it to my friends.",
    "I LOL’d at the end of the cake scene"
]

In [106]:
for text in texts:
    preprocessed_text = " ".join(simple_preprocess(text))
    transformed_text = tfidf.transform([preprocessed_text]).toarray()
    prediction = nbc_1.predict(transformed_text)[0]
    
    if prediction == 0:
        print(f"{text}: Negative")
    elif prediction == 1:
        print(f"{text}: Neutral")
    elif prediction == 2:
        print(f"{text}: Positive")

What is not to like about this product.: Positive
Not bad.: Negative
Not an issue.: Negative
Not buggy.: Neutral
Not happy.: Positive
Not user-friendly.: Positive
Not good.: Positive
Is it any good?: Positive
I do not dislike horror movies.: Negative
Disliking horror movies is not uncommon.: Negative
Sometimes I really hate the show.: Neutral
I love having to wait two months for the next series to come out!: Neutral
The final episode was surprising with a terrible twist at the end.: Neutral
The film was easy to watch but I would not recommend it to my friends.: Neutral
I LOL’d at the end of the cake scene: Neutral
