<a href="https://colab.research.google.com/github/IshaSarangi/Edureka_Notes/blob/main/Edureka_Sentiment_Analysis_Demo_Sep25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/drive/1sd6t7tg7nbmwoWelOe0IkyEzihsj3taw?usp=sharing

Lexicon + Rule BasedSentiment Analysis

VADER (Valence Aware Dictionary for Sentiment Reasoning)

In [1]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [2]:
from nltk.sentiment import SentimentIntensityAnalyzer

#Initialize VADER
sia = SentimentIntensityAnalyzer()

In [3]:
#Sample text
sample_reviews = [
    "The place was fantastic! Loved the vibe and the food.",
    "I hated the food. It was bland and overpriced.",
    "It was okay, nothing special.",
    "Fast service, clean environment, and tasty meals!",
    "Not worth the price. Wouldn't recommend."
]

In [4]:
#get sentiment scores
for text in sample_reviews:
    sentiment_scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"Sentiment Scores: {sentiment_scores}")

#if coumpound score > 0.05 --> positive
#if compound score < 0.05 --> negative
#if compound score = 0.05 --> neutral

Text: The place was fantastic! Loved the vibe and the food.
Sentiment Scores: {'neg': 0.0, 'neu': 0.507, 'pos': 0.493, 'compound': 0.8313}
Text: I hated the food. It was bland and overpriced.
Sentiment Scores: {'neg': 0.375, 'neu': 0.625, 'pos': 0.0, 'compound': -0.6369}
Text: It was okay, nothing special.
Sentiment Scores: {'neg': 0.315, 'neu': 0.419, 'pos': 0.265, 'compound': -0.092}
Text: Fast service, clean environment, and tasty meals!
Sentiment Scores: {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.4574}
Text: Not worth the price. Wouldn't recommend.
Sentiment Scores: {'neg': 0.486, 'neu': 0.514, 'pos': 0.0, 'compound': -0.4168}


**Sentiment Analysis Uisng ML Algorithm:  
NB,SVM**

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

#NLTK libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

#download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
#Load dataset
train_data = pd.read_csv("/content/train.csv", encoding = 'latin1')
test_data = pd.read_csv("/content/test.csv", encoding = 'latin1')

print(train_data.head())
print(test_data.head())

       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
4  358bd9e861   Sons of ****, why couldn`t they put them on t...   

                         selected_text sentiment Time of Tweet Age of User  \
0  I`d have responded, if I were going   neutral       morning        0-20   
1                             Sooo SAD  negative          noon       21-30   
2                          bullying me  negative         night       31-45   
3                       leave me alone  negative       morning       46-60   
4                        Sons of ****,  negative          noon       60-70   

       Country  Population -2020  Land Area (Km²)  Density (P/Km²)  
0  Afghanistan          38928346         652860.0    

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            27481 non-null  object 
 1   text              27480 non-null  object 
 2   selected_text     27480 non-null  object 
 3   sentiment         27481 non-null  object 
 4   Time of Tweet     27481 non-null  object 
 5   Age of User       27481 non-null  object 
 6   Country           27481 non-null  object 
 7   Population -2020  27481 non-null  int64  
 8   Land Area (Km²)   27481 non-null  float64
 9   Density (P/Km²)   27481 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 2.1+ MB


In [8]:
#Unique Sentiment Classes
print(train_data['sentiment'].value_counts())

sentiment
neutral     11118
positive     8582
negative     7781
Name: count, dtype: int64


In [9]:
#Text Preprocessing
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(f"[{re.escape(string.punctuation)}]", '', text)
    text = re.sub(r'\d+', '', text)
    return text

In [10]:
#Apply cleaning on the input text
train_data['text'] = train_data['text'].apply(clean_text)
test_data['text'] = test_data['text'].apply(clean_text)

In [11]:
print(train_data['text'])

0                        id have responded if i were going
1               sooo sad i will miss you here in san diego
2                                   my boss is bullying me
3                            what interview leave me alone
4         sons of  why couldnt they put them on the rel...
                               ...                        
27476     wish we could come see u on denver  husband l...
27477     ive wondered about rake to  the client has ma...
27478     yay good for both of you enjoy the break  you...
27479                                but it was worth it  
27480       all this flirting going on  the atg smiles ...
Name: text, Length: 27481, dtype: object


In [12]:
#Text Preprocessing: Tokenize, Stopwords Removal and Lemmatization
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
  words = nltk.word_tokenize(text)
  words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
  return ''.join(words)

train_data['text'] = train_data['text'].apply(preprocess_text)
test_data['text'] = test_data['text'].apply(preprocess_text)

In [13]:
print(train_data['text'])

0                                         idrespondedgoing
1                                      sooosadmisssandiego
2                                              bosbullying
3                                      interviewleavealone
4                        soncouldntputreleasealreadybought
                               ...                        
27476      wishcouldcomeseeudenverhusbandlostjobcantafford
27477    ivewonderedrakeclientmadeclearnetdontforcedevs...
27478    yaygoodenjoybreakprobablyneedhecticweekendtake...
27479                                                worth
27480                          flirtinggoingatgsmileyayhug
Name: text, Length: 27481, dtype: object


In [14]:
#Train Test split
x = train_data['text']
y = train_data['sentiment']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)


In [15]:
#Vectorize data: TFIDF
tfidf_vectorizer = TfidfVectorizer(max_features = 5000)
x_train_tfidf = tfidf_vectorizer.fit_transform(x_train)
x_test_tfidf = tfidf_vectorizer.transform(x_test)

###Model Training

In [16]:
#Model 1: Naive Bayes Classifier
nb = MultinomialNB()
nb.fit(x_train_tfidf, y_train)
y_pred_nb = nb.predict(x_test_tfidf)

In [17]:
#Model 2: Random Forest Classifier
rf = RandomForestClassifier(n_estimators = 100, random_state = 42)
rf.fit(x_train_tfidf, y_train)
y_pred_rf = rf.predict(x_test_tfidf)

In [18]:
#Model 3: SVM Model
svm = SVC(kernel = 'linear')
svm.fit(x_train_tfidf, y_train)
y_pred_svm = svm.predict(x_test_tfidf)

###Evaluate ML Models

In [19]:
def evaluate_model(model, x_test, y_test):
    print(f"\nModel: {model}")
    y_pred = model.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy}")
    report = classification_report(y_test, y_pred)
    print("Classification Report:\n", report)

In [20]:
evaluate_model(nb, x_test_tfidf, y_test)


Model: MultinomialNB()
Accuracy: 0.4147716936510824
Classification Report:
               precision    recall  f1-score   support

    negative       0.79      0.01      0.02      1562
     neutral       0.41      1.00      0.58      2230
    positive       0.95      0.02      0.05      1705

    accuracy                           0.41      5497
   macro avg       0.72      0.34      0.22      5497
weighted avg       0.69      0.41      0.26      5497



In [21]:
#Predict sentiment using trained models
def predict_sentiment(text, model):
    text = clean_text(text)
    text = preprocess_text(text)
    text_tfidf = tfidf_vectorizer.transform([text])
    sentiment = model.predict(text_tfidf)[0]
    return sentiment

In [22]:
print(predict_sentiment(["I love this product! It's amazing"], svm))

neutral


In [23]:
print(predict_sentiment(["I hate this product. It's terrible"], svm))

neutral


Shorter way: Sklearn pipelines

Dataset: Yelp customer reviews dataset



In [24]:
from sklearn.pipeline import Pipeline

In [25]:
#Load data
df = pd.read_csv('/content/yelp.csv')
print(df.head())

              business_id        date               review_id  stars  \
0  9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   
1  ZRJwVLyzEJq1VAihDhYiow  2011-07-27  IjZ33sJrzXqU-0X6U8NwyA      5   
2  6oRAC4uyJCsJl1X0WZpVSA  2012-06-14  IESLBzqUCLdSzSqm0eCSxQ      4   
3  _1QQZuf4zZOyFCvXc0o6Vg  2010-05-27  G-WvGaISbqqaMHlNnByodA      5   
4  6ozycU1RpktNG2-1BroVtw  2012-01-05  1uJFq2r5QfJG_6ExMRCaGw      5   

                                                text    type  \
0  My wife took me here on my birthday for breakf...  review   
1  I have no idea why some people give bad review...  review   
2  love the gyro plate. Rice is so good and I als...  review   
3  Rosie, Dakota, and I LOVE Chaparral Dog Park!!...  review   
4  General Manager Scott Petello is a good egg!!!...  review   

                  user_id  cool  useful  funny  
0  rLtl8ZkDX5vH5nAx9C3q5Q     2       5      0  
1  0a2KyEL0d3Yb1V6aivbIuQ     0       0      0  
2  0hT2KtfLiobPvh6cDC8JQg     0    

In [26]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  10000 non-null  object
 1   date         10000 non-null  object
 2   review_id    10000 non-null  object
 3   stars        10000 non-null  int64 
 4   text         10000 non-null  object
 5   type         10000 non-null  object
 6   user_id      10000 non-null  object
 7   cool         10000 non-null  int64 
 8   useful       10000 non-null  int64 
 9   funny        10000 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 781.4+ KB
None


###Text Labelling:

Stars 1-2 --> 'Negative'

Stars 4-5 --> 'Positive'

Stars 3 --> 'Neutral' (Commonly dropped)

In [27]:
df=df[df['stars']!=3]
df['sentiment']=df['stars'].apply(lambda x: 'Positive' if x>3 else 'Negative')

In [28]:
print(df['sentiment'])

0       Positive
1       Positive
2       Positive
3       Positive
4       Positive
          ...   
9994    Positive
9996    Positive
9997    Positive
9998    Negative
9999    Positive
Name: sentiment, Length: 8539, dtype: object


In [29]:
x=df['text']
y=df['sentiment']

#Train Test Split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [30]:
#Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', RandomForestClassifier())
])

In [31]:
#Train the model in the pipeline
pipeline.fit(x_train, y_train)

In [32]:
#Predict Sentiment for sample Reviews
sample_reviews = [
    "The place was fantastic! Loved the vibe and the food.",
    "I hated the food. It was bland and overpriced.",
    "It was okay, nothing special.",
    "Fast service, clean environment, and tasty meals!",
    "Not worth the price. Wouldn't recommend."
]

pred=pipeline.predict(sample_reviews)

print("\nSample sentiment predictions:")

for review,sentiment in zip(sample_reviews,pred):
  print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")


Sample sentiment predictions:
Review: The place was fantastic! Loved the vibe and the food.
Predicted Sentiment: Positive

Review: I hated the food. It was bland and overpriced.
Predicted Sentiment: Positive

Review: It was okay, nothing special.
Predicted Sentiment: Positive

Review: Fast service, clean environment, and tasty meals!
Predicted Sentiment: Positive

Review: Not worth the price. Wouldn't recommend.
Predicted Sentiment: Positive



###Sentiment Analysis using Deep Neural Networks

Keras, LSTM, Embedding Layer, Tokenizer

In [33]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical

In [37]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 8539 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  8539 non-null   object
 1   date         8539 non-null   object
 2   review_id    8539 non-null   object
 3   stars        8539 non-null   int64 
 4   text         8539 non-null   object
 5   type         8539 non-null   object
 6   user_id      8539 non-null   object
 7   cool         8539 non-null   int64 
 8   useful       8539 non-null   int64 
 9   funny        8539 non-null   int64 
 10  sentiment    8539 non-null   object
dtypes: int64(4), object(7)
memory usage: 800.5+ KB
None


In [39]:
#Prepare Tokenizer
max_words = 10000 #only keeps the top 1000 most frequent words
max_len = 100

tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(df['text'])

In [41]:
#Text to sequences and padding
#Pad sequences makes all the sequences of same length

sequences = tokenizer.texts_to_sequences(df['text']) #used preprocessed text from ML part
print(sequences)
#show tokens and index using tokenizer
print(tokenizer.word_index)

# for word, index in tokenizer.word_index.items():
#     print(f"{word}:{index}")

[[14, 482, 236, 38, 43, 20, 14, 710, 11, 245, 3, 10, 8, 233, 2, 1333, 8, 251, 65, 130, 588, 306, 4745, 45, 2431, 62, 1766, 1851, 54, 432, 8, 233, 3, 54, 30, 589, 616, 20, 2, 2598, 363, 562, 506, 10, 380, 36, 2, 27, 4145, 50, 102, 616, 26, 2, 1598, 17, 47, 43, 2, 119, 88, 691, 5, 2223, 3, 47, 45, 2013, 2780, 10, 8, 1719, 3, 794, 2, 85, 80, 160, 24, 78, 102, 164, 19, 69, 368, 581, 48, 45, 1347, 3, 2599, 81, 137, 53, 17, 109, 10, 10, 8, 150, 153, 165, 20, 2, 100, 563, 233, 4, 24, 2, 463, 2224, 3823, 663, 1628, 3155, 3, 10, 8, 210, 3, 118, 10, 142, 15, 129, 746, 7, 45, 1, 224, 15, 8, 150, 3, 10, 480, 130, 2, 172, 1254, 10, 8, 2, 85, 765, 80, 160, 24, 667, 4, 171, 163, 6, 51, 66], [4, 22, 72, 723, 287, 57, 105, 181, 191, 357, 55, 16, 27, 10, 823, 6, 601, 17, 17, 60, 656, 319, 19, 25, 283, 1, 55, 174, 13, 45, 381, 2082, 35, 25, 173, 105, 36, 13, 12, 134, 846, 14, 279, 3, 4, 589, 28, 55, 144, 571, 1304, 16, 565, 605, 10, 8, 102, 777, 67, 90, 4, 277, 11, 5, 605, 658, 3, 277, 21, 64, 22, 6, 163

In [42]:
print(len(sequences))
x = pad_sequences(sequences, maxlen = max_len)
print(x)
print(len(x))

8539
[[  47   43    2 ...    6   51   66]
 [  81    2 2781 ...   57 1524 1430]
 [   0    0    0 ...   45 1474  250]
 ...
 [  25  881   29 ...    7    2  452]
 [   4  460  591 ... 6302    5  110]
 [   0    0    0 ...   20    2  500]]
8539


In [43]:
#Labels
print(df['sentiment'])
df['sentiment'] = df['sentiment'].map({'Positive': 1, 'Negative': 0})
print(df['sentiment'])

0       Positive
1       Positive
2       Positive
3       Positive
4       Positive
          ...   
9994    Positive
9996    Positive
9997    Positive
9998    Negative
9999    Positive
Name: sentiment, Length: 8539, dtype: object
0       1
1       1
2       1
3       1
4       1
       ..
9994    1
9996    1
9997    1
9998    0
9999    1
Name: sentiment, Length: 8539, dtype: int64


In [44]:
y = np.array(df['sentiment'].values)
print(y)

[1 1 1 ... 1 0 1]


In [45]:
#Train Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [46]:
#Build Model: Embedding + LSTM
model = Sequential()
model.add(Embedding(input_dim = max_words, output_dim = 64, input_length = max_len))
#model.add(GRU(64, dropout = 0.2, recurrent_dropout = 0.2)) ---> TIY
model.add(LSTM(64, dropout = 0.2, recurrent_dropout = 0.2))
model.add(Dense(1, activation = 'sigmoid'))



In [47]:
#Compile and Train
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.fit(x_train, y_train, validation_data = (x_test, y_test), epochs = 10, batch_size = 32, verbose = 1)

Epoch 1/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 345ms/step - accuracy: 0.8018 - loss: 0.5044 - val_accuracy: 0.8712 - val_loss: 0.3149
Epoch 2/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m103s[0m 468ms/step - accuracy: 0.9017 - loss: 0.2475 - val_accuracy: 0.8759 - val_loss: 0.3205
Epoch 3/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m107s[0m 500ms/step - accuracy: 0.9365 - loss: 0.1707 - val_accuracy: 0.8718 - val_loss: 0.3175
Epoch 4/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m107s[0m 336ms/step - accuracy: 0.9604 - loss: 0.1123 - val_accuracy: 0.8782 - val_loss: 0.3435
Epoch 5/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 353ms/step - accuracy: 0.9742 - loss: 0.0770 - val_accuracy: 0.8852 - val_loss: 0.4228
Epoch 6/10
[1m214/214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 353ms/step - accuracy: 0.9810 - loss: 0.0562 - val_accuracy: 0.8864 - val_loss: 0.4331
Epoch 7

<keras.src.callbacks.history.History at 0x7edb49ed00b0>

In [48]:
#Sample predictions
print(sample_reviews)

['The place was fantastic! Loved the vibe and the food.', 'I hated the food. It was bland and overpriced.', 'It was okay, nothing special.', 'Fast service, clean environment, and tasty meals!', "Not worth the price. Wouldn't recommend."]


In [49]:
sample_seq = tokenizer.texts_to_sequences(sample_reviews)
sample_pad = pad_sequences(sample_seq, maxlen = max_len)

predictions = model.predict(sample_pad)

print('\nSample Predictions: ')
for review, prediction in zip(sample_reviews, predictions):
    sentiment = 'Positive' if prediction > 0.5 else 'Negative'
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 974ms/step

Sample Predictions: 
Review: The place was fantastic! Loved the vibe and the food.
Predicted Sentiment: Positive

Review: I hated the food. It was bland and overpriced.
Predicted Sentiment: Negative

Review: It was okay, nothing special.
Predicted Sentiment: Negative

Review: Fast service, clean environment, and tasty meals!
Predicted Sentiment: Positive

Review: Not worth the price. Wouldn't recommend.
Predicted Sentiment: Negative

