## Day 47- DIY Solution
**Q1. Problem Statement: Sentiment Analysis**<br>
Write a Python program that reads the mood_data.txt (provided on LMS). The following are the given tasks, that has to be taken into
consideration while constructing the solution.<br>
Here dataset contains two columns where one is our target (“emotion” has 6
different categories) and another is the independent variable (“Text” contains
data in form of sentences).
1. Load the mobile mood_data.txt data into a DataFrame
2. Generate tokens and remove punctuations, stop words and lower all rows
3. Join all the tokens as they were before and store them in a new column named
“cleaned_text”
4. Now remove all single characters, extra space, and special characters and
store processed data in a new column named “processed_text”
5. Create a final DataFrame containing dependent variable(emotion) and
processed text
6. Extract independent variables (Xs) and dependent variables (Ys) into separate
data objects
7. Generate tokens and do vectorization

8. Build a model with Multinomial Naive Bayes, Random Forest, Random Forest (Entr
opy), SVM and compare their accuracy


In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Step-1:** Importing Libraries.

In [2]:
# Load the required libraries from Python
# Make sure all the libraries have been download else download using nltk.download command
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk 

**Step-2:** Loading sample data set into dataframe.

In [4]:
df_train = pd.read_csv('mood_data.txt', names=['Text', 'Emotion'], sep=';') # load the dataset onto the google colab file section

In [5]:
df_train.shape

(16000, 2)

In [6]:
df_train.head()

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


**Step-3:** Generating tokens and remove punctuations, stop words and converting all rows to lower case.

In [7]:
# Load the required libraries for cleaning
import string,re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [8]:
# Create a function to generate cleaned data from raw text
def clean_text(mood):
    mood = word_tokenize(mood) # Create tokens
    mood= " ".join(mood) # Join tokens
    mood = [char for char in mood if char not in string.punctuation] # Remove punctuations
    mood = ''.join(mood) # Join the leters
    mood = [word for word in mood.split() if mood.lower() not in stopwords.words('english')] # Remove common english words (I, you, we,...)
    return " ".join(mood)

**Step-4:** Storing new data in cleaned_text column.

In [9]:
# Apply the function to 'text' to clean it
# Add cleaned data as a separate column to the DataFrame
df_train['cleaned_text'] = df_train['Text'].apply(clean_text)
df_train

Unnamed: 0,Text,Emotion,cleaned_text
0,i didnt feel humiliated,sadness,i didnt feel humiliated
1,i can go from feeling so hopeless to so damned...,sadness,i can go from feeling so hopeless to so damned...
2,im grabbing a minute to post i feel greedy wrong,anger,im grabbing a minute to post i feel greedy wrong
3,i am ever feeling nostalgic about the fireplac...,love,i am ever feeling nostalgic about the fireplac...
4,i am feeling grouchy,anger,i am feeling grouchy
...,...,...,...
15995,i just had a very brief time in the beanbag an...,sadness,i just had a very brief time in the beanbag an...
15996,i am now turning and i feel pathetic that i am...,sadness,i am now turning and i feel pathetic that i am...
15997,i feel strong and good overall,joy,i feel strong and good overall
15998,i feel like this was such a rude comment and i...,anger,i feel like this was such a rude comment and i...


In [10]:
df_train["cleaned_text"].head()

0                              i didnt feel humiliated
1    i can go from feeling so hopeless to so damned...
2     im grabbing a minute to post i feel greedy wrong
3    i am ever feeling nostalgic about the fireplac...
4                                 i am feeling grouchy
Name: cleaned_text, dtype: object

**Step-4:** Removing special charachters,extra space,and convert into lower case

In [11]:
features = df_train['cleaned_text']
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))
    
    # Remove single characters appearing in the text except the start
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    
    # Remove single characters appearing at the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 
    
    # Substitute multiple spaces with a single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
    
    
    # Convert to lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

In [12]:
# Print first five values of processed data
processed_features[:5]

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'im grabbing minute to post feel greedy wrong',
 'i am ever feeling nostalgic about the fireplace will know that it is still on the property',
 'i am feeling grouchy']

**Step-5:** Saving the above processed data into processed_text column

In [13]:
# Add the processed data as a separate column to the DataFrame

df_train['processed_text'] = processed_features
df_train

Unnamed: 0,Text,Emotion,cleaned_text,processed_text
0,i didnt feel humiliated,sadness,i didnt feel humiliated,i didnt feel humiliated
1,i can go from feeling so hopeless to so damned...,sadness,i can go from feeling so hopeless to so damned...,i can go from feeling so hopeless to so damned...
2,im grabbing a minute to post i feel greedy wrong,anger,im grabbing a minute to post i feel greedy wrong,im grabbing minute to post feel greedy wrong
3,i am ever feeling nostalgic about the fireplac...,love,i am ever feeling nostalgic about the fireplac...,i am ever feeling nostalgic about the fireplac...
4,i am feeling grouchy,anger,i am feeling grouchy,i am feeling grouchy
...,...,...,...,...
15995,i just had a very brief time in the beanbag an...,sadness,i just had a very brief time in the beanbag an...,i just had very brief time in the beanbag and ...
15996,i am now turning and i feel pathetic that i am...,sadness,i am now turning and i feel pathetic that i am...,i am now turning and feel pathetic that am sti...
15997,i feel strong and good overall,joy,i feel strong and good overall,i feel strong and good overall
15998,i feel like this was such a rude comment and i...,anger,i feel like this was such a rude comment and i...,i feel like this was such rude comment and im ...


**Step-6:** Extracting processed_text and Emotion then creating final dataframe.

In [14]:
final_df = df_train[["processed_text","Emotion"]]
final_df

Unnamed: 0,processed_text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing minute to post feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
...,...,...
15995,i just had very brief time in the beanbag and ...,sadness
15996,i am now turning and feel pathetic that am sti...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such rude comment and im ...,anger


**Step-7:** Generating tokens and doing vectorization

In [15]:
# Tokenize the text using TweetTokenizer from NLTK

from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [16]:
# Function to generate tokens using TweetTokenizer
def tokenize(text): 
    tk = TweetTokenizer()
    return tk.tokenize(text)

vectorizer = CountVectorizer(analyzer = 'word',tokenizer = tokenize,lowercase = True,ngram_range=(1, 1))

In [17]:
# Generate unique words from the processed data by applying Count Vectorizer along with TweetTokenizer
count= vectorizer.fit_transform(final_df['processed_text'])



In [18]:
# What is the shape of the data- Count vectorizer provides information about unique words present in data
count.shape

(16000, 15206)

In [19]:
# Load the libraries required for performing classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

**Step-8:** Spliting the data into training and testing data sets

In [20]:
# Use processed data as independent variable and polarity as dependent variable

X = final_df['processed_text'].values
y = final_df['Emotion'].values

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=100, test_size=0.3)

**Step-9:** Doing vectorization for training and testing data

In [21]:
# Extract features using TFIDF Vectorizer

vectorizer = TfidfVectorizer(max_features=1000)
X_train_idf = vectorizer.fit_transform(X_train)
X_test_idf = vectorizer.transform(X_test)

In [24]:
# Print idf values
df_idf = pd.DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names_out(),columns=["idf_weights"])
# Sort ascending
df_idf.sort_values(by=['idf_weights'],ascending = False).head()

Unnamed: 0,idf_weights
blah,7.758809
chest,7.684701
pregnant,7.433387
computer,7.379319
dream,7.379319


**Step-10:** Model building(generate asked model) and model evaluation

In [25]:
# Perform Multinomial Naive Bayes Classification
# Apply MultinomialNB on training data
mnb = MultinomialNB()
mnb.fit(X_train_idf, y_train)

In [26]:
# Predict polarity by fitting the model to testing data
pred_mnb = mnb.predict(X_test_idf)

# Calculate accuracy of predicted values
acc = accuracy_score(y_test, pred_mnb)


results = pd.DataFrame([['Multinomial Naive Bayes', acc]],
               columns = ['Model', 'Accuracy'])

print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.740625


In [27]:
# Perform Random Forest classification on the processed data and compare the accuracy score of both these models

# Random Forest Classifier with 'gini'

from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train_idf, y_train)

# Predict using testing data
y_pred_rf = clf_rf.predict(X_test_idf)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Gini)', acc]],
               columns = ['Model', 'Accuracy'])

results = results.append(model_results, ignore_index = True)
print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.740625
1      Random Forest(Gini)  0.831667


  results = results.append(model_results, ignore_index = True)


In [28]:
# Random Forest Classifier with 'entropy'

from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(criterion='entropy')
clf_rf.fit(X_train_idf, y_train)

# Predict using testing data
y_pred_rf = clf_rf.predict(X_test_idf)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Entropy)', acc]],
               columns = ['Model', 'Accuracy'])

results = results.append(model_results, ignore_index = True)


  results = results.append(model_results, ignore_index = True)


In [29]:
#svm model
from sklearn.svm import SVC
clf_svc = SVC()
clf_rf.fit(X_train_idf, y_train)

# Predict using testing data
y_pred_rf = clf_rf.predict(X_test_idf)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['SVC by SVM ', acc]],
               columns = ['Model', 'Accuracy'])

results = results.append(model_results, ignore_index = True)
print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.740625
1      Random Forest(Gini)  0.831667
2   Random Forest(Entropy)  0.809167
3              SVC by SVM   0.811667


  results = results.append(model_results, ignore_index = True)


In [30]:
# Display confusion matrix for Random Forest

confusion_matrix(y_test,y_pred_rf) ### Confusion matrix for Random Forest

array([[ 499,   15,   65,    4,   28,    0],
       [  23,  448,   92,    5,   30,   17],
       [  18,   11, 1465,   47,   52,   10],
       [   4,    3,   83,  280,    3,    0],
       [  39,   39,  225,   14, 1083,    7],
       [   0,   39,   25,    0,    6,  121]], dtype=int64)

**Conclusion** : Random forrest classifier has performed the best.