# Sentiment analysis on social media comments

In today's digital age, online platforms such as Twitter, Goodreads, and Amazon are overflowing with people's opinions on various products and services. It has become increasingly important for organizations to tap into this wealth of information and gain valuable insights to improve their offerings. However, the sheer volume of data makes it virtually impossible to manually analyze and process. This is where the benefits of Data Science come into play, with Sentiment Analysis providing a powerful tool for businesses to automatically gauge the sentiment of these opinions and make data-driven decisions.

*In this notebook, I will train a machine learning model to identify good, bad or neutral social media comments*

## Data wrangling

In [1]:
#First, we upload the dataset from Kaggle "Twitter and Reddit Sentimental analysis Dataset"
import pandas as pd
df1=pd.read_csv('Reddit_Data.csv')
df2=pd.read_csv('Twitter_Data.csv')

In [2]:
#We explore them
df1.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


In [3]:
df2.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [4]:
#Here, we merge both dataframes to only one, to simplify the machine learning model in the next steps

df1.rename(columns={'clean_comment':'Text'},inplace=True)
df2.rename(columns={'clean_text':'Text'},inplace=True)
data = pd.concat([df1, df2])
data.head()

Unnamed: 0,Text,category
0,family mormon have never tried explain them t...,1.0
1,buddhism has very much lot compatible with chr...,1.0
2,seriously don say thing first all they won get...,-1.0
3,what you have learned yours and only yours wha...,0.0
4,for your own benefit you may want read living ...,1.0


In [5]:
#At this point, we verify if we have any Nan values and we drop them if we found any
data.isnull().values.any()

True

In [6]:
#We have Nan values in our dataframe. We count them
data.isnull().sum()

Text        104
category      7
dtype: int64

In [7]:
#We drop those Nan values
data.dropna(inplace=True)


In [8]:
data.isnull().values.any()

False

In [9]:
#Now, we have a dataframe in which we can perform a machine learning model

## Machine Learning model

In [10]:
# Now we Pre-process the data and we create Bag of Words model Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['Text'])

In [11]:
#To create a machine learning model we split the data into trainig and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts,data['category'], test_size=0.25, random_state=5)

#Then we train the model 
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [12]:
#We calculate the accuracy of the model
MNB.score(X_test, Y_test)

0.6988207075754548

In [13]:
#And finally we make the prediction with the test Set
predicted=MNB.predict(X_test)

In [14]:
#To better understand our model, we print the classification report
import matplotlib.pyplot as plt
import numpy
from sklearn import metrics

print(metrics.classification_report(Y_test, predicted))

              precision    recall  f1-score   support

        -1.0       0.71      0.54      0.62     10931
         0.0       0.83      0.57      0.68     17039
         1.0       0.64      0.87      0.74     22060

   micro avg       0.70      0.70      0.70     50030
   macro avg       0.73      0.66      0.68     50030
weighted avg       0.72      0.70      0.69     50030



The precision to classify neutral comments is 83%. For negative comments, the precision is above 70%. However, to classify positive comments, using this model, we find only 64% of precision.