# This is a first notebook doing a sentiment analysis on the first 100k entries in the Yelp database.

This notebook closely follows the instructions given by **Natashsha Selvaraj** on [medium](https://medium.com/towards-data-science/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6)

In [None]:
import sys
# adding to the path variables the one folder higher (locally, not changing system variables)
sys.path.append("..")

# importing all needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# first, we have to import Vader
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# set Randomseed
RSEED = 42

In [None]:
# load the first 100k lines of the review file into a dataframe

df = pd.read_csv('../data/review_1819_eng.csv')

## Next step is to generate wordclouds

First we will start with all reviews and then split the data into positive and negative reviews and compare the corresponding clouds.

In [None]:
# initialize the stopword list:
stopwords = nltk.corpus.stopwords.words('english')

# update the stopwords after generating the first few clouds with non decisive words
additional_stopwords = ['one', 'go', 'also', 'would', 'get', 'got']
stopwords.extend(additional_stopwords)

# create a wordcloud using all the text in text
text = " ".join(text for text in df.text)

#remove the stopwords from the text
wordcloud = WordCloud(stopwords=stopwords).generate(text)

In [None]:
# plot the wordcloud

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

It can be seen, that a big park of the reviews concern restaurants and bars (food, drink, ordered etc.)
It is also noticeable, that way more reviews are positive than negative, shown by good, great love amazing etc...

## Now we have to classify the reviews into positive and negative reviews

to do this, all reviews below 3 will be classified as negative and all reviews higher than 3 will be positive. As 3 is a neutral classification, we will drop these reviews

In [None]:
# remove all 3 stars reviews
# # assigning the positive (+1) and negative (-1) classes to reviews above or below 3 stars in a new feature called sentiment

df_sentiment = df[df['stars'] != 3]
df_sentiment['sentiment'] = df_sentiment['stars'].apply(lambda rating : +1 if rating > 3 else -1)

# look at the head of the new dataframe showing the new feature

df_sentiment.head()

Building WordClouds for the positive and negative reviews
Therefore we have to split the dataframe in a positive and a negative dataframe

In [None]:
# split df in positive and negative df

df_pos = df_sentiment[df_sentiment['sentiment'] == 1]
df_neg = df_sentiment[df_sentiment['sentiment'] == -1]

In [None]:
# generate the positive wordclouds and plot them

pos = " ".join(text for text in df_pos.text)
wordcloud_pos = WordCloud(stopwords=stopwords).generate(pos)

plt.imshow(wordcloud_pos, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
# generate the negative wordclouds and plot them

neg = " ".join(text for text in df_neg.text)
wordcloud_neg = WordCloud(stopwords=stopwords).generate(neg)

plt.imshow(wordcloud_neg, interpolation='bilinear')
plt.axis('off')
plt.show()

These Wordclouds don't really give any impression of the rating of the review

## Having done a first set of EDA, we can now train our first sentiment analysis model

Before Vectorizing the Words, we have to do some Data Cleaning.

We will remove all punctuation

In [None]:
#define function for textcleaning
punctuation = ['"', '(', ')', '-', '$', ',', '+', "'", "\n", "\r"]

def clean_text(text):   
    cleaned_text = "".join(u for u in text if u not in punctuation)
    return cleaned_text

In [None]:
# apply function to df
df_sentiment['text'] = df_sentiment['text'].apply(clean_text)


Now we have to split the data in a test and a training part

In [None]:
# split data into feature and target 
X = df_sentiment['text']
y = df_sentiment['sentiment']

# split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RSEED)

Now we have to Vectorize the Words. 
We will use the Tfidf Vectorizer

In [None]:
# initialize vectorizer
vectorizer = TfidfVectorizer()

# fit and transform the text
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

Now we can train the model

In [None]:
# initialize the model
logreg = LogisticRegression()

# fit the model
logreg.fit(X_train, y_train)

In [None]:
# make predictions
y_pred = logreg.predict(X_test)

In [None]:
# test the model
sns.heatmap(confusion_matrix(y_pred, y_test), annot=True, fmt='g')

In [None]:
# show the classification report
print(classification_report(y_pred, y_test))

### This model has an accuracy of 97 % to correctly predict the sentiment of a review

# Sentiment Analysis using VADER

In [None]:
#calculate the negative, positive, neutral and compound scores, plus verbal evaluation
def sentiment_vader(sentence):

    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()

    sentiment_dict = sid_obj.polarity_scores(sentence)
    negative = sentiment_dict['neg']
    neutral = sentiment_dict['neu']
    positive = sentiment_dict['pos']
    compound = sentiment_dict['compound']

    if sentiment_dict['compound'] >= 0.05 :
        overall_sentiment = "Positive"

    elif sentiment_dict['compound'] <= - 0.05 :
        overall_sentiment = "Negative"

    else :
        overall_sentiment = "Neutral"
  
    return negative, neutral, positive, compound, overall_sentiment

In [None]:
# look at a random sample of neutral ratings

df[df.stars == 3].sample(10)

In [None]:
# get an insight into these reviews

x = 1440001
print(df.text[x])
print(df.stars[x])
sentiment_vader(df.text[x])