# **Sentiment Analysis**

**What is sentiment analysis?** 

In simple words, Sentiment analysis is defined as the process of mining of data, view, review or sentence to predict the emotion of the sentence through natural language processing (NLP), a branch of computer science concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. The sentiment analysis involve classification of text into three phase “Positive”, “Negative” or“Neutral”. It analyzes the data and labels the ‘better’ and ‘worse’ sentiment as positive and negative respectively.

Sentiment Analysis is very helpful in a variety of applications, in this case it is used to understand the real customer feedbacks based on their comments and reviews.

---
To proceed with this analysis I tried to answer different questions and to check if my assumptions were right or not.

From the modeling point of view, different approaches were used:
- an approach using the powerful functionalities of the library NLTK (Natural Language ToolKit - https://www.nltk.org/) with the VADER model;
- some Machine Learning models (KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, XGBoost) along with pre-trained Deep Learning models (such as HuggingFace's RoBERTa);
- extra: use of built pipelines for making sentiment analysis really quick and easy (this will be really useful for the streamlit sentiment analyzer webapp).

---

This is the analyzed product: 
- Product: https://www.amazon.co.uk/PreSonus-3-5-inch-High-Definition-Active-Monitors/dp/B075QVMBT9/ref=cm_cr_arp_d_product_top?ie=UTF8
- Reviews: https://www.amazon.co.uk/product-reviews/B075QVMBT9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=1

## Importing Dependencies

In [29]:
import matplotlib.pyplot as plt
from matplotlib import style
from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import json
from wordcloud import WordCloud,STOPWORDS

import nltk
from nltk.stem  import WordNetLemmatizer

In [31]:
pd.set_option("max_colwidth", None)
plt.style.use('ggplot')

## Loading Data

After the data scraping/mining step i ended up with a json file which needs to be converted into a pandas dataframe to simplify the analysis. 

This is the purpose of the **json_2_pandas** function: it takes as input the path where the json file is located, than opens it in 'read' ('r') mode to load the data. After loading the data, I iterate through the object to extract review titles, ratings and contents and adding them to a dictionary.

Then there is another function, the **format_date** function, that uses the datetime module to convert the dates in a easier format for pandas conversion.

In [32]:
def format_date(css_date):
    # only taking the date and joining those elements in a string
    date = css_date.split()[len(css_date.split())-3:]
    date_string = " ".join(date)

    # change format
    date_object = datetime.strptime(date_string, '%d %B %Y')
    formatted_date = date_object.strftime('%Y-%m-%d')
    return formatted_date

In [33]:
def json_2_pandas(json_path):
    with open(json_path, 'r') as json_file:
        data = json.load(json_file)
    
    reviews = {"Date": [],
               "Title": [],
               "Rating": [],
               "Content": []}

    for page in data:
        if len(page) != 0:          # if there are reviews in that page list
            for review in page:
                # append to the lists in the dictionary the desired elements
                reviews['Date'].append(format_date(review['place and date']))
                reviews['Title'].append(review["title"])
                reviews['Rating'].append(int(review["rating"][:1]))
                reviews['Content'].append(review["body"])
    
    reviews = pd.DataFrame.from_dict(reviews)
    return reviews

Converting the .json file into a pandas DataFrame.

In [34]:
path = 'G:\Il mio Drive\MAGISTRALE\IT Coding\Project\Sentiment-Analysis-on-Amazon-product-reviews\Data\B075QVMBT9_reviews.json'
df = json_2_pandas(json_path = path)

In [35]:
# to have all the reviews in a .csv file
df.to_csv('reviews.csv')

In [36]:
print('Before correction: ', df.iloc[2,1])
df.iloc[2,1] = "Its a beauty"           # can correct it right away since I saw it
print('After correction: ', df.iloc[2,1])

Before correction:  Its a beuaty
After correction:  Its a beauty


In [37]:
print(df.iloc[16]['Content'])

Bought these speakers to use for my new gaming pc and they haven’t disappointed. Well packaged, look and sound great. I often listen to music such as House so wanted speakers that could also have good bass and they would perform. Brilliant for such a low price.


In [38]:
print(df.iloc[13]['Content'])

Great set of speakers. Good quality sound. Easy connections. Very happy 🙂👍


As we can see from these examples, the first one presents some typing errors which can obviously occur when writing a review. Then we see in the second example that long reviews have been scrapped properly, and in the third example we notice also the presence of emoticons.

But there could be some missing values: let's check.

In [39]:
df[df['Content'].str.len() == 0]

Unnamed: 0,Date,Title,Rating,Content
12,2023-05-07,Great deal.,5,
53,2023-03-04,Impressive,5,
78,2022-12-29,"Loud, and very clear audio",5,
167,2022-06-24,Awesome! General balanced sound!,5,
291,2021-10-29,great job presonus,5,
296,2021-10-17,Amazing,5,
387,2021-04-12,The best speakers I have ever owned,5,
563,2020-05-07,Amazing Sound. Best I have ever heard,5,
601,2020-02-14,Good quality,3,


We have some missing review contents, but thanks to the title and the rating (as we can see) we can draw some sentiment insights either way! In fact we can already say that these empty reviews are all highly positive, apart from the last one (601) which is pretty neutral, exposing a comment about the good quality of the product but nothing more.

## Data Cleaning and PreProcessing

### DATE

Quick date manipulation to obtain 3 columns Day, Month and Year:

In [40]:
# splitting day, month and year in 3 separate columns
date = df['Date'].str.split("-", n=2, expand=True)      # splitting all the values in the column at most 2 times
df['Year'] = date[0].astype(int)
df['Month'] = date[1].astype(int)
df['Day'] = date[2].astype(int)
df = df.drop(['Date'], axis=1)

### STOP WORDS

Creating a column without stop words. Coming to stop words, these are words that do not imapct the overall sentiment of the review, but the general nltk stop words contains words like not, hasn't, would'nt which actually conveys a negative sentiment. If I remove that it will end up contradicting the target variable (sentiment). So I have curated a list of the stop words which doesn't have any negative sentiment or any negative alternatives.

Obviously these are just some of the many stopwords, but i'll use these as a reference for some EDA.

In [41]:
# convert first everything in lower case to maximize matching of stopwords (and not only)
df['Title'] = df['Title'].str.lower()
df['Content'] = df['Content'].str.lower()

In [42]:
# stop_words = stopwords.words('english')
# print(stop_words)
stop_words = ['i', 'me', "needn't", "don't", "isn", 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y']
new_stopwords = ["would","shall","could","might"]
stop_words.extend(new_stopwords)
len(stop_words)

145

In [43]:
# basically crating a new column with all the content in the review eliminating with list comprehension the words in the stop_words set of words
df['Clean Content'] = df['Content'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

### DIY SENTIMENT

Lastly, I will create a function that creates a new column named 'Sentiment' based on the customer rating.

In [44]:
def rating_2_sentiment(row):
    if row == 3:
        sentiment = 'Neutral'
    elif row == 4 or row == 5:
        sentiment = 'Positive'
    elif row == 1 or row == 2:
        sentiment = 'Negative'
    return sentiment

In [45]:
# apply function to Rating column
df['Sentiment'] = df['Rating'].apply(rating_2_sentiment)

### RE-ORGANIZE DATASET

Putting in a new column the content + the title of the review for integrity and completeness. This will also be useful not only for evaluating the sentiment, but also to have a sort of "content review" also for the observations without revire content!

In [46]:
# generalize with . even though there are some titles that end up with some kind of punctuation
df['Review'] = df['Title'] + '. ' + df['Clean Content']

Let's now visualize the complete DataFrame.

In [47]:
# reorganize DataFrame
df = df[['Title', 'Content', 'Clean Content', 'Review', 'Rating', 'Sentiment', 'Year', 'Month', 'Day']]
df.sample(5)

Unnamed: 0,Title,Content,Clean Content,Review,Rating,Sentiment,Year,Month,Day
260,amazing,amazing,amazing,amazing. amazing,5,Positive,2021,12,22
82,astounding sound quality,"these are by far the best speakers i have ever used on my pc - absolutely stunning sound quality, very clean and not overpowering bass. for the first time i have the eq setting on neutral to just enjoy the pure clarity that these offer...perfect for music, videos and gaming!","far best speakers ever used pc - absolutely stunning sound quality, clean not overpowering bass. first time eq setting neutral enjoy pure clarity offer...perfect music, videos gaming!","astounding sound quality. far best speakers ever used pc - absolutely stunning sound quality, clean not overpowering bass. first time eq setting neutral enjoy pure clarity offer...perfect music, videos gaming!",5,Positive,2022,12,18
218,very pleased,great product,great product,very pleased. great product,5,Positive,2022,2,27
577,sound quality is fudging amazing!,these speakers are 100% accurate and is worth the price. the bass is unbelievably amazing and its good for music production and more! deffo recommend this prduct.,speakers 100% accurate worth price. bass unbelievably amazing good music production more! deffo recommend prduct.,sound quality is fudging amazing!. speakers 100% accurate worth price. bass unbelievably amazing good music production more! deffo recommend prduct.,5,Positive,2020,4,15
393,"for something so small, they fill your room with quality sound.",presonus fits really well in my bedroom. i’ve fitted them either side of my chimney breast for true stereo and they sound amazing even without a subwoofer...i highly recommend them.,presonus fits really well bedroom. i’ve fitted either side chimney breast true stereo sound amazing even without subwoofer...i highly recommend them.,"for something so small, they fill your room with quality sound.. presonus fits really well bedroom. i’ve fitted either side chimney breast true stereo sound amazing even without subwoofer...i highly recommend them.",5,Positive,2021,4,4


In [48]:
df.shape

(740, 9)

We ended up with the pandas DataFrame with 740 reviews, all with date, title, rating, comment and a DIY sentiment column. Let's jump into the analysis.

In [49]:
df.to_csv('clean_reviews.csv') 