 ##### Sentiment analysis is the interpretation and classification of text-based data. The point of this analysis is to categorize each data-point into a class that represents its quality (positive, negative, etc.). Sentiment analysis focuses on the polarity, emotions, and intentions of authors. Classic sentiment analysis consists of the following steps: preprocessing, training, feature extraction,and classification.

### Importing data from sqlite database

In [1323]:
import pandas as pd

In [1324]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [1325]:
import sqlite3

In [1326]:
con= sqlite3.connect(r"C:\Users\dines\Downloads\database.sqlite")

In [1327]:
df= pd.read_sql_query('SELECT* FROM Reviews',con)

In [1328]:
df.shape

(568454, 10)

In [1329]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [1330]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [1331]:
df['HelpfulnessNumerator']>df['HelpfulnessDenominator'] ##invalid row

0         False
1         False
2         False
3         False
4         False
          ...  
568449    False
568450    False
568451    False
568452    False
568453    False
Length: 568454, dtype: bool

In [1332]:
df[df['HelpfulnessNumerator']>df['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [1333]:
df_valid=df[df['HelpfulnessNumerator']<=df['HelpfulnessDenominator']]

In [1334]:
df_valid.shape

(568452, 10)

In [1335]:
df_valid.duplicated(['UserId','ProfileName','Time','Text'])

0         False
1         False
2         False
3         False
4         False
          ...  
568449    False
568450    False
568451    False
568452    False
568453    False
Length: 568452, dtype: bool

### Performing Sentiment Analysis on Data

In [1336]:
df_valid[df_valid.duplicated(['UserId','ProfileName','Time','Text'])]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
29,30,B0001PB9FY,A3HDKO7OW0QNK4,Canadian Fan,1,1,5,1107820800,The Best Hot Sauce in the World,I don't know if it's the cactus or the tequila...
574,575,B000G6RYNE,A3PJZ8TU8FDQ1K,Jared Castle,2,2,5,1231718400,"One bite and you'll become a ""chippoisseur""","I'm addicted to salty and tangy flavors, so wh..."
1973,1974,B0017165OG,A2EPNS38TTLZYN,tedebear,0,0,3,1312675200,Pok Chops,The pork chops from Omaha Steaks were very tas...
2309,2310,B0001VWE0M,AQM74O8Z4FMS0,Sunshine,0,0,2,1127606400,Below standard,Too much of the white pith on this orange peel...
2323,2324,B0001VWE0C,AQM74O8Z4FMS0,Sunshine,0,0,2,1127606400,Below standard,Too much of the white pith on this orange peel...
...,...,...,...,...,...,...,...,...,...,...
568409,568410,B0018CLWM4,A2PE0AGWV6OPL7,Dark Water Mermaid,3,3,5,1309651200,Quality & affordable food,I was very pleased with the ingredient quality...
568410,568411,B0018CLWM4,A88HLWDCU57WG,R28,2,2,5,1332979200,litter box,My main reason for the five star review has to...
568411,568412,B0018CLWM4,AUX1HSY8FX55S,DAW,1,1,5,1319500800,Happy Camper,I bought this to try on two registered Maine C...
568412,568413,B0018CLWM4,AVZ2OZ479Q9E8,Ai Ling Chow,0,0,5,1336435200,Two Siberians like it!,When we brought home two 3-month-old purebred ...


In [1337]:
data=df_valid.drop_duplicates(subset=['UserId','ProfileName','Time','Text'])

In [1338]:
data.shape

(393931, 10)

In [1339]:
data.dtypes

Id                         int64
ProductId                 object
UserId                    object
ProfileName               object
HelpfulnessNumerator       int64
HelpfulnessDenominator     int64
Score                      int64
Time                       int64
Summary                   object
Text                      object
dtype: object

In [1340]:
data['Time']=pd.to_datetime(data['Time'],unit='s')

In [1341]:
import warnings
from warnings import filterwarnings
filterwarnings('ignore')

 ### What is sentiment analysis?
    Sentiment analysis is the computational task of automatically determining what feelings a writer is expressing in text
    Some examples of applications for sentiment analysis include:

    1.Analyzing the social media discussion around a certain topic
    2.Evaluating survey responses
    3.Determining whether product reviews are positive or negative

    Sentiment analysis is not perfect.It also cannot tell you why a writer is feeling a certain way. However, it can be useful to quickly summarize some qualities of text, especially if you have so much text that a human reader cannot analyze it.For this project,the goal is to to classify Food reviews based on customers' text.

In [1342]:
!pip install TextBlob




In [1343]:
from textblob import TextBlob


In [1344]:
text=df['Summary'][0]

In [1345]:
text

'Good Quality Dog Food'

In [1346]:
TextBlob(text).sentiment.polarity

0.7

In [1347]:
polarity=[]

for i in df['Summary']:
    try:
        polarity.append( TextBlob(i).sentiment.polarity)
    except:
        polarity.append(0)

In [1348]:
len(polarity)

568454

In [1349]:
data= df.copy()

In [1350]:
data['polarity']=polarity

In [1351]:
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,polarity
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,0.7
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0.0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,0.0
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0.0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,0.8


### Perform EDA for positive sentences

In [1352]:
data_positive=data[data['polarity']>0]

In [1353]:
data_positive.shape

(331665, 11)

In [1354]:
!pip install wordcloud



In [1456]:
from wordcloud import WordCloud,STOPWORDS

In [1457]:
stopwords=set(STOPWORDS)

In [1547]:
total_text=(''.join(data_positive['Summary']))

In [1548]:
#total_text

In [1549]:
import re

In [1550]:
total_text=re.sub('[^a-zA-Z]',' ',total_text)

In [1551]:
total_text=re.sub(' +',' ',total_text)

In [1552]:
total_text[0:10000]

In [1553]:
wordcloud= WordCloud(width=1000,height=500,stopwords=stopwords).generate(total_text)
plt.figure(figsize=(15,5))
plt.imshow(wordcloud)

### Perform EDA for negative sentences

In [1554]:
#data_negative=data[data['polarity']<0]

In [1555]:
#data_negative.shape

In [1556]:
#total_text2=(''.join(data_negative['Summary']))

In [1557]:
#total_text2=re.sub('[^a-zA-Z]',' ',total_text2)

In [1558]:
#total_text2=re.sub(' +',' ',total_text2)

In [1559]:
#wordcloud2= WordCloud(width=1000,height=500,stopwords=stopwords).generate(total_text2)
#plt.figure(figsize=(15,5))
#plt.imshow(wordcloud2)
#plt.axis('off')


## Analysing what customers amazon should recommend more products.

#### Amazon can recommend more products to only those who are going to buy more or to one who has a better conversion rate,so lets ready data according to this problem statement



In [1560]:
#df['UserId'].nunique()

In [1561]:
#df.head()

In [1562]:
#raw=df.groupby('UserId').agg({'Summary':'count','Text':'count','Score':'mean','ProductId':'count'}).sort_values(by='Text',ascending=False)

In [1563]:
#raw

In [1564]:
#raw.columns=['no_of_summary','num_text','avg_score','no_of_prod_purchased']
#raw

In [1565]:
#user_10=raw.index[0:10]

In [1566]:
#num_10=raw['no_of_prod_purchased'][0:10]

In [1567]:
#plt.bar(user_10,num_10,label='most recommended user')
#plt.xlabel('UserId')
#plt.ylabel('no_of_prod_purchased')
#plt.xticks(rotation='vertical')

#### These are the Top 10 Users so we can recommend more & more Prodcuts to these Usser Id as there will be a high probability that these person are going to be buy more




### Which Product has a good number of reviews

In [1568]:
#data['ProductId'].nunique()

In [1569]:
#prod_count=data['ProductId'].value_counts().to_frame()

In [1570]:
#prod_count

In [1571]:
#prod_count[prod_count['ProductId']>500]

In [1572]:
#freq_prod_ids=prod_count[prod_count['ProductId']>500].index

In [1573]:
#data['ProductId'].isin(freq_prod_ids)

In [1574]:
#freq_prod_df=data[data['ProductId'].isin(freq_prod_ids)]

In [1575]:
#freq_prod_df

In [1576]:
#freq_prod_df.columns

In [1577]:
#sns.countplot(y='ProductId', data=freq_prod_df, hue='Score')

### IS there any difference between behaviour of frequent user and not frequent user?

In [1578]:
#x=data['UserId'].value_counts()

In [1579]:
#x

In [1580]:
#data['viewer_type']=data['UserId'].apply(lambda user: 'Frequent' if x[user]>50 else 'Not Frequent')

In [1581]:
#data.head(5)

In [1582]:
#not_freq_viewer=data[data['viewer_type']=='Not Frequent']
#freq_viewer=data[data['viewer_type']=='Frequent']

In [1583]:
#freq_viewer['Score'].value_counts()/len(freq_viewer)*100

In [1584]:
#not_freq_viewer['Score'].value_counts()/len(not_freq_viewer)*100

In [1585]:
#freq_viewer['Score'].value_counts().plot(kind='bar')

In [1586]:
#not_freq_viewer['Score'].value_counts().plot(kind='bar')

### The distribution of ratings among frequent reviewers is similar to that of all reviews. 
### However, we can see that frequent reviewers give less 5-star reviews and less 1-star review.
### Frequent users appear to be more discerning in the sense that they give less extreme reviews than infrequent reviews.

## Are frequent users more verbose?

In [1587]:
#data['Text'][0]

In [1588]:
#type(data['Text'][0])

In [1589]:
#type(data['Text'][0].split(' '))

In [1590]:
#len(data['Text'][0].split(' '))

In [1591]:
#def calculate_len(text):
#    return len(text.split(' '))

In [1592]:
#data['Text_lenght']=data['Text'].apply(calculate_len)

In [1593]:
#not_freq_data=data[data['viewer_type']=='Not Frequent']
#freq_data=data[data['viewer_type']=='Frequent']

In [1594]:
#not_freq_data

In [1595]:
#fig=plt.figure()
#ax1=fig.add_subplot(121)
#ax1.boxplot(freq_data['Text_lenght'])
#ax1.set_xlabel('frequency of frequent reviewers')

#ax2=fig.add_subplot(122)
#ax2.boxplot(not_freq_data['Text_lenght'])
#ax2.set_xlabel('frequency of not frequent reviewers')


#### The distributions of word counts for frequent and infrequent reviews shows that 
#### infrequent reviewers have a large amount of reviews of low word count.
#### On the other hand, the largest concentration of word count is higher for frequent reviewers than for infrequent reviews. 


### Analyse Length of Comments whether Customers are going to give Lengthy comments or short one


In [1596]:
#final=df[0:2000]

In [1597]:
#final.head()

In [1598]:
#final.isnull().sum()

In [1599]:
#final.duplicated().sum()

In [1600]:
#len(final['Text'][0].split(' '))

In [1601]:
#def calc_length(text):
#    return len(text.split(' '))

In [1602]:
#final['Text_length']=final['Text'].apply(calc_length)

In [1603]:
#import plotly.express as px

In [1604]:
#px.box(final,y='Text_length')

#### Conclusion-->>
    Seems to have Almost 50 percent users are going to give their Feedback limited to 50 words whereas there are only few users who are going give Lengthy Feedbacks

### Analysing score

In [1605]:
#sns.countplot(final['Score'])

### Analysing behavior of customers

### Text preprocessing

In [1606]:
#final['Text'][0]

In [1607]:
#final['Text']=final['Text'].str.lower()

In [1608]:
#data=final['Text'][164]

In [1609]:
#punctuation= '''@!#$%^*()<:;{}?/[]'''
#data= final['Text'][164]
#no_punc =''
#for char in data:
#    if char not in punctuation:
#        no_punc=no_punc+char
#no_punc
#    

In [1610]:
#import string
#punctuations=string.punctuation

#def remove_punc(review):
 #   no_punc =''
  #  for char in review:
   #     if char not in punctuations:
    #        no_punc=no_punc+char
    #return no_punc

In [1611]:
#final['Text']=final['Text'].apply(remove_punc)

In [1612]:
#final.head()


In [1613]:
#import nltk
#from nltk.corpus import stopwords

In [1614]:
#data=final['Text'][164]

In [1615]:
#data

In [1616]:
#re=[word for word in data.split(' ') if word not in set(stopwords.words('english'))]

#str=' '
#for wd in re:
#    str=str+wd
#    str=str+' '

#str

In [1617]:
#def remove_stopword(review):
#    return' '.join([word for word in review.split(' ') if word not in set(stopwords.words('english'))])

In [1618]:
#final['Text']=final['Text'].apply(remove_stopword)

In [1619]:
#final['Text'][45]

In [1620]:
#final['Text'].str.contains('http').sum()

In [1621]:
#pd.set_option('display.max_row',2000)
#final['Text'].str.contains('http')


In [1622]:
#review=final['Text'][21]

In [1623]:
#review

In [1624]:
#import re

In [1625]:
#url_pattern=re.compile(r'href|http.\w+')
#url_pattern.sub(r'',review)

In [1626]:
#def remove_url(review):
#    url_pattern=re.compile(r'href|http.\w+')
#    return url_pattern.sub(r'',review)

In [1627]:
#final['Text']=final['Text'].apply(remove_url)

In [1628]:
#final['Text'][34]

In [1629]:
#final['Text'][34].replace(' br ','')

In [1630]:
#for i in range (len(final['Text'])):
#    final['Text'][i].replace(' br ','')
    

In [1631]:
#final.head()

In [1632]:
#comment_words=' '.join(final['Text'])

In [1633]:
#stopwords=set(STOPWORDS)

In [1634]:
###wordcloud3= WordCloud(width=1000,height=500,stopwords=stopwords).generate(comment_words)
#plt.figure(figsize=(15,5))
#plt.imshow(wordcloud3)
#plt.axis('off')
