# CAPSTONE PROJECT DSFT8 - Digital music

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

## &#10148;Problem Statement </br> 
### <div class="alert alert-info">Thomas, a global market analyst, wishes to develop an automated system to analyze and monitor an enormous number of reviews. By monitoring the entire review history of products, he wishes to analyze tone, language, keywords, and trends over time to provide valuable insights that increase the success rate of existing and new products and marketing campaigns.</div>

## Introduction

Everyday we come across various products in our lives, on the digital medium we swipe across hundreds of product choices under one category. It will be tedious for the customer to make selection. Here comes 'reviews' where customers who have already got that product leave a rating after using them and brief their experience by giving reviews. As we know ratings can be easily sorted and judged whether a product is good or bad. But when it comes to sentence reviews we need to read through every line to make sure the review conveys a positive or negative sense. In the era of artificial intelligence, things like that have got easy with the Natural Langauge Processing(NLP) technology.

## Table of contents
 - 1.PREPROCESSING AND CLEANING
 - 2.BUSINESS INSIGHTS AND VISUALIZATION 
 - 3.SENTIMENT ANALYSIS
 - 4.TEXT CLASSIFICATION
 - 5.TIME SERIES ANALYSIS 
 - 6.CLUSTERING
 - 7.PRODUCT RECOMMENDATION
 - 8.PREDICTION OF NEXT PURCHASE DAY
 - 9.CONCLUSION


## &#10148; Requried Libraries</br>

- Importing the required libraries for the project

In [None]:
import json                                        # to work with json file
import pandas as pd                                # to work with dataframes
import numpy as np                                 # to work with numpy arrays
import gzip                                        # to extract work file from zip file
import nltk                                        # working with nlp algorithms
from nltk.sentiment import SentimentIntensityAnalyzer  # To predict the sentiments based on the text
from tqdm.notebook import tqdm                     # library for adding progress bar
import sklearn                                     # to working with machine learning algorithms
from sklearn.linear_model import LogisticRegression  # Classification algorithm
from sklearn.feature_extraction.text import TfidfVectorizer # To convert text to numerical based on tfidf score
from nltk.corpus import stopwords                  # to detect stopwords
import re                                          # To remove the unwanted text
from sklearn.metrics import classification_report  # Classification report
from sklearn.metrics import accuracy_score         # evaluation metric
from sklearn.metrics import f1_score               # evaluation metric
from sklearn.metrics import recall_score           # evaluation metric
from sklearn.metrics import precision_score        # evaluation metric
from sklearn.model_selection import train_test_split # train test split
import time                                        # to check the processing time
from sklearn.preprocessing import LabelEncoder     # To convert categorical to numerical
import warnings
warnings.filterwarnings('ignore')                  # To ignore the warnings
from sklearn.model_selection import StratifiedKFold # Splitting
from sklearn.naive_bayes import MultinomialNB       # Naive bayes algorithm
import matplotlib.pyplot as plt                     # Visualization tool
import seaborn as sns                               # Visualization tool
from statsmodels.tsa.seasonal import seasonal_decompose            # Time series components
from statsmodels.tsa.stattools import adfuller                      # To find the stationarity of the data
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf       # To plot ACF and PACF plots
from statsmodels.tsa.arima.model import ARIMA                       # To build the ARIMA model
from sklearn.metrics import mean_squared_error                      # To check the mean square error                                      
from statsmodels.tsa.statespace.sarimax import SARIMAX              # To build the sarimax model
from sklearn.neighbors import NearestNeighbors                      # KNN algorithm
from sklearn.metrics.pairwise import cosine_similarity              # 
import scipy.sparse
from scipy.sparse import csr_matrix                                 # Correlation Matrix
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler, StandardScaler      # For Scaling the data
from sklearn.cluster import KMeans                                  # For Cluster Formation  
from sklearn.feature_extraction.text import CountVectorizer         # For Vectorisation
from wordcloud import WordCloud, STOPWORDS                          # For Word Cloud
from sklearn import metrics                                         # For Matrics Algorithms
from sklearn.metrics import classification_report                   # For Classification Evaluation Report
from datetime import datetime, timedelta,date
from sklearn.metrics import confusion_matrix                        # It shows the tabel of probability values
from sklearn.metrics import plot_confusion_matrix                   # plot of confusion matrix
from sklearn.metrics import mean_squared_error                      # To check the mean square error 

## &#10148; Converting file from json to dataframe</br>

- The gzip module provides the GzipFile class, as well as the open() , compress() and decompress() convenience functions.

- The Yield keyword in Python is similar to a return statement used for returning values or objects in Python.

In [None]:
def parse(path):                   # Creating Function
  g = gzip.open(path, 'rb')        # opens the compressed format file
  for l in g:
    yield eval(l)                  # Returns eval(l)

def getDF(path):                   # Creating Function getDF
  i = 0
  df = {}                          # Creating empty dictionary
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')         # .from_dict creates DataFrame object from dictionary by columns or by index allowing dtype specification.

df = getDF('/content/gdrive/MyDrive/Digital music/meta_Digital_Music.json.gz')

## &#10148; Importing the Data</br>

In [None]:
# Checking Shape of the dataset
df.shape

In [None]:
# Checking top 5 rows of the data set
df.head()

In [None]:
# Renaming the columns 
columns=['userId', 'productId', 'ratings','timestamp']
df3 = pd.read_csv("/content/gdrive/MyDrive/Digital music/ratings_Digital_Music.csv", names=columns)

In [None]:
# Checking shape of df3
df3.shape

In [None]:
# importing the data
df1 = getDF('/content/gdrive/MyDrive/Digital music/reviews_Digital_Music.json.gz')
df1.head()

In [None]:
# Checking shape of df1
df1.shape

In [None]:
df1['userID'] = df3['userId']

In [None]:
df1.head()

In [None]:
# importing the data
df2 = getDF('/content/gdrive/MyDrive/Digital music/reviews_Digital_Music_5.json.gz')
df2.head()

In [None]:
# Checking shape
df2.shape

In [None]:
# Feature Selection for data1
data1 = df1[['asin', 'reviewText','reviewerName', 'overall', 'unixReviewTime', 'reviewTime', 'userID']]

In [None]:
# Feature Selection for data2
data2 = df[['asin', 'title', 'categories', 'price', 'brand']]
data2.head()

In [None]:
# Merging the data set
H_data = pd.merge(data1, data2, on = 'asin')
H_data.head()

## &#10148; Data Exploration</br>

In [None]:
# Checking Shape of dataset
H_data.shape

In [None]:
# Checking description
H_data.describe()

In [None]:
# Checking information of dataset
H_data.info()

## **Dataset Details**
#### This file has reviewer ID , User ID, Reviewer Name, Reviewer text, helpful, Summary(obtained from Reviewer text),Overall Rating on a scale 5, Review time

#### Description of columns in the file:

reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B

asin - ID of the product, e.g. 0000013714

reviewerName - name of the reviewer

reviewText - text of the review

overall - rating of the product

summary - summary of the review

unixReviewTime - time of the review (unix time)

reviewTime - time of the review (raw)

## &#10148; Data Preprocessing</br>

In [None]:
# Removing the duplicates
H_data.drop_duplicates(["reviewText","asin","reviewerName"], keep = "last", inplace = True)

In [None]:
# Checking null values
(H_data.isnull().sum()*100)/H_data.shape[0]

In [None]:
# Imputing 'Unknow' in brand column
H_data['brand'].fillna('Unknown', inplace = True)

In [None]:
H_data.drop(['title'], axis = 1, inplace = True)

In [None]:
for i in range(50):
  H_data['price'] = H_data['price'].interpolate(method = 'linear', limit = 5)

In [None]:
H_data.isnull().sum()

In [None]:
# Dropping remianing null values
H_data.dropna(inplace = True)

In [None]:
H_data.isnull().sum()

## &#10148; Data cleaning</br>

- Clean text is human language rearranged into a format that machine models can understand. Text cleaning can be performed using simple Python code that eliminates stopwords, removes unicode words, and simplifies complex words to their root form.

In [None]:
# Creating cleaning function
import re
def cleaning(text):
    text = re.sub("[^0-9A-Za-z\-]+", " ", text) 
    text = re.sub("(?<!\w)\d+", "", text)
    text = re.sub("-(?!\w)", "", text)
    text = " ".join(text.split())
    text = text.lower()
    return text

In [None]:
# Calling the cleaning function for reviewText column
H_data["reviewText"] = H_data["reviewText"].apply(cleaning)

In [None]:
# Checking Information
H_data.info()

In [None]:
# converting the data type of reviewTime with date type
H_data['reviewTime'] = pd.to_datetime(H_data['reviewTime'])

## &#10148; Sentiment Analysis</br>

## What is sentiment analysis?

- Sentiment analysis is a text analysis method that detects polarity (e.g. a positive or negative opinion) within the text, whether a whole document, paragraph, sentence, or clause.
- Sentiment analysis aims to measure the attitude, sentiments, evaluations, attitudes, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.



### Creating 'sentiment' column
This is an important preprocessing phase, we are deciding the outcome column (sentiment of review) based on the overall score. If the score is greater than 3, we take that as positive and if the value is less than 3 it is negative If it is equal to 3, we take that as neutral sentiment

In [None]:
# Assigning the Positive Negative and Neutral Sentiment ob the basis of overall column
a=[]
for x in H_data['overall']: 
  if x>3:
    x='Pos'
    a.append(x)
  elif x==3:
    x='Neutral'
    a.append(x)
  else:
    x='Neg'
    a.append(x)

In [None]:
H_data['Sentiment']=a


#### VADER
- VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.
- VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.
- For example- Words like ‘love’, ‘enjoy’, ‘happy’, ‘like’ all convey a positive sentiment. Also VADER is intelligent enough to understand the basic context of these words, such as “did not love” as a negative statement. It also understands the emphasis of capitalization and punctuation, such as “ENJOY”

In [None]:
# downloding the vader lexicon 
nltk.download('vader_lexicon')

In [None]:
# Getting the polarity of reviewText 
res2 = {}
t = 0
for i, row in tqdm(H_data.iterrows(), total=len(H_data)):
    text = row['reviewText']
    res2[t] = SentimentIntensityAnalyzer().polarity_scores(text)
    t = t + 1

  0%|          | 0/833560 [00:00<?, ?it/s]

In [None]:
# Transposing the dataframe
j = pd.DataFrame(res2).T
j

In [None]:
# concating the main data and the Polarity Scores
M_data = pd.concat([H_data, j], axis = 1)

In [None]:
# Dropping the null values
M_data.dropna(inplace = True)

In [None]:
# Creating the Class column based on compound column
M_data.insert(0, 'Class', np.nan)
M_data.loc[M_data['compound']>=0.05, 'Class'] = 'pos'
M_data.loc[M_data['compound']<=-0.05, 'Class'] = 'neg'
M_data.loc[((M_data['compound'] > -0.05) & (M_data['compound'] < 0.05)), 'Class'] = 'neutral'

In [None]:
# giving the datetiem index for reviewtime on the basis of year and month
M_data['year'] = pd.DatetimeIndex(M_data['reviewTime']).year
M_data['month'] = pd.DatetimeIndex(M_data['reviewTime']).month

In [None]:
# converting thedata into CSV file
M_data.to_csv("M_datafinal2.csv")

In [None]:
# installing the googletrans library
!pip install googletrans==3.1.0a0

In [None]:
# Importing the GoogleTrans library
from googletrans import Translator
translator = Translator()

In [None]:
text1 = '''
A Római Birodalom (latinul Imperium Romanum) az ókori Róma által létrehozott 
államalakulat volt a Földközi-tenger medencéjében
'''

text2 = '''
Vysoké Tatry sú najvyššie pohorie na Slovensku a v Poľsku a sú zároveň jediným 
horstvom v týchto štátoch s alpským charakterom. 
'''
a = [text1, text2]
a = pd.DataFrame({'col':a})

In [None]:
# Creating a loop to check if the language is english or not, if not translating it into english
for i in range(len(a.iloc[:, 0])):
  dt = translator.detect(a.iloc[i, 0])
  if  dt != 'en':
    a.iloc[i, 0] = translator.translate(a.iloc[i, 0],dest='en').text

## &#10148; Text classification</br>
- Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

In [None]:
# importing the data set which we have created
df = pd.read_csv('/content/gdrive/MyDrive/CSV files/M_datafinal2.csv')
df.head()

In [None]:
# Converting the datatype of reviewTime to Date type
df["reviewTime"] = pd.to_datetime(df["reviewTime"])

In [None]:
# checking null values
df.isnull().sum()

In [None]:
# dropping the null values
df.dropna(inplace = True)

In [None]:
# dropping the 'Unnamed: 0' column
df.drop('Unnamed: 0', axis = 1, inplace = True)

In [None]:
# slicing the data
df1 = df.iloc[:100, :]
df1.head()

### Remove text-Stop words
Coming to stop words, general nltk stop words contains words like not,hasn't,would'nt which actually conveys a negative sentiment. If we remove that it will end up contradicting the target variable(sentiment). So I have curated the stop words which doesn't have any negative sentiment or any negative alternatives.

In [None]:
# Getting stop words
nltk.download('stopwords')

stop_words = stopwords.words("english")

In [None]:
# applying stopword function on reviewText
df['reviewText'] = df['reviewText'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [None]:
# For WordCloud
stopwords = set(STOPWORDS)
def word_cloud(data, title):
    wordcloud = WordCloud(
    background_color = "black",
    max_font_size = 40,
    max_words = 200,
    stopwords = stopwords,
    scale = 3).generate(str(df['reviewText']))
    fig = plt.figure(figsize = (15, 15))
    plt.axis("off")
    if title: 
        fig.suptitle(title, fontsize=15)
        fig.subplots_adjust(top=2.25)
    plt.imshow(wordcloud)
    plt.show()

In [None]:
df.columns

In [None]:
neg=df[df["Pros_cons"] == "Neg"]["reviewText"]
pos=df[df["Pros_cons"] == "Pos"]["reviewText"]
neu=df[df["Pros_cons"] == "Neutral"]["reviewText"]

In [None]:
word_cloud(pos, "Most Repeated words in positive reviews")
word_cloud(neg, "Most Repeated words in negative reviews")
word_cloud(neu, "Most Repeated words in neutral reviews")

### <div class="alert alert-info">Interpretation
**- From the above plots we can see that most used positive, negative and neutral words from the review text**</div>

In [None]:
# Getting BIGRAM
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2),stop_words='english').fit(corpus)   # converting a text documents to a matrix of token counts.      
    bag_of_words = vec.transform(corpus)                                         # Transforming the corpus into numbers
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]     # it provides a dictionary with the mapping of the word item index 
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
# create a function for bigram plots
def n_gram_plot(data,title,color):                            # Creating n_gram_plot function
    x=[x[0] for x in data]
    y=[x[1] for x in data]
    sns.barplot(y,x,color='{}'.format(color))
    plt.title('{} Reviews Bigrams'.format(title),fontsize=15)
    plt.yticks(rotation=0,fontsize=15)

common_words_good = get_top_n_bigram(pos, 10)                  # Calling get_top_n_bigram for pos columns
common_words_neutral = get_top_n_bigram(neu, 10)               # Calling get_top_n_bigram for neu columns
common_words_bad = get_top_n_bigram(neg, 10)                   # Calling get_top_n_bigram for neg columns

# bigram plot using function above
plt.figure(figsize=(15,10))
# good reviews bigrams
plt.subplot(151)
n_gram_plot(common_words_good,'Good','green')                  # Calling n_gram_plot for pos 
#============================================= 
#neutral reviews bigrams
plt.subplot(153)
n_gram_plot(common_words_neutral,'Neutral','yellow')           # Calling n_gram_plot for pos
#============================================= 
#bad reviews bigrams
plt.subplot(155)
n_gram_plot(common_words_bad,'Bad','red')                      # Calling n_gram_plot for pos
plt.show()

### <div class="alert alert-info">Interpretation
- **From the above plots we can see that most occuring bigram words in the text reveiws**</div>

In [None]:
X = df['reviewText']
Y = df['Pros_cons']

In [None]:
Y.value_counts()

In [None]:
Y = LabelEncoder().fit_transform(Y)
Y

In [None]:
# Getting unique values and converting it into array
unique, counts = np.unique(Y, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
# Splitting the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [None]:
unique, counts = np.unique(Y_train, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
# Applying TFIDF Vectorizer
%%time
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.1, min_df = 1,
                             use_idf = True, smooth_idf = True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Model selection
Let's consider all the classification algorithm and perform the model selection process

#### &#10148; Logistic regression</br>

In [None]:
# Making and Fitting the Model
%%time
model = LogisticRegression(multi_class = 'ovr').fit(X_train, Y_train)
y_pred = model.predict(X_test)

In [None]:
# Making unique Values and converting the values in array 
unique, counts = np.unique(y_pred, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
%%time
print(classification_report(Y_test, y_pred, target_names = ['neg', 'neu', 'pos']))

In [None]:
color = 'white'
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(model, X_test, Y_test, cmap=plt.cm.Blues, display_labels = ['Negative','Neutral','Positive'], ax = ax)
plt.show()

#### &#10148; Sample Illustration</br>

In [None]:
a = ['Nice song to here', 'worst song and waste of money', 'Good song but quality is not good']
a1 = vectorizer.transform(a)

In [None]:
fo = model.predict(a1)
fo

In [None]:
s = pd.DataFrame({"Random_review":a, "Predictions": ['Positive', 'Negative', 'Positive']})
s

#### &#10148; Naive bayes classifier</br>

In [None]:
# Making and Fitting the model
%%time
model1 = MultinomialNB().fit(X_train, Y_train)
y_pred1 = model1.predict(X_test)

In [None]:
print(classification_report(Y_test, y_pred1, target_names = ['neg', 'nue', 'pos']))

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(model1, X_test, Y_test, cmap=plt.cm.Blues, display_labels = ['Negative','Neutral','Positive'], ax = ax)
plt.show()

## &#10148; Time series analysis</br>
- Time series analysis is a technique in statistics that deals with time series data and trend analysis. Time series data follows periodic time intervals that have been measured in regular time intervals or have been collected in particular time intervals.

In [None]:
%%time
plt.figure(figsize = (15, 8))
plt.title('CDF Of Sentiments Across Our Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(df['neg'],bw=0.1,cumulative=True)
sns.kdeplot(df['neu'],bw=0.1,cumulative=True)
sns.kdeplot(df['pos'],bw=0.1,cumulative=True)
plt.xlabel('Sentiment Value',fontsize=19)
plt.legend(['neg', 'neutral', 'pos'])
plt.show()

### <div class="alert alert-info">Interpretation
- **It is also clear that the dominant sentiment is neutral; oddly, most of the reviews do not resemble more positive or negative sentiment rather than neutral.**</div>

In [None]:
df1 = df[['neg', 'pos', 'reviewTime']]

In [None]:
# Setting the index as Date
df1 = df1.set_index('reviewTime')
df1.head()

In [None]:
# Resampling the data based on Weekly 
df1 = df1.resample('W').sum()
df1.shape

In [None]:
# Seasonal Decompose For Positive Reviews
%%time
decomposition=seasonal_decompose(df1['pos'], period=52)
d_trend=decomposition.trend
d_seasonal=decomposition.seasonal
d_residual=decomposition.resid


fig,ax = plt.subplots(4,2,figsize=(30,20))

plt.subplot(411)
plt.plot(df1['pos'],label='Original')
plt.legend(loc='best')
plt.title('Pos_actual', fontsize = 20)

plt.subplot(412)
plt.plot(d_trend,label='Trend')
plt.legend(loc='best')
plt.title('Pos_trend', fontsize = 20)

plt.subplot(413)
plt.plot(d_seasonal,label='Seasonal')
plt.legend(loc='best')
plt.title('Pos_seasonal', fontsize = 20)

plt.subplot(414)
plt.plot(d_residual,label='Residual')
plt.legend(loc='best')
plt.title('Pos_residual', fontsize = 20)

### <div class="alert alert-info">Interpretation
- **Above plots shows the time series decomposition and we can see the components of time series for positive reviews**
- **First plot is actual data plot with weekly dispersed data points**
- **Second one shows the trend in the data where we can see there is positive trend in the dataset**
- **Third plot shows the seasonality which we can see some repetitions over the period of time**
- **Last plot shows the irregularity in the the data over a period of time we can see some irregular kind over a period in the plot**

In [None]:
# Seasonal Decompose For Negative Reviews
%%time
decomposition=seasonal_decompose(df1['neg'], period=52)
d_trend=decomposition.trend
d_seasonal=decomposition.seasonal
d_residual=decomposition.resid


fig,ax = plt.subplots(4,2,figsize=(30,20))

plt.subplot(411)
plt.plot(df1['neg'],label='Original')
plt.legend(loc='best')
plt.title('neg_actual', fontsize = 20)

plt.subplot(412)
plt.plot(d_trend,label='Trend')
plt.legend(loc='best')
plt.title('neg_trend', fontsize = 20)

plt.subplot(413)
plt.plot(d_seasonal,label='Seasonal')
plt.legend(loc='best')
plt.title('neg_seasonal', fontsize = 20)

plt.subplot(414)
plt.plot(d_residual,label='Residual')
plt.legend(loc='best')
plt.title('neg_residual', fontsize = 20)

### <div class="alert alert-info">Interpretation
- **Above plots shows the time series decomposition and we can see the components of time series for negative reviews**
- **First plot is actual data plot with weekly dispersed data points**
- **Second one shows the trend in the data where we can see there is positive trend in the dataset**
- **Third plot shows the seasonality which we can see some repetitions over the period of time**
- **Last plot shows the irregularity in the the data over a period of time we can see some irregular kind over a period in the plot**

In [None]:
plt.figure(figsize=(15,12))
plt.subplot(211)
plot_acf(df['pos'], ax=plt.gca(), lags = 52)
plt.subplot(212)
plot_pacf(df['pos'], ax=plt.gca(), lags = 52)
plt.show()

In [None]:
plt.figure(figsize=(15,12))
plt.subplot(211)
plot_acf(df['neg'], ax=plt.gca(), lags = 52)
plt.subplot(212)
plot_pacf(df['neg'], ax=plt.gca(), lags = 52)
plt.show()

In [None]:
plt.figure(figsize=(25, 6))
plt.subplot(1, 2, 1)
sns.boxplot(df.year, df.pos)

plt.subplot(1, 2, 2)
sns.boxplot(df.month, df.pos)
plt.show()

### <div class="alert alert-info">Interpretation
**- From the visualisation we can see how the sentiments distributed over the years as well as months, the mean and the variation is almost same over the period**</div>

In [None]:
plt.figure(figsize=(25, 6))
sns.countplot(df.year)

In [None]:
plt.figure(figsize=(25, 6))
sns.countplot(df.month)

In [None]:
# Creating function to check stationarity
def checkstationary(df):
    pvalue = adfuller(df)[1]
    if pvalue < 0.05:
        ret = 'Pvalue:{}. Data is stationary, Proceed to model building'.format(pvalue)
    else:
        ret = 'Pvalue:{}.Data is not stationary, make data stationary'.format(pvalue)
    return ret

In [None]:
# Checking Stationarity of Negative Sentiment Column
checkstationary(df1['neg'])

### <div class="alert alert-info">Interpretation
**- From augmented dickey fuller test we can see that the data is not stationary so we should do diffrencing or d = 1 while building the model**</div>

In [None]:
# Checking Stationarity of Positive Sentiment Column
checkstationary(df1['pos'])

### <div class="alert alert-info">Interpretation
**- From augmented dickey fuller test we can see that the data is not stationary so we should do diffrencing or d = 1 while building the model**</div>

In [None]:
# Splitting the data
split = int(0.95 * len(df1))
train = df1.iloc[:split]
test = df1.iloc[split:]
print("Train = {}, Test = {}".format(len(train), len(test)))

In [None]:
# Creating function to get optimum p and q value
def sarima_model(p,d,q,P,D,Q):
    sm1=SARIMAX(train,order=(p,d,q),seasonal_order=(P,D,Q,52)).fit()
    f1=sm1.forecast(len(test))
    actual=[]
    predicted=[]
    for i in range(len(f1)):
        actual.append(test[i])
        predicted.append(f1[i])
    RMSE=round(mean_squared_error(actual,predicted,squared=False),3)
    return RMSE,actual,predicted

In [None]:
p=[0,1, 2]
d=1
q=[0,1, 2]
p1=[]
q1=[]
rmse1=[]
P=[0,1, 2]
Q=[0,1, 2]
D=1
P1=[]
Q1=[]
for i in range(len(p)):
    for j in range(len(q)):
        for k in range(len(P)):
            for l in range(len(Q)):
                p1.append(p[i])
                q1.append(q[j])
                P1.append(P[k])
                Q1.append(Q[l])
                rmse1.append(sarima_model(p[i],d,q[j],P[k],D,Q[l])[0])

In [None]:
val2 = pd.DataFrame(zip(p1,q1,P1,Q1,rmse1),columns=['p','q','P','Q','RMSE'])
val2.sort_values(by='RMSE').head(1)

In [None]:
# Creating function for sarima model for negative sentiment
def SARMA1(df):
    model2 = SARIMAX(train['neg'],order=(1, 1, 2),seasonal_order=(1,1,2,52)).fit()
    print('Summary : S')
    print('past_predictions : past')
    print('future_predictions : future')
    select = input('Enter you requried information: ')
    summary  = model2.summary()
    pred1 = model2.predict()
    forecast1 = model2.forecast(len(test['neg'])+20)
    if select == 'S':
        return summary
    elif select == 'past':
        return pred1
    else:
        return forecast1

In [None]:
# Creating function for sarima model for positive sentiment
def SARMA2(df):
    model2 = SARIMAX(train['pos'],order=(1, 1, 2),seasonal_order=(1,1,2,52)).fit()
    print('Summary : S')
    print('past_predictions : past')
    print('future_predictions : future')
    select = input('Enter you requried information: ')
    summary  = model2.summary()
    pred1 = model2.predict()
    forecast1 = model2.forecast(len(test['pos'])+20)
    if select == 'S':
        return summary
    elif select == 'past':
        return pred1
    else:
        return forecast1

In [None]:
train1 = SARMA1(train['neg'])

In [None]:
train2 = SARMA2(train['pos'])

In [None]:
plt.figure(figsize=(30,10))
plt.title('Actual vs forecast')
plt.plot(train['neg'],marker = '.', label = 'neg', color = 'red')
plt.plot(train['pos'],marker = '.', label = 'pos', color = 'g')
plt.plot(train1,marker = '.', label = 'neg_forecast', color = 'b')
plt.plot(train2,marker = '.', label = 'pos_forecast', color = 'b')
plt.legend()

In [None]:
# Evaluation using RMSE
pos_rmse = np.sqrt(mean_squared_error(test['pos'], train2[:-20]))
neg_rmse = np.sqrt(mean_squared_error(test['pos'], train1[:-20]))

In [None]:
res = pd.DataFrame({'Sentiments':['Pos', 'neg'], 'RMSE':[pos_rmse, neg_rmse]})
res

### <div class="alert alert-info">Interpretation
**- From both visualization as well as the error values of forcast data we can see the SARIMA model is giving good forcast results and the positive and negative reviews are increasing over period but the positive reviews are inresing in more percent compare to negative review**</div>

## &#10148; Clustering</br>
- Cluster analysis is the grouping of objects such that objects in the same cluster are more similar to each other than they are to objects in another cluster. The classification into clusters is done using criteria such as smallest distances, density of data points, graphs, or various statistical distributions.

In [None]:
df.head()

In [None]:
X1 = df[['price', 'unixReviewTime']]

In [None]:
# Scaling the data
%%time
Scaler = StandardScaler()
for i in X1.columns:
    X1[i] = Scaler.fit_transform(np.array(X1[i]).reshape(-1, 1))

In [None]:
%%time
X1 = X1.values
distortion = []
for i in range(2, 10):
    kmeans = KMeans(n_clusters = i).fit(X1)
    distortion.append(kmeans.inertia_)
plt.figure(figsize = (15, 5))
plt.plot(range(2, 10), distortion)
plt.grid(True)

In [None]:
# Making the model and fitting it
%%time
model1 = KMeans(n_clusters = 5, random_state = 10).fit(X1)
pred = model1.fit_predict(X1)

In [None]:
df.columns

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x=X1[pred==0,0] ,y=X1[pred==0,1] ,s=100,label="q_target")
sns.scatterplot(x=X1[pred==1,0] ,y=X1[pred==1,1],s=100,label="Sec_target")
sns.scatterplot(x=X1[pred==2,0] ,y=X1[pred==2,1] ,s=100,label="standard")
sns.scatterplot(x=X1[pred==3,0] ,y=X1[pred==3,1],s=100,label="target")
sns.scatterplot(x=X1[pred==4,0] ,y=X1[pred==4,1],s=100,label="tert_target")
#sns.scatterplot(x=kmeans.cluster_centers_[:,0] ,y= kmeans.cluster_centers_[:,1] ,s=300,label="center")
plt.title("clusters of customers")
plt.xlabel("Price")
plt.ylabel("Unixreviewtime")


### <div class="alert alert-info">Interpretation
- **1. This graph illustrates the relation between the review time and the price of the product where each cluster shows the grouping of reviews on the particular products**

- **2. The purple cluster shows the target cluster which can be said as excellent and the worst reviews that is 5 and 1 rated reviews. As the price of product is increasing the review time is also increasing, from this we can interpret that either the product is very good or worst as people tends to give instant reviews when the product is strongly liked or disliked by them**

- **3. The more the review time is increasing we can see the product price is also increasing from which we can interpret that customer take some time to give reviews when the product have higher price**</div>

In [None]:
clus = df.copy()

In [None]:
# Assigning the clusters 
clusters=[]
for c in pred:
    if c==0:
        clusters.append("q_target")
    elif c==1:
        clusters.append("Sec_target")
    elif c==2:
        clusters.append("standard")
    elif c==3:
        clusters.append("tert_target")
    elif c==4:
        clusters.append("target") 
        
        
clus["clusters1"]=clusters


In [None]:
d1=clus[(clus["clusters1"]=='target')]
d1["overall"].value_counts()

In [None]:
d2=clus[(clus["clusters1"]=='Sec_target')]
d2["overall"].value_counts()

In [None]:
d3=clus[(clus["clusters1"]=='standard')]
d3["overall"].value_counts()

In [None]:
d4=clus[(clus["clusters1"]=='tert_target')]
d4["overall"].value_counts()

In [None]:
d5=clus[(clus["clusters1"]=='q_target')]
d5["overall"].value_counts()

In [None]:
improve=clus[(clus["clusters1"]=='target') & (clus["overall"]<3)]

In [None]:
a=clus[(clus["clusters1"]=='target')]

In [None]:
# Recommending the products
improve['asin'].value_counts()[0:10]

In [None]:
improve1=clus[(clus["clusters1"]=='target') & (clus["overall"]==1)]
improve1['asin'].value_counts()[0:10]

In [None]:
improve2=clus[(clus["clusters1"]=='target') & (clus["overall"]==5)]
improve2['asin'].value_counts()[0:10]

## &#10148; Customer segmentation</br>
- We can’t treat every customer the same way with the same content, same channel, same importance. They will find another option which understands them better.
- Customers who use your platform have different needs and they have their own different profile. Your should adapt your actions depending on that.
- You can do many different segmentations according to what you are trying to achieve. If you want to increase retention rate, you can do a segmentation based on the similarities between the customers
- But there are very common and useful segmentation methods as well. Now we are going to implement one of them to our business: RFM.
- **1. Recency: How recently customers made their purchase.**
- **2. Frequency: For simplicity, we’ll count the number of times each customer made a purchase.**
- **3. Monetary: How much money they spent in total.**

In [None]:
df.columns

In [None]:
CS_df = pd.DataFrame(df['userID'].unique())
CS_df.columns = ['userID']

In [None]:
Max_purchase = df.groupby('userID').reviewTime.max().reset_index()
Max_purchase.columns = ['userID','MaxPurchaseDate']

#### &#10148; Recency</br>

In [None]:
Max_purchase['Recency'] = (Max_purchase['MaxPurchaseDate'].max() - Max_purchase['MaxPurchaseDate']).dt.days

In [None]:
CS_df = pd.merge(CS_df, Max_purchase[['userID','Recency']], on='userID')
CS_df.head()

#### &#10148; Frequency</br>

In [None]:
tx_frequency = df.groupby('userID').reviewTime.count().reset_index()
tx_frequency.columns = ['userID','Frequency']

In [None]:
CS_df = pd.merge(CS_df, tx_frequency, on='userID')

#### &#10148; Revenue</br>

In [None]:
tx_revenue = df.groupby('userID').price.sum().reset_index()

In [None]:
CS_df = pd.merge(CS_df, tx_revenue, on='userID')

In [None]:
CS_df.head()

In [None]:
CS_df.isnull().sum()

#### &#10148; K_means</br>

In [None]:
a = CS_df.select_dtypes(exclude = 'object')
b = CS_df.select_dtypes(include = 'object')

In [None]:
a.columns

In [None]:
CS_df

In [None]:
CS_df1 = CS_df.copy()

In [None]:
CS_df1.head()

In [None]:
# Scaling the data
%time
Scaler = StandardScaler()
for i in a.columns:
    CS_df1[i] = Scaler.fit_transform(np.array(CS_df[i]).reshape(-1, 1))

In [None]:
# Getting optimum cluster number
%%time
X = CS_df1.drop(['userID'], axis = 1).values
distortion = []
for i in range(2, 10):
    kmeans = KMeans(n_clusters = i).fit(X)
    distortion.append(kmeans.inertia_)

In [None]:
plt.figure(figsize = (15, 5))
plt.plot(range(2, 10), distortion)
plt.grid(True)

### <div class="alert alert-info">Interpretation
**- From the above elbow curve we can take k as 3 because the slope is more at k = 3**</div>

In [None]:
# Creating Model and fitting it
%%time
model = KMeans(n_clusters = 3, random_state = 10).fit(X)
pred = model.fit_predict(X)

In [None]:
CS_df.columns

In [None]:
color1 = [ "red", "blue", "Yellow"]
l = ["Customer segmentation", 'Good Customers', 'Unsatisfied customers', 'Loyal customers']
plt.figure(figsize = (30, 8))
plt.subplot(1, 2, 1)
sns.scatterplot(x = CS_df['Recency'], y = CS_df['price'], s = 70, hue =pred, palette = color1)
plt.legend(labels = l)
plt.title('recency v/s price', fontsize = 15)

plt.subplot(1, 2, 2)
sns.scatterplot(x = CS_df['Recency'], y = CS_df['Frequency'], s = 70, hue =pred, palette = color1)
plt.legend(labels = l)
plt.title('recency v/s frequency', fontsize = 15)
plt.show()

### <div class="alert alert-info">Interpretation
**- From the above clustering result we can see cleary 3 types of clusters are there**
- 1. Good customers: They are visiting the sites more frequently and revenue is good
- 2. Unsatisfied customers: They are stop visiting the site for a long time so we can assume that they are not satisfied with the service
- 3. Loyal customers: they are frequently visiting customers aswell as they are generating high revenue than that good customers </div>

## &#10148; Amazon recommendation system</br>
### What Recommendation Systems Can Solve?
- It helps the consumer to find the best product.
- It helps websites to increase user engagement.
- It makes the contents more personalized.
- It helps websites to find the most relevant product for the consumer.
- Help item providers in delivering their items to the right user.

In [None]:
df.columns

In [None]:
df3 = df[['userID', 'asin', 'overall']]

In [None]:
df3.rename(columns = {'asin':'productId', 'overall': 'ratings'}, inplace = True)

In [None]:
df3.head()

In [None]:
df3.describe()

In [None]:
df4=df3.iloc[:1000005,0:]

In [None]:
df4.isnull().sum()

In [None]:
plt.figure(figsize = (15, 8))
sns.countplot(df4['ratings'])
plt.show()

In [None]:
print("\nTotal no of ratings :",df4.shape[0])
print("Total No of Users   :", len(np.unique(df4.userID)))
print("Total No of products  :", len(np.unique(df4.productId)))

In [None]:
top_rating = df4.groupby(by='userID')['ratings'].count().sort_values(ascending=False)[:10]
print('Top 10 users based on ratings: \n',top_rating)

In [None]:
new_df=df4.groupby("productId").filter(lambda x:x['ratings'].count() >=50)
new_df

In [None]:
new_df1=new_df.head(10000)

ratings_matrix = new_df1.pivot_table(values='ratings', index='productId', columns='userID', fill_value=0)
ratings_matrix.head()

In [None]:
print('Shape of the pivot table: ', ratings_matrix.shape)

In [None]:
X = ratings_matrix

In [None]:
new_df.head()

In [None]:
X.shape

In [None]:
%%time
from sklearn.decomposition import TruncatedSVD       # used for dimensionality reduction
SVD = TruncatedSVD(n_components=5)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape

In [None]:
%%time
correlation_matrix = np.corrcoef(decomposed_matrix)        # Return Pearson product-moment correlation coefficients.
correlation_matrix.shape

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient) is a measure of the strength of a linear association between two variables

In [None]:
def recommend(s):
  l = list(X.index)
  h = l.index(s)
  i=X.index[h]
  correlation_product_ID = correlation_matrix[h]
  Recommend = list(X.index[correlation_product_ID > 0.05])
  Recommend.remove(i)
  print(Recommend[0:5])

In [None]:
s = 'B0000002ME'

In [None]:
recommend(s)

### <div class="alert alert-info">Interpretation
**- From the above recomendation system we can see by using the correlation matrix the products are recomended based on the related product so this will help the customers to find the related products and it will generate the good revenue for the company aswell**

## Conclusion
**- EDA**
- Count of reviews increasing over the period of time
- Revenue is increasing over the period of time

**- SENTIMENTAL ANALYSIS**
- Model is able to detect and translate all the languages to English
- Model is able to Automate Sentiment Predictions

**- CLUSTERING**
- Model is able to segregate top and bottom products
- Model is able to create segments based on customer perceptions

**- PRODUCT RECOMMENDATION**
- Model is able to recommend related products based on customer purchase
- Model is able to forecast future trend of the sentiments