# Amazon Alexa Review Ratings

This dataset consists of nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train Machines for sentiment analysis.

What can we do with this Data?

We can use this data to analyze Amazon’s Alexa product; discovering insights into consumer reviews and assist with Machine Learning models. We can also train our Machine Learning models for sentiment analysis and analyze customer reviews (How many positive reviews? How many negative reviews?).

Data source: www.kaggle.com/sid321axn/amazon-alexa-reviews

# Importing the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [None]:
ds = pd.read_csv('amazon_alexa.tsv', sep = '\t')

In [None]:
ds.keys()

In [None]:
ds.head()

# Visualising the dataset

In [None]:
total_feedback = len(ds['rating'])
positive_feedback = len(ds[ds['feedback'] == 1])
negative_feedback = len(ds[ds['feedback'] == 0])
rating_1 = len(ds[ds['rating'] == 1])
rating_2 = len(ds[ds['rating'] == 2])
rating_3 = len(ds[ds['rating'] == 3])
rating_4 = len(ds[ds['rating'] == 4])
rating_5 = len(ds[ds['rating'] == 5])

In [None]:
print('Total feedback = ', total_feedback)
print('Positive feedback = ', positive_feedback)
print('Negative feedback = ', negative_feedback)
print('Rating 1 = ', rating_1)
print('Rating 2 = ', rating_2)
print('Rating 3 = ', rating_3)
print('Rating 4 = ', rating_4)
print('Rating 5 = ', rating_5)

In [None]:
positive = ds[ds['feedback'] == 1]

In [None]:
negative = ds[ds['feedback'] == 0]

In [None]:
len(positive)

In [None]:
len(negative)

In [None]:
len(ds)

In [None]:
sns.countplot(ds['feedback'], label = "Count") 

In [None]:
sns.countplot(x = 'rating', data = ds)

In [None]:
plt.figure(figsize = (40,15))
sns.barplot(x = 'variation', y='rating', data=ds, palette = 'deep')

# Taking care of missing data

In [None]:
# We observe no missing data

sns.heatmap(ds.isnull(), yticklabels = False, cbar = False, cmap = 'Blues')

# Data Preperation

In [None]:
# Let's drop the date

ds = ds.drop(['date', 'rating'],axis=1)

In [None]:
ds

# Encoding Categorical Variables

In [None]:
# The below is like the LabelEncoder, OneHotEncoder and avoiding the dummy variable trap. 

variation_dummies = pd.get_dummies(ds['variation'], drop_first = True)

In [None]:
variation_dummies

In [None]:
# First let's drop the column

ds.drop(['variation'], axis=1, inplace=True)

In [None]:
# Now let's add the encoded column

ds = pd.concat([ds, variation_dummies], axis=1)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
alexa_countvectorizer = vectorizer.fit_transform(ds['verified_reviews'])

In [None]:
alexa_countvectorizer.shape

In [None]:
print(vectorizer.get_feature_names())

In [None]:
print(alexa_countvectorizer.toarray())  

In [None]:
# First let's drop the column

ds.drop(['verified_reviews'], axis=1, inplace=True)
reviews = pd.DataFrame(alexa_countvectorizer.toarray())

In [None]:
# Concatenate them together

ds = pd.concat([ds, reviews], axis=1)

In [None]:
ds

In [None]:
# Dropping the target label coloumns

X = ds.iloc[:, 1:].values
y = ds.iloc[:, 0].values

In [None]:
X.shape

In [None]:
y.shape

# Splitting the dataset into the training set and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling - Not Required

# Fitting the Random Forest Classifier to the dataset

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 500, criterion = 'entropy')
rfc.fit(X_train, y_train)

In [None]:
# Predicting the test set results

y_pred = rfc.predict(X_test)

# Model Evaluation - Confusion Matrix and K-Fold Cross Validation

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = rfc, X = X_train, y = y_train, cv = 10)
mean_accuracy = accuracies.mean()
std_accuracy = accuracies.std()

In [None]:
print(mean_accuracy)
print(std_accuracy)