### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Assignment template

### Scenario
You are a data analyst working for Turtle Games, a game manufacturer and retailer. They manufacture and sell their own products, along with sourcing and selling products manufactured by other companies. Their product range includes books, board games, video games and toys. They have a global customer base and have a business objective of improving overall sales performance by utilising customer trends. In particular, Turtle Games wants to understand: 
- how customers accumulate loyalty points (Week 1)
- how useful are remuneration and spending scores data (Week 2)
- can social data (e.g. customer reviews) be used in marketing campaigns (Week 3)
- what is the impact on sales per product (Week 4)
- the reliability of the data (e.g. normal distribution, Skewness, Kurtosis) (Week 5)
- if there is any possible relationship(s) in sales between North America, Europe, and global sales (Week 6).

# Week 1 assignment: Linear regression using Python
The marketing department of Turtle Games prefers Python for data analysis. As you are fluent in Python, they asked you to assist with data analysis of social media data. The marketing department wants to better understand how users accumulate loyalty points. Therefore, you need to investigate the possible relationships between the loyalty points, age, remuneration, and spending scores. Note that you will use this data set in future modules as well and it is, therefore, strongly encouraged to first clean the data as per provided guidelines and then save a copy of the clean data for future use.

## Instructions
1. Load and explore the data.
    1. Create a new DataFrame (e.g. reviews).
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
2. Remove redundant columns (`language` and `platform`).
3. Change column headings to names that are easier to reference (e.g. `renumeration` and `spending_score`).
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.
5. Use linear regression and the `statsmodels` functions to evaluate possible linear relationships between loyalty points and age/renumeration/spending scores to determine whether these can be used to predict the loyalty points.
    1. Specify the independent and dependent variables.
    2. Create the OLS model.
    3. Extract the estimated parameters, standard errors, and predicted values.
    4. Generate the regression table based on the X coefficient and constant values.
    5. Plot the linear regression and add a regression line.
6. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Import neccesary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm 
from statsmodels.formula.api import ols

In [None]:
# Load the CSV file(s) as reviews.
reviews = pd.read_csv('turtle_reviews.csv')

# View the DataFrame.
reviews.head()

In [None]:
# Sense check the data and identify column data types
reviews.info()

- There are no null values. 
- The target variable for linear regression is in column 4: loyalty_points.
- The review and summary columns are object data types but are not categorical variables. 

In [None]:
# Confirm that there are no null values. 
reviews.isna().sum()

In [None]:
# Descriptive statistics.
reviews.describe()

In order to check that there are no erroneous values, the numerical data is visualised. Eg: check to see if there are negative entries or nubers outside of a realistic range that need to be investigated. 

In [None]:
# Visualise numerical column: renumeration
sns.histplot(data=reviews, x='remuneration (k£)')

In [None]:
# Visualise numerical column: age
sns.histplot(data=reviews, x='age')

In [None]:
# Visualise numerical column: renumeration
sns.histplot(data=reviews, x='spending_score (1-100)')

In [None]:
# Visualise numerical column: renumeration
sns.histplot(data=reviews, x='loyalty_points')

All of the numerical columns seem within acceptable ranges, without any obviously incorrect entries.

## 2. Drop columns

I will drop the column 'Platform' because all of the reviews in the data were scraped from the Turtle Games website. I will also drop the 'Language' column because all of the reviews are in English. 

In [None]:
# Drop unnecessary columns.
reviews_clean = reviews.drop(columns=['language', 'platform'])

# View column names.
reviews_clean.columns

## 3. Rename columns

I will remove special characters from 'remuneration (k£)' and 'spending_score (1-100)' to make the data easier to work with.

In [None]:
# Rename the column headers.
reviews_clean.rename(columns={'remuneration (k£)':'renumeration','spending_score (1-100)': 'spending_score'}, inplace=True)

# View column names.
reviews_clean.columns

## 4. Save the DataFrame as a CSV file

In [None]:
# Create a CSV file as output.
reviews_clean.to_csv('reviews_clean.csv')

In [None]:
# Import new CSV file with Pandas.
reviews_final = pd.read_csv('reviews_clean.csv')

# View DataFrame.
reviews_final.head()

In [None]:
# View the column names.
reviews_final.columns

The new column 'Unnamed: 0' may be the index values from the first time the dataframe was imported.

In [None]:
# View 'Unnamed: 0'
reviews_final['Unnamed: 0']

In [None]:
# Drop Unnamed: 0
reviews_final = reviews_final.drop(columns = 'Unnamed: 0')

# View the columns in the dataframe
reviews_final.columns

## 5. Linear regression

A simple multiple linear regression model is fit to the variables spending scores, age and renumeration to try determine if they are good predictors for the number of loyalty points a customer has. 

### 5a) spending vs loyalty

In [None]:
# Independent variable.
x = reviews_final['spending_score']

# Dependent variable.
y = reviews_final['loyalty_points']

# Check for data for linearity.
plt.ylabel('Loyalty Points')
plt.xlabel('Spending Score')
plt.scatter(x, y)

In [None]:
# Use OLS to find the line of best fit. 

# Create the formula to pass through the OLS methods.
f = 'y ~ x'
reviews_OLS = ols(f, data = reviews_final).fit()

In [None]:
# Extract the estimated parameters.
print('Parameters: ', reviews_OLS.params)
print()

# Extract the standard errors.
print('Standard errors: ', reviews_OLS.bse)
print()

# Extract the predicted values.
print('Predicted values: ', reviews_OLS.predict())

In [None]:
# Set the x coefficient and constant
c = reviews_OLS.params[0]
m = reviews_OLS.params[1]

# Print the regression table
reviews_OLS.summary()

Key results for the model:
- R-Squared (Coefficient of determination) = 0.452. 
    - 45.2% of the variation in Loyalty Points that can be explained by Spending. 
- F Statistic Probability is very close to zero. This means that there is sufficient evidence to conclude, at a significance level as small as 0.01,  that the *model fits the data better than a model with zero predictor variables.* 

Key results for the variable:
- The intercept has a high p-value for the two-sided t test. Using a significance level of 0.05, the null hypothesis that the actual x-intercept is zero cannot be rejected. The x-intercept is not statistically significant.
- The independent variable has a p-value = 0 for the two sided t-test. At a significance level of 0.05, the null hypothesis that the true value of the coefficient is zero can be rejected. There is sufficient evidence to indicate that there is a *strong relationship* between spending score and the number of loyalty points that a customer has. 
- The x coefficient is 33.0617. This means that for a one unit change in customer spending score, we can expect the number of loyalty points the customer has to increase by a factor of 33.0617. 

In [None]:
# Sense the intercept and coefficient are set correctly. 
print(c, " ", m)

In [None]:
# Create a linear equation and plot the regression model
y_pred = m*x + c

y_pred

In [None]:
# Plot the original data using a scatterplot
plt.scatter(x, y)

# Plot the regression line
plt.plot(x, y_pred, color='red')

# Set plot details
plt.title("Linear Regression of Loyalty Points on Spending Score")
plt.ylabel('Loyalty Points')
plt.xlabel('Spending Score')

It can be seen from the above plot that the model fits the data reasonably well at low spending scores, but not for high spending scores. The distance between the predicted values and the actual values increases as spending score increases. 
This corresponds with the low R-Squared value, which indicated that only roughly 46% of the variability in loyalty points can be explained by customer spending score. Considering that the F-test indicated that the model is statistically significant, I conclude that there is a relationship between customer spending score and loyalty points. Other variables may, however, explain loyalty points better. The explanatory accuracy of the model may also be increased by adding variables to the model.

### 5b) renumeration vs loyalty

In [None]:
# Independent variable.
x = reviews_final['renumeration']

# Dependent variable.
y = reviews_final['loyalty_points']

# Check for data for linearity.
plt.ylabel('Loyalty Points')
plt.xlabel('Renumeration in 1000s')
plt.scatter(x2, y2)

The scatterplot looks similar to the plot for spending score vs loyalty points. I expect that the regression model will yield similar results as the previous model, due to the data being distributed similarly. 

In [None]:
# Use OLS to find the line of best fit. 

# Create the formula to pass through the OLS methods.
f = 'y ~ x'
reviews_OLS = ols(f, data = reviews_final).fit()

In [None]:
# Extract the estimated parameters.
print('Parameters: ', reviews_OLS.params)
print()

# Extract the standard errors.
print('Standard errors: ', reviews_OLS.bse)
print()

# Extract the predicted values.
print('Predicted values: ', reviews_OLS.predict())

In [None]:
# Set the x coefficient and constant
c = reviews_OLS.params[0]
m = reviews_OLS.params[1]

# Print the regression table
reviews_OLS.summary()

Key results for the model:
- R-Squared (Coefficient of determination) = 0.38. 
    - 38% of the variation in Loyalty Points that can be explained by Renumeration. 
- F Statistic Probability is very close to zero. This means that there is sufficient evidence to conclude, at a significance level as small as 0.01,  that the *model fits the data better than a model with zero predictor variables.* There is therefore a relationship between renumeration and loyalty points. 

Key results for the variable:
- The intercept has a high p-value for the two-sided t test. Using a significance level of 0.05, the null hypothesis that the actual x-intercept is zero cannot be rejected. The x-intercept is not statistically significant.
- The independent variable has a p-value = 0 for the two sided t-test. At a significance level of 0.05, the null hypothesis that the true value of the coefficient is zero can be rejected. There is sufficient evidence to indicate that there is a *strong relationship* between renumeration and the number of loyalty points that a customer has. 
- The x coefficient is 34.1878. This means that for a one unit change in renumeration, we can expect the number of loyalty points the customer has to increase by a factor of 34.1878. 

In [None]:
# Sense the intercept and coefficient are set correctly. 
print(c, " ", m)

In [None]:
# Create a linear equation and plot the regression model
y_pred = m*x + c

y_pred

In [None]:
# Plot the original data using a scatterplot
plt.scatter(x, y)

# Plot the regression line
plt.plot(x, y_pred, color='red')

# Set plot details
plt.title("Linear Regression of Loyalty Points on Renumeration")
plt.ylabel('Loyalty Points')
plt.xlabel('Renumeration in 1000s')

Similar to the model fit using spending score, this model fits the data reasonably well for salaries between zero and close to sixty-thousand pounds. The distance between the predicted values and the actual data increases for salaries including and above 60 000 pounds. 
This may explain why the key results for the overall model conflict with the key results for the coefficient. The low R-squared value for the model with high significance for the coefficiet of x may be because the model does not predict loyalty points for high-earning customers as well as it preicts loyalty points for low-earning customers. 

Considering that the F-test indicated that the model is statistically significant, I conclude that there is a relationship between customer spending score and loyalty points. Other variables may, however, explain loyalty points better. The explanatory accuracy of the model may also be increased by adding variables to the model.

### 5c) age vs loyalty

In [None]:
# Independent variable.
x = reviews_final['age']

# Dependent variable.
y = reviews_final['loyalty_points']

# Check for data for linearity.
plt.ylabel('Loyalty Points')
plt.xlabel('Age')
plt.scatter(x, y)

This data is distributed very differently to the past two variables. It is also seems to be grouped according to age ranges. The data values within age ranges seems to be similarly distributed. 

In [None]:
# Use OLS to find the line of best fit. 

# Create the formula to pass through the OLS methods.
f = 'y ~ x'
reviews_OLS = ols(f, data = reviews_final).fit()

In [None]:
# Extract the estimated parameters.
print('Parameters: ', reviews_OLS.params)
print()

# Extract the standard errors.
print('Standard errors: ', reviews_OLS.bse)
print()

# Extract the predicted values.
print('Predicted values: ', reviews_OLS.predict())

In [None]:
# Set the x coefficient and constant
c = reviews_OLS.params[0]
m = reviews_OLS.params[1]

# Print the regression table
reviews_OLS.summary()

Key results for the model:

- R-Squared (Coefficient of determination) = 0.002.
    -0.2% of the variation in Loyalty Points that can be explained by Age.
- F Statistic Probability is 0.0577. At a significance level of 0.05, there is not sufficient evidence to reject the null hypothesis that a zero variable model will fit the data better than this one. There is not a strong explantory relationship between age and loyalty points. 

Key results for the variable:
- The intercept has a p-value of zero for the two-sided t test. The x-intercept is statistcally significant. The F-test for the model suggests that a no-variable models fits the data better than the constructed here. This is why the intercept is statistically significant. 
- The independent variable has a p-value = 0.058 for the two sided t-test. At a significance level of 0.05, the null hypothesis that the true value of the coefficient is zero cannot be rejected. There is not sufficient evidence to indicate that there is a strong relationship between renumeration and the number of loyalty points that a customer has.


In [None]:
# Sense check the intercept and coefficient are set correctly. 
print(c, " ", m)

In [None]:
# Create a linear equation and plot the regression model
y_pred = m*x + c

y_pred

In [None]:
# Plot the original data using a scatterplot
plt.scatter(x, y)

# Plot the regression line
plt.plot(x, y_pred, color='red')

# Set plot details
plt.title("Linear Regression of Loyalty Points on Age")

Age is not a good predictor of a customer's loyalty points accumulation. The above plot shows that the data is not clustered close to the regression line. This shows what the key results for the model above indicated: that the model is not a good fit for the data. 

### Multiple linear regression with Spending Score and Renumeration

In [None]:
# Import LinearRegression
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

# Define the dependent variable
y = reviews_final['loyalty_points']

# Define the dependent variable
X = reviews_final[['renumeration', 'spending_score']]

In [None]:
# Fit the model
mlr = linear_model.LinearRegression()
mlr.fit(X,y)

In [None]:
# Print the R-squared value.
print("R-squared: ", mlr.score(X,y))  

# Print the intercept.
print("Intercept: ", mlr.intercept_) 

# Print the coefficients.
print("Coefficients:")  

# Map a similar index of multiple containers (to be used as a single entity).
list(zip(X, mlr.coef_))  

This multiple linear regression has a high R^2 value. Roughly 83% of variation in loyalty points can be explained by variation in renumeration and spending score. 

## 6. Observations and insights

***Your observations here...***






Questions for next time:
- could renumeration and spending score be used in a multiple linear regression to more accurately explain how loyalty points are accumulated? 
    
    
Summary: 
- Age cannot be used to explain loyalty points. 
- Renumeration and spending score are averagly good at predicting spending score. Both of their model's F-tests indicate that there is a significant relationship between these variables and how customers accumulate loyalty points. 

# 

# Week 2 assignment: Clustering with *k*-means using Python

The marketing department also wants to better understand the usefulness of renumeration and spending scores but do not know where to begin. You are tasked to identify groups within the customer base that can be used to target specific market segments. Use *k*-means clustering to identify the optimal number of clusters and then apply and plot the data using the created segments.

## Instructions
1. Prepare the data for clustering. 
    1. Import the CSV file you have prepared in Week 1.
    2. Create a new DataFrame (e.g. `df2`) containing the `renumeration` and `spending_score` columns.
    3. Explore the new DataFrame. 
2. Plot the renumeration versus spending score.
    1. Create a scatterplot.
    2. Create a pairplot.
3. Use the Silhouette and Elbow methods to determine the optimal number of clusters for *k*-means clustering.
    1. Plot both methods and explain how you determine the number of clusters to use.
    2. Add titles and legends to the plot.
4. Evaluate the usefulness of at least three values for *k* based on insights from the Elbow and Silhoutte methods.
    1. Plot the predicted *k*-means.
    2. Explain which value might give you the best clustering.
5. Fit a final model using your selected value for *k*.
    1. Justify your selection and comment on the respective cluster sizes of your final solution.
    2. Check the number of observations per predicted class.
6. Plot the clusters and interpret the model.

## 1. Load and explore the data

In [None]:
# Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import accuracy_score
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Sense check the reviews_final dataframe. 
reviews_final.head()

In [None]:
# Load the CSV file(s) as df2.
df2 = pd.read_csv('reviews_clean.csv')
df2 = df2.drop(columns=['Unnamed: 0'])

# View DataFrame.
df2.columns

In [None]:
# Sense check against reviews_final.
reviews_final.shape

In [None]:
# Drop unnecessary columns.
clustering_reviews = df2[['renumeration', 'spending_score']]

# View DataFrame.
print(clustering_reviews.shape)

clustering_reviews.head()

In [None]:
# Check for null values and value counts.
print(clustering_reviews.info())

In [None]:
# Descriptive statistics.
clustering_reviews.describe()

- Renumeration is the total income per customer per year in pounds.
    - Customers have a mean income of 48 079.06. 
    - Individual values deviate from this by 23 123.984 pounds. 
- Spending score is the score assigned to a customer based on their spending habits. 
    - The average spending score of customers is 50. 
    - This differs per customer by 26 point, on average. 

## 2. Plot

- Are there any visible clusters, correlations and outliers?
- Are there any existing relationships between spending score and renumeration?

In [None]:
# Create a scatterplot with Seaborn.
sns.scatterplot(x='renumeration', y='spending_score', data=clustering_reviews).set_title('A scatterplot showing the relationship between spending score and renumeration')

- No clear linear positive or linear negative relationship. 
- There are roughly five areas where data values are loosely clustered. 

In [None]:
# Create a pairplot with Seaborn.
sns.color_palette('Paired')
sns.pairplot(clustering_reviews, diag_kind='kde')

- Both spending score and renumeration are multi-modal, as shown by the distinct peaks in the kernel density graphs. 
- This implies that there may be multiple groups within the data. 

## 3. Elbow and silhoutte methods

### 1. The Elbow Method

In [None]:
# Import the K-Means class
import sklearn
from sklearn.cluster import KMeans

# Create a list to store the cluster sizes
cs = []

# Use a loop with a range from 1 - 10 to test cluster sizes
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(clustering_reviews)
    cs.append(kmeans.inertia_)
    
# Plot the SSE for each model against the number of clusters
plt.plot(range(1,11), cs, marker='o')
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("SSE (Total)")

plt.show()

- This graph indicates that the optimum number of clusters for this model is k = 5. 

### 2. The Silhouette Method

In [None]:
# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Create an empty list to store the different cluster sizes.
sil = []
kmax=10

for k in range(2, kmax + 1):
    kmeans_s = KMeans(n_clusters = k).fit(clustering_reviews)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(clustering_reviews, labels, metric = 'euclidean'))
    
# Plot the silhouette scores against number of clusters. 
plt.plot(range(2, kmax+1), sil, marker='o')

plt.title("The Silhouette Method")
plt.xlabel("The number of clusters")
plt.ylabel("Sil")

plt.show()

The peak is defined as the optimum number of clusters. The peak here is at 5, indicating that 5 is the optimum number of k. 

In order to evaluate if five is indeed the optimum number of clusters I will test values on either side of the 5: 4,6. 

## 4. Evaluate k-means model at different values of *k*

### Evaluate the model for k = 4

In [None]:
# Create a k-means object with 4 clusters. 
kmeans = KMeans(n_clusters=4, max_iter=15000, init='k-means++', random_state=0)
kmeans.fit(clustering_reviews)
clusters = kmeans.labels_

# Add the cluster value to the data frame. 
clustering_reviews['K-Means Predicted'] = clusters

# Plot the predicted values.
sns.pairplot(clustering_reviews, hue='K-Means Predicted', diag_kind='kde')

K = 4 clusters

- There are three distinct clusters with little overlap. 
- Cluster zero does look bigger than the other clusters. In-cluster variation may be larger than what is neccesary. 
- The kernel distribution for cluster zero for both renumeration and spending score are quite wide. This indicates a large in cluster variation value, as the data is spread out over a long axis. 
- The goal of clustering is for in-cluster variation to be small. Therefore it would be beneficial to have a larger number of clusters. 

In [None]:
# Create a k-means object with 5 clusters. 
kmeans = KMeans(n_clusters=5, max_iter=15000, init='k-means++', random_state=0)
kmeans.fit(clustering_reviews)
clusters = kmeans.labels_

# Add the cluster value to the data frame. 
clustering_reviews['K-Means Predicted'] = clusters

# Plot the predicted values.
sns.pairplot(clustering_reviews, hue='K-Means Predicted', diag_kind='kde')

- Adding one additional cluster changed the distributions of spending score and renumeration.
- Group zero no longer has such a long right-ward tail, implying that the in-cluster variation is smaller. 
- The clusters themselves are evenly defined. 

In [None]:
# Create a k-means object with 6 clusters. 
kmeans = KMeans(n_clusters=6, max_iter=15000, init='k-means++', random_state=0)
kmeans.fit(clustering_reviews)
clusters = kmeans.labels_

# Add the cluster value to the data frame. 
clustering_reviews['K-Means Predicted'] = clusters

# Plot the predicted values.
sns.pairplot(clustering_reviews, hue='K-Means Predicted', diag_kind='kde')

- Adding an additional cluster did not change the variation of any of the clusters by a large amount. 

In [None]:
# Create a k-means object with 7 clusters. 
kmeans = KMeans(n_clusters=7, max_iter=15000, init='k-means++', random_state=0)
kmeans.fit(clustering_reviews)
clusters = kmeans.labels_

# Add the cluster value to the data frame. 
clustering_reviews['K-Means Predicted'] = clusters

# Plot the predicted values.
sns.pairplot(clustering_reviews, hue='K-Means Predicted', diag_kind='kde')

- Once again, adding additional cluster did not reduce the dispersion of the data points from their cluster centres. 
- Looking at the KDEs for renumeration and spending_score, the standard deviation for each cluster looks quite even across all clusters. 

## 5. Fit final model and justify your choice

From the above analysis, it is clear that k = 5 is the optimal choice of clusters. It is the largest number of clusters that significantly reduces the dispersion between the data points and the centroid of their clusters. 

In [None]:
# Apply the final model with k =5. 
kmeans = KMeans(n_clusters=5, max_iter=15000, init='k-means++', random_state=0)
kmeans.fit(clustering_reviews)
clusters = kmeans.labels_

# Add the cluster value to the data frame. 
clustering_reviews['K-Means Predicted'] = clusters

# Plot the predicted values.
sns.pairplot(clustering_reviews, hue='K-Means Predicted', diag_kind='kde')

In [None]:
# Check the number of observations per predicted class.
clustering_reviews['K-Means Predicted'].value_counts()

- Cluster zero is has the most amount of data points. 
- The other four clusters contain close to 300 data points each. 

## 6. Plot and interpret the clusters

In [None]:
# Visualising the clusters.
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='renumeration' , 
                y ='spending_score',
                data=clustering_reviews , hue='K-Means Predicted', palette='colorblind').set_title(
                'Renumeration vs Spending', fontsize=20)

In [None]:
# View the dataframe
clustering_reviews

In order to identify what market segments these clusters identify, I will merge the original data frame with the cluster values and study the other variables that make these groups unique. 

In [None]:
# Join clusters to main data for context. 
df2['K-Means Predicted'] = clusters
df2

In [None]:
# How many females and males are in each cluster? 
gender_clusters = pd.concat([df2, pd.get_dummies(df2['gender'])], axis=1)

In [None]:
gender_clusters.columns
gender_clusters.groupby(['K-Means Predicted'])[['Female', 'Male']].sum()

In [None]:
# What is the average age and income per cluster? 
gender_clusters.groupby(['K-Means Predicted'])['age', 'renumeration', 'spending_score'].mean()

In [None]:
# What is the education level of each cluster? 
education_clusters = pd.concat([df2, pd.get_dummies(df2['education'])], axis=1)

education_clusters.columns

In [None]:
education_clusters.groupby(['K-Means Predicted'])[['Basic', 'PhD', 'diploma','graduate','postgraduate']].sum()

In [None]:
# Revisit the visualisation of the clusters
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='renumeration' , 
                y ='spending_score',
                data=clustering_reviews , hue='K-Means Predicted', palette='colorblind').set_title(
                'Renumeration vs Spending', fontsize=20)

## 7. Discuss: Insights and observations

***Your observations here...***

- Customers can be split into five clusters based on spending score and renumeration. 
- I then merged the clustering results to get a clearer picture of who belonged to which clusters. 

*What characterizes customer segments?*  ($ == pounds in this instance)

- Cluster 0:
    - has the highest proportion of graduates (aprox 45%)
    - the average renumeration is $44 418.79 
    - the average age is 42
    - average spending score is aprox 50. 
    - largest cluster. 
    
- Cluster 1:
    - also make up the greatest proportion of customers in this cluster. 
    - av. renumeration is $73 240.28
    - av. age is 36
    - av. spending score is 82. 
    - these are high spend high value customers

- Cluster 2: 
    - People holding PhDs, graduates and post grads make up similar proportions of the cluster. 
    - the av. renumeration is $74 831.21.
    - the av spending score is 17.42
    - low spend customers with high incomes. 
    
 - Cluster 3:
     - Are majority graduates, with aprox 42 %. 
     - av. age is 43.
     - av. renumeration is $20 424.35. 
     - av. spending score is 19.76.
     
- Cluster 4: 
    - Are majority graduates, with PhD holders following. 
    - av. age is 31. 
    - av. renumeration is $20 353.68
    - av. spending score is 79.42. 
    
 
*Questions for next time*
- What is the average age of the data as a whole?
- Are there outliers in renumeration, spending score and age that could skew the results?

# 

# Week 3 assignment: NLP using Python
Customer reviews were downloaded from the website of Turtle Games. This data will be used to steer the marketing department on how to approach future campaigns. Therefore, the marketing department asked you to identify the 15 most common words used in online product reviews. They also want to have a list of the top 20 positive and negative reviews received from the website. Therefore, you need to apply NLP on the data set.

## Instructions
1. Load and explore the data. 
    1. Sense-check the DataFrame.
    2. You only need to retain the `review` and `summary` columns.
    3. Determine if there are any missing values.
2. Prepare the data for NLP
    1. Change to lower case and join the elements in each of the columns respectively (`review` and `summary`).
    2. Replace punctuation in each of the columns respectively (`review` and `summary`).
    3. Drop duplicates in both columns (`review` and `summary`).
3. Tokenise and create wordclouds for the respective columns (separately).
    1. Create a copy of the DataFrame.
    2. Apply tokenisation on both columns.
    3. Create and plot a wordcloud image.
4. Frequency distribution and polarity.
    1. Create frequency distribution.
    2. Remove alphanumeric characters and stopwords.
    3. Create wordcloud without stopwords.
    4. Identify 15 most common words and polarity.
5. Review polarity and sentiment.
    1. Plot histograms of polarity (use 15 bins) for both columns.
    2. Review the sentiment scores for the respective columns.
6. Identify and print the top 20 positive and negative reviews and summaries respectively.
7. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Import all the necessary packages.
import pandas as pd
import numpy as np
import nltk 
import os 
import matplotlib.pyplot as plt

# nltk.download ('punkt').
# nltk.download ('stopwords').

from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from textblob import TextBlob
from scipy.stats import norm

# Import Counter.
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data set as df3.
df3 = pd.read_csv('reviews_clean.csv')

# View DataFrame.
df3

In [None]:
# Explore data set.
df3.info()

In [None]:
# Keep necessary columns. Drop unnecessary columns.
df4 = df3[['review', 'summary']]

# View DataFrame.
df4

In [None]:
# Determine if there are any missing values.
df4.isna().sum()

There are no null values, and 2000 reviews and summaries. 

## 2. Prepare the data for NLP
### 2a) Change to lower case and join the elements in each of the columns respectively (review and summary)

In [None]:
# Review: Change all to lower case and join with a space.
df4['review'] = df4['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Preview the result
print(df4['review'].shape)
df4['review'].head()

In [None]:
# Summary: Change all to lower case and join with a space.
df4['summary'] = df4['summary'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Preview the result
print(df4['summary'].shape)
df4['summary'].head()

### 2b) Replace punctuation in each of the columns respectively (review and summary)

In [None]:
# Replace all the punctuations in review column.
df4['review'] = df4['review'].str.replace('[^\w\s]','')

# View output.
df4['review']

In [None]:
# Replace all the puncuations in summary column.
df4['summary'] = df4['summary'].str.replace('[^\w\s]','')

# View output.
df4['summary']

### 2c) Drop duplicates in both columns

In [None]:
# Check for duplicates in both columns
print(df4.review.duplicated().sum())
print(df4.summary.duplicated().sum())
print(df4.duplicated().sum())

In [None]:
# Drop duplicates based on both columns (complete observations as duplicates)
df4_clean = df4.drop_duplicates()
df4_clean = df4_clean.reset_index()

# View DataFrame.
df4_clean

In [None]:
df4_clean['review'][94]

In [None]:
# Check for duplicates again
print(df4_clean.review.duplicated().sum())
print(df4_clean.summary.duplicated().sum())
print(df4_clean.duplicated().sum())

In [None]:
df4_clean.shape

There are duplicate reviews and duplicate summaries remaining, but there are no complete observations as duplicates. 
1961 values remain. 

## 3. Tokenise and create wordclouds

In [None]:
# Create new DataFrame (copy DataFrame).
df5 = df4_clean

# View DataFrame.
df5

In [None]:
# Apply tokenisation to both columns.
df5['review_tokens'] = df5['review'].apply(word_tokenize)
df5['summary_tokens'] = df5['summary'].apply(word_tokenize)

# View DataFrame.
df5.head()

In [None]:
# Review: Create a word cloud.

# Define an empty list of tokens
review_tokens =[]

# Get range
length = 1961

for i in range(df5.shape[0]):
    review_tokens = review_tokens + df5['review_tokens'][i]

# View the list
review_tokens

In [None]:
# Create a string with the words
review_tokens_string = ' '

for value in review_tokens:
    # Add each token word to the string
    review_tokens_string = review_tokens_string + value + " "

In [None]:
# Set a colour palette
sns.set(color_codes=True)

# Create a WordCloud object
word_cloud = WordCloud(width = 1600, height = 900, 
                       background_color='white',
                       colormap='plasma',
                       stopwords='none',
                       min_font_size=10).generate(review_tokens_string)

In [None]:
# Review: Plot the WordCloud image.                     
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(word_cloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Summary: Create a word cloud.

# Define an empty list of tokens
summary_tokens =[]

# Get range
for i in range(df5.shape[0]):
    summary_tokens = summary_tokens + df5['summary_tokens'][i]

# View the list
summary_tokens

In [None]:
# Create a string with the words
summary_tokens_string = ' '

for value in summary_tokens:
    # Add each token word to the string
    summary_tokens_string = summary_tokens_string + value + " "

In [None]:
# Set a colour palette
sns.set(color_codes=True)

# Create a WordCloud object
word_cloud2 = WordCloud(width = 1600, height = 900, 
                       background_color='white',
                       colormap='plasma',
                       stopwords='none',
                       min_font_size=10).generate(summary_tokens_string)

In [None]:
# Summary: Plot the WordCloud image.
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(word_cloud2) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In the above WordClouds, words like 'and', 'the'and 'to' are among the most common words. These are stop words that need to be removed so that the most common, useful words are highlighted. 

## 4. Frequency distribution and polarity
### 4a) Create frequency distribution

In [None]:
# Determine the frequency distribution.

# Reviews
fdist_reviews = FreqDist(review_tokens)

# Summary
fdist_summary = FreqDist(summary_tokens)

# Preview the distribution: reviews
fdist_reviews

In [None]:
# Preview the distribution: summary
fdist_summary

In [None]:
# Create a DataFrame from the reviews and summary frequency distribution
from collections import Counter

counts_reviews = pd.DataFrame(Counter(review_tokens).most_common(15), 
                             columns=['Word', 'Frequency']).set_index('Word')

counts_summary = pd.DataFrame(Counter(summary_tokens).most_common(15), 
                             columns=['Word', 'Frequency']).set_index('Word')

# Show the DataFrame: Reviews
counts_reviews

In [None]:
# Show the DataFrame: Summary
counts_summary

In [None]:
# Plot reviews distribution
# Set the plot type.
ax = counts_reviews.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='plasma')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Top 15 most common words in online review for Turtle Games",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

In [None]:
# Plot summary distribution
# Set the plot type.
ax = counts_summary.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='viridis')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Top 15 most common words in online review summaries for Turtle Games",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

### 4b) Remove alphanumeric characters and stopwords

In [None]:
# Check if there are stand alone characters to remove / not real english words
tokens1 = [word for word in review_tokens if not word.isalnum()]

tokens1

In [None]:
# Delete all the alpanum characters / not real words in reviews
review_tokens1 = [word for word in review_tokens if word.isalnum()] 

review_tokens1

In [None]:
# Check if there are stand alone characters to remove / not real english words
tokens2 = [word for word in summary_tokens if not word.isalnum()]

tokens2

There are no alphanumeric characters to remove from the summary tokens. Stopwords can now be removed from both reviews and summaries. 

In [None]:
# Remove all the stopwords from reviews

# Download the stop word list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stop words.
english_stopwords = set(stopwords.words('english'))

In [None]:
# Create a filtered list of tokens without stop words.
review_tokens_clean = [x for x in review_tokens1 if x.lower() not in english_stopwords]

# Define an empty string variable.
reviews_tokens_string2 = ''

for value in review_tokens_clean:
    # Add each filtered token word to the string.
    reviews_tokens_string2 = reviews_tokens_string2 + value + ' '
    
reviews_tokens_string2

In [None]:
# Review all stopwords from summaries

# Create a filtered list of tokens without stop words.
summary_tokens_clean = [x for x in summary_tokens if x.lower() not in english_stopwords]

# Define an empty string variable.
summary_tokens_string2 = ''

for value in summary_tokens_clean:
    # Add each filtered token word to the string.
    summary_tokens_string2 = summary_tokens_string2 + value + ' '
    
summary_tokens_string2

### 4c) Create wordcloud without stopwords

In [None]:
# Create a wordcloud without stop words: Reviews
wordcloud_reviews = WordCloud(width = 1600, height = 900, 
                background_color ='white', 
                colormap='plasma', 
                min_font_size = 10).generate(reviews_tokens_string2) 

# Plot the WordCloud image.                        
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(wordcloud_reviews) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Create a wordcloud without stop words: Summary
wordcloud_summary = WordCloud(width = 1600, height = 900, 
                background_color ='white', 
                colormap='plasma', 
                min_font_size = 10).generate(summary_tokens_string2) 

# Plot the WordCloud image.                        
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(wordcloud_summary) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

### 4d) Identify 15 most common words and polarity

In [None]:
# Determine the frequency distribution.

# Reviews
fdist_reviews = FreqDist(review_tokens_clean)

# Summary
fdist_summary = FreqDist(summary_tokens_clean)

# Preview the distribution: reviews
fdist_reviews

In [None]:
# Preview the distribution: summary
fdist_summary

In [None]:
# Create a DataFrame from the reviews and summary frequency distribution
from collections import Counter

counts_reviews = pd.DataFrame(Counter(review_tokens_clean).most_common(15), 
                             columns=['Word', 'Frequency']).set_index('Word')

counts_summary = pd.DataFrame(Counter(summary_tokens_clean).most_common(15), 
                             columns=['Word', 'Frequency']).set_index('Word')

# Show the DataFrame: Reviews
counts_reviews

In [None]:
# View the dataframe: summary
counts_summary

In [None]:
# Plot reviews distribution
# Set the plot type.
ax = counts_reviews.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='plasma')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Top 15 most common words in online review for Turtle Games",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

In [None]:
# Plot summary distribution
# Set the plot type.
ax = counts_summary.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='viridis')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Top 15 most common words in online review summaries for Turtle Games",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

## 5. Review polarity and sentiment: Plot histograms of polarity (use 15 bins) and sentiment scores for the respective columns.

In [None]:
# Provided function.
def generate_polarity(comment):
    '''Extract polarity score (-1 to +1) for each comment'''
    return TextBlob(comment).sentiment[0]

In [None]:
# Download Vader lexicon
nltk.download('vader_lexicon')

In [None]:
# Import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create a variable to store the SIA
darth_vader = SentimentIntensityAnalyzer()

In [None]:
df5['review_tokens'][100]

In [None]:
reviews_value_tokens = df5['review_tokens'].to_list()
summary_value_tokens = df5['summary_tokens'].to_list()

In [None]:
# Remove stop words and non-alphanumeric characters from reviews
filtered_review_tokens = [[y.lower() for y in filtered_tokens if y.lower() not in english_stopwords and y.isalpha()] for filtered_tokens in reviews_value_tokens]

In [None]:
# Sense check 
filtered_review_tokens[100]

In [None]:
# Remove stop words and non-alphanumeric characters from summaries
filtered_summary_tokens = [[y.lower() for y in filtered_tokens if y.lower() not in english_stopwords and y.isalpha()] for filtered_tokens in summary_value_tokens]

In [None]:
filtered_summary_tokens[100]

In [None]:
# Determine polarity of both columns. 
polarityscores_reviews_tokens =\
{" ".join(_) : darth_vader.polarity_scores(" ".join(_)) for _ in filtered_review_tokens}

# View output.
polarityscores_reviews_tokens

In [None]:
# Create a DataFrame using the dictionary of polarity scores
polarity_scores_reviews = pd.DataFrame(polarityscores_reviews_tokens).T

polarity_scores_reviews.head()

In [None]:
# Determine polarity of summaries. 
polarityscores_summary_tokens =\
{" ".join(_) : darth_vader.polarity_scores(" ".join(_)) for _ in filtered_summary_tokens}

# View output.
polarityscores_summary_tokens

In [None]:
# Create a DataFrame using the dictionary of polarity scores
polarity_scores_summary = pd.DataFrame(polarityscores_summary_tokens).T

polarity_scores_summary.head()

- I have interpreted sentiment as the overall sentiment of the reviews and summaries. This will be measured using compound scores. 
- To measure polarity, I will look at the positive, negative and neutral polarity of each review and summary.

In [None]:
# Review: Create a histogram plot with bins = 15.
# Histogram of polarity


# Histogram of sentiment score
sns.histplot(data=polarity_scores_reviews['compound'], bins=15).set(title='The Distribution of Reviews Sentiment')

In [None]:
# Summary: Create a histogram plot with bins = 15.
# Histogram of polarity


# Histogram of sentiment score
sns.histplot(data=polarity_scores_summary['compound'], bins=15).set(title='The Distribution of Summary Sentiment')

## 6. Identify top 20 positive and negative reviews and summaries respectively

In [None]:
# Top 20 positive summaries.
summary_pos = polarity_scores_summary.sort_values(['pos'], ascending=False).reset_index()

# View output.
summary_pos.loc[0:19]

In [None]:
# Top 20 negative summaries.
summary_neg = polarity_scores_summary.sort_values(['neg'], ascending=False).reset_index()

# View output.
summary_neg.loc[0:19]

In [None]:
# Top 20 positive reviews.
reviews_pos = polarity_scores_reviews.sort_values(['pos'], ascending=False).reset_index()

# View output.
reviews_pos.loc[0:19]

In [None]:
# Top 20 negative reviews.
reviews_neg = polarity_scores_reviews.sort_values(['neg'], ascending=False).reset_index()

# View output.
reviews_neg.loc[0:19]

## 7. Discuss: Insights and observations

***Your observations here...***

# 