# Analysis of Chocolate Bar Rating dataset

Author: Lynn Miller

Version: 1.0

Date: 18-May-2018

Description

1. Exploratory data analysis of the chocolate bar rating dataset
2. Feature engineering
3. Build model to predict chocolate bar ratings

In [2]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import random
import math
from scipy.stats.stats import pearsonr
from scipy.stats.stats import ttest_ind
from scipy.stats import chi2_contingency
from statsmodels.formula.api import ols
from sklearn import tree
from sklearn.tree import _tree
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from pprint import pprint

%matplotlib inline
plt.style.use('ggplot')

Read the data into a Pandas data frame and display some rows

In [5]:
chocolate = pd.read_csv("../input/chocolate-bar-ratings/flavors_of_cacao.csv")
chocolate.head()

## Exploratory Data Analysis

Let's explore this data to see what we can find out about it. As we may want to clean the data as we go, I'll make a copy of it, so I can refer back to the original data if necessary.

In [6]:
chocolateOriginal = chocolate.copy()
chocolateOriginal.head()

These column names are rather cumbersome, so let's simplify them

In [7]:
chocolate.columns = ["Company","SpecificOrigin","Ref","ReviewDate","CocoaPercent","Location","Rating","BeanType","BroadOrigin"]

### Basic Descriptive Statistics

Firstly, check the data types of these columns

In [8]:
chocolate.dtypes

Let's have a look at the column statistics ...

In [9]:
chocolate.describe()

So the review date is a year betweeen 2006 and 2017, the references are integers between 5 and 1952, the rating has a mean of 3.19 and a median of 3.25

And the string columns ...

In [10]:
chocolate.describe(exclude=[np.number])

One immediately obvious issue is that we have several string, or categorical columns with large cardinality domains. Many machine learning algorithms require categorical variables to be converted to a set of binary variables (one for each value after the first) - or do this implicitly. This would add almost 1700 variables to the dataset. As we only have 1795 observations, we will almost certainly overfit our model if we use all these variables as they are.

So, let's explore the data to see what feature engineering can be done.

#### Chocolate Ratings

In [11]:
fig, ax = plt.subplots(1,1,figsize=(11, 5))
g1 = sns.distplot(chocolate.Rating, kde=False, bins=np.arange(1, 5.5, 0.25),
                  hist_kws={"rwidth": 0.9, "align": "left", "alpha": 1.0})
plt.suptitle("Chocolate Ratings", fontsize=16)
fig.show()

The ratings are slightly skewed to the left - with a number of low-rated chocolates. There is a slight bi-modality wth peaks at 3 and 3.5, but this may simply indicate a reluctance by reviewers to use the 1/4 ratings (1.75 and 2.25 are similarly lower than would be expected), rather than two distinct groupings.

Which are the best chocolates?

In [12]:
chocolate[chocolate["Rating"] == 5]

And which are the worst chocolates?

In [13]:
chocolate[chocolate["Rating"] <= 1]

### Feature Engineering

#### Cocoa Percentage

The Cocoa Percentage is a string ... let's convert it to a number and as it's a percentage we'll also convert it to a proportion

In [14]:
chocolate.CocoaPercent = chocolate.CocoaPercent.apply(lambda x : float(x.rstrip("%"))/100.0)
chocolate.describe()

So the cocoa percentage ranges from 42% to 100%, with a mean of 71.7% and median 70%. How are the values distributed?

In [15]:
g = chocolate.CocoaPercent.hist(bins=58, figsize=(18,9))
plt.title("Chocolate - Cocoa Percent")
plt.show(g)

The most common value is 70%, and quite a few samples have 72% and 75% cocoa. The distribution is not symmetric and has quite long tails (below 55% and above 90%).

#### Review Date

Count the number of reviews for each year and plot the distribution

In [16]:
chocolate.ReviewDate.value_counts()
g = chocolate.ReviewDate.hist(bins=12, figsize=(18,9))
plt.title("Chocolate - Review Date")
plt.show(g)

The number of reviews per year increases most years until 2016. The low number of review in 2017 may indicate data collection stopped early 2017.

#### Ref

The range of ref values is less than the total number of reviews, so this is obviously not a unique field. How many unique values are there and how often is each one used?

In [17]:
ref = chocolate.Ref.copy()
print("Number of unique ref values:" + str(ref.nunique()))
ref.sort_values(inplace=True)
print("\nCount of ref values by number of occurrences")
print(pd.DataFrame(ref.value_counts()).groupby('Ref').Ref.count())

There are 440 unique refs. Each ref occurs between one and ten times (i.e. 8 refs occur just once, 311 occur four times and 1 occurs ten times).

The column metadata for this dataset (https://www.kaggle.com/rtatman/chocolate-bar-ratings/data) says this is linked to when the data is entered in the database, so it is likely to be closely related to the review date, so lets compare these.

In [18]:
ref = chocolate.groupby('ReviewDate').Ref
ref = pd.concat([ref.count(), ref.min(), ref.max(), ref.nunique()], axis=1).reset_index()
ref.columns = ['ReviewDate','count','minRef','maxRef','uniqueRef']
g = chocolate.boxplot(column="Ref", by="ReviewDate", figsize=(18,12))
ref

The samples with the same ref are generally for the same year, with the occasional ref spanning two years (e.g. 464 spans 2009 and 2010 and 1928 spans 2016 and 2017).

#### Company

There are 416 different chocolate companies in the dataset, which is too many to use with only 1795 samples. Is there any way to combine any of the companies?

Let's look at the top 50 companies by number of samples:

In [19]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.barplot(x=chocolate.Company.value_counts().index[0:50], y=chocolate.Company.value_counts()[0:50], palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate Companies")
plt.show(g)

There are a few that could be related - for instance according to the metadata in values like "Hotel Chocolat (Copperneur)" the maker is "Copperneur", so we could extract the maker - and assume the company is the maker where there isn't one.

In [20]:
chocolate["Company"] = chocolate.Company.apply(lambda x: re.split("\(|aka ",x.rstrip(")"))[-1])

And now we'll only keep the ones with more than 20 observations - set the others to "other"

In [21]:
company = chocolate[["Company","Ref"]].groupby("Company").count().reset_index() #sort_values(by="Ref", ascending=False).reset_index()
company["newCompany"] = company.apply(lambda x: x.Company if x.Ref > 20 else "other", axis=1)
company

In [22]:
chocolate = chocolate.merge(company[["Company","newCompany"]], how="left", on="Company")

In [23]:
fig,(ax1,ax2) = plt.subplots(2, 1, sharex=True)

g1 = sns.countplot(chocolate.newCompany, order=chocolate.newCompany.value_counts().index, palette="Blues_d", ax=ax1)
g1.set_xticklabels('')
g1.set_xlabel('')
g1.set_ylabel('')
g1.set_ylim(1400,1460)

g2 = sns.countplot(chocolate.newCompany, order=chocolate.newCompany.value_counts().index, palette="Blues_d", ax=ax2)
g2.set_xticklabels(g2.get_xticklabels(), rotation=90)
g2.set_ylim(0,60)

plt.suptitle("Chocolate Companies", fontsize=16)
plt.subplots_adjust(hspace=0.2)

#### Location

There are 60 locations in the dataset. How are the observations distributed across these locations?

In [24]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.countplot(chocolate.Location, order=chocolate.Location.value_counts().index, palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Company Locations")
plt.show(g)

The largest category is the U.S.A, another nine countries countries have more than 30 observations and the remaining 50 have less than 30. Let's combine these smaller locations into a single category.

In [25]:
locs = chocolate.Location.value_counts()
chocolate["LocName"] = chocolate.Location.apply(lambda x: "Other" if x in locs[locs < 30].index else x)
#locs[locs > 30].index
chocolate.LocName.value_counts()
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.countplot(chocolate.LocName, order=chocolate.LocName.value_counts().index, palette="Blues_d")
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Company Locations")
plt.show(g)

The new location column (currently LocName) has 11 categories, with the "Other" category having just under 400 observations.

#### Bean Type

Plot the BeanType distribution 

In [26]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

g = plt.subplots(figsize=(18, 9))
g = sns.countplot(chocolate.BeanType, order=chocolate.BeanType.value_counts().index, palette="YlOrBr_r")
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate Bean Types")
plt.show(g)

The largest group has no category name, then there are the Trinitario, Criollo, and Forastero cocoa beans, which according to various websites make up nearly all the cocoa bean varieties grown. There is the blend category, plus several categories that specify two types of beans, plus some that appear to be identifying a sub-type.

To simplify this, create a new Bean Type feature (called BeanT for now) as follows:
- Convert nans and other empty strings to "not-specified"
- Convert categories with two bean types to "Blend". As there are quite a few Criollo blends (in particular Criollo/Trinitario) keep these separate from the other blends in case this is significant.
- Remove any sub-type.

In [27]:
chocolate['BeanT'] = chocolate.BeanType.replace(np.nan, 'not-specified', regex=True).replace('\xa0', 'not-specified').apply(
    lambda x : ("Blend-Criollo" if "Criollo" in re.split(" |,|\)",str(x)) else "Blend") if any(
        word in x for word in ['Blend',',']) else x).apply(lambda x : (x.split()[0]))
chocolate.describe(exclude=[np.number])

So that has reduced the number of bean types to 12.

What bean types do we have now?

In [28]:
chocolate.groupby('BeanT').BeanT.count()

We have the three main bean types, the two blends and the "not-specified" category, plus six very small categories. Lets combine all the small categories together, as individually they do not have enough samples.

In [29]:
chocolate['BeanT'] = chocolate['BeanT'].apply(
    lambda x: "Other" if x in ["Amazon","Beniano","CCN51","EET","Matina","Nacional"] else x)

fig, ax = plt.subplots(1,1,figsize=(11, 5))
g1 = sns.countplot(chocolate.BeanT, palette="YlOrBr_r") #, ax=ax[0])
g1.set_xticklabels(g1.get_xticklabels(), rotation=90)
plt.suptitle("Chocolate Bean Types and Blend", fontsize=16)
fig.show()

In [30]:
chocolate.groupby('BeanT').BeanT.count()

#### Specific Origin and Broad Origin

Specific Origin has over 1000 distinct values, which is clearly too many to work with. Can we ignore this and just use the Broad Origin?

Lets have a closer look at the two most common values - Madagascar for Specific Origin and Venezuela for Broad Origin

BroadOrigin has 100 values, with the following distribution:

In [31]:
#chocolate.groupby('SpecificOrigin').SpecificOrigin.count().sort_values(ascending=False).head(10)
print(chocolate.groupby('BroadOrigin').BroadOrigin.count().sort_index())

These are mainly countries, but the data is messy, there are misspellings and multiple abbreviations for the same country. There are regions and continents, and multiple countries. 

Many categories have a very small number of records, and these will need combining to provide useful categories. An obvious way to combine categories with multiple countries is to assume most of the beans are from the first country and ignore the other countries. We'll also fix the spelling mistakes and expand the abbreviations.

In [32]:
chocolate["Origin"] = chocolate.BroadOrigin.replace(np.nan, 'not specified', regex=True).replace(
    '\xa0', 'not specified').str.replace('Dom.*','Dominican Republic').str.replace('Ven.*','Venezuela').apply(
    lambda x: re.split(',|\(|\/|\&|\-',str(x))[0].rstrip().replace('Cost ','Costa ').replace('DR','Dominican Republic').replace(
        'Tobago','Trinidad').replace('Trinidad','Trinidad and Tobago').replace("Carribean","Caribbean"))
print(chocolate.groupby('Origin').Origin.count().sort_index())

That's fixed some and shown a few more ... let's fix these 

In [33]:
chocolate["Origin"] = chocolate.Origin.apply(
    lambda x: x.replace('Gre.','Grenada').replace('Guat.','Guatemala').replace("Hawaii","United States of America").replace(
        'Mad.','Madagascar').replace('PNG','Papua New Guinea').replace('Principe','Sao Tome').replace(
        'Sao Tome','Sao Tome and Principe'))
print(chocolate.groupby('Origin').Origin.count().sort_index())

Ok - that looks better. But we still have some categories with very small numbers of records. If we assume that chocolate beans from countries close together are similar, then combining countries by region and/or continent should be useful.

To help do this, I'll use another Kaggle dataset: the countryContinent.csv dataset from "https://www.kaggle.com/statchaitya/country-to-continent/data"

In [34]:
countriesRaw = pd.read_csv("../input/country-to-continent/countryContinent.csv", encoding='iso-8859-1') 
countriesRaw

We'll tidy this list up to strip off the extranenous parts of the country (e.g. so we have "Venezuela" instead of "Venezuela (Bolivarian Republic of)"

In [35]:
countries = countriesRaw[["country","sub_region","continent"]]
countries.country = countries.country.apply(lambda x: re.split("\(|\,",x)[0].rstrip())
countries = countries.drop_duplicates()
countries

Some more updates to the Origin attribute, to match the country and sub-region names used in the Countries dataset

In [36]:
chocolate["Origin"] = chocolate["Origin"] = chocolate.Origin.apply(
    lambda x: x.replace("St.","Saint").replace("Vietnam","Viet Nam").replace("Burma","Myanmar").replace(
        "Ivory Coast","Côte d'Ivoire").replace("West","Western").replace(" and S. "," "))
print(chocolate.groupby('Origin').Origin.count().sort_index())

Now we can merge the chocolate and countries dataframes to set a sub_region for each country. Then we'll list the ones that haven't matched:

In [37]:
chocolate = chocolate.merge(countries[["country","sub_region"]], how="left", left_on="Origin", right_on="country")
chocolate[chocolate.country.isnull()].groupby("Origin").Origin.count().sort_index()

We'll manually fix up Hawaii. For the others that didn't match, we'll set the sub_region to the Origin. We'll also set the country for the "not specified" rows.

In [38]:
chocolate.loc[chocolate.Origin=="Hawaii","country"] = "United States of America"
chocolate.loc[chocolate.Origin=="Hawaii","sub_region"] = "Northern America"

chocolate.loc[chocolate.country.isnull(),"sub_region"] = chocolate.loc[chocolate.country.isnull(),"Origin"]
chocolate.loc[chocolate.country.isnull(),"country"] = "--"

Now we'll use the sub_region to find the continents ... and fix up "Africa"

In [39]:
regions = countries[["sub_region","continent"]].drop_duplicates()
chocolate = chocolate.merge(regions, how="left", on="sub_region")
chocolate.loc[chocolate.Origin=='Africa',"continent"] = 'Africa'
chocolate.continent = chocolate.continent.replace(np.nan,"other")

In [40]:
print(chocolate[["continent","sub_region","country","Origin"]].groupby(["continent","sub_region","country"]).count())

Next we'll do the rollups by setting all the small country categories to the sub_region

In [41]:
chocCounts = chocolate[["Origin","Ref"]].groupby(["Origin"]).count()
chocCounts.columns = ["countryCount"]
chocRollup = chocolate.merge(chocCounts, how="left", left_on="Origin", right_index=True)[["Origin","sub_region","countryCount"]]
chocolate.Origin = chocRollup.apply(lambda x: x.sub_region if x.countryCount < 28 else x.Origin, axis=1)
print(chocolate[["continent","sub_region","country","Origin"]].groupby(["continent","sub_region","Origin"]).count())

And repeat that to set the ones that are still small to the continent

In [42]:
chocCounts = chocolate[["Origin","Ref"]].groupby(["Origin"]).count()
chocCounts.columns = ["countryCount"]
chocRollup = chocolate.merge(chocCounts, how="left", left_on="Origin", right_index=True)[["Origin","continent","countryCount"]]
chocolate.Origin = chocRollup.apply(lambda x: x.continent if x.countryCount < 28 else x.Origin, axis=1)
print(chocolate[["continent","country","Origin"]].groupby(["continent","Origin"]).count())
#print(chocolate[["continent","sub_region","country","Origin"]].groupby(["continent","sub_region","Origin"]).count())

This looks good, the only problem is only 5 records have rolled up to "Americas". Find an appropriate category to merge this with.

In [43]:
print(chocolate.loc[chocolate.Origin.str.contains("America"),["Origin","BroadOrigin","country"]].groupby(["Origin","BroadOrigin"]).count())

So some of our "Central America" records have the original BroadOrigin of "Central and S. America", so this looks like a good candidate. Combine "Americas" and "Central America" into one category and display the final Origin values.

In [44]:
chocolate.loc[chocolate.Origin.isin(["Americas","Central America"]),"Origin"] = "Central and South America"
print(chocolate[["continent","country","Origin"]].groupby(["continent","Origin"]).count())

##### SpecificOrigin

In [45]:
chocolate.SpecificOrigin.describe()

In [46]:
origin = chocolate[['SpecificOrigin', 'Ref']].groupby(['SpecificOrigin']).count().reset_index()
origin[origin.Ref >= 20]

There are 1039 SpecificOrigin values, out of 1795 records. Most only occur once and only 5 values have more than 20 observations. There doesn't seem a lot of point doing anything with these - especially as we have the BroadOrigin.

#### Re-Organise Dataframe

Display the current dataframe, original data and new features

In [47]:
chocolate.head()

Re-organise dataframe, keeping just the attributes we want
- Drop Company and replace with newCompany
- Drop SpecificOrigin as this has too many values
- Drop Ref and ReviewDate (if we're going to predict the rating for new types of chocolate we presumably won't have these attributes)
- Keep CocoaPercent
- Drop Location and replace with LocName
- Keep Rating
- Drop BeanType and replace with BeanT
- Drop BroadOrigin and replace with Origin

In [48]:
chocolate=chocolate.loc[:,["Rating", "CocoaPercent", "newCompany", "LocName", "BeanT", "Origin"]]
chocolate.columns = ["Rating","CocoaPercent","Company","Location","BeanType","Origin"]
chocolate.dtypes

In [49]:
chocolate.head()

### Influences on Rating

#### CocoaPercent

Anecdotally, good chocolate is associated with higher percentage of cocoa. Is this correct?

In [50]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g1 = sns.lmplot(x="CocoaPercent", y="Rating", data=chocolate, y_jitter=0.2, x_jitter=0.01)
plt.title("Chocolate: Cocoa Percentage vs Rating")
fig.show()

That assumption appears to be wrong - there is a clear negative association between Cocoa percent and rating. We can check this using a Pearson's correlation test

In [52]:
pearsonr(chocolate.CocoaPercent, chocolate.Rating)

There is a slight negative correlation (-0.16), but this is significant (2.12e-12).

##### Finding 1:

A higher Cocoa percent is associated with a lower rating

#### Company

Run an ANOVA test to see if the ratings vary by company

In [53]:
comp_lm = ols('Rating ~ Company', data=chocolate).fit()
print(comp_lm.params)
print(sm.stats.anova_lm(comp_lm, typ=2))

The ANOVA test shows there is a significant variation in the ratings between companies. Visualise the differences using a Boxplot, and order the boxplots by the mean rating.

In [54]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.boxplot(x=chocolate.Company, y=chocolate.Rating, palette="YlOrBr_r",
                order=chocolate[["Company","Rating"]].groupby("Company").mean().sort_values("Rating", ascending=False).index)
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate: Company vs Rating")
plt.show(g)

#### Location

Run an ANOVA test to see if the ratings vary by location

In [55]:
loc_lm = ols('Rating ~ Location', data=chocolate).fit()
print(loc_lm.params)
print(sm.stats.anova_lm(loc_lm, typ=2))

The ANOVA test shows there is a significant variation in the ratings between locations. Visualise the differences using a Boxplot, and order the boxplots by the mean rating.

In [56]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.boxplot(x=chocolate.Location, y=chocolate.Rating, palette="YlOrBr_r",
                order=chocolate[["Location","Rating"]].groupby("Location").mean().sort_values("Rating", ascending=False).index)
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate: Location vs Rating")
plt.show(g)

So the best chocolate was made by an Italian company, but Australian chocolates have the highest mean rating. But all countries produced at least one chocolate with a rating of 4.

#### Bean Type

Run an ANOVA test to see if the ratings vary by bean type

In [57]:
bean_lm = ols('Rating ~ BeanType', data=chocolate).fit()
print(bean_lm.params)
print(sm.stats.anova_lm(bean_lm, typ=2))

The ANOVA test shows there is a significant variation in the ratings between bean types. Visualise the differences using a Boxplot, and order the boxplots by the mean rating.

In [58]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.boxplot(x=chocolate.BeanType, y=chocolate.Rating, palette="YlOrBr_r",
                order=chocolate[["BeanType","Rating"]].groupby("BeanType").mean().sort_values("Rating", ascending=False).index)
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate: Bean Types vs Rating")
plt.show(g)

#### Origin

Run an ANOVA test to see if the ratings vary by origin

In [59]:
orig_lm = ols('Rating ~ Origin', data=chocolate).fit()
print(orig_lm.params)
print(sm.stats.anova_lm(orig_lm, typ=2))

Again there is a significant difference here. Where do the best beans come from?

In [60]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
g = plt.subplots(figsize=(18, 9))
g = sns.boxplot(x=chocolate.Origin, y=chocolate.Rating, palette="YlOrBr_r",
                order=chocolate[["Origin","Rating"]].groupby("Origin").mean().sort_values("Rating", ascending=False).index)
g.set_xticklabels(g.get_xticklabels(), rotation=90)
plt.title("Chocolate: Bean Origin vs Rating")
plt.show(g)

So the beans in the best chocolate came from Venezuela (and one of unknown origin). Chocolate made from Guatemalan beans had the highest average (mean) rating and beans from Viet Nam had the highest median rating, but a significant number of Venezuelan beans rated well (the upper quartile is 3.75).

### Dependencies between Attributes

In [61]:
print("Contingency Tests for Categorical attributes")
print("Company and Location: {}".format(chi2_contingency(pd.crosstab(chocolate.Company,chocolate.Location))[1]))
print("Company and BeanType: {}".format(chi2_contingency(pd.crosstab(chocolate.Company,chocolate.BeanType))[1]))
print("Company and Origin: {}".format(chi2_contingency(pd.crosstab(chocolate.Company,chocolate.Origin))[1]))
print("Location and BeanType: {}".format(chi2_contingency(pd.crosstab(chocolate.Location,chocolate.BeanType))[1]))
print("Location and Origin: {}".format(chi2_contingency(pd.crosstab(chocolate.Location,chocolate.Origin))[1]))
print("BeanType and Origin: {}".format(chi2_contingency(pd.crosstab(chocolate.BeanType,chocolate.Origin))[1]))

These are all significant, so there are no dependencies between any of the categorical attributes

In [62]:
print("ANOVA tests between CocoaPercent and categorical attributes")
print(sm.stats.anova_lm(ols('CocoaPercent ~ Company', data=chocolate).fit(), typ=2))
print(sm.stats.anova_lm(ols('CocoaPercent ~ Location', data=chocolate).fit(), typ=2))
print(sm.stats.anova_lm(ols('CocoaPercent ~ BeanType', data=chocolate).fit(), typ=2))
print(sm.stats.anova_lm(ols('CocoaPercent ~ Origin', data=chocolate).fit(), typ=2))

These are all significant, except for CocoaPercent/BeanType.

*ToDo: Investigate relationship between CocoaPercent and BeanType*

## Predict Rating

### Create Test and Training sets

Randomly select 20% of the data and set aside as the test data

In [63]:
random.seed(12345)
testSize = len(chocolate) // 5
testIndices = random.sample(range(len(chocolate)),testSize)
testIndices.sort()
chocTest = chocolate.iloc[testIndices,]
print("Test data set has {} observations and {} attributes".format(chocTest.shape[0],chocTest.shape[1]))

The rest of the data is used to train the models

In [64]:
chocTrain = chocolate.drop(testIndices)
print("Training data set has {} observations and {} attributes".format(chocTrain.shape[0],chocTrain.shape[1]))

Many of the models expect all the attributes to be numeric, so convert the categorical features to dummy variables

In [65]:
trainX = pd.get_dummies(chocTrain.iloc[:,1:])
trainY = chocTrain.Rating
print("Training data set has {} observations and {} attributes".format(trainX.shape[0],trainX.shape[1]))
testX = pd.get_dummies(chocTest.iloc[:,1:])
testY = chocTest.Rating
print("Test data set has {} observations and {} attributes".format(testX.shape[0],testX.shape[1]))

### Linear Regression

Fit a linear regression model using ols

In [66]:
olsModel = ols('Rating ~ CocoaPercent + BeanType + Origin + Location + Company', data=chocTrain).fit()
print(olsModel.params)

### Linear Regression with Ridge Regularisation

I'm using Bayesian Ridge regularisation, this doesn't require setting a grid to determine the degree of regularisation to use.

In [67]:
reg = linear_model.BayesianRidge()
reg.fit(trainX,trainY)
reg.coef_

Quickly check a few training predictions to see how they compare

In [68]:
lrResults = pd.DataFrame(trainY[0:10])
lrResults["Ols"] = round(olsModel.predict(chocTrain.iloc[0:10])*4)/4
lrResults["Reg"] = np.round(reg.predict(trainX.iloc[0:10])*4)/4
lrResults

This doesn't look great, so try a few other methods. I'll evaluate these properly using the test data later.

### Decision Tree

In [69]:
dtrModel = tree.DecisionTreeRegressor(max_depth=5)
dtrModel.fit(trainX,trainY)

Function to display the decision tree (from a KDNuggets post by Matthew Mayo, https://www.kdnuggets.com/2017/05/simplifying-decision-tree-interpretation-decision-rules-python.html)

In [70]:
def tree_to_code(tree, feature_names):

    '''
    Outputs a decision tree model as a Python function
    
    Parameters:
    -----------
    tree: decision tree model
        The decision tree to represent as a function
    feature_names: list
        The feature names of the dataset used for building the decision tree
    '''

    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, tree_.value[node]))

    recurse(0, 1)

Display the decision tree

In [71]:
tree_to_code(dtrModel,trainX.columns)

Can we improve this using a grid search to tune the model parameters?

In [72]:
random.seed(8765)

# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [8, 10, 12],
    'max_features': [8, 9, 10],
    'min_samples_leaf': [2, 4, 6, 8, 10],
    'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16],
    'splitter': ['best', 'random']
}
# Create a based model
dtr = tree.DecisionTreeRegressor(max_depth=5)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = dtr, param_grid = param_grid, 
                           cv = 10, n_jobs = -1, verbose = 2)
grid_search.fit(trainX, trainY)
bestDtr = grid_search.best_estimator_
grid_search.best_params_

Again, display the decision tree

In [73]:
tree_to_code(bestDtr,trainX.columns)

And a quick comparison of the results

In [74]:
dtResults = pd.DataFrame(trainY[0:20])
dtResults["First"] = np.round(dtrModel.predict(trainX.iloc[0:20])*4)/4
dtResults["Tuned"] = np.round(bestDtr.predict(trainX.iloc[0:20])*4)/4
dtResults

There's not a lot of difference between the two models, and they only match the observed rating in a few cases.

### Random Forest

Can we improve on the decision tree using a random forest?

To select the random forest parameters, I'll use a randomised search to get rough estimates, then refine these using a grid search. This is based on the technique and code given in a blog by William Koehrsen (https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)

Note: this cell takes a while to run (>20 minutes)

In [75]:
random.seed(2468)

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 10 fold cross validation, 
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100,
                               cv = 10, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(trainX, trainY)
bestrf_random = rf_random.best_estimator_
rf_random.best_params_

In [76]:
random.seed(2468)
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [15, 20, 25],
    'max_features': [6, 8, 10],
    'min_samples_leaf': [2],
    'min_samples_split': [10],
    'n_estimators': [800, 1000, 1200]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
rf_grid = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 10, n_jobs = -1, verbose = 2)
rf_grid.fit(trainX, trainY)
bestrf_grid = rf_grid.best_estimator_
rf_grid.best_params_

How does this compare to our decision trees?

In [77]:
dtResults["RF"] = np.round(bestrf_grid.predict(trainX.iloc[0:20])*4)/4
dtResults

This may have helped a little. Again the real test will come later, when I evaluate the models using the test data.

### SVM

Finally, I'll try an SVM model, again using cross-validation to select parameters for a support vector machine model, and fit the training data

Note: This cell takes over 10 mins to run.

In [78]:
random.seed(97531)
param_grid = {
    'C': [0.01, 0.1, 1.0],
    'epsilon': [0.01, 0.1, 1.0],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': [2, 3, 4],
    'gamma': [0.01, 0.1, 1],
    'coef0': [-1, 0, 1]
}
svr = SVR()
svmGrid = GridSearchCV(estimator = svr, param_grid = param_grid, 
                           cv = 10, n_jobs = -1, verbose = 2)
svmGrid.fit(trainX, trainY)
best_svr = svmGrid.best_estimator_
svmGrid.best_params_

### Test the Models

#### Baseline using Mean Rating

A baseline prediction is to simply use the mean rating of the training data rounded to the next 0.25, as the prediction for each test case.

In [79]:
meanTrain = np.round(trainY.mean()*4)/4
print("Baseline prediction using training mean\nRMSE: {:5.3}".format(math.sqrt(((testY - meanTrain) ** 2).mean())))

So a model has to have RMSE less than 0.494 to be better than guessing

In [80]:
testResults = pd.DataFrame(["ols","reg","dtr","dtr_tuned","rf","svm"])
testResults.columns = ["model"]
testResults["test"] = [0,0,0,0,0,0]
testResults["train"] = [0,0,0,0,0,0]
testResults

In [81]:
def rmse(predict, labels):
    return math.sqrt(((predict-labels)**2).mean())

In [82]:
testResults.loc[0,"test"] = rmse(np.round(olsModel.predict(chocTest)*4)/4, testY)
testResults.loc[0,"train"] = rmse(np.round(olsModel.predict(chocTrain)*4)/4, trainY)
testResults.loc[1,"test"] = rmse(np.round(reg.predict(testX)*4)/4, testY)
testResults.loc[1,"train"] = rmse(np.round(reg.predict(trainX)*4)/4, trainY)
testResults.loc[2,"test"] = rmse(np.round(dtrModel.predict(testX)*4)/4, testY)
testResults.loc[2,"train"] = rmse(np.round(dtrModel.predict(trainX)*4)/4, trainY)
testResults.loc[3,"test"] = rmse(np.round(bestDtr.predict(testX)*4)/4, testY)
testResults.loc[3,"train"] = rmse(np.round(bestDtr.predict(trainX)*4)/4, trainY)
testResults.loc[4,"test"] = rmse(np.round(bestrf_grid.predict(testX)*4)/4, testY)
testResults.loc[4,"train"] = rmse(np.round(bestrf_grid.predict(trainX)*4)/4, trainY)
testResults.loc[5,"test"] = rmse(np.round(best_svr.predict(testX)*4)/4, testY)
testResults.loc[5,"train"] = rmse(np.round(best_svr.predict(trainX)*4)/4, trainY)
testResults

All the models give a slightly better result on the training data than the test data, and so overfit slightly. The random forest model gives the best result on the test data, with an RMSE of 0.443. The default model (guess the mean rating) had an RMSE of 0.494, so we've improved the predictions by about 10%.

### The Predictions

Compare the predicted and actual ratings for the test data ...

In [84]:
results = pd.DataFrame(testY)
results["Predict"] = np.round(bestrf_grid.predict(testX)*4)/4
results["Error"] = np.abs(results.Rating - results.Predict)
results

... and summarise the result by the error.

In [85]:
results[['Error', 'Predict']].groupby(['Error']).count().reset_index()


The model predicted 77 ratings accurately (about 21%), another 133 (37%) were out by 0.25 and 99 (28%) out by 0.5. Only 13 (less than 4%) were out by a whole rating or more.