# Introduction to Scikit-Learn - Answers

You have been contracted by Google to predict which app is going to be successful.

Using the Google Play app store dataset from the previous exercise, you will predict the number of times an app is installed.

## Loading the dataset

In [1]:
# Import the pandas and NumPy librairies
import pandas as pd
import numpy as np

In [2]:
# Load the updated Google Play app store dataset saved from the previous exercise
df = pd.read_pickle('data.pkl')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,"10,000+",Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,"500,000+",Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,"5,000,000+",Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,"50,000,000+",Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,"100,000+",Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## Preprocessing

Selecting the columns that we will include in the model

In [3]:
# Select the features that we want to keep
df = df.loc[:, ['Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Price', 'Content Rating']]

Some preliminary data cleaning

In [4]:
# Drop missing values
df = df.dropna().reset_index()

# Convert the 'Reviews' column to float
df.loc[:, 'Reviews'] = df.loc[:, 'Reviews'].astype('float')

# Remove the '+' signs from the 'Installs' column and convert it to float
df.loc[:, 'Installs'] = df.loc[:, 'Installs'].str.replace('+', '')
df.loc[:, 'Installs'] = df.loc[:, 'Installs'].str.replace(',', '').astype('float')

  df.loc[:, 'Installs'] = df.loc[:, 'Installs'].str.replace('+', '')


Binning the Size variable into three groups: small, medium, large (0, 1, 2)

In [5]:
# Import KBinsDiscretizer from sklearn (to install sklearn: conda install scikit-learn)
from sklearn.preprocessing import KBinsDiscretizer

# Instantiate the estimator
kbd = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')

# Fit the estimator to the data
data = df.loc[:, 'Size'].values.reshape(-1, 1)
kbd.fit(data)

# Transform the data
data_binned = kbd.transform(data)

# Transfrom the binned data to a DataFrame and merge it to df
SizeBins = pd.DataFrame(data_binned, columns=['Size Bins'])
df = df.join(SizeBins).drop(columns=['Size', 'index'])

Encoding categorical features as a one-hot numeric array. This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models.

In [6]:
# Convert categorical variables ('Category' and 'Content Rating') into dummy variables (one-hot encoding)
df_num = pd.get_dummies(df, columns=['Category', 'Content Rating'])

In [7]:
# Check that you have the same number of rows but more columns due to one-hot encoding
print('DataFrame shape before one-hot encoding: {}'.format(df.shape))
print('DataFrame shape after one-hot encoding: {}'.format(df_num.shape))

DataFrame shape before one-hot encoding: (7729, 7)
DataFrame shape after one-hot encoding: (7729, 44)


Creating the features and target variables

In [8]:
# Create arrays for features and target variable
y = df_num['Installs']
X = df_num.drop(columns=['Installs'])

Centering and scaling data (i.e. zero mean and unit variance)

In [9]:
# Import scale from sklearn
from sklearn.preprocessing import scale

# Scale the features:
X_scaled = scale(X)

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Scaled Features: 7.05524956749521e-19
Standard Deviation of Scaled Features: 1.0


Splitting data into training and test sets using sklearn's train_test_split

In [13]:
# Import train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Create training and test sets: 70/30 split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
X_train.shape
# y_train.shape

(5410, 43)

## Model Training

All supervised estimators in sklearn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

### Linear Regression

In [11]:
# Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Instantiate the regressor
reg_lin = LinearRegression()

# Fit the regressor to the training data
reg_lin.fit(X_train, y_train)

# Predict on the test data
y_pred_lin = reg_lin.predict(X_test)

### Random Forest

In [12]:
# Import RandomForestRegressor from sklearn
from sklearn.ensemble import RandomForestRegressor

# Instantiate the regressor with default value for hyperparameters
reg_forest = RandomForestRegressor(random_state=42)

# Fit the regressor to the training data
reg_forest.fit(X_train, y_train)

# Predict on the test data
y_pred_forest = reg_forest.predict(X_test)

## Model Tuning

### Hyperparameter tuning with GridSearchCV

GridSearchCV performs an exhaustive search over specified hyperparameter values for an estimator. The hyperparameters of the estimator are optimized by cross-validated grid-search over a parameter grid.

In [13]:
# Import GridSearchCV from sklearn
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
param_grid = {'n_estimators': [50, 100, 150],
              'max_depth': [10, 20, 30]}

# Instantiate the regressor
reg_rf = RandomForestRegressor(random_state=42)

# Instantiate the GridSearchCV object
reg_cv = GridSearchCV(reg_rf, param_grid, cv=2)

# Fit the regressor to the training data
reg_cv.fit(X_train, y_train)

# Print the tuned hyperparameters
print("Tuned Hyperparameters: {}".format(reg_cv.best_params_)) 

# Predict on the test data
y_pred_cv = reg_cv.predict(X_test)

Tuned Hyperparameters: {'max_depth': 20, 'n_estimators': 150}


## Model Evaluation

Let's compare the coefficient of determination R^2 between the models. sklearn has a method score(X, y) that can be applied to a fitted regressor and returns R^2 for us


In [14]:
# Compare R^2 of the three models
print("R^2 of Linear Regression: {}".format(reg_lin.score(X_test, y_test)))
print("R^2 of Random Forest: {}".format(reg_forest.score(X_test, y_test)))
print("R^2 of Random Forest with GridSearchCV: {}".format(reg_cv.score(X_test, y_test)))

R^2 of Linear Regression: 0.24038261989970522
R^2 of Random Forest: 0.9183885575782863
R^2 of Random Forest with GridSearchCV: 0.9157273103591018


Now, let's compare the models' Root Mean Square Error by using sklearn's mean_squared_error function

In [15]:
# Import mean_squared_error from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Compare RMSE of the three models
rmse_lin = np.sqrt(mean_squared_error(y_test, y_pred_lin))
rmse_forest = np.sqrt(mean_squared_error(y_test, y_pred_forest))
rmse_cv = np.sqrt(mean_squared_error(y_test, y_pred_cv))
print("RMSE of Linear Regression: {}".format(rmse_lin))
print("RMSE of Random Forest: {}".format(rmse_forest))
print("RMSE of Random Forest with GridSearchCV: {}".format(rmse_cv))

RMSE of Linear Regression: 48277003.91561354
RMSE of Random Forest: 15824077.09956474
RMSE of Random Forest with GridSearchCV: 16080009.126974687


In [16]:
from sklearn.metrics import mean_squared_error, mean_absolute_error


As you can see, GridSearchCV has improved our model by tuning the hyperparameters!