In [23]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Advertising.csv')

In [3]:
df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


In [4]:
df.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


And so we would like to be able to do is if I have an upcoming advertising campaign with a certain cost for TV, radio and newspaper expenditure, what would I expect the sales to be given any particular
spending here?

We've already visualized this data alot of times so we wont go into viisualization.

## Data preparation

In [5]:
X = df.drop('sales',axis=1)
y = df['sales']

In [6]:
X.head()

Unnamed: 0,TV,radio,newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4


In [7]:
y.head()

0    22.1
1    10.4
2     9.3
3    18.5
4    12.9
Name: sales, dtype: float64

### Train|Validation|Holdout_test split
70%|15%|15%

1st split - 70%|30%

2nd split - 50%|50% from 30% test set from previous step

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# 1st split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3 , random_state=101)

In [10]:
# 2nd split
X_validation, X_holdout_test, y_validation, y_holdout_test = train_test_split(X_test,y_test, test_size=0.5 , random_state=101)

In [12]:
X_train.shape

(140, 3)

In [13]:
X_validation.shape

(30, 3)

In [14]:
X_holdout_test.shape

(30, 3)

**We could also think about scaling the data depending on which algorithm we'll be using.**

**If we're using Linear regression, it is good to scale down the data.**

**But we'll be using Random Fosrest Regressor, so scaling is not a big deal here.**

# Model Training

In [15]:
from sklearn.ensemble import RandomForestRegressor

In [16]:
model = RandomForestRegressor(n_estimators=3, random_state=101)

We are purposely choosing less no of estimators here to see the effect of hyperparameter tuning.

In [17]:
model.fit(X_train,y_train)

RandomForestRegressor(n_estimators=3, random_state=101)

### Evaluation

In [18]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [19]:
validation_predictions = model.predict(X_validation)

In [20]:
validation_predictions

array([14.43333333,  6.46666667,  5.1       , 15.16666667, 11.4       ,
        9.96666667, 11.63333333, 12.1       , 19.        ,  7.03333333,
       12.43333333, 21.9       , 13.13333333,  7.2       , 11.7       ,
        7.56666667, 14.26666667, 12.6       , 11.16666667,  7.9       ,
       12.93333333, 21.2       , 19.66666667, 15.76666667, 15.9       ,
       25.06666667, 20.4       ,  9.83333333, 14.56666667, 19.66666667])

In [21]:
mean_absolute_error(y_validation, validation_predictions) # MEAN

0.853333333333333

In [24]:
np.sqrt(mean_squared_error(y_validation, validation_predictions)) # STD Deviation

1.1031268688998959

We can now compare this MAE with the mean of out dataset and the RMSE with Standard deviation of our dataset.

In [25]:
df.describe()['sales'] # MAE-0.85, RMSE-1.10

count    200.000000
mean      14.022500
std        5.217457
min        1.600000
25%       10.375000
50%       12.900000
75%       17.400000
max       27.000000
Name: sales, dtype: float64

Let's imagine we were not satisfied with the results here and we further want toreduce down this. We can perform hyperparameter tuning then.

In [29]:
model = RandomForestRegressor(n_estimators=30, random_state=101)
model.fit(X_train, y_train)

RandomForestRegressor(n_estimators=30, random_state=101)

In [30]:
validation_predictions2 = model.predict(X_validation)

In [31]:
# MAE
mean_absolute_error(y_validation, validation_predictions2) # MEAN

0.6575555555555552

In [32]:
np.sqrt(mean_squared_error(y_validation, validation_predictions2)) # STD Deviation

0.8542009478215644

So we are getting improvements after using 30 estimators.

In [33]:
df.describe()['sales']

count    200.000000
mean      14.022500
std        5.217457
min        1.600000
25%       10.375000
50%       12.900000
75%       17.400000
max       27.000000
Name: sales, dtype: float64

So supposing we are satisfied with this performance. We now want to truly test the model's performance on the holdout test set.

# Final performance on the Holdout set

In [34]:
holdout_predictions = model.predict(X_holdout_test)

In [35]:
# MAE
mean_absolute_error(y_holdout_test, holdout_predictions) # MEAN

0.5937777777777775

In [36]:
np.sqrt(mean_squared_error(y_holdout_test, holdout_predictions)) # STD Deviation

0.745323693040418

#### We are performing even better on the holdout set.
What does this mean:-
1. We should be expecting a performance of 0.59-0.65 for MAE for new data.
2. We should be expecting a performance of 0.74-0.85 for RMSE for new data.

### Remember once we have tested the holout set, we're not allowed/ should not go back and tune the model again.

# Setting up the final model

In [37]:
final_model = RandomForestRegressor(n_estimators=30, random_state=101)

In [38]:
final_model.fit(X,y)

RandomForestRegressor(n_estimators=30, random_state=101)

If you preeviously scaled your model then you have to scale it again before final_model.fit

# Saving the model

In [1]:
import joblib 

In [40]:
joblib.dump(final_model,'final_model.pkl')

['final_model.pkl']

You can actually save anything as a pickle file, which is very useful.

#### And often it's a really good idea with a post and get API request that you save your column feature names as well and you should save them as a list.

In [41]:
X.columns

Index(['TV', 'radio', 'newspaper'], dtype='object')

In [42]:
list(X.columns)

['TV', 'radio', 'newspaper']

This is something we may need when dealing with json data or just setting up a dataframe in future.

Now we could technically make it work without saving the column names, but having them helps us alot.

In [43]:
joblib.dump(list(X.columns),'col_names.pkl')

['col_names.pkl']

# Loading the model

Let's confirm evertying worked my loading the model

In [44]:
new_columns = joblib.load('col_names.pkl')

In [45]:
new_columns

['TV', 'radio', 'newspaper']

In [46]:
loaded_model = joblib.load('final_model.pkl')

In [49]:
# Predict any new data
loaded_model.predict([[230.1,37.8,69.2]])

array([21.99])

So this is really useful for us, especially if we're able to code in Python.

But in theory, you would want this to eventually be accessible to someone who didn't know how to load things and jobblib.load in Python.

And that's where we need to convert this into an API.

We have it almost in a stage where we can hand off to somebody and say, hey, here's the API or here's the model persistance.

We need to connect and close that gap by using flask.

That way, any web developer that knows post and get requests for API can actually connect to this model.