# Demo: Scikit-learn

# <u> Problem Statement </u>
You have been provided with a dataset that contains the costs of advertising on different media channels and the corresponding sales of XYZ firm. Evaluate the dataset to:
<ol>
    <li>Find the features or media channels used by the firm. </li>
    <li>Find the sales figures for each channel.</li>
    <li>Create a model to predict the sales outcome. </li>
    <li>Split it into trainig and testing datasets for the model.</li>
    <li>Calculate the mean squared error(MSE)</li>
</ol>
   

In [49]:
# Import the required libraries
import pandas as pd

In [50]:
# import the advertising dataset
# set the index column to zero as that's index provided by the dataset and 
# we are going to use it in the dataframe
df_adv_data = pd.read_csv('Advertising.csv', index_col = 0)

In [51]:
# view top 5 records
df_adv_data.head()

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


Sales figure for each channel

In [52]:
# view the datset size
df_adv_data.size

800

In [53]:
# view the shape of the dataset
df_adv_data.shape

(200, 4)

In [54]:
# view the column of the dataset
df_adv_data.columns

Index(['TV Ad Budget ($)', 'Radio Ad Budget ($)', 'Newspaper Ad Budget ($)',
       'Sales ($)'],
      dtype='object')

This means that there are three features and one response present in the dataset

TV Ad Budget ($)', 'Radio Ad Budget ($)', 'Newspaper Ad Budget ($) ==> featuers(media channels used by the firm)<br>
'Sales ($)' ==> response (sales)

In [55]:
# create a feature object from the column
X_feature = df_adv_data[['Newspaper Ad Budget ($)', 'Radio Ad Budget ($)', 'TV Ad Budget ($)']]

In [56]:
# view feature object
X_feature.head()

Unnamed: 0,Newspaper Ad Budget ($),Radio Ad Budget ($),TV Ad Budget ($)
1,69.2,37.8,230.1
2,45.1,39.3,44.5
3,69.3,45.9,17.2
4,58.5,41.3,151.5
5,58.4,10.8,180.8


In [57]:
# Create target object from sales column which is a response in the dataset
Y_target = df_adv_data[['Sales ($)']]

In [58]:
# view the target method
Y_target.head()

Unnamed: 0,Sales ($)
1,22.1
2,10.4
3,9.3
4,18.5
5,12.9


In [59]:
# view the feature object shape
X_feature.shape

(200, 3)

In [60]:
# view target object shape
Y_target.shape

(200, 1)

you can see all 200 observation are present in the both the variables. And the feature object contains three columns, and the response object contains only one columns.

In [65]:
#import train, test split class bydefault
# split test and training data
# by default 75% training data and 25% testing data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_feature, Y_target, random_state = 1)


# split as training and testing dataset

In [66]:
#view shape of train and test sets for both feature and response
print(x_train.shape)
print(y_test.shape)
print(x_test.shape)
print(y_train.shape)

(150, 3)
(50, 1)
(50, 3)
(150, 1)


In [67]:
# import the linear regression model 

from sklearn.linear_model import LinearRegression

# create an instance of the model using the estimator shape here
linreg = LinearRegression()
linreg.fit(x_train, y_train)
# code to create linear regration model which will predict the sales outcome for any new data

In [68]:
# print the intercept   and coefficients of the linear model using the built in function shown here
print(linreg.intercept_)
print(linreg.coef_)

[2.87696662]
[[0.00345046 0.17915812 0.04656457]]


In [69]:
# Now it's time to predict the response using the test dataset.
# This is accomplished by using a predict object and passing the test data to it.
y_pred = linreg.predict(x_test)
y_pred

array([[21.70910292],
       [16.41055243],
       [ 7.60955058],
       [17.80769552],
       [18.6146359 ],
       [23.83573998],
       [16.32488681],
       [13.43225536],
       [ 9.17173403],
       [17.333853  ],
       [14.44479482],
       [ 9.83511973],
       [17.18797614],
       [16.73086831],
       [15.05529391],
       [15.61434433],
       [12.42541574],
       [17.17716376],
       [11.08827566],
       [18.00537501],
       [ 9.28438889],
       [12.98458458],
       [ 8.79950614],
       [10.42382499],
       [11.3846456 ],
       [14.98082512],
       [ 9.78853268],
       [19.39643187],
       [18.18099936],
       [17.12807566],
       [21.54670213],
       [14.69809481],
       [16.24641438],
       [12.32114579],
       [19.92422501],
       [15.32498602],
       [13.88726522],
       [10.03162255],
       [20.93105915],
       [ 7.44936831],
       [ 3.64695761],
       [ 7.22020178],
       [ 5.9962782 ],
       [18.43381853],
       [ 8.39408045],
       [14

you can see that it gives you the predicted value in an array. You can always varify them against the actual values to test if your model is accurate or not

A better way to test your model's accuracy is to calculate its mean square error.<br>
you must first import numpy library and the matrix class from SK learn.

In [72]:
# import required library for calculatin MSE(mean square error)
from sklearn import metrics
import numpy as np
# use numpy square method  and pass built in main square error function of metrics by applying them on the response test and predict objects.

In [74]:
# Calculate the mean square error(MSE)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.4046514230328935


It will give you them mean square error present in the model and you can use it determine the accuracy of the model 