Problem Statement

You have been provided with a dataset that contains the costs of advertising on different media channels and the corresponding sales of XYZ firm. Evaluate the dataset to:

1. Find the features or media channels used by the firm
2. Find the sales figures for each channel 
3. Create a model to predict the sales outcome
4. Split it into training and testing datasets for the model 
5. Calculate the mean squared error (MSE)

In [1]:
# Import the required libraries
import pandas as pd

In [2]:
# Import the advertising dataset as pandas dataframe
adv_df = pd.read_csv("datasets/Advertising.csv", index_col=0)

In [3]:
# view top 5 records
adv_df.head()

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [5]:
# view size of the data
adv_df.size

800

In [6]:
# view shape
adv_df.shape

(200, 4)

In [7]:
# view columns of the dataset
adv_df.columns

Index(['TV Ad Budget ($)', 'Radio Ad Budget ($)', 'Newspaper Ad Budget ($)',
       'Sales ($)'],
      dtype='object')

Thus we have 3 feature columns(media channels used by the firm) and one response column(sales)

In [8]:
# create a feature object from the columns
feature_obj = adv_df[["TV Ad Budget ($)", "Radio Ad Budget ($)", "Newspaper Ad Budget ($)"]]

In [9]:
# view feature object
feature_obj.head()

Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($)
1,230.1,37.8,69.2
2,44.5,39.3,45.1
3,17.2,45.9,69.3
4,151.5,41.3,58.5
5,180.8,10.8,58.4


In [10]:
# create a target object from cols
target_obj = adv_df[['Sales ($)']]

In [11]:
target_obj.head()

Unnamed: 0,Sales ($)
1,22.1
2,10.4
3,9.3
4,18.5
5,12.9


A way to view if all objects are captured is by viewing shape of the objects

In [12]:
print(feature_obj.shape)
print(target_obj.shape)

(200, 3)
(200, 1)


Split training and testing data
Usually training is 75% 
testing is 25%

In [16]:
# import libraries
from sklearn.model_selection import train_test_split

In [17]:
f_train, f_test, t_train, t_test = train_test_split(feature_obj, target_obj, random_state=1)

In [18]:
print(f_train.shape)
print(f_test.shape)
print(t_train.shape)
print(t_test.shape)

(150, 3)
(50, 3)
(150, 1)
(50, 1)


Code to create LINEAR REGRESSION MODEL which will predict the sales outcome for any new data

In [19]:
from sklearn.linear_model import LinearRegression

In [20]:
# create instance of the model using estimator
linreg = LinearRegression()

In [21]:
# fit training data to it
linreg.fit(f_train, t_train)

In [22]:
# print intercepts and coefficients
print(linreg.intercept_)
print(linreg.coef_)

[2.87696662]
[[0.04656457 0.17915812 0.00345046]]


In [29]:
# prediction
t_pred = linreg.predict(f_test)
print(t_pred)

[[21.70910292]
 [16.41055243]
 [ 7.60955058]
 [17.80769552]
 [18.6146359 ]
 [23.83573998]
 [16.32488681]
 [13.43225536]
 [ 9.17173403]
 [17.333853  ]
 [14.44479482]
 [ 9.83511973]
 [17.18797614]
 [16.73086831]
 [15.05529391]
 [15.61434433]
 [12.42541574]
 [17.17716376]
 [11.08827566]
 [18.00537501]
 [ 9.28438889]
 [12.98458458]
 [ 8.79950614]
 [10.42382499]
 [11.3846456 ]
 [14.98082512]
 [ 9.78853268]
 [19.39643187]
 [18.18099936]
 [17.12807566]
 [21.54670213]
 [14.69809481]
 [16.24641438]
 [12.32114579]
 [19.92422501]
 [15.32498602]
 [13.88726522]
 [10.03162255]
 [20.93105915]
 [ 7.44936831]
 [ 3.64695761]
 [ 7.22020178]
 [ 5.9962782 ]
 [18.43381853]
 [ 8.39408045]
 [14.08371047]
 [15.02195699]
 [20.35836418]
 [20.57036347]
 [19.60636679]]


Now calculate accuracy by Mean Square Error

In [31]:
# import libraries
from sklearn import metrics
import numpy as np

In [32]:
# calculating root mean squared error
np.sqrt(metrics.mean_squared_error(t_test, t_pred))

1.404651423032896