# Future Sales Prediction with Machine Learning

# Introduction

The objective of this analysis is to understand the relationship between the different variables in the dataset and build a linear regression model that would allow us to predict how many units will be sold when an x amount of dollars is placed to any of the advertising methods. 

The dataset at hand shows the number of units sold along with the advertising costs incurred. The advertising channels used to sell the product are TV, Radio and Newspaper. 

 - TV: Advertising cost spent in dollars for advertising on TV;
 - Radio: Advertising cost spent in dollars for advertising on Radio;
 - Newspaper: Advertising cost spent in dollars for advertising on Newspaper;
 - Sales: Number of units sold

Now we will go ahead with with the analysis process.

 - Importing Libraries
 - Cleaning Data
 - Visualization
 - Correlation Test
 - Linear Regression Model
 - Testing the Regression Model

## Importing Libraries

Importing the necessary packages to conduct our analysis

Scikit learn (sklearn) is a machine learning toolkit for Python. As such, it has tools for performing steps of the machine learning process, like training a model.

Pandas is a data analysis and manipulation tool that works with dataframes

numpy  offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more.

plotly is an Open Source Graphing Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import mean_absolute_error, mean_squared_error

## Importing CSV file

Importing the CSV file and then assigning it to the variable "data", that will then be used to perform transformations and analysis.

In [4]:
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/advertising.csv")
print(data.head())

      TV  Radio  Newspaper  Sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3   12.0
3  151.5   41.3       58.5   16.5
4  180.8   10.8       58.4   17.9


## Exploring the Dataset

In [38]:
data

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


In [6]:
data.tail()

Unnamed: 0,TV,Radio,Newspaper,Sales
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,18.4


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


In [9]:
round(data.describe(),2)

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.04,23.26,30.55,15.13
std,85.85,14.85,21.78,5.28
min,0.7,0.0,0.3,1.6
25%,74.38,9.98,12.75,11.0
50%,149.75,22.9,25.75,16.0
75%,218.82,36.52,45.1,19.05
max,296.4,49.6,114.0,27.0


## Scanning the Dataset for Null Values

Searching for any null values in the dataset and then deciding if what to do with them if any null values are present. In this instance, there were no null value and as such will continue to visualize the relationships between different variables.

In [31]:
print(data.isnull().sum())

TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64


## Scanning the Dataset for Outliers

As observed for TV, there are no outliers in the dataset

In [16]:
figure1 = px.box(data, y = "TV")
figure1.show()

As observed for Radio, there are no outliers in the dataset

In [17]:
figure2 = px.box(data, y = "Radio")
figure2.show()

However for Newspaper there are outliers. From the Box plot graph, there are two outliers visible. It would be good practice to check the dataframe for any other outliers and note the difference between the first outlier value and the upper fence of the dataset for that column. In this instance, that would be newspaper. 

In [18]:
figure3 = px.box(data, y = "Newspaper")
figure3.show()

There are only two outliers as can be seen after running the code below.

In [21]:
outlier = data[(data["Newspaper"] > 100)]
print(outlier["Newspaper"])

16     114.0
101    100.9
Name: Newspaper, dtype: float64


The difference between the first outlier and the upper fence is not large enough to perfom any actions on the dataset. As such we will leave such values in the dataset as continue with our analysis.

In [26]:
data.sort_values(by = "Newspaper", ascending = False)

Unnamed: 0,TV,Radio,Newspaper,Sales
16,67.8,36.6,114.0,12.5
101,296.4,36.3,100.9,23.8
75,16.9,43.7,89.4,8.7
165,234.5,3.4,84.8,16.9
118,125.7,36.9,79.2,15.9
...,...,...,...,...
42,293.6,27.7,1.8,20.7
139,184.9,43.9,1.7,20.7
8,8.6,2.1,1.0,4.8
65,69.0,9.3,0.9,11.3


## Visualizations

**1. Observe the relationship between amount spent on TV advertising an the number of units sold (Sales)**

In [27]:
viz1 = px.scatter(data_frame = data, x = "Sales", y = "TV",
                  size = "TV", trendline = "ols")
viz1.show()

**2. Observe the relationship between amount spent on Radio advertising and the number of units sold (Sales)**

In [28]:
viz2 = px.scatter(data_frame = data, x = "Sales", y = "Radio",
                 size = "Radio", trendline = "ols")
viz2.show()

**3. Observe the relationship between amount spent on Newspaper advertising and the number of units sold (Sales)**

In [29]:
viz3 = px.scatter(data_frame = data, x = "Sales", y = "Newspaper",
                 size = "Newspaper", trendline = "ols")
viz3.show()

In [35]:
data["Newspaper"].sum()

6110.799999999999

Analysis:

From looking at the different visualizations, it is observable that as the amount of spending placed on TV advertisements resulted in more units sold. Whereas for the other 2 graphs, the effects are not clearly visualized. As such, lets conduct a correlation analysis to further understand the relationship between these numerical variables.

## Correlation Test ##

In [30]:
correlation = data.corr()
print(correlation["Sales"].sort_values(ascending=False))

Sales        1.000000
TV           0.901208
Radio        0.349631
Newspaper    0.157960
Name: Sales, dtype: float64


In [33]:
# Correlation Heatmap

corrfig = px.imshow(correlation)
corrfig.show()

Analysis:
    
From looking at the correlation table, there is a strong correlation between TV and sales while the correlation of Radio and Newspaper to Sales is weakly related. Further empahasizing on our analysis from looking at the visualizations.

## Linear Regression

In order to run a linear regression model on our dataset, we first assign the data to our dependent and independent variables. In this case it would be x (independent variable) and y (dependent variable).

In [15]:
x = np.array(data.drop(["Sales"], 1))
y = np.array(data["Sales"])


In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.



We will then follow the approach of splitting the data into test and training data sets:
Each of x and y are split into a training dataset and testing dataset

The reason we conduct this approach is due to the fact that we would want to improve our linear regression model performance using the training data before exposing it to real world data (Test data).

In order to do this the "train_test_split" function from sklearn is used. x represents the independent features (TV, Radio, Newspaper) and y represents the dependent features (Sales). We will take the test size to be 20% of the dataset and 80% for the traning dataset. 

In [30]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)

model.fit = The ‘fit’ method trains the algorithm on the training data, after the model is initialized. 

In [31]:
model = LinearRegression()
model.fit(xtrain, ytrain)

[0.05450927 0.10094536 0.00433665]


Exploring the coefficients we can make the following explanations:

For every Dollar spent on TV advertising, we expect the number of units sold to increase by 0.05 units.
For every Dollar spent on Radio advertising, we expect the number of units sold to increase by 0.1 units.
For every Dollar spent on Newspaper advertising, we expect the number of units sold to increase by 0.004 units. 

In [34]:
print(model.coef_)

[0.05450927 0.10094536 0.00433665]


Exploring the intercept shows that if no dollars were spent on advertising, the number of units sold will be 4.7.

In [33]:
print(model.intercept_)

4.714126402214134


model.score returns the coefficient of determination of the prediction.

In [32]:
print(model.score(xtest, ytest))

0.9059011844150826


### Test the model

In [35]:
y_pred = model.predict(xtest)

In [67]:
comparison = pd.DataFrame({"Actual": ytest, "Predicted": y_pred})
comparison["Percentage Difference"] = (comparison["Predicted"] - comparison["Actual"]) / comparison["Actual"]
print(comparison)

    Actual  Predicted  Percentage Difference
0     16.9  17.034772               0.007975
1     22.4  20.409740              -0.088851
2     21.4  23.723989               0.108598
3      7.3   9.272785               0.270245
4     24.7  21.682719              -0.122157
5     12.6  12.569402              -0.002428
6     22.3  21.081195              -0.054655
7      8.4   8.690350               0.034566
8     16.5  17.237013               0.044667
9     16.1  16.666575               0.035191
10    11.0   8.923965              -0.188730
11     8.7   8.481734              -0.025088
12    16.9  18.207512               0.077368
13     5.3   8.067507               0.522171
14    10.3  12.645510               0.227719
15    16.7  14.931628              -0.105891
16     5.5   8.128146               0.477845
17    16.6  17.898766               0.078239
18    11.3  11.008806              -0.025769
19    18.9  20.478328               0.083509
20    19.7  20.806318               0.056158
21    12.5

In [66]:
datatype = comparison.dtypes
datatype

Actual       float64
Predicted    float64
dtype: object

In [55]:
mae = mean_absolute_error(ytest, y_pred)
mse = mean_squared_error(ytest, y_pred)
rmse = np.sqrt(mse)

In [59]:
print(f'Mean absolute error: {mae:.2f}')


Mean absolute error: 1.27


In [60]:
print(f'Mean squared error: {mse:.2f}')


Mean squared error: 2.91


In [61]:
print(f'Root mean squared error: {rmse:.2f}')


Root mean squared error: 1.71
