# Sales Prediction

The aim is to build a model which predicts sales based on the money spent on different platforms such as TV, radio, and newspaper for marketing.

In [8]:
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
#Reading the dataset
dataset = pd.read_csv("advertising.csv")

In [10]:
dataset.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


# Data Pre-Processing

In [11]:
dataset.shape

**1. Checking for missing values**

In [12]:
dataset.isna().sum()

**Conclusion:** The dataset does not have missing values

**2. Checking for duplicate rows**

In [13]:
dataset.duplicated().any()

**Conclusion:** There are no duplicate rows present in the dataset

**3. Checking for outliers**

In [14]:
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(dataset['TV'], ax = axs[0])
plt2 = sns.boxplot(dataset['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(dataset['Radio'], ax = axs[2])
plt.tight_layout()

**Conclusion:** There are not that extreme values present in the dataset

# Exploratory Data Analysis

**1. Distribution of the target variable**

In [15]:
sns.distplot(dataset['Sales']);

**Conclusion:** It is normally distributed

**2. How Sales are related with other variables**

In [16]:
sns.pairplot(dataset, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

**Conclusion:** TV is strongly, positively, linearly correlated with the target variable. Bu the Newspaper feature seems to be uncorrelated

**3. Heatmap**

In [17]:
sns.heatmap(dataset.corr(), annot = True)
plt.show()

**Conclusion:** TV seems to be most correlated with Sales as 0.9 is very close to 1

# Model Building

Linear Regression is a useful tool for predicting a quantitative response.

Prediction using:
    1. Simple Linear Regression
    2. Multiple Linear Regression

**1. Simple Linear Regression**

Simple linear regression has only one x and one y variable. It is an approach for predicting a quantitative response using a single feature.

It establishes the relationship between two variables using a straight line. Linear regression attempts to draw a line that comes closest to the data by finding the slope and intercept that define the line and minimize regression errors.

**Formula:** Y = β0 + β1X + e

    Y = Dependent variable / Target variable
    β0 = Intercept of the regression line 
    β1 = Slope of the regression lime which tells whether the line is increasing or decreasing
    X = Independent variable / Predictor variable
    e = Error
    
**Equation:** Sales = β0 + β1X + TV

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [19]:
#Setting the value for X and Y
x = dataset[['TV']]
y = dataset['Sales']

In [20]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

In [21]:
slr= LinearRegression()  
slr.fit(x_train, y_train)

In [22]:
#Printing the model coefficients
print('Intercept: ', slr.intercept_)
print('Coefficient:', slr.coef_)

In [23]:
print('Regression Equation: Sales = 6.948 + 0.054 * TV')

In [24]:
#Line of best fit
plt.scatter(x_train, y_train)
plt.plot(x_train, 6.948 + 0.054*x_train, 'r')
plt.show()

In [25]:
#Prediction of Test and Training set result  
y_pred_slr= slr.predict(x_test)  
x_pred_slr= slr.predict(x_train)  

In [26]:
print("Prediction for test set: {}".format(y_pred_slr))

In [27]:
#Actual value and the predicted value
slr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_slr})
slr_diff

In [28]:
#Predict for any value
slr.predict([[56]])

**Conclusion:** The model predicted the Sales of 10.003 in that market

In [29]:
# print the R-squared value for the model
from sklearn.metrics import accuracy_score
print('R squared value of the model: {:.2f}'.format(slr.score(x,y)*100))

**Conclusion:** 81.10% of the data fit the regression model

In [30]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred_slr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_slr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_slr))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)

**2. Multiple Linear Regression**

Multiple linear regression has one y and two or more x variables. It is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable.

Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.

Assumptions for Multiple Linear Regression:
    1. A linear relationship should exist between the Target and predictor variables.
    2. The regression residuals must be normally distributed.
    3. MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
    
**Formula:** Y = β0 + β1X1 + β2X2 + β3X3 + ... + βnXn + e

    Y = Dependent variable / Target variable
    β0 = Intercept of the regression line 
    β1, β2,..βn = Slope of the regression lime which tells whether the line is increasing or decreasing
    X1, X2,..Xn = Independent variables / Predictor variables
    e = Error
    
**Equation:** Sales = β0 + (β1 * TV) + (β2 * Radio) + (β3 * Newspaper)

In [31]:
#Setting the value for X and Y
x = dataset[['TV', 'Radio', 'Newspaper']]
y = dataset['Sales']

In [32]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.3, random_state=100)  

In [33]:
mlr= LinearRegression()  
mlr.fit(x_train, y_train) 

In [34]:
#Printing the model coefficients
print(mlr.intercept_)
# pair the feature names with the coefficients
list(zip(x, mlr.coef_))

In [35]:
#Predicting the Test and Train set result 
y_pred_mlr= mlr.predict(x_test)  
x_pred_mlr= mlr.predict(x_train)  

In [36]:
print("Prediction for test set: {}".format(y_pred_mlr))

In [37]:
#Actual value and the predicted value
mlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_mlr})
mlr_diff

In [38]:
#Predict for any value
mlr.predict([[56, 55, 67]])

**Conclusion:** The model predicted the Sales of 13.82 in that market

In [39]:
# print the R-squared value for the model
print('R squared value of the model: {:.2f}'.format(mlr.score(x,y)*100))

**Conclusion:** 90.21% of the data fit the multiple regression model

In [40]:
# 0 means the model is perfect. Therefore the value should be as close to 0 as possible
meanAbErr = metrics.mean_absolute_error(y_test, y_pred_mlr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_mlr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_mlr))

print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)