# **SALES PREDICTION** 

### **Problem Statement**

Build a model which predicts sales based on the money spent on different platforms for marketing.

### Data
Use the advertising dataset given in ISLR and analyse the relationship between 'TV advertising' and 'sales' using a simple linear regression model. 

In this notebook, we'll build a **LinearRegression model, DecisionTreeResgression Model and RandomForestRegression Model** to predict `Sales` using an appropriate predictor variable.

## Reading and Understanding the Data

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package

import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

# **DATA LOADING**

The dataset contain the advertising dataset.

In [None]:
advertising = pd.DataFrame(pd.read_csv("../input/advertising.csv"))
advertising.head()

# **Exploratory Data Analysis (EDA)**
EDA stands for Exploratory Data Analysis. It is a critical step in the data analysis process that involves examining and visualizing data sets to understand their main characteristics, patterns, and relationships. The primary goal of EDA is to gain insights, detect anomalies, and inform the data modeling process.

# **1. Summarize Data:**

This involves calculating descriptive statistics such as **mean, median, mode, standard deviation,** and other relevant measures to understand the central tendency and spread of the data.

In [None]:
advertising.shape

In [None]:
advertising.columns   #total 4 columns ['TV', 'Radio', 'Newspaper', 'Sales']

In [None]:
advertising.info()   #4 features are the numerical datatype .

In [None]:
advertising.describe()

# **2. Visualisation and Insights**

In [None]:
# Checking Null values
advertising.isnull().sum()*100/advertising.shape[0]
# There are no NULL values in the dataset, hence it is clean.

In [None]:
# Outlier Analysis
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(advertising['TV'], ax = axs[0])
plt2 = sns.boxplot(advertising['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(advertising['Radio'], ax = axs[2])
plt.tight_layout()

In [None]:
# There are no considerable outliers present in the data.

#### Sales (Target Variable)

In [None]:
sns.boxplot(advertising['Sales'])
plt.show()

In [None]:
# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(advertising, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

In [None]:
# Let's see the correlation between different variables.
sns.heatmap(advertising.corr(), cmap="YlGnBu", annot = True)
plt.show()

As is visible from the pairplot and the heatmap, the variable `TV` seems to be most correlated with `Sales`. So let's go ahead and perform simple linear regression using `TV` as our feature variable.

# **Splitting**  the Data set into Train and Test

In [None]:
X = advertising[['TV','Radio','Newspaper']]
y = advertising['Sales']

You now need to split our variable into training and testing sets. You'll perform this by importing `train_test_split` from the `sklearn.model_selection` library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
# Let's now take a look at the train dataset

X_train.head()

In [None]:
y_train.head()

# **Train the model:**

Hyper Parameter tuning :Hyperparameter tuning, also known as hyperparameter optimization, is the process of finding the best set of hyperparameters for a machine learning model to achieve optimal performance on a given dataset

There are two main hyperparameter tuning techniques:

GridSearchCV
RandomizedSearchCV

# **1.Linear Regression Model**



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
# Create a Linear Regression model
baseline_model = LinearRegression()

# Hyperparameter tuning
param_grid = {'normalize': [True, False]}  # You can add more hyperparameters to tune
grid_search = GridSearchCV(baseline_model, param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Get the best Linear Regression model with the best parameters
best_model_1 = grid_search.best_estimator_

# Fit the best model on the training data
best_model_1.fit(X_train, y_train)

# Make predictions on the test data
y_pred_1 = best_model_1.predict(X_test)



# **2.DecisionTree Regression Model**

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Create a Decision Tree Regressor
baseline_model = DecisionTreeRegressor()

# Hyperparameter tuning
param_grid = {'max_depth': [None, 5, 10, 20],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}
grid_search = GridSearchCV(baseline_model, param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Get the best Decision Tree Regressor model with the best parameters
best_model_2 = grid_search.best_estimator_

# Fit the best model on the training data
best_model_2.fit(X_train, y_train)

# Make predictions on the test data
y_pred_2 = best_model_2.predict(X_test)


# **3. Random Forest regression  Model**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Create a Random Forest Regressor
baseline_model = RandomForestRegressor()

# Hyperparameter tuning
param_grid = {'n_estimators': [100, 200, 300],
              'max_depth': [None, 5, 10, 20],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}
grid_search = GridSearchCV(baseline_model, param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Get the best Random Forest Regressor model with the best parameters
best_model_3 = grid_search.best_estimator_

# Fit the best model on the training data
best_model_3.fit(X_train, y_train)

# Make predictions on the test data
y_pred_3 = best_model.predict(X_test)




## Model Evaluation

To compare the performance of different regression models (Linear Regression, Decision Tree, Random Forest) and determine which one is best for your specific dataset, you can evaluate each model using appropriate metrics such as **Mean Squared Error (MSE)**, **Root Mean Squared Error (RMSE)**, **Mean Absolute Error (MAE)**, and **R-squared**. After evaluating each model, you can compare their performance metrics to decide which one performs the best.

**LinearRegression Model Evaluation**

In [None]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_1)
print("Mean Squared Error:", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_1)
print("Mean Absolute Error:", mae)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred_1)
print("R-squared:", r2)


**DecisionTreeRegression Model Evaluation**

In [None]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_2)
print("Mean Squared Error:", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_2)
print("Mean Absolute Error:", mae)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred_2)
print("R-squared:", r2)

**RandomForestRegression Model Evaluation**

In [None]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_3)
print("Mean Squared Error:", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_3)
print("Mean Absolute Error:", mae)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred_3)
print("R-squared:", r2)

## Model Comparison

| Model                | RMSE                | MAE                 |     R-squared       |
|----------------------|---------------------|---------------------|---------------------|
| Linear Regression    | 1.6235998775338913  | 1.2278183566589418  | 0.8655979373420271  |
| Decision Tree        | 1.6269261194734022  | 1.3483167989417992  | 0.8650466787227526  |
| Random Forest        | 1.1670994173591218  | 0.9642500000000004  | 0.9305513802876184  |

### Summary

Based on the evaluation metrics, the Random Forest model outperforms the other models in terms of RMSE, MAE, and R-squared. It achieved the lowest RMSE and MAE values, indicating better predictive accuracy. The high R-squared value suggests that a significant portion of the variance in the target variable is explained by the model. While the Decision Tree model also performed well, the Random Forest model demonstrates superior performance and generalization capability.

The Random Forest model is recommended for this project due to its strong predictive performance and ability to handle complex relationships within the data.
