# Sales Prediction
Introduction:

The Sales Price Prediction project aims to utilize linear regression and extra tree regressor, two machine learning models, to forecast the sales of a product based on various advertising channels. The dataset used for this project is sourced from Kaggle, a popular platform for data science and machine learning. It contains four columns: TV, Radio, Newspaper, and Sales, representing the advertising expenditure for TV, radio, and newspaper channels, as well as the corresponding sales figures.

The objective of this project is to develop a predictive model that can estimate the sales based on the advertising investments in each channel. By analyzing historical data and understanding the relationships between advertising channels and sales, businesses can optimize their advertising strategies and make informed decisions to maximize their revenue.

This sales price prediction project uses Python and its machine learning libraries to preprocess the Kaggle dataset, train a linear regression model, and evaluate its performance. With the trained model, businesses can make sales predictions based on planned advertising expenditures, optimizing resource allocation and advertising budgets.

The output of this project will be a sales prediction model that provides insights into the expected sales based on the investments in TV, radio, and newspaper advertising channels. By leveraging machine learning techniques and utilizing the Kaggle dataset, businesses can make data-driven decisions and improve their sales forecasting accuracy, ultimately maximizing their profitability and growth.

## Linear Regression

Linear regression is a powerful algorithm used for predicting continuous target variables based on input features. It assumes a linear relationship between the independent variables and the target variable, aiming to find the best-fitting line that minimizes the difference between the predicted and actual values. By analyzing this relationship, businesses can gain valuable insights into the factors influencing sales prices and make informed decisions about their advertising strategies. The algorithm learns the coefficients associated with each feature from the training data, allowing it to make accurate predictions on new data. Linear regression is widely used due to its simplicity, interpretability, and effectiveness in various domains, making it an essential tool for sales price prediction and regression tasks.


In [1]:
#importing required libraries
import numpy as np # Importing numpy library for numerical computations
import pandas as pd # Import pandas library for data manipulation and analysis

In [2]:
df=pd.read_csv('Advertising.csv') # Reading the dataset into a dataframe using the pandas library
df.head() # displays the first 5 rows of the DataFrame

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [3]:
df=df.drop(columns=["Unnamed: 0"]) # dropping useless column
df.head() # displays the first 5 rows of the DataFrame

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
df.shape # returns the dimensions (rows, columns) of the DataFrame.

(200, 4)

In [5]:
df.info() # provides a concise summary of the DataFrame's structure and content.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


In [6]:
df.describe() # generates summary statistics of the DataFrame's numerical columns

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [7]:
# Assigning the feature matrix and target variable
x= df[['TV','Radio','Newspaper']]
y= df['Sales']

In [8]:
y # Display target variable

0      22.1
1      10.4
2       9.3
3      18.5
4      12.9
       ... 
195     7.6
196     9.7
197    12.8
198    25.5
199    13.4
Name: Sales, Length: 200, dtype: float64

In [9]:
x # Display feature matrix

Unnamed: 0,TV,Radio,Newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4
...,...,...,...
195,38.2,3.7,13.8
196,94.2,4.9,8.1
197,177.0,9.3,6.4
198,283.6,42.0,66.2


In [10]:
# Utilizing the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets.
# Assigning the variables `x_train`, `x_test`, `x_train`, and `x_test` to the results of the `train_test_split` function.
#`X` represents the feature matrix, and `x` represents the target variable.
# Setting the `test_size` parameter to 0.20, indicating that we want to allocate 30% of the data for testing, while using the remaining 80% for training.
# Additionally, setting the `random_state` parameter to 43 to ensure the same random split is obtained every time the code is run, providing reproducibility of the results.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=43)

In [11]:
from sklearn.preprocessing import StandardScaler  # Importing the StandardScaler class from sklearn.preprocessing module.
Sc = StandardScaler()  # Creating an instance of the StandardScaler class.
x_train_scaled=Sc.fit_transform(x_train)  # Scaling/transforming the X_train data using the fit_transform method of StandardScaler.
x_test_scaled=Sc.fit_transform(x_test)  # Scaling/transforming the X_test data using the fit_transform method of StandardScaler.

In [12]:
from sklearn.linear_model import LinearRegression # Importing the `LinearRegression` class from the `sklearn.linear_model` module
lr=LinearRegression() # Creating an instance of the linear regression model using the `LinearRegression` class.
lr.fit(x_train_scaled,y_train)# Fitting the linear regression model to the scaled training data

In [13]:
y_pred=lr.predict(x_test_scaled) # Using the trained linear regression model (lr) to predict the target variable

In [20]:
# Lets evaluate the model for its accuracy using various metrics such as RMSE and R-Squared
from sklearn import metrics # Importing the metrics module from the sklearn library

print('MAE:',metrics.mean_absolute_error(y_pred,y_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_pred,y_test)))
print('R-Squared',metrics.r2_score(y_pred,y_test))


MAE: 1.1972402071291417
RMSE: 1.4589611400948739
R-Squared 0.9179879912884453


## Extra Trees Regressor Model

Since we have achieved a 92% accuracy with the linear regression model, we can further enhance our predictions by incorporating the Extra Trees Regressor model. This ensemble algorithm, known for its randomized approach, offers the potential for even greater accuracy in our sales price prediction project. By leveraging the strengths of the Extra Trees Regressor, such as its ability to handle a large number of decision trees and its randomized splitting thresholds, we can potentially improve the accuracy and performance of our predictions. Let's explore the implementation of the Extra Trees Regressor model and evaluate its effectiveness in achieving higher accuracy in our sales price prediction task.

In [23]:
!pip install -q autoviz # Installing the `autoviz` library, which provides automated visualization of data.
!pip install -q -U --pre pycaret # Installing the latest pre-release version of the `pycaret` library, a Python library for automating machine learning workflows.
from pycaret.regression import * # Importing the necessary modules and functions from the `pycaret.regression` module for regression tasks.

In [None]:
s = setup(data = df, target = 'Sales', session_id=123)  # Setting up the machine learning environment

In [None]:
compare_models() # comparing different models

In [26]:
et = create_model('et') # Creating an Extra Trees Regressor model using the `create_model()` function from the `pycaret.regression` module.

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.3637,0.3478,0.5897,0.9842,0.0373,0.0286
1,0.6276,1.1538,1.0741,0.9629,0.2399,0.2001
2,0.4551,0.4245,0.6515,0.9864,0.0389,0.0328
3,0.4281,0.3163,0.5624,0.9733,0.0404,0.0326
4,0.4111,0.2964,0.5444,0.976,0.0422,0.0323
5,0.4338,0.2968,0.5448,0.9871,0.0384,0.0359
6,0.4693,0.3032,0.5506,0.9909,0.0468,0.0407
7,0.4261,0.2542,0.5042,0.9869,0.0565,0.0431
8,0.4567,0.3008,0.5485,0.991,0.0603,0.0414
9,0.5528,0.4052,0.6366,0.9902,0.0377,0.0365


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [27]:
et = finalize_model(et) # Finalizing the Extra Trees Regressor model by applying the finalize_model() function from the pycaret.regression module to the et model
et

In [28]:
preds = predict_model(et)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0,0.0,0.0,1.0,0.0,0.0


Obtaining an R-squared score of 1 indicates that the Extra Trees Regressor model is performing extremely well and has achieved a perfect fit to the training data. An R-squared score of 1 means that the model can explain 100% accurate.

## Conclusion

In conclusion, our sales price prediction project utilized linear regression to develop a machine learning model for forecasting product sales based on advertising factors. The dataset included columns for "TV", "Radio", "Newspaper", and "Sales".

Through data preprocessing, model training, and evaluation, we successfully created a linear regression model capable of predicting sales based on advertising expenditures. The model achieved a high accuracy of 92%, indicating its effectiveness in capturing the relationships between advertising channels and sales.

Furthermore, we explored the use of the Extra Trees Regressor model to potentially improve accuracy. This ensemble-based algorithm showed exceptional performance, achieving a perfect fit to the training data with an R-squared score of 1. However, it is important to consider potential overfitting and assess the model's generalization on unseen data.

The project's outcomes provide businesses with valuable insights for estimating future sales and optimizing advertising budgets. By leveraging the trained models, businesses can make informed decisions on resource allocation and advertising strategies.

Overall, our project demonstrates the significance of machine learning in sales prediction and highlights the effectiveness of linear regression and the Extra Trees Regressor algorithm. Future work may involve exploring additional algorithms, incorporating more features, and conducting further evaluations to enhance prediction accuracy and practical application in sales forecasting.