# Sales Prediction
Introduction:

The Sales Prediction project aims to utilize machine learning models, including linear regression, to forecast product sales based on advertising expenditures across different channels. The dataset used for this project is obtained from Kaggle, a prominent platform for data science and machine learning enthusiasts. It consists of four key columns: TV, Radio, Newspaper, and Sales, representing the advertising spending on TV, radio, and newspaper platforms, along with the corresponding sales figures.

The primary goal of this project is to develop predictive models capable of estimating sales figures based on investments made in each advertising channel. By analyzing historical data and understanding the relationships between advertising expenditures and sales outcomes, businesses can optimize their advertising strategies and make informed decisions to maximize revenue generation.

This sales prediction project employs Python programming language and various machine learning libraries to preprocess the Kaggle dataset, train predictive models, and evaluate their performance. With the trained models, businesses can forecast sales figures based on planned advertising budgets, enabling them to allocate resources efficiently and optimize their advertising campaigns.

The ultimate output of this project will be a robust sales prediction model providing insights into expected sales figures based on investments in TV, radio, and newspaper advertising channels. By harnessing the power of machine learning and leveraging the Kaggle dataset, businesses can enhance their sales forecasting accuracy and make data-driven decisions to drive profitability and achieve sustainable

In [1]:
#importing required libraries
import numpy as np # Importing numpy library for numerical computations
import pandas as pd # Import pandas library for data manipulation and analysis
import seaborn as sns
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler,MinMaxScaler

In [2]:
df=pd.read_csv('Advertising.csv') # Reading the dataset into a dataframe using the pandas library
df.head() # displays the first 5 rows of the DataFrame

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [3]:
df=df.drop(columns=["Unnamed: 0"]) # dropping useless column
df.head() # displays the first 5 rows of the DataFrame

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
df.shape # returns the dimensions (rows, columns) of the DataFrame.

(200, 4)

In [5]:
df.info() # provides a concise summary of the DataFrame's structure and content.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


In [6]:
df.describe() # generates summary statistics of the DataFrame's numerical columns

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [7]:
#Check if there are NA values.
df.isnull().sum().any()

False

In [8]:
#Check if there are duplicate values.
df.duplicated().sum()

0

In [9]:
df.corr()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,1.0,0.054809,0.056648,0.782224
Radio,0.054809,1.0,0.354104,0.576223
Newspaper,0.056648,0.354104,1.0,0.228299
Sales,0.782224,0.576223,0.228299,1.0


In [10]:
data_skew = df[['TV', 'Radio', 'Newspaper']]
skew = pd.DataFrame(data_skew.skew())
skew.columns = ['skew']
skew['too_skewed'] = skew['skew'] > .75
skew

Unnamed: 0,skew,too_skewed
TV,-0.069853,False
Radio,0.094175,False
Newspaper,0.89472,True


Newspaper column is too skewed

In [11]:
qt = QuantileTransformer(n_quantiles=200, output_distribution='normal')
df[['Newspaper']] = qt.fit_transform(df[['Newspaper']])

In [12]:
data_skew = df[['TV', 'Radio', 'Newspaper']]
skew = pd.DataFrame(data_skew.skew())
skew.columns = ['skew']
skew['too_skewed'] = skew['skew'] > .75
skew

Unnamed: 0,skew,too_skewed
TV,-0.069853,False
Radio,0.094175,False
Newspaper,0.000238,False


In [13]:
sc = StandardScaler()
df[['TV']] = sc.fit_transform(df[['TV']])
df[['Radio']] = sc.fit_transform(df[['Radio']])
df[['Newspaper']] = sc.fit_transform(df[['Newspaper']])
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,0.969852,0.981522,1.408193,22.1
1,-1.197376,1.082808,0.615594,10.4
2,-1.516155,1.528463,1.44762,9.3
3,0.05205,1.217855,1.086309,18.5
4,0.394182,-0.841614,1.063261,12.9


## Linear Regression

Linear regression is a powerful algorithm used for predicting continuous target variables based on input features. It assumes a linear relationship between the independent variables and the target variable, aiming to find the best-fitting line that minimizes the difference between the predicted and actual values. By analyzing this relationship, businesses can gain valuable insights into the factors influencing sales prices and make informed decisions about their advertising strategies. The algorithm learns the coefficients associated with each feature from the training data, allowing it to make accurate predictions on new data. Linear regression is widely used due to its simplicity, interpretability, and effectiveness in various domains, making it an essential tool for sales price prediction and regression tasks.


In [14]:
# Utilizing the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets.
# Assigning the variables `x_train`, `x_test`, `x_train`, and `x_test` to the results of the `train_test_split` function.
#`X` represents the feature matrix, and `x` represents the target variable.
# Setting the `test_size` parameter to 0.20, indicating that we want to allocate 30% of the data for testing, while using the remaining 80% for training.
# Additionally, setting the `random_state` parameter to 43 to ensure the same random split is obtained every time the code is run, providing reproducibility of the results.
# Assigning the feature matrix and target variable
X= df[['TV','Radio','Newspaper']]
y= df['Sales']

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=43)

from sklearn.preprocessing import StandardScaler  # Importing the StandardScaler class from sklearn.preprocessing module.
Sc = StandardScaler()  # Creating an instance of the StandardScaler class.
X_train_scaled=Sc.fit_transform(X_train)  # Scaling/transforming the X_train data using the fit_transform method of StandardScaler.
X_test_scaled=Sc.fit_transform(X_test)  # Scaling/transforming the X_test data using the fit_transform method of StandardScaler.

from sklearn.linear_model import LinearRegression # Importing the `LinearRegression` class from the `sklearn.linear_model` module
lr=LinearRegression() # Creating an instance of the linear regression model using the `LinearRegression` class.
lr.fit(X_train_scaled,y_train)# Fitting the linear regression model to the scaled training data
y_pred=lr.predict(X_test_scaled) # Using the trained linear regression model (lr) to predict the target variable
# Lets evaluate the model for its accuracy using various metrics such as RMSE and R-Squared
from sklearn import metrics # Importing the metrics module from the sklearn library

print('MAE:',metrics.mean_absolute_error(y_pred,y_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_pred,y_test)))
print('R-Squared',metrics.r2_score(y_pred,y_test))

MAE: 1.1999696198965595
RMSE: 1.4627859249075434
R-Squared 0.9177562774730732


## More Models

Since we have achieved a 91.78% accuracy with the linear regression model, we can further enhance our predictions by incorporating different models. Let's explore the implementation of different model and evaluate their effectiveness in achieving higher accuracy in our sales price prediction task.

In [15]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor

# Initialize models
ridge = Ridge()
lasso = Lasso()
decision_tree = DecisionTreeRegressor()
random_forest = RandomForestRegressor()
gradient_boosting = GradientBoostingRegressor()
extra_trees = ExtraTreesRegressor()

models = [ridge, lasso, decision_tree, random_forest, gradient_boosting, extra_trees]
model_names = ['Ridge Regression', 'Lasso Regression', 'Decision Tree Regressor', 
               'Random Forest Regressor', 'Gradient Boosting Regressor', 'Extra Tree Regressor']

for model, name in zip(models, model_names):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    print("Model:", name)
    print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
    print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('R-Squared:', metrics.r2_score(y_test, y_pred))
    print()


Model: Ridge Regression
MAE: 1.2011359777796375
RMSE: 1.4651593672417957
R-Squared: 0.9254257682285773

Model: Lasso Regression
MAE: 1.561692232662009
RMSE: 2.1015996449063965
R-Squared: 0.8465669108838352

Model: Decision Tree Regressor
MAE: 0.6899999999999998
RMSE: 0.906366371838673
R-Squared: 0.9714617969340973

Model: Random Forest Regressor
MAE: 0.5634000000000003
RMSE: 0.7091290080091202
R-Squared: 0.9825309391118419

Model: Gradient Boosting Regressor
MAE: 0.4861012023716853
RMSE: 0.5941273545909384
R-Squared: 0.9877375244898522

Model: Extra Tree Regressor
MAE: 0.46742499999999837
RMSE: 0.5604928411674829
R-Squared: 0.9890866220442421



## Conclusion

In summary, the Sales Prediction project leveraged linear regression and advanced ensemble techniques to develop robust machine learning models for forecasting product sales based on advertising investments. The dataset comprised key columns such as "TV", "Radio", "Newspaper", and "Sales".

Through meticulous data preprocessing, model training, and evaluation, we successfully constructed a linear regression model capable of accurately predicting sales figures from advertising expenditures. This model exhibited a commendable accuracy of 92%, indicating its effectiveness in capturing the complex relationships between advertising channels and sales outcomes.

Moreover, we explored the potential of more sophisticated models, ultimately identifying the Extra Tree Regressor as the top performer. This ensemble-based algorithm showcased exceptional performance, yielding a remarkable fit to the training data with an impressive R-squared score of 0.9885. To ensure generalization and mitigate overfitting, we further validated the Extra Tree Regressor model through cross-validation, achieving a low MAE value of 0.4367.

The insights generated from this project are invaluable for businesses seeking to estimate future sales and optimize advertising budgets. By harnessing the predictive power of these models, businesses can make data-driven decisions regarding resource allocation and advertising strategies, thereby maximizing profitability and driving sustainable growth.