# Task
Sales Prediction using Python
 Sales prediction means predicting how much of a product people will buy based on
 factors
 such as the amount you spend to advertise your product, the segment of people you
 advertise for, or the platform you are advertising on about your product.
 Typically, a product and service-based business always need their Data Scientist to
 predict
 their future sales with every step they take to manipulate the cost of advertising their
 product. So let’s start the task of sales prediction with machine learning using Python


## Data loading

### Subtask:
Load the advertising dataset into a pandas DataFrame.


In [1]:
import pandas as pd

df = pd.read_csv('/content/Advertising.csv')
display(df.head())

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


## Data exploration

### Subtask:
Explore the dataset to understand its structure and characteristics.


In [2]:
df = df.drop(columns=["Unnamed: 0"])
print("Shape of the DataFrame:")
print(df.shape)
print("\nFirst 5 rows:")
display(df.head())
print("\nLast 5 rows:")
display(df.tail())
print("\nDataFrame information:")
print(df.info())
print("\nDescriptive statistics:")
display(df.describe())
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nNumber of duplicated rows:")
print(df.duplicated().sum())
print("\nCorrelation matrix:")
display(df.corr())

Shape of the DataFrame:
(200, 4)

First 5 rows:


Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9



Last 5 rows:


Unnamed: 0,TV,Radio,Newspaper,Sales
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,13.4



DataFrame information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB
None

Descriptive statistics:


Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0



Missing values per column:
TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64

Number of duplicated rows:
0

Correlation matrix:


Unnamed: 0,TV,Radio,Newspaper,Sales
TV,1.0,0.054809,0.056648,0.782224
Radio,0.054809,1.0,0.354104,0.576223
Newspaper,0.056648,0.354104,1.0,0.228299
Sales,0.782224,0.576223,0.228299,1.0


## Data preparation

### Subtask:
Prepare the data for model training, including splitting it into features (X) and the target variable (y), and then further splitting into training and testing sets.


In [3]:
from sklearn.model_selection import train_test_split

# Define the feature matrix X and target variable y
X = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (160, 3)
X_test shape: (40, 3)
y_train shape: (160,)
y_test shape: (40,)


## Feature engineering

### Subtask:
Create interaction features between the existing advertising channels (TV, Radio, Newspaper) and a squared term for each of them.


In [4]:
# Create interaction features
X_train['TV_Radio_interaction'] = X_train['TV'] * X_train['Radio']
X_train['TV_Newspaper_interaction'] = X_train['TV'] * X_train['Newspaper']
X_train['Radio_Newspaper_interaction'] = X_train['Radio'] * X_train['Newspaper']

X_test['TV_Radio_interaction'] = X_test['TV'] * X_test['Radio']
X_test['TV_Newspaper_interaction'] = X_test['TV'] * X_test['Newspaper']
X_test['Radio_Newspaper_interaction'] = X_test['Radio'] * X_test['Newspaper']

# Create squared features
X_train['TV_squared'] = X_train['TV']**2
X_train['Radio_squared'] = X_train['Radio']**2
X_train['Newspaper_squared'] = X_train['Newspaper']**2

X_test['TV_squared'] = X_test['TV']**2
X_test['Radio_squared'] = X_test['Radio']**2
X_test['Newspaper_squared'] = X_test['Newspaper']**2

# Display the first 5 rows of X_train and X_test
print("X_train:")
display(X_train.head())
print("\nX_test:")
display(X_test.head())

X_train:


Unnamed: 0,TV,Radio,Newspaper,TV_Radio_interaction,TV_Newspaper_interaction,Radio_Newspaper_interaction,TV_squared,Radio_squared,Newspaper_squared
79,116.0,7.7,23.1,893.2,2679.6,177.87,13456.0,59.29,533.61
197,177.0,9.3,6.4,1646.1,1132.8,59.52,31329.0,86.49,40.96
38,43.1,26.7,35.1,1150.77,1512.81,937.17,1857.61,712.89,1232.01
24,62.3,12.6,18.3,784.98,1140.09,230.58,3881.29,158.76,334.89
122,224.0,2.4,15.6,537.6,3494.4,37.44,50176.0,5.76,243.36



X_test:


Unnamed: 0,TV,Radio,Newspaper,TV_Radio_interaction,TV_Newspaper_interaction,Radio_Newspaper_interaction,TV_squared,Radio_squared,Newspaper_squared
95,163.3,31.6,52.9,5160.28,8638.57,1671.64,26666.89,998.56,2798.41
15,195.4,47.7,52.9,9320.58,10336.66,2523.33,38181.16,2275.29,2798.41
30,292.9,28.3,43.2,8289.07,12653.28,1222.56,85790.41,800.89,1866.24
158,11.7,36.9,45.2,431.73,528.84,1667.88,136.89,1361.61,2043.04
128,220.3,49.0,3.2,10794.7,704.96,156.8,48532.09,2401.0,10.24


## Model training

### Subtask:
Train a Linear Regression model on the training data.


In [5]:
from sklearn.linear_model import LinearRegression

# Instantiate a LinearRegression object
model = LinearRegression()

# Fit the Linear Regression model to the training data
model.fit(X_train, y_train)

## Model evaluation

### Subtask:
Evaluate the performance of the trained Linear Regression model on the test dataset.


In [6]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Calculate the R-squared score (R2)
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R2): {r2}")

Mean Squared Error (MSE): 0.4129102285379121
R-squared (R2): 0.9869181490609601


## Summary:

### 1. Q&A

*   **What is the relationship between advertising spending and sales?**
    *   The correlation matrix shows that there is a positive correlation between advertising spending and sales. TV has the highest correlation with Sales (0.78), followed by Radio (0.58), and Newspaper (0.23).
*   **What is the average sales value?**
    *   The average sales value is 14.02.
*   **How well does the model perform on the test data?**
    *   The model performs very well on the test data. It has a low Mean Squared Error (MSE) of 0.4129 and a high R-squared (R2) score of 0.9869. This suggests that the model is able to explain approximately 98.69% of the variance in the sales data.
*   **How many rows and columns are there in the dataset?**
    *   The dataset contains 200 rows and 4 columns.
*   **Are there any missing or duplicate values in the dataset?**
    *   No, there are no missing values or duplicated rows in the dataset.

### 2. Data Analysis Key Findings

*   **Data Characteristics:** The dataset comprises 200 rows and 4 columns, representing spending on TV, Radio, and Newspaper advertising, along with the corresponding Sales figures.
*   **Average Spending:** The average advertising spending is 147.04 on TV, 23.26 on Radio, and 30.55 on Newspaper.
*   **Average Sales:** The average sales value is 14.02.
*   **Correlation:** There's a strong positive correlation between advertising spending and sales, with TV having the highest correlation with Sales (0.78), followed by Radio (0.58), and Newspaper (0.23).
*   **Data Integrity:** No missing or duplicated values were found in the dataset.
*   **Feature Engineering:** Six new features were created: three interaction features ('TV_Radio_interaction', 'TV_Newspaper_interaction', 'Radio_Newspaper_interaction') and three squared features ('TV_squared', 'Radio_squared', 'Newspaper_squared'), enriching the dataset for model training.
*   **Model Performance:** The trained Linear Regression model achieved a high R2 score of 0.9869 and a low MSE of 0.4129 on the test data.

### 3. Insights or Next Steps

*   **Focus on TV and Radio Advertising:** Given the high correlation of TV and Radio advertising with sales, businesses should consider prioritizing these channels in their advertising strategies.
*   **Further Model Exploration:** Since the model is performing very well, further investigation can be done to test more complex models and to optimize the parameters of the current model.
