<a href="https://www.kaggle.com/code/damanjeetkaur/marketing-sales-data-evaluate-linear-regression?scriptVersionId=197513903" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Evaluate simple linear regression

## Introduction

We will use simple linear regression to explore the relationship between two continuous variables. 
We will perform a complete simple linear regression analysis, which includes creating and fitting a model, checking model assumptions, analyzing model performance, interpreting model coefficients, and communicating results to stakeholders.

## Step 1: Imports

### Import packages

In [None]:
# Import pandas, pyplot from matplotlib, and seaborn.

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

### Import the statsmodel module and the ols function

Import the `statsmodels.api` Python module using its common abbreviation, `sm`, along with the `ols()` function from `statsmodels.formula.api`.

In [None]:

import statsmodels.api as sm
from statsmodels.formula.api import ols

### Load the dataset

In [None]:

data = pd.read_csv("/kaggle/input/dummy-advertising-and-sales-data/Dummy Data HSS.csv")

# Display the first five rows.
data.head()


## Step 2: Data exploration

### Familiarize yourself with the data's features

Starting with an exploratory data analysis 

The features in the data are:
* TV promotion budget (in millions of dollars)
* Social media promotion budget (in millions of dollars)
* Radio promotion budget (in millions of dollars)
* Sales (in millions of dollars)

Each row corresponds to an independent marketing promotion where the business invests in `TV`, `Social_Media`, and `Radio` promotions to increase `Sales`.

In [None]:
# Display the shape of the data.
data.shape

### Explore the independent variables

There are three continuous independent variables: `TV`, `Radio`, and `Social_Media`.
To understand how heavily the business invests in each promotion type, use `describe()` to generate descriptive statistics for these three variables.

In [None]:
data.describe()

### Explore the dependent variable

Before fitting the model, we'll make sure the `Sales` for each promotion (i.e., row) is present. 
If the `Sales` in a row is missing, that row isn't of much value to the simple linear regression model.

Display the percentage of missing values in the `Sales` column in the DataFrame `data`.

In [None]:
# Calculate the average missing rate in the sales column.
missing_sales = data.Sales.isna().mean()


# Convert the missing_sales from a decimal to a percentage and round to 2 decimal place.
missing_sales = round(missing_sales*100, 2)

# Display the results (missing_sales must be converted to a string to be concatenated in the print statement).

print("PErcentage of missing rate in the sales column", str(missing_sales),"%")

### Remove the missing data

Remove all rows in the data from which `Sales` is missing.

In [None]:
# Subset the data to include rows where Sales is present.

data = data.dropna(subset =['Sales'],axis=0)
data.shape

### Visualize the sales distribution

Create a histogram to visualize the distribution of `Sales`.

In [None]:
# Create a histogram of the Sales.
sns.histplot(data['Sales'])

# Add a title
plt.title("Distribution of Sales")
plt.show()


## Step 3: Model building

Now, we'll create a pairplot to visualize the relationships between pairs of variables in the data. 
We'll  use this to visually determine which variable has the strongest linear relationship with `Sales`. 
This will help to select the X variable for the simple linear regression.

In [None]:
# Create a pairplot of the data.

sns.pairplot(data)
plt.show()

We'll select TV for linear regression model as it's showing tight linear relationship.

### Build and fit the model

In [None]:
# Define the OLS formula.

ols_formula = "Sales ~ TV"

# Create an OLS model.

OLS = ols(formula=ols_formula, data=data)

# Fit the model.

model = OLS.fit()

# Save the results summary.

results = model.summary()

# Display the model results.

print(results)

### Check model assumptions

To justify using simple linear regression, check that the four linear regression assumptions are not violated.
These assumptions are:

* Linearity
* Independent Observations
* Normality
* Homoscedasticity

### Model assumption: Linearity

The linearity assumption requires a linear relationship between the independent and dependent variables. 
Check this assumption by creating a scatterplot comparing the independent variable with the dependent variable. 

We'll create a scatterplot comparing the selected X variable with the dependent variable.

In [None]:
# Create a scatterplot comparing X and Sales (Y).

sns.scatterplot(x = data['TV'], y=data['Sales']);
plt.show();


**QUESTION:** Is the linearity assumption met?
Yes, definitely.

### Model assumption: Independence

The **independent observation assumption** states that each observation in the dataset is independent. As each marketing promotion (i.e., row) is independent from one another, the independence assumption is not violated.

### Model assumption: Normality

The normality assumption states that the errors are normally distributed.

We'll create two plots to check this assumption:

* **Plot 1**: Histogram of the residuals
* **Plot 2**: Q-Q plot of the residuals

In [None]:
# Calculate the residuals.

residuals = model.resid


# Create a histogram with the residuals.
figure = sns.histplot(residuals)


# Set the x label of the residual plot.
figure.set_xlabel("residuals")

# Set the title of the residual plot.
figure.set_title("Histogram of residuals")

# Create a Q-Q plot of the residuals.
figure2 = sm.qqplot(residuals, line='s')


### Model assumption: Homoscedasticity

The **homoscedasticity (constant variance) assumption** is that the residuals have a constant variance for all values of `X`.

We'll check this by creating a scatterplot with the fitted values and residuals. 
Add a line at $y = 0$ to visualize the variance of residuals above and below $y = 0$.

In [None]:
# Create a scatterplot with the fitted values from the model and the residuals.

X = data['TV']
fitted_value = model.predict()
figure = sns.scatterplot(x= fitted_value, y= residuals)

# Set the x-axis label.
figure.set_xlabel("fitted vALUES")
# Set the y-axis label.
figure.set_ylabel("RESIDUALS")
# Set the title.
figure.set_title("scatterplot with the fitted values from the model and the residuals")
# Add a line at y = 0 to visualize the variance of residuals above and below 0.
figure.axhline(0)

# Show the plot.
plt.show()

**QUESTION:** Is the homoscedasticity assumption met?.
Yes, as the fitted values are well distributed . and do not have any correlation among each other.

## Step 4: Results and evaluation

### Display the OLS regression results

If the linearity assumptions are met, we can interpret the model results accurately.

In [None]:
# Display the model_results

print(results)

### Measure the uncertainty of the coefficient estimates

Model coefficients are estimated. This means there is an amount of uncertainty in the estimate. A p-value and $95\%$ confidence interval are provided with each coefficient to quantify the uncertainty for that coefficient estimate.
