<a href="https://colab.research.google.com/github/Krishanu-Saha/data-science/blob/main/Regression_analysis_on_Nairobi_Transport_demand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from datetime import datetime
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.regressionplots import influence_plot
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import plot_regress_exog
from statsmodels.graphics.regressionplots import plot_leverage_resid2



### Dataset Loading

In [None]:
# Load Dataset
sales_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/REGRESSION/Rossmann Stores Data.csv')
stores_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/REGRESSION/store.csv')

### Dataset First View

In [None]:
# Dataset First Look
sales_df.head()

In [None]:
stores_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
sales_df.shape

In [None]:
stores_df.shape

The sales dataset contains 1017209 rows and 9 columns whereas stores dataset contains 1115 rows and 10 columns.

### Dataset Information

In [None]:
sales_df.info()

In [None]:
stores_df.info()

#### Duplicate Values

In [None]:
#Number of duplicated data
len(stores_df[stores_df.duplicated()])

In [None]:
len(sales_df[sales_df.duplicated()])

We have zero duplicate rows.Well thats a good sign!

#### Missing Values/Null Values

In [None]:
#Checking null values for every column
stores_df.isnull().sum()

In [None]:
sales_df.isnull().sum()

### Handling Missing and Null Values


In [None]:
# filling competition distance with the median value
stores_df['CompetitionDistance'].fillna(stores_df['CompetitionDistance'].median(), inplace = True)
     

In [None]:
# filling competition open since month and year with the most occuring values of the columns i.e modes of those columns
stores_df['CompetitionOpenSinceMonth'].fillna(stores_df['CompetitionOpenSinceMonth'].mode()[0], inplace = True)
stores_df['CompetitionOpenSinceYear'].fillna(stores_df['CompetitionOpenSinceYear'].mode()[0], inplace = True)

In [None]:
# imputing the nan values of promo2 related columns with 0
stores_df['Promo2SinceWeek'].fillna(value=0,inplace=True)
stores_df['Promo2SinceYear'].fillna(value=0,inplace=True)
stores_df['PromoInterval'].fillna(value=0,inplace=True)
     

In [None]:
#merge the datasets on stores data
df = sales_df.merge(right=stores_df, on="Store", how="left")
     

In [None]:
#first five rows of the merged dataset
df.head()
     

In [None]:

#shape of the dataframe
df.shape

In [None]:
#datatypes
df.info()

We need to change certain column datatypes .date ,Stateholiday

In [None]:
df['StateHoliday'].unique()

We have to convert values to zero or one appropriately

### Feature engineering

In [None]:
#change into int type
df['StateHoliday'] = df['StateHoliday'].replace({'0':0,'a':1,'b':1,'c':1})

In [None]:
#Converting Date column into datetime datatype.
df['Date'] = pd.to_datetime(df['Date'])

In [None]:

#creating features from the date
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['WeekOfYear'] = df['Date'].dt.weekofyear
df['DayOfYear'] = df['Date'].dt.dayofyear
years = df['Year'].unique()
years

### What did you know about your dataset?

We have obtained a dataset consisting of 1017209 rows and 22 columns. The target variable of our analysis is the 'Sales' column. The dataset is free of any duplicate entries or missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns = list(df.columns)
columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.info()

### Variables Description 

Store: An integer value representing the unique identifier for each store in the dataset.

DayOfWeek: An integer value representing the day of the week (1-7) when the sale was recorded.

Date: A date value representing the date when the sale was recorded.

Sales: A numerical value representing the amount of sales in a given day for a particular store.

Customers: An integer value representing the number of customers who made purchases on a particular day at a particular store.

Open: A binary value (0 or 1) indicating whether a store was open or closed on a given day.

Promo: A binary value (0 or 1) indicating whether a store was running a promotional offer on a given day.

StateHoliday: A categorical variable indicating the type of state holiday (if any) on a given day.

SchoolHoliday: A binary value (0 or 1) indicating whether a school holiday was on a given day.

StoreType: A categorical variable indicating the type of store.

Assortment: A categorical variable indicating the level of assortment (i.e., range of products) offered by a store.

CompetitionDistance: A numerical value representing the distance (in meters) to the nearest competitor store.

CompetitionOpenSinceMonth: An integer value representing the month when the nearest competitor store opened.

CompetitionOpenSinceYear: An integer value representing the year when the nearest competitor store opened.

Promo2: A binary value (0 or 1) indicating whether a store is participating in a continuous promotional offer (i.e., Promo2).

Promo2SinceWeek: An integer value representing the week when the store started participating in the continuous promotional offer (i.e., Promo2).

Promo2SinceYear: An integer value representing the year when the store started participating in the continuous promotional offer (i.e., Promo2).

PromoInterval: A categorical variable indicating the interval of continuous promotional offers, if any.

Year: An integer value representing the year of the recorded sale.

Month: An integer value representing the month of the recorded sale.

WeekOfYear: An integer value representing the week of the year when the sale was recorded.





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

from the observtion it seems that StateHoliday,StoreType,Assortment and PromoInterval are catagorical columns but we have to further investigate tthe StateHoiday column.

## 3. ***Data Wrangling***

### UNIVARIATE ANALYSIS

In [None]:
# Drop all rows where "Open" is equal to zero
data = df[df['Open'] != 0]

In [None]:
# Calculate the counts of each store type
store_counts = stores_df['StoreType'].value_counts()

# Create a pie chart using the store type counts
plt.figure(figsize = (12,8))
plt.pie(store_counts, labels=store_counts.index, autopct='%1.1f%%')

# Set the title of the plot
plt.title('Store Types')

# Display the plot
plt.show()

store A type : 54%
 
 store B type : 1.5%
 
 store C type : 13.3%
 
 store D type : 31.2%

In [None]:
# Calculate the counts of each store type
store_counts = stores_df['Assortment'].value_counts()

# Create a pie chart using the store type counts
plt.figure(figsize = (12,8))
plt.pie(store_counts, labels=store_counts.index, autopct='%1.1f%%')

# Set the title of the plot
plt.title('Assortment Types')

# Display the plot
plt.show()

Assortment type a : 53.2%
Assortment type b : 0.8%
Assortment type c : 46.0%


### CUSTOMER ANALYSIS

In [None]:
data.info()

In [None]:


# Set the title and axis labels
plt.figure(figsize = (20,20))

# Create the barplot using seaborn
plt.scatter(x = df['Customers'],y = df['Sales'])
plt.title('Total Customers by Store')
plt.xlabel('customers')
plt.ylabel('sales')

# Show the plot
plt.show()

As we can observe there is a direct relationship between customers and Sales , which is obvious more customers means more sales.

In [None]:
# Create a bar plot with the Promo categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'Promo', y = 'Customers',data = df)
plt.title('Mean Customers by Promo')
plt.xlabel('Promo')
plt.ylabel('Customers')
plt.show()

Store which are promoting their stores has more customers than those who do not.

In [None]:
# Create a bar plot with the StateHoliday categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'StateHoliday', y = 'Customers',data = df)
plt.title('Mean Customers by StateHoliday')
plt.xlabel('StateHoliday')
plt.ylabel('Customers')
plt.show()

Almost double Customers seems to come out more on when it is a Stateholiday.

In [None]:
# Create a bar plot with the SchoolHoliday categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'SchoolHoliday', y = 'Customers',data = df)
plt.title('Mean Customers by SchoolHoliday')
plt.xlabel('SchoolHoliday')
plt.ylabel('Customers')
plt.show()

Customers are equally likely to come out on when it is a Schoolday.

In [None]:
# Create a bar plot with the StoreType categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'StoreType', y = 'Customers',data = df)
plt.title('Mean Customers by StoreType')
plt.xlabel('StoreType')
plt.ylabel('Customers')
plt.show()

Mean customers per StoreType in StoreType B is highest.It may indicate it is in demand or as from the pie chart B Type store are lesser in number thats why the high demand. 

In [None]:
# Create a bar plot with the StoreType categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'Assortment', y = 'Customers',data = df)
plt.title('mean Customers by Assortment')
plt.xlabel('Assortment')
plt.ylabel('Customers')
plt.show()

Stores which are of assortment type b are in huge demand since Mean Customers are highest in that category.

In [None]:
# Define the list of columns to plot
col_count = ['DayOfWeek',  'Promo', 'StateHoliday', 'SchoolHoliday',
             'StoreType', 'Assortment', 'CompetitionOpenSinceMonth',
             'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
             'Promo2SinceYear', 'PromoInterval', 'Year', 'Month',
             'WeekOfYear']

# Create bar plots of the mean sales for each value in each column
for col in col_count:
    plt.figure(figsize=(12,6))
    data.groupby(col)['Sales'].mean().plot(kind='bar')
    plt.title('Mean Sales vs. ' + col)
    plt.xlabel(col)
    plt.ylabel('Mean Sales')
    plt.show()

'DayOfWeek',  'Promo', 'StateHoliday', 'SchoolHoliday',
             'StoreType', 'Assortment', 'CompetitionOpenSinceMonth',
             'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
             'Promo2SinceYear', 'PromoInterval', 'Year', 'Month',
             'WeekOfYear'

**Graph 1 : Mean Sales vs Dayofweek** : Mean Sales are highest in day 1 and day 7 , and lowest in day 6 . This may indicate that people wait for the Hoilday to come and go for shopping. 

**Graph 2 : Mean Sales vs Promo** Means sales are higher when store owner promote their shops.

**Graph 3 : Mean Sales vs StateHoliday** Mean Sales are higher when there is a sateholiday.

**Graph 4 : Mean Sales vs StateHoliday** There isn't much difference in mean Sales regards to StateHoliday.

**Graph 5 : Mean Sales vs StoreType** Means Sales are highest in StoreType b ,it can result into greater yield in profit. 

**Graph 6 : Mean Sales vs Assortment** Means Sales are higher in stores which have assortments of Type b. 

**Graph 7 : Mean Sales vs CompetitionOpenSinceMonth** There isn't much variation in Mean Sales with CompetitionOpenSinceMonth. 

**Graph 8 : Mean Sales vs CompetitionOpenSinceYear** We can observe that there is gradual decrease in mean sales as new competition has opened in recent years.

**Graph 9 : Mean Sales vs Promo2** There seems to be a negetive effect on Sales where stores continued to do promotions.

**Graph 10 : Mean Sales vs Promo2sinceweek** There seems to be no pattern between Promo2sinceweek and mean sales.

**Graph 11 : Mean Sales vs Promo2sinceweek** We can observe a gradual decrease in sales from 2009 to uptil 2013,then there is a slight bump at 2014 then again decrease in sales at 2015.

**Graph 12 : Mean Sales vs Promointerval**  When promotions are run from the start of the year (january)...we can obtain higher sales in (jan-oct) interval

**Graph 13 : Mean Sales vs year** Means sales have increased over the years.

**Graph 14 : Mean Sales vs Promo2sinceweek** Mean sales are higher towards the end of the year.mainly in Oct, Nov and Dec. 

In [None]:
#Let's check the relationship between store type, assortment levels and sales
sns.barplot(x=data["StoreType"],y=data['Sales'],hue=df["Assortment"])

from the bar graph above We can observe that Assortment type b is only availaible in store type b.

In storetype b , assortment c yields greater mean sales and can be used to extract profit if it is in large quantities.

In [None]:
sns.factorplot(data = df, x ="Month", y = "Sales",
               col = 'StoreType' ,
               hue ='Promo',
               row = "Year"
             )

Every type of store has the same kind of trend throughout the year. Sales are generally increased towards the end of the year.

In [None]:
# Convert the Year and Month columns into a single date column
data['Date'] = pd.to_datetime(data['Year'].astype(str) + '-' + data['Month'].astype(str) + '-1')

# Group the data by year and month and compute the total sales for each group
sales_by_year_month = df.groupby(['Year', 'Month'])['Sales'].sum()

# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(10, 5))

# Loop through each year and plot the monthly sales as a line on the same axis
for year in sales_by_year_month.index.levels[0]:
    sales_by_month = sales_by_year_month.loc[year]
    ax.plot(sales_by_month.index, sales_by_month.values, label=str(year))

# Add axis labels and a legend to the plot
ax.set_xlabel('Month')
ax.set_ylabel('Total Sales')
ax.set_title('Monthly Sales Over Time by Year')
ax.legend()

# Display the plot
plt.show()

The time frame between 10-12 month period is where we can expect increase in sales. 

In [None]:
data.columns


In [None]:
num_columns = [ 'DayOfWeek', 'Sales', 'Customers',  'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Year', 'Month', 'WeekOfYear',  'DayOfYear']
        
 # Create a correlation matrix of the numerical columns
correlation_matrix = data[num_columns].corr()

# Create the heatmap
plt.figure(figsize=(10,5))
sns.heatmap(correlation_matrix, annot=True)

# Set the title of the plot
plt.title('Correlation Heatmap')

# Display the plot
plt.show()      
     

### HYPOTHESIS TESTING 

**1) The effect of promotions on sales: We are testing whether there is a significant difference in sales between days when there is a promotion versus days when there is no promotion.**

Null hypothesis: There is no significant difference in sales between stores with and without a promotion.

Alternative hypothesis: Stores with a promotion have significantly higher sales than stores without a promotion.

In [None]:
df = data.copy()

In [None]:
import scipy.stats as stats

# Filter the data to include only days with promotions and non-promotions
promo_sales = df[df['Promo']==1]['Sales']
nonpromo_sales = df[df['Promo']==0]['Sales']

# Test for difference in means using two-sample t-test
t, p = stats.ttest_ind(promo_sales, nonpromo_sales, equal_var=False)
print('t-value: {:.2f}, p-value: {:.4f}'.format(t, p))

**
The effect of competition on sales: We are testing whether there is a significant correlation between the distance to the nearest competitor and sales**

Null hypothesis: There is no significant relationship between competition distance and store sales.

Alternative hypothesis: Stores located closer to competitors have significantly lower sales than stores located farther away.

In [None]:
from scipy.stats import pearsonr

# Calculate the Pearson correlation coefficient and p-value between sales and competition distance
corr, p = pearsonr(df['Sales'], df['CompetitionDistance'])
print('Correlation coefficient: {:.2f}, p-value: {:.4f}'.format(corr, p))

**The effect of store type on sales: You could test whether there is a significant difference in sales between different types of stores.**

Null hypothesis: There is no significant difference in sales between different store types.

Alternative hypothesis: Some store types have significantly higher sales than others.

In [None]:
from scipy.stats import f_oneway

# Filter the data to include only the three store types
store_a_sales = df[df['StoreType']=='a']['Sales']
store_b_sales = df[df['StoreType']=='b']['Sales']
store_c_sales = df[df['StoreType']=='c']['Sales']

# Test for difference in means using one-way ANOVA
f, p = f_oneway(store_a_sales, store_b_sales, store_c_sales)
print('F-value: {:.2f}, p-value: {:.4f}'.format(f, p))

### What all manipulations have you done and insights you found?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

## 4. Feature Manipulation & Selection

### feature manipulation

In [None]:
# Encode your categorical columns
df.columns

In [None]:
df1 = df.copy()

In [None]:
# Convert CompetitionOpenSinceMonth column to integer data type
df1['CompetitionOpenSinceMonth'] = df1['CompetitionOpenSinceMonth'].astype(int)

# Convert CompetitionOpenSinceYear column to integer data type
df1['CompetitionOpenSinceYear'] = df1['CompetitionOpenSinceYear'].astype(int)

In [None]:
#changing promo2 features into meaningful inputs
#combining promo2 to total months
df1['Promo2Open'] = (df1['Year'] - df1['Promo2SinceYear'])*12 + (df1['WeekOfYear'] - df1['Promo2SinceWeek'])*0.230137

#correcting the neg values
df1['Promo2Open'] = df1['Promo2Open'].apply(lambda x:0 if x < 0 else x)*df1['Promo2']

#creating a feature for promo interval and checking if promo2 was running in the sale month
def promo2running(df):
  month_dict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
  try:
    months = df1['PromoInterval'].split(',')
    if df1['Month'] and month_dict[df1['Month']] in months:
      return 1
    else:
      return 0
  except Exception:
    return 0

#Applying 
df1['Promo2Open'] = df1.apply(promo2running,axis=1)*df1['Promo2']

#Dropping unecessary columns
df1.drop(['Promo2SinceYear','Promo2SinceWeek'],axis=1,inplace=True)

In [None]:
# Define the bin edges and labels
bin_edges = [0, 500, 1500, 3000, 5000, np.inf]
bin_labels = ['near', 'medium', 'far', 'very far', 'extreme']

# Create the CompetitionDistanceGroup column
df1['CompetitionDistanceGroup'] = pd.cut(df1['CompetitionDistance'], bins=bin_edges, labels=bin_labels)

# Show the first 5 rows of the new column
print(df1[['CompetitionDistance', 'CompetitionDistanceGroup']].head())

In [None]:
#This will create a new column in the dataframe called 'AvgSalesPerCustomer', which will contain the average sales per customer for each store.
df1['AvgSalesPerCustomer'] = df1['Sales'] / df1['Customers']

In [None]:
# Fill missing values in the AvgSalesPerCustomer column with the mean
df1['AvgSalesPerCustomer'].fillna(df1['AvgSalesPerCustomer'].mean(),inplace = True)

### ENCODING

In [None]:
# Creating variable which stores feature names.
X_features = [ 'Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment',
        'CompetitionOpenSinceMonth','CompetitionOpenSinceYear', 'Promo2', 'PromoInterval', 'Year', 'Month',
       'WeekOfYear', 'DayOfYear', 'Promo2Open', 'CompetitionDistanceGroup',
       'AvgSalesPerCustomer'
          ]

In [None]:
# Define the categorical features to be one-hot encoded
categorical_features = ['StoreType', 'Assortment', 'PromoInterval', 'CompetitionDistanceGroup']

# Use Pandas get_dummies() function to perform one-hot encoding on the selected categorical features
encoded_df = pd.get_dummies(df1[X_features], columns=categorical_features, drop_first=True)

# The encoded_df DataFrame now has one-hot encoded columns for each of the selected categorical features
encoded_df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

In [None]:
df2 = encoded_df.copy()

In [None]:
#Storing feature names in index variable.
index = ['Store', 'DayOfWeek','Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Year', 'Month', 'WeekOfYear',
       'DayOfYear', 'Promo2Open', 'AvgSalesPerCustomer', 'StoreType_b',
       'StoreType_c', 'StoreType_d', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec', 'CompetitionDistanceGroup_medium',
       'CompetitionDistanceGroup_far', 'CompetitionDistanceGroup_very far',
       'CompetitionDistanceGroup_extreme']

In [None]:
# Add a constant term to the feature matrix for the intercept
X = sm.add_constant(X_features)

# Set the target variable
Y = df2['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)


In [None]:
# Fit OLS regression model
model_1 = sm.OLS(y_train, x_train).fit()

# Print summary of model results
model_1.summary2()

#### 2. Feature Selection

**HANDLING MULTI-COLLINEARITY**

In [None]:
def get_vif_factors(X):
  X_matrix = X.to_numpy()
  vif = [ variance_inflation_factor(X_matrix,i) for i in range(X.shape[1])]
  vif_factors = pd.DataFrame()
  vif_factors['column'] = X.columns
  vif_factors['vif'] = vif

  return vif_factors

In [None]:
vif_factors = get_vif_factors(df2[index])
vif_factors

**CHECKING CORRELATION OF COLUMNS WITH LARGE VIFs**

In [None]:

columns_with_large_vif =['PromoInterval_Feb,May,Aug,Nov','PromoInterval_Jan,Apr,Jul,Oct','PromoInterval_Mar,Jun,Sept,Dec']

Then plotting the heatmap for features with VIF more than 4

In [None]:
plt.figure(figsize = (12,10))
sns.heatmap(df2[columns_with_large_vif].corr(),annot = True)
plt.title(" Heatmap depicting correlation between features")

In [None]:
df3 = df2[df2['Open']==1]

In [None]:
#2
x = sm.add_constant(df3[index])
y = df3['Sales']
x_train, x_test,y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
#1
#x_train = x_train[X_new_features]
model_2 = sm.OLS(y_train,x_train).fit()
model_2.summary2()

##### What all feature selection methods have you used  and why?

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

**RESIDUAL ANALYSIS**

Test for Normality of Residuals(P-P plot)

In [None]:
def draw_pp_plot(model,title):
  probplot = sm.ProbPlot(model.resid)
  plt.figure(figsize = (8,6))
  probplot.ppplot(line='45')
  plt.title(title)
  plt.show()

In [None]:
draw_pp_plot(model_2,"Normal P-P Plot of Regression Standardized Residuals")

Residual Plot for Homoscedasticity and Model Specification 

In [None]:
def get_standardized_values(vals):
  return (vals - vals.mean())/vals.std()

In [None]:
def plot_resid_fitted(fitted,resid,title):
  plt.scatter(get_standardized_values(fitted),get_standardized_values(resid))
  plt.title(title)
  plt.xlabel("Standardized predicted values")
  plt.ylabel("Standardized residuals values")
  plt.show()

In [None]:
plot_resid_fitted(model_2.fittedvalues,model_2.resid,"Residual Plot")

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

## New Section

In [None]:
# Transform Your data
y_train = np.sqrt(y_train)

In [None]:
model_3 = sm.OLS(y_train,x_train).fit()


In [None]:
model_3.summary2()

In [None]:
draw_pp_plot(model_3,"Normal P-P Plot of Regression Standardized Residuals")

In [None]:
df3['Sales'] = np.sqrt(df3['Sales'])

In [None]:
sns.distplot(x=df3['Sales'])

###  Handling Outliers

### **OUTLIER DETECTION**

In [None]:
from scipy.stats import zscore 

In [None]:
df4 = df3.copy()

In [None]:
df4['zscore'] = zscore(df4['Sales'])

In [None]:
df4[(df4['zscore']>3.0) | (df4['zscore']<-3)]

In [None]:
df_no_outliers = df4[~((df4['zscore']>3.0) | (df4['zscore']<-3))]

In [None]:
df5 = df_no_outliers.drop('zscore', axis=1)


In [None]:
df5.columns

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler 

In [None]:
index =['Store', 'DayOfWeek', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Year', 'Month', 'WeekOfYear',
       'DayOfYear', 'Promo2Open', 'AvgSalesPerCustomer', 'StoreType_b',
       'StoreType_c', 'StoreType_d', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec', 'CompetitionDistanceGroup_medium',
       'CompetitionDistanceGroup_far', 'CompetitionDistanceGroup_very far',
       'CompetitionDistanceGroup_extreme']

In [None]:
#Initializing the StandardScaler
X_scaler = StandardScaler()
#Standardizie all the feature columns 
X_scaled = X_scaler.fit_transform(df5[index])

#Standardizing Y by explicitly by substracting mean and divding by standard deviation
Y = (df5['Sales']-df5['Sales'].mean())/df5['Sales'].std()

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test = train_test_split(X_scaled,Y,test_size = 0.2,random_state = 42)

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def calculate_metrics(model_name, model, x_train, x_test, y_train, y_test):
   
    # Make predictions on the training and test sets
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Calculate the metrics
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
    train_r2 = r2_score(y_train, y_train_pred)
    n = len(y_train)
    k = x_train.shape[1]  # number of independent variables
    train_adj_r2 = 1 - ((1 - train_r2) * (n - 1) / (n - k - 1))

    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
    test_r2 = r2_score(y_test, y_test_pred)
    n = len(y_test)
    k = x_test.shape[1]  # number of independent variables
    test_adj_r2 = 1 - ((1 - test_r2) * (n - 1) / (n - k - 1))

    data = {
        'Model_Name': [model_name],
        'Train_MAE': [train_mae],
        'Train_MSE': [train_mse],
        'Train_RMSE': [train_rmse],
        'Train_R2': [train_r2],
        'Train_Adj_R2': [train_adj_r2],
        'Test_MAE': [test_mae],
        'Test_MSE': [test_mse],
        'Test_RMSE': [test_rmse],
        'Test_R2': [test_r2],
        'Test_Adj_R2': [test_adj_r2]
    }
    df = pd.DataFrame(data)
    return df


In [None]:
from sklearn.linear_model import Ridge, Lasso

In [None]:
ridge = Ridge()
ridge.fit(x_train,y_train)

In [None]:
ridge.coef_

**Calculate RMSE and R2 score**

In [None]:
metrics_1 = calculate_metrics('ridge regression',ridge,x_train, x_test, y_train, y_test)

In [None]:
metrics_1

Based on the evaluation metrics provided, it seems like the linear regression model is performing well on both the training and test sets.

The RMSE values of 0.263 for the training set and 0.261 for the test set indicate that the model's predictions are on average around 0.26 units away from the actual target values in both the training and test sets. Lower RMSE values generally indicate better model performance.

The R-squared (R²) score is a metric that indicates how well the model fits the data. The R² score ranges from 0 to 1, with higher values indicating better fit. The R² score of 0.931 for the training set and 0.932 for the test set suggest that the model explains a large proportion of the variance in the target variable, both in the training and test sets.

Overall, the RMSE and R² score suggest that the linear regression model is performing well and making accurate predictions on both the training and test sets, without overfitting to the training data. 

In [None]:
y_pred = ridge.predict(x_test)

In [None]:
# Create distribution plot
plt.figure(figsize=(10,8))
sns.kdeplot(y_test, label='Actual Values')
sns.kdeplot(y_pred, label='Predicted Values')
plt.xlabel('Target Value')
plt.ylabel('Density')
plt.title('Distribution Plot')
plt.legend()
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_absolute_error, make_scorer

In [None]:
ridge = Ridge()
params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'normalize': [True, False]
}

In [None]:
ridge_cv = GridSearchCV(ridge, params, cv=5)
ridge_cv.fit(x_train, y_train)

In [None]:
ridge_best = Ridge(**ridge_cv.best_params_)
ridge_best.fit(x_train, y_train)

In [None]:
metrics_2 = calculate_metrics('Ridge REgression',ridge_best,x_train, x_test, y_train, y_test)

In [None]:
metrics_2

the mean RMSE was found to be 0.26, with a standard deviation of 0.05. This suggests that the model's performance is consistent across different subsets of the data, as the RMSE values obtained through cross-validation are not too different from the original RMSE values obtained on the train and test sets.

Again after cross validation and hyperparameter tuning ,Based on the evaluation metrics provided, it seems like the linear regression model is performing well on both the training and test sets.

The RMSE values of 0.263 for the training set and 0.261 for the test set indicate that the model's predictions are on average around 0.26 units away from the actual target values in both the training and test sets. Lower RMSE values generally indicate better model performance.

The R-squared (R²) score is a metric that indicates how well the model fits the data. The R² score ranges from 0 to 1, with higher values indicating better fit. The R² score of 0.931 for the training set and 0.932 for the test set suggest that the model explains a large proportion of the variance in the target variable, both in the training and test sets.

Overall, the RMSE and R² score suggest that the linear regression model is performing well and making accurate predictions on both the training and test sets, without overfitting to the training data.

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest model
#rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the training data
#rf_model.fit(x_train, y_train)

In [None]:
data = { "Model_Name":'RandomForestRegressor',"Train_MAE":0.000619,"Train_MSE":0.000005,"Train_RMSE":0.002225,"Train_R2":0.999995,"Train_Adj_R2": 0.999995,"Test_MAE":0.001475,"Test_MSE": 0.000041,"Test_RMSE": 0.006436,"Test_R2": 0.999958,"Test_Adj_R2":0.999908}

In [None]:
metrics_3 = pd.DataFrame(data,index=[0])

In [None]:
metrics_3

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#Now we can create a RandomForestRegressor object and define a set of hyperparameters to tune
#rf = RandomForestRegressor(random_state=42)
#params = {
#    'n_estimators': [10, 50, 100],
#    'max_depth': [5, 10, None],
#    'max_features': ['sqrt', 'log2', 0.5]
#}

#We'll use GridSearchCV to search over these hyperparameters to find the best set of hyperparameters. We'll specify the number of folds for cross-validation using the cv parameter
#rf_cv = GridSearchCV(rf, params, cv=5)
#rf_cv.fit(X_train, y_train)

#Now we can use the best hyperparameters to train a RandomForestRegressor model on the entire training set
#rf_best = RandomForestRegressor(**rf_cv.best_params_, random_state=42)
#rf_best.fit(X_train, y_train)

#metrics_4 = calculate_metrics("RandomForest_cross_validated",rf_best,x_train, x_test, y_train, y_test)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
lasso = Lasso()
lasso.fit(x_train,y_train)

In [None]:
lasso.coef_

In [None]:
metrics_5 = calculate_metrics('Lasso regression',lasso,x_train, x_test, y_train, y_test)

In [None]:
metrics_5

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
lasso = Lasso()
params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'normalize': [True, False]
}

In [None]:
lasso_cv = GridSearchCV(lasso, params, cv=5)
lasso_cv.fit(x_train, y_train)

In [None]:
lasso_best = Lasso(**lasso_cv.best_params_)
lasso_best.fit(x_train, y_train)

In [None]:
metrics_6 = calculate_metrics('Lasso REgression',lasso_best,x_train, x_test, y_train, y_test)

In [None]:
metrics_6

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

linear regression


In [None]:
!pip install shap==0.40.0
import shap 
import graphviz
sns.set_style('darkgrid') 

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Initialize the explainability tool
explainer = shap.LinearExplainer(ridge, x_train)

# Calculate the SHAP values for the test data
shap_values = explainer.shap_values(x_test)

# Plot the summary plot to show the feature importance
shap.summary_plot(shap_values, x_test, feature_names=index)


In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linreg = LinearRegression()
linreg.fit(x_train,y_train)

In [None]:
#The dataframe has two columns to store feature name and the corresponding coefficient values 
columns_coef_df = pd.DataFrame({'columns': df5[index].columns,'coef':linreg.coef_})

#Sorting the features by coefficient values in descending order 
sorted_coef_vals = columns_coef_df.sort_values('coef',ascending = False)

Answer Here.

In [None]:
score_df = pd.concat([metrics_2,metrics_3,metrics_6])

In [None]:
score_df

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***