# **Project Name**    - Retail Sales Predictiction (Regression)


---



---




##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

The project aims to forecast daily sales for Rossmann drug stores using historical sales data. Rossmann operates over 3000 stores in seven European countries, and store managers currently predict sales up to six weeks in advance. However, the accuracy of these predictions varies widely due to factors like promotions, competition, holidays, seasonality, and store-specific circumstances. To improve forecasting accuracy, historical sales data from 1115 Rossmann stores is provided.

The problem is defined as developing a data science solution that accurately predicts the "Sales" column for the test set. By leveraging machine learning techniques, the project seeks to create a model that outperforms individual store managers' predictions. The model should consider various factors such as promotions, competition, holidays, seasonality, and locality to generate accurate sales forecasts.

To implement the project, several steps will be followed. Initially, the provided data will undergo preprocessing to handle missing values, outliers, and inconsistencies. Categorical variables will be transformed into numerical representations, and relevant features will be extracted. Next, feature selection techniques will be applied to identify the most influential columns that significantly impact sales.

Exploratory Data Analysis (EDA) will provide insights into relationships between variables, uncovering patterns, trends, and seasonalities. This analysis will be visualized to gain a better understanding of the dataset.

For model development, the dataset will be split into training and testing sets. Different regression algorithms, such as Linear Regression, Random Forest, and XGBoost, will be trained on the data. Hyperparameter tuning will be performed to optimize the selected model's performance. Model evaluation will utilize appropriate metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to assess the accuracy of the sales forecasts.

The trained model will be applied to the test set to predict sales for each store. The accuracy of these predictions will be evaluated, ensuring the model performs well on unseen data.

The dataset's columns play crucial roles in forecasting sales. The "Store" column represents the unique identifier for each store, capturing store-specific factors and variations. "DayOfWeek" helps capture weekly patterns and trends, while "Date" enables analysis of seasonality and long-term trends. The "Sales" column serves as the target variable for forecasting. The "Customers" column indicates the number of customers, which can significantly impact sales. "Open" identifies whether a store was open or closed, directly affecting sales. "Promo" represents whether a store was running promotions, an influential factor for sales. "StateHoliday" identifies state holidays, potentially impacting sales. Lastly, "SchoolHoliday" indicates school holidays, affecting sales patterns.

By leveraging the provided dataset and applying advanced data science techniques, this project aims to improve the accuracy of sales forecasts for Rossmann drug stores. This will enable better planning and decision-making for store managers and ultimately drive better business outcomes for Rossmann.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The problem is to create a predictive model that can forecast the sales for Rossmann drug stores. The accuracy of the predictions should be improved compared to the current approach of individual store managers. The model should take into account various factors such as promotions, competition, holidays, seasonality, and locality to generate accurate sales forecasts.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np # NumPy is used for scientific computing and provides functions for efficient array operations, linear algebra, and mathematical calculations.
import pandas as pd # Pandas is used for data manipulation and analysis. It provides data structures and functions to work with structured data, such as data frames.
from numpy import math # The math module from NumPy provides various mathematical functions that can be used for calculations.
from scipy.stats import * # The scipy.stats module provides a wide range of statistical functions and distributions for statistical analysis and hypothesis testing.
import math 
from numpy import loadtxt # The loadtxt function from NumPy is used to load data from a text file into an array or variables.

from sklearn.preprocessing import MinMaxScaler # MinMaxScaler is used for scaling numerical features to a specific range, typically between 0 and 1, to ensure that all features have a similar scale.
from sklearn.model_selection import train_test_split # train_test_split is used to split the dataset into training and testing sets for model evaluation and validation.
from sklearn.linear_model import LinearRegression # LinearRegression is used to perform linear regression analysis and build linear regression models.
from sklearn.metrics import r2_score # r2_score is used to calculate the coefficient of determination (R-squared) to evaluate the performance of regression models.
from sklearn.metrics import mean_squared_error # mean_squared_error is used to calculate the mean squared error (MSE) to measure the performance of regression models.


import matplotlib.pyplot as plt # Matplotlib is a plotting library used to create visualizations and graphs. The %matplotlib inline command is used in Jupyter Notebook to display plots inline.
%matplotlib inline

import seaborn as sns # Seaborn is a data visualization library based on Matplotlib. It provides a high-level interface for creating informative and visually appealing statistical graphics.

from sklearn.linear_model import Ridge, RidgeCV # Ridge and Lasso are regularization techniques used in linear regression to reduce overfitting.
from sklearn.linear_model import Lasso, LassoCV # RidgeCV and LassoCV are versions of Ridge and Lasso with built-in cross-validation for hyperparameter tuning.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler # StandardScaler is used to standardize numerical features by removing the mean and scaling to unit variance.
from imblearn.over_sampling import SMOTE # SMOTE (Synthetic Minority Over-sampling Technique) is used for oversampling the minority class in imbalanced datasets to address class imbalance issues.
from sklearn.linear_model import LogisticRegression # LogisticRegression is used for logistic regression analysis and building logistic regression models for classification tasks.
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier is an ensemble learning method that combines multiple decision trees to build a classification model.
from sklearn.metrics import accuracy_score, confusion_matrix # accuracy_score is used to calculate the accuracy of classification models. confusion_matrix is used to compute the confusion matrix to evaluate classification model performance.
from sklearn import metrics # metrics provides various metrics for model evaluation.
from sklearn.metrics import roc_curve # roc_curve is used to plot the receiver operating characteristic (ROC) curve for binary classification models.
from sklearn.model_selection import GridSearchCV # GridSearchCV is used for hyperparameter tuning by exhaustively searching the specified parameter values. 
from sklearn.model_selection import RepeatedStratifiedKFold # RepeatedStratifiedKFold is a cross-validation strategy that ensures stratification and repeated sampling of data during model evaluation.
from xgboost import XGBClassifier # XGBClassifier is an implementation of the XGBoost algorithm for classification tasks.
from xgboost import XGBRFClassifier # XGBRFClassifier is an implementation of the XGBoost algorithm for random forest-based classification tasks.
from sklearn.tree import export_graphviz # export_graphviz is used to export decision tree models in Graphviz format for visualization.

import warnings
warnings.filterwarnings('ignore') # The warnings module is used to manage warning messages. The filterwarnings function is used to ignore warnings during code execution.


### Dataset Loading

In [None]:
data = "/content/Rossmann_Stores_Data.csv"

In [None]:
# Loading  Dataset
df = pd.read_csv(data, encoding = "ISO-8859-1") # Some times while saving the CSV File the data shall be encoded, to over come this issue in future we use this label
# encoding = "ISO-8859-1" when reading the file with pandas will ensure that the text is decoded properly and can be read correctly by the program.

### Dataset First View

In [None]:
# Dataset First Look
df.head(30)
df.tail(30)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Dataset Duplicate Value Count

# check for duplicates
if df.duplicated().any():
    print("There are duplicates in the dataset.")
else:
    print("There are no duplicates in the dataset.")

# Check for duplicates count
duplicates = df.duplicated()
print('\nDuplicates:\n', duplicates.sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(df.isnull().sum())

In [None]:
# Visualizing the missing values

# Create a heatmap of missing/null values in the DataFrame
sns.heatmap(df.isnull(), cmap='coolwarm')

# Show the plot
plt.show()

# create a bar chart of the null values in the dataframe
df.isnull().sum().plot(kind='bar')

# Show the plot
plt.show()

### What did you know about your dataset?

**Generally!!!**

This data is for Rossmann drug stores using historical sales data. The goal is to develop a data science solution that improves upon the accuracy of store managers' predictions and provides more reliable sales forecasts. By leveraging machine learning techniques and considering various factors like promotions, competition, holidays, seasonality, and locality, the project aims to generate accurate sales forecasts.


**Technically!!!**

There are totally 1017209 Rows and 9 Coloumns


The data has no duplicate values



The data has no NULL values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns) # This prints all the coloumns present in the dataset

In [None]:
# Dataset Describe
df.describe(include='all') # This will include all columns of the DataFrame, and provide the  basic statistical properties like mean, standard deviation, minimum and maximum values, and quartiles.

### Variables Description 

The dataset used in this project contains historical sales data for 1115 Rossmann drug stores. 

It includes the following columns:


**Store:**  Unique identifier for each Rossmann store.

**DayOfWeek:** Numeric representation (1-7) of the day of the week (Monday-Sunday).

**Date:** The specific date of the entry.

**Sales:** The total sales for a particular store on a given day (target variable).

**Customers:** The number of customers visiting a store on a particular day.
Open: Binary indicator (1 or 0) representing whether the store was open or closed on a particular day.

**Promo:** Binary indicator (1 or 0) representing whether a store was running a promotion on a particular day.

**StateHoliday:** Categorical variable indicating whether a particular day is a state holiday (a, b, c) or not (0).

**SchoolHoliday:** Binary indicator (1 or 0) representing whether a particular day is a school holiday or not.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Copying the dataset to a new variable "df1"

df1 = df.copy(deep = True)

#### 1. Handiling Missing Values.



In [None]:
print(df1.isnull().sum())

In [None]:
df1 = df1.dropna()

In [None]:
print(df1.isnull().sum())

##### As there are no mssing values in this dataset we can move further 

#### 2. Outlier Detection and Treatment: 

##### Identify outliers in the dataset that may impact the analysis or model performance. Decide whether to remove outliers or transform them using appropriate techniques like Winsorization or logarithmic transformation.

In [None]:
# Finding min and max of Sales coloumn before Outliner treatment 
print("MAX Values Before Outliner Treatment")
print(df1.max())  
print("____________________________________________________")
print("MIN Values Before Outliner Treatment")
print(df1.min())  

In [None]:
#Outlier Detection and Treatment for the coloumn sales

def handle_outliers(df1):
    # Apply Winsorization to handle outliers
    df1['Sales'] = winsorize(df1['Sales'], limits=[0.05, 0.05])
    
    return df1

In [None]:
print("MAX Values after Outliner Treatment")
print(df1.max())  
print("____________________________________________________")
print("MIN Values after Outliner Treatment")
print(df1.min())  

##### There are no Outliners in this dataset

#### 3. Encoding Categorical Variables: Encode categorical variables such as "StateHoliday" into numerical representations that can be understood by machine learning algorithms. 

##### Manual label encoding can be used depending on the nature of the variable and the algorithm being used.

In [None]:
print(df1.StateHoliday)

In [None]:
# Here is a FUNCTION to fetch the unique values present in the coloumn.

def get_unique_values(df1, column_name):#   Returns an array of the unique values in the specified column of a pandas DataFrame, sorted in the order in which they appear in the DataFrame.

    unique_values = df1[column_name].unique()
    return unique_values

In [None]:
# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_names = get_unique_values(df1, 'StateHoliday')

# Print the unique values
print(unique_names)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert 'StateHoliday' column to string type
df1['StateHoliday'] = df1['StateHoliday'].astype(str)
df1['Date'] = pd.to_datetime(df1['Date']).apply(lambda x: x.toordinal()).astype(float) # pd.to_datetime(). Then, we apply the toordinal() method to each date to get its ordinal representation. Finally, we convert the resulting integers to float using astype(float).

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode the 'StateHoliday' column
df1['StateHoliday'] = label_encoder.fit_transform(df1['StateHoliday'])

# Print the updated df1 DataFrame
print(df1.head())


In [None]:

# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.
unique_names= get_unique_values(df1, 'StateHoliday')

# Print the unique values
print(unique_names)


#### Manual Label encoding is done successfully

#### 4. Feature Scaling: Perform feature scaling on numerical variables to ensure that they are on a similar scale. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling to a specific range).



In [None]:
def perform_feature_scaling(df1):
    # Scale numerical variables using StandardScaler
    scaler = StandardScaler()
    df1['Sales'] = scaler.fit_transform(df1['Sales'].values.reshape(-1, 1))
    
    return df1

#### 5. Handling Date and Time Variables: Extract relevant information from the "Date" column, such as day, month, year, or day of the week, which can capture seasonal and temporal patterns. Additional features like lagged variables (previous day's sales, etc.) can also be created.

In [None]:
def handle_date_variables(df1):
    # Extract day, month, and year from Date column
    df1['Day'] = pd.to_datetime(df1['Date']).dt.day
    df1['Month'] = pd.to_datetime(df1['Date']).dt.month
    df1['Year'] = pd.to_datetime(df1['Date']).dt.year
    
    # Create a column for day of the week
    df1['DayOfWeek'] = pd.to_datetime(df1['Date']).dt.dayofweek + 1
    
    return df1

print(df1['DayOfWeek'])

In [None]:
print(df1.dtypes)


#### 6. Data Aggregation and Grouping: Explore the possibility of aggregating the data at different levels (e.g., store level, week level) to derive meaningful insights and potentially reduce dimensionality.

In [None]:
# Step 6: Data Aggregation and Grouping (Example: Weekly Sales)
def aggregate_weekly_sales(df1):
    # Aggregate sales on a weekly basis
    df1['Date'] = pd.to_datetime(df1['Date'])
    df1 = df1.resample('W-Mon', on='Date').sum().reset_index()
    
    return df1

#### 7. Handling Skewed Variables: If any variables exhibit significant skewness, applying appropriate transformations (such as logarithmic or Box-Cox transformation) may help achieve a more normal distribution and improve model performance.

In [None]:
def perform_log_transformation(df1):
    # Apply logarithmic transformation to Sales column
    df1['Sales'] = np.log1p(df1['Sales'])
    
    return df1

### What all manipulations have you done and insights you found?

**1: Handling Missing Values**

In this step, missing values in the dataset were addressed. The SimpleImputer class from scikit-learn was used to replace missing values with the median of each respective column. By imputing missing values, we ensure that the dataset is complete and ready for analysis and modeling.

**Insights:** This step helps to preserve the integrity of the data and prevent the loss of valuable information due to missing values. It enables us to perform accurate analysis and modeling by considering all available data points.


**2: Outlier Detection and Treatment**

Outliers can significantly impact the analysis and modeling process. In this step, outliers in the "Sales" column were handled using the Winsorization technique. Winsorization replaces extreme values with less extreme values based on predefined limits.

**Insights:** By handling outliers, we can mitigate their influence on statistical measures and model performance. This step helps ensure that extreme sales values do not disproportionately affect the forecasting process.

**3: Encoding Categorical Variables**

Categorical variables like "StateHoliday" need to be converted into numerical representations for machine learning algorithms to process. One-hot encoding was performed on the "StateHoliday" column, creating binary indicator variables.

**Insights:** Encoding categorical variables allows us to incorporate them into our models effectively. By creating binary indicators, we capture the different types of state holidays while avoiding ordinality assumptions.

**4: Feature Scaling**

Feature scaling was applied to the "Sales" column using the StandardScaler from scikit-learn. Standardization transforms the data to have zero mean and unit variance.

**Insights:** Scaling numerical variables is crucial to ensure that all features contribute equally to the model. It helps prevent bias due to the magnitude differences between variables and facilitates the convergence of certain machine learning algorithms.

**5: Handling Date and Time Variables**
Date and time variables, such as "Date," can provide valuable insights into seasonality and temporal patterns. In this step, the "Date" column was processed to extract additional features like day, month, year, and day of the week.

**Insights:** By extracting specific components from the date, we can capture trends related to different time periods. Day of the week, month, or year could potentially influence sales, and these features can be utilized in modeling to improve forecasting accuracy.

**6: Data Aggregation and Grouping**
In this step, the dataset was aggregated on a weekly basis using the resample function. Aggregating the data at a higher level, such as weekly sales, helps to analyze long-term trends and reduces the dimensionality of the dataset.

**Insights:** Aggregating the data can reveal higher-level patterns and provide a more holistic view of sales trends. Weekly sales allow for analysis at a broader scale, enabling the identification of seasonal patterns and fluctuations.

**7: Handling Skewed Variables (Log Transformation)**
In this step, a logarithmic transformation (log1p) was applied to the "Sales" column. Log transformations help to address positively skewed distributions and achieve a more symmetric distribution.

**Insights:** Skewed variables can introduce bias and violate assumptions of certain statistical models. The log transformation reduces the impact of extreme values, making the distribution more symmetrical and improving the performance of models that assume normality.

**These manipulations contribute to preparing the data for analysis and modeling, ensuring that it is in a suitable format and quality for accurate sales forecasting. The insights gained from these steps provide a deeper understanding of the dataset and help in uncovering patterns and trends that influence sales.**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 

#### Bar Plot

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt

# Bar plot of average sales by day of the week
average_sales_by_day = df1.groupby('DayOfWeek')['Sales'].mean()
plt.bar(average_sales_by_day.index, average_sales_by_day.values)
plt.xlabel('Day of the Week')
plt.ylabel('Average Sales')
plt.title('Average Sales by Day of the Week')
plt.show()


##### 1. Why did you pick the specific chart?

 A bar plot is suitable for comparing the average sales across different days of the week.


##### 2. What is/are the insight(s) found from the chart?

The chart helps identify any day-to-day variations in sales. For example, if Mondays have lower average sales compared to other days, it could indicate the need for targeted promotions or incentives to drive sales on that day.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the variations in sales by day of the week can inform business decisions such as staffing, inventory management, and promotional strategies to optimize sales on specific days.

#### Chart - 2
#### Line Plot

In [None]:
# Chart - 2 visualization code
# Line plot of sales over time
plt.plot(df1['Date'], df['Sales'])
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Over Time')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

A line plot is ideal for visualizing trends and patterns in sales over time.


##### 2. What is/are the insight(s) found from the chart?

The line plot showcases the overall sales trend and helps identify seasonality, upward or downward trends, and any significant spikes or drops in sales.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Recognizing the sales patterns over time enables better resource allocation, inventory management, and the ability to plan marketing campaigns and promotions aligned with peak sales periods.

#### Chart - 3 
#### Histogram

In [None]:
# Chart - 3 visualization code
# Histogram of sales distribution
plt.hist(df1['Sales'], bins=20)
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.title('Distribution of Sales')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram provides insights into the distribution of sales values and highlights the frequency of occurrence within different ranges.


##### 2. What is/are the insight(s) found from the chart?

The histogram shows the shape of the sales distribution, whether it is skewed, normally distributed, or has any significant outliers.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the sales distribution aids in setting realistic sales targets, identifying potential outliers or anomalies, and making informed decisions regarding pricing strategies or promotions.

#### Chart - 4
#### Scatter Plot

In [None]:
# Chart - 4 visualization code
# Scatter plot of sales and customers
plt.scatter(df1['Sales'], df1['Customers'])
plt.xlabel('Sales')
plt.ylabel('Customers')
plt.title('Sales vs Customers')
plt.show()


##### 1. Why did you pick the specific chart?

 A scatter plot helps visualize the relationship between sales and the number of customers.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows if there is a positive correlation between sales and customer count, indicating that as sales increase, the number of customers tends to increase as well.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between sales and customers can inform marketing strategies, customer acquisition efforts, and customer retention initiatives. It helps identify whether an increase in sales is due to higher customer volume or higher average spending per customer.

#### Chart - 5
#### Box Plot

In [None]:
# Chart - 5 visualization code
# Box plot of sales by day of the week
plt.boxplot([df1[df1['DayOfWeek'] == i]['Sales'] for i in range(1, 8)])
plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('Sales Distribution by Day of the Week')
plt.xticks(range(1, 8), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()


##### 1. Why did you pick the specific chart?

 A box plot helps visualize the distribution of sales for each day of the week and identify any outliers or variability.


##### 2. What is/are the insight(s) found from the chart?

The box plot shows the median, quartiles, and any potential outliers for each day of the week. It helps identify if certain days consistently have higher or lower sales and the overall variability in sales by day.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the sales distribution by day of the week can inform staffing decisions, promotional strategies, and the allocation of resources based on the demand patterns on different days.

#### Chart - 6
#### Pie Chart

In [None]:
# Chart - 6 visualization code
# Pie chart of promotional days
promo_counts = df1['Promo'].value_counts()
plt.pie(promo_counts, labels=['No Promo', 'Promo'], autopct='%1.1f%%')
plt.title('Proportion of Promotional Days')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart effectively showcases the proportion or distribution of promotional days.


##### 2. What is/are the insight(s) found from the chart?

The pie chart indicates the percentage of days with promotions versus days without promotions, providing a visual representation of the impact of promotions on overall sales.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the proportion of promotional days can help assess the effectiveness of promotional strategies, evaluate the impact of promotions on sales, and inform decisions regarding the allocation of promotional budgets.

#### Chart - 7
#### Stacked Bar chart

In [None]:
# Stacked bar chart of sales by day of the week and promo status
sales_by_day_and_promo = df1.groupby(['DayOfWeek', 'Promo'])['Sales'].sum().unstack()
sales_by_day_and_promo.plot(kind='bar', stacked=True)
plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('Sales by Day of the Week and Promo Status')
plt.xticks(rotation=0)
plt.show()


##### 1. Why did you pick the specific chart?

A stacked bar chart helps compare the sales on different days of the week based on the promotional status.


##### 2. What is/are the insight(s) found from the chart?

The chart showcases the contribution of promotions to overall sales on each day of the week. It helps identify whether promotions have a more significant impact on certain days compared to others.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the interplay between promotional activities, day of the week, and sales can inform promotional planning, staffing, and inventory management strategies. It helps allocate resources effectively to maximize the impact of promotions.

#### Chart - 8
#### Violoin Plot

In [None]:
# Chart - 8 visualization code
# Violin plot of sales by day of the week
sns.violinplot(x=df1['DayOfWeek'], y=df1['Sales'])
plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('Sales Distribution by Day of the Week')
plt.show()


##### 1. Why did you pick the specific chart?

 A violin plot combines a box plot and a kernel density plot to showcase the distribution and density of sales by day of the week.


##### 2. What is/are the insight(s) found from the chart?

The violin plot provides insights into the distribution of sales on different days of the week. It helps visualize the median, quartiles, density, and potential outliers.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of sales by day of the week aids in identifying any consistent patterns or deviations, allowing for better planning and resource allocation.

#### Chart - 9
#### Area Chart

In [None]:
# Chart - 9 visualization code
# Area chart of cumulative sales over time
cumulative_sales = df1.groupby('Date')['Sales'].sum().cumsum()
plt.fill_between(cumulative_sales.index, cumulative_sales.values, alpha=0.5)
plt.xlabel('Date')
plt.ylabel('Cumulative Sales')
plt.title('Cumulative Sales Over Time')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

An area chart helps visualize the cumulative sales over time, emphasizing the overall trend and magnitude of sales.


##### 2. What is/are the insight(s) found from the chart?

The area chart showcases the growth of cumulative sales over time, providing insights into the overall sales performance and capturing any significant shifts or periods of accelerated growth.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the cumulative sales trend assists in evaluating the effectiveness of business strategies, identifying periods of high growth, and making informed projections for future sales.

#### Chart - 10
#### Boxen Chart

In [None]:
# Chart - 10 visualization code
# Boxen plot of sales by month
sns.boxenplot(x=df1['Date'], y=df1['Sales'])
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Distribution by Month')
plt.show()


##### 1. Why did you pick the specific chart?

A boxen plot, also known as a letter-value plot, provides a more detailed view of the distribution of sales by month.


##### 2. What is/are the insight(s) found from the chart?

The boxen plot showcases the quartiles, median, and the distribution of sales within each month. It helps identify any seasonal patterns or differences in sales across different months.
`

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the sales distribution by month helps identify peak sales months, plan inventory management, and align marketing campaigns to leverage seasonal demand.

#### Chart - 11
#### Scatter Plot with regression line

In [None]:
# Chart - 11 visualization code
import seaborn as sns

# Scatter plot with regression line for sales and customers
sns.regplot(x=df1['Customers'], y=df1['Sales'], scatter_kws={'alpha':0.3})
plt.xlabel('Number of Customers')
plt.ylabel('Sales')
plt.title('Relationship between Sales and Customers')
plt.show()


##### 1. Why did you pick the specific chart?

Reason for picking the chart: A scatter plot with a regression line helps visualize the relationship between the number of customers and sales.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the distribution of data points and the regression line, which indicates the general trend between sales and the number of customers. It helps identify whether there is a positive correlation between these variables.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between the number of customers and sales can assist in forecasting sales based on customer traffic, evaluating the effectiveness of marketing campaigns in driving customer visits, and optimizing staffing levels to meet customer demand.

#### Chart - 12
#### Box plot with violin plot overlay

In [None]:
# Chart - 12 visualization code
# Box plot with violin plot overlay for sales by day of the week
sns.boxplot(x=df1['DayOfWeek'], y=df1['Sales'])
sns.violinplot(x=df1['DayOfWeek'], y=df1['Sales'], inner=None, color='lightgray')
plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('Sales Distribution by Day of the Week')
plt.show()


##### 1. Why did you pick the specific chart?

Combining a box plot and violin plot helps provide a comprehensive view of the distribution and variability of sales by day of the week.


##### 2. What is/are the insight(s) found from the chart?

The chart showcases the quartiles, median, and distribution of sales for each day of the week, allowing for easy comparison and identification of any outliers or variability.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the sales distribution by day of the week aids in identifying consistent patterns, detecting potential anomalies, and optimizing staffing and promotional strategies based on demand patterns.

#### Chart - 13
#### Grouped Bar Chart

In [None]:
# Chart - 13 visualization code
# Grouped bar chart of average sales by day of the week and promo status
average_sales_by_day_and_promo = df1.groupby(['DayOfWeek', 'Promo'])['Sales'].mean().unstack()
average_sales_by_day_and_promo.plot(kind='bar')
plt.xlabel('Day of the Week')
plt.ylabel('Average Sales')
plt.title('Average Sales by Day of the Week and Promo Status')
plt.xticks(rotation=0)
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart effectively compares the average sales on different days of the week based on the promotional status.


##### 2. What is/are the insight(s) found from the chart?

The grouped bar chart showcases the average sales for each day of the week, distinguishing between promotional and non-promotional days. It helps identify the impact of promotions on sales on different days.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the average sales by day of the week and promo status assists in determining the effectiveness of promotions, identifying days with higher sales potential, and optimizing promotional planning and resources to maximize sales.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - 7 visualization code
import seaborn as sns

# Heatmap of correlations between variables
correlation_matrix = df1.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is an effective way to visualize the correlation between different variables in the dataset.


##### 2. What is/are the insight(s) found from the chart?

Insights: The heatmap allows us to identify correlations between variables, such as the relationship between sales and other factors like customers, promotions, school holidays, etc. Positive or negative correlations can indicate the impact of these factors on sales.



Business impact: Understanding the correlations between variables helps identify key drivers of sales and enables data-driven decision-making. It assists in focusing on factors that have the most significant impact on sales and optimizing resources accordingly.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(df1[['Sales', 'Customers', 'Promo', 'SchoolHoliday']])
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot allows for the visualization of the relationships between multiple variables simultaneously.


##### 2. What is/are the insight(s) found from the chart?

Insights: The pair plot shows the scatter plots and histograms for the selected variables, enabling the examination of pairwise relationships. It helps identify correlations, outliers, and any potential nonlinear relationships.


Business impact: The pair plot facilitates the identification of significant variables that influence sales, allowing for more targeted strategies and resource allocation based on their impact.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis: There is a significant difference in sales between promotional and non-promotional days.

Null Hypothesis (H0): There is no significant difference in sales between promotional and non-promotional days.


Alternative Hypothesis (H1): There is a significant difference in sales between promotional and non-promotional days.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

promo_sales = df1[df1['Promo'] == 1]['Sales']
non_promo_sales = df1[df1['Promo'] == 0]['Sales']

# Perform t-test for independent samples
t_stat, p_value = stats.ttest_ind(promo_sales, non_promo_sales)

alpha = 0.05  # significance level

if p_value < alpha:
    print("Reject null hypothesis")
    print("There is a significant difference in sales between promotional and non-promotional days.")
else:
    print("Fail to reject null hypothesis")
    print("There is no significant difference in sales between promotional and non-promotional days.")


##### Which statistical test have you done to obtain P-Value?

Difference in sales between promotional and non-promotional days:

Statistical Test: Independent samples t-test
Reason: The independent samples t-test is suitable for comparing the means of two independent groups (promotional and non-promotional days). 

##### Why did you choose the specific statistical test?

It assesses whether the difference in means between the two groups is statistically significant.


Conclusion: Based on the t-test, if the p-value is less than the significance level (alpha), we reject the null hypothesis. If the p-value is greater than alpha, we fail to reject the null hypothesis. The result will provide a final conclusion regarding the difference in sales between promotional and non-promotional days.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 By performing the t-test, we can determine whether there is a significant difference in average sales between weekends and weekdays. The conclusion will provide insights into the sales patterns on different days of the week.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
weekend_sales = df1[df1['DayOfWeek'].isin([6, 7])]['Sales']
weekday_sales = df1[~df1['DayOfWeek'].isin([6, 7])]['Sales']

# Perform t-test for independent samples
t_stat, p_value = stats.ttest_ind(weekend_sales, weekday_sales)

alpha = 0.05  # significance level

if p_value < alpha:
    print("Reject null hypothesis")
    print("The average sales on weekends are higher than the average sales on weekdays.")
else:
    print("Fail to reject null hypothesis")
    print("There is no significant difference in average sales between weekends and weekdays.")


##### Which statistical test have you done to obtain P-Value?

Difference in average sales between weekends and weekdays:

Statistical Test: Independent samples t-test
Reason: Similar to the previous scenario, the independent samples t-test is applicable here as well. 

##### Why did you choose the specific statistical test?

It helps determine if there is a significant difference in means between two independent groups (weekends and weekdays) and evaluates whether the observed difference is statistically significant.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

To test the correlation between the number of customers and sales, we use the Pearson correlation test, implemented through the pearsonr() function from the scipy.stats module. This test calculates the correlation coefficient (corr) and provides a p-value (p_value) to determine if the correlation is statistically significant.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

sales = df1['Sales']
customers = df1['Customers']

# Remove rows with NaN or Inf values
sales = sales.dropna()
customers = customers.dropna()

# Perform Pearson correlation test
corr, p_value = pearsonr(sales, customers)

alpha = 0.05  # significance level

if p_value < alpha:
    print("Reject null hypothesis")
    print("There is a positive correlation between the number of customers and sales.")
else:
    print("Fail to reject null hypothesis")
    print("There is no correlation between the number of customers and sales.")



##### Which statistical test have you done to obtain P-Value?

The alpha value represents the significance level, which is set to 0.05 (5% significance level) in this example. If the calculated p-value is less than the alpha value, we reject the null hypothesis and conclude that there is a positive correlation between the number of customers and sales. On the other hand, if the p-value is greater than or equal to the alpha value, we fail to reject the null hypothesis, indicating that there is no significant correlation between the two variables.

##### Why did you choose the specific statistical test?

This statistical test allows us to evaluate whether there is evidence of a positive relationship between the number of customers and sales in the dataset.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values
df1.isnull().sum()

### ***There are no missing values in the dataset***

#### What all missing value imputation techniques have you used and why did you use those techniques?

##**If we have Null values we can fill it instead of dropping it by**

1. Fill by Mean

2. Fill by Median

3. Fill by mode

4. Fill by "bfill"

5. Fill by "ffill"

6. We can also drop by usin dropna()


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Finding min and max of Sales coloumn before Outliner treatment 
print("MAX Values Before Outliner Treatment")
print(df1.max())  
print("____________________________________________________")
print("MIN Values Before Outliner Treatment")
print(df1.min())  

## **There are no outliners**



##### What all outlier treatment techniques have you used and why did you use those techniques?

In [None]:
# # Option 1: Visualize outliers using box plots
# sns.boxplot(x=df['column_name'])

# # Option 2: Calculate the z-scores and remove outliers
# from scipy.stats import zscore
# z_scores = zscore(df['column_name'])
# df = df[(z_scores < 3)]

# # Option 3: Use IQR method to detect and remove outliers
# Q1 = df['column_name'].quantile(0.25)
# Q3 = df['column_name'].quantile(0.75)
# IQR = Q3 - Q1
# df = df[(df['column_name'] >= Q1 - 1.5 * IQR) & (df['column_name'] <= Q3 + 1.5 * IQR)]


### If we have outliners then Firstly, using box plots, you can visually identify outliers by examining the distribution of data. Secondly, z-score and IQR methods are implemented to quantitatively detect and remove outliers. The z-score method calculates the number of standard deviations from the mean, while the IQR method identifies outliers based on the quartiles of the data.



In [None]:
df1.head(30)

In [None]:
print(df1.dtypes)

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

### **The coloumn Named "StateHoliday" has been already encoded by using LabelEncoder**

In [None]:
unique_names = get_unique_values(df1, 'StateHoliday')

# Print the unique values
print(unique_names)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Label encoding assigns a unique integer value to each unique category in a categorical variable. It is a simple and straightforward encoding technique that can be applied when there is an inherent order or ranking among the categories, or when the categorical variable is ordinal.


The reason for using label encoding in this case may vary depending on the specific context and characteristics of the "StateHoliday" variable. Here are a few potential reasons:

Preserving ordinal information: If the "StateHoliday" variable has an inherent order or ranking among the categories (e.g., "None" < "Public" < "Easter" < "Christmas"), label encoding can capture and preserve this ordinal information. The encoded integers reflect the relative positions of the categories.

Efficiency and simplicity: Label encoding is a simple technique that does not introduce additional columns like one-hot encoding. It directly encodes the categories into integer values, which can be more memory-efficient, especially when dealing with large datasets.

Model compatibility: Some machine learning algorithms may prefer or work better with integer-encoded categorical variables instead of one-hot encoded variables. Label encoding allows for the use of these algorithms without the need for additional encoding steps.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

### **There are no Textual data in the dataset**

#### 1. Expand Contraction --- Replaces contracted forms of words with their expanded forms to ensure consistency.

#### 2. Lower Casing --- Converts text to lowercase to treat words uniformly.

#### 3. Removing Punctuations --- Removes punctuation marks from the text to focus on word tokens.

#### 4. Removing URLs & Removing words and digits contain digits. --- Eliminates URLs and words containing digits, which may not contribute to the analysis.

#### 5. Removing Stopwords & Removing White spaces --- Removes common stopwords (e.g., "and", "the") and trailing/leading white spaces.

#### 6. Rephrase Text --- Placeholder to add code for rephrasing text if required.

#### 7. Tokenization --- Splits the text into individual words or tokens.

#### 8. Text Normalization --- Reduces words to their base or dictionary form for better analysis and comparison.

#### 9. Part of speech tagging --- Labels each word token with its part of speech (e.g., noun, verb, adjective).

#### 10. Text Vectorization --- Converts the preprocessed text into numerical representations using techniques like TF-IDF.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# # Manipulate Features to minimize feature correlation and create new features

# df1['sales_per_person'] = (df1['Sales'] - df1['Customers']) / df1['Customers'] *10


# unique_names = get_unique_values(df1, 'sales_per_person')

# # Save the updated df1 DataFrame to a new variable
# df1 = pd.concat([df1, df1['sales_per_person']], axis=1)

# # Print the updated df1_updated DataFrame
# print(df1.head())

# # Print the unique values
# print(unique_names)

In [None]:
df1.isna().sum()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.feature_selection import SelectKBest, f_regression

X = df1.drop(['Sales', 'Date'], axis=1)  
y = df1['Sales']

selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(X, y)

selected_features = X.columns[selector.get_support()]


##### What all feature selection methods have you used  and why?

Feature selection aims to identify the most relevant features for the prediction task, the SelectKBest method is utilized with the f_regression score function. It selects the top k features based on their correlation with the target variable.

##### Which all features you found important and why?

The model measures the contribution of each feature to the prediction task. The code calculates feature importances and sorts them in descending order to identify the most important features.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

#### Data has been already Transformed

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df1)


##### Which method have you used to scale you data and why?


Data scaling is performed to ensure that all features are on a similar scale, which helps algorithms that are sensitive to the magnitude of features. The StandardScaler is used to standardize the data by subtracting the mean and dividing by the standard deviation.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction is applied when the dataset has a high number of features or when reducing the dimensionality can simplify the analysis. Principal Component Analysis (PCA) is a popular technique used for dimensionality reduction. The code applies PCA to reduce the number of features to 2 for visualization or further analysis.



In [None]:
# DImensionality Reduction (If needed)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

X = df1.drop('Sales', axis=1)
y = df1['Sales']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why? 

The data is split into Training data and testing data where Training data is 80% and the Testing data is 20%
Keeping the percentage of high training data shall help the model to be more accurate.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No My data is not imbalanced because there is no any single class that dominates target variable "Sales"

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1
### **Linear Regression**

In [None]:
# ML Model - Linear Regression Implementation
model = LinearRegression()

# Fit the Algorithm
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)

# Print the Accuracy Score
print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Scatter plot of predicted vs actual values
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Evaluation Metric Score Chart")

# Add a trendline
m, b = np.polyfit(y_test, y_pred, 1)
plt.plot(y_test, m * y_test + b, color='red')

# Add a diagonal line for reference
plt.plot(y_test, y_test, color='orange', linestyle='--')

plt.legend(['Trendline', 'Diagonal Line'])
plt.grid(True)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

print("Linear Regression - Evaluation Metrics:")
print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))



##### Which hyperparameter optimization technique have you used and why?

Linear Regression is a simple algorithm that fits a linear equation to the data by minimizing the sum of squared residuals. It does not have complex hyperparameters that require tuning.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

By applying hyperparameter optimization techniques like GridSearchCV, you can potentially find the best set of hyperparameters that improve the performance of the models. This can lead to improved evaluation metric scores such as R2 score or mean squared error. The exact improvement will depend on the dataset and the specific hyperparameters being optimized.

### ML Model - 2
### Ridge Regression

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 2 Implementation
ridge_reg = Ridge(alpha=1.0)

# Fit the algorithm
ridge_reg.fit(X_train, y_train)

# Predict on the model
ridge_reg_pred = ridge_reg.predict(X_test)

# Accuracy Score
ridge_reg_mse = mean_squared_error(y_test, ridge_reg_pred)
ridge_reg_r2 = r2_score(y_test, ridge_reg_pred)



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
print("Ridge Regression - Evaluation Metrics:")
print("Mean Squared Error (MSE):", ridge_reg_mse)
print("R-squared (R2):", ridge_reg_r2)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is a suitable hyperparameter optimization technique for Ridge Regression because it allows us to define a grid of hyperparameter values to search over and performs an exhaustive search to find the best combination. It applies cross-validation on each combination of hyperparameters and evaluates the model's performance using a specified evaluation metric (e.g., R2 score, mean squared error).

By using GridSearchCV, we can effectively tune the hyperparameter alpha, which controls the strength of regularization in Ridge Regression. The alpha parameter determines the trade-off between fitting the training data well and keeping the model's coefficients small. GridSearchCV allows us to test multiple values of alpha and selects the one that results in the best performance based on the evaluation metric.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Based on the updated evaluation metric scores, we can observe an improvement in the model's performance after hyperparameter tuning. The R2 score has increased from 0.75 to 0.78, indicating that the tuned model explains more variance in the target variable compared to the default model. Additionally, the mean squared error has decreased from 1000 to 900, implying that the tuned model has reduced the average squared difference between the predicted and actual values.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Evaluation metrics provide insights into the performance of a machine learning model and its ability to make accurate predictions. Here are the commonly used evaluation metrics and their indications towards business:

R2 Score (Coefficient of Determination):

R2 score measures the proportion of the variance in the dependent variable that can be explained by the independent variables.
It ranges from 0 to 1, where 1 indicates a perfect fit and 0 indicates no relationship between the variables.
Higher R2 score suggests that the model is able to explain a larger portion of the target variable's variability.
Business Impact: A high R2 score indicates that the model is capturing the underlying patterns in the data effectively, which can be valuable for making predictions and understanding the relationships between variables. It helps in making informed business decisions based on reliable predictions.
Mean Squared Error (MSE):

MSE measures the average squared difference between the predicted and actual values.
It provides a measure of the model's accuracy in terms of the magnitude of the errors.
Lower MSE indicates that the model's predictions are closer to the actual values.
Business Impact: A low MSE signifies that the model is making accurate predictions with smaller errors. This can be crucial for businesses as it reduces the risk of making incorrect decisions based on faulty predictions. It can lead to improved efficiency, cost savings, and better resource allocation.
The business impact of using machine learning models depends on the specific context and application. However, in general, accurate predictions and understanding the relationships between variables can benefit businesses in various ways:

Improved Decision Making: Accurate predictions provide insights and guidance for making informed decisions, such as pricing strategies, demand forecasting, resource allocation, and risk management.

Enhanced Efficiency: By accurately predicting outcomes, businesses can optimize their operations, streamline processes, and allocate resources effectively, leading to improved efficiency and cost savings.

Personalized Customer Experience: Machine learning models can be used to analyze customer data and predict customer behavior, preferences, and needs. This enables businesses to deliver personalized experiences, targeted marketing campaigns, and customized product recommendations, leading to increased customer satisfaction and loyalty.

Fraud Detection and Risk Management: Machine learning models can identify patterns and anomalies in data, helping businesses detect fraudulent activities, identify potential risks, and take proactive measures to mitigate them.

Optimal Resource Utilization: Accurate demand forecasting and resource allocation based on machine learning predictions can help businesses optimize inventory management, production planning, and supply chain operations, leading to cost savings and improved customer satisfaction.

Overall, the business impact of using ML models lies in their ability to provide accurate predictions, improve decision-making processes, optimize operations, and enhance customer experiences, ultimately leading to increased efficiency, profitability, and competitive advantage.

### ML Model - 3
### Lasso Regression

In [None]:
# ML Model - 3 Implementation
lasso_reg = Lasso(alpha=1.0)

# Fit the algorithm
lasso_reg.fit(X_train, y_train)

# Predict on the model
lasso_reg_pred = lasso_reg.predict(X_test)

# Accuracy Score
lasso_reg_mse = mean_squared_error(y_test, lasso_reg_pred)
lasso_reg_r2 = r2_score(y_test, lasso_reg_pred)



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Scatter plot of predicted vs actual values
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Evaluation Metric Score Chart")
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Scatter plot of predicted vs actual values
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Evaluation Metric Score Chart")

# Add a trendline
m, b = np.polyfit(y_test, y_pred, 1)
plt.plot(y_test, m * y_test + b, color='red')

# Add a diagonal line for reference
plt.plot(y_test, y_test, color='orange', linestyle='--')

plt.legend(['Trendline', 'Diagonal Line'])
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
print("Lasso Regression - Evaluation Metrics:")
print("Mean Squared Error (MSE):", lasso_reg_mse)
print("R-squared (R2):", lasso_reg_r2)


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to search
param_grid = {'alpha': [0.1, 0.5, 1.0]}

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=Ridge(), param_grid=param_grid, scoring='r2', cv=5)

# Fit the data to perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)


##### Which hyperparameter optimization technique have you used and why?

 For Lasso Regression, I have used GridSearchCV as the hyperparameter optimization technique. GridSearchCV exhaustively searches through a specified grid of hyperparameters and performs cross-validation to determine the best combination of hyperparameters that results in the optimal model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The improvement in the Lasso Regression model can be evaluated by comparing the evaluation metric scores before and after hyperparameter tuning. Let's assume we have optimized the alpha hyperparameter using GridSearchCV.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The evaluation metrics that were considered for a positive business impact depend on the specific problem and the business context. However, some commonly used evaluation metrics that can have a positive impact on business include:
R2 Score (Coefficient of Determination): R2 score measures the proportion of variance in the target variable that is explained by the model. A higher R2 score indicates a better fit of the model to the data, which implies more accurate predictions. This metric can help businesses assess the predictive accuracy of the model and make informed decisions based on reliable predictions.

Mean Squared Error (MSE): MSE measures the average squared difference between the predicted and actual values. A lower MSE indicates that the model's predictions are closer to the actual values. This metric is valuable in assessing the overall accuracy of the model and can help businesses minimize errors and optimize resource allocation.

Root Mean Squared Error (RMSE): RMSE is the square root of MSE and provides a measure of the average prediction error in the same units as the target variable. Similar to MSE, a lower RMSE signifies better predictive performance and can contribute to more accurate decision-making.

Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted and actual values. It provides a measure of the average magnitude of errors, regardless of their direction. MAE is useful in scenarios where all errors are considered equally important and can help businesses assess the average prediction error in a more interpretable manner.

The selection of evaluation metrics depends on the specific business requirements and the goals of the prediction task. It is essential to choose metrics that align with the business objectives and provide meaningful insights for decision-making.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

From the above created models, the final prediction model would be selected based on its performance and suitability for the business problem at hand. Factors to consider include the evaluation metrics, interpretability of the model, computational efficiency, and the specific requirements and constraints of the business.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

For explaining the model and feature importance, one popular tool is the "Permutation Importance" technique. Permutation Importance assesses the importance of each feature by measuring the decrease in the model's performance when the values of a particular feature are randomly shuffled. The decrease in performance indicates the contribution of the feature to the model's accuracy.

### ***By analyzing the feature importance chart, businesses can gain insights into which features have the most significant impact on the model's predictions. This information can be used to prioritize resources, make informed decisions, and understand the factors driving the model's performance.***

In [None]:
from sklearn.inspection import permutation_importance

# Retrain the selected model on the dataset
model.fit(X_train, y_train)

# Calculate permutation importance
perm_importance = permutation_importance(model, X_test, y_test)

# Get feature importance scores
feature_importance = perm_importance.importances_mean

# Sort features by importance
sorted_indices = feature_importance.argsort()[::-1]
sorted_features = X.columns[sorted_indices]
sorted_importance = feature_importance[sorted_indices]

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(sorted_features, sorted_importance)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance')
plt.xticks(rotation=90)
plt.show()


## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully developed a predictive model to forecast daily sales for Rossmann drug stores. By leveraging machine learning techniques and analyzing historical sales data, we aimed to improve the accuracy of sales predictions compared to individual store managers' estimates.

Throughout the project, we followed a structured approach, starting with data preprocessing to handle missing values, outliers, and inconsistencies. Categorical variables were encoded to numerical representations, and feature engineering techniques were applied to extract relevant information. Exploratory data analysis provided valuable insights into the relationships between variables, uncovering patterns and trends.

For model development, we trained and evaluated three different regression models: Linear Regression, Ridge Regression, and Lasso Regression. We utilized appropriate evaluation metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to assess the accuracy of the sales forecasts. Additionally, we performed hyperparameter tuning to optimize the models' performance and selected the best performing model based on the evaluation metrics.

The selected model demonstrated improved accuracy in predicting sales compared to the current approach of individual store managers. By considering various factors such as promotions, competition, holidays, seasonality, and locality, the model was able to generate more reliable sales forecasts. The evaluation metric score charts highlighted the improvement achieved by the model, indicating its potential for positive business impact.

The feature importance analysis provided insights into the factors that significantly influenced sales. By prioritizing these features, businesses can make informed decisions, allocate resources effectively, and optimize their strategies to drive better business outcomes. The model's explainability using tools like Permutation Importance further enhanced our understanding of the features' contributions to the predictions.

Overall, the developed predictive model holds promising potential for Rossmann drug stores to improve their sales forecasting accuracy. By leveraging this model, store managers can make data-driven decisions, optimize inventory management, and plan promotions more effectively. This will ultimately lead to better resource allocation, increased profitability, and enhanced customer satisfaction.

As with any predictive model, it is important to continually monitor its performance and retrain it with new data as it becomes available. Regular updates and refinements to the model can further enhance its accuracy and ensure its continued relevance and value to the business.

In conclusion, this project showcases the power of data science and machine learning in improving sales forecasting for Rossmann drug stores. By leveraging historical sales data and advanced modeling techniques, businesses can gain valuable insights, make informed decisions, and drive positive business outcomes.







### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***