<a href="https://colab.research.google.com/github/Nayan3101/Capstone-project-Regression---Retail-Sales-Prediction/blob/main/Nayan_Capstone_Project_M6_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression - Retail Sales Prediction**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** - NAYAN SURYAWANSHI


# **Project Summary -**

# **GitHub Link -**

https://github.com/Nayan3101

# **Problem Statement**


**The retail industry, exemplified by Rossmann's 3,000 drug stores across 7 European countries, faces the recurring challenge of accurately predicting daily sales up to six weeks in advance. The accuracy of these sales forecasts is significantly affected by various dynamic factors, including promotional activities, competition, school and state holidays, seasonality, and geographical location. Currently, individual Rossmann store managers rely on their unique insights and circumstances to make these predictions, resulting in a wide range of forecasting accuracy.**

**In this machine learning capstone project, the primary objective is to develop a robust regression model for Retail Sales Prediction. The project leverages historical sales data from 1,115 Rossmann stores to forecast the "Sales" column in the test set. This predictive model should take into account the intricate interplay of sales-influencing variables, adapt to temporal and geographical variations, and accommodate the occasional store closures due to refurbishment.**

**The successful completion of this project will have far-reaching implications. By delivering a reliable sales prediction model, it will not only enhance the overall accuracy of sales forecasts for Rossmann but also streamline the decision-making process for store managers. This model is expected to optimize inventory management, allocation of resources, and promotion planning, ultimately contributing to the company's profitability and customer satisfaction. Furthermore, it will provide a data-driven framework to improve sales predictions, leading to better informed business decisions and more efficient store operations.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import missingno as msno
import matplotlib
import matplotlib.pylab as pylab

%matplotlib inline
matplotlib.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 8,6

import math
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LassoLars
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import ElasticNet

### Dataset Loading

In [None]:
# mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
url1 = '/content/drive/MyDrive/almabetter/store.csv'
data1 = pd.read_csv(url1)

In [None]:
url2 = '/content/drive/MyDrive/almabetter/Rossmann Stores Data.csv'
data2 = pd.read_csv(url2)

### Dataset First View

In [None]:
# Dataset First Look
data1

In [None]:
data2

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data1.shape

In [None]:
data2.shape

### Dataset Information

In [None]:
# Dataset Info
data1.info()

In [None]:
data2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data1.duplicated().sum()

In [None]:
data2.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data1.isnull().sum()

In [None]:
# Visualizing the missing values
data2.isnull().sum()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data1.columns

In [None]:
data2.columns

In [None]:
# Dataset Describe
data1.describe()

In [None]:
data2.describe()

### Variables Description

**DATA1**
  1. **Store:** unique identifier for each Rossmann store.

  2. **StoreType:** category of the store (e.g., A, B, C, or D).

  3. **Assortment:** describe the level of products available in the store.

  4. **CompetitionDistance:** distance to the nearest competitor store.

  5. **CompetitionOpenSinceMonth**: specify the month when the nearest competitor store opened.

  6. **CompetitionOpenSinceYear:** year when the nearest competitor store opened.

  7. **Promo2:** This binary column (0 OR 1) indicate whether promo2 is active for store

  8. **Promo2SinceWeek:** calendar week when promo2 started for the store.

  9. **Promo2SinceYear:** year when promo2 started.

  10. **PromoInterval:** describe the intervals during which promo2 runs.

**DATA2**
  1. **Store:** unique identifier for each Rossmann store.

  2. **DayOfWeek:** indicates the day of the week for each data point ranging from 1 (Monday) to 7 (Sunday). It helps capture sales.

  3. **Date:** This column contains the date of the sales data point, providing a time reference for each observation. It is crucial for analyzing time series data and seasonality.

  4.**Sales:** represents the daily sales of store on a particular date.

  5.**Customers:** number of customers who visited the store.

  6. **Open:** This binary column indicates whether the store was open (1) or closed (0) on the respective date.

  7. **Promo:** This binary column (0 or 1) may represent whether there was a promotion on the given date.

  8. **StateHoliday:** indicate whether the date corresponds to a state holiday.

  9. **SchoolHoliday:** This binary column (0 or 1) indicate whether there was a school holiday on the given date.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_value_d1 = {col: data1[col].unique() for col in data1.columns}
unique_value_d1



In [None]:
unique_count_d1 = {col: len(data1[col].unique()) for col in data1.columns}
unique_count_d1

In [None]:
unique_value_d2 = {col: data2[col].unique() for col in data2.columns}
unique_value_d2

In [None]:
unique_count_d2 = {col: len(data2[col].unique()) for col in data2.columns}
unique_count_d2

## 3. ***Data Wrangling***

### Data Wrangling Code

**Since there are no null values in data 2 so we will handel null values from data 1**

In [None]:
plt.figure(figsize=(8, 6))
data1.isnull().sum().plot(kind='bar', color='skyblue')
plt.title('Missing Values Bar Chart')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
for i, count in enumerate(data1.isnull().sum()):
    data1.isnull().sum().plot(kind='bar', color='skyblue').text(i, count + 1, str(count), ha='center', va='bottom')

plt.show()

**To replace null values we need to know the data type of data1's columns so that we can replace the null values with their mean,mode & median accordingly**

In [None]:
data1.dtypes

In [None]:
data1['CompetitionDistance'].fillna(data1['CompetitionDistance'].mean(), inplace=True)


**'CompetitionOpenSinceMonth' & 'CompetitionOpenSinceYear'**
both column represent month and year when nearest compititor store opened.the null value indicates that there is no stores open so we fill fill the null values with "0"

In [None]:
data1['CompetitionOpenSinceMonth'].fillna(0, inplace=True)
data1['CompetitionOpenSinceYear'].fillna(0, inplace=True)

**'Promo2SinceWeek' & 'Promo2SinceYear'**
these 2 column indicates munth and yesr when nearest compititior store opend. Since some rows are empty,this means that no store were opened that time.so we will fill the null values with "0"

In [None]:
data1['Promo2SinceWeek'].fillna(0, inplace=True)
data1['Promo2SinceYear'].fillna(0, inplace=True)
data1['PromoInterval'].fillna('NoPromo', inplace=True)

In [None]:
has_null_values = data1.isnull().any().any()

if has_null_values:
    print("There are still null values in the data.")
else:
    print("There are no null values in the data.")

In [None]:
data1.isnull().sum()

In [None]:
data2.isnull().sum()

**Now we can clearly see that both the Data has 0 null values**

In data2 we have a column "Date" which is in object form. So we need to change the format of the column in date format

In [None]:
data2['Date'] = pd.to_datetime(data2['Date'],format='%Y-%m-%d')
print(data2['Date'].dtype)

In [None]:
data2.dtypes

In [None]:
data1.dtypes

**NOW WE WILL MERGE BOTH DATA1 AND DATA2 FOR VIZUALIZATION AND ANALYSIS WORK**

In [None]:
# Merging both datasets
df = data2.merge(right=data1, on='Store',how='left')
df

In [None]:
df.info()

**now we will separaetly  extract date, month & year from column"Date"

In [None]:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['WeekOfYear'] = df['Date'].dt.weekofyear
df['DayName'] = df['Date'].dt.day_name()

In [None]:
df

In [None]:
df.copy()

In [None]:
df.columns

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**<h2>STORE CLOSURE PATTERN**

In [None]:
closure_count = df.groupby(['DayOfWeek','Open'])['Open'].count().unstack()
closure_count

In [None]:
closure_count.plot(kind='bar',figsize=(15,5),color=['violet','purple'],fontsize=8)
plt.title('''STORE CLOSURE PATTERN
(0 = Close & 1 = Open)''',fontsize = 12,color='purple')
plt.xlabel('WEEK DAY NUMBER',fontsize = 10,color='purple')
plt.ylabel('COUNT',fontsize = 10,color='purple')

plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar graph for visualizing the store closure pattern because it is an effective way to display categorical data (day of the week) against a quantitative variable (number of store closures). Bar graphs make it easy to compare and highlight differences in closure patterns across different days of the week.

##### 2. What is/are the insight(s) found from the chart?

The primary insight from the chart is that a significant number of stores are closed on Sundays. This is evident from the relatively low bar of open for Sunday compared to the bars for other days of the week. The data shows a clear trend of store closures, indicating that Sunday is the day when many stores choose to remain closed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The insight that stores are predominantly closed on Sundays can have both positive and negative implications for a business, depending on the context:

**Positive Impact:**

  1. **Cost Savings:** Stores that close on Sundays can potentially reduce operating costs such as labor, utilities, and maintenance.
  2. **Employee Well-Being:** It may provide employees with a regular day off, contributing to better work-life balance and job satisfaction.
  3. **Resource Allocation:** Knowing that Sundays are slower business days, retailers can optimize staff scheduling and resource allocation.

**Negative Impact:**

  1. **Missed Sales Opportunities:** Closing on Sundays may lead to missed sales opportunities, as some customers prefer to shop on weekends. This could impact overall revenue.
  2. **Competitive Disadvantage:** If competitors remain open on Sundays, a store's closure on that day could put them at a competitive disadvantage.
  3. **Customer Inconvenience:** Customers who rely on Sunday shopping might find it inconvenient if the store is closed, potentially leading to dissatisfaction.

#### Chart - 2

**<h2>SALES AFFECTED BY SCHOOL HOLIDAY**

In [None]:
labels = 'Not-Affected' , 'Affected'
sizes = df.SchoolHoliday.value_counts()
colors = ['olivedrab', 'silver']
explode = (0.1, 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Sales Affected by Schoolholiday",fontsize=20)
plt.plot()
fig.set_size_inches(5,5)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pie chart to visualize the distribution of sales affected by school holidays. A pie chart is a suitable choice when you want to show the proportion of a whole (total sales) that is divided into different categories (affected and unaffected by school holidays). It provides a clear and easy-to-understand representation of the percentage breakdown.

##### 2. What is/are the insight(s) found from the chart?

The primary insight from the chart is that approximately 17.9% of sales are affected by school holidays, while the remaining 82.1% are unaffected. This suggests that school holidays have a noticeable impact on a relatively small portion of sales, indicating that the majority of sales occur during non-school holiday periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impacts may include:

  1. Optimizing staffing and resource allocation for non-school holiday periods when sales are higher.
  2. Tailoring marketing and promotions to target the majority of sales during non-school holiday times.
  3. Offering special promotions or events during school holidays to potentially boost sales during those periods.

negative impacts :

  The insight itself does not necessarily lead to negative growth. However, how the business responds to this insight is crucial. If the business does not adapt its strategies to account for the seasonality of school holidays and non-school holiday periods, it might miss opportunities to maximize sales during both types of periods. The negative impact would come from failing to adjust operations and marketing efforts accordingly.

#### Chart - 3

**<h2> TYPE OF STORES**

In [None]:
# Chart - 3 visualization code
store_type_count = df['StoreType'].value_counts()
store_type_count

In [None]:
store_type_count.plot(kind = 'bar',color = 'coral',fontsize = 8, figsize=(10,5))
plt.title('STORE TYPE COUNT')
plt.xlabel('STORE')
plt.ylabel('COUNT')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the counts of each store type because it's an effective way to compare categorical data. Bar charts make it easy to understand the distribution of different store types in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The primary insight from the chart is that store type "a" has the highest count, followed by store type "d," then "c," and lastly "b." This suggests that store type "a" is the most prevalent in the dataset, while store type "b" is the least common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

positive business impact:

  1. **Resource Allocation:** Understanding the prevalence of each store type can help allocate resources, staff, and inventory more effectively.
  2. **Marketing Strategies:** Tailoring marketing and product strategies to the most common store types can lead to increased sales.
  3. **Expansion and Growth:** Identifying the most and least common store types can inform decisions about opening new stores or optimizing existing ones.
positive business impact:
  The insight about the distribution of store types itself is unlikely to lead to negative growth. However, how the business interprets and acts on this insight is critical. Potential negative impacts could arise if the business fails to adapt its strategies based on the prevalence of each store type. For example:

  Neglecting less common store types (e.g., type "b") may result in underperforming stores, missed opportunities, and potential negative growth.
  
  Overinvesting in the most common store type (e.g., type "a") without considering market dynamics may lead to inefficiencies.

#### Chart - 4

**<h2>AVERAGE SALES WITH AND WITHOUT PROMO**

In [None]:
promo_sales = df[df['Promo'] == 1]['Sales']
no_promo_sales = df[df['Promo'] == 0]['Sales']

labels = ['With Promo', 'Without Promo']
values = [promo_sales.mean(), no_promo_sales.mean()]

plt.figure(figsize=(6, 4))
bars = plt.bar(labels, values, color=['wheat', 'Khaki'], width=0.4)
plt.title('Average Sales with and without Promo')
plt.ylabel('Average Sales')
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add sales values on top of the bars
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), round(value, 2), ha='center', va='bottom')

plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart for comparing the average sales with and without the Promo indicator because it provides a clear and concise way to visualize and compare these two values side by side.

##### 2. What is/are the insight(s) found from the chart?

The primary insight from the chart is the significant difference in average sales between the two categories. "With Promo" has a notably higher average sale of 7991.15 compared to "Without Promo," which has a lower average sale of 4406.05. This suggests that running promotions positively influences sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive impact:**
  1. Increasing the frequency and effectiveness of promotions during strategic periods to boost sales.
  2. Tailoring promotional strategies to different store types or customer segments for maximum impact.
  3. Optimizing inventory and staff resources to handle increased sales during promotional periods.
  **negative impact:**
  
  The insight itself, which indicates that promotions drive higher sales, is generally positive. However, the potential negative impact could arise if promotions are not planned and executed strategically:

  Over-reliance on promotions without considering their profitability and sustainability could lead to lower margins and negative long-term growth.

  Promotions may attract one-time customers who only shop during sale events, potentially reducing customer loyalty during non-promotional periods.

#### Chart - 5

**<h2>AVERAGE SALES BY VARIOUS STORES**

In [None]:
# Chart - 5 visualization code
average_sales_by_store_type = df.groupby('StoreType')['Sales'].mean().sort_values()
average_sales_by_store_type

In [None]:
plt.figure(figsize=(10, 6))
bars = average_sales_by_store_type.plot(kind='bar', color='saddlebrown')
plt.title('Average Sales by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Average Sales')
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add sales values on top of the bars
for bar, value in zip(bars.patches, average_sales_by_store_type):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), round(value, 2), ha='center', va='bottom')

plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the average sales by store type because it provides an effective way to compare the sales performance of different store types. Bar charts are suitable for displaying and comparing quantitative data for distinct categories.

##### 2. What is/are the insight(s) found from the chart?

The primary insights from the chart are as follows:

  1. Store type "b" has the highest average sales among all store types.
  2. Store type "a" follows with the second-highest average sales.
  3. Store type "c" has lower average sales than "b" and "a."
  4. Store type "d" has the lowest average sales among all store types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

positive business impact:

  1. Resource Allocation: Allocating more resources to store types "b" and "a" to maximize sales potential.
  2. Marketing Strategies: Developing tailored marketing strategies for store types based on their sales performance.
  3. Expansion and Growth: Considering whether to open more stores of successful types ("b" and "a") and reevaluating the performance of store type "d."


#### Chart - 6

**<h2>AVERAGE SALES BY ASSORTMENT**

In [None]:
# Chart - 6 visualization code
average_sales_by_assortment = df.groupby('Assortment')['Sales'].mean().sort_values()
average_sales_by_assortment

In [None]:
plt.figure(figsize=(10, 6))
average_sales_by_assortment.plot(kind='bar', color='olivedrab')
plt.title('Average Sales by Assortment')
plt.xlabel('Assortment')
plt.ylabel('Average Sales')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the average sales by assortment because it provides an effective way to compare the sales performance of different assortment levels. Bar charts are suitable for displaying and comparing quantitative data for distinct categories, making it easy to see the differences.

##### 2. What is/are the insight(s) found from the chart?

The primary insights from the chart are as follows:

  1. Assortment level "b" has the highest average sales among all assortment levels.
  2. Assortment level "c" follows with the second-highest average sales.
  3. Assortment level "a" has the lowest average sales among all assortment levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impacts:**

  1. **Product Strategy:** Focusing on assortment level "b" to maximize sales and potentially expand product offerings in this category.
  2. **Pricing and Promotion Strategies:** Tailoring pricing and promotion strategies to match the performance of each assortment level.
  3. **Inventory Management:** Optimizing inventory based on the sales performance of different assortment levels.

**negative business impacts:**

  1. Neglecting assortment level "a" without considering strategies for improvement may lead to underutilized inventory and missed sales.
  2. Overinvesting in assortment level "b" without monitoring market trends and consumer preferences could result in inefficiencies.

#### Chart - 7

**<h2>SALES VS CUSTOMER CHART**

In [None]:
# Chart - 7 visualization code
x = df['Customers']
y = df['Sales']

plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.5)
plt.title('Sales vs. Customers')
plt.xlabel('Number of Customers')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I initially selected a scatter plot because it's a common choice to visualize the relationship between two continuous variables (Sales and Customers). It's effective for understanding the dispersion and distribution of data points and identifying patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

The primary insight from the scatter plot is the general relationship between Sales and Customers:

  1. As the number of Customers increases, Sales tend to increase as well.
  2. There is a positive correlation between the two variables, indicating that higher customer footfall is associated with higher sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 positive business impact:

  1. **Staffing and Resource Allocation:** Businesses can better allocate staffing and resources to handle the expected sales volume based on the number of customers.
  2. **Marketing Strategies:** Understanding the correlation can help tailor marketing strategies to attract more customers and increase sales.
  3. **Sales Predictions:** It enables businesses to make more accurate sales forecasts based on expected customer traffic.

The insights themselves do not inherently lead to negative growth. However, potential negative impacts could occur if the business does not adapt its strategies based on the observed correlation:

  1. Focusing solely on increasing customer traffic without considering factors like pricing, product quality, or customer experience may not lead to sustained sales growth.
  2. Overstaffing during periods of low customer traffic can result in inefficient resource allocation.

#### Chart - 8

**<h2>AVERAGE YEAR SALES**

In [None]:
average_sales_by_year = df.groupby('Year')['Sales'].mean()
average_sales_by_year

In [None]:
plt.figure(figsize=(10, 6))
bars = average_sales_by_year.plot(kind='bar', color='maroon',width=0.4)
plt.title('AVERAGE YEAR SALES')
plt.xlabel('YEAR')
plt.ylabel('Average Sales')
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add sales values on top of the bars
for bar, value in zip(bars.patches, average_sales_by_store_type):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), round(value, 2), ha='center', va='bottom')

plt.show()

##### 1. Why did you pick the specific chart?

bar chart  is a suitable choice for comparing average sales across different years. Bar charts effectively display and compare quantitative data for distinct categories, in this case, the years.

##### 2. What is/are the insight(s) found from the chart?

The primary insights from the bar chart are as follows:

  1. Average sales increased over the three years, with the highest average sales occurring in 2015, followed by 2014 and 2013.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impacts:**

  1. **Strategic Planning:** Businesses can consider factors that contributed to the increase in sales from 2013 to 2015 and plan future strategies accordingly.
  2. **Resource Allocation:** Allocating resources based on the sales trend, such as expanding during high-sales years and optimizing operations during lower-sales years.
  3. **Market Segmentation:** Tailoring marketing and promotional efforts based on the understanding of which years saw the highest sales.


#### Chart - 9

In [None]:
average_sales_by_day = df.groupby('DayOfWeek')['Sales'].mean().sort_index()
average_sales_by_day

In [None]:
day_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

plt.figure(figsize=(10, 6))
bar = average_sales_by_day.plot(kind='bar', color='olivedrab')
plt.title('Average Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Sales')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.xticks(range(7), day_labels, rotation=45)  # Assign day labels

plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the sales by day of the week because a bar chart is well-suited for comparing quantitative values (sales) across distinct categories (days of the week). It provides a clear representation of how sales vary from one day to another.

##### 2. What is/are the insight(s) found from the chart?

The primary insights from the bar chart are as follows:

  1. Sales are highest on day 1 (Monday) and gradually decrease throughout the week.
  2. There is a significant drop in sales on day 7 (Sunday) compared to other days of the week.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**positive business impact:**

  1. **Staffing and Resource Allocation:** Businesses can allocate resources and staff based on the expected sales volume for each day.
  2. **Promotions and Marketing:** Adjusting marketing and promotion strategies to boost sales on days with lower performance can lead to positive impacts.
  3. **Inventory Management:** Optimizing inventory management based on sales trends for different days of the week.

**negative business impact:**

  1. Failing to adjust staffing and resources to match the expected sales volume for each day can result in inefficiencies.
  2. Not adapting marketing and promotional efforts for lower-performing days may lead to missed growth opportunities.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***