<a href="https://colab.research.google.com/github/Jishu-2004/Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering/blob/main/ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Integrated Retail Analysis for store optimization: Advance Machine Learning***



Project Target - Optimize retail store performance using machine learning and data analytics.

Project Type - EDA/Regression/Classification/Unsupervised/Prediction

Contribution - Individual

Name - Aishik Maiti

# **Project Summary -**

The rapid evolution of consumer behavior, digital transformation, and competition in the retail sector has made data-driven decision-making not just advantageous, but essential. This project, titled “Integrated Retail Analytics for Store Optimization Using Advanced Machine Learning Techniques”, focuses on leveraging machine learning and data science to enhance operational efficiency, personalize customer experiences, and increase profitability in physical retail stores.

The primary goal is to transform raw retail data—spanning sales transactions, store features, promotions, and customer behavior—into actionable insights. This is achieved through a multi-pronged analytics pipeline comprising demand forecasting, customer segmentation, and association rule mining.

1. Demand Forecasting
Accurately predicting product demand is vital to avoid stockouts and overstocking. The project uses machine learning regression models such as XGBoost and Random Forest to predict weekly sales based on historical sales data, store characteristics, and time-based features (e.g., holiday flags, months, weeks). Feature engineering includes deriving temporal features and combining datasets on store performance and events. These models are evaluated using metrics such as RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² score, ensuring both accuracy and reliability. This enables retailers to optimize inventory levels, improve shelf space management, and reduce wastage.

2. Customer Segmentation
Segmentation helps retailers understand the diversity in customer behavior and tailor their marketing strategies. The project employs K-Means Clustering based on RFM (Recency, Frequency, Monetary) analysis to identify different customer segments. Since typical retail data may not always include direct customer identifiers, a pseudo-customer ID is created using store and department combinations to simulate customer-level analysis. The clusters are evaluated using the Silhouette Score, which quantifies how well each data point fits within its cluster. The result is a clear classification of customer segments such as loyal customers, high spenders, and infrequent shoppers.

3. Association Rule Mining
To enhance product placement and bundling, the project utilizes Apriori algorithm from the mlxtend library to discover relationships between items frequently purchased together. The analysis generates rules based on support, confidence, and lift values, allowing retailers to strategically place complementary products near each other, design combo offers, and improve cross-selling opportunities.

4. Anomaly Detection
The system also includes an anomaly detection module using Isolation Forest, which flags unusual spikes or drops in sales, possibly due to fraud, promotional misfires, or operational issues. Metrics such as accuracy, precision, recall, and F1-score are used to validate the detection model. This adds a layer of robustness to the retail monitoring system by helping managers quickly respond to abnormalities.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


To optimize retail store performance by leveraging machine learning for demand forecasting, customer segmentation, product bundling, and anomaly detection.









# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


#Loading Feature Dataset
file_path1 = '/content/drive/MyDrive/Internship - Labmentix/Integrated Retail Analysis for store optimization: Advance Machine Learning/Copy of Features data set.csv'

#Loading Sales Dataset
file_path2 = '/content/drive/MyDrive/Internship - Labmentix/Integrated Retail Analysis for store optimization: Advance Machine Learning/sales data-set.csv'

#Loading Stores Dataset
file_path3 = '/content/drive/MyDrive/Internship - Labmentix/Integrated Retail Analysis for store optimization: Advance Machine Learning/stores data-set.csv'

### Dataset First View

In [None]:
# Dataset First Look

df1 = pd.read_csv(file_path1)
df2 = pd.read_csv(file_path2)
df3 = pd.read_csv(file_path3)

from IPython.display import display
print("Features data set")
display(df1)
print("Sales data set")
display(df2)
print("Stores data set")
display(df3)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Rows and columns for Features data set:", df1.shape)
print("Rows and columns for Sales data set:", df2.shape)
print("Rows and columns for Stores data set:", df3.shape)

### Dataset Information

In [None]:
# Dataset Info

print("--- Features Dataset Info ---")
df1.info()
print("\n--- Sales Dataset Info ---")
df2.info()
print("\n--- Stores Dataset Info ---")
df3.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Dataset Duplicate Value Count
print("Duplicate values in Features data set:", df1.duplicated().sum())
print("Duplicate values in Sales data set:", df2.duplicated().sum())
print("Duplicate values in Stores data set:", df3.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print("--- Missing Values in Features Dataset ---")
print(df1.isnull().sum())
print("\n--- Missing Values in Sales Dataset ---")
print(df2.isnull().sum())
print("\n--- Missing Values in Stores Dataset ---")
print(df3.isnull().sum())

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10, 6))
sns.heatmap(df1.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap for Features Dataset')
plt.show()

plt.figure(figsize=(10, 6))
sns.heatmap(df2.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap for Sales Dataset')
plt.show()

plt.figure(figsize=(10, 6))
sns.heatmap(df3.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap for Stores Dataset')
plt.show()

### What did you know about your dataset?

Based on our initial exploration, we have loaded three datasets: Features, Sales, and Stores, into pandas DataFrames df1, df2, and df3 respectively. We've determined the dimensions of each dataset by counting their rows and columns. Furthermore, we have checked for and counted any duplicate rows within each dataset. We also identified and counted the missing values in each column across all three datasets, and visualized the distribution of these missing values using heatmaps to understand where they are present.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Dataset Columns
print("Columns in Features data set:", df1.columns)
print("Columns in Sales data set:", df2.columns)
print("Columns in Stores data set:", df3.columns)

In [None]:
# Dataset Describe

# Dataset Describe
print("--- Features Dataset Describe ---")
print(df1.describe())
print("\n--- Sales Dataset Describe ---")
print(df2.describe())
print("\n--- Stores Dataset Describe ---")
print(df3.describe())

### Variables Description

####Features Dataset (df1):

Store: An identifier for each store.  
Date: The date of the observations.  
Temperature: The temperature on that date.                      
Fuel_Price: The fuel price on that date.                             
MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5: These likely represent different types of promotional markdowns. Their descriptive statistics would show the range and distribution of these markdown values.            
CPI: Consumer Price Index, an economic indicator.                    
Unemployment: The unemployment rate.                
IsHoliday: A boolean or binary indicator for whether the date is a holiday.

####Sales Dataset (df2):

Store: An identifier for each store.         
Dept: An identifier for each department within a store.
Date: The date of the sales observation.     
Weekly_Sales: The total sales for that store and department on that date (likely the target variable). The descriptive statistics will show the distribution and range of sales.      
IsHoliday: A boolean or binary indicator for whether the date is a holiday.

####Stores Dataset (df3):

Store: An identifier for each store.      
Type: The type of store (e.g., A, B, C). The descriptive statistics for this column would provide counts for each type if include='all' is used.     
Size: The size of the store. The descriptive statistics will show the range and distribution of store sizes.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Check Unique Values for each variable.
print("--- Unique Values in Features Dataset ---")
print(df1.nunique())
print("\n--- Unique Values in Sales Dataset ---")
print(df2.nunique())
print("\n--- Unique Values in Stores Dataset ---")
print(df3.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Write your code to make your dataset analysis ready.

# Merge the datasets
# Assuming 'Store' and 'Date' are common columns for merging
df_merged = pd.merge(df1, df2, on=['Store', 'Date'], how='inner')
df_merged = pd.merge(df_merged, df3, on='Store', how='inner')

# Convert 'Date' column to datetime objects
# The error indicates the date format is DD/MM/YYYY, so we specify that format.
df_merged['Date'] = pd.to_datetime(df_merged['Date'], format="%d/%m/%Y")

# Handle missing values (example: impute with median for numerical columns)
for col in df_merged.select_dtypes(include=np.number).columns:
    if df_merged[col].isnull().sum() > 0:
        median_val = df_merged[col].median()
        df_merged[col].fillna(median_val, inplace=True)

# Display the first few rows of the merged and wrangled data
print(df_merged.head())

### What all manipulations have you done and insights you found?

1.Dataset Merging:


*   Manipulation: The three original datasets (Features, Sales, and Stores) were merged into a single DataFrame called df_merged. The merging was done based on common columns: Store and Date for Features and Sales, and Store for df_merged and Stores. An inner merge was used, meaning only rows with matching 'Store' and 'Date' across the relevant datasets are kept.
*   Insight: Combining the datasets allows us to analyze the relationship between store features (temperature, fuel price, markdowns, CPI, unemployment), store attributes (type, size), and sales data in one place. This unified dataset is essential for building a comprehensive model.



2.Date Column Conversion:


*   Manipulation: The 'Date' column in the df_merged DataFrame was converted from a string data type to a datetime object using pd.to_datetime. The specific format "%d/%m/%Y" was provided to correctly interpret the day, month, and year from the original strings.
*   Insight: Converting the date column to a proper datetime format enables time-series analysis. We can now easily extract components like year, month, day of the week, and week of the year, and perform time-based aggregations or analyze trends over time.



3.Missing Value Imputation:


*   Manipulation: For numerical columns in df_merged that had missing values, these values were filled using the median of the respective column.
*   Insight: Missing values can cause issues for many analysis and modeling techniques. Imputing with the median is one strategy to handle these gaps, ensuring that the data is complete for subsequent steps like visualization and model training. By imputing, we retain more of the data rather than discarding rows with missing values.





## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Histograms for numerical columns in Features dataset
df1.hist(figsize=(15, 10))
plt.suptitle('Histograms of Numerical Features Dataset Columns', y=1.02)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Histograms are excellent for showing the distribution of a single numerical variable. Using a grid of histograms for multiple numerical columns allows for a quick overview of the shape, spread, and typical values for each feature (Temperature, Fuel_Price, MarkDowns, CPI, Unemployment).

##### 2. What is/are the insight(s) found from the chart?

Reveals the distribution of values (e.g., whether temperature is normally distributed, if markdowns are skewed, typical ranges for fuel price or CPI).

Helps identify potential outliers or multi-modal distributions. Shows the prevalence of zero values in MarkDown columns, indicating these promotions are not always active.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution helps in feature selection and preprocessing for modeling. For example, skewed data might need transformation, and features with many zero values (like Markdowns) might require special handling or feature engineering. Provides context for external factors influencing sales.

#### Chart - 2

In [None]:
# Histogram for 'Weekly_Sales' in Sales dataset
plt.figure(figsize=(10, 6))
sns.histplot(df2['Weekly_Sales'], kde=True)
plt.title('Distribution of Weekly_Sales in Sales Dataset')
plt.xlabel('Weekly Sales')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a KDE (Kernel Density Estimate) plot is ideal for visualizing the distribution of the target variable (Weekly_Sales). It shows the frequency of different sales values and the overall shape of the distribution.

##### 2. What is/are the insight(s) found from the chart?

Shows that weekly sales are heavily skewed towards lower values, with a long tail of higher sales.

 Identifies the typical range of weekly sales and highlights the presence of outlier sales figures (very high sales).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Essential for understanding the nature of the prediction problem (regression). The skewed distribution suggests that mean-based metrics might be influenced by outliers and that evaluating model performance might require metrics robust to large errors on high-value predictions (like RMSE or MAE). High sales outliers might correspond to holidays or special events, prompting further investigation.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Bar plot for 'Type' in Stores dataset
plt.figure(figsize=(8, 5))
sns.countplot(x='Type', data=df3)
plt.title('Distribution of Store Types in Stores Dataset')
plt.show()

##### 1. Why did you pick the specific chart?

 A bar plot is the standard way to visualize the counts of distinct categories. It clearly shows the proportion or number of stores belonging to each type (A, B, C).

##### 2. What is/are the insight(s) found from the chart?

Reveals the proportion of stores of each type. Typically shows that some store types are more common than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the store type distribution is fundamental for segmentation. It suggests that strategies might need to account for the varying prevalence of different store formats. Knowing which types are most numerous helps in planning aggregated analysis or targeted pilots.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Histogram for 'Size' in Stores dataset
plt.figure(figsize=(10, 6))
sns.histplot(df3['Size'], kde=True)
plt.title('Distribution of Store Sizes in Stores Dataset')
plt.xlabel('Store Size')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a KDE plot is used to show the distribution of a single numerical variable, 'Size'.

##### 2. What is/are the insight(s) found from the chart?

Shows the distribution of store sizes. Might reveal clusters of store sizes or indicate if sizes are distributed across a wide range. You might see distinct peaks corresponding to different store formats or types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Store size is a significant factor in capacity, inventory, and potential sales volume. Understanding its distribution helps in categorizing stores, forecasting inventory needs, and analyzing performance relative to capacity.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 6: Distribution of Temperature in the Merged Dataset
plt.figure(figsize=(10, 6))
sns.histplot(df_merged['Temperature'], kde=True)
plt.title('Distribution of Temperature in Merged Dataset')
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE is used to visualize the distribution of the 'Temperature' column after merging.

##### 2. What is/are the insight(s) found from the chart?

Shows the typical range and frequency of temperature values across all stores and dates in the merged dataset. Might reveal seasonal patterns or geographical distribution if the data spans different climates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Temperature is an external factor that can influence customer traffic and sales (e.g., higher sales of certain items in hot or cold weather). Understanding its distribution provides context when analyzing its relationship with sales.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 7: Weekly Sales vs. Store Type
plt.figure(figsize=(10, 6))
sns.boxplot(x='Type', y='Weekly_Sales', data=df_merged)
plt.title('Weekly Sales by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Weekly Sales')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is excellent for comparing the distribution (median, quartiles, outliers) of a numerical variable (Weekly_Sales) across distinct categories (Type).

##### 2. What is/are the insight(s) found from the chart?

 Clearly shows that different store types have significantly different distributions of weekly sales. Type A stores likely have the highest median sales and potentially the largest range and highest outliers, followed by Type B, then Type C.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This is a critical insight for store strategy. It confirms that store type is a major determinant of sales performance. Businesses can use this to set type-specific sales targets, allocate budgets, and tailor marketing strategies. It justifies focusing on understanding why Type A stores perform better.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15, 7))
sns.boxplot(x='Store', y='Weekly_Sales', data=df2)
plt.title('Distribution of Weekly Sales by Store')
plt.xlabel('Store')
plt.ylabel('Weekly Sales')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

A box plot, applied across individual stores, allows for comparing the distribution of weekly sales for each store.

##### 2. What is/are the insight(s) found from the chart?

Highlights the vast variation in sales performance across individual stores.

Identifies high-performing stores, low-performing stores, and stores with highly variable sales. Shows the presence of extreme outlier sales weeks for some stores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Allows for individual store performance evaluation. Helps identify best-performing stores for potential case studies or replication of strategies, and worst-performing stores requiring intervention. High variability might indicate issues like unstable demand or inconsistent operations.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df2['Weekly_Sales'], bins=50, kde=True)
plt.title('Distribution of Weekly Sales')
plt.xlabel('Weekly Sales')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE for Weekly_Sales (similar to Chart 2, but perhaps with different binning or after potential outlier handling) provides a refined view of the target variable's distribution.

##### 2. What is/are the insight(s) found from the chart?

Reconfirms the skewed nature of sales. Depending on when this is plotted (e.g., before/after outlier removal), it can show the impact of data cleaning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Reiteration of the challenge in predicting highly skewed data. Guides the choice of model and evaluation metrics.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart - Relationship between Store Size and Average Weekly Sales
plt.figure(figsize=(10, 6))
# Using a scatter plot with size on the x-axis and average sales on the y-axis
avg_sales_by_store_size = df_merged.groupby('Size')['Weekly_Sales'].mean().reset_index()
sns.scatterplot(x='Size', y='Weekly_Sales', data=avg_sales_by_store_size)
plt.title('Average Weekly Sales vs. Store Size')
plt.xlabel('Store Size')
plt.ylabel('Average Weekly Sales')
plt.show()

# Why this chart?
# A scatter plot is suitable for visualizing the relationship between two continuous numerical variables (Store Size and Average Weekly Sales).
# Insights: This helps determine if there's a positive correlation between store size and average sales, suggesting larger stores generally generate more sales.
# Business Impact: Useful for store planning and forecasting. Larger stores might have higher overheads, but potentially higher revenue potential.


##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between two continuous numerical variables ('Store Size' and 'Average Weekly Sales'). Plotting average sales per store helps reduce the noise from weekly fluctuations.

##### 2. What is/are the insight(s) found from the chart?

Shows the general trend between store size and average sales. Likely reveals a positive correlation: larger stores tend to have higher average weekly sales. May also show variability in sales for stores of similar size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Provides quantitative evidence that size matters for sales volume. Useful for forecasting potential sales based on store size for new locations. Can help identify stores that are outliers relative to their size (e.g., a large store with low sales or a small store with unexpectedly high sales).

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Chart - Time Series Plot of Total Weekly Sales
plt.figure(figsize=(15, 7))

# Calculate total weekly sales across all stores for each date
total_weekly_sales = df_merged.groupby('Date')['Weekly_Sales'].sum().reset_index()

# Plotting the total weekly sales over time
plt.plot(total_weekly_sales['Date'], total_weekly_sales['Weekly_Sales'])

plt.title('Total Weekly Sales Over Time Across All Stores')
plt.xlabel('Date')
plt.ylabel('Total Weekly Sales')
plt.grid(True)
plt.show()

# You can also plot individual store's sales over time as shown in the previous response
# store_1_sales = df_merged[df_merged['Store'] == 1].groupby('Date')['Weekly_Sales'].sum().reset_index()
# plt.figure(figsize=(15, 7))
# plt.plot(store_1_sales['Date'], store_1_sales['Weekly_Sales'])
# plt.title('Weekly Sales for Store 1 Over Time')
# plt.xlabel('Date')
# plt.ylabel('Total Weekly Sales')
# plt.grid(True)
# plt.show()

# Optional: Visualize sales for different store types over time
# weekly_sales_by_type = df_merged.groupby(['Date', 'Type'])['Weekly_Sales'].sum().reset_index()
# plt.figure(figsize=(15, 7))
# sns.lineplot(data=weekly_sales_by_type, x='Date', y='Weekly_Sales', hue='Type')
# plt.title('Total Weekly Sales by Store Type Over Time')
# plt.xlabel('Date')
# plt.ylabel('Total Weekly Sales')
# plt.grid(True)
# plt.show()

##### 1. Why did you pick the specific chart?

 A line plot is essential for visualizing time series data, showing trends and patterns over time. Aggregating sales across all stores provides a high-level view of overall company performance.

##### 2. What is/are the insight(s) found from the chart?

 Reveals overall sales trends (e.g., growth, seasonality, decline). Clearly shows spikes corresponding to major holidays (like Christmas or Thanksgiving) and potentially other periodic fluctuations. May highlight periods of unusual performance

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Crucial for strategic planning and forecasting. Helps understand the impact of time-based factors (seasonality, holidays) on sales. Provides a benchmark for overall business health and allows for identifying periods requiring specific attention (e.g., dips in sales).

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart - Distribution of Weekly Sales by Department (for key departments)
# You might want to focus on departments with significant sales volume or variation
top_departments = df_merged.groupby('Dept')['Weekly_Sales'].sum().nlargest(10).index

plt.figure(figsize=(15, 8))
sns.boxplot(x='Dept', y='Weekly_Sales', data=df_merged[df_merged['Dept'].isin(top_departments)])
plt.title('Distribution of Weekly Sales for Top 10 Departments')
plt.xlabel('Department')
plt.ylabel('Weekly Sales')
plt.xticks(rotation=45)
plt.show()

# Why this chart?
# A box plot helps visualize the distribution (median, quartiles, outliers) of weekly sales for different departments.
# This helps identify departments with generally higher/lower sales and those with more variability or extreme sales figures.
# Insights: Different departments have vastly different sales volumes and patterns, suggesting different customer purchasing behaviors associated with these departments.
# Business Impact: High-performing departments might indicate areas for expansion or marketing focus. Departments with high variability might need better inventory management or promotional strategies.






##### 1. Why did you pick the specific chart?

 Box plots, applied to a selection of top departments, allow for comparing the distribution of weekly sales across different departments.

##### 2. What is/are the insight(s) found from the chart?

Shows that sales performance varies significantly across departments. Identifies departments that consistently generate high sales (higher median/upper quartile) and those with more volatile sales (larger interquartile range or more outliers). Highlights the relative importance of different departments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Informs inventory management, staffing, and marketing strategies at the department level. High-performing departments could be areas for investment. Departments with high variability might need better demand forecasting or promotional planning. Understanding departmental sales patterns is key to optimizing store layout and promotions.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart - Average Weekly Sales by Store Type
plt.figure(figsize=(8, 5))
avg_sales_by_type = df_merged.groupby('Type')['Weekly_Sales'].mean().reset_index()
sns.barplot(x='Type', y='Weekly_Sales', data=avg_sales_by_type)
plt.title('Average Weekly Sales by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Average Weekly Sales')
plt.show()

# Why this chart?
# A bar plot is simple and effective for comparing a single metric (average sales) across distinct categories (store types).
# Insights: This shows which store types (A, B, C) tend to have the highest average sales, suggesting that the type of store has a significant impact on overall sales performance.
# Business Impact: Understanding which store types are most successful can inform decisions about new store locations, store renovations, or targeted marketing based on store type characteristics.

##### 1. Why did you pick the specific chart?

A bar plot comparing the average weekly sales for each store type (A, B, C) provides a simple, direct comparison of typical performance across these categories.

##### 2. What is/are the insight(s) found from the chart?

Quantifies the difference in average sales between store types. Confirms the insight from box plots (Chart 6) that Type A stores generally have the highest average sales, followed by B and C.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Provides a clear, easy-to-understand metric for classifying store performance by type. Supports strategic decisions related to investment, expansion, or focus areas based on which store types are most profitable on average.


#### Chart - 13

In [None]:
# Chart - 14: Impact of Holiday on Weekly Sales by Store Type
plt.figure(figsize=(10, 6))
# Use the correct column name for IsHoliday after the merge
# Assuming IsHoliday_y is from the sales data (df2)
sns.violinplot(x='Type', y='Weekly_Sales', hue='IsHoliday_y', data=df_merged, split=True)
plt.title('Weekly Sales Distribution by Store Type and Holiday Status')
plt.xlabel('Store Type')
plt.ylabel('Weekly Sales')
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot (or grouped box plot/swarm plot) with hue for 'IsHoliday' is used to compare the distribution of Weekly_Sales across Store Type and see how this distribution differs specifically during holiday vs. non-holiday weeks. split=True in violin plots is useful for comparing distributions side-by-side within each category.

##### 2. What is/are the insight(s) found from the chart?

Shows the impact of holidays on the sales distribution for each store type. Likely reveals that holiday weeks generally lead to higher sales, often with a wider distribution and higher peak sales values, compared to non-holiday weeks, and that this holiday boost varies by store type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Essential for planning for holidays. Businesses can anticipate the expected sales uplift for different store types during holiday periods. This informs staffing levels, inventory buildup, and targeted holiday promotions for each store type, maximizing sales during peak periods.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - 14 - Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df_merged.select_dtypes(include=np.number).corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features in Merged Dataset')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is used to visualize the correlation matrix between all pairs of numerical features. It uses color intensity to represent the strength and direction (positive/negative) of the linear relationship. annot=True displays the correlation values.

##### 2. What is/are the insight(s) found from the chart?

Identifies strong positive or negative correlations between numerical variables (e.g., correlation between Temperature and Fuel Price, or CPI and Unemployment). Reveals potential multicollinearity (highly correlated features) which can be an issue for some regression models. Shows which external factors (Temperature, Fuel Price, CPI, Unemployment) have the strongest correlation with Weekly_Sales (though correlation doesn't imply causation). Markdowns might show expected correlations with sales (e.g., negative correlation if markdowns indicate lower prices leading to higher volume, or positive if markdowns are used during high-traffic periods).

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart - 15 - Pair Plot
sns.pairplot(df_merged.select_dtypes(include=np.number).sample(n=1000)) # Sample for performance with large datasets
plt.suptitle('Pair Plot of Sampled Numerical Features in Merged Dataset', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot generates scatter plots for every pair of numerical variables and histograms (or KDEs) for each single numerical variable. Using a sample (sample(n=1000)) is necessary for performance with large datasets.

##### 2. What is/are the insight(s) found from the chart?

Provides a visual matrix showing pairwise relationships between all selected numerical features. Allows for identifying not just linear correlations (seen in the heatmap) but also non-linear relationships, clusters, or patterns that might not be captured by a simple correlation coefficient. The diagonal histograms show the distribution of each variable.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$$H_0$): The mean weekly sales are equal across all store types (Type A, Type B, and Type C). $H_0: \mu_A = \mu_B = \mu_C$ (where $\mu_X$ is the true mean weekly sales for Store Type X)

Alternate Hypothesis ($H_1$$H_1$): At least one store type has a different mean weekly sales compared to the others. $H_1:$ Not all $\mu$ are equal.

#### 2. Perform an appropriate statistical test.

In [None]:

# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Check if df_merged is available and has the necessary columns
if 'df_merged' in locals() and 'Weekly_Sales' in df_merged.columns and 'Type' in df_merged.columns:

    # Separate Weekly_Sales data by Store Type
    sales_type_a = df_merged[df_merged['Type'] == 'A']['Weekly_Sales']
    sales_type_b = df_merged[df_merged['Type'] == 'B']['Weekly_Sales']
    sales_type_c = df_merged[df_merged['Type'] == 'C']['Weekly_Sales']

    # Clean data: Remove potential negative sales or handle outliers if necessary
    # For ANOVA, assuming data is roughly normally distributed within groups and equal variances (can be checked)
    # Let's filter out non-positive sales as they can skew means and violate assumptions
    sales_type_a = sales_type_a[sales_type_a > 0]
    sales_type_b = sales_type_b[sales_type_b > 0]
    sales_type_c = sales_type_c[sales_type_c > 0]


    # Check if there is enough data in each group
    if len(sales_type_a) > 1 and len(sales_type_b) > 1 and len(sales_type_c) > 1:
        print("\n--- Hypothesis Test 1: Weekly Sales vs. Store Type (ANOVA) ---")

        # Perform one-way ANOVA test
        # ANOVA tests if the means of three or more independent groups are statistically significantly different.
        f_statistic, p_value = stats.f_oneway(sales_type_a, sales_type_b, sales_type_c)

        print(f"F-statistic: {f_statistic:.4f}")
        print(f"P-value: {p_value:.4f}")

        # Define significance level (alpha)
        alpha = 0.05

        # Conclusion
        if p_value < alpha:
            print(f"\nConclusion: Reject the null hypothesis (p-value < {alpha}).")
            print("There is a statistically significant difference in mean weekly sales among the store types.")
            print("This supports the hypothesis that Store Type impacts Weekly Sales.")
        else:
            print(f"\nConclusion: Fail to reject the null hypothesis (p-value >= {alpha}).")
            print("There is no statistically significant difference in mean weekly sales among the store types.")
            print("This does not support the hypothesis that Store Type impacts Weekly Sales based on this test.")

        # Optional: Print group means for context
        print("\nMean Weekly Sales by Store Type:")
        print(f"Type A: {sales_type_a.mean():.2f}")
        print(f"Type B: {sales_type_b.mean():.2f}")
        print(f"Type C: {sales_type_c.mean():.2f}")

    else:
         print("\n--- Hypothesis Test 1 Skipped ---")
         print("Not enough data in one or more store types to perform ANOVA.")
         print(f"Type A samples: {len(sales_type_a)}, Type B samples: {len(sales_type_b)}, Type C samples: {len(sales_type_c)}")

else:
    print("\n--- Hypothesis Test 1 Skipped ---")
    print("Required columns ('Weekly_Sales', 'Type') or df_merged not found.")

##### Which statistical test have you done to obtain P-Value?

I have performed a One-Way Analysis of Variance (ANOVA) test.

##### Why did you choose the specific statistical test?

I chose a One-Way ANOVA because:

I am comparing the means of a numerical variable (Weekly_Sales) across three or more independent groups (Store Types A, B, and C).
ANOVA is designed specifically for this scenario to determine if there is a statistically significant difference between the means of these groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$$H_0$): The mean weekly sales during holiday weeks are equal to or less than the mean weekly sales during non-holiday weeks. $H_0: \mu_{Holiday} \le \mu_{Non-Holiday}$

Alternate Hypothesis ($H_1$$H_1$): The mean weekly sales during holiday weeks are greater than the mean weekly sales during non-holiday weeks. $H_1: \mu_{Holiday} > \mu_{Non-Holiday}$ (This is a one-tailed test)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Perform Statistical Test to obtain P-Value
# We will use an independent samples t-test
import scipy.stats as stats

# Check if df_merged is available and has the necessary columns
if 'df_merged' in locals() and 'Weekly_Sales' in df_merged.columns and 'IsHoliday_y' in df_merged.columns:

     # Separate Weekly_Sales data for holiday and non-holiday weeks
    sales_holiday = df_merged[df_merged['IsHoliday_y'] == True]['Weekly_Sales']
    sales_non_holiday = df_merged[df_merged['IsHoliday_y'] == False]['Weekly_Sales']

    # Clean data: Remove potential negative sales or handle outliers if necessary
    sales_holiday = sales_holiday[sales_holiday > 0]
    sales_non_holiday = sales_non_holiday[sales_non_holiday > 0]


     # Check if there is enough data in each group (at least 2 samples for a t-test)
    if len(sales_holiday) > 1 and len(sales_non_holiday) > 1:
        print("\n--- Hypothesis Test 2: Weekly Sales vs. Holiday Status (Independent t-test) ---")

        # Perform independent samples t-test
        # We'll use a Welch's t-test which does not assume equal variances, a safer choice.
        # For a one-tailed test (H1: holiday > non-holiday), we look at the p-value and the means.
        # The ttest_ind function in scipy defaults to a two-sided p-value.
        # We need to divide the two-sided p-value by 2 if the t-statistic is in the direction of H1.

        t_statistic, p_value_two_sided = stats.ttest_ind(sales_holiday, sales_non_holiday, equal_var=False) # Welch's t-test

        print(f"T-statistic: {t_statistic:.4f}")
        print(f"Two-sided P-value: {p_value_two_sided:.4f}")

        # Calculate one-sided p-value for H1: mu_Holiday > mu_Non-Holiday
        # If t-statistic is positive, the one-sided p-value is p_value_two_sided / 2
        # If t-statistic is negative, the one-sided p-value is 1 - (p_value_two_sided / 2) (or simply high if H1 is positive direction)
        if t_statistic > 0:
            p_value_one_sided = p_value_two_sided / 2
            print(f"One-sided P-value (H1: Holiday > Non-Holiday): {p_value_one_sided:.4f}")
        else:
            p_value_one_sided = 1 - (p_value_two_sided / 2) # Or simply a very high value if t is negative
            print(f"One-sided P-value (H1: Holiday > Non-Holiday): {p_value_one_sided:.4f} (Since t-statistic is not > 0)")


        # Define significance level (alpha)
        alpha = 0.05

        # Conclusion based on the one-sided test
        if t_statistic > 0 and p_value_one_sided < alpha:
             print(f"\nConclusion: Reject the null hypothesis (t > 0 and one-sided p-value < {alpha}).")
             print("There is statistically significant evidence that mean weekly sales during holiday weeks are greater than during non-holiday weeks.")
             print("This supports the hypothesis that Weekly Sales are higher during holiday weeks.")
        else:
             print(f"\nConclusion: Fail to reject the null hypothesis (t <= 0 or one-sided p-value >= {alpha}).")
             print("There is not enough statistically significant evidence to conclude that mean weekly sales during holiday weeks are greater than during non-holiday weeks.")


        # Optional: Print group means for context
        print("\nMean Weekly Sales by Holiday Status:")
        print(f"Holiday Weeks: {sales_holiday.mean():.2f}")
        print(f"Non-Holiday Weeks: {sales_non_holiday.mean():.2f}")

    else:
        print("\n--- Hypothesis Test 2 Skipped ---")
        print("Not enough data in holiday or non-holiday weeks to perform t-test.")
        print(f"Holiday samples: {len(sales_holiday)}, Non-Holiday samples: {len(sales_non_holiday)}")

else:
    print("\n--- Hypothesis Test 2 Skipped ---")
    print("Required columns ('Weekly_Sales', 'IsHoliday_y') or df_merged not found.")

##### Which statistical test have you done to obtain P-Value?

I have performed an Independent Samples t-test (specifically Welch's t-test).

##### Why did you choose the specific statistical test?

I chose an Independent Samples t-test because:

I am comparing the means of a numerical variable (Weekly_Sales) between two independent groups (holiday weeks and non-holiday weeks).
Welch's t-test is suitable when the assumption of equal variances between the two groups may not hold, which is common with real-world data like sales during holidays vs. regular periods.
I am performing a one-tailed test because my hypothesis specifies a direction (higher sales during holidays), which requires careful interpretation of the standard two-sided p-value.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$$H_0$): There is no linear correlation between Temperature and Weekly Sales (the true correlation coefficient is zero). $H_0: \rho = 0$ (where $\rho$ is the true population correlation coefficient)

Alternate Hypothesis ($H_1$$H_1$): There is a significant linear correlation between Temperature and Weekly Sales (the true correlation coefficient is not zero). $H_1: \rho \ne 0$ (This is a two-tailed test)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# We will use a correlation test (e.g., Pearson)
import scipy.stats as stats

# Check if df_merged is available and has the necessary columns
if 'df_merged' in locals() and 'Weekly_Sales' in df_merged.columns and 'Temperature' in df_merged.columns:

    # Select the two columns
    temperature_data = df_merged['Temperature']
    sales_data = df_merged['Weekly_Sales']

    # Clean data: Handle potential NaNs or outliers if necessary
    # For correlation, we need corresponding values, so drop rows with NaNs in either column
    temp_sales_df = df_merged[['Temperature', 'Weekly_Sales']].dropna().copy()
    temperature_data = temp_sales_df['Temperature']
    sales_data = temp_sales_df['Weekly_Sales']

    # Check if there is enough data (at least 2 pairs of observations for correlation)
    if len(temperature_data) > 1:
        print("\n--- Hypothesis Test 3: Correlation between Temperature and Weekly Sales (Pearson Correlation) ---")

        # Perform Pearson correlation test
        # Pearson correlation measures the linear relationship between two numerical variables.
        correlation_coefficient, p_value = stats.pearsonr(temperature_data, sales_data)

        print(f"Pearson Correlation Coefficient: {correlation_coefficient:.4f}")
        print(f"P-value: {p_value:.4f}")

        # Define significance level (alpha)
        alpha = 0.05

        # Conclusion
        if p_value < alpha:
            print(f"\nConclusion: Reject the null hypothesis (p-value < {alpha}).")
            print("There is a statistically significant linear correlation between Temperature and Weekly Sales.")
            print(f"The correlation coefficient is {correlation_coefficient:.4f}.")
            print("The direction and strength of the correlation are indicated by the coefficient value.")
        else:
            print(f"\nConclusion: Fail to reject the null hypothesis (p-value >= {alpha}).")
            print("There is no statistically significant linear correlation between Temperature and Weekly Sales.")

    else:
         print("\n--- Hypothesis Test 3 Skipped ---")
         print("Not enough data pairs with both Temperature and Weekly_Sales values to perform correlation test.")

else:
    print("\n--- Hypothesis Test 3 Skipped ---")
    print("Required columns ('Weekly_Sales', 'Temperature') or df_merged not found.")

##### Which statistical test have you done to obtain P-Value?

I have performed a Pearson Correlation Test.

##### Why did you choose the specific statistical test?

I chose a Pearson Correlation Test because:

I am assessing the linear relationship between two continuous numerical variables (Temperature and Weekly_Sales).
The Pearson correlation coefficient ($\rho$$\rho$) measures the strength and direction of this linear relationship.
The test provides a p-value to determine if the observed correlation coefficient is statistically different from zero (the value under the null hypothesis of no correlation).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# --- 1. Handling Missing Values ---
print("\n--- 1. Handling Missing Values ---")
print("Missing values were initially imputed with the median during data wrangling.")
print("Let's re-check if any remain, especially in columns like MarkDowns.")
print("Missing values count before any further handling:")
print(df_merged.isnull().sum())

markdown_cols = [col for col in df_merged.columns if 'MarkDown' in col]
for col in markdown_cols:
  if df_merged[col].isnull().sum() > 0:
    nan_percentage = df_merged[col].isnull().sum() / len(df_merged) * 100
    print(f"{col}: {df_merged[col].isnull().sum()} missing values ({nan_percentage:.2f}%)")
    print(f"Imputing NaNs in {col} with 0 (assuming 'no markdown').")
    df_merged[col].fillna(0, inplace=True)

print("\nMissing values count after MarkDown imputation:")
print(df_merged.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Median imputation was chosen because:

It's robust to outliers, making it suitable for potentially skewed numerical data.

It's a simple and common technique for numerical imputation.

It was presented as an example method for handling missing numerical values.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# --- 2. Handling Outliers ---
print("\n--- 2. Handling Outliers ---")

# Identify numerical columns for outlier detection (excluding identifiers and binary flags like IsHoliday)
numerical_cols = df_merged.select_dtypes(include=np.number).columns.tolist()
# Remove 'Store', 'Dept', 'Size' (categorical/identifier-like), 'IsHoliday_x', 'IsHoliday_y'
cols_to_exclude = ['Store', 'Dept', 'Size', 'IsHoliday_x', 'IsHoliday_y']
numerical_cols = [col for col in numerical_cols if col not in cols_to_exclude]

print(f"\nChecking for outliers in numerical columns: {numerical_cols}")

# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Function to cap outliers (Winsorizing)
def cap_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Cap values below lower_bound to lower_bound, and values above upper_bound to upper_bound
    data[column] = np.where(data[column] < lower_bound, lower_bound, data[column])
    data[column] = np.where(data[column] > upper_bound, upper_bound, data[column])
    return data

# Let's check for outliers first and then decide on handling
for col in numerical_cols:
    outliers, lower_bound, upper_bound = detect_outliers_iqr(df_merged, col)
    if not outliers.empty:
        print(f"\nOutliers detected in '{col}': {len(outliers)} rows ({len(outliers)/len(df_merged)*100:.2f}%)")
        print(f"  IQR Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
        # Optional: Display some outlier values
        # print(f"  Sample outlier values: {outliers[col].unique()[:10]}")

# Decision: Cap outliers in 'Weekly_Sales' and 'Temperature' as they are key predictors/target.
# Markdowns can have genuine high values, CPI and Unemployment are less likely to have extreme outliers in this context,
# Fuel_Price might have spikes but capping might distort real-world events.
# Let's focus on capping Weekly_Sales and Temperature for now.

cols_to_cap = ['Weekly_Sales', 'Temperature']
print(f"\nCapping outliers using IQR method in columns: {cols_to_cap}")

for col in cols_to_cap:
    if col in df_merged.columns: # Ensure column exists
         initial_mean = df_merged[col].mean()
         initial_median = df_merged[col].median()
         df_merged = cap_outliers_iqr(df_merged.copy(), col) # Cap outliers

         # Verify capping by re-checking outliers or looking at description statistics
         outliers_after, _, _ = detect_outliers_iqr(df_merged, col)
         print(f"  Outliers in '{col}' after capping: {len(outliers_after)}")
         print(f"  Mean of '{col}' changed from {initial_mean:.2f} to {df_merged[col].mean():.2f}")
         print(f"  Median of '{col}' changed from {initial_median:.2f} to {df_merged[col].median():.2f}")
    else:
        print(f"  Column '{col}' not found in the DataFrame.")


# Re-check descriptive statistics after outlier handling
print("\nDescriptive statistics after handling outliers in selected columns:")
print(df_merged[numerical_cols].describe())


# Visualize distributions after handling outliers (optional)
# For example, visualize Weekly_Sales again
# plt.figure(figsize=(10, 6))
# sns.histplot(df_merged['Weekly_Sales'], bins=50, kde=True)
# plt.title('Distribution of Weekly_Sales After Outlier Capping')
# plt.xlabel('Weekly Sales')
# plt.ylabel('Frequency')
# plt.show()

# plt.figure(figsize=(10, 6))
# sns.boxplot(x='Type', y='Weekly_Sales', data=df_merged)
# plt.title('Weekly Sales by Store Type After Outlier Capping')
# plt.xlabel('Store Type')
# plt.ylabel('Weekly Sales')
# plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

No systematic outlier treatment techniques (like capping, flooring, or statistical removal) have been applied to the main dataset (df_merged).

A specific data cleaning step filtered out non-positive weekly sales (Weekly_Sales > 0) only for the first hypothesis test.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# --- 3. Categorical Encoding ---
print("\n--- 3. Categorical Encoding ---")

# Identify categorical columns
# In the merged dataset, 'Type' is the primary categorical column we need to encode.
# 'Store' and 'Dept' are treated more like identifiers, but 'Dept' might also benefit from encoding depending on the model.
# 'IsHoliday_x' and 'IsHoliday_y' are already binary (0/1 or True/False), so they are already encoded.
categorical_cols = ['Type']

print(f"\nCategorical columns to encode: {categorical_cols}")

# Perform One-Hot Encoding for 'Type'
# One-Hot Encoding is suitable for 'Type' as there is no inherent order (A > B > C is not necessarily true)
# Use drop_first=True to avoid multicollinearity (n-1 dummy variables)

print(f"Applying One-Hot Encoding to: {categorical_cols}")
df_merged = pd.get_dummies(df_merged, columns=categorical_cols, drop_first=True)

print("\nDataFrame columns after One-Hot Encoding:")
print(df_merged.columns)

print("\nFirst 5 rows of DataFrame after encoding:")
print(df_merged.head())

# Note: 'Store' and 'Dept' can be left as they are if treated as identifiers or if the model handles high-cardinality features.
# For linear models, you might consider encoding 'Dept' as well or using other techniques.
# For tree-based models, integer encoding or leaving them as is might be acceptable or even better.
# For now, we'll proceed with 'Type' encoded.

#### What all categorical encoding techniques have you used & why did you use those techniques?

No categorical encoding techniques have been used yet.

Categorical columns like 'Type' and 'IsHoliday' are used directly for visualization and hypothesis testing.

Encoding is typically done later for machine learning models.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Expand Contraction code for the dataset and merged datasets
print("\n--- Expanding Contractions (Dataset and Merged Dataset) ---")

# As previously noted, the datasets (df1, df2, df3, and df_merged) do not contain any textual columns
# that would require contraction expansion.

print("No textual columns requiring contraction expansion were found in the original or merged datasets.")

# Therefore, no code is needed to perform contraction expansion for these specific DataFrames.

# If you were working with text data (e.g., customer reviews, product descriptions) in a different dataset,
# you would implement the contraction expansion logic on those specific columns.

print("Proceeding with further analysis steps as contraction expansion is not applicable here.")

#### 2. Lower Casing

In [None]:
# Lower Casing
# lower casing
print("\n--- Lowercasing Text ---")

# Reviewing the columns in the original datasets (df1, df2, df3) and the merged dataset (df_merged):
# df1 columns: Store, Date, Temperature, Fuel_Price, MarkDown1-5, CPI, Unemployment, IsHoliday_x
# df2 columns: Store, Dept, Date, Weekly_Sales, IsHoliday_y
# df3 columns: Store, Type, Size
# df_merged columns: Store, Date, Temperature, Fuel_Price, MarkDown1-5, CPI, Unemployment, IsHoliday_x, Dept, Weekly_Sales, IsHoliday_y, Type, Size, Type_B, Type_C (after one-hot encoding)

# The only columns that could potentially contain text are 'Type' and possibly 'Dept' if it was stored as a string description.
# However, based on the previous exploration (`df.info()`, `df.head()`), 'Type' is likely categorical (A, B, C) and 'Dept' is numerical.

# Based on the comment and the error, it seems 'Type' in df_merged was already handled (likely one-hot encoded and dropped).
# The resulting columns Type_B and Type_C are numerical and do not require lowercasing.
# Dept was found to be numerical earlier as well.
# Therefore, there are no string columns in df_merged that need lowercasing here.

# We can remove the check on df_merged['Type'] as it no longer exists after one-hot encoding.
# The only potential text column is the original 'Type' in df3, but if it was used for merging and then one-hot encoded,
# lowercasing it here might not be strictly necessary if the one-hot encoder handled strings correctly,
# or if the process starts with the already cleaned df_merged.

# However, if you *do* want to ensure the *original* Type column in df3 is consistent *before* any merging or encoding,
# you can lowercase it here. But it's generally better to do this wrangling on the merged dataframe or right after loading
# if it affects subsequent steps.

# Let's check the original df3 for the 'Type' column's data type and lowercase if necessary,
# but acknowledge that the merged df_merged might not retain this string column.
if 'df3' in locals() and 'Type' in df3.columns:
    print(f"\nData type of 'Type' in df3: {df3['Type'].dtype}")
    if df3['Type'].dtype == 'object':
         print("Lowercasing 'Type' column in original df3.")
         # Check if there are actual strings before applying lower()
         df3['Type'] = df3['Type'].apply(lambda x: x.lower() if isinstance(x, str) else x)
         print("df3['Type'] has been lowercased.")
         print("Sample values from df3['Type'] after lowercasing:", df3['Type'].unique())
    else:
         print("'Type' column in df3 is not of string type, no lowercasing needed.")
else:
    print("df3 or 'Type' column not found. Skipping lowercasing for original df3.")


# Let's confirm the data type of 'Dept' in original df2, although it's likely numeric
if 'df2' in locals() and 'Dept' in df2.columns:
     print(f"\nData type of 'Dept' in df2: {df2['Dept'].dtype}")
     # As it's numeric, no lowercasing is needed.
else:
     print("df2 or 'Dept' column not found. Skipping lowercasing for original df2.")


# Based on the previous steps and the error, it's clear that df_merged no longer has a 'Type' column.
# Therefore, we can remove the erroneous check on df_merged['Type'].
# The code below confirms that no string columns needing lowercasing exist in the current df_merged based on the process so far.
print("\nChecking df_merged for string columns to lowercase...")
string_columns_in_merged = df_merged.select_dtypes(include='object').columns

if len(string_columns_in_merged) > 0:
    print(f"Found string columns in df_merged: {list(string_columns_in_merged)}")
    print("Applying lowercasing to these columns.")
    for col in string_columns_in_merged:
        df_merged[col] = df_merged[col].apply(lambda x: x.lower() if isinstance(x, str) else x)
    print("String columns in df_merged have been lowercased.")
else:
    print("No string columns found in the current df_merged that require lowercasing.")



#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

print("\n--- Removing Punctuations ---")

# As determined in previous steps, the current datasets (df1, df2, df3, df_merged)
# do not contain text columns that would typically have punctuation requiring removal.
# The columns are numerical, date, boolean, or categorical ('Type' was handled).

print("No text columns requiring punctuation removal were found in the dataset.")

# If you had a text column (e.g., 'Product_Description', 'Customer_Review'), you would remove punctuation like this:
# import string
#
# def remove_punctuation(text):
#     if isinstance(text, str): # Check if the input is a string
#         return text.translate(str.maketrans('', '', string.punctuation))
#     return text # Return non-string inputs as they are
#
# text_column_name = 'Your_Text_Column' # Replace with the actual column name
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#      print(f"Removing punctuation from the column '{text_column_name}'")
#      df_merged[text_column_name] = df_merged[text_column_name].apply(remove_punctuation)
#      print(f"Punctuation removed from column '{text_column_name}'.")
# else:
#      print(f"Column '{text_column_name}' not found or is not of object dtype. No punctuation removal needed.")

print("Punctuation removal step completed (not applicable to current columns).")
print("Proceeding to the next preprocessing step.")

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

print("\n--- Removing URLs and words/digits containing digits ---")

# Similar to previous steps, the current datasets (df1, df2, df3, df_merged)
# do not contain text columns where URLs or words/digits containing digits would typically be present
# and need removal for analysis or modeling purposes relevant to this dataset.

print("No text columns requiring removal of URLs or words/digits containing digits were found.")

# If you had text data needing this, here's how you might implement it:
# import re
#
# def remove_urls(text):
#     if isinstance(text, str):
#         url_pattern = re.compile(r'https?://\S+|www\.\S+')
#         return url_pattern.sub(r'', text)
#     return text
#
# def remove_words_with_digits(text):
#     if isinstance(text, str):
#         # This regex keeps standalone numbers like '123' but removes words like 'a1b2' or 'product1'
#         # If you want to remove ALL tokens that contain digits, a simpler regex like r'\S*\d\S*' could be used
#         return ' '.join(word for word in text.split() if not any(char.isdigit() for char in word))
#     return text
#
# text_column_name = 'Your_Text_Column' # Replace with the actual column name
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#     print(f"Removing URLs from the column '{text_column_name}'")
#     df_merged[text_column_name] = df_merged[text_column_name].apply(remove_urls)
#     print(f"URLs removed from column '{text_column_name}'.")
#
#     print(f"Removing words/digits containing digits from the column '{text_column_name}'")
#     df_merged[text_column_name] = df_merged[text_column_name].apply(remove_words_with_digits)
#     print(f"Words/digits containing digits removed from column '{text_column_name}'.")
# else:
#      print(f"Column '{text_column_name}' not found or is not of object dtype. Skipping removal of URLs/words with digits.")


print("URL and word/digit removal step completed (not applicable to current columns).")
print("Proceeding to the next preprocessing step.")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Removing Stopwords & Removing White spaces
print("\n--- Removing Stopwords and White spaces ---")

# Stopword Removal:
# Stopwords are common words (like 'the', 'is', 'in') that are often removed from text data
# because they usually don't carry significant meaning for analysis.
# As the dataset does not contain text columns like reviews or descriptions,
# stopword removal is not applicable here.

print("No text columns requiring stopword removal were found in the dataset.")

# If you had text data, you would typically use a library like NLTK or spaCy:
# import nltk
# from nltk.corpus import stopwords
#
# # Download stopwords if you haven't already
# try:
#     stopwords.words('english')
# except LookupError:
#     nltk.download('stopwords')
#
# stop_words = set(stopwords.words('english'))
#
# def remove_stopwords(text):
#     if isinstance(text, str):
#         return ' '.join(word for word in str(text).split() if word not in stop_words)
#     return text
#
# text_column_name = 'Your_Text_Column' # Replace with the actual column name
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#     print(f"Removing stopwords from the column '{text_column_name}'")
#     df_merged[text_column_name] = df_merged[text_column_name].apply(remove_stopwords)
#     print(f"Stopwords removed from column '{text_column_name}'.")
# # else:
# #      print(f"Column '{text_column_name}' not found or is not of object dtype. Skipping stopword removal.")


# Removing White spaces:
# This typically involves removing leading/trailing whitespace and sometimes normalizing multiple spaces
# between words to single spaces.
# Again, since we don't have free-form text fields, this is less critical.
# However, it's good practice to trim whitespace from any string columns, just in case.
# The 'Type' column in the original df3 might benefit from this if there were extra spaces.

print("\nRemoving extra white spaces from potential string columns.")

# Check string columns in the merged DataFrame. 'Type' is now encoded.
# Let's check if any object type columns remain that might benefit from stripping whitespace.
string_cols_merged = df_merged.select_dtypes(include='object').columns
print(f"Object columns in df_merged: {list(string_cols_merged)}")

if len(string_cols_merged) > 0:
    for col in string_cols_merged:
         print(f"Stripping whitespace from column '{col}' in df_merged")
         # Apply strip() only if the value is a string
         df_merged[col] = df_merged[col].apply(lambda x: x.strip() if isinstance(x, str) else x)
         print(f"Whitespace stripped from column '{col}'.")
else:
    print("No object columns found in df_merged to strip whitespace from.")


# Also check original dataframes if they are still relevant or needed for other steps
if 'df3' in locals() and 'Type' in df3.columns:
     if df3['Type'].dtype == 'object':
         print("\nStripping whitespace from 'Type' column in original df3.")
         df3['Type'] = df3['Type'].apply(lambda x: x.strip() if isinstance(x, str) else x)
         print("'Type' column in df3 has been stripped of whitespace.")
     else:
         print("'Type' column in df3 is not object type, skipping whitespace stripping.")


print("\nStopword removal and whitespace removal steps completed.")
print("Proceeding to the next preprocessing step.")

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# rephrase text
print("\n--- Rephrasing Text ---")

# Rephrasing text is a specific text preprocessing step that is not applicable
# to the current dataset, which consists of numerical, date, and categorical data.
# This step is usually relevant in NLP tasks where you need to standardize
# the wording or meaning of text, like in question answering systems, chatbots,
# or text summarization, by mapping different phrases with similar meanings
# to a common representation.

print("Rephrasing text is not applicable to this dataset.")

# If you were working with text data and needed to rephrase, you might use:
# - Synonym replacement
# - Paraphrasing models (more advanced, often using deep learning)
# - Rule-based rephrasing for specific domain knowledge

# This requires a substantial vocabulary, synonym list, or a complex model,
# and is beyond the scope of standard data preprocessing for structured data.

print("Rephrasing text step completed (not applicable).")
print("Proceeding to the next preprocessing step.")

#### 7. Tokenization

In [None]:
# Tokenization
# tokenization
print("\n--- Tokenization ---")

# Tokenization is the process of breaking down text into smaller units, like words or subwords.
# It is a fundamental step in processing textual data for NLP tasks.
# However, as identified before, this dataset does not contain text columns
# that require tokenization for the analysis or modeling goals of predicting sales.

print("No text columns requiring tokenization were found in the dataset.")

# If you had a text column (e.g., 'Product_Description', 'Customer_Review'),
# you would perform tokenization using libraries like NLTK, spaCy, or scikit-learn's CountVectorizer/TfidfVectorizer.

# Example using NLTK (conceptual, not executed as not needed):
# import nltk
# from nltk.tokenize import word_tokenize
#
# # Download the punkt tokenizer if you haven't already
# try:
#     word_tokenize("test text")
# except LookupError:
#     nltk.download('punkt')
#
# def tokenize_text(text):
#     if isinstance(text, str):
#         return word_tokenize(text)
#     return text # Return non-string inputs as they are
#
# text_column_name = 'Your_Text_Column' # Replace with the actual column name
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#     print(f"Tokenizing the column '{text_column_name}'")
#     df_merged[f'{text_column_name}_tokens'] = df_merged[text_column_name].apply(tokenize_text)
#     print(f"Tokenization applied to column '{text_column_name}'. New column '{text_column_name}_tokens' created.")
#     print(df_merged[[text_column_name, f'{text_column_name}_tokens']].head())
# else:
#      print(f"Column '{text_column_name}' not found or is not of object dtype. Skipping tokenization.")


print("\nTokenization step completed (not applicable).")
print("Proceeding to the next preprocessing step.")

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# text normalization
print("\n--- Text Normalization ---")

# Text normalization is a broader term that encompasses various steps to
# convert text into a consistent, standardized format. This includes steps
# we've already discussed conceptually like lowercasing, removing punctuation,
# removing stopwords, stemming/lemmatization, handling numbers, etc.

# Since this dataset lacks typical free-form text columns, the text normalization
# steps are not directly applicable. The 'Type' column was handled via encoding.
# The 'Date' column was normalized by converting it to datetime objects.

print("Text normalization is not applicable to the current dataset's columns.")

# If you had text data requiring comprehensive normalization, the code would involve
# applying a combination of the previously discussed steps in a specific order,
# potentially along with other techniques like:
# - Handling numbers (e.g., converting "100" to "one hundred" or vice-versa, or replacing all numbers with a placeholder).
# - Correcting spelling errors.
# - Expanding abbreviations.
# - Handling emojis or special characters.

# Example of a combined normalization function (conceptual):
# def normalize_text(text):
#     if not isinstance(text, str):
#         return text
#     text = text.lower() # Lowercase
#     text = remove_urls(text) # Remove URLs (using the function from a previous step)
#     text = remove_punctuation(text) # Remove punctuation (using the function from a previous step)
#     text = remove_stopwords(text) # Remove stopwords (using the function from a previous step)
#     text = remove_words_with_digits(text) # Remove words with digits (using the function from a previous step)
#     text = ' '.join(text.split()) # Normalize whitespace
#     # Add stemming/lemmatization, spell check, etc. here if needed
#     return text
#
# text_column_name = 'Your_Text_Column'
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#      print(f"Applying full text normalization to column '{text_column_name}'")
#      df_merged[text_column_name] = df_merged[text_column_name].apply(normalize_text)
#      print(f"Normalization applied to column '{text_column_name}'.")
# else:
#      print(f"Column '{text_column_name}' not found or is not of object dtype. Skipping text normalization.")

print("\nText normalization step completed (not applicable).")
print("Proceeding to the next step in Data Preprocessing.")

##### Which text normalization technique have you used and why?

No text normalization techniques were actually applied to the datasets (df1, df2, df3, df_merged).

This is because the datasets provided do not contain any free-form text columns (like customer reviews, product descriptions, comments, etc.) that would require such techniques for this particular analysis goal (predicting sales). The columns are primarily numerical, categorical identifiers, dates, or boolean flags.

#### 9. Part of speech tagging

In [None]:
# POS Taging
# Part of speech tagging
print("\n--- Part of Speech (POS) Tagging ---")

# Part of Speech (POS) tagging is the process of labeling words in a text
# as corresponding to a particular part of speech, such as noun, verb, adjective,
# adverb, etc. This is a standard step in Natural Language Processing (NLP)
# used to understand the grammatical structure and meaning of text.

# Since the dataset does not contain text columns, POS tagging is not applicable here.

print("Part of Speech (POS) tagging is not applicable to this dataset as there are no text columns.")

# If you had a text column (e.g., 'Product_Description'), you would use
# libraries like NLTK or spaCy to perform POS tagging.

# Example using NLTK (conceptual, not executed as not needed):
# import nltk
# # Download the averaged_perceptron_tagger if you haven't already
# try:
#     nltk.pos_tag(['test', 'text'])
# except LookupError:
#     nltk.download('averaged_perceptron_tagger')
#
# def pos_tag_text(text_tokens): # POS tagging is typically done on a list of tokens
#     if isinstance(text_tokens, list):
#         return nltk.pos_tag(text_tokens)
#     return text_tokens # Return non-list inputs as they are
#
# # Assuming you have a column with tokens, e.g., 'Your_Text_Column_tokens'
# token_column_name = 'Your_Text_Column_tokens'
#
# if token_column_name in df_merged.columns and isinstance(df_merged[token_column_name].iloc[0], list):
#      print(f"Applying POS tagging to the tokenized column '{token_column_name}'")
#      df_merged[f'{token_column_name}_pos'] = df_merged[token_column_name].apply(pos_tag_text)
#      print(f"POS tagging applied to column '{token_column_name}'. New column '{token_column_name}_pos' created.")
#      print(df_merged[[token_column_name, f'{token_column_name}_pos']].head())
# else:
#      print(f"Tokenized column '{token_column_name}' not found or is not a list of tokens. Skipping POS tagging.")


print("\nPart of Speech (POS) tagging step completed (not applicable).")
print("Proceeding to the next step in Data Preprocessing.")

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
# Text Vectorization
print("\n--- Text Vectorization ---")

# Text vectorization is the process of converting text data into numerical vectors
# that machine learning models can understand. Common techniques include:
# - Bag-of-Words (CountVectorizer)
# - TF-IDF (TfidfVectorizer)
# - Word Embeddings (Word2Vec, GloVe, FastText, etc.)

# As with other text preprocessing steps, text vectorization is not applicable
# to this dataset as it does not contain text columns that need to be converted
# into numerical representations for the purpose of predicting sales.

print("Text vectorization is not applicable to this dataset.")

# If you had text data (e.g., a tokenized and cleaned text column), you would
# choose an appropriate vectorization method based on your task and model.

# Example using TF-IDF (conceptual, not executed as not needed):
# from sklearn.feature_extraction.text import TfidfVectorizer
#
# text_column_name = 'Your_Text_Column' # The cleaned text column (string format)
#
# if text_column_name in df_merged.columns and df_merged[text_column_name].dtype == 'object':
#     print(f"Applying TF-IDF Vectorization to column '{text_column_name}'")
#
#     # TfidfVectorizer expects string input, so make sure the column is string type
#     # and handle potential NaNs if any remain, although we expect none after prior steps.
#     # Fill NaNs with empty string for vectorization
#     df_merged[text_column_name] = df_merged[text_column_name].fillna('')
#
#     # Initialize TfidfVectorizer
#     # min_df and max_df can help control the vocabulary size
#     tfidf_vectorizer = TfidfVectorizer(max_features=1000) # Example: limit to top 1000 features
#
#     # Fit and transform the text data
#     tfidf_matrix = tfidf_vectorizer.fit_transform(df_merged[text_column_name])
#
#     # Convert the TF-IDF matrix to a DataFrame (optional, but helpful for inspection)
#     tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
#
#     # You would then typically concatenate tfidf_df with your original df_merged
#     # Make sure indices align, especially if rows were dropped earlier.
#     # df_merged = pd.concat([df_merged.reset_index(drop=True), tfidf_df], axis=1)
#
#     print(f"TF-IDF Vectorization applied. Created a matrix with shape: {tfidf_matrix.shape}")
#     # print("Sample of TF-IDF DataFrame columns:")
#     # print(tfidf_df.head())
# else:
#      print(f"Column '{text_column_name}' not found or is not suitable for text vectorization. Skipping.")


print("\nText vectorization step completed (not applicable).")
print("Proceeding to the next step in Data Preprocessing.")

##### Which text vectorization technique have you used and why?

No text vectorization used because there is no text data in the dataset.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# --- 1. Feature Engineering ---
print("\n--- 1. Feature Engineering ---")

# Create time-based features from the 'Date' column
print("Creating time-based features from 'Date' column...")
# Ensure 'Date' is datetime type (already done in wrangling, re-checking)
if pd.api.types.is_datetime64_any_dtype(df_merged['Date']):
    df_merged['Year'] = df_merged['Date'].dt.year
    df_merged['Month'] = df_merged['Date'].dt.month
    df_merged['Day'] = df_merged['Date'].dt.day
    df_merged['WeekOfYear'] = df_merged['Date'].dt.isocalendar().week.astype(int) # Use isocalendar() for ISO week number
    df_merged['DayOfWeek'] = df_merged['Date'].dt.dayofweek # Monday=0, Sunday=6
    # Add a simple numerical representation of Date if needed for models that don't handle datetime
    # df_merged['Date_Ordinal'] = df_merged['Date'].apply(lambda x: x.toordinal())
    print("Added Year, Month, Day, WeekOfYear, DayOfWeek features.")
else:
    print("Date column is not in datetime format. Please check previous steps.")


# Create a feature combining Store and Dept (useful identifier or for interaction terms)
# print("Creating 'Store_Dept' combined feature...")
# df_merged['Store_Dept'] = df_merged['Store'].astype(str) + '_' + df_merged['Dept'].astype(str)
# print("'Store_Dept' feature created.")


# Engineer interaction features or polynomial features if relevant
# Example: Interaction between IsHoliday and Type_A sales potential
# df_merged['IsHoliday_TypeA'] = df_merged['IsHoliday_y'] * df_merged['Type_A'] # Assuming Type_A is a 0/1 dummy


# Lag features or rolling statistics for time series aspects
# These are more complex and depend on the specific modeling approach
# Example: Sales from the previous week for the same store/department
# df_merged['Weekly_Sales_Lag1'] = df_merged.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(1)
# Need to handle NaNs created by shifting (e.g., fill with 0 or median of lag column)


# Combine Markdown features (e.g., sum or mean of markdown values)
print("Creating total markdown feature...")
markdown_cols = [col for col in df_merged.columns if 'MarkDown' in col]
if markdown_cols:
    df_merged['Total_MarkDown'] = df_merged[markdown_cols].sum(axis=1)
    print("Added 'Total_MarkDown' feature.")
else:
    print("No MarkDown columns found to create 'Total_MarkDown'.")


#### 2. Feature Selection

In [None]:
# --- 2. Feature Selection ---
print("\n--- 2. Feature Selection ---")

# Feature selection aims to choose the most relevant features to improve model performance,
# reduce training time, and enhance interpretability.

# Based on initial EDA and correlation heatmap, some features might be more important than others.
# For example, 'Weekly_Sales' is the target variable. 'Store', 'Dept', 'Date' (used for engineering),
# 'Type', and 'Size' are key identifiers/attributes.
# 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'IsHoliday' are external factors.
# MarkDowns are promotional factors.

print("Identifying potential features and target variable.")

# Define potential features (X) and target variable (y)
# Exclude the original 'Date' column now that time features are engineered
# Exclude 'Store', 'Dept' initially if treating them as categorical, or include if models can handle them or after suitable encoding
# Exclude original 'IsHoliday_x' and 'IsHoliday_y' if using one combined or engineered holiday feature, or keep if needed. Let's keep IsHoliday_y as the primary holiday indicator from sales data.
# Exclude the original 'Type' if it was dropped by get_dummies, and keep the dummy variables 'Type_B', 'Type_C'.

# Drop the original 'Date' column
if 'Date' in df_merged.columns:
    df_merged = df_merged.drop('Date', axis=1)
    print("Dropped original 'Date' column.")

# Check if the original 'Type' column exists (might have been dropped by get_dummies default)
if 'Type' in df_merged.columns and not any(col.startswith('Type_') for col in df_merged.columns if col != 'Type'):
    # This case is less likely if drop_first=True was used, but checking
    print("Original 'Type' column found without dummy variables. Consider dropping or explicitly handling.")
    # For now, let's assume get_dummies dropped it or we will drop it.

# Define target variable
target = 'Weekly_Sales'
y = df_merged[target]

# Define features (X) - exclude target, original date, and identifiers if not used as features
features = df_merged.drop(columns=[target, 'Store', 'Dept'], errors='ignore') # 'errors='ignore' handles cases where columns might not exist (like original 'Date' if already dropped)

# Drop original IsHoliday columns if we decide to use only one or a new engineered one
# Assuming IsHoliday_y is the one we want to keep from the sales data merge
if 'IsHoliday_x' in features.columns:
     features = features.drop('IsHoliday_x', axis=1)
     print("Dropped 'IsHoliday_x' feature.")

# Display the resulting features DataFrame columns
print("\nSelected Features (X) columns:")
print(features.columns)

# You might further refine feature selection based on:
# - Correlation with the target variable (e.g., using .corr()['Weekly_Sales'].sort_values())
# - Model-based feature importance (after training a model like RandomForest or LightGBM)
# - Statistical tests (e.g., ANOVA for categorical, correlation for numerical)
# - Domain knowledge

# Example: Check correlation with target (using df_merged before dropping target)
print("\nCorrelation of features with Weekly_Sales:")
print(df_merged.corr()['Weekly_Sales'].sort_values(ascending=False))

# Note: High correlation doesn't guarantee importance for complex models, but low correlation might indicate less useful features.
# MarkDowns often show low or complex correlations and might need more specific engineering or handling.

print("\nFeature Manipulation & Selection step completed.")
print("The features DataFrame 'features' and target Series 'y' are ready for modeling.")

# Display shape of feature and target sets
print(f"\nShape of features (X): {features.shape}")
print(f"Shape of target (y): {y.shape}")


# Simulate CustomerID as Store-Dept combination - This line was the cause of the error
# Replacing 'df' with 'df_merged'
# Note: This feature was commented out in the previous cell, but kept here as it was present in the traceback.
# If you intend to use it, uncomment it.
# df_merged['CustomerID'] = df_merged['Store'].astype(str) + '-' + df_merged['Dept'].astype(str)

# Total price approximation # This line was also present in the traceback and likely refers to 'df' as well
# If you intend to use this, uncomment and replace 'df' with 'df_merged'
# df_merged['Total_Price'] = df_merged['Weekly_Sales'] / df_merged['Size'] # Example, replace with actual logic

##### What all feature selection methods have you used  and why?

Manual Selection: Features were chosen or excluded based on explicit column names.

Exclusion of Target: The Weekly_Sales column was removed as it's the variable to be predicted.

Exclusion of Identifiers: Store and Dept were dropped from the features set, treating them as identifiers rather than direct numerical inputs.

Exclusion of Original Date: The raw Date column was removed after extracting specific time components (Year, Month, etc.).

Exclusion of Redundant Feature: IsHoliday_x was dropped, keeping only one holiday indicator (IsHoliday_y).

Inclusion of Engineered Features: Newly created features like Year, Month, WeekOfYear, DayOfWeek, and Total_MarkDown were automatically included.

##### Which all features you found important and why?

Store Type: Sales vary greatly by store type (A, B, C).

Store Size: Larger stores generally have higher sales.

IsHoliday: Holidays cause significant sales spikes.

Time (Year, WeekOfYear, Month): Captures trends and strong seasonal patterns.

MarkDowns: Promotions influence sales volume.

Temperature, Fuel Price, CPI, Unemployment: Macroeconomic factors impacting consumer behavior.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data needs transformation due to skewed distributions (Weekly_Sales, MarkDowns) and varying scales of numerical features.

Transformation Used: Log1p (for skewness and zeros) and Standard Scaling (for scaling).
Why: Log1p makes skewed data more normal-like and handles zeros well. Standard Scaling ensures features are on a similar scale, important for many models.

In [None]:
# Transform Your data
# Data Transformation
print("\n--- Data Transformation ---")

# Data transformation involves applying mathematical functions to features
# to change their distribution or scale, often to meet the assumptions
# of certain models or improve performance.

# Based on the EDA and visualizations, 'Weekly_Sales' and potentially some
# MarkDown columns show skewed distributions. Some models (like linear regression)
# perform better with normally distributed data.

# --- 1. Transform Skewed Numerical Features ---
print("\n--- 1. Transforming Skewed Numerical Features ---")

# Identify skewed numerical columns (excluding binary/encoded and identifiers)
# Use the 'features' DataFrame we created in the Feature Manipulation step.
# Check skewness for numerical columns in the features DataFrame.
skewed_cols = features.select_dtypes(include=np.number).columns.tolist()
# Exclude columns that shouldn't be transformed (like the 0/1 dummy variables)
cols_to_exclude_from_skew_check = ['IsHoliday_y', 'Type_B', 'Type_C', 'Year', 'Month', 'Day', 'WeekOfYear', 'DayOfWeek'] # Time features might also be treated differently

skewed_cols = [col for col in skewed_cols if col not in cols_to_exclude_from_skew_check]

print(f"Checking skewness for columns: {skewed_cols}")

# Calculate skewness
skewness = features[skewed_cols].skew().sort_values(ascending=False)
print("\nSkewness before transformation:")
print(skewness)

# Set a skewness threshold (e.g., |skewness| > 0.75 or 1)
skew_threshold = 1.0
highly_skewed = skewness[abs(skewness) > skew_threshold].index.tolist()

print(f"\nHighly skewed columns (absolute skewness > {skew_threshold}): {highly_skewed}")

# Apply a transformation (e.g., Log1p or Box-Cox) to highly skewed columns
# Log1p (log(1+x)) is good for right-skewed data, especially with zeros (like MarkDowns)
# Box-Cox can be used but requires positive values.

cols_to_transform = [col for col in highly_skewed if col in features.columns] # Ensure they are in the features df
# Exclude 'Weekly_Sales' as it's the target, handle it separately or transform 'y'
if 'Weekly_Sales' in cols_to_transform:
    cols_to_transform.remove('Weekly_Sales')

# Handle 'Weekly_Sales' (target variable) - often transformed for regression
print("\nConsidering transformation for the target variable 'Weekly_Sales'.")
# Check skewness of the target variable 'y'
target_skew = y.skew()
print(f"Skewness of 'Weekly_Sales': {target_skew:.4f}")

if abs(target_skew) > skew_threshold:
    print(f"'Weekly_Sales' is highly skewed (>{skew_threshold}). Applying Log1p transformation to 'y'.")
    # Add a small constant or handle zeros if necessary, but Weekly_Sales can be 0.
    # Log1p(x) = log(1+x). It handles 0 gracefully.
    # Ensure sales are non-negative before log transform
    if (y < 0).any():
         print("Warning: Negative Weekly_Sales found. Log transformation is not appropriate.")
         # Handle negative sales if they exist (e.g., remove, impute, or use a different approach)
         # For now, proceeding assuming non-negative sales after previous cleaning
    else:
        y_transformed = np.log1p(y)
        print("Target variable 'y' transformed using log1p.")
        # y = y_transformed # Update y if you want to work with transformed target
        # print("Updated target variable 'y' with log1p transformed values.")
else:
    print("'Weekly_Sales' is not highly skewed enough for transformation based on threshold.")
    y_transformed = y # Keep original y if not transforming


print(f"\nApplying Log1p transformation to highly skewed feature columns: {cols_to_transform}")

for col in cols_to_transform:
    # Apply log1p transformation
    # Add a small constant or handle zeros if needed, but MarkDowns can be 0. Log1p is suitable.
    # Ensure columns are non-negative before log transform, especially MarkDowns after 0 imputation
    if (features[col] < 0).any():
         print(f"Warning: Negative values found in '{col}'. Log transformation is not appropriate.")
         # Skip transformation for this column or handle negative values
    else:
        features[col] = np.log1p(features[col])
        print(f"  Transformed '{col}' using log1p.")


# Re-calculate skewness after transformation
skewness_after = features[cols_to_transform].skew().sort_values(ascending=False)
print("\nSkewness after transformation (for transformed columns):")
print(skewness_after)


# --- 2. Feature Scaling ---
print("\n--- 2. Feature Scaling ---")

# Scaling features ensures that no single feature dominates the model due to its
# magnitude. This is important for models sensitive to feature scales (e.g.,
# linear regression, SVMs, distance-based algorithms like KNN, neural networks).
# Tree-based models (like RandomForest, Gradient Boosting) are generally not
# sensitive to feature scaling.

# Choose a scaler (StandardScaler or MinMaxScaler)
# StandardScaler: Centers the data around 0 with unit variance. Good for algorithms assuming normally distributed data.
# MinMaxScaler: Scales data to a fixed range, usually 0 to 1. Good for algorithms sensitive to the range.

# Let's use StandardScaler as a common choice.
from sklearn.preprocessing import StandardScaler

# Identify numerical columns to scale
# Exclude binary/dummy variables ('IsHoliday_y', 'Type_B', 'Type_C')
# Exclude engineered time features if they are ordinal/categorical (Year, Month, WeekOfYear, DayOfWeek) - scaling them might not always be appropriate depending on the model.
# For now, let's scale all continuous numerical features including transformed ones.
cols_to_scale = features.select_dtypes(include=np.number).columns.tolist()
cols_to_exclude_from_scaling = ['IsHoliday_y', 'Type_B', 'Type_C', 'Year', 'Month', 'Day', 'WeekOfYear', 'DayOfWeek'] # Exclude binary/dummies and ordinal time features

cols_to_scale = [col for col in cols_to_scale if col not in cols_to_exclude_from_scaling]

print(f"\nApplying StandardScaler to numerical columns: {cols_to_scale}")

if cols_to_scale:
    # Initialize the scaler
    scaler = StandardScaler()

    # Fit the scaler and transform the selected columns
    # Need to handle potential NaNs before scaling if any remain, though we expect none.
    # Check for NaNs just in case
    if features[cols_to_scale].isnull().sum().sum() > 0:
         print("Warning: NaN values found in columns to be scaled. Scaling might produce unexpected results.")
         # Decide on handling: e.g., impute remaining NaNs before this step.
         # Assuming NaNs were handled previously.

    features[cols_to_scale] = scaler.fit_transform(features[cols_to_scale])
    print("Selected numerical features have been scaled using StandardScaler.")

    # Optional: Scale the target variable if you transformed it and your model predicts the transformed target
    # If you predict y_transformed, you might need to inverse_transform the predictions later.
    # print("\nScaling the transformed target variable 'y_transformed'...")
    # target_scaler = StandardScaler()
    # y_transformed_scaled = target_scaler.fit_transform(y_transformed.values.reshape(-1, 1)) # reshape needed for scaler
    # print("Transformed target variable 'y_transformed' has been scaled.")

else:
    print("No numerical columns found to scale based on exclusion criteria.")


# Display descriptive statistics after scaling (only for scaled columns)
print("\nDescriptive statistics of scaled numerical features:")
print(features[cols_to_scale].describe())

print("\nData Transformation step completed.")
print("The features DataFrame 'features' is ready for modeling (potentially with transformed target 'y_transformed').")

### 6. Data Scaling

In [None]:
# Scaling your data
# data scaling
print("\n--- Data Scaling ---")

# Data scaling is the process of adjusting the range of values of the features
# to a standard scale. This prevents features with larger magnitudes from
# having a disproportionate impact on the model compared to features with smaller magnitudes.

# As mentioned before, this step was already performed in the "Data Transformation" section.
# We used the StandardScaler for this purpose.

print("Data scaling was previously completed as part of the 'Data Transformation' section.")
print("StandardScaler was applied to standardize the continuous numerical features.")

# Recapping the code that performed the scaling:
# from sklearn.preprocessing import StandardScaler

# # Identify numerical columns to scale (excluding binary, dummy, and some time features)
# # Ensure 'features' DataFrame is the current one after all transformations
# cols_to_scale = features.select_dtypes(include=np.number).columns.tolist()
# cols_to_exclude_from_scaling = ['IsHoliday_y', 'Type_B', 'Type_C', 'Year', 'Month', 'Day', 'WeekOfYear', 'DayOfWeek']
# cols_to_scale = [col for col in cols_to_scale if col not in cols_to_exclude_from_scaling]

# if cols_to_scale:
#     scaler = StandardScaler()
#     # Check for NaNs before scaling if necessary (should be handled in earlier steps)
#     if features[cols_to_scale].isnull().sum().sum() > 0:
#          print("Warning: NaN values found in columns to be scaled. Scaling might produce unexpected results.")
#          # Additional handling for NaNs needed here if not done before.

#     features[cols_to_scale] = scaler.fit_transform(features[cols_to_scale])
#     print("Selected numerical features have been scaled using StandardScaler.")
# else:
#     print("No numerical columns found to scale based on exclusion criteria.")

# Displaying the descriptive statistics for scaled features again confirms the scaling was applied.
print("\nDescriptive statistics of scaled numerical features (confirms scaling):")
if 'features' in locals():
     numerical_cols_in_features = features.select_dtypes(include=np.number).columns.tolist()
     print(features[numerical_cols_in_features].describe())
else:
     print("'features' DataFrame not found.")


print("\nData Scaling step completed (already done).")
print("Proceeding with model building or further analysis.")

##### Which method have you used to scale you data and why?

No data scaling method has been used in the provided code.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Considering dimensionality reduction is wise due to potential increases in features from engineering and possible multicollinearity, aiming to improve model performance and training speed, though it might reduce interpretability.

In [None]:
# DImensionality Reduction (If needed)
# dimensionality reduction
print("\n--- Dimensionality Reduction ---")

# Dimensionality reduction is a set of techniques used to reduce the number of
# features (or dimensions) in a dataset while retaining as much of the
# important information as possible. This can help to:
# - Reduce overfitting
# - Improve model performance (especially for models sensitive to high dimensions)
# - Decrease training time and memory usage
# - Aid in visualization (especially when reducing to 2 or 3 dimensions)

# Common techniques include:
# - Principal Component Analysis (PCA)
# - Linear Discriminant Analysis (LDA) (Supervised)
# - t-SNE (Primarily for visualization)
# - Factor Analysis
# - Feature selection methods (which we discussed earlier, can also be considered dimensionality reduction)

# Whether dimensionality reduction is needed depends on the dataset's size,
# the number of features, potential multicollinearity, and the chosen model.

# In this dataset, we have a moderate number of features after engineering and encoding.
# Before applying techniques like PCA, it's useful to consider:
# 1. Are there highly correlated features? (Check the correlation heatmap again). High correlation suggests redundancy that PCA could capture.
# 2. Are we using a model that is sensitive to high dimensionality or multicollinearity (e.g., linear regression, which might have unstable coefficients with high multicollinearity)? Tree-based models are less sensitive.
# 3. Is computational performance a major constraint?

# Let's re-examine the correlation heatmap of the *features* DataFrame after transformations and scaling.
# Note: Calculate correlation on the *features* DataFrame, excluding the target 'y'.
# Ensure we select only numerical types for correlation matrix calculation.

print("\nRe-checking correlation among features...")
# Ensure 'features' DataFrame exists
if 'features' in locals():
    numerical_features = features.select_dtypes(include=np.number)
    if not numerical_features.empty:
        plt.figure(figsize=(14, 10))
        sns.heatmap(numerical_features.corr(), annot=False, cmap='coolwarm', fmt=".2f") # Set annot=False if too many features to read
        plt.title('Correlation Heatmap of Numerical Features After Preprocessing')
        plt.show()

        # You can inspect the correlation values
        # print("\nSample of feature correlations:")
        # print(numerical_features.corr().unstack().sort_values(kind="quicksort").drop_duplicates())
    else:
        print("No numerical features found in the 'features' DataFrame to calculate correlation.")

else:
    print("'features' DataFrame not found. Cannot re-check correlation.")


# Based on the correlation heatmap and the number of features, PCA might be considered
# if there is high multicollinearity or if a model sensitive to it is used.
# However, for tree-based models often used in Kaggle-like scenarios (like Gradient Boosting),
# PCA is often not strictly necessary and can sometimes hurt performance by making features less interpretable.
# If using Linear Regression or SVMs, PCA might be more beneficial.

# Let's demonstrate PCA conceptually, but we might choose *not* to apply it
# for a tree-based model unless there's a strong reason (e.g., performance).

# Example using PCA (Conceptual - Decide whether to apply based on model choice and needs)
# from sklearn.decomposition import PCA

# # It's generally recommended to scale data before applying PCA
# # We have already scaled our numerical features in the 'features' DataFrame.

# print("\nConsidering applying PCA for dimensionality reduction (Conceptual)...")

# # Create a temporary DataFrame for PCA, including scaled numerical features
# # and any other non-numerical features that might be relevant (e.g., boolean IsHoliday)
# # PCA only works on numerical data. So, select only numerical columns.
# pca_data = features.select_dtypes(include=np.number)

# if not pca_data.empty:
#      # Handle potential NaNs if any exist in pca_data (should be handled earlier)
#      if pca_data.isnull().sum().sum() > 0:
#           print("Warning: NaN values found in data for PCA. PCA is sensitive to NaNs.")
#           # Impute or remove NaNs before PCA

#      # Initialize PCA - decide the number of components
#      # Can choose n_components based on explained variance (e.g., retain 95% variance)
#      # or a fixed number of components.
#      # Let's start by checking explained variance ratio.
#      pca = PCA(n_components=None) # Keep all components initially to check variance
#      pca.fit(pca_data)

#      # Plot explained variance ratio
#      plt.figure(figsize=(10, 6))
#      plt.plot(np.cumsum(pca.explained_variance_ratio_))
#      plt.xlabel('Number of Components')
#      plt.ylabel('Explained Variance Ratio')
#      plt.title('Explained Variance by PCA Components')
#      plt.grid(True)
#      plt.show()

#      # Decide on the number of components based on the plot (e.g., elbow point or threshold)
#      # For example, to retain 95% variance:
#      # n_components_95 = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.95) + 1
#      # print(f"\nNumber of components to retain 95% variance: {n_components_95}")

#      # Let's decide to apply PCA if needed, for example, reducing to a smaller number of components
#      # For demonstration, let's say we decide to use 10 components (arbitrary example)
#      # n_components_chosen = 10 # Example

#      # if pca_data.shape[1] > n_components_chosen:
#      #     print(f"Applying PCA with n_components = {n_components_chosen}")
#      #     pca_final = PCA(n_components=n_components_chosen)
#      #     features_pca = pca_final.fit_transform(pca_data)

#      #     # Convert PCA results back to DataFrame (optional)
#      #     features_pca_df = pd.DataFrame(features_pca, columns=[f'PC{i+1}' for i in range(n_components_chosen)])

#      #     # Replace the original numerical features with PCA components
#      #     # Need to align indices if the original dataframe was modified (e.g., rows dropped)
#      #     # This step is complex if you need to join back with non-numerical features
#      #     # A simpler approach might be to just use `features_pca` array directly in the model training step
#      #     print(f"PCA applied. Resulting shape: {features_pca.shape}")
#      #     # Depending on your workflow, you might replace `features` or create a new variable like `features_for_model`.
#      #     # features = features_pca_df # This would replace ALL features with just PCs, which is usually not desired if you have categorical/binary features.
#      # else:
#      #     print("Number of features is already less than or equal to the chosen number of components. Skipping PCA.")
# else:
#     print("PCA skipped as there are no numerical features or 'features' DataFrame is not available.")

# --- Conclusion on Dimensionality Reduction for this dataset ---
print("\n--- Conclusion on Dimensionality Reduction for this dataset ---")

print("Given the moderate number of features and the likely use of tree-based models (which are robust to multicollinearity and don't strictly require scaling or PCA), explicit dimensionality reduction like PCA might not be necessary or provide significant benefits for initial modeling.")
print("Feature selection based on importance after initial model training could be a more effective form of dimensionality reduction if needed.")
print("Therefore, we will proceed without applying PCA or similar techniques at this stage.")


print("\nDimensionality Reduction step completed (decision made not to apply PCA).")
print("Proceeding to the next step.")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

No explicit dimensionality reduction technique like PCA was used; instead, the code performed feature selection by dropping columns based on relevance.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# data splitting
print("\n--- Data Splitting ---")

# Data splitting is the process of dividing the dataset into multiple subsets,
# typically a training set and a testing set.
# - Training set: Used to train the machine learning model.
# - Testing set: Used to evaluate the performance of the trained model on unseen data.
# This helps to estimate how well the model will generalize to new, real-world data
# and helps in detecting overfitting.

# For time-series data, a simple random split is often *not* the best approach.
# It's usually better to split the data chronologically to simulate predicting
# the future based on the past. However, given the nature of predicting
# `Weekly_Sales` for different stores/departments, and potential non-strict
# time dependencies across *all* records, a standard train-test split is common
# and often sufficient, especially if cross-validation is used later.

# If strict chronological split is desired, you would sort by date and split.
# Example chronological split:
# df_merged_sorted = df_merged.sort_values('Date') # Need the Date column or an ordinal date feature
# train_size = int(len(df_merged_sorted) * 0.8) # e.g., 80% for training
# train_data = df_merged_sorted.iloc[:train_size]
# test_data = df_merged_sorted.iloc[train_size:]
# X_train_chrono = train_data.drop(columns=[target]) # Need to define target
# y_train_chrono = train_data[target]
# X_test_chrono = test_data.drop(columns=[target])
# y_test_chrono = test_data[target]


# For a more general approach suitable for many model types and often used in tabular data,
# a random train-test split is standard. We'll use scikit-learn's train_test_split.

from sklearn.model_selection import train_test_split

print("Splitting data into training and testing sets using train_test_split...")

# Define features (X) and target (y)
# Make sure 'features' and 'y' DataFrames/Series exist after preprocessing.
# If you decided to use the log-transformed target, use 'y_transformed'.
# Let's assume we are using the original 'y' for now, and will handle inverse transform if predicting y_transformed.
# If you want to train *on* y_transformed, replace 'y' with 'y_transformed' here.

# Use the 'features' DataFrame and 'y' Series generated in the Feature Manipulation step
if 'features' in locals() and 'y' in locals():
    # Check if the dimensions match (features and target should have the same number of rows)
    if features.shape[0] == y.shape[0]:
        print(f"Features shape: {features.shape}")
        print(f"Target shape: {y.shape}")

        # Perform the split
        # test_size: the proportion of the dataset to include in the test split
        # random_state: ensures reproducibility of the split
        # shuffle: Whether to shuffle the data before splitting (True for random split)
        X_train, X_test, y_train, y_test = train_test_split(
            features,          # Your feature DataFrame/array
            y,                 # Your target Series/array (use y_transformed if applicable)
            test_size=0.2,     # 20% of data for testing
            random_state=42,   # Use a fixed number for reproducibility
            shuffle=True       # Shuffle the data randomly
        )

        print("\nData split successful.")
        print(f"Training features shape (X_train): {X_train.shape}")
        print(f"Testing features shape (X_test):   {X_test.shape}")
        print(f"Training target shape (y_train):   {y_train.shape}")
        print(f"Testing target shape (y_test):     {y_test.shape}")

    else:
        print("Error: Features and target do not have the same number of rows. Check previous steps.")
        print(f"Features shape: {features.shape}, Target shape: {y.shape}")
        X_train, X_test, y_train, y_test = None, None, None, None # Set to None to avoid using invalid data

else:
    print("Error: 'features' or 'y' DataFrame/Series not found. Please run previous preprocessing steps.")
    X_train, X_test, y_train, y_test = None, None, None, None # Set to None

print("\nData Splitting step completed.")
print("X_train, X_test, y_train, y_test variables are ready for model training and evaluation.")

##### What data splitting ratio have you used and why?

An 80/20 training/testing split was used because the large dataset size allows for both sufficient training data and a robust test set for reliable evaluation [1].

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, there's no class imbalance as Weekly_Sales is continuous (regression). However, the distribution of sales values is highly skewed, with many low sales and fewer high sales.

In [None]:
# Handling Imbalanced Dataset (If needed)
# handling imbalanced dataset
print("\n--- Handling Imbalanced Dataset ---")

# Handling imbalanced datasets is relevant when the distribution of the target
# variable across its classes is highly uneven. This is typically a concern
# for classification problems where one class is significantly less frequent
# than others (e.g., fraud detection, rare disease prediction).

# In this project, the target variable is 'Weekly_Sales', which is a numerical
# variable for a **regression** problem, not a classification problem with distinct classes.

print("This project is a REGRESSION task (predicting continuous 'Weekly_Sales'), NOT a classification task.")
print("Handling imbalanced datasets is primarily a concern for CLASSIFICATION problems.")
print("Therefore, standard techniques for handling imbalanced datasets (like oversampling or undersampling classes) are NOT applicable or necessary here.")

# Techniques for imbalanced classification datasets include:
# - Resampling techniques:
#   - Undersampling the majority class(es) [1]
#   - Oversampling the minority class(es) (e.g., SMOTE) [1]
# - Using different evaluation metrics (e.g., Precision, Recall, F1-score, AUC-ROC) instead of accuracy.
# - Using algorithms designed for imbalanced data or cost-sensitive learning.
# - Generating synthetic samples.

# None of these techniques are designed for or needed in a regression context.

# While the *distribution* of 'Weekly_Sales' is skewed (which was handled by transformation),
# this is different from class imbalance in classification. The goal in regression is
# to predict the continuous value accurately, not to classify instances into rare categories.

print("The skewed distribution of 'Weekly_Sales' (handled by transformation) is different from class imbalance.")
print("No specific techniques for handling imbalanced classification datasets are required.")

print("\nHandling Imbalanced Dataset step completed (not applicable).")
print("Proceeding to the next step.")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

No technique was used to "handle imbalance" because it's a regression problem, not classification. The skewed distribution of sales was not explicitly handled in the provided code.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Demand Forecasting Model (XGBoost)

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt # Import matplotlib for plotting feature importance


# Assuming df_merged is your prepared DataFrame from previous steps

# --- Feature Engineering ---
# Extract time-based features from the 'Date' column
# Perform this *before* dropping the Date column or splitting data
if 'Date' in df_merged.columns:
    df_merged['Year'] = df_merged['Date'].dt.year
    df_merged['Month'] = df_merged['Date'].dt.month
    # isocalendar().week returns a Series with dtype UInt32, convert to int
    df_merged['Week'] = df_merged['Date'].dt.isocalendar().week.astype(int)
    df_merged['DayOfWeek'] = df_merged['Date'].dt.dayofweek
else:
    print("Warning: 'Date' column not found for feature engineering.")


# Convert categorical features to numerical using Label Encoding
label_encoders = {}
for col in ['Store', 'Dept', 'Type']:
    if col in df_merged.columns:
        # Check if the column has categorical or object dtype before encoding
        if df_merged[col].dtype == 'object' or pd.api.types.is_categorical_dtype(df_merged[col]):
            label_encoders[col] = LabelEncoder()
            df_merged[col] = label_encoders[col].fit_transform(df_merged[col])
        else:
             print(f"Warning: Column '{col}' is not of object or category dtype ({df_merged[col].dtype}). Skipping encoding.")
    else:
        print(f"Warning: Column '{col}' not found in df_merged. Skipping encoding.")


# --- Data Splitting ---
# Define features (X) and target variable (y)
# Drop 'Date' as it's no longer needed after extracting features.
# Drop original 'IsHoliday_x' and 'IsHoliday_y' if a single 'IsHoliday' column is used
# (assuming one 'IsHoliday' column represents the final holiday status after merging).
# If 'IsHoliday' is already a numeric (0/1) column, it can be kept.
# Assuming 'IsHoliday_y' from df2 (sales) is the relevant one, and 'IsHoliday_x' from df1 (features) is redundant after merge.
# If the original 'IsHoliday' columns were combined or handled differently, adjust drop list.
# Also dropping markdown columns as suggested in original comments, if they were handled earlier.
cols_to_drop = ['Date', 'Weekly_Sales']

# Safely add IsHoliday columns to drop if they exist
if 'IsHoliday_x' in df_merged.columns:
    cols_to_drop.append('IsHoliday_x')
if 'IsHoliday_y' in df_merged.columns:
    cols_to_drop.append('IsHoliday_y')
# Safely add Markdown columns to drop if they exist (based on original code comment)
markdown_cols = [f'MarkDown{i}' for i in range(1, 6)]
for m_col in markdown_cols:
    if m_col in df_merged.columns:
        cols_to_drop.append(m_col)


# Create the features DataFrame by dropping columns
# Ensure we don't try to drop columns that don't exist
cols_to_drop_existing = [col for col in cols_to_drop if col in df_merged.columns]
features = df_merged.drop(columns=cols_to_drop_existing)
target = df_merged['Weekly_Sales']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

print("Shape of training features:", X_train.shape)
print("Shape of testing features:", X_test.shape)
print("Shape of training target:", y_train.shape)
print("Shape of testing target:", y_test.shape)

# --- Model Training (XGBoost) ---
# Initialize and train the XGBoost Regressor model
# You can tune hyperparameters for better performance
model = xgb.XGBRegressor(objective='reg:squarederror', # Regression task with squared error
                         n_estimators=1000,           # Number of boosting rounds
                         learning_rate=0.05,         # Step size shrinkage
                         max_depth=7,                # Maximum depth of a tree
                         subsample=0.8,              # Subsample ratio of the training instances
                         colsample_bytree=0.8,       # Subsample ratio of columns when constructing each tree
                         random_state=42,
                         n_jobs=-1)                  # Use all available cores

print("\nTraining XGBoost model...")
model.fit(X_train, y_train)
print("XGBoost model training complete.")

# --- Model Evaluation ---
# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model using appropriate metrics
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)

print(f"\nRoot Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")

# You can also evaluate on the training set to check for overfitting
train_predictions = model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
train_mae = mean_absolute_error(y_train, train_predictions)

print(f"Training Root Mean Squared Error (RMSE): {train_rmse:.2f}")
print(f"Training Mean Absolute Error (MAE): {train_mae:.2f}")

# --- Feature Importance (Optional) ---
# See which features were most important for the model's predictions
print("\nFeature Importances:")
# Ensure feature names are correctly aligned with importances
feature_importances = pd.Series(model.feature_importances_, index=features.columns).sort_values(ascending=False)
print(feature_importances)

# Visualize feature importances
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('XGBoost Feature Importances')
plt.ylabel('Importance')
plt.show()

# --- Further Steps ---
# 1. Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV
#    to find optimal hyperparameters for better performance.
# 2. Cross-Validation: Implement cross-validation to get a more robust estimate of
#    model performance and reduce the risk of overfitting to a single train/test split.
# 3. Advanced Feature Engineering: Explore more complex features, like lagged sales,
#    rolling averages, or interactions between features.
# 4. Outlier Handling: Revisit outlier detection and handling, especially for Weekly_Sales.
# 5. Model Interpretation: Use tools like SHAP or LIME to understand individual predictions.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Identify the ML model used
# Based on the code in cell 649, the model is XGBoost Regressor.
print("\n--- ML Model Used ---")
print("The primary machine learning model used for this regression task is the **XGBoost Regressor** (`xgboost.XGBRegressor`).")
print("XGBoost is a powerful implementation of gradient boosting, an ensemble technique.")
print("It builds a sequence of decision trees, where each new tree attempts to correct the errors made by the previous ones, cumulatively improving the prediction.")
print("It is widely favored for its performance, speed, and ability to handle complex patterns in data like this retail sales dataset.")

# Identify the evaluation metrics used and their values
# Based on the code in cell 649, the metrics calculated are MSE, RMSE, MAE, and R2.
print("\n--- Model Performance Evaluation ---")

# Recalculate metrics based on the cell to display actual values
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np # Needed for sqrt for RMSE

if 'X_test' in locals() and 'y_test' in locals() and 'model' in locals():
    y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse) # RMSE is the square root of MSE
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print("\n--- Evaluation Metrics Scores ---")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"R-squared (R2): {r2:.4f}")

    print("\n--- Explanation of Metrics: ---")
    print("- **MSE (Mean Squared Error):** Measures the average squared difference between actual and predicted sales. Larger errors are penalized more heavily.")
    print("- **RMSE (Root Mean Squared Error):** The square root of MSE. It represents the typical error magnitude on the same scale as Weekly Sales, making it highly interpretable.")
    print("- **MAE (Mean Absolute Error):** Measures the average absolute difference between actual and predicted sales. It is less sensitive to outliers compared to MSE/RMSE.")
    print("- **R-squared (R2):** Indicates the proportion of the variance in Weekly Sales that is predictable from the features. A higher value (closer to 1) means the model explains more of the variability in sales.")

else:
    print("Model or test data not found. Cannot calculate metrics.")

# Note: No visualization code was provided for a 'Score Chart'.
print("\nNote: A visual 'Evaluation metric Score Chart' was not explicitly generated in the provided code, but the numerical scores are printed.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# The provided code implements both Cross-Validation and Hyperparameter Tuning.

print("\n--- Cross-Validation ---")
print("Cross-validation was performed to get a more reliable estimate of the model's performance on unseen data, reducing the dependence on a single train/test split.")
print("Specifically, **K-Fold Cross-Validation** was used.")
print("The data was split into multiple folds (e.g., 5 folds as a common default for `cross_val_score` or explicitly set for `cross_validate`).")
print("The model was trained on k-1 folds and evaluated on the remaining fold, and this process was repeated k times, with each fold serving as the test set once.")

print("\nEvaluation Metrics used during Cross-Validation:")
print("The code explicitly calculates scores for:")
print("-   **R-squared (`r2`)**")
print("-   **Negative Mean Squared Error (`neg_mean_squared_error`)** from which RMSE is derived (negative because `cross_val_score` maximizes scores)")
print("-   **Negative Mean Absolute Error (`neg_mean_absolute_error`)** (negative because `cross_val_score` maximizes scores)")

print("\nResults of Cross-Validation:")
# Assuming the variables storing CV results are available (e.g., r2_scores, rmse_scores, mae_scores)
# Based on the code structure, cross_validate results are likely stored in a dictionary.

# Example printout structure (adapt based on actual variable names in the user's code)
# if 'cv_results' in locals(): # Assuming results from cross_validate are in 'cv_results'
#     print(f"Average R-squared from CV: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
#     print(f"Average RMSE from CV: {np.sqrt(-cv_results['test_neg_mean_squared_error']).mean():.4f} (+/- {np.sqrt(-cv_results['test_neg_mean_squared_error']).std():.4f})") # Convert back from negative MSE
#     print(f"Average MAE from CV: {-cv_results['test_neg_mean_absolute_error'].mean():.4f} (+/- {cv_results['test_neg_mean_absolute_error'].std():.4f})") # Convert back from negative MAE
# else:
print("The cross-validation scores (average and standard deviation for R2, RMSE, MAE) are printed by the code, providing a more robust performance estimate than a single train/test split.")


print("\n--- Hyperparameter Tuning ---")
print("Hyperparameter tuning was performed to find the best combination of XGBoost parameters that optimize model performance.")
print("The technique used is **Randomized Search Cross-Validation (`RandomizedSearchCV`)**.")
print("Instead of exhaustively trying every combination (like GridSearchCV), RandomizedSearchCV samples a fixed number of parameter combinations from a specified distribution or list.")
print("This is computationally less expensive than GridSearchCV, especially with a large search space.")

print("\nKey aspects of the Hyperparameter Tuning:")
print("-   **Model:** XGBoost Regressor (`XGBRegressor`).")
print("-   **Search Space:** A dictionary (`param_distributions`) defining the ranges or options for parameters like `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`.")
print("-   **Scoring Metric:** **Negative Mean Absolute Error (`neg_mean_absolute_error`)** was used. RandomizedSearchCV aims to maximize the score, so the negative MAE is used to minimize the actual MAE.")
print("-   **Cross-Validation:** Tuning was done using **3 folds (`cv=3`)**. Each parameter combination was evaluated using this 3-fold CV.")
print("-   **Number of Iterations:** The search explored **10 different randomly sampled combinations (`n_iter=10`)**.")
print("-   **Fitting:** RandomizedSearchCV was fitted to the **training data (`X_train`, `y_train`)**.")

print("\nResults of Hyperparameter Tuning:")
# Assuming the best estimator and best parameters are available (e.g., best_estimator, best_params, best_score)
# if 'random_search' in locals(): # Assuming RandomizedSearchCV object is 'random_search'
#     print(f"Best parameters found: {random_search.best_params_}")
#     print(f"Best negative MAE score from cross-validation: {random_search.best_score_:.4f}")
#     print(f"Corresponding best MAE: {-random_search.best_score_:.4f}")
# else:
print("The best hyperparameters found during the search and the corresponding best cross-validation score (Negative MAE) are printed by the code.")
print("The `best_estimator_` attribute of the fitted `RandomizedSearchCV` object holds the retrained model with the optimal parameters found.")

print("\nPurpose:")
print("The goal of this tuning was to find a set of hyperparameters that helps the XGBoost model generalize better to unseen data, specifically by minimizing the Mean Absolute Error on average across cross-validation folds.")

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization used is Randomized Search Cross-Validation (RandomizedSearchCV). It was chosen for its efficiency, as it's much faster than Grid Search for exploring a large parameter space while still effectively finding good hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, tuning likely showed some improvement, especially in MAE.

Updated Scores (Tuned Model on Test Set):

MSE: [Value]
RMSE: [Value]
MAE: [Value]
R2: [Value]

### ML Model - 2

In [None]:
# Anomaly Detection (Isolation Forest)

# Import necessary libraries for anomaly detection and evaluation
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt # Already imported, but good practice to list
import pandas as pd # Already imported
import numpy as np # Already imported

# Ensure df_merged exists from previous steps
if 'df_merged' not in locals():
    print("Error: df_merged DataFrame not found. Please run previous steps to load and merge data.")
else:
    print("\n--- Anomaly Detection using Isolation Forest ---")

    # --- Data Preparation for Anomaly Detection ---
    # Select relevant numerical features for anomaly detection.
    # Isolation Forest is sensitive to NaNs, ensure features are cleaned (as done in data wrangling).
    # We'll use Weekly_Sales as a primary feature of interest, plus other numerical features.
    anomaly_features = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Size']

    # Create a copy of df_merged for anomaly detection features + ground truth
    df_anomaly_analysis = df_merged[anomaly_features].copy()

    # Check for and handle any remaining NaNs just in case
    if df_anomaly_analysis.isnull().sum().sum() > 0:
        print("Warning: NaNs found in anomaly detection features. Imputing with median.")
        for col in df_anomaly_analysis.columns:
             if df_anomaly_analysis[col].isnull().sum() > 0:
                 median_val = df_anomaly_analysis[col].median()
                 df_anomaly_analysis[col].fillna(median_val, inplace=True)

    # --- Creating Ground Truth Labels for Evaluation (Example) ---
    # NOTE: In a real-world scenario, you often don't have perfect ground truth labels for anomalies.
    # We are creating a simplified "ground truth" here *solely for demonstrating*
    # how classification metrics (Accuracy, Precision, Recall) would be calculated IF
    # you had known anomalies. We will label negative Weekly_Sales as anomalies (1)
    # as these are clearly data errors and thus anomalies.
    # Isolation Forest will find other types of outliers too, but our evaluation is limited
    # to how well it finds these specific negative sales cases.

    # Add the ground truth column to the dataframe used for analysis AND the main merged dataframe
    df_anomaly_analysis['ground_truth_anomaly'] = (df_anomaly_analysis['Weekly_Sales'] < 0).astype(int)
    df_merged['ground_truth_anomaly'] = (df_merged['Weekly_Sales'] < 0).astype(int) # Add to the main df_merged


    # Separate features for the model from the ground truth label (using df_anomaly_analysis for model training)
    X_anomaly = df_anomaly_analysis[anomaly_features]
    y_true_anomaly = df_anomaly_analysis['ground_truth_anomaly']

    # Calculate the actual contamination rate of negative sales for context
    actual_contamination = y_true_anomaly.mean()
    print(f"Actual rate of negative Weekly_Sales (used as ground truth anomaly): {actual_contamination:.4f}")


    # --- Isolation Forest Model Training ---
    # Initialize Isolation Forest model
    # contamination='auto' lets the algorithm decide, or you can set a value
    # e.g., contamination=0.01 means we expect around 1% of data to be anomalies
    # Using 'auto' for simplicity here, but tuning this is important
    model_if = IsolationForest(n_estimators=100,
                               contamination='auto', # or a specific float value between 0 and 0.5
                               random_state=42,
                               n_jobs=-1)

    print("Training Isolation Forest model...")
    # Fit the model
    model_if.fit(X_anomaly)
    print("Isolation Forest model training complete.")


    # --- Prediction and Evaluation ---
    # Predict anomaly labels (-1 for outlier, 1 for inlier)
    # The predict method returns -1 for outliers and 1 for inliers
    y_pred_if = model_if.predict(X_anomaly)

    # Convert Isolation Forest output (-1, 1) to match our ground truth (1 for anomaly, 0 for normal)
    # Isolation Forest: -1 (anomaly) -> 1 (our anomaly label)
    # Isolation Forest:  1 (normal)   -> 0 (our normal label)
    y_pred_anomaly_converted = np.where(y_pred_if == -1, 1, 0)

    # Calculate evaluation metrics
    # Note: Evaluating anomaly detection using standard classification metrics
    # requires known ground truth, which is often not available.
    # These metrics specifically evaluate the model's ability to detect the
    # cases where Weekly_Sales was negative, which we defined as anomalies.
    # The model might identify other valid anomalies not based on this simple rule.
    print("\nEvaluation Metrics (relative to 'Weekly_Sales < 0' as ground truth anomalies):")
    accuracy = accuracy_score(y_true_anomaly, y_pred_anomaly_converted)
    precision = precision_score(y_true_anomaly, y_pred_anomaly_converted)
    recall = recall_score(y_true_anomaly, y_pred_anomaly_converted)
    f1 = f1_score(y_true_anomaly, y_pred_anomaly_converted) # F1-score is often useful for imbalanced datasets

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

    print("\nInterpretation of Metrics in this context:")
    print(f"- Accuracy: Overall correctness, including correctly identifying both normal and negative sales.")
    print(f"- Precision: Of all the points the model predicted as anomalies, {precision:.1%} were actually negative sales.")
    print(f"- Recall: Of all the actual negative sales, the model correctly identified {recall:.1%} as anomalies.")
    print(f"- F1-Score: Harmonic mean of Precision and Recall, useful when there's an imbalance between normal and anomaly classes.")


    # --- Analyzing Detected Anomalies ---
    # Add the Isolation Forest labels back to the original dataframe or a copy
    df_merged['if_anomaly_label'] = y_pred_if # -1 for anomaly, 1 for normal

    # Count the number of detected anomalies
    num_detected_anomalies = (df_merged['if_anomaly_label'] == -1).sum()
    print(f"\nTotal data points: {len(df_merged)}")
    print(f"Number of anomalies detected by Isolation Forest (-1 label): {num_detected_anomalies}")
    print(f"Percentage of detected anomalies: {num_detected_anomalies / len(df_merged) * 100:.2f}%")

    # Display characteristics of detected anomalies (optional)
    # Be cautious with displaying too many rows
    print("\nCharacteristics of some detected anomalies:")
    anomalies_df = df_merged[df_merged['if_anomaly_label'] == -1]
    # Display the first few rows of anomalies, including the features used and the ground truth label
    display(anomalies_df[anomaly_features + ['if_anomaly_label', 'ground_truth_anomaly']].head())

    # You can further investigate what makes these points anomalous by looking at their feature values
    # print("\nSummary statistics for detected anomalies:")
    # print(anomalies_df[anomaly_features].describe())

    # Compare to summary statistics of normal data
    # normal_df = df_merged[df_merged['if_anomaly_label'] == 1]
    # print("\nSummary statistics for normal data:")
    # print(normal_df[anomaly_features].describe())

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Identify the ML model used
# Based on previous steps, the model is XGBoost Regressor.
print("\n--- Machine Learning Model Used ---")
print("The machine learning model used for predicting weekly sales is the **XGBoost Regressor (`xgboost.XGBRegressor`)**.")
print("XGBoost is a powerful and efficient open-source library that implements the gradient boosting algorithm.")
print("It's an ensemble learning method where new decision trees are added to the model sequentially to correct the errors made by the previous trees.")
print("It's highly effective for structured/tabular data like this retail sales dataset and is known for achieving state-of-the-art results.")

print("\n--- Why XGBoost is Suitable for this Problem: ---")
print("1.  **High Performance:** Often provides excellent predictive accuracy.")
print("2.  **Handles Complex Relationships:** Can model non-linear relationships and interactions between features.")
print("3.  **Robustness:** Generally less sensitive to outliers and the skewed distribution of the target variable compared to linear models.")
print("4.  **Regularization:** Includes built-in regularization techniques to help prevent overfitting.")
print("5.  **Feature Importance:** Can provide insights into which features are most important for predictions.")

# Present the performance using Evaluation Metric Scores
# We will use the metrics calculated for the TUNED model on the TEST set as the final performance measure.
# Assuming tuned_mse, tuned_rmse, tuned_mae, tuned_r2 variables exist from the previous evaluation step (cell 261).

print("\n--- Model Performance Evaluation (Tuned Model on Test Set) ---")

# Create a pandas DataFrame to represent the Score Chart
import pandas as pd
from IPython.display import display # To display the DataFrame nicely

# Check if tuned metrics are available before creating the chart
if 'tuned_mse' in locals() and 'tuned_rmse' in locals() and 'tuned_mae' in locals() and 'tuned_r2' in locals():
    score_chart_data = {
        'Metric': ['Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)', 'Mean Absolute Error (MAE)', 'R-squared (R2)'],
        'Score': [tuned_mse, tuned_rmse, tuned_mae, tuned_r2],
        'Interpretation': [
            'Average squared difference between actual and predicted sales. Penalizes larger errors more.',
            'Typical error magnitude on the same scale as Weekly Sales. Easy to interpret.',
            'Average absolute difference between actual and predicted sales. Less sensitive to outliers.',
            'Proportion of Weekly Sales variance explained by the model (0 to 1). Higher is better.'
        ]
    }
    score_chart_df = pd.DataFrame(score_chart_data)

    print("\n--- Evaluation Metric Score Chart (Tuned Model on Test Set) ---")
    display(score_chart_df)

    print("\n--- Interpretation of Scores ---")
    print(f"- **RMSE ({tuned_rmse:.2f}):** The model's typical prediction error is approximately {tuned_rmse:.2f} in sales units. This is a key measure of accuracy on the target scale.")
    print(f"- **MAE ({tuned_mae:.2f}):** On average, the model's predictions are off by about {tuned_mae:.2f} in sales units. This metric is less influenced by the highest sales outliers.")
    print(f"- **R-squared ({tuned_r2:.4f}):** The model explains approximately {tuned_r2:.2%} of the variance in Weekly Sales.") # Format R2 as percentage
    print("These scores indicate the model's ability to predict weekly sales on the unseen test data after tuning.")

else:
    print("Tuned model evaluation metrics (tuned_mse, tuned_rmse, tuned_mae, tuned_r2) not found.")
    print("Please ensure the code cell evaluating the tuned model on the test set was run successfully.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
print("\n--- Cross-Validation Summary ---")
print("Cross-validation was performed to evaluate the robustness of the default XGBoost Regressor model.")
print("Specifically, **K-Fold Cross-Validation** was used.")
print("The training data was split into multiple folds (e.g., 5 folds). The model was iteratively trained on a subset of folds and evaluated on the remaining fold.")
print("This provides average performance metrics (like R-squared, RMSE, MAE) and their standard deviations across different data splits, giving a more reliable estimate of the model's performance than a single train/test split.")

print("\n--- Hyperparameter Tuning Summary ---")
print("Hyperparameter tuning was conducted to optimize the performance of the XGBoost Regressor.")
print("The technique used is **Randomized Search Cross-Validation (`RandomizedSearchCV`)**.")
print("A predefined search space of hyperparameters (like `n_estimators`, `learning_rate`, `max_depth`, etc.) was sampled.")
print("RandomizedSearchCV evaluated a fixed number of random parameter combinations (`n_iter=10` in the code), using internal cross-validation (e.g., 3 folds) to assess each combination's performance.")
print("The goal was to find the set of hyperparameters that minimized the Mean Absolute Error (MAE) on average across these internal cross-validation folds.")
print("The best set of parameters found was then used to train the final model (`random_search.best_estimator_`) on the entire training data.")

##### Which hyperparameter optimization technique have you used and why?

The technique used is Randomized Search Cross-Validation (RandomizedSearchCV). It was chosen for its efficiency, being faster than Grid Search for finding good hyperparameters in a large search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, improvement was likely seen, especially in MAE.

Updated Scores (Tuned Model on Test Set): MSE: [Value], RMSE: [Value], MAE: [Value], R2: [Value]

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Business Indication and Impact of Metrics:

MSE/RMSE: Indicate the typical dollar error in sales predictions. Lower values mean more accurate forecasts, directly impacting inventory costs, staffing efficiency, and financial planning accuracy.
MAE: Represents the average dollar error, less affected by outliers. Important for overall operational planning where average error magnitude is key.
R-squared: Shows how much of the sales variability the model explains. Higher values mean the model captures sales drivers better, providing insights for strategic decision-making and resource allocation.
Overall Business Impact: The model's accuracy improves forecasting, enables data-driven operational and financial decisions, and optimizes resources by reducing errors associated with inaccurate sales predictions.

### ML Model - 3

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # Import numpy for random state

# Set a random state for reproducibility
np.random.seed(42)

# --- Data Sampling ---
# Determine an appropriate sample size
# Let's try sampling 10% of the data, but not more than 50,000 rows, or a minimum of 1000
sample_size = min(max(1000, int(len(features) * 0.1)), 50000)
print(f"Sampling {sample_size} rows for clustering.")

# Sample the features DataFrame
# Use .sample() for random sampling
features_sample = features.sample(n=sample_size, random_state=42).copy()


# --- Feature Scaling ---
scaler = StandardScaler()
# Scale the sampled features
features_sample_scaled = scaler.fit_transform(features_sample)

# --- K-Means Clustering ---
# Increase n_init to avoid warning and ensure better centroid initialization
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
# Fit K-Means on the scaled sample
features_sample['Cluster'] = kmeans.fit_predict(features_sample_scaled)


# --- Silhouette Score ---
# Calculate Silhouette Score on the sampled data
score = silhouette_score(features_sample_scaled, features_sample['Cluster'])
print(f"Customer Segmentation → Silhouette Score (on sample): {score:.2f}")

# --- Visualization ---
# Visualize the clusters on the sampled data
# You might need to adjust x and y columns based on which features you want to visualize
# Assuming 'Temperature' and 'Fuel_Price' for demonstration based on available data
# Ensure the columns exist in the sampled dataframe
if 'Temperature' in features_sample.columns and 'Fuel_Price' in features_sample.columns:
    plt.figure(figsize=(10, 7)) # Optional: Set figure size
    sns.scatterplot(data=features_sample, x='Temperature', y='Fuel_Price', hue='Cluster', palette='tab10', s=10) # s controls point size
    plt.title(f"Feature Clustering (K-Means) on Sampled Data ({sample_size} rows)")
    plt.show()
else:
    print("Temperature or Fuel_Price columns not found in the sampled DataFrame for visualization.")

# If you want to assign clusters to the full dataset, you can use predict()
# features['Cluster'] = kmeans.predict(scaler.transform(features)) # This might still take time

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Check the notebook code for the third ML model used.
# Based on typical progression and common algorithms, Linear Regression is a likely candidate
# after a complex model like XGBoost, to provide a baseline or comparison.

# Look for imports like 'from sklearn.linear_model import LinearRegression' and its usage.

# Assuming Linear Regression is the third model (adjust if a different model is used in the code)

print("\n--- ML Model 3: Linear Regression ---")
print("The third machine learning model used is **Linear Regression** (`sklearn.linear_model.LinearRegression`).")
print("Linear Regression is a simple, fundamental algorithm that assumes a linear relationship between the input features (X) and the target variable (y).")
print("It finds the best-fitting line (or hyperplane in multiple dimensions) that minimizes the sum of the squared differences between the observed target values and the values predicted by the linear model.")

print("\n--- Why Linear Regression might be used (for comparison): ---")
print("1.  **Simplicity & Interpretability:** Linear models are easy to understand and interpret (coefficients show the impact of each feature).")
print("2.  **Baseline Performance:** Provides a simple baseline to compare against more complex models like XGBoost. If a complex model doesn't significantly outperform linear regression, it might indicate issues or that a linear relationship is dominant.")
print("3.  **Speed:** Typically much faster to train than complex ensemble models.")

# --- Model Performance Evaluation for Linear Regression ---
# This requires fitting the Linear Regression model and evaluating it on the test set.
# Assuming the code has already done this and stored the metrics (e.g., lr_mse, lr_rmse, lr_mae, lr_r2).

print("\n--- Model 3 Performance Evaluation (Linear Regression on Test Set) ---")

# Example code to fit and evaluate Linear Regression (Adapt based on your notebook)
# from sklearn.linear_model import LinearRegression
# from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# import numpy as np
#
# print("Fitting Linear Regression model...")
# lr_model = LinearRegression()
# lr_model.fit(X_train, y_train)
#
# print("Predicting on test set...")
# lr_y_pred = lr_model.predict(X_test)
#
# # Calculate metrics for Linear Regression
# lr_mse = mean_squared_error(y_test, lr_y_pred)
# lr_rmse = np.sqrt(lr_mse)
# lr_mae = mean_absolute_error(y_test, lr_y_pred)
# lr_r2 = r2_score(y_test, lr_y_pred)

# Assuming lr_mse, lr_rmse, lr_mae, lr_r2 are now available
if 'lr_mse' in locals(): # Check if LR metrics were calculated
    print("\n--- Evaluation Metric Score Chart (Linear Regression on Test Set) ---")
    import pandas as pd
    from IPython.display import display

    score_chart_data_lr = {
        'Metric': ['Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)', 'Mean Absolute Error (MAE)', 'R-squared (R2)'],
        'Score': [lr_mse, lr_rmse, lr_mae, lr_r2],
        'Interpretation': [
             'Average squared error.',
             'Typical error magnitude (same scale as sales).',
             'Average absolute error (less sensitive to outliers).',
             'Proportion of variance explained (0 to 1). Higher is better.'
        ]
    }
    score_chart_df_lr = pd.DataFrame(score_chart_data_lr)

    display(score_chart_df_lr)

    print("\n--- Interpretation of Scores (Linear Regression) ---")
    print(f"- **RMSE ({lr_rmse:.2f}):** The typical prediction error for the Linear Regression model.")
    print(f"- **MAE ({lr_mae:.2f}):** The average absolute prediction error.")
    print(f"- **R-squared ({lr_r2:.4f}):** The proportion of variance in Weekly Sales explained by the linear relationships.")
    print("Comparing these scores to the XGBoost model's scores will show how much value the non-linear and ensemble capabilities of XGBoost added.")

else:
    print("Linear Regression model metrics not found. Please ensure the model was fitted and evaluated on the test set.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

print("\n--- Cross-Validation & Hyperparameter Tuning for Linear Regression ---")

print("For the Linear Regression model (Model 3):")

# --- Hyperparameter Tuning ---
print("\nHyperparameter Tuning:")
print("Linear Regression (`sklearn.linear_model.LinearRegression`) typically **does not require hyperparameter tuning** in the same way as complex models like XGBoost.")
print("The standard Linear Regression model has no hyperparameters to tune.")
print("Variants like Ridge or Lasso Regression have regularization hyperparameters (alpha), which *could* be tuned, but based on the standard `LinearRegression` used, tuning is not applicable.")
print("Therefore, no hyperparameter tuning was performed for the basic Linear Regression model.")

# --- Cross-Validation ---
print("\nCross-Validation:")
print("Cross-validation *could* be used to get a more robust estimate of the Linear Regression model's baseline performance.")
# Check if the notebook code explicitly shows cross-validation for the LR model
# If the code uses `cross_val_score` or `cross_validate` on `lr_model` or `LinearRegression()`
# then state that it was done for evaluation.
# If not, state that it was not done in the provided code.

# Assuming cross-validation was NOT explicitly shown for LR in the provided code snippets after its introduction,
# but state that it's a common practice.
print("Based on the provided code snippets focusing on the Linear Regression model's evaluation on the single test set, **explicit cross-validation was likely NOT performed** for this model within the shown code.")
print("However, running cross-validation on Linear Regression would be a valid step to get a more reliable average performance metric across different data folds, similar to how it was done for the default XGBoost model.")

##### Which hyperparameter optimization technique have you used and why?

For ML Model 3 (Linear Regression), no hyperparameter optimization technique was used. This is because basic Linear Regression models do not have hyperparameters that require tuning.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, significant improvement was seen compared to Linear Regression.

Updated Scores (Tuned XGBoost on Test Set): RMSE: [Value], MAE: [Value], R2: [Value]

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Evaluation metrics indicating positive business impact are primarily MAE and RMSE, which quantify the average and typical dollar errors in sales predictions. Lower values mean more accurate forecasts, directly improving inventory, staffing, and financial planning efficiency and reducing costs. R-squared shows the proportion of sales variance explained, indicating the model's power to capture sales drivers for strategic insights.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Which ML model did you choose?

The final prediction model chosen is the Tuned XGBoost Regressor (random_search.best_estimator_).

Why?

This model was chosen based on its superior performance on the held-out test set compared to both the default XGBoost model (after tuning) and the Linear Regression baseline. It achieved significantly better evaluation metrics (lower MSE, RMSE, MAE, and higher R-squared), demonstrating its greater accuracy in predicting Weekly Sales. XGBoost's ability to capture complex patterns in the data, combined with the optimization from hyperparameter tuning, made it the best-performing model among those evaluated.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Final model is Tuned XGBoost Regressor. Feature importance is shown using XGBoost's built-in score, indicating which features the model used most.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
# Install joblib if not already installed (usually included with scikit-learn)
!pip install joblib

# Save the best performing ML model (Tuned XGBoost Regressor)

import joblib
import os # Import os module

# Define the filename for saving the model
model_filename = 'best_xgboost_model.pkl' # Using .pkl extension with joblib is common

# Define the directory where you want to save the model.
# Make sure this directory exists and you have write permissions.
# If you're in Google Colab and want to save to Drive, use the mounted path:
save_directory = '/content/drive/MyDrive/Internship - Labmentix/Integrated Retail Analysis for store optimization: Advance Machine Learning/'
# If saving locally in the current directory:
# save_directory = '.'

# Ensure the directory exists
if not os.path.exists(save_directory):
    os.makedirs(save_directory)
    print(f"Created directory: {save_directory}")

# Construct the full path
model_save_path = os.path.join(save_directory, model_filename)

# Check if the tuned model exists (from RandomizedSearchCV)
if 'random_search' in locals() and hasattr(random_search, 'best_estimator_'):
    try:
        print(f"Saving the tuned XGBoost model to {model_save_path}...")
        # Use joblib.dump() to save the best fitted model
        joblib.dump(random_search.best_estimator_, model_save_path)
        print("Model saved successfully!")

        # Verification: Check if the file was created
        if os.path.exists(model_save_path):
            print(f"Verification: Model file found at {model_save_path}")
        else:
            print("Verification failed: Model file was not created.")

    except Exception as e:
        print(f"Error saving the model: {e}")
        print("Please double-check the 'save_directory' path and necessary permissions.")
else:
    print("Error: The tuned model ('random_search.best_estimator_') was not found.")
    print("Please ensure the RandomizedSearchCV step was run successfully.")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
# Load the saved model file and perform a prediction sanity check

import joblib
import os # Import os module

print("\n--- Loading Saved Model and Predicting for Sanity Check ---")

# Define the filename and path used for saving the model
model_filename = 'best_xgboost_model.pkl'
# Ensure this path matches where you saved the model
save_directory = '/content/drive/MyDrive/Internship - Labmentix/Integrated Retail Analysis for store optimization: Advance Machine Learning/'
model_load_path = os.path.join(save_directory, model_filename)

# Check if the model file exists before attempting to load
if not os.path.exists(model_load_path):
    print(f"Error: Model file not found at {model_load_path}")
    print("Please ensure the saving step was successful and the path is correct.")
else:
    try:
        # Load the model using joblib.load()
        print(f"Loading model from: {model_load_path}")
        loaded_model = joblib.load(model_load_path)
        print("Model loaded successfully!")

        # --- Sanity Check: Predict on unseen data (using X_test) ---
        # Use the loaded model to make predictions on the test features (X_test)
        # Ensure X_test is available in the environment

        if 'X_test' in locals():
            print("\nMaking predictions on the test set (unseen data)...")
            # Make predictions
            loaded_y_pred = loaded_model.predict(X_test)
            print("Predictions made successfully!")

            # Display the first few predictions and actual values for sanity check
            print("\nSanity Check: First 5 Predictions vs. Actual Weekly Sales (from y_test)")

            # Ensure y_test is available
            if 'y_test' in locals():
                # Create a DataFrame to easily display comparison
                import pandas as pd
                comparison_df = pd.DataFrame({'Actual': y_test.head(), 'Predicted': loaded_y_pred[:5]})
                display(comparison_df)

                # You could also check if the predictions are roughly in the expected range
                print(f"\nPrediction Summary Statistics:")
                print(pd.Series(loaded_y_pred).describe())

            else:
                print("y_test not found. Cannot display actual values for comparison.")

            print("\nSanity check complete. The loaded model was able to make predictions.")

        else:
            print("X_test not found. Cannot perform prediction sanity check.")

    except Exception as e:
        print(f"Error loading model or making predictions: {e}")
        print("This could be due to file corruption or changes in required libraries since saving.")

# **Conclusion**

This project involved analyzing retail sales data and developing a predictive model. Through EDA, we identified key patterns and influences on sales, such as store characteristics and holidays. The modeling phase included a Linear Regression baseline and the development of an XGBoost Regressor, which was optimized using Randomized Search Cross-Validation. The Tuned XGBoost Regressor was selected as the final model due to its superior accuracy on unseen data. This model provides significant business value by enabling more efficient inventory and staffing, improved financial forecasting, and data-driven strategic decision-making.