# **Introduction to Data Science**
# **DS-2001**
# Project


Solution designed by:

Muddassir Asghar - i23-2577

M. Abdullah Ali - i23-2523

Introduction:
Imtiaz Mall, a renowned department store chain, is experiencing declining sales and a significant
number of non-recurring customers in its electronics section. To address this challenge, you, the
newly appointed Senior Data Scientist, have been tasked with conducting a comprehensive
analysis of the electronics section data and developing data-driven strategies for customer
retention and sales growth. This project focuses on the initial steps of this analysis, specifically
exploring the data through various techniques.

Before we begin, necessary libraries are imported

## **Program prerequisites**

In [None]:
# import ur libraries here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.cluster import KMeans

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

## **Module 1: Data Acquisition and Preprocessing:**

### 1. Data Loading:

The data specific to the project requirements is loaded into the program.

In [None]:
df = pd.read_json('electronics.json')
df.head(10)

In [None]:
df.columns

In [None]:
df.info()

In [None]:
int_columns = [
    "Age",
    "Purchase_Amount",
    "Average_Spending_Per_Purchase",
    "Purchase_Frequency_Per_Month",
    "Brand_Affinity_Score",
    "Month",
    "Year",
]

for col in int_columns:
    df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")

df = df.replace('', np.nan)
df = df.replace('Hidden', np.nan)

df["Will_Purchase_Next_Month"] = df["Will_Purchase_Next_Month"].astype(bool)

def convert_to_days(date):
    if isinstance(date, str) and date != "Hidden":
        try:
            return datetime.strptime(date, "%Y-%m-%d").toordinal()
        except ValueError:
            return None 
    return None 

# Ensure the "Purchase_Date" column is clean before applying the function
df["Days"] = df["Purchase_Date"].apply(convert_to_days).astype("Int64")
int_columns.append('Days')


print(df.info())

In [None]:
df.describe()

In [None]:
df.describe(include="object")

### 2. Data Cleaning:

o Identify and handle missing values using appropriate techniques like
mean/median imputation or dropping rows/columns with excessive missingness.

In [None]:
df.isnull().sum()

In [None]:
df.isnull().any(axis=1).sum()

In [None]:
df.head(10)

In [None]:
def int_cols_plot(df, numerical_columns):
    for column in numerical_columns:
        plt.figure(figsize=(16, 6))
        
        # Plot 1: Histogram with KDE
        plt.subplot(1, 3, 1)
        sns.histplot(df[column].dropna(), kde=True, bins=30, color='blue', alpha=0.6)
        plt.axvline(df[column].mean(), color='red', linestyle='dashed', linewidth=1, label=f"Mean: {df[column].mean():.2f}")
        plt.axvline(df[column].median(), color='green', linestyle='dashed', linewidth=1, label=f"Median: {df[column].median():.2f}")
        plt.title(f"Distribution of {column}")
        plt.legend()
        
        # Plot 2: Boxplot
        plt.subplot(1, 3, 2)
        sns.boxplot(x=df[column], color='orange')
        plt.title(f"Boxplot of {column}")
        
        plt.tight_layout()
        plt.show()

In [None]:
int_cols_plot(df, int_columns)

In [None]:
# # Removing NaN values from Integer columns
mean_impute_columns = [
    "Age", "Purchase_Frequency_Per_Month", "Brand_Affinity_Score", "Days"
]

median_impute_columns = [
    "Purchase_Amount", "Average_Spending_Per_Purchase"
]

log_transform_columns = [
    "Average_Spending_Per_Purchase"
]

for col in mean_impute_columns:
    mean = int(df[col].mean())
    df[col] = df[col].fillna(mean)

for col in median_impute_columns:
    median = int(df[col].median())
    df[col] = df[col].fillna(median)

In [None]:
# Handling missing values in Categorical columns by replacing NaN values with mode
categorical_columns = ['Gender', 'Income_Level', 'Brand', 'Product_Category_Preferences']

for col in categorical_columns:
    mode_value = df[col].mode()[0]
    df[col] = df[col].apply(lambda x: mode_value if pd.isna(x) or x == "Hidden" else x)

def fill_purchase_date(row):
    if pd.isna(row["Purchase_Date"]) or row["Purchase_Date"] == "Hidden":
        mode_date = df["Purchase_Date"].dropna()[df["Purchase_Date"] != "Hidden"].mode()[0]  # Mode of non-hidden, non-null dates
        return mode_date
    return row["Purchase_Date"]  # Keep original value if valid

# Apply the function to fill missing values in 'Purchase_Date'
df["Purchase_Date"] = df.apply(fill_purchase_date, axis=1)

In [None]:
columns_to_drop = ['Customer_ID', 'Address', 'Transaction_ID', 'Product_ID']
df.drop(columns=columns_to_drop, inplace=True)

# Purchase_Date_Column = df.drop(['Purchase_Date'], axis=1)
# df = df.drop(['Purchase_Date'], axis=1)

In [None]:
# Mapping months to seasons
month_to_season = {
    1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Winter'
}

# Mode of the 'Season' column
mode_season = df['Season'].mode()[0]

# Handle NaN or 'Hidden' values in 'Season' column
df['Season'] = df['Season'].replace(['Hidden', np.nan], None)  # Replace 'Hidden' and NaN with None for consistency
df['Season'] = df.apply(
    lambda row: row['Season'] if row['Season'] is not None else (
        month_to_season.get(row['Month'], mode_season) if pd.notnull(row['Month']) else mode_season
    ),
    axis=1
)

# Handling NaN or 'Hidden' in 'Month' column using the season-to-month mapping
season_to_month = {
    'Winter': 1,  # Default to January for Winter
    'Spring': 4,  # Default to April for Spring
    'Summer': 7,  # Default to July for Summer
    'Fall': 10    # Default to October for Fall
}

df['Month'] = df['Month'].replace(['Hidden', np.nan], None)  # Replace 'Hidden' and NaN with None
df['Month'] = df.apply(
    lambda row: row['Month'] if row['Month'] is not None else season_to_month.get(row['Season'], None),
    axis=1
)

# Convert 'Month' to integer, handling missing values gracefully
df['Month'] = pd.to_numeric(df['Month'], errors='coerce').astype('Int64')

# Handling NaN or 'Hidden' in 'Year' column by replacing them with the mode
year_mode = df['Year'].mode()[0]
df['Year'] = df['Year'].replace(['Hidden', np.nan], year_mode)

# Check if any NaN values remain in 'Year' and 'Month' columns
print(df[['Year', 'Month']].isnull().sum())


In [None]:
df['Month'].isnull().sum()

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.info()

o Analyze outliers and determine whether to retain or remove them based on their
impact on the analysis.

In [None]:
def remove_outliers(df, columns):
    outliers = {}
    for column in columns:

        if df[column].dtype in ['object', 'bool']:
            continue
        
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
        
    return df

c = df.columns
df = remove_outliers(df, c[:-1])
df

o Address inconsistencies in data format and encoding.

In [None]:
categorical_columns = ['Gender', 'Income_Level', 'Brand', 'Product_Category_Preferences', 'Season']

for col in categorical_columns:
    print(f"{col}: {df[col].unique()}")

df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=False)



# Columns to drop based on the first unique values
columns_to_drop = {
    'Gender': 'Other',
    'Income_Level': 'Low',
    'Brand': 'Brand_A',
    'Product_Category_Preferences': 'Low',
    'Season': 'Winter'
}

# Drop the first dummy column for each categorical variable
for col, first_value in columns_to_drop.items():
    col_to_drop = f"{col}_{first_value}"  # Construct the column name to drop
    if col_to_drop in df_encoded.columns:
        df_encoded = df_encoded.drop(columns=[col_to_drop])

df_encoded[df_encoded.columns[df_encoded.dtypes == 'bool']] = df_encoded[df_encoded.columns[df_encoded.dtypes == 'bool']].astype('int64')

# Show the columns after dropping
print(df_encoded.columns)

In [None]:
# # Encoding Non-Numerical data by mapping them via a dictionary
# genders = df['Gender'].unique()
# gender_map = {gender: idx for idx, gender in enumerate(genders)}
# df['Gender'] = df['Gender'].map(gender_map)

# income = df['Income_Level'].unique()
# income_map = {inc: idx for idx, inc in enumerate(income)}
# df['Income_Level'] = df['Income_Level'].map(income_map)

# # prod_cat = df['Product_Category'].unique()
# # prod_cat_map = {prod: idx for idx, prod in enumerate(prod_cat)}
# # df['Product_Category'] = df['Product_Category'].map(prod_cat_map)

# brands = df['Brand'].unique()
# brand_map = {brand: idx for idx, brand in enumerate(brands)}
# df['Brand'] = df['Brand'].map(brand_map)

# prod_cat_pref = df['Product_Category_Preferences'].unique()
# prod_cat_pref_map = {prod_cat_pref: idx for idx, prod_cat_pref in enumerate(prod_cat_pref)}
# df['Product_Category_Preferences'] = df['Product_Category_Preferences'].map(prod_cat_pref_map)

# seasons = df['Season'].unique()
# season_map = {season: idx for idx, season in enumerate(seasons)}
# df['Season'] = df['Season'].map(season_map)

# df_encoded = df.copy()

# print(df[['Gender', 'Income_Level', 'Brand', 'Product_Category_Preferences', 'Season']])

In [None]:
df_encoded.info()

In [None]:
numeric_columns = df.select_dtypes(include=['number']).columns
for column in numeric_columns:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=df[column])
    plt.title(f"Box and Whiskers Plot for {column}", fontsize=16)
    plt.tight_layout()
    plt.show()



In [None]:
df.head()

### 3. Data Transformation:

o Create new features that provide deeper insights into customer behavior, such
as:

▪ Average spending per purchase

In [None]:
df_encoded

▪ Purchase frequency per month

▪ Brand affinity score (based on product brand preferences)

▪ Product category preferences (e.g., TVs, smartphones, laptops)

o Standardize or normalize numeric features to ensure they contribute equally to
the given algorithms.

In [None]:
columns_to_standardize = ['Age', 'Purchase_Amount', 'Average_Spending_Per_Purchase',
                       'Purchase_Frequency_Per_Month', 'Brand_Affinity_Score', 'Month', 'Year',
                       'Days']

columns_to_standardize = df_encoded.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()

Scaled_df = df_encoded.copy()
Scaled_df[columns_to_standardize] = scaler.fit_transform(Scaled_df[columns_to_standardize])

KMeans_df = Scaled_df.copy()
MLRM_df = Scaled_df.copy()

# MLRM_df = df_encoded.copy()

Scaled_df

In [None]:
Scaled_df.describe()

In [None]:
numeric_columns = Scaled_df.select_dtypes(include=['number']).columns
int_cols_plot(Scaled_df, numeric_columns)

In [None]:
Scaled_df.describe()

## **Module 2: Exploratory Data Analysis (EDA):**

### 1. Univariate Analysis:

o Analyze the distribution of key features like customer age, purchase amount,
and purchase frequency using histograms, boxplots, and descriptive statistics.

o Identify potential skewness or outliers in the data.

### 2. Bivariate Analysis:

o Utilize scatterplots and heatmaps to explore relationships between different
features, such as purchase amount vs. income level, brand affinity vs. product
category, and purchase frequency vs. age.

o Investigate the presence of correlations and identify any impactful
relationships.

### 3. Temporal Analysis:

o Analyse trends in customer behaviour over time, including changes in
purchase frequency, average spending, and product preferences.

o Identify seasonal variations or any significant shifts in customer behavior
patterns.

## **Module 3: Regression and Decision Tree Analysis:**

### **A. Linear Regression Analysis:**

### 1. Problem Definition:

• Predict the average spending per purchase based on customer demographics and
purchase history.

In [None]:
MLRM_df

In [None]:
df_encoded

In [None]:
# Assuming df is your DataFrame and X_columns is the list of column names
def calculate_vif(df, X_columns):
    """
    Calculate the Variance Inflation Factor (VIF) for a list of predictor variables.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the variables.
        X_columns (list): The list of column names to calculate VIF for.
    
    Returns:
        pd.DataFrame: A DataFrame with variables and their corresponding VIF scores.
    """
    # Creating a new DataFrame with only the selected columns
    X = df[X_columns]
    
    # Adding a constant column to the predictors to account for the intercept
    X = sm.add_constant(X)
    
    # Calculating VIF for each feature
    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    # Dropping the constant column from the result
    return vif_data[vif_data["Variable"] != "const"]

In [None]:
# Example usage
import statsmodels.api as sm

vif_result = calculate_vif(MLRM_df, ['Age', 'Gender_Female', 'Gender_Male',
             'Income_Level_High', 'Income_Level_Medium',
             'Purchase_Amount', 'Purchase_Frequency_Per_Month', 
             'Product_Category_Preferences_High', 
             'Product_Category_Preferences_Medium'
            ])
print(vif_result)

In [None]:
X = MLRM_df[['Age', 'Gender_Female', 'Gender_Male',
             'Income_Level_High', 'Income_Level_Medium',
             'Purchase_Amount', 'Purchase_Frequency_Per_Month',
             'Product_Category_Preferences_High',
             'Product_Category_Preferences_Medium'
            ]]

X = MLRM_df[["Age", "Purchase_Frequency_Per_Month", "Income_Level_High", "Income_Level_Medium","Gender_Female","Gender_Male"] +
         [col for col in MLRM_df.columns if col.startswith("Product_Category_") and col != 'Product_Category_Preferences']]

y = MLRM_df['Average_Spending_Per_Purchase']

# Split data into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the linear regression model
model = LinearRegression().fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("R^2 Score:", r2)

# Mean Absolute Error (to assess model performance)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

In [None]:
# Plotting predicted vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.6, label='Predicted vs Actual')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', label='Perfect Fit (y = x)')

# Customize the plot
plt.title('Predicted vs Actual Spending (Linear Regression)', fontsize=14)
plt.xlabel('Actual Spending', fontsize=12)
plt.ylabel('Predicted Spending', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from scipy import stats

def validate_mlr_assumptions(df, independent_vars, dependent_var):
    """
    Validates the assumptions of Multiple Linear Regression Model (MLRM).
    
    Parameters:
    - df: pandas DataFrame containing the data.
    - independent_vars: list of strings representing independent variable column names.
    - dependent_var: string representing the dependent variable column name.
    
    Assumptions checked:
    1. Linearity
    2. Homoscedasticity
    3. Independence of Errors
    4. Multicollinearity
    """
    # Define independent (X) and dependent (y) variables
    X = df[independent_vars]
    y = df[dependent_var]

    # Add constant to the model (for intercept)
    X = sm.add_constant(X)

    # 1. Linearity: Check linearity by visualizing residuals vs. fitted values
    model = sm.OLS(y, X).fit()
    y_pred = model.fittedvalues
    residuals = model.resid

    plt.figure(figsize=(12, 6))

    # Plot: Residuals vs Fitted
    plt.subplot(121)
    sns.residplot(x=y_pred, y=residuals, lowess=True, line_kws={'color': 'red'}, scatter_kws={'alpha': 0.5})
    plt.title('Residuals vs Fitted')
    plt.xlabel('Fitted values')
    plt.ylabel('Residuals')
    
    # Plot: Q-Q plot for normality of residuals
    plt.subplot(122)
    sm.qqplot(residuals, line ='45', ax=plt.gca())
    plt.title('Q-Q plot for residuals')
    
    plt.tight_layout()
    plt.show()

    # 2. Homoscedasticity: Check if residuals have constant variance
    # If residuals increase or decrease systematically with fitted values, heteroscedasticity may be present.
    plt.figure(figsize=(6, 6))
    plt.scatter(y_pred, residuals, alpha=0.5)
    plt.axhline(0, color='red', linestyle='--')
    plt.title('Residuals vs Fitted (Homoscedasticity Check)')
    plt.xlabel('Fitted values')
    plt.ylabel('Residuals')
    plt.show()

    # 3. Independence of errors: Check Durbin-Watson statistic
    dw_stat = sm.stats.durbin_watson(residuals)
    print(f"Durbin-Watson Statistic: {dw_stat:.4f}")
    # Durbin-Watson statistic should be between 1.5 and 2.5 for independence of errors.
    if dw_stat < 1.5 or dw_stat > 2.5:
        print("Warning: Durbin-Watson statistic suggests potential autocorrelation in residuals.")

    # 4. Multicollinearity: Check Variance Inflation Factor (VIF)
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    print("\nVariance Inflation Factor (VIF) for each feature:")
    print(vif_data)

    # High VIF (>10) indicates multicollinearity problems

    # 5. Check for Normality of residuals (Shapiro-Wilk test)
    stat, p_value = stats.shapiro(residuals)
    print(f"\nShapiro-Wilk Test for normality: stat={stat:.4f}, p-value={p_value:.4f}")
    if p_value < 0.05:
        print("Warning: Residuals do not follow a normal distribution.")
    
    # Return model summary for further details
    print(model.summary())

In [None]:
validate_mlr_assumptions(MLRM_df, ['Age', 'Gender_Female', 'Gender_Male',
             'Income_Level_High', 'Income_Level_Medium',
             'Purchase_Amount', 'Purchase_Frequency_Per_Month',
             'Product_Category_Preferences_High',
             'Product_Category_Preferences_Medium'
            ], ['Average_Spending_Per_Purchase'])

### 2. Model Building:

• Preprocess the data by selecting relevant numerical and categorical variables (e.g.,
income level, product category, age).

• Split the dataset into training and testing sets.

### 3. Implementation:

• Train a linear regression model using the training data.

• Evaluate the model using metrics such as Mean Absolute Error (MAE), Mean
Squared Error (MSE), and R-squared.

### 4. Visualization:

• Plot the predicted vs. actual values for the test dataset.

• Include regression lines for better interpretability.

### **B. Decision Tree Analysis:**

### 1. Problem Definition:

• Classify whether a customer will make a purchase in the next month (use a binary
target variable).

### 2. Model Building:

• Engineer a binary target variable (e.g., 1 = purchase made, 0 = no purchase).

• Use features like purchase frequency, spending history, and product preferences.

### 3. Implementation:

• Train a decision tree classifier and use criteria such as Gini Impurity or
Entropy.

In [None]:
X = df_encoded[['Purchase_Frequency_Per_Month', 'Average_Spending_Per_Purchase', 
                'Purchase_Amount', 'Product_Category_Preferences_High',
                'Product_Category_Preferences_Medium',
                'Brand_Affinity_Score', 'Age', 
                'Season_Fall', 'Season_Summer']]

y = df_encoded['Will_Purchase_Next_Month']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

• Evaluate the model using metrics such as Accuracy, Precision, Recall, and F1
Score.

In [None]:
# Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Feature Importance
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': clf.feature_importances_})
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print("\nFeature Importance:\n", feature_importance)


### 4. Visualization:

In [None]:
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=X.columns, class_names=['No Purchase', 'Purchase'], filled=True)
plt.show()

• Plot the decision tree.

• Highlight important features that influence the decision.

## **Module 4: Clustering Analysis:**

(Hint: Remove the predicted label and then apply K-Means Clustering)

In [None]:
X = KMeans_df.drop(columns=['Will_Purchase_Next_Month', 'Purchase_Date'])

### 1. Define the number of clusters(k):

In [None]:
inertia = []
range_k = range(1, 15)

for k in range_k:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

inertia_diff = np.diff(inertia)
inertia_second_diff = np.diff(inertia_diff)

# Find the index where the second derivative is maximum (this corresponds to the elbow)
optimal_k = np.argmin(inertia_second_diff) + 2


plt.figure(figsize=(8, 6))
plt.plot(range_k, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Sum of Squared Distances (Inertia)')
plt.show()

print(f"The optimal number of clusters (k) is: {optimal_k}")
print(inertia)

• Analyze the elbow plot to determine the optimal number of clusters based on the
sum of squared distances within each other.

### 2. Apply K-Means Clustering:

In [None]:
optimal_k = 8

kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(X)

# Add the cluster labels to the original dataframe
df_encoded['Cluster'] = clusters

# Display the cluster centers (mean of each feature for each cluster)
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=X.columns)
print("Cluster Centers (mean of features per cluster):")
cluster_centers

• Implement K-means with the chosen k value to segment customers into distinct
clusters based on their purchase behavior and preferences.

### 3. Analyze cluster characteristics:

In [None]:
cluster_analysis = df_encoded.groupby('Cluster').agg({
    'Purchase_Amount': 'mean',
    'Average_Spending_Per_Purchase': 'mean',
    'Purchase_Frequency_Per_Month': 'mean',
    'Brand_Affinity_Score': 'mean',
    'Product_Category_Preferences_High': 'mean',
    'Product_Category_Preferences_Medium': 'mean',
    'Season_Fall': 'mean',
    'Season_Summer': 'mean',
}).reset_index()

print("Cluster Characteristics:")
cluster_analysis 

• Investigate key features of each cluster, such as average purchase amount, brand
affinity and product category preferences.

In [None]:
from sklearn.decomposition import PCA

# Perform PCA to reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the clusters in 2D
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_encoded['Cluster'], cmap='viridis', s=50)
plt.title('K-Means Clusters (Projected in 2D)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()


• Identify significant differences and similarities between clusters.

## **Module 5: Comparison and Conclusion:**

### 1. Compare the predictive performance of the regression,decision tree and K-Means
Clustering models.

• Discuss strengths, limitations, and real-world applicability in the context of
customer behavior analysis.

### 2. Provide actionable recommendations for the electronics section based on the results.