### Dataset Description
The Mall Customers Dataset provides data on 200 individuals who visit a mall, including demographic information, annual income, and spending habits. This dataset is useful for exploratory data analysis, customer segmentation, and clustering tasks (e.g., K-means clustering).

- `CustomerID`: A unique identifier for each customer (integer).
- `Genre`: The gender of the customer (Male/Female).
- `Age`: The age of the customer (integer).
- `Annual Income (k$)`: Annual income of the customer in thousands of dollars (integer).
- `Spending Score (1-100)`: A score assigned by the mall based on customer behavior and spending patterns (integer).

#### Link to Dataset
- [Kaggle | Mall Customer Segmentation](https://www.kaggle.com/datasets/abdallahwagih/mall-customers-segmentation/data)

### Problem Statement

### Approach to Solution
1. Data Validation
    - 1.1 Check Missing Values
    - 1.2 Check Duplicated Entries
2. Exploratory Data Analysis
    - 2.1 Dataset Info
    - 2.2 Key Variables Distribution
    - 2.3 Correlation Analysis
3. Dimensionality Reduction
    - 3.1 PCA
4. Data Preprocessing
    - 4.1 Data Pipeline
        - 4.1.1 Standardization
        - 4.1.2 Encoding Categorical
5. Modelling
    - 5.1. Multiple Linear Regression
        - 5.1.1. Model Training
        - 5.1.2. Hyperparameter Tuning
    - 5.2. K-Nearest Neighbors (KNN)
        - 5.2.1. Model Training
        - 5.2.2. Hyperparameter Tuning
    - 5.3. Train-Test Split
6. Model Evaluation
    - 6.1. Multiple Linear Regression Performance
    - 6.2. KNN Performance
    - 6.3. Model Comparison
7. Research Questions
    - 7.1. Most Influential Variables
    - 7.2. Comparison of Model Performance
    - 7.3. Impact of Data Preprocessing
    - 7.4. Insights from EDA

### *Pre-requistes dependencies

In [12]:
!pip install --quiet pandas numpy matplotlib seaborn scikit-learn jupyter notebook

### 1. Data Validation

In [1]:
!cd

C:\Users\visha\OneDrive\Documents\GitHub\Data-Science-Stuff\Mall_Customer_Segmentation


In [9]:
# import neccessary libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import os


# load dataset
path_dataset = 'dataset/Mall_Customers.csv'

# try except
try:
    # Check if the file exists
    if not os.path.exists(os.path.join(os.getcwd(), path_dataset)):
        raise FileNotFoundError(f"File not found at {path_dataset}.")
    # Read the dataset
    df = pd.read_csv(path_dataset)
except FileNotFoundError:
    print(f"File not found at {path_dataset}. Please check the path and try again.")

In [10]:
# first 5 rows of the dataset
df.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [11]:
# missing values
df.isnull().sum()

CustomerID                0
Genre                     0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

In [12]:
# check duplicates
df.duplicated().sum()

np.int64(0)

### 2. Exploratory Data Analysis

### 3. Dimensionality Reduction

### 4. Data Preprocessing

### 5. Modelling

### 6. Model Evaluation

### 7. Research Questions