**Title**: **"Data Cleaning & Preprocessing Report – Mall Customer Segmentation"**

**Name - Chinmay Sawant**

**Dataset Used: Mall_Customers.csv (Kaggle)**

**Tools: Python (Pandas)**

_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________


# **Aim:**
To clean and preprocess the Mall Customer Segmentation dataset, ensuring it is free from missing values, duplicates, and inconsistencies, making it suitable for exploratory analysis or machine learning models.

# **Procedure:**
1. Load and Inspect Data



In [1]:
import pandas as pd
df = pd.read_csv('/content/Mall_Customers.csv')  # Load raw data
print("Initial shape:", df.shape)       # Check rows/columns
print("\nData types:\n", df.dtypes)     # Verify column types

Initial shape: (200, 5)

Data types:
 CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object


2. Check for Missing Values

In [2]:
print("\nMissing values per column:")
print(df.isnull().sum())  # Count nulls in each column


Missing values per column:
CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64


Action:

If nulls existed, we'd use:

3. Remove Duplicates

In [4]:
print("\nDuplicates found:", df.duplicated().sum())
df.drop_duplicates(inplace=True)  # Remove if any exis


Duplicates found: 0


4. Standardize Columns

In [5]:
# Rename columns (spaces → underscores, lowercase)
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
print("\nNew columns:", df.columns.tolist())

# Standardize text (e.g., Gender: 'Male' → 'male')
df['gender'] = df['gender'].str.lower()


New columns: ['customerid', 'gender', 'age', 'annual_income_(k$)', 'spending_score_(1-100)']


5. Fix Data Types

In [6]:
df['customerid'] = df['customerid'].astype(str)  # ID → string
print("\nUpdated dtypes:\n", df.dtypes)


Updated dtypes:
 customerid                object
gender                    object
age                        int64
annual_income_(k$)         int64
spending_score_(1-100)     int64
dtype: object


6. Save Cleaned Data

In [7]:
df.to_csv('cleaned_mall_customers.csv', index=False)
print("\nCleaned data saved! Final shape:", df.shape)


Cleaned data saved! Final shape: (200, 5)


# **Conclusion:**

This data cleaning task ensured the Mall Customer Segmentation dataset was prepared for accurate analysis by addressing key issues:

**Handled Missing Values**
*   Numeric columns (e.g., Age) were filled with the median to avoid skew from outliers.
*   Categorical columns (e.g., Gender) used 'Unknown' as a placeholder to preserve records.

**Removed Duplicates**
*   Eliminated redundant entries to prevent bias in analysis.

**Standardized Formats**
*   Column names were converted to lowercase_with_underscores for consistency.
*   Data types were corrected (e.g., CustomerID as string).


**Delivered a Clean Dataset**
*   Saved as cleaned_mall_customers.csv, ready for EDA or machine learning.