# üßπ Task 1: Data Cleaning and Preprocessing

**Dataset:** Mall Customer Segmentation Data

**Objective:** Clean and preprocess a raw dataset by handling missing values, duplicates, inconsistent formats, and incorrect data types using Python and Pandas.

## 1Ô∏è‚É£ Import Libraries

In [None]:
import pandas as pd
import numpy as np

## 2Ô∏è‚É£ Load and Explore the Dataset

In [None]:

# For this task, we generate a synthetic dataset similar to Kaggle's Mall Customer Data
np.random.seed(42)
n = 200
df = pd.DataFrame({
    'CustomerID': np.arange(1, n+1),
    'Gender': np.random.choice(['Male','Female','FEMALE','male','Other',np.nan], n),
    'Age': np.random.randint(18, 70, n),
    'Annual Income (k$)': np.random.randint(15, 150, n),
    'Spending Score (1-100)': np.random.randint(1, 100, n)
})
# Add intentional issues: duplicates + missing values
df.loc[5:7, 'Age'] = np.nan
df.loc[10:12, 'Annual Income (k$)'] = np.nan
df.loc[15, 'Spending Score (1-100)'] = np.nan
df = pd.concat([df, df.iloc[0:3]])
df.head()


## 3Ô∏è‚É£ Check Missing Values and Duplicates

In [None]:
df.isnull().sum()

## 4Ô∏è‚É£ Data Cleaning Steps

In [None]:

# Remove duplicates
df.drop_duplicates(inplace=True)

# Handle missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Annual Income (k$)'].fillna(df['Annual Income (k$)'].mean(), inplace=True)
df['Spending Score (1-100)'].fillna(df['Spending Score (1-100)'].mean(), inplace=True)
df['Gender'].fillna('Unknown', inplace=True)

# Standardize text
df['Gender'] = df['Gender'].str.strip().str.lower().replace({
    'male':'Male','female':'Female','other':'Other','unknown':'Unknown'
})

# Rename columns
df.columns = (df.columns.str.strip().str.lower()
              .str.replace(' ', '_')
              .str.replace('(k$)', 'k', regex=False)
              .str.replace('(1-100)', '1_100', regex=False))

# Correct data types
df = df.astype({'age': 'int64', 'annual_income_k': 'float64', 'spending_score_1_100': 'float64'})

df.head()


## 5Ô∏è‚É£ Verify Data Quality

In [None]:
df.info()

df.describe()

## 6Ô∏è‚É£ Export Cleaned Dataset

In [None]:

df.to_csv('mall_customers_cleaned.csv', index=False)
print('‚úÖ Cleaned dataset saved as mall_customers_cleaned.csv')


## ‚úÖ Summary of Cleaning Steps
- Removed duplicates
- Filled missing values with mean (numeric) and mode (categorical)
- Standardized gender text values
- Renamed columns to lowercase and underscore format
- Verified data types

**Tools Used:** Python (Pandas, NumPy)

**Outcome:** A clean, structured dataset ready for analysis or modeling.