## TASK - I



####  Mall Customer Segmentation Data

### Importing Libraries

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import zipfile

In [2]:
# Unziping the csv folder
with zipfile.ZipFile('/content/Mall Customer Segmentation Data.zip', 'r') as zip_ref:
    zip_ref.extractall('unzipped_folder')


In [3]:
# Loading data
mall_data = pd.read_csv('/content/unzipped_folder/Mall_Customers.csv')

In [4]:
mall_data.sample(3)

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
53,54,Male,59,43,60
59,60,Male,53,46,46
153,154,Female,38,78,76


In [5]:
# Understanding the Dataset Shape
mall_data.shape

(200, 5)

In [6]:
# Dataset Info Summary (Column Types and Nulls)
mall_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [7]:
# Ensuring data quality by checking for duplicate customer entries

mall_data.duplicated().sum()

np.int64(0)

In [8]:
#  Checking for Missing Values in Each Column
mall_data.isna().sum()

Unnamed: 0,0
CustomerID,0
Gender,0
Age,0
Annual Income (k$),0
Spending Score (1-100),0


#### No missing values found in any column


In [9]:
mall_data.columns

Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)'],
      dtype='object')

#### Cleaning and Standardizing Column Names

In [10]:
mall_data.columns = (
    mall_data.columns
    .str.strip()
    .str.lower()
    .str.replace('[^a-z0-9_]', '_', regex=True)   # replaces bad chars with _
    .str.replace('_+', '_', regex=True)           # reduces multiple __ to single _
    .str.strip('_')                               # removes leading/trailing _
)


In [11]:
mall_data.columns

Index(['customerid', 'gender', 'age', 'annual_income_k',
       'spending_score_1_100'],
      dtype='object')

In [12]:
for col in mall_data.columns:
    print(f"{col}: {mall_data[col].nunique()}")

customerid: 200
gender: 2
age: 51
annual_income_k: 64
spending_score_1_100: 84


In [13]:
mall_data.head()

Unnamed: 0,customerid,gender,age,annual_income_k,spending_score_1_100
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


#### Standardizing Categorical Text Values (Gender)

In [14]:
mall_data['gender'] = mall_data['gender'].str.strip().str.lower()

In [15]:
mall_data.head()

Unnamed: 0,customerid,gender,age,annual_income_k,spending_score_1_100
0,1,male,19,15,39
1,2,male,21,15,81
2,3,female,20,16,6
3,4,female,23,16,77
4,5,female,31,17,40


In [16]:
# Checking Data Types of Each Column
mall_data.dtypes

Unnamed: 0,0
customerid,int64
gender,object
age,int64
annual_income_k,int64
spending_score_1_100,int64


---

### **Data Cleaning and Preprocessing Summary**:

1. **Checked Data Dimensions**: Examined the number of rows and columns to understand dataset size.
2. **Checked for Duplicates**: Confirmed that there are no duplicate rows in the dataset.
3. **Checked for Missing Values**: Ensured no columns have missing (NaN) values.
4. **Renamed Columns**: Standardized column names by:
   * Stripping spaces
   * Converting to lowercase
   * Replacing special characters/spaces with underscores
   * Removing multiple or trailing underscores

5. **Standardized Categorical Values**: Standardized the 'gender' column by:
   * Removing extra spaces
   * Converting all values to lowercase
   
6. **Checked Data Types**: Examined the data types of each column to ensure they're appropriate for analysis.

---

This prepares the data for further analysis or modeling, making it clean and consistent.