# MALL_CUSTOMER DATA PIPELINE

### Import libraries and dataset

In [0]:
import pandas as pd
Hagital = spark.table("workspace.default.mall_customers1")

### Change dataframe to mall

In [0]:
mall = Hagital.toPandas()
mall.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### Show dataset column and data types

In [0]:
mall.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


### Count total number of rows and columns

In [0]:
mall.shape

(200, 5)

In [0]:
mall.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


### Check nulls/blanks columns

In [0]:
mall.isnull().sum()

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

### Rename column name 

In [0]:
mall = mall.rename(columns={
  'Annual Income (k$)': 'Income(000USD)',
  'Spending Score (1-100)': 'Spendingscore'})

In [0]:
mall.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   CustomerID      200 non-null    int64 
 1   Gender          200 non-null    object
 2   Age             200 non-null    int64 
 3   Income(000USD)  200 non-null    int64 
 4   Spendingscore   200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [0]:
mall.head()

Unnamed: 0,CustomerID,Gender,Age,Income(000USD),Spendingscore
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


### Change  gender to lower case

In [0]:
mall['Gender'] = mall['Gender'].str.lower()
mall.head()

Unnamed: 0,CustomerID,Gender,Age,Income(000USD),Spendingscore
0,1,male,19,15,39
1,2,male,21,15,81
2,3,female,20,16,6
3,4,female,23,16,77
4,5,female,31,17,40


In [0]:
mall['Income(000USD)'].min()

15

In [0]:
mall['Income(000USD)'].max()

137

### Calculate income level category 

In [0]:
def Income_level(a):
    if a<50:
        return 'Low'
    elif a<90:
        return 'Medium'
    else:
        return 'High'

In [0]:
mall["Annual_Income"] = mall["Income(000USD)"]*1000
mall.head()

Unnamed: 0,CustomerID,Gender,Age,Income(000USD),Spendingscore,Annual_Income
0,1,male,19,15,39,15000
1,2,male,21,15,81,15000
2,3,female,20,16,6,16000
3,4,female,23,16,77,16000
4,5,female,31,17,40,17000


In [0]:
mall['Income_level'] = mall['Income(000USD)'].apply(Income_level)

In [0]:
mall = mall.drop(['Income(000USD)'], axis=1)
mall.head()

Unnamed: 0,CustomerID,Gender,Age,Spendingscore,Annual_Income,Income_level
0,1,male,19,39,15000,Low
1,2,male,21,81,15000,Low
2,3,female,20,6,16000,Low
3,4,female,23,77,16000,Low
4,5,female,31,40,17000,Low


In [0]:
mall.tail()

Unnamed: 0,CustomerID,Gender,Age,Spendingscore,Annual_Income,Income_level
195,196,female,35,79,120000,High
196,197,female,45,28,126000,High
197,198,male,32,74,126000,High
198,199,male,32,18,137000,High
199,200,male,30,83,137000,High


### Conclusion
This dataset consists of 200 customers and five key attributes, all of which are complete with no missing values. To ensure consistency, column names were standardized and the gender field was converted to lowercase. The descriptive statistics reveal that the age of customers ranges between 18 and 70 years, with an average age of approximately 39 years. Annual income spans from \$15,000 to \$137,000, with a mean income of about \$60,600 and an interquartile range between \$41,500 and \$78,000. The spending score, which varies from 1 to 99, records an average of 50 but shows a wide distribution, suggesting diverse spending behaviors among customers.

Additional feature engineering introduced an `Annual_Income` variable expressed in dollars and categorized customers into low, medium, and high-income levels. These findings highlight potential market segmentation opportunities, as the contrast between income and spending behaviors indicates distinct customer clusters. Future analyses should compare income groups against age and spending scores to identify valuable customer profiles and guide tailored marketing strategies.