# Students Do: Understanding customers

## Instructions

You are given a dataset that contains historical data from purchases of an online store made by 200 customers. In this activity you will put in action your data preprocessing superpowers, also you'll add some new skills needed to start finding customers clusters.

In [45]:
# Initial imports
import pandas as pd
from pathlib import Path
file_path = Path("../Resources/shopping_data.csv")


In [46]:
# load the csv into a dataframe
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,CustomerID,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,1,Yes,52,38000,45
1,2,Yes,40,39000,57
2,3,No,57,46000,59
3,4,Yes,54,41000,51
4,5,No,55,45000,53


In [47]:
# check the type values of each column
df.dtypes

CustomerID                 int64
Previous Shopper          object
Age                        int64
Annual Income              int64
Spending Score (1-100)     int64
dtype: object

In [48]:
df = df.drop(columns=['CustomerID'])
df

Unnamed: 0,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,Yes,52,38000,45
1,Yes,40,39000,57
2,No,57,46000,59
3,Yes,54,41000,51
4,No,55,45000,53
...,...,...,...,...
195,Yes,76,33000,66
196,Yes,74,36000,72
197,Yes,52,39000,61
198,No,55,40000,59


In [49]:
# drop all rows with any NaN and NaT values
df = df.dropna()
df

Unnamed: 0,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,Yes,52,38000,45
1,Yes,40,39000,57
2,No,57,46000,59
3,Yes,54,41000,51
4,No,55,45000,53
...,...,...,...,...
195,Yes,76,33000,66
196,Yes,74,36000,72
197,Yes,52,39000,61
198,No,55,40000,59


In [50]:
# In the case supervised learning we should change the values of our category
class_dict = {'Yes': 1, 'No': 0}
df2 = df.replace({'Previous Shopper': class_dict})
df2.head()

Unnamed: 0,Previous Shopper,Age,Annual Income,Spending Score (1-100)
0,1,52,38000,45
1,1,40,39000,57
2,0,57,46000,59
3,1,54,41000,51
4,0,55,45000,53


In [51]:
# But since we don't actually need the Shopper column for unsupervised learning...
# Drop the 'Previous Shopper' column & 'Spending Score (1-100)' column 
df3 = df.drop(['Previous Shopper'], axis='columns')
df3.head()

Unnamed: 0,Age,Annual Income,Spending Score (1-100)
0,52,38000,45
1,40,39000,57
2,57,46000,59
3,54,41000,51
4,55,45000,53


In [52]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df3[['Age','Annual Income','Spending Score (1-100)']])

In [53]:
new_df = pd.DataFrame(scaled_data, df['Previous Shopper'])
new_df

Unnamed: 0_level_0,0,1,2
Previous Shopper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Yes,-0.295240,-0.118424,-1.625204
Yes,-0.979855,0.051970,-0.053060
No,-0.009984,1.244733,0.208964
Yes,-0.181138,0.392760,-0.839132
No,-0.124086,1.074338,-0.577108
...,...,...,...
Yes,1.073990,-0.970398,1.126048
Yes,0.959887,-0.459214,1.912120
Yes,-0.295240,0.051970,0.470988
No,-0.124086,0.222365,0.208964


In [54]:
# Save the cleaned DataFrame as a new CSV file for further use
file_path = Path("../Resources/shopping_data_cleaned.csv")
new_df.to_csv(file_path, index=False)