# Data Preprocess

get data -> missing Value -> Categorical Data  -> Data Spilit -> Feature Scaling

In [152]:
# use Jupyter Notebook in VScode 
# python 3.9.12

# import Area
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# get DataSet
get the data </br>
Kaggle : https://www.kaggle.com

In [153]:
# read the csv file
dataSet = pd.read_csv('Customers.csv')
dataSet.head(5)

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
0,1,Male,19.0,15000.0,39,Healthcare,1,4
1,2,Male,21.0,35000.0,81,Engineer,3,3
2,3,Female,20.0,86000.0,6,Engineer,1,1
3,4,Female,23.0,59000.0,77,Lawyer,0,2
4,5,Female,31.0,38000.0,40,Entertainment,2,6


# Missing Value
Is there any missing values in dataset

## 資料類型

link : https://yc-kuo.medium.com/ml02-na-f2072615158e

一般可將資料類型，粗略分為兩大類，並各自都有兩個細項：</br>
numeric data: continuous & discrete </br>
categorical data: ordinal & nominal</br>
但更精確的分類，則是分成四個等級：</br>
numeric data: ratio level & interval level</br>
categorical data: ordinal level & nominal level</br>
以下詳述這四個等級：
1. 定比 (ratio level)：可分類、可排序、可加減、可乘除。
比方，月收入，月收入 100萬即為月收入 10萬的 10倍。一般常見的 numeric 即為此。

2. 定距 (interval level)：可分類、可排序、可加減。但不可乘除。
比方，氣溫，氣溫 20度C 並不為 10度C 的 2倍。此類資料類型很少見，只有度C、度F、一些特殊的 Likert scale。

3. 定序 (ordinal level)：可分類、可排序。
比方，問卷調查的滿意分數，有 1分 ~ 5分共 5個選項。

4. 定類 (nominal level)：可分類。
比方，車子顏色，無法排序。

In [154]:
# check NaN
dataSet.isna().any()

CustomerID                False
Gender                     True
Age                        True
Annual Income ($)          True
Spending Score (1-100)    False
Profession                 True
Work Experience           False
Family Size               False
dtype: bool

In [155]:
# missing_data (numeric data) --> Use mean Value
# numeric data -> continuous & discrete 
mean_imputer = SimpleImputer( missing_values=np.nan, strategy='mean')

dataSet["Annual Income ($)"] = mean_imputer.fit_transform(dataSet["Annual Income ($)"].values.reshape(-1, 1))
dataSet["Age"] = mean_imputer.fit_transform(dataSet["Age"].values.reshape(-1, 1))
dataSet.isna().any()

CustomerID                False
Gender                     True
Age                       False
Annual Income ($)         False
Spending Score (1-100)    False
Profession                 True
Work Experience           False
Family Size               False
dtype: bool

In [156]:
# missing_data (categorical data) --> Use most_frequence Value
most_frequence_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
dataSet["Gender"] = most_frequence_imputer.fit_transform(dataSet["Gender"].values.reshape(-1, 1))
dataSet["Profession"] = most_frequence_imputer.fit_transform(dataSet["Profession"].values.reshape(-1, 1))
dataSet.isna().any()


CustomerID                False
Gender                    False
Age                       False
Annual Income ($)         False
Spending Score (1-100)    False
Profession                False
Work Experience           False
Family Size               False
dtype: bool

## Label encode & Onehot encode
Label Enc -> 將類別資料的文字轉換為數值以方便計算</br>
onehot Enc -> 數值標籤的大小對於模型來說,會有大小的差異, 如 0<1<2<..., 類別型特徵之間沒有順序關係的話(eg:性別)，0,1,2,…這樣的數值標籤則容易讓模型產生誤解.
              

In [157]:
LabelEnc =  LabelEncoder()
OnehotEnc = OneHotEncoder()

dataSet["Gender"] = LabelEnc.fit_transform(dataSet["Gender"])
GenderEnc = pd.DataFrame( OnehotEnc.fit_transform(dataSet["Gender"].values.reshape(-1, 1)).toarray(), columns=["female", "Male"] )
# drop the old gender data
dataSet_delgender=dataSet.drop(columns="Gender")
# combine the two dataFrame
dataSet = pd.concat([dataSet_delgender, GenderEnc], axis=1)
dataSet

# you also can use pd.get_dummies()
# dataSet = pd.get_dummies(dataSet, columns=["Gender"])

dataSet = pd.get_dummies(dataSet, columns=["Profession"])
dataSet


Unnamed: 0,CustomerID,Age,Annual Income ($),Spending Score (1-100),Work Experience,Family Size,female,Male,Profession_Artist,Profession_Doctor,Profession_Engineer,Profession_Entertainment,Profession_Executive,Profession_Healthcare,Profession_Homemaker,Profession_Lawyer,Profession_Marketing
0,1,19.0,15000.0,39,1,4,0.0,1.0,0,0,0,0,0,1,0,0,0
1,2,21.0,35000.0,81,3,3,0.0,1.0,0,0,1,0,0,0,0,0,0
2,3,20.0,86000.0,6,1,1,1.0,0.0,0,0,1,0,0,0,0,0,0
3,4,23.0,59000.0,77,0,2,1.0,0.0,0,0,0,0,0,0,0,1,0
4,5,31.0,38000.0,40,2,6,1.0,0.0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,71.0,184387.0,40,8,7,1.0,0.0,1,0,0,0,0,0,0,0,0
1996,1997,91.0,73158.0,32,7,7,1.0,0.0,0,1,0,0,0,0,0,0,0
1997,1998,87.0,90961.0,14,9,2,0.0,1.0,0,0,0,0,0,1,0,0,0
1998,1999,77.0,182109.0,4,7,2,0.0,1.0,0,0,0,0,1,0,0,0,0


# Independent Variable and Dependent Variable

自變量(Independent Variable) -> 能獨立變化而影響或引起其他變數變化的條件或因素</br>
因變量(dependent Variable) -> 即要研究的目標變數，其取值可被觀測且隨自變數的變化而變化</br>

in ML</br>
we want to find a function to calculate the Dependent Variable by independent variable</br>
function( independentV ) = dependentV</br>

In [158]:
IndependentV = dataSet.drop(columns=["Spending Score (1-100)"]).values
DependentV = dataSet["Spending Score (1-100)"].values

# 資料集分割
訓練集 -> 用來訓練model (平時練習)</br>
測試集 -> 用來測試model的預測是否準確 (考試)</br>

In [159]:
# split dataSet
IndependentV_train, IndependentV_test, DependentV_train, DependentV_test = train_test_split(IndependentV, DependentV, test_size=0.2, random_state=0)
# IndependentV_train, IndependentV_test, DependentV_train, DependentV_test

# 特徵縮放
各個變數的範圍大不相同, 若不進行特徵縮放, 數值較大的變數可能會稀釋掉小數值數字對預測結果的影響, 如薪水vs年紀

特徵標準化(高斯分佈)(Standardization) -> 使得資料的平均值會變為0, 標準差變為1

In [160]:
# Featrue Scaling
Scaler_X = StandardScaler()
IndependentV_train = Scaler_X.fit_transform(IndependentV_train)
IndependentV_test = Scaler_X.transform(IndependentV_test)

In [161]:
# mean goes to 0
# std goes to 1
print(IndependentV_train.mean())
print(IndependentV_train.std())

-6.661338147750939e-18
1.0000000000000024
