# 1. Geting the Dataset

Download the dataset from here: https://www.superdatascience.com/pages/machine-learning

## 1.1.About the Dataset: 

The data set contains the follwoing attributes of customer of a company anf if they have baught product of the company. 
|Attribute | Description |
|---|---|
|Country | Country of residence of the customer |
|Age | Age of the customer|
|Salary| Salary of the customer |
|Purchased| If they have purchased the product of the company? | 

We know that:
* Dependent variables: Country, Age and Salary
* Independent variables: Purchased

Machine learning models predict the dependent varable using the independent variables. 

# 2. Import Libraries

Install the following essential libraries: `pip install numpy pandas matplotlib`
| Library | Purpose of usage |
|---|---|
|__Numpy__| Numerical Computation | 
|__Pandas__| Manipulating Tabular Data |
|__Matplotlib__| MATLAB like plotting library | 

In [1]:
# Verify are the libraries installed 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 3. Importring Dataset 

## 3.1. Setting up PWD 
Assigning the Working directory and Dataset variable reference.

In [2]:
# set absolute path to the dataset directory (working directory)
WD_PATH="C:\\Users\\sapta\\Documents\\GitHub\\MLAAS\\dev_docs\\datasets\\"

# Set the dataset filename 
FILE_NAME="data_prep_dataset.csv"

## 3.2. Load Data 

In [3]:
dataset = WD_PATH + FILE_NAME   # absolute path of the file
df_dataset = pd.read_csv(dataset) # load dataset into DataFrame
df_dataset.head() # print the first 5 samples

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


# 4. Dependent & Independent variable Split

## 4.1. Set D&I List
Set the list of dependent and independent variables

In [4]:
# Dependent and Independent variable lists 
ATTR_INDEPENDENT=['Country', 'Age', 'Salary']
ATTR_DEPENDENT=['Purchased']
ATTR_CATAGORICAL=['Country']
ATTR_NUMERIC=['Age', 'Salary']

Verify 

In [5]:
attributes = list(df_dataset.columns) # List of attribute in the dataset 

print(f'Attributes: {attributes}')
print(f'Dependent: {ATTR_DEPENDENT}')
print(f'Independent: {ATTR_INDEPENDENT}')

Attributes: ['Country', 'Age', 'Salary', 'Purchased']
Dependent: ['Purchased']
Independent: ['Country', 'Age', 'Salary']


## 4.2. Split Dependent and Independent into Matrices

In this phase the Dataframe `df_dataset` splits into first individual Dataframes e.g., `df_X` then into _Numpy Matrix_ format i.e., `X` as _Independent Variable Matrix_ and `y` as _Dependent Variable Vector_

In [8]:
# Independent Variables
df_X = pd.DataFrame(df_dataset, columns=ATTR_INDEPENDENT) # extract subset df of independent attr
df_X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [9]:
# Dependent Variables 
df_y = pd.DataFrame(df_dataset, columns=ATTR_DEPENDENT)
df_y

Unnamed: 0,Purchased
0,No
1,Yes
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


# 5. Remove Missing Data 
* Removing missing data by removing the observation containing the missing data is NOT recommanded. 
* Recommanded process is to replace missing data by a __statistical aggregate__ of the concerned column (e.g., Mean, Median, Mode). 

In [10]:
col_with_na = df_X.columns[df_X.isna().any()].tolist()  # columns (List) with missing value 
non_catagrical_cols_with_na=list(set(col_with_na) - set(ATTR_CATAGORICAL))
catagorical_cols_with_na = list(set(col_with_na).intersection(ATTR_CATAGORICAL))

# Replace missing data by mean for each NON-CATAGORICAL column with missing data.
for col in non_catagrical_cols_with_na:
    df_X[col]=df_X[col].fillna(df_X[col].mean())

# Replace missing data by removing rows with missing data for each CATAGORICAL column with missing data.
df_dataset.dropna(subset=catagorical_cols_with_na, inplace=True)

print(df_X)

   Country        Age        Salary
0   France  44.000000  72000.000000
1    Spain  27.000000  48000.000000
2  Germany  30.000000  54000.000000
3    Spain  38.000000  61000.000000
4  Germany  40.000000  63777.777778
5   France  35.000000  58000.000000
6    Spain  38.777778  52000.000000
7   France  48.000000  79000.000000
8  Germany  50.000000  83000.000000
9   France  37.000000  67000.000000


# 6. Encoding of Catagorical attribute  

In [11]:
ATTR_CATAGORICAL  # get the catagorical attribute 

['Country']

In [12]:
df_X[ATTR_CATAGORICAL]

Unnamed: 0,Country
0,France
1,Spain
2,Germany
3,Spain
4,Germany
5,France
6,Spain
7,France
8,Germany
9,France


## Label Encoding 

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
for col in ATTR_CATAGORICAL:
    le = LabelEncoder()
    df_X[col] = le.fit_transform(df_X[col])

print(df_X)

   Country        Age        Salary
0        0  44.000000  72000.000000
1        2  27.000000  48000.000000
2        1  30.000000  54000.000000
3        2  38.000000  61000.000000
4        1  40.000000  63777.777778
5        0  35.000000  58000.000000
6        2  38.777778  52000.000000
7        0  48.000000  79000.000000
8        1  50.000000  83000.000000
9        0  37.000000  67000.000000


In [22]:
# get dummy columns from label encoded column 
col='Country' # attr to OHE 
add_col = pd.get_dummies(df_X[col], prefix=col).astype(np.int8) # OHE into Int type
add_col.drop(columns=list(add_col.columns)[0], inplace=True) # remove the first OHE dummy attr inplace
print(add_col)

   Country_1  Country_2
0          0          0
1          0          1
2          1          0
3          0          1
4          1          0
5          0          0
6          0          1
7          0          0
8          1          0
9          0          0


In [23]:
# Add the dummy attr to independent value matrix
df_X=df_X.join(add_col)
df_X

Unnamed: 0,Country,Age,Salary,Country_1,Country_2
0,0,44.0,72000.0,0,0
1,2,27.0,48000.0,0,1
2,1,30.0,54000.0,1,0
3,2,38.0,61000.0,0,1
4,1,40.0,63777.777778,1,0
5,0,35.0,58000.0,0,0
6,2,38.777778,52000.0,0,1
7,0,48.0,79000.0,0,0
8,1,50.0,83000.0,1,0
9,0,37.0,67000.0,0,0


## Label Encoding of Dependent Var

In [25]:
le = LabelEncoder()
y = le.fit_transform(df_y.values.ravel()) # use the '.values.ravel()' to avoid dim error 
print(y)

[0 1 0 0 1 1 0 1 0 1]


# Train test Split

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
X=np.array(df_X) # convert from DF (df_X) to np.array (X) 
# y is already in np.array format

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [30]:
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[0.00000000e+00 4.40000000e+01 7.20000000e+04 0.00000000e+00
  0.00000000e+00]
 [0.00000000e+00 3.70000000e+01 6.70000000e+04 0.00000000e+00
  0.00000000e+00]
 [0.00000000e+00 3.50000000e+01 5.80000000e+04 0.00000000e+00
  0.00000000e+00]
 [1.00000000e+00 3.00000000e+01 5.40000000e+04 1.00000000e+00
  0.00000000e+00]
 [2.00000000e+00 3.87777778e+01 5.20000000e+04 0.00000000e+00
  1.00000000e+00]
 [1.00000000e+00 4.00000000e+01 6.37777778e+04 1.00000000e+00
  0.00000000e+00]
 [1.00000000e+00 5.00000000e+01 8.30000000e+04 1.00000000e+00
  0.00000000e+00]
 [2.00000000e+00 2.70000000e+01 4.80000000e+04 0.00000000e+00
  1.00000000e+00]]
[[2.0e+00 3.8e+01 6.1e+04 0.0e+00 1.0e+00]
 [0.0e+00 4.8e+01 7.9e+04 0.0e+00 0.0e+00]]
[0 1 1 0 0 1 0 1]
[0 1]


# Feature Scalling 

In [31]:
from sklearn.preprocessing import StandardScaler

In [32]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

In [33]:
# print values 
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[-1.12089708  0.91209148  0.8997527  -0.77459667 -0.57735027]
 [-1.12089708 -0.10493088  0.43965189 -0.77459667 -0.57735027]
 [-1.12089708 -0.3955087  -0.38852958 -0.77459667 -0.57735027]
 [ 0.16012815 -1.12195324 -0.75661023  1.29099445 -0.57735027]
 [ 1.44115338  0.15336051 -0.94065055 -0.77459667  1.73205081]
 [ 0.16012815  0.33093585  0.14314248  1.29099445 -0.57735027]
 [ 0.16012815  1.78382494  1.91197449  1.29099445 -0.57735027]
 [ 1.44115338 -1.55781997 -1.3087312  -0.77459667  1.73205081]]
[[ 1. -1. -1.  0.  1.]
 [-1.  1.  1.  0. -1.]]
[0 1 1 0 0 1 0 1]
[0 1]
