# Typical Data Preparation steps

 - Getting the necessary python libraries 
 - Loading the dataset 
 - Dealing with **Missing values** & **Categorical features** 
 - Splitting the data into **Training sets** & **Testing sets**
 - Normalization of features

Getting the necessary python libraries 

In [1]:
import numpy as np  
import pandas as pd

### Loading the dataset

In [2]:
dataset = pd.read_csv('loans.csv') #Store the dataset in a dataframe

In [3]:
print(dataset)

     City   Age   Revenue Approved
0  Medina  25.0   65000.0      Yes
1   Mecca  30.0   81000.0       No
2  Riyadh  33.0       NaN      Yes
3  Medina  39.0  100000.0       No
4   Mecca  28.0   91000.0      Yes
5  Riyadh   NaN   66000.0       No
6  Medina  40.0   98000.0      Yes
7   Mecca  34.0   86000.0      Yes
8  Riyadh  25.0   70000.0       No
9   Mecca  24.0   62000.0      Yes


In [4]:
# [:, :-1] Store all the raws, Store all the columns except the last one
X = dataset.iloc[:,:-1].values

# [:,3] Store all the raws,  Store colum 3 (Target Co)
y = dataset.iloc[:,3].values

In [5]:
print (X)
print ()
print (y)

[['Medina' 25.0 65000.0]
 ['Mecca' 30.0 81000.0]
 ['Riyadh' 33.0 nan]
 ['Medina' 39.0 100000.0]
 ['Mecca' 28.0 91000.0]
 ['Riyadh' nan 66000.0]
 ['Medina' 40.0 98000.0]
 ['Mecca' 34.0 86000.0]
 ['Riyadh' 25.0 70000.0]
 ['Mecca' 24.0 62000.0]]

['Yes' 'No' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes']


### Dealing with missing values

Rows with missing values can be easily dropped via the dropna method >>> df.dropna(axis=0)

Similarly, we can drop columns that have at least one NaN in any row by setting the axis argument to 1 >>> df.dropna(axis=1)

Only drop rows where all columns are NaN >>> df.dropna(how='all’)

Keep only the rows with at least 2 non-NaN values. >>> df.dropna(thresh=2)

Only drop rows where NaN appear in specific columns (here: 'C') >>> df.dropna(subset=['C'])

Note: df is the dataframe

In [6]:
from sklearn.impute import SimpleImputer

In [7]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [8]:
imputer.fit(X[:,[1,2]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [9]:
X[:,1:3]= imputer.transform(X[:,1:3])

In [10]:
print(X)

[['Medina' 25.0 65000.0]
 ['Mecca' 30.0 81000.0]
 ['Riyadh' 33.0 79888.88888888889]
 ['Medina' 39.0 100000.0]
 ['Mecca' 28.0 91000.0]
 ['Riyadh' 30.88888888888889 66000.0]
 ['Medina' 40.0 98000.0]
 ['Mecca' 34.0 86000.0]
 ['Riyadh' 25.0 70000.0]
 ['Mecca' 24.0 62000.0]]


### Dealing with categorical variables

In [11]:
from sklearn.preprocessing import LabelEncoder 

In [12]:
labelencoder_X = LabelEncoder() #encode categorical features in X 

In [13]:
X[:,0] = labelencoder_X.fit_transform(X[:,0]) 

In [14]:
print(X)

[[1 25.0 65000.0]
 [0 30.0 81000.0]
 [2 33.0 79888.88888888889]
 [1 39.0 100000.0]
 [0 28.0 91000.0]
 [2 30.88888888888889 66000.0]
 [1 40.0 98000.0]
 [0 34.0 86000.0]
 [2 25.0 70000.0]
 [0 24.0 62000.0]]


In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
onehotencoder = OneHotEncoder(categorical_features=[0]) 

In [17]:
X = onehotencoder.fit_transform(X).toarray() 

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [18]:
print(X)

[[0.00000000e+00 1.00000000e+00 0.00000000e+00 2.50000000e+01
  6.50000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.00000000e+01
  8.10000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.30000000e+01
  7.98888889e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.90000000e+01
  1.00000000e+05]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 2.80000000e+01
  9.10000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.08888889e+01
  6.60000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  9.80000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.40000000e+01
  8.60000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.50000000e+01
  7.00000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 2.40000000e+01
  6.20000000e+04]]


In [19]:
labelencoder_y = LabelEncoder() #encode categorical features in y

In [20]:
y = labelencoder_y.fit_transform(y)

In [21]:
print(y)

[1 0 1 0 1 0 1 1 0 1]


### Splitting the Data

In [22]:
from sklearn.model_selection import train_test_split 

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 
#Test size = 20%, training size = 80% 

Normalization

In [24]:
from sklearn.preprocessing import StandardScaler 

In [25]:
sc_X = StandardScaler()

In [26]:
X_train = sc_X.fit_transform(X_train) 

In [27]:
X_test = sc_X.transform(X_test)