## Data Preprocessing
In this notebook, we explain some pre-processing techniques:
- Removing and imputing missing values.
- Normalization and standardization of numerical data .
- Encoding non-numerical data. 

<div class="alert alert-info">Read the dataset.</div>

<div class="alert alert-danger">Exercise 1: Write a code snippet to read the  titanic dataset from local drive?</div>

In [26]:
import pandas as pd
file = "hepatitis.csv"
df = pd.read_csv(file)

<div class="alert alert-info">Removing the missing values.</div>

In [27]:
# Check the proportion of missing values in each column
missing_percentage = (df.isna().sum() / len(df)) * 100
missing_percentage

age                 0.000000
sex                 0.000000
steroid             0.645161
antivirals          0.000000
fatigue             0.645161
malaise             0.645161
anorexia            0.645161
liver_big           6.451613
liver_firm          7.096774
spleen_palpable     3.225806
spiders             3.225806
ascites             3.225806
varices             3.225806
bilirubin           3.870968
alk_phosphate      18.709677
sgot                2.580645
albumin            10.322581
protime            43.225806
histology           0.000000
class               0.000000
dtype: float64

<div class="alert alert-danger">Exercise 2: Which column has the highest number of missing values?</div>

In [28]:
df1=df.dropna(axis=1)
print(df.shape)
print(df1.shape)

(155, 20)
(155, 5)


In [29]:
df1=df.dropna(axis=0)
print(df.shape)
print(df1.shape)

(155, 20)
(80, 20)


In [30]:
df1=df.copy()
df1.dropna(subset=['protime'],axis=0,inplace=True)
df1.shape

(88, 20)

In [31]:
df1=df.copy()
df1.dropna(thresh=0.9*len(df),axis=1,inplace=True)
df1.shape

(155, 17)

<div class="alert alert-danger">Exercise 3: What is the outcome of running the following code?</div>

In [32]:
df2=df.copy()
df2.dropna(thresh=0.95*len(df2),axis=1,inplace=True)

<div class="alert alert-info">Impute the missing values for numeric data.</div>

In [33]:
#Mean Imputation for numerical data
df1=df.copy()
mean_value = df['protime'].mean()
df['protime'].fillna(mean_value, inplace=True)
missing_count = df['protime'].isna().sum()
print(missing_count)

0


In [34]:
#Median Imputation for numerical data
df1=df.copy()
median_value = df1['protime'].median()
df1['protime'].fillna(median_value, inplace=True)
missing_count = df1['protime'].isna().sum()
print(missing_count)

0


<div class="alert alert-info">Impute the missing values for categorical data.</div>

In [35]:
#Mode Imputation for categorical data
mode_value = df['steroid'].mode()[0]
df['steroid'].fillna(mode_value, inplace=True)


<div class="alert alert-info">Normalization of numerical features</div>

In [36]:
# First replce the missing values with mean of the column 
mean_value = df['protime'].mean()
df['protime'].fillna(mean_value, inplace=True)

In [44]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['protime'] = scaler.fit_transform(df[['protime']])

<div class="alert alert-danger">Exercise 4: Given the code below, the value "a" and "b" are equal to...?(use print to see the values)</div>

In [39]:
a=(df['protime'].min())
b=(df['protime'].max())

In [40]:
print(a)
print(b)

0.0
1.0


<div class="alert alert-info">Standardization of numerical features</div>

In [43]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['protime'] = scaler.fit_transform(df[['protime']])
# check the mean and standard deviation
print(round(df['protime'].mean(),2))
import numpy as np
std = np.std(df['protime'])
print(round(std,2))

0.0
1.0


<div class="alert alert-info">Handling categorical data (creating dummy variables)</div>

In [42]:
dummy_df = pd.get_dummies(df, columns=['steroid'], drop_first=True)
dummy_df.head()

Unnamed: 0,age,sex,antivirals,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices,bilirubin,alk_phosphate,sgot,albumin,protime,histology,class,steroid_True
0,30,male,False,False,False,False,False,False,False,False,False,False,1.0,85.0,18.0,4.0,6.478146e-16,False,live,False
1,50,female,False,True,False,False,False,False,False,False,False,False,0.9,135.0,42.0,3.5,6.478146e-16,False,live,False
2,78,female,False,True,False,False,True,False,False,False,False,False,0.7,96.0,32.0,4.0,6.478146e-16,False,live,True
3,31,female,True,False,False,False,True,False,False,False,False,False,0.7,46.0,52.0,4.0,1.058919,False,live,True
4,34,female,False,False,False,False,True,False,False,False,False,False,1.0,,200.0,4.0,6.478146e-16,False,live,True


<div class="alert alert-info">Handling categorical data (label encoding)</div>

In [20]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['sex'] = encoder.fit_transform(df['sex'])
df.head()

Unnamed: 0,age,sex,steroid,antivirals,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices,bilirubin,alk_phosphate,sgot,albumin,protime,histology,class
0,30,1,False,False,False,False,False,False,False,False,False,False,False,1.0,85.0,18.0,4.0,6.478146e-16,False,live
1,50,0,False,False,True,False,False,False,False,False,False,False,False,0.9,135.0,42.0,3.5,6.478146e-16,False,live
2,78,0,True,False,True,False,False,True,False,False,False,False,False,0.7,96.0,32.0,4.0,6.478146e-16,False,live
3,31,0,True,True,False,False,False,True,False,False,False,False,False,0.7,46.0,52.0,4.0,1.058919,False,live
4,34,0,True,False,False,False,False,True,False,False,False,False,False,1.0,,200.0,4.0,6.478146e-16,False,live


In [21]:
# Sample ordinal data
Cancer_risk = ['Low', 'Medium', 'High', 'Low', 'High', 'Medium', 'Low']

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the ordinal data using label encoding
encoded_data = label_encoder.fit_transform(Cancer_risk)

print("Cancer_risk:", Cancer_risk)
print("Cancer_risk:", encoded_data)

Cancer_risk: ['Low', 'Medium', 'High', 'Low', 'High', 'Medium', 'Low']
Cancer_risk: [1 2 0 1 0 2 1]
