# Data preprocessing

## Dataset: Heart Disease Prediction
The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on
residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient
has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It
includes over 4,000 records and 15 attributes.

### Importing Dataset with Pandas:


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv(r'/content/sample_data/framingham.csv')
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [3]:
X = df.iloc[:, :-1] # X has all columns except the last column because it is the column we have to make predictions for.
y = df.iloc[:, -1] # y has last column

X = np.array(X)
y = np.array(y)

## Standardizing Data

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

With standardizing, we can take attributes with a Gaussian distribution and different means and standard deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. For this, we use the StandardScaler class.

In [4]:
from sklearn.preprocessing import StandardScaler

In [5]:
scaler=StandardScaler().fit(X)
rescaledX=scaler.transform(X)
rescaledX

array([[ 1.1531919 , -1.23495068,  1.98206814, ...,  0.28629879,
         0.342704  , -0.20732048],
       [-0.86715836, -0.41825733,  0.02064407, ...,  0.71771073,
         1.59008688, -0.24906213],
       [ 1.1531919 , -0.18491638, -0.96006796, ..., -0.11324749,
        -0.0730903 , -0.49951204],
       ...,
       [-0.86715836, -0.18491638,  0.02064407, ..., -0.93194969,
         0.67533943,  0.16835438],
       [-0.86715836, -0.65159829, -0.96006796, ..., -1.62809168,
         0.84165715,         nan],
       [-0.86715836,  0.28176554,  0.02064407, ..., -1.06186351,
         0.342704  ,  1.04492906]])

In [6]:
from sklearn.preprocessing import MinMaxScaler
x = MinMaxScaler(feature_range=(0,5))
scale = x.fit_transform(X)
scale

array([[5.        , 0.92105263, 5.        , ..., 1.38511876, 1.81818182,
        0.52259887],
       [0.        , 1.84210526, 1.66666667, ..., 1.59840039, 2.57575758,
        0.50847458],
       [5.        , 2.10526316, 0.        , ..., 1.18759089, 1.56565657,
        0.42372881],
       ...,
       [0.        , 2.10526316, 1.66666667, ..., 0.78284052, 2.02020202,
        0.64971751],
       [0.        , 1.57894737, 0.        , ..., 0.43868153, 2.12121212,
               nan],
       [0.        , 2.63157895, 1.66666667, ..., 0.71861367, 1.81818182,
        0.94632768]])

## Imputation of missing values
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.

In [7]:
from sklearn.impute import SimpleImputer

In [8]:
mean_values = SimpleImputer(strategy='mean')
data1 = mean_values.fit_transform(rescaledX)
print(data1)

[[ 1.15319190e+00 -1.23495068e+00  1.98206814e+00 ...  2.86298790e-01
   3.42703997e-01 -2.07320483e-01]
 [-8.67158360e-01 -4.18257334e-01  2.06440713e-02 ...  7.17710727e-01
   1.59008688e+00 -2.49062135e-01]
 [ 1.15319190e+00 -1.84916377e-01 -9.60067961e-01 ... -1.13247493e-01
  -7.30902975e-02 -4.99512044e-01]
 ...
 [-8.67158360e-01 -1.84916377e-01  2.06440713e-02 ... -9.31949692e-01
   6.75339433e-01  1.68354381e-01]
 [-8.67158360e-01 -6.51598291e-01 -9.60067961e-01 ... -1.62809168e+00
   8.41657150e-01  1.75328727e-16]
 [-8.67158360e-01  2.81765538e-01  2.06440713e-02 ... -1.06186351e+00
   3.42703997e-01  1.04492906e+00]]


## Normalizing Data

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer class.

In [9]:
from sklearn import preprocessing

In [10]:
normaliseX = preprocessing.normalize(data1)
normaliseX

array([[ 3.25334840e-01, -3.48400368e-01,  5.59174772e-01, ...,
         8.07697060e-02,  9.66825640e-02, -5.84885968e-02],
       [-3.40842556e-01, -1.64398921e-01,  8.11429419e-03, ...,
         2.82101137e-01,  6.24994582e-01, -9.78955847e-02],
       [ 5.11451031e-01, -8.20120842e-02, -4.25798817e-01, ...,
        -5.02262864e-02, -3.24162075e-02, -2.21538106e-01],
       ...,
       [-3.73101622e-01, -7.95617079e-02,  8.88227210e-03, ...,
        -4.00978596e-01,  2.90570038e-01,  7.24357804e-02],
       [-3.10566762e-01, -2.33365416e-01, -3.43841693e-01, ...,
        -5.83089763e-01,  3.01433681e-01,  6.27927696e-17],
       [-3.61438172e-01,  1.17442010e-01,  8.60460528e-03, ...,
        -4.42592754e-01,  1.42841621e-01,  4.35534347e-01]])

## Discretization
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

In [11]:
from sklearn.preprocessing import KBinsDiscretizer

In [12]:
dis = KBinsDiscretizer(n_bins=10, encode='ordinal')
dis.fit(normaliseX)
print(dis.n_bins)
print(dis.bin_edges_[0])
print(dis.bin_edges_[4])
print(dis.bin_edges_[9])
preProcessedX = dis.fit_transform(normaliseX)

10
[-0.49669172 -0.34080554 -0.30418808 -0.27441178 -0.24593426 -0.20002433
  0.22525374  0.31776493  0.3587834   0.40845888  0.5971696 ]
[-4.14119853e-01 -2.86338026e-01 -2.55805549e-01 -2.29783080e-01
 -2.04283776e-01 -1.41188374e-01 -9.30753395e-05  1.66883246e-01
  2.92253007e-01  4.06268438e-01  9.24550479e-01]
[-0.78006557 -0.3712633  -0.26001283 -0.1696345  -0.089706   -0.01625665
  0.04543237  0.12787188  0.23103746  0.37457507  0.96144485]


In [13]:
preProcessedX

array([[7., 1., 9., ..., 6., 6., 4.],
       [0., 3., 6., ..., 8., 9., 3.],
       [9., 4., 0., ..., 4., 4., 0.],
       ...,
       [0., 4., 6., ..., 0., 8., 7.],
       [1., 2., 1., ..., 0., 8., 6.],
       [0., 6., 6., ..., 0., 7., 9.]])

## Encoding

When dealing with few and scattered numerical values, we may not need to store these. Then, we can perform One Hot Encoding. For k distinct values, we can transform the feature into a k-dimensional vector with one value of 1 and 0 as the rest values. to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0

### OneHotEncoder

In [22]:
from sklearn.preprocessing import OneHotEncoder

In [23]:
encoder=OneHotEncoder()
s = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
encoder.fit(s)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=True)

In [24]:
encoder.transform([['female', 'from US', 'uses Safari'],['male', 'from Europe', 'uses Safari']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

### OrdinalEncoder

In [25]:
enc = preprocessing.OrdinalEncoder()
s = [['a','b','c','d'], ['e', 'f', 'g', 'h'],['i','j','k','l']]
enc.fit(s)
enc.transform([['a', 'f', 'g','l']])

array([[0., 1., 1., 2.]])