In [26]:
%matplotlib inline

In [139]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression

## 01. Intorduction to Machine Learning

### Live Demo

In [28]:
df = pd.read_csv('data/diabetic_data.csv', na_values=['?'])

In [29]:
df.head()

We want to make a model that checks bss readmitted feature which treatment is best

In [30]:
df.readmitted.unique()

In [31]:
df.readmitted.value_counts()

We see that NOT readmitted is largest, above 30 days is second, below < 30 less.
If we want we can combine >30 and < 30 with YES / NO classification.
The variable is categorical and is target, we need **classification** algorithms

In [33]:
df.race.value_counts(dropna=False)

In [34]:
df.gender.value_counts(dropna=False)

In [36]:
df.age.value_counts(dropna=False)

In [39]:
# normalizing the data in %
df.age.value_counts(dropna=False) / len(df) * 100

As we have BIAS in the data, the age above 50 has a lot of records, therefore, the model will not be accurate for young people. The model will be *tuned* for the **biggest** count of the variable.

In [42]:
df.discharge_disposition_id.value_counts(dropna=False)

This variable is considered HIGH CARDINALITY - the categorial variable has many values. The level of the category is the count of the unique values in it. This is **HIGH LEVEL CAT**.
We can:
1. Do nothing,
2. Unite on certain bassis
3. Can be dropped from DS with **COUTION**. For the demo will be dropped in order to be able to create a model

In [43]:
df.metformin.value_counts(dropna=False)

Above is categorical

In [51]:
df.patient_nbr

It is type int, however, we cannot perform matematical operations on the ID, so it is considered 'categorical'. The variable has **TOO BIG ENTROPY**, brings less information.

In [47]:
df.patient_nbr.nunique() / len(df)

In [48]:
df.patient_nbr.value_counts()

In [50]:
df[df.patient_nbr == 88785891]

Separating the **target**. Remaining will be called **attributes**

In [53]:
attr = df.drop(columns='readmitted')

In [56]:
trgt = df['readmitted']

In [57]:
trgt

We drop varibales (columns) from the **attr** and **WRITE DOWN WHY ARE WE DOING THAT!!!** In our case, the 'encounter_id' & 'patient_nbr' are considered categorical variables with high entropy and no value for our model

In [60]:
attr = attr.drop(columns=['encounter_id', 'patient_nbr'])

In [64]:
attr.diag_1.value_counts()

As per the DS description, we have more than 800 distinct values, **HIGH ENTROPY** -> useless. We have NaN values:

In [70]:
len(attr[attr.diag_3.isna()])

We can remove them, however if we drop them, all attr df will have 0 observations. So we can remove NaN id different cols only. The *WEIGHT* col will be dropped due to too many missing values.

In [72]:
attr.weight.isna().count()

In [73]:
attr = attr.drop(columns='weight')

Removing further more usless columns with too many NaNs

In [None]:
attr = attr.drop(columns=['payer_code', 'medical_specialty'])

We now check the numeric variables.

In [81]:
df.time_in_hospital.hist(bins='fd')

In [82]:
df.num_medications.hist(bins='fd')

Some patients are taking more than 50 medicaments!! Useless depending of what we want to achieve. There is problem with this huge range from 0 to 80. And normally is a lot more, times times more. The problem arises when we save a number into the RAM, the number is not saved exact and we obtain numerical errors from basiccomputer math operations.

In [83]:
0.1 + 0.2

We have error from rounding:

In [93]:
10000000000000000.0 + 1 == 10000000000000000.0

the errorrs are less if the numbers we are working with are in range [-1: 1]. So we must pass numbers to the model close to 0. Therefore we are scaling the data using Z-score or other methods. Below is Z-Score example:

In [96]:
zscore = (df.num_medications - df.num_medications.mean()) / df.num_medications.std()

In [97]:
zscore.hist(bins='fd')

Below min-max scaling

In [100]:
min_max = (df.num_medications - df.num_medications.min()) / (df.num_medications.max() - df.num_medications.min())

In [102]:
min_max.hist(bins='fd')

Category vars -> must be passed as numbers to the model. We cannot pass a string to a model

In [104]:
df.metformin

In [106]:
# similar to melt
pd.get_dummies(df.metformin).astype(int)

In [108]:
# change in column names
pd.get_dummies(df[['metformin']]).astype(int)

We need to encode the values with different encoding. Like **one-hot** or **multy** encoding

In [109]:
attr.metformin.replace({'No':-99, 'Down': -1, 'Steady': 0, 'Up': 1})

Now it is in numbers. We must pass all categorical vars in the encoding process. The model will work with the numbers without knowing that these are categories. In order not to confuse the model, we must encode using **get_dummies** and spread to more columns the variable

In [112]:
pd.get_dummies(attr)

The column number is increased, however the model can work with that! We can also reduce the cols using **drop_first**

In [121]:
attr = pd.get_dummies(attr, drop_first=True)

Now we need to scale the values (normalize). Importing minmax scaler, fit to the current attributes

In [117]:
MinMaxScaler()

In [119]:
scaler = MinMaxScaler()

In [122]:
scaler.fit(attr)

In [127]:
# original value range
scaler.data_range_

In [129]:
# transform scale / normalize where the max is 1 and min is 0
attr = scaler.transform(attr)

In [130]:
attr.max(axis=0)

In [131]:
attr.min(axis=0)

The output is numpy array

In [134]:
# same as original df
attr.shape

In [138]:
# everything is a floa
attr.dtype

The other option is OneHotEncoder, LabelEncoder. The LabelEncoder is **REPLACE** function. These operations are **ONLY FOR THE ATTRIBUTES, NOT FOR THE TARGET**. For SKLEARN model, must have 2D array, sorted values, row - observation, column - feature. We use LogisticRegression. We are passing the attributes as array and target column.

In [140]:
model = LogisticRegression()

In [142]:
model.fit(attr, trgt)

The ML is to **PASS CORRECT DATA** and **EVALUATE THE MODEL**. It seems that the fitting was not completed with success. However:

In [144]:
model.score(attr, trgt)

The score is **classification accuracy**. We need to evaluate if the score is OK for us. Otherwise we can make feature engineering, feature selections etc to increase the score, finetune it etc.