# Health Care Dataset - INM701 Coursework
Postgraduate students: 
- Tamila Skakova
- Elbarraa Elalami

## Content:
- **[Part 1](#part1)- Importing the data set, packages used**
- **[Part 2](#part2)- Exploratory data analysis**
- [Part 2.1](#part2.1)- Analysis of the features
- [Part 2.2](#part2.2)- Analysis of the Target
- [Part 2.3](#part2.3)- Statistical analysis of the dataset
- **[Part 3](#part3) -  Preparing our data**
- [Part 3.1](#part3.1) -  Missing Values
- [Part 3.2](#part3.2) -  Enconding, Shuffling, Scaling
- [Part 3.3](#part3.3) -  Multicollinearity
- [Part 3.4](#part3.4) -  SMOTE Analysis
- **[Part 4](#part4) -  Models**
- [Part 4.1](#part4.1) -  Score mthod
- [Part 4.2](#part4.2) -  KNN
- [Part 4.3](#part4.3) -  NN
- [Part 4.4](#part4.4) -  Decision tree model (reservation_status included)
- [Part 4.5](#part4.5) -  Random Forest
- [Part 4.6](#part4.6) -  Naive Bayes
- **[Part 5](#part5) -  Additional Models**
- [Part 5.1](#part5.1) -  Gradient Boost Classifier
- [Part 5.2](#part5.2) -  CatBoost Classifier
- [Part 5.2](#part5.2) -  XGB BOOST Classifier

## Description of features in the dataframe:

- `Column`	Description
- `case_id`	Case_ID registered in Hospital
- `Hospital_code`	Unique code for the Hospital
- `Hospital_type_code`	Unique code for the type of Hospital
- `City_Code_Hospital` City Code of the Hospital
- `Hospital_region_code`	Region Code of the Hospital
- `Available Extra Rooms in Hospital`	Number of Extra rooms available in the Hospital
- `Department`	Department overlooking the case
- `Ward_Type`	Code for the Ward type
- `Ward_Facility_Code`	Code for the Ward Facility
- `Bed Grade`	Condition of Bed in the Ward
- `patientid`	Unique Patient Id
- `City_Code_Patient`	City Code for the patient
- `Type of Admission`	Admission Type registered by the Hospital
- `Severity of Illness`	Severity of the illness recorded at the time of admission
- `Visitors with Patient`	Number of Visitors with the patient
- `Age`	Age of the patient
- `Admission_Deposit`	Deposit at the Admission Time
- `Stay`	Stay Days by the patient, the length of stay - 11 different classes ranging from 0-10 days to more than 100 days.


[Back to top](#Content:)


<a id='part1'></a>

## Part 1 -  Importing the data set, packages used

In [None]:
import os
import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import GaussianNB
import io
import requests
%matplotlib inline

In [None]:
path = "healthcare"

filename_read = os.path.join(path, "train_data.csv")
health_care = pd.read_csv(filename_read, na_values=['NA', '?'])

# print(health_care.shape)
# print(health_care.columns)

In [None]:
health_care.head(10)

[Back to top](#Content:)

<a id='part2'></a>
## Part 2- Exploratory data analysis (EDA)

[Back to top](#Content:)

<a id='part2.1'></a>
### Analysis of the features

In [None]:
#sorting by Stay for better representation in the visualisations
health_care = health_care.sort_values(by = "Stay", ascending = True)
# To make sure we consider all the correct features to make an accurate prediction, it may be useful to create some plots to have a better understanding of our data and relationship between them and the output feature. For that we have used Python data visualisation library seaborn. We begin by visualising the length of stays by age of patients admitted to hospitals, type of admissions, severity of illness, available extra rooms in hospitals,  etc., by using a countplot(), that shows the counts of observations in each categorical bin using bars.

In [None]:
#plot size
plt.figure(figsize = (15,4))
#plot title
plt.title("Age", fontdict = {'fontsize':15})
ax = sns.countplot(x = "Age", hue = 'Stay', data = health_care)

In [None]:
#plot size
plt.figure(figsize = (15,4))
#plot title
plt.title("Hospital_region_code", fontdict = {'fontsize':15})
ax = sns.countplot(x = "Hospital_region_code", hue = 'Stay', data = health_care)

In [None]:
#plot size
plt.figure(figsize = (20,4))
#plot title
plt.title("Available Extra Rooms in Hospital", fontdict = {'fontsize':15})
ax = sns.countplot(x = "Available Extra Rooms in Hospital", hue = 'Stay', data = health_care)

In [None]:
#plot size
plt.figure(figsize = (15,4))
#plot title
plt.title("Type of Admission", fontdict = {'fontsize': 15})
ax = sns.countplot(x = "Type of Admission", hue = 'Stay', data = health_care)


In [None]:
#plot size
plt.figure(figsize = (15,4))
#plot title
plt.title("Severity of Illness", fontdict = {'fontsize':15})
ax = sns.countplot(x = "Severity of Illness", hue = 'Stay', data = health_care)

In [None]:
#plot size
plt.figure(figsize = (15,8))
#plot title
plt.title("Department", fontdict = {'fontsize':15})
ax = sns.countplot(x = "Department", hue = 'Stay', data = health_care)

[Back to top](#Content:)

<a id='part2.2'></a>
### Analysis of the Target

In [None]:
## Checking target
#creating a copy of df to have an original for further manipulation
health_care_copy = health_care.copy()
target = health_care_copy['Stay']

In [None]:
## Encoding data
encoder = LabelEncoder()
target_enc = encoder.fit_transform(target)

In [None]:
#number of unique element for each feature
health_care_copy['Stay_cat'] = encoder.fit_transform(health_care_copy['Stay'])
n = len(health_care_copy['Stay_cat'])
l = []
for i in range(len(np.unique(health_care_copy['Stay_cat']))):
    k = len(health_care[health_care_copy['Stay_cat']==i])
    #print(f'N {10*i} and {10*(i+1)} is : {k}')
    print(f'{i}- {k}  : {100*k/n:.2f}%')

In [None]:
#Visualisation of Target categories
fig, ax1 = plt.subplots()
labels = np.unique(health_care['Stay'])
labels[-1] = '100+'

#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:6]
plt.pie(health_care.groupby('Stay').size(), labels = labels,colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
fig.set_size_inches(20, 15.5)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')  
plt.tight_layout()
plt.title('Percentage of occurence of each category')
plt.show()

In [None]:
#Histogram plot
h = health_care['Stay'].sort_values()
h[h=='More than 100 Days'] = '100+'
plt.figure(figsize = (8,8))
plt.hist(h, bins=11, label=labels, density=True, facecolor='g', alpha=0.75)

plt.xlabel('LOS')
plt.ylabel('Occurence')
plt.title('Histogram of LOS')

plt.grid(True)
plt.show()

[Back to top](#Content:)

<a id='part2.3'></a>
### Statistical analysis of the dataset

In [None]:
#statistical analysis of each feature
df_copy = health_care.copy()
df_copy = df_copy.select_dtypes(include=["int","float"])
headers = list(df_copy.columns.values)
fields = []
for field in headers:
    fields.append({
        "name":field,
        "mean":health_care[field].mean(),
        "var":health_care[field].var(),
        "sdev":health_care[field].std()
    })
for field in fields:
    print(field)

In [None]:
## functions to feed to barplot
def mean(x):
    return np.mean(x)

def median(x):
    return np.median(x)

def std(x):
    return np.std(x)

In [None]:
## Estimate of length of stay
target_estimate = 10* (target_enc) + 5
# Adding the estimate to the dataframe
df_copy['Stay Estimate'] = target_estimate

In [None]:
# Mean Analysis
#We study the relationship between the mean of stay and each feature
for feature in column_features :
    order = pd.unique(health_care[feature])
    order.sort()
    ax = sns.barplot(x=feature, y="Stay Estimate", data=health_care,order=order, estimator=mean)
    ax.set_title(f'Means of {feature} classes') 
    plt.show();

In [None]:
#Median Analysis
for feature in column_features :
    order = pd.unique(health_care[feature])
    order.sort()
    ax = sns.barplot(x=feature, y="Stay Estimate", data=health_care,order=order, estimator=median)
    ax.set_title(f'Medians of {feature} classes') 
    plt.show();

In [None]:
#Analysis of Std
for feature in column_features :
    order = pd.unique(health_care[feature])
    order.sort()
    ax = sns.barplot(x=feature, y="Stay Estimate", data=health_care,order=order, estimator=std)
    ax.set_title(f'Stds of {feature} classes') 
    plt.show();

[Back to top](#Content:)

<a id='part3'></a>
## Part 3 - Preparing our data

[Back to top](#Content:)

<a id='part3.1'></a>
### Missing Values

In [None]:
#checking for missing values
health_care.isnull().values.any()

In [None]:
#Checking number of NANs for each column, in order to understand how many missing values there are in a dataframe.
print("# of NaN in each columns:", health_care.isnull().sum(), sep='\n')

In [None]:
#calculates percentage of missing values in the specific feature
def perc_mv(x, y):
    perc = y.isnull().sum() / len(x) * 100
    return perc

In [None]:
print('Missing value ratios:\nBed Grade: {}\nCity_Code_Patient: {}'.format(
    perc_mv(health_care, health_care['Bed Grade']),
    perc_mv(health_care, health_care['City_Code_Patient'])))

In [None]:
#In the code cell below, we use the attribute dtype on df to retrieve the data type for each column.
print (health_care.dtypes)

In [None]:
#We want to map the name of each Severity of Illness to the corresponding number in ascending order.
health_care['Severity of Illness'] = health_care['Severity of Illness'].map({'Minor':1, 'Moderate': 2, 'Extreme':3})
health_care['Type of Admission'] = health_care['Type of Admission'].map({'Trauma':1, 'Emergency': 2, 'Urgent':3})
# health_care

In [None]:
#dropping features
elements_to_remove = ['case_id', 'City_Code_Hospital', 'City_Code_Patient', 'patientid', 'Bed Grade', 'Admission_Deposit']
health_care = health_care.drop(elements_to_remove, axis=1)

In [None]:
# #uniting the predictors 
# #playing with the parameters
# health_care["Stay"] = health_care["Stay"].map({'0-10':'0-20', 
#                              '11-20':'0-20', 
#                              '21-30':'21-30', 
#                              '31-40':'21-30', 
#                              '41-50':'31-40', 
#                              '51-60':'31-40',
#                              '61-70':"more than 60 Days",  
#                              '71-80':"more than 60 Days", 
#                              '81-90':"more than 60 Days", 
#                              '91 - 100':"more than 60 Days", 
#                              'More than 100 Days' :"more than 60 Days"})

# # df.iloc[np.random.permutation(len(df))]

[Back to top](#Content:)

<a id='part3.2'></a>
### Enconding, Shuffling, Scaling

In [None]:
#using LabelEncoder to change and transform the object format of 
le = LabelEncoder()
for col in ['Hospital_type_code', 'Hospital_region_code','Ward_Type', 'Ward_Facility_Code', 'Department', 'Age', 'Stay']:
    health_care[col]= health_care[col].astype('str')
    health_care[col]= le.fit_transform(health_care[col])
print (health_care.dtypes)

In [None]:
# # Function to hot encode the column with name : name for dataframe df
# def encode_text_dummy(df, name):
#     dummies = pd.get_dummies(df[name])
#     for x in dummies.columns:
#         dummy_name = f"{name}-{x}"
#         df[dummy_name] = dummies[x]
#     df.drop(name, axis=1, inplace=True) ## inplace to make changed on the original df

In [None]:
# ohEncoder = OneHotEncoder()

# Xe = health_care.drop(columns = ["Stay"])
# ye = health_care["Stay"]

# #hot encoding
# ## Pre processing these columns
# dummies_string_columns = ['Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code',  'Type of Admission']
# for column in dummies_string_columns :
#     encode_text_dummy(Xe, column)
    
# ## label_encoded data
# Xe['Age'] = encoder.fit_transform(Xe['Age'])
# ye = ohEncoder.fit_transform(ye)

In [None]:
# #Scaling Hot Encoded Data
# scaler = StandardScaler()
# X = X.values
# y = y
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# for i in range(X_train.shape[1]) :
#     X_train[:,i]= scaler.fit_transform(X_train[:,i].reshape(-1, 1))[:,0]
#     X_test[:,i] = scaler.transform(X_test[:,i].reshape(-1, 1))[:,0]

In [None]:
#shuffling
health_care= health_care.reindex(np.random.permutation(health_care.index))
health_care.reset_index(inplace=True, drop=True)

In [None]:
#initialisation
X = health_care.drop(columns = ["Stay"])
y = health_care["Stay"]
scaler=StandardScaler()

[Back to top](#Content:)

<a id='part3.3'></a>
### Multicollinearity

In [None]:
# Heatmap
fig, ax = plt.subplots(figsize=(10,10))        
sns.heatmap(health_care.corr(), annot=True, linewidths=.5, ax=ax)

In [None]:
#OLS
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

In [None]:
#VIF
X_VIF = health_care
data = pd.DataFrame()
data["feature"] = X_VIF.columns
data["VIF"] = [variance_inflation_factor(X_VIF.values, i) for i in range(len(X_VIF.columns))]
print(data)

In [None]:
#Choosing the number of components
X_scaler = StandardScaler().fit_transform(X)
pca = PCA().fit(X_scaler)

In [None]:
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Dataset Explained Variance')
plt.show()

In [None]:
np.cumsum(pca.explained_variance_ratio_)

In [None]:
# For our analysis we choose n_components = 7
pca = PCA(n_components=7)
X_pca = pca.fit_transform(X_scaler)
X_pca_with_constant = sm.add_constant(X_pca)

In [None]:
model = sm.OLS(y, X_pca_with_constant)
results = model.fit()
print(results.summary())

[Back to top](#Content:)

<a id='part3.4'></a>
### SMOTE analysis

In [None]:
health_care["Stay"].value_counts().plot.bar()

In [None]:
smote_health_care = health_care.copy()

# X = np.array(smote_health_care.loc[:, smote_health_care.columns != "Stay"])
y = np.array(smote_health_care.loc[:, smote_health_care.columns == "Stay"]).reshape(-1, 1)

X = scaler.fit_transform(X_pca_with_constant)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pca_with_constant, y, test_size = 0.33, random_state = 2, shuffle = True, stratify = y)

clf = LogisticRegression(solver = 'lbfgs')
oversample = SMOTE(random_state = 33)

X_train_SMOTE, y_train_SMOTE = oversample.fit_resample(X_train, y_train)

In [None]:
# observe that data has been balanced
pd.Series(y_train_SMOTE).value_counts().plot.bar()

In [None]:
# fit the model
clf.fit(X_train_SMOTE, y_train_SMOTE)

# prediction for Training data
train_pred_sm = clf.predict(X_train_SMOTE)

# prediction for Testing data
test_pred_sm = clf.predict(X_test)

In [None]:
print('Accuracy score for Training Dataset = ', accuracy_score(train_pred_sm, y_train_SMOTE))
print('Accuracy score for Testing Dataset = ', accuracy_score(test_pred_sm, y_test.ravel()))


[Back to top](#Content:)


<a id='part4'></a>

## Part - 4 Models 

[Back to top](#Content:)

<a id='part4.1'></a>
### Score method

In [None]:
# Score of 1 is perfect. The closer the score to 1 the better
K = 10 # Num of Classes - 1 here

# Classes can be ranked here from 0-10
def score(pred, target):
    error = (np.mean(np.abs(pred-target)))/K
    score = 1 - error
    return score

[Back to top](#Content:)

<a id='part4.2'></a>
### KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train) 

In [None]:
y_pred = knn.predict(X_test)

In [None]:
#accuracy
print('Accuracy: %.2f' %accuracy_score(y_test,y_pred))

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
#with Smote
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train_SMOTE, y_train_SMOTE) 

In [None]:
y_pred = knn.predict(X_test)

In [None]:
#accuracy
print('Accuracy: %.2f' %accuracy_score(y_test,y_pred))

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
# from sklearn.metrics import classification_report
# print(classification_report(y_test, y_pred))

[Back to top](#Content:)

<a id='part4.3'></a>
### NN

In [None]:
nn = Sequential()

#model.add(Dropout(0.1)) #applies to layer before ie input here
nn.add(Dense(12, input_dim=X.shape[1], activation='relu'))
nn.add(Dense(8, activation='relu'))
nn.add(Dense(1, activation='sigmoid'))
nn.add(Dropout(0.2))
nn.add(Dense(1))

In [None]:
nn.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='loss', min_delta=1e-4, patience=5, verbose=1, mode='auto')
nn.fit(X_train,y_train,verbose=2,epochs=2)
pred = nn.predict(X_test)
#accuracy
print('Accuracy: %.2f' %accuracy_score(y_test,y_pred))

In [None]:
# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))
nn.summary()

In [None]:
# # compile the keras model
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# # fit the keras model on the dataset
# model.fit(X, y, epochs=150, batch_size=10)
# # evaluate the keras model
# accuracy = model.evaluate(X, y)
# print('Accuracy: %.2f' % (accuracy*100))

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
# from sklearn.metrics import classification_report
# print(classification_report(y_test, y_pred))

[Back to top](#Content:)

<a id='part4.4'></a>
### Decision Trees

In [None]:
tree = DecisionTreeClassifier(criterion='entropy')
tree.fit(X_train, y_train)

In [None]:
y_pred = tree.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
print(f'Score : %.2f' % score(y_test.ravel(), y_pred.ravel()))

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm )
disp.plot()
plt.show();

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

In [None]:
#with SMOTE
tree.fit(X_train_SMOTE, y_train_SMOTE)
y_pred = tree.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
print(f'Score : %.2f' % score(y_test.ravel(), y_pred.ravel()))

[Back to top](#Content:)

<a id='part4.5'></a>
### Random Forest

In [None]:
forest = RandomForestClassifier(n_estimators=10, criterion='entropy')
forest.fit(X_train, y_train)

In [None]:
y_pred = forest.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
print(f'Score : %.4f' % score(y_test.ravel(), y_pred.ravel()))

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

In [None]:
### Cross Validating to check some results
kf = KFold(5, shuffle=True)
fold = 1

for train_index, validate_index in kf.split(X,y):
    forest.fit(X[train_index], y[train_index])
    ytest = y[validate_index]
    y_pred = forest.predict(X[validate_index])
    print(f'Accuracy : %.4f' % accuracy_score(ytest, y_pred))
    print(f'Score : %.4f' % score(ytest.ravel(), y_pred.ravel()))

In [None]:
#Trying Different num of estimators
accuracy_data = []
score_data = []

for i in range(1,2):
    forest = RandomForestClassifier(n_estimators=i, criterion='entropy')
    forest.fit(X_train, y_train)
    y_pred = forest.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    score = 1-np.mean(np.abs(y_test-y_pred))/K
    print(f'Accuracy {i} estimators : %.4f' % accuracy)
    print(f'Score {i} estimators : %.4f' % score)
    accuracy_data.append(accuracy)
    score_data.append(score)

In [None]:
nums = np.arange(1,40)
fig = plt.figure(figsize=(6,5))
plt.plot(nums, accuracy_data, c='r', label='Accuracy')
plt.plot(nums, score_data, label='Score')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy and Score')
plt.legend(loc='upper right')
plt.show();

In [None]:
#Last Model results
y_pred = forest.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
score = 1-np.mean(np.abs(y_test-y_pred))/K
print(f'Score : %.4f' % score)

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

[Back to top](#Content:)


<a id='part4.'></a>
### Naive Bayes

In [None]:
smoothing = [1e-3, 1e-2, 1e-1, 1, 10,100]
for i in range(len(smoothing)):
    model = GaussianNB(var_smoothing=smoothing[i])
    model.fit(X_test, y_test)
    y_pred = model.predict(X_test)
    print('var_smoothing = ',smoothing[i] )
    print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
    score = 1-np.mean(np.abs(y_test-y_pred))/K
    print(f'Score : %.4f' % score)
    print('---------' )

In [None]:
## Keeping smoothing = 1
model = GaussianNB(var_smoothing=1)
model.fit(X_test, y_test)
y_pred = model.predict(X_test)

In [None]:
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

[Back to top](#Content:)


<a id='part5'></a>

## Part - 5 Additional Models

[Back to top](#Content:)


<a id='part5.1'></a>
### Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
score = 1-np.mean(np.abs(y_test-y_pred))/K
print(f'Score : %.4f' % score)

In [None]:
# Model Precision
print("Precision:",precision_score(y_test, y_pred, average='micro'))

# Model Recall
print("Recall:",recall_score(y_test, y_pred, average='micro'))

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

[Back to top](#Content:)


<a id='part5.2'></a>
### Cat Boost Classifier

In [None]:
from catboost import CatBoostClassifier
cb = CatBoostClassifier(iterations=1000)
cb.fit(X_train, y_train)

In [None]:
y_pred = cb.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))

In [None]:
y_pred = y_pred.flatten()
y_pred.shape

In [None]:
y_diff = np.abs(y_test-y_pred)

In [None]:
score = 1-np.mean(y_diff)/K
print(f'Score : %.4f' % score)

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

In [None]:
cm_reduced = cm[:6,:6]

cm = confusion_matrix(y_test, y_pred);
disp = ConfusionMatrixDisplay(cm_reduced)
disp.plot();

[Back to top](#Content:)


<a id='part5.3'></a>
### XGB BOOST Classifier

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

In [None]:
y_pred = xgb.predict(X_test)
print(f'Accuracy : %.3f' % accuracy_score(y_test, y_pred))
score = 1-np.mean(np.abs(y_test-y_pred))/K
print(f'Score : %.4f' % score)