# Our methodology


## Data visualization 
    Loading the data
    Take a quick look at our data
    Understanding our data
    Finding the correlations 
    
## Data preperation 
    Outliers detection
    Skweness correction 
    Data spliting
    Feature scaling
    
## Modeling
    Building the model
    Evaluation with cross-validation
    
## Fine-tuning 
    Finding the best hyperparameters
    
## Performance evaluation
    Evaluate our model with the new hyperparameters
    
## Testing our model
    Evaluate the model with the test set
    

Well lets get started!

## Data visualization

### Importing needed libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import norm
from scipy import stats
import seaborn as sns
warnings.filterwarnings('ignore')
%matplotlib inline

### Loading the data

In [None]:
data = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
data.shape

### Take a quick look at our data


#### Attribute Information: 

age 

sex 

chest pain type (4 values) 

resting blood pressure 

serum cholestoral in mg/dl 

fasting blood sugar > 120 mg/dl

resting electrocardiographic results (values 0,1,2)

maximum heart rate achieved 

exercise induced angina 

oldpeak = ST depression induced by exercise relative to rest 

the slope of the peak exercise ST segment 

number of major vessels (0-3) colored by flourosopy 

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

In [None]:
data.head()

### Understanding our data

In [None]:
data.info()

well we dont have any missing data and all our feature are int!

In [None]:
data.describe()

it looks like we have a few of outliers here we will deal with them later!

In [None]:
data.var()

In [None]:
data.hist(bins=50, figsize=(30,25)) 
plt.style.use('fivethirtyeight')
plt.show()

In [None]:
sns.countplot(x="target", data=data)
plt.show()

In [None]:
sns.displot(data, x="age", hue="age")
plt.show()

In [None]:
sns.countplot(x="sex", data=data)
plt.xlabel('Male = 1 , female = 0')
plt.show()

In [None]:
sns.countplot(x="cp", data=data)
plt.show()

In [None]:
sns.countplot(x="fbs", data=data)
plt.show()

In [None]:
sns.countplot(x="ca", data=data)
plt.show()

In [None]:
sns.countplot(x="slope", data=data)
plt.show()

In [None]:
sns.countplot(x="restecg", data=data)
plt.show()

### Find the correlations

In [None]:
corr_matrix = data.corr()
f, ax = plt.subplots(figsize=(25, 15))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, cmap=cmap, vmax=.5, annot=True, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

In [None]:
High_corr = corr_matrix.nlargest(5, 'target')['target'].index
High_corr

In [None]:
corr_matrix["target"].sort_values(ascending=False)

In [None]:
new_df = data.copy()

## Data preperation

### Outliers Detection

we have a various methods to detect the outliers i am going to use IQR here this method works fine for me but 

you can try other methods like 

            1- Z-score method
            2. Robust Z-score
            3. I.Q.R method
            4. Winterization method(Percentile Capping)
            5. DBSCAN Clustering
            6. Isolation Forest
            7. Visualizing the data
            
IQR stands for "Inter Quartiles Range"

this method depends on two values 
    
    Q1 >> which represents a quarter of the way through the list of all data usually this value is 0.25 but i will use .15 trying not to delete a lot of data 
    
    Q3 >> which represents three-quarters of the way through the list of all data usually this value is 0.75 but i will use .80 for the same resone
    
how IQR works :
    well first it sorts the data and finds its median 
    then seperate the numbers before the median and finds its own median "Q1"  and also seperates the numbers 
    after the total medain and finds its own median "Q3"
    
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/1200px-Boxplot_vs_PDF.svg.png">

then we will take the diffrance between Q3 and Q1

#### But before getting our hands dirty lets define some functions that we will use a lot like 
    "IQR" to calculate the IQR for us 
    "Upper and Lower" to fetch upper values and lower values that contain outliers 
    "outliers_del" to delete them 
    "Plot" function to plot the curves 
    "outlier_compare" to compare the data before deleting outliers and correct the skewness and after
    
I will write a comment for each function when creating it

In [None]:
# This function will calculate the IQR for us and save the values that is higher or lower as follwow
def IQR(column_name):
    Q1 = new_df[column_name].quantile(0.15)
    Q3 = new_df[column_name].quantile(0.80)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5 * IQR
    lower_limit = Q1 - 1.5 * IQR
    values_upper = new_df[new_df[column_name] > upper_limit]
    values_lower = new_df[new_df[column_name] < lower_limit]
    
    return values_upper, values_lower, upper_limit, lower_limit

In [None]:
# this Function will check if the returned shape from IQR is higher than zero 
# why zero! cos the output will be for example like this (2,63) that means there are 2 rows contains outliers 
# and if it more than zero it will show us this rows
def upper(column_name):
    if values_upper.shape[0] > 0:
        print("Outliers upper than the higher limit: ")
        return new_df[new_df[column_name] > upper_limit]
    else:
        print("There are no values higher than the upper limit!")

In [None]:
# same as above but for lower values
def lower(column_name):
    if values_lower.shape[0] > 0:
        print("Outliers lower than the higher limit: ")
        return new_df[new_df[column_name] < lower_limit]
    else:
        print("There are no values lower than the lower limit!")

In [None]:
# this function will delete any outliers upper or lower the limit
def outliers_del(column_name):
    # we will make new_df global to consider the global variable not the local
    global new_df
    new_df = new_df[new_df[column_name] < upper_limit]
    new_df = new_df[new_df[column_name] > lower_limit]
    print("the old data shape is :", data.shape)
    print("the new data shape is :", new_df.shape)

In [None]:
# this function is for ploting the data 
def plot(column_name):
    plt.style.use('fivethirtyeight')
    plt.figure(figsize=(16,5))
    #plt.subplot(1,2,1)
    # we will use fit norm to draw the normal distibutions that the data sould be it will be in black 
    #sns.distplot(data[column_name], fit=norm)
    plt.subplot(1,2,1)
    sns.boxplot(data[column_name],palette="rocket")
    plt.show()

In [None]:
def outlier_compare(column_name):
    plt.style.use('fivethirtyeight')
    plt.figure(figsize=(25,15))
    plt.subplot(2,2,1)
    sns.boxplot(data[column_name], palette="rocket")
    plt.subplot(2,2,2)
    sns.boxplot(new_df[column_name], palette="rocket")
    plt.show()

In [None]:
Upper_Outliers_columns = []
Lower_Outliers_columns = []
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
for column in new_df:
    if new_df[column].dtype in numeric_dtypes:
        values_upper, values_lower, upper_limit, lower_limit = IQR(column)
        if values_upper.shape[0] > 0:
            Upper_Outliers_columns.append(column)
        if values_lower.shape[0] > 0:
            Lower_Outliers_columns.append(column)

In [None]:
print('Columns upper the limit is: ', Upper_Outliers_columns)
print('Columns upper the limit is: ', Lower_Outliers_columns)

i will igonre fbs and thal cos those are cat columns 

#### trestbps

In [None]:
plot('trestbps')

In [None]:
values_upper, values_lower, upper_limit, lower_limit = IQR('trestbps')

In [None]:
upper('trestbps')

In [None]:
lower('trestbps')

In [None]:
outliers_del('trestbps')

In [None]:
outlier_compare('trestbps')

In [None]:
plot('chol')

In [None]:
values_upper, values_lower, upper_limit, lower_limit = IQR('chol')

In [None]:
upper('chol')

In [None]:
lower('chol')

In [None]:
outliers_del('chol')

In [None]:
outlier_compare('chol')

In [None]:
plot('oldpeak')

In [None]:
values_upper, values_lower, upper_limit, lower_limit = IQR('oldpeak')

In [None]:
upper('oldpeak')

In [None]:
outliers_del('oldpeak')

In [None]:
outlier_compare('oldpeak')

### Skewness

lets define some new functions here

In [None]:
# this function is for ploting the data 
def skew_plot(column_name):
    plt.style.use('fivethirtyeight')
    plt.figure(figsize=(16,9))
    plt.subplot(2,2,1)
    sns.distplot(data[column_name], fit=norm)
    plt.subplot(2,2,2)
    res = stats.probplot(data[column_name], plot=plt) 
    plt.show()

In [None]:
# calculate the skewness
def skew(column_name):
    print("Skewness: %f" % new_df[column_name].skew())

In [None]:
# checking and make sure that this column doesnot contain zeros or negative values
def zeros(column_name):
    if ((new_df[column_name] == 0).any() or (new_df[column_name] < 0).any()) == False:
        print("Your column is clean!")
    else:
        print("Watch out you have zeros or negative values here!")

In [None]:
# transform the data with log 
def log(column_name):
    new_df[column_name] = np.log(new_df[column_name])

In [None]:
# transform the data with square root 
def sqrt(column_name):
    new_df[column_name] = np.sqrt(new_df[column_name])

In [None]:
# this one to compair the old data before doing any edit on it like correcting the skeness and after 
def skew_compare(column_name):
    plt.figure(figsize=(20,15))
    plt.subplot(2,2,1)
    sns.distplot(data[column_name], fit=norm)
    plt.subplot(2,2,2)
    res = stats.probplot(data[column_name], plot=plt)
    plt.subplot(2,2,3)
    sns.distplot(new_df[column_name], fit=norm)
    plt.subplot(2,2,4)
    res = stats.probplot(new_df[column_name], plot=plt)
    plt.show()

In [None]:
from scipy.stats import skew

skewness_list = {}
for i in new_df:
    if new_df[i].dtype != "object":
        skewness_list[i] = skew(new_df[i])

skewness = pd.DataFrame({'Skew' :skewness_list})
plt.style.use('fivethirtyeight')
plt.figure(figsize=(15,9))
plt.xlabel('Features', fontsize=15)
plt.ylabel('Skewness', fontsize=15)
plt.xticks(rotation='90')
plt.bar(range(len(skewness_list)), list(skewness_list.values()), align='center')
plt.xticks(range(len(skewness_list)), list(skewness_list.keys()))

plt.show()

In [None]:
skewness_list

In [None]:
skew_plot('chol')

In [None]:
zeros('chol')

In [None]:
log('chol')

In [None]:
skew_compare('chol')

### Data spliting 

In [None]:
X = new_df.drop("target", axis=1)

In [None]:
y = new_df['target'].copy()

In [None]:
X.shape, y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, shuffle=True, random_state=42)

### Feature scaling or Data scaling

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.

we have three methods in sklearn 

MinMaxScaler(feature_range = (0, 1)) will transform each value in the column proportionally within the range [0,1]. Use this as the first scaler choice to transform a feature, as it will preserve the shape of the dataset (no distortion).

StandardScaler() will transform each value in the column to range about the mean 0 and standard deviation 1, ie, each value will be normalised by subtracting the mean and dividing by standard deviation. Use StandardScaler if you know the data distribution is normal.

If there are outliers, use RobustScaler(). Alternatively you could remove the outliers and use either of the above 2 scalers (choice depends on whether data is normally distributed)

We delete most outliers earlier so we can use MinMaxScaler or StandardScaler



In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

## Modeing and Evaluation Using Cross-Validation

In [None]:
# Calssification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
# Cross validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import ShuffleSplit
# Evaluation metrices
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [None]:
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

### LogisticRegression

In [None]:
LR = LogisticRegression()
LR_scores = cross_val_score(LR, X_train, y_train, cv=cv)
LR_scores

In [None]:
LR_scores.mean()

In [None]:
LR_cv_pred = cross_val_predict(LR, X_train, y_train, cv=10)
accuracy_score(y_train, LR_cv_pred)

In [None]:
LR.fit(X_train, y_train)
LR.score(X_train, y_train)

### KNeighborsClassifier

In [None]:
KNN = KNeighborsClassifier()
KNN_scores = cross_val_score(KNN, X_train, y_train, cv=cv)
KNN_scores

In [None]:
KNN_scores.mean()

In [None]:
KNN_cv_pred = cross_val_predict(KNN, X_train, y_train, cv=10)
accuracy_score(y_train, KNN_cv_pred)

In [None]:
KNN.fit(X_train, y_train)
KNN.score(X_train, y_train)

### RandomForestClassifier

In [None]:
RF = RandomForestClassifier()
RF_scores = cross_val_score(RF, X_train, y_train, cv=cv)
RF_scores

In [None]:
RF_scores.mean()

In [None]:
RF_cv_pred = cross_val_predict(RF, X_train, y_train, cv=10)
accuracy_score(y_train, RF_cv_pred)

In [None]:
RF.fit(X_train, y_train)
RF.score(X_train, y_train)

## Fine-Tune Our Model

After we trained our model and take an idea about how it performed no time to find the optimal hyperparameters of the model
One way to do that would be to fiddle with the hyperparameters manually until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.
Instead, you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation

In [None]:
from sklearn.model_selection import GridSearchCV

#### LogisticRegression

In [None]:
param_grid = [
        {'penalty': ['l1', 'l2', 'elasticnet', 'none'], 'C': [.001, .01, .1, 1]},
        {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'max_iter': [100, 1000, 10000], 
         'multi_class': ['auto', 'ovr', 'multinomial']},
]

LR = LogisticRegression()
grid_search = GridSearchCV(LR, param_grid, cv=cv,scoring='accuracy',return_train_score=True)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

#### KNeighborsClassifier

In [None]:
param_grid = [
        {'n_neighbors': [2, 3, 4, 5, 6], 'weights': ['uniform','distance']},
        {'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'], 'leaf_size': [30, 40, 50]},
]

KNN = KNeighborsClassifier()
grid_search = GridSearchCV(KNN, param_grid, cv=cv,scoring='accuracy',return_train_score=True)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

#### RandomForestClassifier

In [None]:
param_grid = [
        {'n_estimators': [100, 150, 200, 250, 300], 'criterion': ['gini','entropy']},
        {'max_depth': [1, 2, 3, 4, 5,6,7,8,9,10], 'max_features': ['auto', 'sqrt', 'log2']},
]

RF = RandomForestClassifier()
grid_search = GridSearchCV(RF, param_grid, cv=cv,scoring='accuracy',return_train_score=True)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

well the grid search gave me max_depth at 2 for the first time i run the algorithm and gave me 3 for the second time but 2 works fine for me!

## Performance Measures

ok lets see how the new hyperparameters will performe

#### LogisticRegression

In [None]:
LR = LogisticRegression(solver='saga')
LR_scores = cross_val_score(LR, X_train, y_train, cv=cv)
LR_scores

In [None]:
LR_score = LR_scores.mean()

In [None]:
LR_cv_pred = cross_val_predict(LR, X_train, y_train, cv=10)
accuracy_score(y_train, LR_cv_pred)

In [None]:
LR.fit(X_train, y_train)
LR.score(X_train, y_train)

In [None]:
LR_pred = LR.predict(X_train)

In [None]:
confusion_matrix(y_train, LR_pred)

#### KNeighborsClassifier

In [None]:
KNN = KNeighborsClassifier(n_neighbors=6, weights='distance')
KNN_scores = cross_val_score(KNN, X_train, y_train, cv=cv)
KNN_scores

In [None]:
KNN_scores.mean()

In [None]:
KNN_cv_pred = cross_val_predict(KNN, X_train, y_train, cv=10)
accuracy_score(y_train, KNN_cv_pred)

In [None]:
KNN.fit(X_train, y_train)
KNN.score(X_train, y_train)

In [None]:
KNN_pred = KNN.predict(X_train)

In [None]:
confusion_matrix(y_train, KNN_pred)

#### RandomForestClassifier

In [None]:
RF = RandomForestClassifier(max_depth=2)
RF_scores = cross_val_score(RF, X_train, y_train, cv=cv)
RF_scores

In [None]:
RF_scores.mean()

In [None]:
RF_cv_pred = cross_val_predict(RF, X_train, y_train, cv=10)
accuracy_score(y_train, RF_cv_pred)

In [None]:
RF.fit(X_train, y_train)
RF.score(X_train, y_train)

In [None]:
RF_pred = RF.predict(X_train)

In [None]:
confusion_matrix(y_train, RF_pred)

## Evaluate Our System on the Test Set 

### Evaluation in classification is different than evaluation in regression, in classification we have a lot of metrics to evaluate our model like 

    F1_score
    Confusion Matrix
    Roc curve
    Auc 
    Jaccard_score
    Log_less
    
#### F1_score:- The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.

<img src="https://forums.fast.ai/uploads/default/original/3X/c/c/cca1b3ad72fc927fbf3d3690f01d2e3b5a31dd2e.png">

#### Confusion Matrix:- It is a table with 4 different combinations of predicted and actual values, Ture positive "TP", True negative "TN", False positive "FP" and False negative "FN", It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves.

<img src="https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png">

#### Roc curve:- An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate (TPR) is a synonym for recall 
 
False Positive Rate (FPR) 

<img src="https://developers.google.com/machine-learning/crash-course/images/ROCCurve.svg" style="width:500px;height:600px;">
 
#### AUC:- AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve

<img src="https://developers.google.com/machine-learning/crash-course/images/AUC.svg" style="width:500px;height:600px;">

AUC provides an aggregate measure of performance across all possible classification thresholds.


#### Jaccard_score:- Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.


<img src="https://miro.medium.com/max/744/1*XiLRKr_Bo-VdgqVI-SvSQg.png" >


#### Log_less:- loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true. The log loss is only defined for two or more labels.




<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mi>L</mi>
    <mrow data-mjx-texclass="ORD">
      <mi>log</mi>
    </mrow>
  </msub>
  <mo stretchy="false">(</mo>
  <mi>y</mi>
  <mo>,</mo>
  <mi>p</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mo>&#x2212;</mo>
  <mo stretchy="false">(</mo>
  <mi>y</mi>
  <mi>log</mi>
  <mo data-mjx-texclass="NONE">&#x2061;</mo>
  <mo stretchy="false">(</mo>
  <mi>p</mi>
  <mo stretchy="false">)</mo>
  <mo>+</mo>
  <mo stretchy="false">(</mo>
  <mn>1</mn>
  <mo>&#x2212;</mo>
  <mi>y</mi>
  <mo stretchy="false">)</mo>
  <mi>log</mi>
  <mo data-mjx-texclass="NONE">&#x2061;</mo>
  <mo stretchy="false">(</mo>
  <mn>1</mn>
  <mo>&#x2212;</mo>
  <mi>p</mi>
  <mo stretchy="false">)</mo>
  <mo stretchy="false">)</mo>
</math>


now let's try out some metrics



### LR

In [None]:
LR_pred = LR.predict(X_test)

In [None]:
accuracy_score(y_test, LR_pred)

confusion_matrix

In [None]:
confusion_matrix = confusion_matrix(y_test, LR_pred)

In [None]:
plot_confusion_matrix(LR, X_test, y_test, cmap=plt.cm.Blues)  
plt.show() 

ROC

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, LR_pred)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC

In [None]:
auc(fpr, tpr)

Jaccard_score

In [None]:
jaccard_score(y_test, LR_pred)

F1_score

In [None]:
f1_score(y_test, LR_pred)

Log_Less

In [None]:
log_loss(y_test, LR_pred)

#### KNN

In [None]:
KNN_pred = KNN.predict(X_test)

In [None]:
accuracy_score(y_test, KNN_pred)

confusion_matrix

In [None]:
plot_confusion_matrix(KNN, X_test, y_test, cmap=plt.cm.Blues)  
plt.show() 

ROC

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, KNN_pred)

fig, ax = plt.subplots()
plt.style.use('fivethirtyeight')
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC

In [None]:
auc(fpr, tpr)

Jaccard_score

In [None]:
jaccard_score(y_test, KNN_pred)

F1_score

In [None]:
f1_score(y_test, KNN_pred)

Log_Less

In [None]:
log_loss(y_test, KNN_pred)

#### RF

In [None]:
RF_pred = RF.predict(X_test)

In [None]:
accuracy_score(y_test, RF_pred)

confusion_matrix

In [None]:
plot_confusion_matrix(RF, X_test, y_test, cmap=plt.cm.Blues)  
plt.show() 

ROC

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, RF_pred)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

AUC

In [None]:
auc(fpr, tpr)

Jaccard_score

In [None]:
jaccard_score(y_test, RF_pred)

F1_score

In [None]:
f1_score(y_test, RF_pred)

Log_Less

In [None]:
log_loss(y_test, RF_pred)

well i tried my best here if you have any good ideas that can improve my methodology and my code feel free to make a comment and let me know!