<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > PREDICTING HEART ATTACK</h1></center>

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 1 - Introduction</h1></center>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.</p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.</p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.</p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">For this project, we will predict if one person has a heart failure according to some risk factors. This project was elaborated for 
<a href="https://www.kaggle.com/andrewmvd/heart-failure-clinical-data">Kaggle Challenge</a></p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">The dataset used in this project was obtained from the <a href="https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records">UCI Machine Learning Repository</a>.</p>

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial"> Import Library </h2>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import pandas as pd
import numpy as np
import seaborn as sns
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
import plotly.express as px
from tabulate import tabulate
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial"> Variables description </h2>

<html>
<style>
ol.a {list-style-type: number;}
ul.b {list-style-type: circle;}
</style>

<dl>
  <dt><b>age</b>: age of the patient (years) [40 to 95]</dt>
  <dt><b>anaemia</b>: decrease of red blood cells or hemoglobin (boolean) [0 and 1]</dt>
    <dd>- Hematocrit is the percentage of red cells in your blood.</dd>
    <dd>- Normal levels from men 41% to 50% and woman 36% to 48%</dd>
  <dt><b>creatinine phosphokinase (CPK)</b>: level of the CPK enzyme in the blood (mcg/L) [23 to 7861]</dt>
    <dd>- CPK is an enzyme in the body that causes the phosphorylation of creatine.</dd>
    <dd>- CPK is found in the skeletal muscle, cardiac muscle, brain, bladder, stomach and colon.</dd>
    <dd>- CPK leaks into the blood when a muscles tissue is damaged, and as such high levels of CPK is indicative of stress or injury to the heart or other muscles</dd>
    <dd>- The CPK normal range for a male is between 39 – 308 U/L, while in females is between 26 – 192 U/L.</dd>

  <dt><b>diabetes</b>: if the patient has diabetes (boolean) [0 and 1]</dt>
  <dt><b>ejection fraction</b>: percentage of how much blood the left ventricle pumps out with each contraction [14 to 80]</dt>
    <dd>- This indication of how well your heart is pumping out blood can help to diagnose and track heart failure</dd>
    <dd>- A normal heart’s ejection fraction may be between 50 and 70 percent.</dd>

  <dt><b>high blood pressure</b>: if the patient has hypertension (boolean) [0 and 1]</dt>
  <dt><b>platelets</b>: platelets in the blood (kiloplatelets/mL) [25,100 to 850,000]</dt>
    <dd>- Too many platelets can lead to heart attack and stroke</dd>
    <dd>- A normal platelet count ranges from 150,000 to 450,000</dd>

  <dt><b>serum creatinine</b>: level of serum creatinine in the blood (mg/dL) [0,5 to 9,4]</dt>
    <dd>- A creatinine test is a measure of how well your kidneys are performing their job of filtering waste from your blood.</dd>
    <dd>- For men, 0.74 to 1.35 mg/dL. For women, 0.59 to 1.04 mg/dL.</dd>
    <dd>- If a patient has high levels of serum creatinine, it may indicate renal dysfunction</dd>
    
  <dt><b>serum sodium</b>: level of serum sodium in the blood (mEq/L) [113 to 148]</dt>
    <dd>- Hyponatremia occurs when the concentration of sodium in your blood is low. Sodium helps regulate the amount of water that's in and around your cells.</dd>
    <dd>- A normal blood sodium level is between 135 and 145 mEq/L</dd>
    <dd>- An abnormally low level of sodium in the blood might be caused by heart failure</dd>
    
  <dt><b>sex</b>: woman or man (binary) [0 and 1]</dt>
  <dt><b>smoking</b>: if the patient smokes or not (boolean) [0 and 1]</dt>
  <dt><b>time</b>: follow-up period (days) [4 to 285]</dt>
  <dt><b>[target] death event</b>: if the patient deceased during the follow-up period (boolean) [0 and 1]</dt>

</dl>
</html>

In [None]:
#Import dataset
df = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 2 - Exploratory Data Analysis</h1></center>

In [None]:
# Checking missing values
df.isna().sum()

In [None]:
# histogram plot for all variables
df.hist(figsize=(20,20));

In [None]:
# Data Visualization - Age x DEATH_EVENT
fig_1 = px.histogram(df, 'age', color='DEATH_EVENT', nbins=50, title='Data distribution per age')
fig_1.show()

fig_2 = px.box(df, x="DEATH_EVENT", y="age", title='Box Plot (DEATH_EVENT x Age)')
fig_2.show()

In [None]:
# Data Visualization - creatinine_phosphokinase x DEATH_EVENT
fig_3 = px.histogram(df, 'creatinine_phosphokinase', color='DEATH_EVENT', nbins=50, title='Data distribution per creatinine_phosphokinase')
fig_3.show()

fig_4 = px.box(df, x="DEATH_EVENT", y="creatinine_phosphokinase", title='Box Plot (DEATH_EVENT x creatinine_phosphokinase)')
fig_4.show()

In [None]:
# Data Visualization - ejection_fraction x DEATH_EVENT
fig_5 = px.histogram(df, 'ejection_fraction', color='DEATH_EVENT', nbins=50, title='Data distribution per ejection_fraction')
fig_5.show()

fig_6 = px.box(df, x="DEATH_EVENT", y="ejection_fraction", title='Box Plot (DEATH_EVENT x ejection_fraction)')
fig_6.show()

In [None]:
# Data Visualization - platelets x DEATH_EVENT
fig_7 = px.histogram(df, 'platelets', color='DEATH_EVENT', nbins=50, title='Data distribution per platelets')
fig_7.show()

fig_8 = px.box(df, x="DEATH_EVENT", y="platelets", title='Box Plot (DEATH_EVENT x platelets)')
fig_8.show()

In [None]:
# Data Visualization - serum_creatinine x DEATH_EVENT
fig_9 = px.histogram(df, 'serum_creatinine', color='DEATH_EVENT', nbins=50, title='Data distribution per serum_creatinine')
fig_9.show()

fig_10 = px.box(df, x="DEATH_EVENT", y="serum_creatinine", title='Box Plot (DEATH_EVENT x serum_creatinine)')
fig_10.show()

In [None]:
# Data Visualization - serum_sodium x DEATH_EVENT
fig_11 = px.histogram(df, 'serum_sodium', color='DEATH_EVENT', nbins=50, title='Data distribution per serum_sodium')
fig_11.show()

fig_12 = px.box(df, x="DEATH_EVENT", y="serum_sodium", title='Box Plot (DEATH_EVENT x serum_sodium)')
fig_12.show()

In [None]:
# Data Visualization - time x DEATH_EVENT
fig_13 = px.histogram(df, 'time', color='DEATH_EVENT', nbins=50, title='Data distribution per time')
fig_13.show()

fig_14 = px.box(df, x="DEATH_EVENT", y="time", title='Box Plot (DEATH_EVENT x time)')
fig_14.show()

In [None]:
# counter plot for all categorical variables
fig, ((axis1, axis2), (axis3, axis4), 
      (axis5, axis6)) = plt.subplots(3,2, figsize=(20,50))

sns.countplot(x='smoking', hue='DEATH_EVENT', data=df, ax=axis1)
sns.countplot(x='anaemia', hue='DEATH_EVENT', data=df, ax=axis2)
sns.countplot(x='sex', hue='DEATH_EVENT', data=df, ax=axis3);
sns.countplot(x='diabetes', hue='DEATH_EVENT', data=df, ax=axis4)
sns.countplot(x='high_blood_pressure', hue='DEATH_EVENT', data=df, ax=axis5)

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">Checking if the target variable is balanced</h2>

In [None]:
# Class distribution
df.groupby('DEATH_EVENT').size()

In [None]:
# SMOTE
smote_bal = SMOTE()

X = df.iloc[:, 0:12]
y = df.iloc[:, 12]
X_res, y_res = smote_bal.fit_resample(X, y)

# Plot 
sns.countplot(y_res, palette = "OrRd")
plt.box(False)
plt.xlabel('Death No (0) / Yes (1)', fontsize = 11)
plt.ylabel('Total', fontsize = 11)
plt.title('Counting deaths\n')
plt.show()

**With our new dataset balanced, Let's verify if no NA's were introduced into the dataset.**

In [None]:
df_1 = pd.concat([X_res, y_res], axis=1)
df_1.isna().sum()

In [None]:
#Separate the Categorical and Numerical variables
num = [name for name in df_1.columns if df_1[name].nunique() > 3]
cat = [name for name in df_1.columns if df_1[name].nunique() < 3]
df_num = df_1[num]
df_num["DEATH_EVENT"] = df_1["DEATH_EVENT"]
df_cat = df_1[cat]

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 3 - Feature Selection</h1></center>

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">Correlation between Cathegorical variables</h2>

In [None]:
plt.figure(figsize = (10,5))
sns.heatmap(df_cat.corr(), cmap="Blues")

<ul><li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Considering this dataset, it seems that "Anaemia" have no correlation with the number of deaths "DEATH_EVENT".</li></ul></p>
<ul><li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Let's use another technic to analyze this correlation. Let's suppose that "Anaemia" has correlation with "DEATH_EVENT", i.e., our null hypothesis.</li></ul></p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>Using stats.pearsonr, we can evaluate the p-value for this variable:</b></p>

In [None]:
a, b = scipy.stats.pearsonr(df_cat.DEATH_EVENT, df_cat.anaemia)
table = [['Correlation Coefficient', 'p-value'], [a, b]]
print(tabulate(table))

<ul><li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">According to correlation table, "Anaemia" presents a value next to 0.</li></ul></p>
<ul><li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Calculating p-value using scipy.stats, "Anaemia" presents a value > 0,05.</li></ul></p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>With these two evidences, we can remove this variable.</b></p>

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">Correlation between Numerical variables</h2>

In [None]:
plt.figure(figsize = (10,5))
sns.heatmap(df_num.corr(), cmap="Blues")

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Considering this dataset, it seems that "platelets" and "creatinine_phosphokinase" have no correlation with the number of deaths "DEATH_EVENT". According to correlation table, both variables present a value next to 0.</p>
<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Let's use another technic to analyze this correlation. Let's suppose that both variables have correlation with "DEATH_EVENT", i.e., our null hypothesis.</p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>Using stats.pearsonr, we can evaluate the p-value for both variables:</b></p>

In [None]:
a, b = scipy.stats.pearsonr(df_num.DEATH_EVENT, df_num.platelets)
c,d = scipy.stats.pearsonr(df_num.DEATH_EVENT, df_num.creatinine_phosphokinase)
table = [['Correlation Coefficient', 'p-value'], [a, b], [c, d]]
print(tabulate(table))

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">As we can see, both values are higher than 5% (second value). Thus, we can reject the null hypotesis.</p>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>As conclusion, based on these 2 techniques, we have evidences to reject the null hypotesis and we can remove these variables of our dataset.</b></p>

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">Conclusions about Feature Selection</h2>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">After evaluating the charts, balanced the target variable and analyzing the correlation between the variables, some observations were made:</p>

<ul>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>Age</b> --> Most case of deaths is between 60-75 years</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>CPK</b> --> Normal range is 26-308 (mcg/L). Most case of deaths is between 128-582 mcg/L</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>ejection_fraction</b> --> Normal range is 50-70%. Most case of deaths is under 40%</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>platelets</b> --> Normal range is 150k-450k. Most case of deaths is between 200k-310k</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>serum_creatinine</b> --> Normal range is 0.6-1.35 (md/dL). Most case of deaths is between 1-2 (md/dL)</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>serum_sodium</b> --> Normal range is 135-145 (mmEq/L). Most case of deaths is between 133-140 (mmEq/L)</p></li>
     <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px"><b>time</b> --> Most case of deaths is under 100 days</p></li>
</ul>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">We can take some conclusions with these observations:</p>

<ol>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Age is proportional to heart failure;</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Most of the deaths cases have a "serum_sodium" in a normal range (between 135-145 (mmEq/L)). For this dataset, this characteristis does not impact the number of deaths. Consequently, we can remove this variable;</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">We can remove "platelets", "creatinine_phosphokinase" and "anaemia" according to correlation analysis.</p></li>
</ol>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">For the train and test phase, we will remove <i>"platelets"</i>, <i>"creatinine_phosphokinase"</i>, <i>"serum_sodium"</i> and <i>"anaemia"</i>. 🤘</p>

In [None]:
df_2 = df_1.drop(columns = ["platelets","creatinine_phosphokinase","anaemia", "serum_sodium"])
df_2.head()

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 4 - Normalizing Data</h1></center>

In [None]:
# Separating array into input and output components
array = df_2.values

X = array[:,0:8]
Y = array[:,8]

# Creating normalized data
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

In [None]:
#Separating train and test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.7)

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 5 - Predictive Analysis</h1></center>

<p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Predictive analysis is the phase that we evaluate our model using several Machine Learning techniques. In this project, it was used the following algorithms:</p>

<ul>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Logistic Regression</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Decision Tree</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Naive Bayes</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">SVM</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Gradient Boosting Classifier</p></li>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">Random Forest Classifier</p></li>
     
</ul>

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">1) Logistic Regression</h2>

In [None]:
# Creating logistic Regression object
model_v1a = LogisticRegression(solver ='liblinear', max_iter=1000)

# Trainning the model with data train and checking the score
model_v1a.fit(X_train, y_train)
model_v1a.score(X_train, y_train)

# Accurancy the model - train data
predict_train_a = model_v1a.predict(X_train)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, predict_train_a)))
print()

In [None]:
# Accurancy the model - test data
predict_test_a = model_v1a.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_a)))
print()

In [None]:
# Tuning Hyperparameters
valores_grid = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}

# Creating model
model = LogisticRegression(solver ='liblinear', max_iter=1000)

# Creating grid
model_v1 = GridSearchCV(estimator = model, param_grid = valores_grid)
model_v1.fit(X_train, y_train)

print("Accuracy: %.3f" % (model_v1.best_score_ * 100))
print("Best Model Parameters:\n", model_v1.best_estimator_)

In [None]:
predict_test = model_v1.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test)))
print()

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">2) Decision Tree</h2>

In [None]:
# Creating Decision Tree object
model_v2 = tree.DecisionTreeClassifier() 

# Trainning the model with data train and checking the score
model_v2.fit(X_train, y_train)
model_v2.score(X_train, y_train)

# Accurancy the model - train data
predict_train_v2 = model_v2.predict(X_train)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, predict_train_v2)))
print()

In [None]:
# Accurancy the model - test data
predict_test_v2 = model_v2.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_v2)))
print()

In [None]:
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 8),
              "min_samples_leaf": randint(1, 8),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

In [None]:
predict_test_a2 = tree_cv.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_a2)))
print()

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">3) Naive Bayes</h2>

In [None]:
# Creating GaussianNB object
model_v3 = GaussianNB()

# Trainning the model with data train and checking the score
model_v3.fit(X_train, y_train)
model_v3.score(X_train, y_train)

# Accurancy the model - train data
predict_train_v3 = model_v3.predict(X_train)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, predict_train_v3)))
print()

In [None]:
# Accurancy the model - test data
predict_test_v3 = model_v3.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_v3)))
print()

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">4) Support Vector Regression</h2>

In [None]:
# Creating the model
model_v4 = svm.SVC(kernel = 'rbf')

# Grid values
C_range = np.array([50., 100., 400.])
gamma_range = np.array([0.03*0.001,0.0001,0.3*0.001])

# Hyperparameters grid
svm_param_grid = dict(gamma = gamma_range, C = C_range)

# Grid Search
model_v4_grid_search_rbf = GridSearchCV(model_v4, svm_param_grid, cv = 7)

# Trainning the model with data train and checking the score
model_v4_grid_search_rbf.fit(X_train, y_train)

# Accurancy the model - train data
print(f"Acurácia em Treinamento: {model_v4_grid_search_rbf.best_score_ :.2%}")
print("")
print(f"Hiperparâmetros Ideais: {model_v4_grid_search_rbf.best_params_}")

In [None]:
# Accurancy the model - test data
predict_test_v4 = model_v4_grid_search_rbf.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_v4)))
print()

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">5) Gradient Boosting Classifier</h2>

In [None]:
# Creating Gradient Boosting object
model_v5 = GradientBoostingClassifier()

# Trainning the model with data train and checking the score
model_v5.fit(X_train, y_train)
model_v5.score(X_train, y_train)

# Accurancy the model - train data
predict_train_v5 = model_v5.predict(X_train)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, predict_train_v5)))
print()

In [None]:
# Accurancy the model - test data
predict_test_v5 = model_v5.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_v5)))
print()

<h2 style = "background:black;color:white;border:0;border-radius:3px;font-family:arial">6) Random Forest Classifier</h2>

In [None]:
# Creating Random Forest object
model_v6 = RandomForestClassifier(n_estimators = 500)

# Trainning the model with data train and checking the score
model_v6.fit(X_train, y_train)
model_v6.score(X_train, y_train)

# Accurancy the model - train data
predict_train_v6 = model_v6.predict(X_train)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, predict_train_v6)))
print()

In [None]:
# Accurancy the model - test data
predict_test_v6 = model_v6.predict(X_test)
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, predict_test_v6)))
print()

<center><h1 style = "background:black;color:white;border:0;border-radius:3px;font-family:verdana" > 6 - Conclusion</h1></center>

In this project, it was possible to evaluate the accurancy of the models according to the algorithms used previously. For a better visualization, here is presented an extract of the test data accuracy for each model evaluated

In [None]:
modelos = []

modelos.append(('LR', model_v1))
modelos.append(('DTC', model_v2))
modelos.append(('NB', model_v3))
modelos.append(('SVM', model_v4_grid_search_rbf))
modelos.append(('GBC', model_v5))
modelos.append(('RFC', model_v6))

for nome, modelo in modelos:
    predict_test = modelo.predict(X_test)
    met = metrics.accuracy_score(y_test, predict_test)
    msg = "Accuracy for " "%s: %f" % (nome, met)
    print(msg)

<ul>
    <li><p style = "color:black;font-weight:200;text-indent:0px;font-size:15px">As a final conclusion, the algorithm <b>Random Forest Classifier</b> presented the best level of accurancy (85%!!!) 😃😁💪👏</p></li>    
</ul>