# 🔥 Machine Learning for Beginners With 7 Diffrent Models 💪
This notbook covers the basics of Machine Learning for beginners. We will estimate whether patients have a **Heart Attack or Not**. We will use [this dataset](https://www.kaggle.com/mehmetzahitylmaz/heartdisease2) to make predictions. We will proceed with these steps:
* Import Libraries and Datasets 🔌
* Exploring Dataset 🔍
* Spliting Numerical and Categorical Values ✂️
* Exploring Categorical Columns 🔦
* Spliting Columns for One Hot Encoding and Label Encoding ✍️
* Investigating Missing Values 🕵️‍♂️
* Bring Data Together 📖
* Split Data to Train and Test 👟
* Training our models 💪
* Compare Models Perfomance ⚡️⭐️
* Performance of Models and Last Words 🎯

In [None]:
# Print List of files (our train and test data) in Kaggle's file explorer
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Import Libraries and Datasets 🔌
In this section we import libraries and import DataFrames. We use "Coronary_artery.csv" because it provides us good example to preproccessing data.

In [None]:
# Importing Pandas an Numpy Libraries to use on manipulating our Data
import pandas as pd
import numpy as np

# To Preproccesing our data
from sklearn.preprocessing import LabelEncoder

# To fill missing values
from sklearn.impute import SimpleImputer

# To Split our train data
from sklearn.model_selection import train_test_split

# To Visualize Data
import matplotlib.pyplot as plt
import seaborn as sns

# To Train our data
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, GaussianNB

# To evaluate end result we have
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score


# We are importing our Data with Pandas Library
# We use "Coronary_artery.csv" 
df = pd.read_csv("/kaggle/input/heartdisease2/heart-2.csv")


# Exploring Dataset 🔍
* In this section we will look at our train DataFrame. head() function print first 5 rows in our DataFrame

In [None]:
# Prints first 5 row in Data
df.head()

In [None]:
# Print number of rows in data
print("Rows:", len(df))

In [None]:
# Prints Summary of Numerical Data
df.describe()

In [None]:
# Prints Summary of Categorical Data
df.describe(include=[np.object])

# Spliting Numerical and Categorical Values ✂️
We will split categorical (String) and numerical values. selec_dtypes function selecet column in our DataFrame includes or exculudes type we define. "object" stands for Sting values inside our DataFrame.

* When we include type of "object" we select String Columns
* When we exclude type of "object" we select Numerical Columns

In [None]:
numerical_column = df.select_dtypes(exclude="object").columns.tolist()
categorical_column = df.select_dtypes(include="object").columns.tolist()
print("Numerical Columns:", numerical_column)
print("****************")
print("Categorical Columns:", categorical_column)

# Exploring Categorical Columns 🔦
In this section we will look for our Categorical (String) columns. In here we will see unique row which is unique values column has. We will use that info to split our categorical columns to encoding later on.

In [None]:
df[categorical_column].describe()

# Spliting Columns for One Hot Encoding and Label Encoding ✍️
There is 6 categorical column in dataset. We will transform columns that have less than 10 and more then 2 unique values, with One Hot Encoder (We define new column for each value).

We will transfor rest of columns with Label Encoding (We dont define new columns instead of that we give a numerical value for each unique label).

* We are not using One Hot Encoding on columns have more then 10 unique value because of It will effect performance very badly. 
* We are not using One Hot Encoding on 2 unique values because it will cause "Dummy Variable" proglem. You can Google it for know more  


In [None]:
# Get column names have less than 10 more than 2 unique values
to_one_hot_encoding = [col for col in categorical_column if df[col].nunique() <= 10 and df[col].nunique() > 2]

# Get Categorical Column names thoose are not in "to_one_hot_encoding"
to_label_encoding = [col for col in categorical_column if not col in to_one_hot_encoding]

print("To One Hot Encoding:", to_one_hot_encoding)
print("To Label Encoding:", to_label_encoding)


# Investigating Missing Values 🕵️‍♂️
In this section we will search for missing columns. If our data has missing columns it can produce error while we training our data. 


In [None]:
df.isnull().sum()

There is no missing column :) Now we can proceed to One Hot Encoding and Label Encoding

# One Hot Encoding and Label Encoding 🔥

In [None]:
# We will use built in pandas function "get_dummies()" to simply to encode "to_one_hot_encoding" columns
one_hot_encoded_columns = pd.get_dummies(df[to_one_hot_encoding])
one_hot_encoded_columns

In [None]:
# Label Encoding

label_encoded_columns = []
# For loop for each columns
for col in to_label_encoding:
    # We define new label encoder to each new column
    le = LabelEncoder()
    # Encode our data and create new Dataframe of it, 
    # notice that we gave column name in "columns" arguments
    column_dataframe = pd.DataFrame(le.fit_transform(df[col]), columns=[col] )
    # and add new DataFrame to "label_encoded_columns" list
    label_encoded_columns.append(column_dataframe)

# Merge all data frames
label_encoded_columns = pd.concat(label_encoded_columns, axis=1)
label_encoded_columns

# Bring Data Together 📖
Now we will birng all modified data to one single DataFrame then we examine our data 

In [None]:
# Copy our DataFrame to X variable
X = df.copy()

# Droping Categorical Columns,
# "inplace" means replace our data with new one
# Don't forget to "axis=1"
X.drop(categorical_column, axis=1, inplace=True)

# Merge DataFrames
X = pd.concat([X, one_hot_encoded_columns, label_encoded_columns], axis=1)
print("All columns:", X.columns.tolist())
X

# Split Data to Train and Test 👟
We successfuly complete data manipulation. There is one more step to Train our data with Machine Learning. We will split data to Train and Test groups. It's important because we use Test group to see our Machine Learning Models performance

First we will split data to two parts X and Y. 
* X is our data without target column
* Y is our target to predict

In [None]:
# Define Y (This is the value we will predict)
y = df["target"]

# Droping "class" from X
X.drop(["target"], axis=1, inplace=True)
X

Now it's time to split our data to train and test

In [None]:
# You can specify test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# It's Time to TRAIN!!! 💪
Now we will train our data with several machine learning models. We will try several models and find which is best for us. We will use **Classifiers** to predict. You should use Calssifiers for predicting nominal values ("True or False" or "Category A, Category B, C .." like types) and you should use Reggressors for predicting numerical values (exp. Companie's Salary). In this we will use these algoritms:

* Random Forest (It's my favorite)
* Desicion Tree
* Logistic Regression Classifier
* Bernouilli Naive Bias
* Gaussian Naive Bias
* KNN (K-Nearest Neighbors)
* XGBoost (It's new and have acurate predictions)

# Random Forest 🌲
Random forest is our first algoritm to try. It can take few arguments one of them is n_estimators. You can play with n_estimators to get different predictions. But don't forget that if you increase it too much you can ended up with overtrained model which we don't realy want :) For more information you can see [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# Define Random Forest Model
rf = RandomForestClassifier(n_estimators=100)

# We fit our model with our train data
rf.fit(X_train, y_train)

# Then predict results from X_test data
pred_rf = rf.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_rf[0:10])
print("Actual:", y_test[0:10])

We will investigate performance of our model  later on 🧐

# Desicion Tree 🌳
Decision Tree is our second algoritm. To learn more chekout [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
# Define Decision Tree Model
dt = DecisionTreeClassifier()
# We fit our model with our train data
dt.fit(X_train, y_train)
# Then predict results from X_test data
pred_dt = dt.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_dt[0:10])
print("Actual:", y_test[0:10])

# Logistic Regression 📈
Now we will speed things little up. [Sklearn docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


In [None]:
# Define Logistic Regression Model
log = LogisticRegression()
# We fit our model with our train data
log.fit(X_train, y_train)
# Then predict results from X_test data
pred_log = log.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_log[0:10])
print("Actual:", y_test[0:10])

# Bernouilli Naive Bias
[Sklearn docs](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)

In [None]:
# Define Bernouilli Naive Bias Model
bnb = BernoulliNB()
# We fit our model with our train data
bnb.fit(X_train, y_train)
# Then predict results from X_test data
pred_bnb = bnb.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_bnb[0:10])
print("Actual:", y_test[0:10])

# Gaussian Naive Bias
[Sklearn docs](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

In [None]:
# Define Gaussian Naive Bias Model
gnb = GaussianNB()
# We fit our model with our train data
gnb.fit(X_train, y_train)
# Then predict results from X_test data
pred_gnb = gnb.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_gnb[0:10])
print("Actual:", y_test[0:10])

# KNN (K-Nearest Neighbors) 🏘
KNN is based on calculation of nearest elements to our data. KNN takes some arguments "n_neighbors" and "metric". "n_neighbors" is how many nearest neighbors will be used to predict result. "metric" is method to which method to used to calculate distance to neighbors. You can check [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for more info.

In [None]:
# Define KNN Model
knn = KNeighborsClassifier(n_neighbors=3, metric="minkowski")
# We fit our model with our train data
knn.fit(X_train, y_train)
# Then predict results from X_test data
pred_knn = knn.predict(X_test)

# See First 10 Predictions and They Actual Values
print("Predicted:", pred_knn[0:10])
print("Actual:", y_test[0:10])

# XGBoost 🚀
XGBoost is a library based on gradient boosting. It's very accurate and fast. But you need some knowledge to benefit it's briliant features. In this example I will show basic ussage og XGBoost. You should checkout [it's own documentation](https://xgboost.readthedocs.io/en/latest/) or [Kaggle Tutorial](https://www.kaggle.com/alexisbcook/xgboost) for more information.

In [None]:
# Define XGBoost Model
xgb = XGBClassifier(n_estimators=1000, learning_rate=0.05)
# We fit our model with our train data
xgb.fit(
    X_train, y_train,
    # That means if model don't improve it self in 5 rounds, it will stop learning
    # So you can save your time and don't overtrain your model.
    early_stopping_rounds=5,
    # We provide Test data's to evaluate model performance
    eval_set=[(X_test, y_test)],
    verbose=False
 )
# Then predict results from X_test data
pred_xgb = xgb.predict(X_test)

# See First 10 Predictions and They Actual Values
# print("Predicted:", pred_xgb[0:10])
print("Actual:", y_test[0:10])

# Compare Models Perfomance ⚡️⭐️
In this section we will compare performance of models which we train in previous section. We will use "confusion_matrix" and "accuracy_score" from sklearn library. Then we choose best model for our data.

# Confusion Maxtrixes 📐

In [None]:
# Confusion Matrixes
# First parameter is actual value
# second parameter is value that we prediceted

# Random Forest 
cm_rf = confusion_matrix(y_test, pred_rf)
# Desicion Tree
cm_dt = confusion_matrix(y_test, pred_dt)
# Logistic Regression
cm_log = confusion_matrix(y_test, pred_log)
# Bernouilli Naive Bias
cm_bnb = confusion_matrix(y_test, pred_bnb)
# Gaussian Naive Bias
cm_gnb = confusion_matrix(y_test, pred_gnb)
# KNN (K-Nearest Neighbors)
cm_knn = confusion_matrix(y_test, pred_knn)
# XGBoost 
cm_xgb = confusion_matrix(y_test, pred_xgb)

print("***********************")
print("Confusion Matrixes")
print("***********************")
print("Random Forest:\n", cm_rf)
print("Desicion Tree:\n", cm_dt)
print("Logistic Regression:\n", cm_log)
print("Bernouilli Naive Bias:\n", cm_bnb)
print("Gaussian Naive Bias:\n", cm_gnb)
print("KNN (K-Nearest Neighbors):\n", cm_knn)
print("XGBoost:\n", cm_xgb)

# Accuracy Scores 🥇

In [None]:
# Accuracy Scores
# First parameter is actual value
# second parameter is value that we prediceted

# Random Forest 
acc_rf = accuracy_score(y_test, pred_rf)
# Desicion Tree
acc_dt = accuracy_score(y_test, pred_dt)
# Logistic Regression
acc_log = accuracy_score(y_test, pred_log)
# Bernouilli Naive Bias
acc_bnb = accuracy_score(y_test, pred_bnb)
# Gaussian Naive Bias
acc_gnb = accuracy_score(y_test, pred_gnb)
# KNN (K-Nearest Neighbors)
acc_knn = accuracy_score(y_test, pred_knn)
# XGBoost 
acc_xgb = accuracy_score(y_test, pred_xgb)

print("***********************")
print("Accuracy Scores")
print("***********************")
print("Random Forest:", acc_rf)
print("Desicion Tree:", acc_dt)
print("Logistic Regression:", acc_log)
print("Bernouilli Naive Bias:", acc_bnb)
print("Gaussian Naive Bias:", acc_gnb)
print("KNN (K-Nearest Neighbors):", acc_knn)
print("XGBoost:", acc_xgb)

# Performance of Models and Last Words 🎯
As you see XGBoost, Logistic Regression and Bernouilli Naive Bias: give us best results. Any way feel free to comment below. Have a nice day :)

 