1. [Introduction](#1)
2. [Load and Preprocessing](#2)
3. [Supervised Models](#3)
      * [Random Forest](#4)
      * [Decision Tree](#5)
      * [K-Nearest Neighbours](#6)
      * [Support Vector Machines](#7)
      * [Naive Bayes](#8)
4. [Compare Model's Accuracy with Graph](#9)
5. [Conclusion](#10)

<a id="1" >
    
# Introduction

* Supervised learning is used for datas which have labels(class). We can find the class of our data with supervised learning.
 
![image.png](https://i.ibb.co/4pKMSNq/Machine-Learning-Classification-Algorithms-1280x720.jpg)

* In this picture, we can say that all of foods have some features(their size,weight, etc.) and every foods have labels(ice cream, cake, etc.). Thanks to this features, we can find their labels with our model.

* In supervised learning, we use our provided labels and features as a teacher of model. For this example, we will split our data as train and test datas. We will use our train data to train our model. We can train our model with this data because we know its label. After that we will use our test data to determine our accuracy. 

* In this notebook, we will use Naive Bayes, Decision Tree, Support Vector Machines, Random Forest and K-Nearest Neighbours. All of them will be explained before its code.

<a id="2" >
    
# Load and Preprocessing

* First we import our libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from collections import Counter
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Load Data

In [None]:
df = pd.read_csv("/kaggle/input/glass/glass.csv")

## Check data
* We see that we don't have null data. Our number of data is 214 and their type is float64. Our type(it will be our label) is int64

In [None]:
df.info()

* We can see our first 5 row of our dataframe

In [None]:
df.head()

* Our columns name

In [None]:
df.columns

* In this part, we determine our features and labels. Our labels are our type of glasses. It has 6 type(1,2,3,5,6,7). It calls it in our code as a Y and
X is our features

In [None]:
X = df.iloc[:,:-1]
Y = df.iloc[:,9]

* We can see that how many types we have

In [None]:
Counter(Y)

* Normalize Data

* In this part we normalize our data. It means all of features have value between 0 and 1. Thanks to normalization, all of our data's values change to common scale, without distorting differences in the ranges of values. 

In [None]:
X_n = (X-np.min(X))/(np.max(X)-np.min(X))

* In this part we split our data as train and test. This code means our %80 of data will be used to train our model. %20 of data will be used to test our model. Random state means it takes same values for every run

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X_n,Y,test_size=0.2,random_state=42)

<a id = "3" >
    
# Supervised Models

<a id = "4" >

## Random Forest

![randomforest.png](https://i.ibb.co/THj4n51/Random-Forest-Algorithm.jpg)

* Random forest is using this type of algorithm. It uses yes/no. It ask a yes-no question to our data for a specific value. If it is correct, it selects yes road, if it isn't correct, it selects no road. 
* For an example, our data is [1,5,6,8,10,7]. For 6, it asks your value >2? and it is correct. It is going yes road and random forest asks your value<5? and it is wrong and it is going to false road. With these questions, our figure looks like tree and thanks to it, we can class our data

* Random forest is using a lot of tree and it classifies our data with voting system. Each tree vote for one class and the winner will be our class.

* We will use this parameters for our graphs.

In [None]:
score_list = []
model = []
cross_val_score_list=[]

In [None]:
tuning= [100,200,300,400,500,600,700,800]

* For all model, first we import our model's library. After that we can use it.
* For all model, we find our train's accuracy with k-fold cross validation. We split our train data and use them for train and validation. 
![](https://i.ibb.co/ZcdQJLq/8uEci.png)

* For this picture cv=5. It means it uses %80 of data for train and %20 data to determine accuracy of train data. Their mean will be our accuracy of our train data. If we don't use it, our train's accuracy can misguide us. For an example, our accuracy is %86 without k-fold cross validation. Hovewer,our accuracies are %90,%99,%65,%91,%85 with k-fold cross validation. We can say that for this model there can be some problems because it is unstable.

* For our models we will our cv=3 and we check std to understand our model stable or not.

* To find the best n_estimators, we use this code. (We can use grid searh cv, but in this notebook we didn't use it) It means how many estimators will vote to class our data.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier 
score=[]
a=0
c=0
for i in tuning:
    rf = RandomForestClassifier(n_estimators=i, random_state=42)

    rf.fit(x_train,y_train)
    accuracies = cross_val_score(estimator=rf,X = x_train,y = y_train,cv=3)
#%%
    score.append(np.mean(accuracies))
    if np.mean(accuracies)>a:
        a=np.mean(accuracies)
        c=i
print("acc = ",a," best number of estimator = ",c)
print("std= ",np.std(accuracies))

In [None]:
plt.plot(tuning,score)

* For all of models first we introduce one variable which is equal to our model(their parameters can change with model). After that we fit it with our train datas. After that we check our test results with our model which is fitted by our train datas.

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42) #variable = our model

rf.fit(x_train,y_train) #fit our model with our train datas
print(rf.score(x_test,y_test)) #check score with our test datas
model.append("Random Forest Classifier") #store our models name for our graph
score_list.append(rf.score(x_test,y_test)) #store our test score for our graph

* We use confusion matrix to compare our predicted and real value. For an example we have dog and cat labels. We have 30 dog and 20 cat. With confusion matrix, we can see that how many dogs we predicted right.
* For an example, our data have 10 dogs and 90 cats. We saw that our accuracy is %90 and we can say that it is good. However, we use confusion matrix and we saw that our model predicted dogs for all features. It means our model isn't good because we have two classes and our predicted data don't have one of them.

In [None]:
from sklearn.metrics import confusion_matrix
y_pred= rf.predict(x_test) #we use this code to predict our labels with our model
#confusion matrix part
categories = [1,2,3,5,6,7] 
cm = confusion_matrix(y_test,y_pred)
f , ax = plt.subplots(figsize=(5,5))

sns.heatmap(cm,annot = True,linewidths =0.5,linecolor ="Red",fmt=".0f",ax=ax)
ax.set_xticklabels(categories)
ax.set_yticklabels(categories)
plt.xlabel("y_pred")
plt.ylabel("y_true")



plt.show()

<a id = "5" >

## Decision Tree

* It is like random forest, but it uses only one tree. Random forest is essentially collection of decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
accuracies = cross_val_score(estimator=rf,X = x_train,y = y_train,cv=3)
#%%
print("acc = ", np.mean(accuracies))
print("std= ",np.std(accuracies))

In [None]:
dt.fit(x_train,y_train)
print("acc = ", dt.score(x_test,y_test))
model.append("Decision Tree Classifier")
score_list.append(dt.score(x_test,y_test))

In [None]:
y_pred= dt.predict(x_test)

categories = [1,2,3,5,6,7]
cm = confusion_matrix(y_test,y_pred)
f , ax = plt.subplots(figsize=(5,5))

sns.heatmap(cm,annot = True,linewidths =0.5,linecolor ="Red",fmt=".0f",ax=ax)
ax.set_xticklabels(categories)
ax.set_yticklabels(categories)
plt.xlabel("y_pred")
plt.ylabel("y_true")

plt.show()

<a id = "6" >
    
## K-Nearest Neighbors

![knn](https://i.ibb.co/K5Z6345/knn.png)
* K-Nearest Neighbors determines nearest neighbors. It looks nearest neighbors class and specify your data's class. 
* For this model, you have to determine number of neighbors. For this picture, in small circle n_neighbors is 3, for big circle n_neighbors is 6

* To find the best n_neighbors, we use this code. (We can use grid searh cv, but in this notebook we didn't use it)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
score =[]
a=0
c=0
for i in range(2,20):
    knn = KNeighborsClassifier(n_neighbors = i)
    accuracies = cross_val_score(estimator=knn,X = x_train,y = y_train,cv=3)
    score.append(np.mean(accuracies))
    if np.mean(accuracies)>a:
        a=np.mean(accuracies)
        c=i
    
print("acc = ",a," best number of neighbors = ",c)
print("std= ",np.std(accuracies))

In [None]:
plt.plot(range(2,20),score)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(x_train,y_train)
print("acc = ", knn.score(x_test,y_test))
model.append("K-Nearest Neighbors")
score_list.append(knn.score(x_test,y_test))

In [None]:
y_pred= knn.predict(x_test)

categories = [1,2,3,5,6,7]
cm = confusion_matrix(y_test,y_pred)
f , ax = plt.subplots(figsize=(5,5))

sns.heatmap(cm,annot = True,linewidths =0.5,linecolor ="Red",fmt=".0f",ax=ax)
ax.set_xticklabels(categories)
ax.set_yticklabels(categories)
plt.xlabel("y_pred")
plt.ylabel("y_true")

plt.show()

<a id = "7" >
    
## Support Vector Machines

![support](https://i.ibb.co/2SpcZ3M/SVM.jpg)

* It uses vector to separate our data and says that left of this vector will be class a, right of this vector will be class. By looking p, svm optimizes for best hyper-plane

In [None]:
from sklearn.svm import SVC

svm = SVC(random_state=1)
accuracies = cross_val_score(estimator=knn,X = x_train,y = y_train,cv=3)
print("acc = " ,np.mean(accuracies))
print("std= ",np.std(accuracies))

In [None]:

svm.fit(x_train,y_train)
print("acc = " ,svm.score(x_test,y_test))
model.append("Support Vector Machines")
score_list.append(svm.score(x_test,y_test))

In [None]:
y_pred= svm.predict(x_test)

categories = [1,2,3,5,6,7]
cm = confusion_matrix(y_test,y_pred)
f , ax = plt.subplots(figsize=(5,5))

sns.heatmap(cm,annot = True,linewidths =0.5,linecolor ="Red",fmt=".0f",ax=ax)
ax.set_xticklabels(categories)
ax.set_yticklabels(categories)
plt.xlabel("y_pred")
plt.ylabel("y_true")

plt.show()

<a id="8" >

## Navie Bayes

* It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
* Navie bayes formula is :
![navie](https://i.ibb.co/N25cdqS/Siniflandirma-Notlari-10-Bayes-Teoremi-Formul-e1504212839936.png)

In [None]:
from sklearn.naive_bayes import GaussianNB

nb=GaussianNB()

accuracies = cross_val_score(estimator=nb,X = x_train,y = y_train,cv=3)
print("acc = ", np.mean(accuracies))
print("std= ",np.std(accuracies))

In [None]:
nb.fit(x_train,y_train)
print("acc = " ,nb.score(x_test,y_test))
model.append("Naive Bayes")
score_list.append(nb.score(x_test,y_test))

In [None]:
y_pred= nb.predict(x_test)

categories = [1,2,3,5,6,7]
cm = confusion_matrix(y_test,y_pred)
f , ax = plt.subplots(figsize=(5,5))

sns.heatmap(cm,annot = True,linewidths =0.5,linecolor ="Red",fmt=".0f",ax=ax)
ax.set_xticklabels(categories)
ax.set_yticklabels(categories)
plt.xlabel("y_pred")
plt.ylabel("y_true")
labels = ["True Neg","False Pos","False Neg","True Pos"]


plt.show()

<a id="9" >
# Compare Model's Accuracy with Graph

In [None]:
cv_result = pd.DataFrame({"Scores":score_list, "ML Models":model})
g = sns.barplot("Scores", "ML Models",data=cv_result)
g.set_xlabel("Test Name")
g.set_title("Test Scores")

<a id="10" >
       
# Conclusion

* For this data random forest classifier is the best choice for us. Its accuracy is higher than others.
* We saw all models' basic, what they are and how we can write them in python.
* To increase our model's accuracy, we can use other hyperparameters with grid search cv and we can make some changes for our data to increase our accuracy. We have to try them to see they work or not.
* I hope you will like it, if you like it don't forget to upvote.

Thank you