## Random Forest Classifier

### Importing  libraries & dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing dataSet
data1 = pd.read_csv(r"C:\Users\User-pc\Desktop\Final Exam\Data Set\Migraine.csv")
data1.head()

Unnamed: 0,Age,Duration,Frequency,Location,Character,Intensity,Nausea,Vomit,Phonophobia,Photophobia,...,Tinnitus,Hypoacusis,Diplopia,Defect,Ataxia,Conscience,Paresthesia,DPF,Gender,Type
0,31,2,5,North,1,3,1,1,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,Male,Migraine without aura
1,30,1,5,North,1,3,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,0,Male,Typical aura with migraine
2,41,2,3,North,1,2,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,0,Male,Typical aura with migraine
3,17,3,1,North,1,3,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,Male,Familial hemiplegic migraine
4,48,2,2,North,1,3,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,Male,Typical aura with migraine


### Getting more information about data

In [3]:
data1.shape

(400, 25)

In [4]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          400 non-null    int64  
 1   Duration     400 non-null    int64  
 2   Frequency    400 non-null    int64  
 3   Location     376 non-null    object 
 4   Character    400 non-null    int64  
 5   Intensity    400 non-null    int64  
 6   Nausea       400 non-null    int64  
 7   Vomit        400 non-null    int64  
 8   Phonophobia  400 non-null    int64  
 9   Photophobia  388 non-null    float64
 10  Visual       393 non-null    float64
 11  Sensory      388 non-null    float64
 12  Dysphasia    392 non-null    float64
 13  Dysarthria   392 non-null    float64
 14  Vertigo      394 non-null    float64
 15  Tinnitus     390 non-null    float64
 16  Hypoacusis   393 non-null    float64
 17  Diplopia     393 non-null    float64
 18  Defect       389 non-null    float64
 19  Ataxia  

In [5]:
data1.isnull().sum()/len(data1)

Age            0.0000
Duration       0.0000
Frequency      0.0000
Location       0.0600
Character      0.0000
Intensity      0.0000
Nausea         0.0000
Vomit          0.0000
Phonophobia    0.0000
Photophobia    0.0300
Visual         0.0175
Sensory        0.0300
Dysphasia      0.0200
Dysarthria     0.0200
Vertigo        0.0150
Tinnitus       0.0250
Hypoacusis     0.0175
Diplopia       0.0175
Defect         0.0275
Ataxia         0.0000
Conscience     0.0275
Paresthesia    0.0000
DPF            0.0000
Gender         0.0275
Type           0.0000
dtype: float64

### Dropping null values & unwanted column

In [6]:
data1.dropna(axis=0, inplace=True)

### Label encoding

In [7]:
data1['Location'].value_counts()

North    276
South     14
West       7
Name: Location, dtype: int64

In [8]:
Location = {'North':0, 
        'South':1,
         'West':2}

# apply using map
data1['Location'] = data1['Location'].map(Location)

In [9]:
data1['Gender'].value_counts()

Male      158
Female    139
Name: Gender, dtype: int64

In [10]:
Gender = {'Male':0,'Female':1}

# apply using map
data1['Gender'] = data1['Gender'].map(Gender)

In [11]:
data1['Type'].value_counts()

Typical aura with migraine       183
Migraine without aura             45
Familial hemiplegic migraine      19
Typical aura without migraine     14
Basilar-type aura                 13
Other                             13
Sporadic hemiplegic migraine      10
Name: Type, dtype: int64

In [12]:
Type = {'Typical aura with migraine':0,'Migraine without aura':1, 'Familial hemiplegic migraine':2,
        'Typical aura without migraine':3, 'Other':4, 'Basilar-type aura':5, 'Sporadic hemiplegic migraine':6}

# apply using map
data1['Type'] = data1['Type'].map(Type)

In [13]:
data1.head()

Unnamed: 0,Age,Duration,Frequency,Location,Character,Intensity,Nausea,Vomit,Phonophobia,Photophobia,...,Tinnitus,Hypoacusis,Diplopia,Defect,Ataxia,Conscience,Paresthesia,DPF,Gender,Type
0,31,2,5,0,1,3,1,1,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,0,1
1,30,1,5,0,1,3,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,0,0,0
2,41,2,3,0,1,2,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,0,0,0
3,17,3,1,0,1,3,1,0,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,0,2
5,21,1,2,0,1,2,1,1,1,1.0,...,0.0,0.0,0.0,0.0,0,0.0,0,1,0,0


### Selecting Target & Feature Variable

In [14]:
X = data1.iloc[:, 0:24].values 
Y = data1.iloc[:,-1].values

### Splitting the data into train & test data

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 0)

### Feature scaling

In [16]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Creating the model & checking the accuracy of model
### Random Forest Classifier

In [17]:
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 10)
print(forest.fit(X_train, Y_train))
print('Training Accuracy:', forest.score(X_train, Y_train))
print( classification_report(Y_test, forest.predict(X_test)) )
print(F'Accuracy:',accuracy_score(Y_test, forest.predict(X_test)))

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=10)
Training Accuracy: 1.0
              precision    recall  f1-score   support

           0       0.87      1.00      0.93        40
           1       1.00      1.00      1.00         8
           2       0.50      0.20      0.29         5
           3       1.00      1.00      1.00         3
           4       0.00      0.00      0.00         2
           5       0.00      0.00      0.00         1
           6       1.00      1.00      1.00         1

    accuracy                           0.88        60
   macro avg       0.62      0.60      0.60        60
weighted avg       0.82      0.88      0.84        60

Accuracy: 0.8833333333333333


### K Neighbors Classifier

In [18]:
knn = KNeighborsClassifier(n_neighbors = 15, p = 2)
# the value of n_neighbors is non-parametric and a general rule of thumb in choosing the value of n_neighbors is 
# n_neighbors = sqrt(N), where N stands for the number of samples in your training dataset.
print(knn.fit(X_train, Y_train))
print('Training Accuracy:', knn.score(X_train, Y_train))
print( classification_report(Y_test, knn.predict(X_test)) )
print(F'Accuracy:',accuracy_score(Y_test, knn.predict(X_test)))

KNeighborsClassifier(n_neighbors=15)
Training Accuracy: 0.7932489451476793
              precision    recall  f1-score   support

           0       0.77      1.00      0.87        40
           1       1.00      0.62      0.77         8
           2       0.00      0.00      0.00         5
           3       1.00      1.00      1.00         3
           4       0.00      0.00      0.00         2
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00         1

    accuracy                           0.80        60
   macro avg       0.40      0.38      0.38        60
weighted avg       0.70      0.80      0.73        60

Accuracy: 0.8


### What is the Accuracy using Random Forest
### What is the Accuracy using K Nearest Neighbor

In [19]:
#Accuracy Score
acc_1 = 0.88
acc_2 = 0.8

results = pd.DataFrame([["Random Forest Classifier",acc_1],["K Neighbor Classifier",acc_2]],
                        columns = ["Models","Accuracy Score"]).sort_values(by='Accuracy Score',ascending=False)


results

Unnamed: 0,Models,Accuracy Score
0,Random Forest Classifier,0.88
1,K Neighbor Classifier,0.8


### Which metric should we concentrate on & why?
- Which - 

We should concentrate on metric known as classification report from sklearn library.

- Why?

A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report.

Precision – What percent of your predictions were correct?

Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

TP – True Positives

FP – False Positives

Precision – Accuracy of positive predictions.

Precision = TP/(TP + FP)

Recall – What percent of the positive cases did you catch? 

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.


FN – False Negatives


Recall: Fraction of positives that were correctly identified.

Recall = TP/(TP+FN)

F1 score – What percent of positive predictions were correct? 

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.


F1 Score = 2*(Recall * Precision) / (Recall + Precision).

                         - - - - - - - - X X X X X X X X - - - - - - - -