### **Random Forest and Adaboost using Ensemble Learning approach**


## <span style="color : green"> Ensemble Learning </span>

# <center> Table of Contents </center>

1. Train a Random Forest model and AdaBoost model with different parameters and find the best parameters.
1. Display the confusion matrix ( graph ) for the models
1. Evaluate the models using accuracy score and classification report 


## **Description**:
Random Forest is a machine learning algorithm that is used for classification, regression, and other tasks. It works by combining multiple decision trees and creating an ensemble of trees.
Here's a simplified explanation of how the algorithm works:
-	Randomly select a subset of the data (sampling with replacement)
-	Build a decision tree based on the selected subset of the data
-	Repeat the process (step 1 and 2) to create multiple decision trees
-	Combine the predictions of all the decision trees to create a final prediction




In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import  confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

### **Random Forest Classifier**

In [None]:
#1. Read the dataset and do necessary preprocessing[data imputation in null values, use encoding techniques to convert categorical to numerical]
df=pd.read_csv("income.csv")
df.isnull().sum()

age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
income_level      0
dtype: int64

In [None]:
#2. Choose independent variable (X) and dependent variable (Y) from given dataset
x=df.iloc[:,:-1]
y=df.income_level
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=3)

3. Create different model of Random Forest using different parameters[Minimum of 4]
4. Find the best parameters for the model and calculate the Ypredict value
5. Display the confusion table and calculate the accuracy, precision and recall, f-score etc.
for the different parameters of Random Forest

In [None]:
model_1 = RandomForestClassifier(criterion='gini',max_features='sqrt')
model_1.fit(x_train,y_train)
y1_predicted = model_1.predict(x_test)
model_1.score(x_test,y_test)

0.8015149964172382

In [None]:
cm_1 = confusion_matrix(y_test,y1_predicted)
acc_1=(cm_1[0][0]+cm_1[1][1])/np.sum(cm_1)
pre_1=cm_1[0][0]/(cm_1[0][0]+cm_1[1][0])
rec_1=cm_1[0][0]/(cm_1[0][0]+cm_1[0][1])
f1_sc_1=cm_1[1][1]/(cm_1[1][1]+cm_1[1][0])
met_1=pd.DataFrame([[cm_1[0][0],cm_1[0][1],cm_1[1][0],cm_1[1][1],acc_1,pre_1,rec_1,f1_sc_1]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_1

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,6612,788,1151,1218,0.801515,0.851733,0.893514,0.514141


In [None]:
model_2 = RandomForestClassifier(criterion='gini',max_features='log2')
model_2.fit(x_train,y_train)
y2_predicted = model_2.predict(x_test)
model_2.score(x_test,y_test)

0.8012079025488791

In [None]:
cm_2 = confusion_matrix(y_test,y2_predicted)
acc_2=(cm_2[0][0]+cm_2[1][1])/np.sum(cm_2)
pre_2=cm_2[0][0]/(cm_2[0][0]+cm_2[1][0])
rec_2=cm_2[0][0]/(cm_2[0][0]+cm_2[0][1])
f1_sc_2=cm_2[1][1]/(cm_2[1][1]+cm_2[1][0])
met_2=pd.DataFrame([[cm_2[0][0],cm_2[0][1],cm_2[1][0],cm_2[1][1],acc_2,pre_2,rec_2,f1_sc_2]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_2

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,6603,797,1145,1224,0.801208,0.85222,0.892297,0.516674


In [None]:
model_3 = RandomForestClassifier(criterion='entropy',max_features='sqrt')
model_3.fit(x_train,y_train)
y3_predicted = model_3.predict(x_test)
model_3.score(x_test,y_test)

0.8022315487767427

In [None]:
cm_3 = confusion_matrix(y_test,y3_predicted)
acc_3=(cm_3[0][0]+cm_3[1][1])/np.sum(cm_3)
pre_3=cm_3[0][0]/(cm_3[0][0]+cm_3[1][0])
rec_3=cm_3[0][0]/(cm_3[0][0]+cm_3[0][1])
f1_sc_3=cm_3[1][1]/(cm_3[1][1]+cm_3[1][0])
met_3=pd.DataFrame([[cm_3[0][0],cm_3[0][1],cm_3[1][0],cm_3[1][1],acc_3,pre_3,rec_3,f1_sc_3]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_3

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,6609,791,1141,1228,0.802232,0.852774,0.893108,0.518362


In [None]:
model_4 = RandomForestClassifier(criterion='log_loss',max_features=None)
model_4.fit(x_train,y_train)
y4_predicted = model_4.predict(x_test)
model_4.score(x_test,y_test)

0.8158460436073293

In [None]:
cm_4 = confusion_matrix(y_test,y4_predicted)
acc_4=(cm_4[0][0]+cm_4[1][1])/np.sum(cm_4)
pre_4=cm_4[0][0]/(cm_4[0][0]+cm_4[1][0])
rec_4=cm_4[0][0]/(cm_4[0][0]+cm_4[0][1])
f1_sc_4=cm_4[1][1]/(cm_4[1][1]+cm_4[1][0])
met_4=pd.DataFrame([[cm_4[0][0],cm_4[0][1],cm_4[1][0],cm_4[1][1],acc_4,pre_4,rec_4,f1_sc_4]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_4

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,6757,643,1156,1213,0.815846,0.853911,0.913108,0.51203


### **AdaBoost Classifier**

- Adaptive Boosting (AdaBoost) is a popular ensemble learning algorithm that combines multiple weak classifiers to create a stronger classifier. In AdaBoost, each classifier is trained on the same dataset, but the weights of the data points are adjusted in each iteration to give more weight to the misclassified points.

In [None]:
#1. Read the dataset and do necessary preprocessing[data imputation in null values, use encoding techniques to convert categorical to numerical]
df=pd.read_csv("income.csv")
df.isnull().sum()

age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
income_level      0
dtype: int64

In [None]:
#2. Choose independent variable (X) and dependent variable (Y) from given dataset
x=df.iloc[:,:-1]
y=df.income_level
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=105)

3. Create different model of AdaBoost using different parameters[Minimum of 4]
4. Find the best parameters for the model and calculate the Ypredict value
5. Display the confusion table and calculate the accuracy, precision and recall, f-score etc.
for the different parameters of AdaBoost

In [None]:
clf_1 = AdaBoostClassifier(n_estimators=100,algorithm='SAMME',random_state=0)
clf_1.fit(x_train,y_train)
y1c_predicted = clf_1.predict(x_test)
clf_1.score(x_test,y_test)

0.8347834988228069

In [None]:
cm_1c = confusion_matrix(y_test,y1c_predicted)
acc_1c=(cm_1c[0][0]+cm_1c[1][1])/np.sum(cm_1c)
pre_1c=cm_1c[0][0]/(cm_1c[0][0]+cm_1c[1][0])
rec_1c=cm_1c[0][0]/(cm_1c[0][0]+cm_1c[0][1])
f1_sc_1c=cm_1c[1][1]/(cm_1c[1][1]+cm_1c[1][0])
met_1c=pd.DataFrame([[cm_1c[0][0],cm_1c[0][1],cm_1c[1][0],cm_1c[1][1],acc_1c,pre_1c,rec_1c,f1_sc_1c]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_1c

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,7164,279,1335,991,0.834783,0.842923,0.962515,0.426053


In [None]:
clf_2 = AdaBoostClassifier(n_estimators=100,algorithm='SAMME.R',random_state=0)
clf_2.fit(x_train,y_train)
y2c_predicted = clf_2.predict(x_test)
clf_2.score(x_test,y_test)

0.8415395639267069

In [None]:
cm_2c = confusion_matrix(y_test,y2c_predicted)
acc_2c=(cm_2c[0][0]+cm_2c[1][1])/np.sum(cm_2c)
pre_2c=cm_2c[0][0]/(cm_2c[0][0]+cm_2c[1][0])
rec_2c=cm_2c[0][0]/(cm_2c[0][0]+cm_2c[0][1])
f1_sc_2c=cm_2c[1][1]/(cm_2c[1][1]+cm_2c[1][0])
met_2c=pd.DataFrame([[cm_2c[0][0],cm_2c[0][1],cm_2c[1][0],cm_2c[1][1],acc_2c,pre_2c,rec_2c,f1_sc_2c]],columns=['TP','FN','FP','TN','Accuracy','Precision','Recall','F1_Score'])
met_2c

Unnamed: 0,TP,FN,FP,TN,Accuracy,Precision,Recall,F1_Score
0,7198,245,1303,1023,0.84154,0.846724,0.967083,0.439811
