<a href="https://colab.research.google.com/github/MahmoodInamdar/Python-projects-MI/blob/main/ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Learning
You should build an end-to-end machine learning pipeline using an ensemble learning model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end machine learning pipeline, including an ensemble model, such as [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Optimize your pipeline by cross-validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [None]:
import pandas as pd
import sklearn.model_selection
import sklearn.svm
import sklearn.metrics
import sklearn.preprocessing
import imblearn.over_sampling
import sklearn.decomposition
import numpy as np


In [None]:
df = pd.read_csv("mnist.csv")
df = df.drop(['id'],axis=1)

df.head(2)

Unnamed: 0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_train, df_test = sklearn.model_selection.train_test_split(df)

print("df_train size:" , df_train.shape)
print("df_test size:" , df_test.shape)

df_train size: (3000, 785)
df_test size: (1000, 785)


In [None]:
x_train = df_train.drop(["class"], axis=1)   
y_train = df_train["class"]

x_test = df_test.drop(["class"], axis=1)   
y_test = df_test["class"]

print("x_train size:", x_train.shape)
print("y_train size:", y_train.shape)
print("x_test size:", x_test.shape)
print("y_test size:", y_test.shape)

x_train size: (3000, 784)
y_train size: (3000,)
x_test size: (1000, 784)
y_test size: (1000,)


In [None]:
parameters = {'n_estimators' :[100,150], 'criterion' :('gini', 'entropy'), 'min_samples_leaf':[1,2,4],'min_samples_split':[2,4,6],'max_depth': [5,7]}
model = sklearn.ensemble.RandomForestClassifier()
clf = sklearn.model_selection.GridSearchCV(model, parameters,cv=5,n_jobs=-1,scoring="accuracy")
clf.fit(x_train,y_train)
print("The Accuracy through decision tree model = {}".format(clf.best_score_*100))
print("best parameter = {}".format(clf.best_params_))


The Accuracy through decision tree model = 90.76666666666668
best parameter = {'criterion': 'entropy', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}


In [None]:
y_predected_final = clf.predict(x_test)
Accuracy = sklearn.metrics.accuracy_score(y_predected_final, y_test)
print(f'The accuracy of the final model  using test data set is: ',Accuracy*100,"%," ,"\n","with best parameters = {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}")
     

The accuracy of the final model SVM using test data set is:  90.10000000000001 %, 
 with best parameters = {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}


In [None]:
parameter = {'loss' : ('log_loss', 'deviance', 'exponential'),'n_estimators' :[100,150], 'criterion' :('friedman_ms', 'squared_error'), 'min_samples_leaf':[1,2,4],'min_samples_split':[2,4,6],'max_depth': [5,7]}
models = sklearn.ensemble.GradientBoostingClassifier()
clfi = sklearn.model_selection.GridSearchCV(models, parameter,cv=5,n_jobs=2,scoring="accuracy")
clfi.fit(x_train,y_train)
print("The Accuracy through decision tree model = {}".format(clfi.best_score_*100))
print("best parameter = {}".format(clfi.best_params_))

In [None]:
y_predected_finals = clf.predict(x_test)
Accuracy = sklearn.metrics.accuracy_score(y_predected_final, y_test)
print(f'The accuracy of the final model  using test data set is: ',Accuracy*100)
     