<a href="https://colab.research.google.com/github/DeepsMaxi305/Data_Science/blob/main/ensemble_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Learning
You should build an end-to-end machine learning pipeline using an ensemble learning model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end machine learning pipeline, including an ensemble model, such as [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Optimize your pipeline by cross-validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

#Importing Libraries

In [1]:
import pandas as pd
import sklearn.metrics
import sklearn.ensemble
import sklearn.model_selection

#Loading the Dataset

In [2]:
df = pd.read_csv("mnist.csv")
df = df.set_index("id")
df.head(3)

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#Training the Model

In [3]:
x = df.drop(['class'],axis = 1)
y = df["class"]
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x,y)

#Using Grid Search CV for Random Forest Classifier

In [4]:
parameters_grid = {
    "criterion": ["gini", "entropy"],
    "n_estimators": range(50, 260, 50)
}
model_1 = sklearn.model_selection.GridSearchCV(sklearn.ensemble.RandomForestClassifier(),
                                              parameters_grid, scoring = "accuracy", cv = 5, n_jobs = 1)
model_1.fit(x_train,y_train)
print("Accuracy of Best Random Forest Classifier = {:.2f}".format(model_1.best_score_))


Accuracy of Best Random Forest Classifier = 0.93


In [5]:
print("Best Hyperparameters of Best Random Forest Classifier = {}".format(model_1.best_params_))


Best Hyperparameters of Best Random Forest Classifier = {'criterion': 'gini', 'n_estimators': 250}


#Using Grid Search CV for Gradient Boosting Classifier

In [None]:
parameters_grid = {
    "learning_rate": [0.01, 0.1, 0.2],
    "n_estimators": range(50, 260, 50)
}
model_2 = sklearn.model_selection.GridSearchCV(sklearn.ensemble.GradientBoostingClassifier(),
                                              parameters_grid, scoring = "accuracy", cv = 5, n_jobs = 1)
model_2.fit(x_train,y_train)
print("Accuracy of Best Gradient Boosting Classifier = {:.2f}".format(model_2.best_score_))
print("Best Hyperparameters of Best Gradient Boosting Classifier = {}".format(model_2.best_params_))

#Testing the Best Model

In [None]:
y_predicted = model_1.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test,y_predicted)
cm = sklearn.metrics.confusion_matrix(y_test,y_predicted)
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(y_test,y_predicted)

print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1 Score =", f1)
print("Confusion matrix:\n", cm)
                              