## Ensemble Learning

Ensemble learning techniques attempt to make the performance of the predictive models better by improving their accuracy. Ensemble Learning is a process using which multiple machine learning models (such as classifiers) are strategically constructed to solve a particular problem.
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

### Methods of Ensemble Learning
The three most popular methods for combining the predictions from different models are:

1. Bagging: Building multiple models (typically of the same type) from different subsamples of the training dataset.
2. Boosting: Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
3. Voting: Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

We will be looking at methods 1 and 3 in this Labsheet.

##### Important Notes: 
1. You will be required your previously acquired knowledge of functions of Libraries to code in this labsheet. 
2. While it is not expected that you might remember all functionalities, Remember: "With great practice comes great memory" and "Google is everyone's best friend (as long as it is not the labtest)."


### Section 1: Bagging Algorithms

Bagging (aka Bootstrap Aggregation) involves taking multiple samples from your training dataset (with replacement) and training a model for each sample.

The final output prediction is averaged across the predictions of all of the sub-models.

The two bagging models covered in this section are as follows:

1. Bagged Decision Trees
2. Random Forest

#### 1.1 Bagged Decision Trees
Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.
The following example uses Bagging Classifier with 100 Decision Trees
##### Notes:
1. High variance means that your estimator (or learning algorithm) varies a lot depending on the data that you give it. If algorithm is able to fit your data extremely well every single time and even a single data point perturbation changes the algorithm a lot then the algorithm is has high variance. This type of high variance is called __overfitting__. Thus usually overfitting is related to high variance. This is bad because it means your algorithm is probably not robust to noise.
2. A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input. It is sensitive to where it splits and how it splits. Therefore, even small changes in input variable values might result in very different tree structure.

##### Step 1: Import required libraries
You'll need the following libraries: 
pandas,  model_selection (available in sklearn), BaggingClassifier (available in sklearn.ensemble) and DecisionTreeClassifier (available in sklearn.tree). 

Hint: import statements can be written as-
- import x, 
- import x as y, 
- from a import b. 

Figure out how the necessary imports will be done in this case.

Write code to import these dependencies below:

In [None]:
#(1)code for step 1


##### Step 2: Reading the Dataset
Now, you need to read the .csv file of the Letter Image Recognition Data.
Remember: You can (and should for all purposes of this lab) use pandas to tinker with data whenever required.
For reading a csv file, typically the following steps are followed:
1. specify the url/file path to the dataset
2. make an array of strings that specify the column-names of the dataset (Note this step is option but recommended for the readability of your code).
3. use the variables from 1 and 2 to call an appropriate function in pandas to read the csv file. [Hint: this function has a very obvious name, you used it in previous labsheets and it returns a dataframe object].

You could now verify that your dataset has been read correctly by printing the first few rows(remember - head?)

Write code to read the dataset below:

##### Note: read_csv() fuction reads both .csv file and .txt file. Other arguments of the fuction remain same.

In [None]:
#(2)code for steps 2.1,2.2 and 2.3 goes here


##### Step 3: Getting ready before actual classification

Once data has been read, the next step is always pre-processing.
Though in this dataset, we fortunately do not require any pre-processing but it is good practice to be well-versed with the metadata file of the dataset to ensure that you carry out necessary processing steps before using the data.

We now store the values from the dataframe object in an array and then seperate it into input attributes and target class.

In [None]:
array = df.values 
#dataframe is the name of the variable that you assigned to the object returned when you used the read_csv.

X = array[:,1:16] 
#starting at index 0, columns 1 to 16 form the input fields

y = array[:,0]
#the zeroth column is the target class

### ANOTHER way to get X and y (What we were already doing in previous lab sheets) ###
# X = df.iloc[:,1:]
# y = df.iloc[:,1]
# Former one generate numpy array and latter one, panda dataframe (if multiple colums selected) or series (if only one column)

# You can use print(type(X/y)) to check data types of both.

##### Step 4: Split the dataset into train and test data

The __‘train_test_split’__ function of Scikit-learn model_selection class takes in 5 parameters. The first two parameters are the input and target data we split up earlier. Next, we will set ‘test_size’ to 0.3. This means that 30% of all the data will be used for testing, which leaves 70% of the data as training data for the model to learn from.
Setting ‘stratify’ to y makes our training split represent the proportion of each value in the y variable.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, stratify=y)

#### ---------------------------------------------------------------------------------------------------------------------------------------------------------

### Check accuracy of model on single decision tree

Before creating bagging classifier, we will check the accuracy on simple decision tree model.

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)

In [None]:
# CHECK THE ACCURACY on Testing data
## Direct method
print("Accuracy-direct method:",dt.score(X_test,y_test))

## Another similar method
y_pred = dt.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy:",accuracy_score(y_test, y_pred))

#### ---------------------------------------------------------------------------------------------------------------------------------------------------------

##### Step 5: Creating the bagging decision tree classifier

We will use 10 folds with a random seed and 100 decision trees

Note: It is a good idea to look up the return types and details from documentation of sklearn for the funtions we will use to get a better understanding of what is happening

__Important: Before poceeding, please read this [article](https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f) fully.__

In [None]:
# Aprroach 1: Train and Test Split

seed = 7  #you can use any number, or even generate a random number if you fancy
number_of_trees = 100

dtree_1 = DecisionTreeClassifier()

baggin_model_1 = BaggingClassifier(base_estimator=dtree_1, n_estimators=number_of_trees, random_state=seed)
baggin_model_1.fit(X_train,y_train)

# Check accuracy on testing data
print("Accuracy: ",baggin_model_1.score(X_test,y_test))

In [None]:
# Aprroach 2: K-Fold

kfold = model_selection.KFold(n_splits = 10, random_state = seed)
#here n_splits=10 because we are doing a 10-fold verification. 

dtree_2 = DecisionTreeClassifier()

bagging_model_2 = BaggingClassifier(base_estimator=dtree_2, n_estimators=number_of_trees, random_state=seed)

results = model_selection.cross_val_score(bagging_model_2, X, y, cv=kfold)
print(results.mean())

##### Note:
1. __cross_val_score() fuction automatically do the split and print() statement prints the mean of accuracies on testing data.__
2. __Remove mean() function to see all 10 folds accuracy individually.__

You could now try changing some values in the above code and see how your model changes and try to figure out what gives best results.

#### 1.2 Random Forest
Random forest is an extension of bagged decision trees.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.

##### Step 1: Import required libraries
You'll need the following libraries: 
pandas,  model_selection (available in sklearn) RandomForestClassifier (available in sklearn.ensemble)


In [None]:
#(3)code for step 1 here
#note: you need not re-import a library that you may have already imported


##### Step  2, 3 and 4: 
Remains same as before, so we wont go through the trouble of writing those lines of code again.
we will re-use the variables X, y and kfold and Num_of_trees.

##### Step 5: Creating the Random Forest Classifier
We use Test-Train split/10-fold verification and 100 trees with max features in each tree as 3 (max_features is the size of the random subsets of features to consider when splitting a node.).

In [None]:
# Aprroach 1: Train and Test Split

max_features = 5

rf_model_1 = RandomForestClassifier(n_estimators=number_of_trees, max_features=max_features)

#(4)fit the model and check accuracy(similar to what was done in bagging)
## code here


In [None]:
# Aprroach 2: K-Fold

rf_model_2 = RandomForestClassifier(n_estimators=number_of_trees, max_features=max_features)

#(5)check the score
## code here


### Section 2: Voting Ensemble

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms.

It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from submodels, but this is called stacking (stacked generalization) and is currently not provided in scikit-learn.

You can create a voting ensemble model for classification using the VotingClassifier class.

We will create a simple ensemble voting model using two base models: Logistic Regression and Decision Tree Classifier.
More complex Voting Models can be created using with more number and complexity of base classifiers.

##### Step 1: Import required libraries
You'll need the following libraries: 
pandas,  model_selection (available in sklearn), LogisticRegression (available in sklearn.linear_model), DecisionTreeClassifier (available in sklearn.tree) and VotingClassifier (available in sklearn.ensemble)

Note: do not import libraries already imported before.

In [None]:
#(6)code for step 1


##### Step  2 and 3: 
Remains same as before, so we wont go through the trouble of writing those lines of code again.
we will re-use the variables X, Y, seed and kfold.

##### Step 4: Creating the Voting Classifier

We will now create two seperate base classifiers i.e. the Decision Tree and Logistic Regression models and combine them using Voting Classifier

In [None]:
# create the sub models
estimators = []
base1 = LogisticRegression()
estimators.append(('logistic', base1))
base2 = DecisionTreeClassifier()
estimators.append(('cart', base2))

# create the ensemble model
ensemble = VotingClassifier(estimators)
voting_results = model_selection.cross_val_score(ensemble, X, y, cv=kfold)
print(voting_results.mean())

#### That's All Folks!

Similar simple modeling example with greater clarity. Please go through it - [reference](https://www.pluralsight.com/guides/ensemble-modeling-scikit-learn).

To know more about Bagging: [GFG](https://www.geeksforgeeks.org/ml-bagging-classifier/)

More about all Ensemble learning :
1. [Scikit-learn documentaion](https://scikit-learn.org/stable/modules/ensemble.html)
2. [Comprehensive Guide](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/)

__This Labsheet has been compiled using the following [reference](https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/).__