#### Ensemble Learning

This kind of learning takes into consideration different algorithms and different set of training data to build the model. 

1.  In here our model creates multiple models , as an example suppose one individual is planning to buy a car , and the person has a specific choice of car. Suppose that model is "Honda City"
2.  Now the individual will go and ask for feedbacks for that particular car from multiple people.
3.  Some will say yes to Honda City , some will say no to Honday city , so multiple decisions will come into picture.
4.  Our model will iterate through this all this decision and based on precision , accuracy etc , it will get us the best possible decision.
5.  Now to look into this in machine learning perspective , the multiple people who are being asked about the decision can use multiple algorithms , as some can use KNN , some can use logistic regression , some can use SVM etc etc , this is one kind of ensemble learning.
6.  Another pattern would be using the same algorithm , suppose Logistic regression but using different training data set from the main master data set to conclude with a final result.
7.  This different models using different algorithms are called learners.

In [5]:
# Random Forest Implementation
# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,friend
0,2016,1,1,Fri,45,45,45.6,45,29
1,2016,1,2,Sat,44,45,45.7,44,61
2,2016,1,3,Sun,45,44,45.8,41,56
3,2016,1,4,Mon,44,41,45.9,40,53
4,2016,1,5,Tues,41,40,46.0,44,41


Following are explanations of the columns:

*   year: 2016 for all data points
*   month: number for month of the year
*   day: number for day of the year
*   week: day of the week as a character string
*   temp_2: max temperature 2 days prior
*   temp_1: max temperature 1 day prior
*   average: historical average max temperature
*   actual: max temperature measurement
*   friend: your friend’s prediction, a random number between 20 below the average and 20 above the average

In [6]:
print('The shape of our features is:', features.shape)

The shape of our features is: (348, 9)


In [7]:
# Descriptive statistics for each column
features.describe()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,friend
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,2016.0,6.477011,15.514368,62.652299,62.701149,59.760632,62.543103,60.034483
std,0.0,3.49838,8.772982,12.165398,12.120542,10.527306,11.794146,15.626179
min,2016.0,1.0,1.0,35.0,35.0,45.1,35.0,28.0
25%,2016.0,3.0,8.0,54.0,54.0,49.975,54.0,47.75
50%,2016.0,6.0,15.0,62.5,62.5,58.2,62.5,60.0
75%,2016.0,10.0,23.0,71.0,71.0,69.025,71.0,71.0
max,2016.0,12.0,31.0,117.0,117.0,77.4,92.0,95.0


In [8]:
# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)

Unnamed: 0,average,actual,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,45.6,45,29,1,0,0,0,0,0,0
1,45.7,44,61,0,0,1,0,0,0,0
2,45.8,41,56,0,0,0,1,0,0,0
3,45.9,40,53,0,1,0,0,0,0,0
4,46.0,44,41,0,0,0,0,0,1,0


##### Features and Targets and Convert Data to Arrays

Now, we need to separate the data into the features and targets. The target, also known as the label, is the value we want to predict, in this case the actual max temperature and the features are all the columns the model uses to make a prediction. We will also convert the Pandas dataframes to Numpy arrays because that is the way the algorithm works. (I save the column headers, which are the names of the features, to a list to use for later visualization).

In [9]:
# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
labels = np.array(features['actual'])
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop('actual', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)

##### Training and Testing Sets

There is one final step of data preparation: splitting data into training and testing sets. During training, we let the model ‘see’ the answers, in this case the actual temperature, so it can learn how to predict the temperature from the features. We expect there to be some relationship between all the features and the target value, and the model’s job is to learn this relationship during training. Then, when it comes time to evaluate the model, we ask it to make predictions on a testing set where it only has access to the features (not the answers)! Because we do have the actual answers for the test set, we can compare these predictions to the true value to judge how accurate the model is. Generally, when training a model, we randomly split the data into training and testing sets to get a representation of all data points (if we trained on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months.) I am setting the random state to 42 which means the results will be the same each time I run the split for reproducible results.

In [10]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.20, random_state = 0)

In [12]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (278, 14)
Training Labels Shape: (278,)
Testing Features Shape: (70, 14)
Testing Labels Shape: (70,)


In [16]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000)
# Train the model on training data
rf.fit(train_features, train_labels)

In [17]:
print(rf.score(test_features,test_labels))

0.8007676260813571


In [18]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: temp_1               Importance: 0.49
Variable: average              Importance: 0.41
Variable: temp_2               Importance: 0.03
Variable: day                  Importance: 0.02
Variable: friend               Importance: 0.02
Variable: month                Importance: 0.01
Variable: year                 Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]