# Practice Activity - Bagging
## Nick Bias 
### 4/11/22

### Try to build the best, bagging-based model (this includes random forests) to predict age.
#### Libraries 

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import BaggingClassifier,BaggingRegressor 
from sklearn.model_selection import cross_val_score
# evaluate bagging ensemble for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot

#### Reading in Data

In [2]:
abalone_data = pd.read_csv("Data/Week3/abalone.csv")
abalone_data

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


#### Descriptive Statistics

In [3]:
abalone_data[["rings"]].describe()

Unnamed: 0,rings
count,4177.0
mean,9.933684
std,3.224169
min,1.0
25%,8.0
50%,9.0
75%,11.0
max,29.0


In [4]:
abalone_data.dtypes

sex                object
length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
rings               int64
dtype: object

#### Spliting Prediction Variable from dataset,  Creating Dummy Variables for Sex, and Merging Data
- X = Dataset with all Independent Variables 
- y = The Dependent Variable of how many Rings the Abalone have

In [5]:
y = abalone_data['rings']
abalone_data2 = abalone_data.iloc[:,1:-1]
abalone_sex = pd.get_dummies(abalone_data['sex'])
X = pd.merge(abalone_data2, abalone_sex, left_index=True, right_index=True)
X

Unnamed: 0,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,F,I,M
0,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,0,0,1
1,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,0,0,1
2,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,1,0,0
3,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,0,0,1
4,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,0,1,0
...,...,...,...,...,...,...,...,...,...,...
4172,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,1,0,0
4173,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,0,0,1
4174,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,0,0,1
4175,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,1,0,0


### Creating the Model

In [6]:
# define the model
model = BaggingRegressor(base_estimator=RandomForestRegressor())

The base_estimator is used to make a Bagged Random Forest Regression model. Can put any other model type there to do a different kind of Bagging model

#### Cross Validation 

In [7]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#### These are Scoring Metrics for Regression Ensemble models:

- explained_variance
- max_error
- neg_mean_absolute_error
- neg_mean_squared_error
- neg_root_mean_squared_error
- neg_mean_squared_log_error
- neg_median_absolute_error
- r2
- neg_mean_poisson_deviance
- neg_mean_gamma_deviance
- neg_mean_absolute_percentage_error

In [8]:
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: -1.511 (0.074)


Absolute Error is the amount of error in your measurements. Mean Absolute Error (MAE) is the average of all absolute errors. We are looking for a number close to 0. Since the Model got a -1.511 this is very close to 0 so it is a good fit. 

### Prediction

In [9]:
model = BaggingRegressor()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [[0.5, 0.420, 0.15, 0.75, 0.2, 0.18, 0.05, 0, 0, 1]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

Prediction: 10
