### Required Codio Assignment 20.1: Basic Aggregating of Models

This activity focuses on combining models in an ensemble to make predictions.  You will first create an ensemble on your own and then be introduced to the `VotingClassifier` from `scikit-learn` to implement these ensembles.  You will consider a classification problem and use Logistic Regression, KNN, and Support Vector Machines to build your ensemble.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)



In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### The Data


The data was retrieved from [kaggle](https://www.kaggle.com/) and contains information from fetal Cardiotocogram exams that were classified into three categories:


- Normal
- Suspect
- Pathological


In [2]:
df = pd.read_csv('data/fetal.zip', compression = 'zip')

In [3]:
df.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

In [5]:
df['fetal_health'].value_counts()

1.0    1655
2.0     295
3.0     176
Name: fetal_health, dtype: int64

In [6]:
X = df.drop('fetal_health', axis = 1)
y = df['fetal_health']

In [7]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

[Back to top](#-Index)

### Problem 1

#### Model Predictions

**10 Points**

Given the models below and the starter code, scale that data and train the models on the data, assigning the predictions as an array to the given dictionary.  

In [8]:
models = [LogisticRegression(), KNeighborsClassifier(), SVC()]

In [10]:
### GRADED
results = {'logistic': [],
          'knn': [],
          'svc': []}
i = 0
for model in models:
    #fit the model
    pass
    #make predictions
    
    #track predictions with predictions -- should have three 
    i += 1
    
### BEGIN SOLUTION
results = {'logistic': [],
          'knn': [],
          'svc': []}
i = 0
for model in models:
    model.fit(X, y)
    results[list(results.keys())[i]] = model.predict(X)
    i += 1
    
### END SOLUTION

### ANSWER CHECK
results

{'logistic': array([2., 1., 1., ..., 2., 2., 1.]),
 'knn': array([2., 1., 1., ..., 2., 2., 1.]),
 'svc': array([2., 1., 1., ..., 2., 2., 1.])}

[Back to top](#-Index)

### Problem 2

#### Majority Vote

**10 Points**

Using your dictionary of predictions, create a DataFrame called `prediction_df` and add a column to the DataFrame named `ensemble_prediction` based on the majority vote of your predictions.

In [12]:
### GRADED
prediction_df = pd.DataFrame()
prediction_df['ensemble_prediction'] = ''
    
### BEGIN SOLUTION
prediction_df = pd.DataFrame(results)
prediction_df['ensemble_prediction'] = prediction_df.mode(axis = 1).iloc[:, 0]
### END SOLUTION

### ANSWER CHECK
prediction_df.head()

Unnamed: 0,logistic,knn,svc,ensemble_prediction
0,2.0,2.0,2.0,2.0
1,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0


[Back to top](#-Index)

### Problem 3


#### Accuracy of Classifiers

**10 Points**


Create a list of accuracy scores for each of the classifiers.  Use this list with the columns to create a DataFrame named `results_df` to hold the accuracy scores of the classifiers.  What rank was your ensemble?


In [14]:
from sklearn.metrics import accuracy_score

In [15]:
### GRADED
accuracies = []
for col in prediction_df.columns:
    # put your answer here
    
### BEGIN SOLUTION
accuracies = []
for col in prediction_df.columns:
    accuracies.append(accuracy_score(y, prediction_df[col]))
### END SOLUTION

### ANSWER CHECK
accuracies

[0.9045155221072436, 0.9374412041392286, 0.929444967074318, 0.9270931326434619]

[Back to top](#-Index)

### Problem 4

#### Using the Voting Classifier

**10 Points**

Use the documentation and User Guide [here](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) to create a voting ensemble using the `VotingClassifier` based on the majority vote using the same three classifiers `svc`, `lgr`, and `knn`.  Assign the accuracy of the ensemble to `vote_accuracy` below.

In [17]:
### GRADED
voter = ''
vote_accuracy = ''
    
### BEGIN SOLUTION
voter = VotingClassifier([('svc', SVC()), ('lgr', LogisticRegression()), ('knn', KNeighborsClassifier())])
voter.fit(X, y)
vote_accuracy = voter.score(X, y)
### END SOLUTION

### ANSWER CHECK
vote_accuracy

0.9270931326434619

[Back to top](#-Index)

### Problem 5

#### Voting based on probabilities

**10 Points**

Consult the user guide and create a new ensemble that makes predictions based on the probabilities of the estimators.  **HINT**: This has to do with the `voting` parameter.  Assign the ensemble as `soft_voter` and the accuracy as `soft_accuracy`. 

In [19]:
### GRADED
soft_voter = ''
soft_accuracy = ''
    
### BEGIN SOLUTION
soft_voter = VotingClassifier([('svc', SVC(probability = True)), ('lgr', LogisticRegression()), ('knn', KNeighborsClassifier())], voting = 'soft')
soft_voter.fit(X, y)
soft_accuracy = soft_voter.score(X, y)
### END SOLUTION

### ANSWER CHECK
soft_accuracy

0.9379115710253998

[Back to top](#-Index)

### Problem 6

#### Using different weights 

**10 Points**

Finally, consider weighing the classifiers differently.  Use the Logistic Regression estimator as .5 of the weight in predicting based on majority votes, and the SVC and KNN as 0.25 each.  Assign the accuracy of these predictions on the test data to `weighted_acc`.  

In [24]:
### GRADED
weighted_voter = ''
weighted_score = ''
    
### BEGIN SOLUTION
weighted_voter = VotingClassifier([('svc', SVC(probability = True)), ('lgr', LogisticRegression()), ('knn', KNeighborsClassifier())],
                                 weights=[0.25, .5, .25])
weighted_voter.fit(X, y)
weighted_score = weighted_voter.score(X, y)
### END SOLUTION

### ANSWER CHECK
weighted_score

0.9214487300094073