<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.3: Stacking

INSTRUCTIONS:

- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the scenario below.
- The baseline results (minimum) are:
    - **Accuracy** = 0.9667
    - **ROC AUC**  = 0.9614
- Try to achieve better results!

## Scenario: Predicting Breast Cancer
The dataset you are going to be using for this laboratory is popularly known as the **Wisconsin Breast Cancer** dataset. The task related to it is Classification.

The dataset contains a total number of _10_ features labelled in either **benign** or **malignant** classes. The features have _699_ instances out of which _16_ feature values are missing. The dataset only contains numeric values.

# Step 1: Define the problem or question
Identify the subject matter and the given or obvious questions that would be relevant in the field.

## Potential Questions
List the given or obvious questions.

## Actual Question
Choose the **one** question that should be answered.

# Step 2: Find the Data
## Wisconsin Breast Cancer DataSet
- **Citation Request**

    This breast cancer databases was obtained from the **University of Wisconsin Hospitals**, **Madison** from **Dr. William H. Wolberg**. If you publish results when using this database, then please include this information in your acknowledgements.

- **Title**

    Wisconsin Breast Cancer Database (January 8, 1991)

- **Sources**
    - **Creator**
            Dr. WIlliam H. Wolberg (physician)
            University of Wisconsin Hospitals
            Madison, Wisconsin
            USA
    - **Donor**
            Olvi Mangasarian (mangasarian@cs.wisc.edu)
            Received by David W. Aha (aha@cs.jhu.edu)
    - **Date**
            15 July 1992

# Step 3: Read the Data
- Read the data
- Perform some basic structural cleaning to facilitate the work

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [7]:
columns = [
            'Index',
            'Clump Thickness',
            'Uniformity of Cell Size',
            'Uniformity of Cell Shape',
            'Marginal Adhesion',
            'Single Epithelial Cell Size',
            'Bare Nuclei',
            'Bland Chromatin',
            'Normal Nucleoli',
            'Mitoses',
            'Class']

df = pd.read_csv('../Data/breast-cancer-wisconsin-data-old.csv',
                 header = None,
                 names = columns,
                 usecols = columns[1:], #Use all cols but first one
                 na_values = '?' #Replace '?' with nan
                )

# Step 4: Explore and Clean the Data
- Perform some initial simple **EDA** (Exploratory Data Analysis)
- Check for
    - **Number of features**
    - **Data types**
    - **Domains, Intervals**
    - **Outliers** (are they valid or expurious data [read or measure errors])
    - **Null** (values not present or coded [as zero of empty strings])
    - **Missing Values** (coded [as zero of empty strings] or values not present)
    - **Coded content** (classes identified by numbers or codes to represent absence of data)

In [8]:
df.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


# Step 5: Prepare the Data
- Deal with the data as required by the modelling technique
    - **Outliers** (remove or adjust if possible or necessary)
    - **Null** (remove or interpolate if possible or necessary)
    - **Missing Values** (remove or interpolate if possible or necessary)
    - **Coded content** (transform if possible or necessary [str to number or vice-versa])
    - **Normalisation** (if possible or necessary)
    - **Feature Engeneer** (if useful or necessary)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump Thickness              699 non-null    int64  
 1   Uniformity of Cell Size      699 non-null    int64  
 2   Uniformity of Cell Shape     699 non-null    int64  
 3   Marginal Adhesion            699 non-null    int64  
 4   Single Epithelial Cell Size  699 non-null    int64  
 5   Bare Nuclei                  683 non-null    float64
 6   Bland Chromatin              699 non-null    int64  
 7   Normal Nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


In [10]:
#Impute nan values:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

#This iterative imputer bases y on all other columns in dataframe iteratively

imputer = IterativeImputer()

df.loc[:,'Bare Nuclei'] = imputer.fit_transform(df[['Bare Nuclei']])



In [11]:
#Normalize data:
from sklearn.model_selection import train_test_split

#Split data:

X = df.drop('Class', axis = 1)
y = df['Class']

#Normalize (Scale) data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_X = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)
            
X_train, X_test, y_train, y_test = train_test_split(scaled_X, 
                                                    y, 
                                                    test_size=0.33, 
                                                    random_state=42)

# Step 6: Modelling
Refer to the Problem and Main Question.
- What are the input variables (features)?
- Is there an output variable (label)?
- If there is an output variable:
    - What is it?
    - What is its type?
- What type of Modelling is it?
    - [ ] Supervised
    - [ ] Unsupervised 
- What type of Modelling is it?
    - [ ] Regression
    - [ ] Classification (binary) 
    - [ ] Classification (multi-class)
    - [ ] Clustering

In [None]:
#Will use a few different models 

# Step 7: Split the Data

Need to check for **Supervised** modelling:
- Number of known cases or observations
- Define the split in Training/Test or Training/Validation/Test and their proportions
- Check for unbalanced classes and how to keep or avoid it when spliting

In [None]:
#Already done

# Step 8: Define and Fit Models

Define the model and its hyper-parameters.

Consider the parameters and hyper-parameters of each model at each (re)run and after checking the efficiency of a model against the training and test datasets.

### Trying a stacking classifier:

In [18]:
#Import classifier:

from sklearn.ensemble import StackingClassifier


#Import base models:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

#Initialize stacked model:

stack = StackingClassifier(estimators = [('Log Regressor',LogisticRegression()),
                                         ('K-Nearest Neighbors',KNeighborsClassifier()),
                                         ('Support vector machine',SVC()),
                                         ('Decision Tree',DecisionTreeClassifier()),
                                        ],
                           final_estimator = None, #Default is Log Regression
                           cv = None, #Default is 5-fold cv 
                           stack_method = 'auto' #Method for base models, default is predict_proba followed by a few other options
                          )

In [20]:
#Train stacked model:

stack.fit(X_train,y_train)

#Generate predictions:

stack_predictions = stack.predict(X_test)

# Step 9: Verify and Evaluate the Training Model
- Use the **training** data to make predictions
- Check for overfitting
- What metrics are appropriate for the modelling approach used
- For **Supervised** models:
    - Check the **Training Results** with the **Training Predictions** during development
- Analyse, modify the parameters and hyper-parameters and repeat (within reason) until the model does not improve

In [21]:
from sklearn.metrics import accuracy_score, plot_confusion_matrix, classification_report, plot_roc_curve 

In [27]:
print('Training accuracy for stacked model:', stack.score(X_train,y_train))
print('Testing accuracy for stacked model:', accuracy_score(y_test,stack_predictions))

Training accuracy for stacked model: 0.9700854700854701
Testing accuracy for stacked model: 0.9653679653679653


# Step 10: Make Predictions and Evaluate the Test Model
**NOTE**: **Do this only after not making any more improvements in the model**.

- Use the **test** data to make predictions
- For **Supervised** models:
    - Check the **Test Results** with the **Test Predictions**

In [29]:
#Try some other regressors but with default settings:
logmodel = LogisticRegression().fit(X_train,y_train)
log_predictions = LogisticRegression().fit(X_train,y_train).predict(X_test)

knn = KNeighborsClassifier().fit(X_train,y_train)
knn_predictions = knn.predict(X_test)

svc = SVC().fit(X_train,y_train)
svc_predictions = svc.predict(X_test)

tree = DecisionTreeClassifier().fit(X_train,y_train)
tree_predictions = tree.predict(X_test)



In [35]:
#Initialize some lists to produce a nice dataframe:

estimators_ = [stack,logmodel,knn,svc,tree]
Estimators = ['Stacked Classifier','Logistic Regression','K-Nearest neighbors','Support Vector Machine', 'Decision Tree']

#Get training scores for each model:
training_scores = [model.score(X_train,y_train) for model in estimators_]

#Get testing scores for each model:
test_scores = [accuracy_score(y_test,model.predict(X_test)) for model in estimators_]

In [56]:
df = pd.DataFrame(data = {'Estimator': Estimators,
                         'Training scores' : training_scores,
                         'Test scores' : test_scores
                        }
                )
df

Unnamed: 0,Estimator,Training scores,Test scores
0,Stacked Classifier,0.970085,0.965368
1,Logistic Regression,0.970085,0.965368
2,K-Nearest neighbors,0.970085,0.961039
3,Support Vector Machine,0.970085,0.969697
4,Decision Tree,1.0,0.943723


# Step 11: Solve the Problem or Answer the Question
The results of an analysis or modelling can be used:
- As part of a product or process, so the model can make predictions when new input data is available
- As part of a report including text and charts to help understand the problem
- As input for further questions

© 2020 Institute of Data