<a href="https://colab.research.google.com/github/MonkeyWrenchGang/MGTPython/blob/main/module_7/7_1_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem 


---


Breast cancer is the most common cancer in American women, except for skin cancers. It is estimated that in 2022, approximately 30% of all new cancer diagnoses will be breast cancer. Analyzing breast cancer data is essential to understanding the disease and developing effective treatments. Advances in technology have made it possible to collect large amounts of data on breast cancer, including patient demographics, tumor characteristics, treatment outcomes, and genetic profiles.

Here are some facts from the [National Breast Cancer Organization:](https://www.nationalbreastcancer.org/)

- In 2022, an estimated 287,500 new cases of invasive breast cancer will be diagnosed in women in the U.S. as well as 51,400 new cases of non-invasive (in situ) breast cancer.

- 65% of breast cancer cases are diagnosed at a localized stage (there is no sign that the cancer has spread outside of the breast), for which the 5-year relative survival rate is 99%.

- An estimated 43,550 women will die from breast cancer in the U.S.

-  **1 in 8** women in the United States will be diagnosed with breast cancer in her lifetime

Let's develop some Machine Learning skills to detect "Malignant" cancer tumors. I'm a big fan of "recipe" appraoches to machine learning and cookies. We'll take a look at the true "sugar-cookie" recipe for machine learning model building. 

1. train test split, partitioning our data into a 70/30 split
2. scikit-learn pipeline framework
3. model training 
4. model evaluation 


In [40]:
# --------------------------------
import warnings
warnings.filterwarnings('ignore')
# --------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot  as plt
import seaborn as sns

# -- sklearn stuff -- 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# -- sklearn ML models -- 
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

In [41]:
cancer = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/MGTPython/main/module_6/data/BreastCancerData.csv")
cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


When building models we want to train on one set of data and evaluate the performance of the model on data that it hasn't seen before, to ensure that it can generalize well to new, unseen data. THIS IS WHY MODELS ARE BUILT!

- `train_test_split()` function randomly shuffles the data and splits it into two sets based on a specified ratio, typically 70-30 or 80-20.

```python
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.30)
```

In [42]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(cancer,test_size=0.30)

print("train pct: {:2.2%}".format(train.shape[0]/cancer.shape[0]))
print("test  pct: {:2.2%}".format(test.shape[0]/cancer.shape[0]))

train pct: 69.95%
test  pct: 30.05%


# Supervised Learning: 


Minamly Identify:

1. target feature - what is it you want to predict? 

2. categorical_features - Categorical features must be transformed into a numerical representation that can be used by machine learning algorithms

3. numeric_features - Numeric features sometimes / often need to be transformed so that they can be used by specific machine learning algorithms



In [43]:
# -- Split features into target, categorical and numeric lists --
# -- 1. target variable name
target = 'diagnosis'
print("-- target variable  : {}".format(target))
# -- 2. numeric variables --
numeric_features =  cancer.select_dtypes(include="number").columns
# Exclude the 'id' column
numeric_features = numeric_features.difference(['id'])
numeric_features = numeric_features.to_list() # convert from array to list. 
print("-- numeric features :")
print(numeric_features)
# -- 3. categorical variables -- we don't have any categories
categorical_features = [] 
print("-- categorical features :")
print(categorical_features)

-- target variable  : diagnosis
-- numeric features :
['area_mean', 'area_se', 'area_worst', 'compactness_mean', 'compactness_se', 'compactness_worst', 'concave points_mean', 'concave points_se', 'concave points_worst', 'concavity_mean', 'concavity_se', 'concavity_worst', 'fractal_dimension_mean', 'fractal_dimension_se', 'fractal_dimension_worst', 'perimeter_mean', 'perimeter_se', 'perimeter_worst', 'radius_mean', 'radius_se', 'radius_worst', 'smoothness_mean', 'smoothness_se', 'smoothness_worst', 'symmetry_mean', 'symmetry_se', 'symmetry_worst', 'texture_mean', 'texture_se', 'texture_worst']
-- categorical features :
[]


# Create Pipeline


---

- categorical transformations
  - missing value handling
  - one-hot-encoding
- numeric transformaitons
  - missing value handling
  - scaling / normalizing numbers
- create pipeline 
  - bind transformations to 
  - model algorithem 
- fit the pipeline 
- finally make predictions with the pipeline

In [44]:
# Define the column transformer for the categorical variables
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')), # handling missing values 
    ('onehot', OneHotEncoder(handle_unknown='ignore'))    # convert categories to 1s and 0s
])

# Define the column transformer for the numeric variables
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),       # replace missing values with median 
    ('scaler', StandardScaler())                         # scale numerics 
])

# Define the full pipeline
preprocessor = ColumnTransformer([
    ('cat', cat_transformer, categorical_features),   # Use one-hot encoding for the first three features
    ('num', num_transformer, numeric_features)  # Use standard scaling for the remaining numeric features
])
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(max_depth=10))
])

# Fit the pipeline to the training data
pipeline.fit(train[numeric_features], train[target])

# Predict on the train and test data sets
y_pred_train = pipeline.predict(train[numeric_features])
y_pred_test = pipeline.predict(test[numeric_features])
# Evaluate the performance of the model
accuracy_train = accuracy_score(train[target], y_pred_train)
accuracy_test  = accuracy_score(test[target], y_pred_test)
print("-- Accuracy --")
print('train: {:.3f}'.format(accuracy_train))
print('test : {:.3f}'.format(accuracy_test))

-- Accuracy --
train: 1.000
test : 0.936


## Beauty of Pipelines


In [39]:
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(max_depth=10))
])

lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

nn_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', MLPClassifier())
])

models = {"decision tree":dt_pipeline,
          "logistic reg": lr_pipeline,
          "random forest": rf_pipeline, 
          "neural net": nn_pipeline}

for model in models.keys():
  print("model: {}".format(model))
  models[model].fit(train[numeric_features], train[target])
  # Predict on the train and test data sets
  y_pred_train = models[model].predict(train[numeric_features])
  y_pred_test = models[model].predict(test[numeric_features])
  
  # Evaluate the performance of the model
  accuracy_train = accuracy_score(train[target], y_pred_train)
  accuracy_test  = accuracy_score(test[target], y_pred_test)
  
  # Evaluate the performance of the model
  recall_train = recall_score(train[target], y_pred_train, pos_label='M')
  recall_test  = recall_score(test[target], y_pred_test, pos_label='M')
  print("-- Accuracy --")
  print('train: {:.3f}'.format(accuracy_train))
  print('test : {:.3f}'.format(accuracy_test))
  print("-- Recall --")
  print('train: {:.3f}'.format(recall_train))
  print('test : {:.3f}'.format(recall_test))

model: decision tree
-- Accuracy --
train: 1.000
test : 0.942
-- Recall --
train: 1.000
test : 0.932
model: logistic reg
-- Accuracy --
train: 0.990
test : 0.971
-- Recall --
train: 0.971
test : 0.959
model: random forest
-- Accuracy --
train: 1.000
test : 0.971
-- Recall --
train: 1.000
test : 0.973
model: neural net
-- Accuracy --
train: 0.987
test : 0.977
-- Recall --
train: 0.971
test : 0.959
