[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Clinical-Informatics-Interest-Group/Medicine-AI-Seminar/blob/main/session_2/IntroML.ipynb)

# Extremely Short intro to Python
### (and Jupyter Notebooks)

Python is one of the most popular Machine Learning languages in academia and industry.  
It's also thankfully easier to read than many other programming languages.

In [None]:
# Place your cursor in this cell and press 'shift' + 'enter'
print("Hello, medical students!")

In [None]:
# print("Hello, medical students")

In [None]:
# Nothing happened there ^ because the line started with '#'.
# Coders call this "commenting out" code. In python, any line that starts with '#' is
# not evaluated.

In [None]:
# Python does math well, too.
1 + 1

In [None]:
# If you want more information about a function from Jupyter, simply place a '?' in front
# of it and evaulate it with 'shift+enter'.
?print

# [Breast Cancer Wisconsin Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
"Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image."  
https://scikit-learn.org/stable/datasets/toy_dataset.html

# First let's train our own neurons
![img](./images/histology.webp)
Rakha, Emad & Reis-Filho, Jorge & Baehner, Frederick & Dabbs, David & Decker, Thomas & Eusebi, Vincenzo & Fox, Stephen & Ichihara, Shu & Jacquemier, Jocelyne & Lakhani, Sr & Palacios, José & Richardson, Andrea & Schnitt, Stuart & Schmitt, Fernando & Tan, Puay-Hoon & Tse, Gary & Badve, Sunil & Ellis, Ian. (2010). Breast cancer prognostic classification in the molecular era: the role of histological grade. Breast cancer research : BCR. 12. 207. 10.1186/bcr2607. 

In [None]:
# These import statements bring new functions for us to use.
# This saves us from having to write them ourselves.
from sklearn import datasets
import pandas as pd

# Let's use a function from 'datasets' to load the breast cancer data
# and assign it to 'tumor'
tumor = datasets.load_breast_cancer()

In [None]:
# In order to "Hold Out" half the data for validation
# we need to write some code that allows us to separate
# it at the halfway point.

# Find the length of the data set
length = len(tumor.data)
# Halfway point
midpoint = (length // 2)
# New start point
secondhalf = midpoint + 1

In [None]:
# Assign the data features (independant variables) to a matrix 'X'
# which the Perceptron will make predictions on. 'y' is the ground
# truth the Perceptron will compare it's predictions to.
X = pd.DataFrame(tumor.data[:midpoint, :])
y = tumor.target[:midpoint]
# We'll hold some data from model testing in order to test its "real world" performance
X_validate = pd.DataFrame(tumor.data[secondhalf:, :])
y_validate = tumor.target[secondhalf:]

In [None]:
from sklearn.model_selection import train_test_split

# Here, we tell 'train_test_split' to take 30% of our training
# data set and assign it to X_test. X_train is the data the perceptron
# will learn from by predicting malignant or not. X_test is the data
# internally validates the trained algorithm.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [None]:
# Evaluate 'X' to see the data in a nice table
X

In [None]:
# Calling the 'describe' method on X returns some useful information
X.describe()

In [None]:
# It's important to standardize the numerical values in every feature, so that no feature is disproportionally weighted.
from sklearn.preprocessing import StandardScaler

# We should fit the 'Standard Scaler' to the data we're going to train the Perceptron on.
sc = StandardScaler()
sc.fit(X_train)
# And then standardize the data by transforming it with the fit 'Standard Scaler'
X_train_std = sc.transform(X_train)
# Now apply the same standardizing to the test and validation sets
X_test_std = sc.transform(X_test)
X_valid_std = sc.transform(X_validate)

In [None]:
from sklearn.linear_model import Perceptron

# Let's call the Perceptron we're about to train 'neuron', and give it
# two initial instructions (in the form of parameters). 'eta0' is the "learning
# rate", which tell's the Perceptron how much it should change its predictions
# each time it gets them wrong. 'random_state=1' tells the Perceptron to randomly
# weight each feature of the data from the start.
neuron = Perceptron(eta0=0.1, random_state=1)

# Next we'll tell neuron to learn from the data features 'X_train_std' by attempting to predict
# the outcome 'y_train'
neuron.fit(X_train_std, y_train)

In [None]:
# Check the accuracy by scoring 'neuron' against the test data
neuron.score(X_test_std, y_test)

In [None]:
# Let's see what happens if we only allow the Perceptron to learn
# from a single prediction.
weak_neuron = Perceptron(max_iter=1, eta0=0.1, random_state=1)
weak_neuron.fit(X_train_std, y_train)
weak_neuron.score(X_test_std, y_test)

In [None]:
# Lastly, we validate the neuron with the validation
# data held out from training.
neuron.score(X_valid_std, y_validate)

# Breast Cancer Diagnosis by Fine Needle Aspirate
- Sensitivity: 74 percent (95% CI 72 to 77 percent)
- Specificity: 96 percent (95% CI 94 to 98 percent)

Wang M, He X, Chang Y, Sun G, Thabane L. "A sensitivity and specificity comparison of fine needle aspiration cytology and core needle biopsy in evaluation of suspicious breast lesions: A systematic review and meta-analysis." Breast. 2017;31:157. Epub 2016 Nov 17. 

In [None]:
from sklearn.metrics import confusion_matrix

# Explore this code to see how we can describe the
# sensitivity and specificity of our 'neuron'
y_validate_pred = neuron.predict(X_valid_std)
confusion_matrix(y_validate, y_validate_pred)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_validate, y_validate_pred).ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
print('Sensitivity : ', sensitivity)
print('Specificity : ', specificity)