# Model Creation for the Penguin Classifier App

This notebook shows how I created the machine learning model used for the penguin classifier app shown in the lectures on graphical user interfaces. My approach is very simple -- I'm using a decision tree and skipping cross-validation. The resulting model is OK, but with cross-validation and more careful modeling decisions, I know that you can do much better!!   

In [13]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn import tree

In [14]:
url = 'https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/palmer_penguins.csv'
penguins = pd.read_csv(url)

In [15]:
penguins['Species'] = penguins['Species'].str.split().str.get(0)

In [16]:
penguins.groupby(['Island', 'Species'])[['Body Mass (g)', 'Culmen Length (mm)']].aggregate(np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,Body Mass (g),Culmen Length (mm)
Island,Species,Unnamed: 2_level_1,Unnamed: 3_level_1
Biscoe,Adelie,3709.659091,38.975
Biscoe,Gentoo,5076.01626,47.504878
Dream,Adelie,3688.392857,38.501786
Dream,Chinstrap,3733.088235,48.833824
Torgersen,Adelie,3706.372549,38.95098


In [17]:
train, test = train_test_split(penguins, test_size = 0.5)

In [18]:
def prep_penguins(data_df):
    """
    prepare the penguins data set
    first, we apply a LabelEncoder to the Species and Island columns
    second, we remove all columns other than the three we'll use for 
    this exercise. 
    third, we remove rows with na values in any of the required
    columns. 
    finally, we split into predictor and target variables. 
    
    data_df: a row-subset of the penguins data frame
    return: X, y, the cleaned predictor and target variables (both data frames)
    """
    
    # copy the original df to suppress warnings
    df = data_df.copy()
    
    # apply label encoders to Species and Island columns
    le = preprocessing.LabelEncoder()
    df['Species'] = le.fit_transform(df['Species'])
    
    le = preprocessing.LabelEncoder()
    df['Island'] = le.fit_transform(df['Island'])
    
    # only need these columns
    df = df[['Species', 'Island', 'Body Mass (g)', 'Culmen Length (mm)']]
    # remove rows if they have NA in any of the needed columns
    df = df.dropna()
    
    # separate into predictor and target variables
    X = df.drop(['Species'], axis = 1)
    y = df['Species']
    
    return(X, y)

In [19]:
X_train, y_train = prep_penguins(train)
X_test, y_test   = prep_penguins(test)

In [20]:
# make the model
T = tree.DecisionTreeClassifier(max_depth = 5)
T.fit(X_train, y_train)
T.score(X_train, y_train), T.score(X_test, y_test)

(1.0, 0.9593023255813954)

# Pickling

Here is the new part: after creating the model, we *pickle* it. This saves its state, allowing us to load it into a new Python session without going through the hassle of downloading the data and training the model every time we want to use the app.  

In [21]:
import pickle

# saves the model
pickle.dump(T, open("model.p", "wb"))