# Using Jupyter Notebooks with Galileo

Galileo makes using Jupyter Notebooks as easy as running a Python script. This short example notebook demonstrates how using a Jupyter notebook with Galileo is just like running a notebook locally.

This example will use the famous iris data set to perform some exploratory data analysis as well as to train and test four classifiers, a decision tree, a random forest, a support vector classifier (SVC), and a logistic regression model.

### Exploratory Data Analysis

We begin by importing the dependencies specified when creating the mission:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from IPython.display import IFrame

%matplotlib inline

import seaborn as sns
sns.set() #  set the plotting style

Load in the data set using the uploaded file and inspect the first five rows. The data set does not include a header for column names, so add those manually:

In [None]:
header_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

df = pd.read_csv("iris.csv", names=header_names)
df.head()

For more information on the Iris data set and original method Ronald Fisher used to classify the species, please watch this short video:

In [None]:
IFrame("https://www.youtube.com/watch?v=PlrEJfvZRNo", width=560, height=315)

Data can also be loaded from online sources:

In [None]:
online_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                 names=header_names)
online_df.head()

It is also possible to use data included in loaded dependencies, such as the same iris data set included in the scikit-learn datasets module:

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
sk_df = pd.DataFrame(np.hstack((iris['data'], iris['target'].reshape(150,1))
                               ),
                     columns=header_names
                     )
sk_df.head()

With access to the data, we can now look at simple summary statistics to compare the three classes of *iris setosa*, *iris versicolor*, and *iris-virginica*.

In [None]:
df.groupby('species').agg([np.mean, np.std, min, max])

Another way to gain insight into the ways the variables interact with each other is to create a pair plot, which shows pairwise relationships in the data set. 

In [None]:
pair_plot = sns.pairplot(df, hue='species')
plt.show()

The pair plot suggests that *iris setosa* is linearly separable from the other species along a few axes but that *iris versicolor* and *iris virginica* might require more work to distinguish.

### Classification Task

We will use four different methods for classifying the iris flowers into the three species: a decision tree, and random forest, a suspport vector classifier, and logistic regression. In order to visualize the resulting models' decision boundaries, we will restrict the data set to only two dimensions: petal width and petal length.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions

random_seed = 0

We split the data into training and testing sets to measure how well the models generalize to new data. We will withhold 20% of the data for use in testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sk_df.drop([
                                                                'sepal_width',
                                                                'sepal_length',
                                                                'species'
                                                                ],
                                                               axis=1).values,
                                                    sk_df['species'].values,
                                                    test_size = 0.2,
                                                    random_state=random_seed
                                                    )

Instantiate the four classifiers and specify any desired parameters.

In [None]:
clf1 = DecisionTreeClassifier(random_state=random_seed)
clf2 = RandomForestClassifier(random_state=random_seed)
clf3 = SVC(random_state=random_seed, kernel='linear')
clf4 = LogisticRegression(random_state=random_seed,
                          max_iter=1000,
                          solver='saga',
                          multi_class='multinomial'
                         )

In [None]:
gs = gridspec.GridSpec(2, 2)

fig = plt.figure(figsize=(10,8))

labels = ['Decision Tree', 'Random Forest', 'SVC', 'Logistic Regression']
for clf, lab, grd in zip([clf1, clf2, clf3, clf4],
                         labels,
                         itertools.product([0, 1], repeat=2)):

    clf.fit(X_train, y_train)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X_test, y=y_test.astype(np.int64), clf=clf, legend=2)
    plt.title(lab)
    
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles, 
              ['Iris setosa', 'Iris versicolor', 'Iris virginica'], 
               framealpha=0.3, scatterpoints=1)
    ax.set_xlabel("Petal Length")
    ax.set_ylabel("Petal Width")
plt.tight_layout()
plt.show()

This visualization shows how the different classification methods' internal mechanics dictate the construction of decision boundaries. The tree-based methods' decision boundaries are always parallel to the feature axes and the support vector classifier (with a linear kernel) and logistic regression have strictly linear boundaries.

### Conclusion

This short example notebook has demonstrated how a Jupyter Notebook on Galileo work identically to how a notebook works on a local machine, with support for loading data from a local drive, an online source, or from a dependency.