# Banknotes dataset using pandas

This brief Jupyter notebook uses tools _other than_ the `datascience` library to visualize the banknotes dataset from Section 17.4 in the textbook.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
from mpl_toolkits.mplot3d import Axes3D
from sklearn import tree
path_data = 'http://personal.psu.edu/drh20/200DS/assets/data/'

The code below uses the `read_csv` function in the `pandas` library to read the `banknotes.csv` dataset (the same one from Section 17.4) and then display the object.

In [None]:
banknotes = pd.read_csv(path_data + 'banknote.csv')
banknotes

The `banknotes` object created by `read_csv` is of the `DataFrame` type:

In [None]:
type(banknotes)

There's a `pandas` method called `groupby` that operates on `DataFrame` objects and creates groups based on one of the columns.  In our dataset, `Class` is the obvious grouping variable, so for instance we can use `groupby` to find the mean of each variable when the rows are grouped by `Class`:

In [None]:
banknotes.groupby('Class').mean()

The code that follows creates a 3-dimensional scatterplot very much like the one seen in Subsection 17.4.2.

In [None]:
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig)

for grp_name, grp_idx in banknotes.groupby('Class').groups.items():
    x = banknotes.loc[grp_idx,'WaveletVar']
    y = banknotes.loc[grp_idx,'WaveletSkew']
    z = banknotes.loc[grp_idx,'WaveletCurt']
    ax.scatter(x,y,z, label=grp_name)

ax.legend(labels=['Genuine', 'Counterfeit'])

Here is a simple way to produce an interactive 3D plot that can be moved with the mouse:

In [None]:
import plotly.express as px
fig = px.scatter_3d(banknotes, x='WaveletVar', y='WaveletSkew', z='WaveletCurt',
              color='Class')
fig.show()

# Classification and Regression Tree (CART)

CART is a classification method that we won't describe in detail in this course, though a couple examples will suffice to show how it can easily be used via freely-available software.  In this case, we'll use the `sklearn` library and its `tree` capability.

In [None]:
banknotes

In [11]:
predictors = banknotes.iloc[:,0:4] # These four columns will be used to predict whether counterfeit or not
response = banknotes.iloc[:,4] # response=0 for real, response=1 for counterfeit

In [12]:
# Use a classification tree as implemented in the sklearn library
CART = tree.DecisionTreeClassifier()
CART = CART.fit(predictors, response)

In [None]:
pred_names = np.array(predictors.columns).tolist()
print(tree.export_text(CART, feature_names=pred_names))

In [None]:
dot_data = tree.export_graphviz(CART, out_file = None, 
                      feature_names = pred_names,
                      filled=True, rounded=True,  
                      special_characters=True)  
graphviz.Source(dot_data)

We can try something similar with the `wine` dataset from Section 17.4.

In [None]:
wines = pd.read_csv(path_data + 'wine.csv')
wines

For predictors, consider only two of the columns:  `Alcohol` and `Flavanoids` (obviously we could choose a different set of columns if we wanted).

In [17]:
wine_pred = wines.iloc[:,[1,7]]
wine_resp = wines.iloc[:,0]

In [19]:
wine_CART = tree.DecisionTreeClassifier()
wine_CART = wine_CART.fit(wine_pred, wine_resp)

In [None]:
wine_names = np.array(wine_pred.columns).tolist()
print(tree.export_text(wine_CART, feature_names=wine_names))

In [None]:
dot_data = tree.export_graphviz(wine_CART, out_file=None, 
                      feature_names= wine_names,
                      filled=True, rounded=True,  
                      special_characters=True)  
graphviz.Source(dot_data)  