# Demo 12 - Predictive Models: Classification

In this notebook we'll use the famous [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris) to check out some real decision trees!  

<img src="https://raw.githubusercontent.com/nmattei/cmps6790/main/_demos/data/iris.png">

This data set has:
1. 150 instances with 4 attributes (same units, all numeric)
2. Balanced class distribution
3. No missing data

In [None]:
# clone the course repository, change to right directory, and import libraries.
# COLAB only!
%cd /content
!git clone https://github.com/nmattei/cmps3160.git
%cd /content/cmps3160/_demos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
plt.style.use('fivethirtyeight')
# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42

In [None]:
# Import the data and check it out...
df_iris = pd.read_csv("./data/iris.csv")
df_iris.head()

In [None]:
df_iris.describe()

In [None]:
df_iris.groupby("species").size()

Make a test and train split.  Note that we are using a *stratified sample* here so that we don't mess up our classifier! [More info in the docs!](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [None]:
# Vectorize the whole thing...
import sklearn
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_iris,
                               test_size=0.4,
                               stratify=df_iris["species"])

In [None]:
# Check that...
train.groupby("species").size()

In [None]:
test.groupby("species").size()

In [None]:
# Just for fun..
import seaborn as sns
sns.pairplot(train, hue="species", height=2, palette='colorblind')

In [None]:
corrmat = train.corr()
sns.heatmap(corrmat, annot = True, square = True);

## Decision Tree

Now let's build a decision tree!

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
features = ['sepal_length','sepal_width','petal_length','petal_width']
X_train = train[features]
y_train = train.species
X_test = test[features]
y_test = test.species

In [None]:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)

In [None]:
# Check some measures...
print(f"The accuracy of the Decision Tree is {metrics.accuracy_score(prediction,y_test):.3f}")
print(f"The Precision of the Decision Tree is {metrics.precision_score(prediction,y_test,average='weighted'):.3f}")
print(f"The Recall of the Decision Tree is {metrics.recall_score(prediction,y_test,average='weighted'):.3f}")

In [None]:
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(mod_dt, X_test, y_test,
                                        display_labels=mod_dt.classes_,
                                        cmap=plt.cm.Blues, normalize='all')

In [None]:
# Cooler...
mod_dt.feature_importances_


In [None]:
plt.figure(figsize = (10,8))
plot_tree(mod_dt, feature_names = features, class_names = mod_dt.classes_, filled = True);

The Above only is using petal_width and petal_length... so we can plot the decision boundry..

<img src="https://github.com/nmattei/cmps3160/blob/master/_demos/data/boundry.png?raw=1">

## Logistic Regression

Let's compare with Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train,y_train)
lr_prediction=lr.predict(X_test)
print(f"The accuracy of Logistic Regression is {metrics.accuracy_score(lr_prediction,y_test):.3f}")
print(f"The Precision of Logistic Regression is {metrics.precision_score(lr_prediction,y_test,average='weighted'):.3f}")
print(f"The Recall of Logistic Regression is {metrics.recall_score(lr_prediction,y_test,average='weighted'):.3f}")

### Logistic Regression coefficients

We can inspect the `_coef` variable of the LogisticRegression classifier to find the $\beta$ coefficients for each class. This is a matrix where cell (i,j) returns the $\beta$ parameter for class $i$ and feature $j$.

In [None]:
lr.coef_

In [None]:
# let's put the coefficients into a nice data frame.
pd.DataFrame(lr.coef_, columns=features, index=lr.classes_)

We can inspect the coefficients for each class for some insights into what the predictive features are. For example, `petal_length` appears to be strongly positively associated with the `virginica` class, which matches what we saw above in the pairplot.



### Decision Boundary

To visualize the decision boundary, we'll fit a new Logistic Regression classifier using two dimensions.

We'll then make a countour plot showing the predictions as the two features change.

Note that in the latest version of sklearn, there is a [class](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay) that makes this plotting easier, but it is not available on Colab.

In [None]:
# fit a classifier using only two features
features_s = ['petal_length','petal_width']
X_train_s = train[features_s]
X_test_s = test[features_s]
lr.fit(X_train_s,y_train)
lr_prediction_s=lr.predict(X_test_s)

In [None]:
# generate a grid of points for many posible values of petal length and width.
xx, yy = np.mgrid[0:7:.01, 0:3:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
grid_preds = lr.predict(grid)
label2int = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
labelints = np.array([label2int[s] for s in grid_preds])
labelints = labelints.reshape(xx.shape)

In [None]:
# plot the predicted class for each point.
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, labelints, 25, cmap="RdBu")
sns.scatterplot(data=X_test, ax=ax, x='petal_length', y='petal_width', hue=lr_prediction_s)
plt.legend(loc='upper left')

## Text classification

In this example we go through a light example of processing a dataset for analyzing text.

The data comes from [this website](https://www.cs.cornell.edu/people/pabo/movie-review-data/) at Cornell and is from Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.

This contains 1000 positive and 1000 negative movie reviews. Our job is to classify a review as positive or negative based on the text.

In [None]:
# need to unzip the data first.
!unzip ./data/review_polarity.zip -d ./data/

In [None]:
!ls data/review_polarity/pos

In [None]:
!cat data/review_polarity/pos/cv193_5416.txt

In [None]:
import glob

# labels are based on which directory the files are in.
all_pos = list(glob.glob("./data/review_polarity/pos/*"))
all_neg = list(glob.glob("./data/review_polarity/neg/*"))
labels = np.array([1] * len(all_pos) + [0] * len(all_neg))
filenames = all_pos + all_neg

We'll use TfidfVectorizer to convert each document into a (sparse) *feature* vector.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

vec = TfidfVectorizer(input='filename', stop_words='english')
X = vec.fit_transform(filenames)
X.shape

So, we have 2000 documents and 39,659 unique words.

How big is this matrix?

Wait, how do we store that?

dense matrix:
$$
X=
  \begin{bmatrix}
    0.1 & 2.8 & 3.2 & ... & 1.5 \\
    3.2 & 4.1 & 5.1 & ... & 2.7  \\
    ...\\
    1.4 & 3.4 & 7.5 & ... & 7.5  \\
  \end{bmatrix}
$$

sparse matrix:
$$
X=
  \begin{bmatrix}
    0.1 & 0 & 0 & ... & 1.5 \\
    0 & 0 & 0 & ... & 2.7  \\
    ...\\
    0 & 3.4 & 0 & ... & 0  \\
  \end{bmatrix}
$$

How can we store a sparse matrix more efficiently?

<br><br><br>
[CSR matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)

In [None]:
X[0]

In [None]:
filenames[0]

In [None]:
!cat ./data/review_polarity/pos/cv839_21467.txt

In [None]:
X[0].indices

In [None]:
feature_names = np.array(vec.get_feature_names_out())
feature_names[X[0].indices]

In [None]:
X[0].data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4,
                                                    shuffle=True, random_state=42)

In [None]:
textlr = LogisticRegression()
textlr.fit(X_train, y_train)
y_predicted = textlr.predict(X_test)
print(f"accuracy= {metrics.accuracy_score(y_predicted,y_test):.3f}")
print(f"precision= {metrics.precision_score(y_predicted,y_test):.3f}")
print(f"recall ={metrics.recall_score(y_predicted,y_test):.3f}")

In [None]:
ConfusionMatrixDisplay.from_estimator(textlr, X_test, y_test,
                                        display_labels=textlr.classes_,
                                        cmap=plt.cm.Blues, normalize='all')

In [None]:
pos_coef = pd.DataFrame(textlr.coef_[0],  index=feature_names).rename(columns={0: 'coef'})
pos_coef.sort_values('coef', ascending=False).head(20)

In [None]:
pos_coef.sort_values('coef', ascending=True).head(20)

## Titanic

Let's fit a Decision Tree classifier on the Titanic data as well.

In [None]:
df_titanic = pd.read_csv("./data/titanic.csv")
df_titanic = pd.get_dummies(df_titanic, columns=['sex'])
# Be cheeky with our NAN
df_titanic = df_titanic[(df_titanic["age"].notna()) & (df_titanic["fare"].notna())]
df_titanic.head()

In [None]:
train, test = train_test_split(df_titanic,
                               test_size=0.4,
                               stratify=df_titanic["survived"])

In [None]:
features = ["pclass", "fare", "sex_female", "age"]
X_train = train[features]
y_train = train.survived
X_test = test[features]
y_test = test.survived

In [None]:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
# Check some measures...
print(f"The accuracy of the Decision Tree is {metrics.accuracy_score(prediction,y_test):.3f}")
print(f"The Precision of the Decision Tree is {metrics.precision_score(prediction,y_test,average='weighted'):.3f}")
print(f"The Recall of the Decision Tree is {metrics.recall_score(prediction,y_test,average='weighted'):.3f}")

In [None]:
# Plot some graphs...
ConfusionMatrixDisplay.from_estimator(mod_dt, X_test, y_test,
                                        display_labels=mod_dt.classes_,
                                        cmap=plt.cm.Blues, normalize='all')

In [None]:
# Plot some graphs...
from sklearn.metrics import PrecisionRecallDisplay
PrecisionRecallDisplay.from_estimator(mod_dt, X_test, y_test)

In [None]:
plt.figure(figsize = (15,8))
plot_tree(mod_dt, feature_names = features, class_names={1:"survived", 0:"died"}, filled = True)