## What is Classification? (with mtcars)


## Setup and Imports

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
!pip install plotnine


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
!pip install statsmodels


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:

#from plotnine import *
import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

In [None]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
%%R

# My commonly used R imports

require('tidyverse')
require('DescTools')
require('ggrepel')

## Load the data

read mtcars

In [None]:
from plotnine.data import mtcars
mtcars.head()

## Logistic Regression 

What is the probability that a car is automatic or manual given it's weight and horsepower? 

In other words: `am ~ wt + hp`

In [None]:
%%R 

logistic <- glm(am ~ wt + hp, data = mtcars, family = binomial(link = 'logit'))
print(summary(logistic))
print(exp(coef(logistic)))
print(PseudoR2(logistic, which = 'McFadden'))

Oh look, a pseudo r^2 of .76. That's pretty good.

But now, what if our goal was prediction, not inference. Suppose I don't care much about how these things are related to one another. I just want to build a machine that categorizes cars as automatic or manual.

Visually, this is what we're doing

In [None]:
%%R -w 750 -i mtcars

mtcars$am <- factor(mtcars$am, labels = c('Automatic', 'Manual'))

ggplot(mtcars) +
    aes(x=wt, y=hp, color=am, shape=am, label=name) +
    geom_point(size=4) + 
    geom_text_repel() +
    theme_bw() + 
    labs(
        title="Automatic vs Manual cars in mtcars",
        y = "Horsepower (hp)", x= "Weight (wt)")

What is a classification task? The idea is to train an algorithm that will a boundary between the two categories and categorize any new data that comes in accurately. 

![](flashcards/Classification_web.png)

## The logistic regression is a classifier!!!

or...well...it can become one

In [None]:
%%R 

df <- mtcars %>% mutate(
    prediction_odds = exp(predict(logistic)),
    prediction_pct = prediction_odds / (1 + prediction_odds),
    prediction = ifelse(prediction_pct > 0.5, 'Manual', 'Automatic')
)

df %>% head()

In [None]:
%%R -o df

df %>% select(wt, hp, prediction, am) %>% head()

## How well did our classifier do?

In [None]:
pd.crosstab(df.prediction, df.am)