<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Logistic Regression Practice
**Possums**

<img src="./images/pos2.jpg" style="height: 250px">

*The common brushtail possum (Trichosurus vulpecula, from the Greek for "furry tailed" and the Latin for "little fox", previously in the genus Phalangista) is a nocturnal, semi-arboreal marsupial of the family Phalangeridae, native to Australia, and the second-largest of the possums.* -[Wikipedia](https://en.wikipedia.org/wiki/Common_brushtail_possum)

In [1]:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
# Import train_test_split.
from sklearn.model_selection import train_test_split
# Import logistic regression
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### Get the data

Read in the `possum.csv` data (located in the `data` folder).

In [2]:
possum = pd.read_csv('./data/possum.csv')

In [3]:
possum.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0


In [4]:
possum.shape

(104, 8)

In [5]:
possum.value_counts()

site  pop    sex  age  head_l  skull_w  total_l  tail_l
1     Vic    f    1.0  93.1    54.8     90.5     35.5      1
5     other  m    1.0  82.5    52.3     82.0     36.5      1
6     other  f    4.0  88.7    52.0     83.0     38.0      1
                       86.0    54.0     82.0     36.5      1
                  3.0  90.0    53.8     81.5     36.0      1
                                                          ..
1     Vic    m    7.0  96.0    59.0     90.0     36.0      1
                  5.0  95.1    59.9     89.5     36.0      1
                       92.9    57.6     85.5     34.0      1
                  4.0  93.8    56.8     87.0     34.5      1
7     other  m    7.0  91.8    57.6     84.0     35.5      1
Length: 102, dtype: int64

In [6]:
possum.isnull().sum()

site       0
pop        0
sex        0
age        2
head_l     0
skull_w    0
total_l    0
tail_l     0
dtype: int64

In [8]:
possum[possum['age'].isnull() == True ]

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
43,2,Vic,m,,85.1,51.5,76.0,35.5
45,2,Vic,m,,91.4,54.4,84.0,35.0


### Preprocessing

> Check for & deal with any missing values.  
Convert categorical columns to numeric.  
Do any other preprocessing you feel is necessary.

In [9]:
possum.dropna(inplace=True)

In [10]:
possum.shape

(102, 8)

In [12]:
possum['pop'].value_counts()

other    58
Vic      44
Name: pop, dtype: int64

In [13]:
possum['pop'] = possum['pop'].map({'other':0, 'Vic':1})

In [14]:
possum['sex'].value_counts()

m    59
f    43
Name: sex, dtype: int64

In [15]:
possum['sex'] = possum['sex'].map({'m':0, 'f':1})

In [16]:
possum.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,1,0,8.0,94.1,60.4,89.0,36.0
1,1,1,1,6.0,92.5,57.6,91.5,36.5
2,1,1,1,6.0,94.0,60.0,95.5,39.0
3,1,1,1,6.0,93.2,57.1,92.0,38.0
4,1,1,1,2.0,91.5,56.3,85.5,36.0


### Modeling

> Build Logistic Regression model to predict `pop`; region of origin.  
Examine the performance of the model.

In [17]:
X = possum.drop(columns=['pop'])
y = possum['pop']

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=50, test_size=0.2)

In [19]:
logreg = LogisticRegression(solver='newton-cg')

In [20]:
logreg.fit(X_train, y_train)

### Interpretation & Predictions

> Interpret at least one coefficient from your model.  
> Generate predicted probabilities for your testing set.  
> Generate predictions for your testing set.

In [21]:
logreg.score(X_train, y_train)

1.0

In [23]:
logreg.score(X_test, y_test)

0.9523809523809523

In [24]:
logreg.coef_

array([[-2.06404346,  0.07571615,  0.25407208, -0.10193859, -0.39256939,
         0.18536099, -0.73087215]])

In [25]:
X.columns

Index(['site', 'sex', 'age', 'head_l', 'skull_w', 'total_l', 'tail_l'], dtype='object')

In [27]:
for col, coef in zip(X.columns, logreg.coef_[0]):
    print("IV: {} has coeff: {}".format(col, coef))

IV: site has coeff: -2.0640434631605196
IV: sex has coeff: 0.07571614688506137
IV: age has coeff: 0.25407208241496926
IV: head_l has coeff: -0.10193859442859486
IV: skull_w has coeff: -0.3925693921633521
IV: total_l has coeff: 0.1853609917454654
IV: tail_l has coeff: -0.7308721540308134


In [28]:
np.exp(0.25407)

1.2892620494526477

In [29]:
logreg.predict(X_test)

array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1])

In [30]:
logreg.score(X_test, y_test)

0.9523809523809523

In [34]:
for x_t, y_t in zip((logreg.predict(X_test)), y_test):
    print("X_predict: {}, y_test:{}".format(x_t, y_t))

X_predict: 0, y_test:0
X_predict: 1, y_test:1
X_predict: 1, y_test:1
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 1, y_test:1
X_predict: 0, y_test:0
X_predict: 1, y_test:1
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 1, y_test:1
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 0, y_test:1
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 0, y_test:0
X_predict: 1, y_test:1
X_predict: 1, y_test:1
X_predict: 1, y_test:1


In [35]:
logreg.predict_proba(X_test)

array([[9.99108978e-01, 8.91021545e-04],
       [1.05481873e-01, 8.94518127e-01],
       [2.97631961e-01, 7.02368039e-01],
       [9.87807554e-01, 1.21924463e-02],
       [9.92451167e-01, 7.54883253e-03],
       [1.25942377e-02, 9.87405762e-01],
       [8.33998038e-01, 1.66001962e-01],
       [1.85078126e-01, 8.14921874e-01],
       [9.97427534e-01, 2.57246555e-03],
       [9.98832254e-01, 1.16774596e-03],
       [9.98250901e-01, 1.74909884e-03],
       [2.15992298e-01, 7.84007702e-01],
       [9.99934291e-01, 6.57089445e-05],
       [9.57231738e-01, 4.27682624e-02],
       [7.51716842e-01, 2.48283158e-01],
       [9.99777650e-01, 2.22349751e-04],
       [9.99982743e-01, 1.72574980e-05],
       [9.95779421e-01, 4.22057858e-03],
       [4.96303413e-03, 9.95036966e-01],
       [1.65957016e-02, 9.83404298e-01],
       [2.35383490e-03, 9.97646165e-01]])

In [39]:
for actual, pred, prob in zip(y_test, logreg.predict(X_test), logreg.predict_proba(X_test)):
    print('Actual: {}, Predicted: {}, prob of 0 = {}, prob of 1 = {}'.format(actual, pred, prob[0], prob[1]))

Actual: 0, Predicted: 0, prob of 0 = 0.9991089784546459, prob of 1 = 0.0008910215453541655
Actual: 1, Predicted: 1, prob of 0 = 0.1054818732871513, prob of 1 = 0.8945181267128487
Actual: 1, Predicted: 1, prob of 0 = 0.2976319607539364, prob of 1 = 0.7023680392460636
Actual: 0, Predicted: 0, prob of 0 = 0.987807553684616, prob of 1 = 0.012192446315383992
Actual: 0, Predicted: 0, prob of 0 = 0.9924511674714329, prob of 1 = 0.007548832528567054
Actual: 1, Predicted: 1, prob of 0 = 0.012594237742338033, prob of 1 = 0.987405762257662
Actual: 0, Predicted: 0, prob of 0 = 0.8339980380282994, prob of 1 = 0.16600196197170058
Actual: 1, Predicted: 1, prob of 0 = 0.1850781263165071, prob of 1 = 0.8149218736834929
Actual: 0, Predicted: 0, prob of 0 = 0.9974275344462336, prob of 1 = 0.002572465553766478
Actual: 0, Predicted: 0, prob of 0 = 0.998832254041999, prob of 1 = 0.0011677459580009192
Actual: 0, Predicted: 0, prob of 0 = 0.9982509011622057, prob of 1 = 0.0017490988377943236
Actual: 1, Predic