### Classification Warm-up

In [1]:
import pandas as pd
import numpy as np

# data visualization
import matplotlib
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.linear_model import LogisticRegression

from pydataset import data

Load the swiss data set with pydataset.data

In [2]:
df = data('swiss')
df

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


In [3]:
#Skim through the documentation for the dataset
data('swiss', show_doc=True)

swiss

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Swiss Fertility and Socioeconomic Indicators (1888) Data

### Description

Standardized fertility measure and socio-economic indicators for each of 47
French-speaking provinces of Switzerland at about 1888.

### Usage

    data(swiss)

### Format

A data frame with 47 observations on 6 variables, each of which is in percent,
i.e., in [0,100].

[,1] Fertility Ig, "common standardized fertility measure" [,2] Agriculture
[,3] Examination nation [,4] Education [,5] Catholic [,6] Infant.Mortality
live births who live less than 1 year.

All variables but 'Fert' give proportions of the population.

### Source

Project "16P5", pages 549-551 in

Mosteller, F. and Tukey, J. W. (1977) “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading Mass.

indicating their source as "Data used by permission of Franice van de Walle.
Office of Population Research, Princeton Univer

In [4]:
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [5]:
df.columns

Index(['Fertility', 'Agriculture', 'Examination', 'Education', 'Catholic',
       'Infant.Mortality'],
      dtype='object')

Transform the Catholic variable into a categorical variable named is_catholic. The values should be either Catholic or Not Catholic.  Then, drop the Catholic column.

In [6]:
df = df.rename(columns={'Catholic':'is_catholic','Infant.Mortality':'infant_mortality'})
df

Unnamed: 0,Fertility,Agriculture,Examination,Education,is_catholic,infant_mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6
Porrentruy,76.1,35.3,9,7,90.57,26.6
Broye,83.8,70.2,16,7,92.85,23.6
Glane,92.4,67.8,14,8,97.16,24.9
Gruyere,82.4,53.3,12,7,97.67,21.0
Sarine,82.9,45.2,16,13,91.38,24.4


In [7]:
df['is_catholic'] = df.is_catholic >= 80

In [8]:
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,is_catholic,infant_mortality
Courtelary,80.2,17.0,15,12,False,22.2
Delemont,83.1,45.1,6,9,True,22.2
Franches-Mnt,92.5,39.7,5,5,True,20.2
Moutier,85.8,36.5,12,7,False,20.3
Neuveville,76.9,43.5,17,15,False,20.6


In [9]:
for col in df.columns[df.dtypes == 'bool']:
    df[col] = df[col].map({True: 1, False: 0})

In [10]:
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,is_catholic,infant_mortality
Courtelary,80.2,17.0,15,12,0,22.2
Delemont,83.1,45.1,6,9,1,22.2
Franches-Mnt,92.5,39.7,5,5,1,20.2
Moutier,85.8,36.5,12,7,0,20.3
Neuveville,76.9,43.5,17,15,0,20.6


Split the data into training and test data sets. We will be trying to predict whether or not a province is catholic.

In [11]:
#train, test = train_test_split(df)

In [27]:
X = df.drop(['is_catholic'],axis=1)
y = df['is_catholic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

X_train.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,infant_mortality
Neuchatel,64.4,17.6,35,32,23.0
Lausanne,55.7,19.4,26,28,20.2
Sarine,82.9,45.2,16,13,24.4
Entremont,69.3,84.9,7,6,19.8
Herens,77.3,89.7,5,2,18.3


In [28]:
X_train.dtypes

Fertility           float64
Agriculture         float64
Examination           int64
Education             int64
infant_mortality    float64
dtype: object

In [29]:
y_train.dtypes

dtype('int64')

Fit a logistic regression model using Agriculture and Examination. Measure the model's performance.

In [30]:
results = pd.DataFrame(dict(actual=y_train))

In [31]:
#Create the logistic regression object
# from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')

In [32]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

In [33]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[ 0.06240486  0.00523436 -0.30561029 -0.03026974 -0.00925864]]
Intercept: 
 [-0.00916628]


In [34]:
y_pred = logit.predict(X_train)

In [35]:
y_pred_proba = logit.predict_proba(X_train)

In [36]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.88


In [37]:
print(confusion_matrix(y_train, y_pred))

[[18  3]
 [ 1 10]]


In [38]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.86      0.90        21
           1       0.77      0.91      0.83        11

   micro avg       0.88      0.88      0.88        32
   macro avg       0.86      0.88      0.87        32
weighted avg       0.89      0.88      0.88        32



Fit a decision tree classifier using the Education and Fertility features. Measure the model's performance using accuracy, precision, and recall.

In [25]:
#Create the Decision Tree Object
# for classificaiton you can change the algorithm as gini or entropy (information gain).  Default is gini.
clf = DecisionTreeClassifier(criterion=['Education','Fertility'], max_depth=3, random_state=123)

In [26]:
clf.fit(X_train, y_train)

TypeError: unhashable type: 'list'

In [None]:

Fit a K Nearest Neighbors model using two features of your choice. Measure the model's performance.
Use the best model from the ones above on your test data set and evaluate the model's predictions.
Explain how/why your model is making the predictions that it is.

In [None]:
sns.relplot(x='is_catholic',y='fertility', data=df_swiss)