# Model Diagnostics in Python

In this notebook, I will be exploring some concepts from model diagnostics. Here, I used python to perform logistic regression to predict binary response values in sklearn. I also assessed how well my model performed using a variety of metrics (`confusion_matrix`, `precision_score`, `recall_score`, `accuracy_score`), and assessed my model fit.


### Concepts explored
The learning aim of this notebook is to learn how to:

* Create and evaluate confusion matrices
* Evaluate precision and recall rates
* Analyze true positives, false positives using edge cases

### Dataset description

The dataset contains four variables: `admit`, `gre`, `gpa`, and `prestige`:

* `admit` is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
* `gre` is the GRE score. GRE stands for Graduate Record Examination.
* `gpa` stands for Grade Point Average.
* `prestige` is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).


In [1]:
#reading the necessary libraries and the dataset
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

df = pd.read_csv('./admissions.csv')
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


To do:

`1.` Change prestige to dummy variable columns that are added to `df`.  
`2.` Divide my data into training and test data.  
`3.` Create the test set as 20% of the data, and use a random state of 0. (response should be the `admit` column).  

In [2]:
#changing presitige to dummy variables, creating 4 levels
df[['level1','level2','level3','level4']] = pd.get_dummies(df['prestige'])


#creating x and y variables as response and explanatory variables
X = df.drop(['admit', 'prestige', 'level1'] , axis=1)
y = df['admit']


#dividing data into training and test data
X_train, X_test, y_train, y_test =train_test_split(
    X,y,test_size=0.20, random_state=0)



To do:

`1.` Use sklearn's Logistic Regression  to fit a logistic model using `gre`, `gpa`, and 3 of the `prestige` dummy variables.
`2.` Fit the logistic regression model without changing any of the hyperparameters.  
`3.` As a first score, obtain the confusion matrix.  

In [3]:
#instantiate logistic regression model
log_mod = LogisticRegression()

#fit the model to the training data
log_mod.fit(X_train, y_train)

#create predictions using test data
y_preds = log_mod.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [4]:
confusion_matrix(y_test, y_preds) 

array([[56,  0],
       [20,  4]])

#### Interpretation of Confusion Matrix

* The number of non-admits correctly identified as non-admits: `56`
* The number of admits correctly identified as admits: `4`
* The number of admits incorrectly identified as non-admits: `20`
* The number of non-admits incorrectly identified as admits: `0`

#### How well did my model perform on the test data?
To access this, I will use a variety of metrics (`precision_score`, `recall_score`, `accuracy_score`)

In [5]:

print('precision score:', precision_score(y_test, y_preds))

precision score: 1.0


The precision score is the probability that our learning algorithm would correctly predicted positive observations to the total predicted positive observations.

In [6]:
print('recall score:', recall_score(y_test, y_preds))

recall score: 0.16666666666666666


The recall score tells us to what degree the model correctly identified the accepted students as accepted.

In [7]:
print('accuracy score:', accuracy_score(y_test, y_preds))

accuracy score: 0.75


The accuracy score tells us to what degree the model correctly identified cases, whether accepted or non-accepted.

*NB*: **This exercise, and the dataset used, are courtesy of the Udacity Data Analyst Nanodegree programme which I am currently part of.** 