## All you need is ♥️… And a pet!

Some of you didn't get any flower on Valentine's day but you worry not, Jerry has a solution for you. Get ready to adopt a pet!

<img src="img/dataset-cover.jpg" width="920">

Here we are going to build a classifier to predict whether an animal from an animal shelter will be adopted or not (aac_intakes_outcomes.csv, available at: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/version/1#aac_intakes_outcomes.csv). You will be working with the following features:

1. *animal_type:* Type of animal. May be one of 'cat', 'dog', 'bird', etc.
2. *intake_year:* Year of intake
3. *intake_condition:* The intake condition of the animal. Can be one of 'normal', 'injured', 'sick', etc.
4. *intake_number:* The intake number denoting the number of occurrences the animal has been brought into the shelter. Values higher than 1 indicate the animal has been taken into the shelter on more than one occasion.
5. *intake_type:* The type of intake, for example, 'stray', 'owner surrender', etc.
6. *sex_upon_intake:* The gender of the animal and if it has been spayed or neutered at the time of intake
7. *age_upon\_intake_(years):* The age of the animal upon intake represented in years
8. *time_in_shelter_days:* Numeric value denoting the number of days the animal remained at the shelter from intake to outcome.
9. *sex_upon_outcome:* The gender of the animal and if it has been spayed or neutered at time of outcome
10. *age_upon\_outcome_(years):* The age of the animal upon outcome represented in years
11. *outcome_type:* The outcome type. Can be one of ‘adopted’, ‘transferred’, etc.

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from itertools import combinations 
import ast
from sklearn.linear_model import LogisticRegression
import seaborn as sn
%matplotlib inline

data_folder = './data/'

### A) Load the dataset and convert categorical features to a suitable numerical representation (use dummy-variable encoding). 
- Split the data into a training set (80%) and a test set (20%). Pair each feature vector with the corresponding label, i.e., whether the outcome_type is adoption or not. 
- Standardize the values of each feature in the data to have mean 0 and variance 1.

The use of external libraries is not permitted in part A, except for numpy and pandas. 
If you notice missing values in the imported entries, you have to deal with them. Make informed choices.

In [20]:
columns = ['animal_type', 'intake_year', 'intake_condition', 'intake_number', 'intake_type', 'sex_upon_intake', \
          'age_upon_intake_(years)', 'time_in_shelter_days', 'sex_upon_outcome', 'age_upon_outcome_(years)', \
          'outcome_type']
original_data = pd.read_csv(data_folder+'aac_intakes_outcomes.csv', usecols=columns)

In [21]:
print('The length of the data with all rows is : {}'.format(len(original_data)))
original_data.dropna(inplace=True)
print('The length of the data without the rows with nan value is: {}'.format(len(original_data)))

The length of the data with all rows is : 79672
The length of the data without the rows with nan value is: 79661


In [35]:
categorical_features = ['animal_type', 'intake_condition', 'intake_type', 'sex_upon_intake', 'sex_upon_outcome']

data_encoded = pd.get_dummies(original_data,columns=categorical_features, drop_first=True )
data_encoded['outcome_type'] = (data_encoded['outcome_type'] == 'Adoption').astype(int)

print(data_encoded)


       outcome_type  age_upon_outcome_(years)  age_upon_intake_(years)  \
0                 0                 10.000000                10.000000   
1                 0                  7.000000                 7.000000   
2                 0                  6.000000                 6.000000   
3                 0                 10.000000                10.000000   
4                 0                 16.000000                16.000000   
...             ...                       ...                      ...   
79667             0                  0.038356                 0.038356   
79668             0                  2.000000                 2.000000   
79669             0                  1.000000                 1.000000   
79670             0                  0.821918                 0.410959   
79671             0                 10.000000                10.000000   

       intake_year  intake_number  time_in_shelter_days  animal_type_Cat  \
0             2017            1.0  

The code below can used to split the data to training/testing. The `data_to_split` should be the clean dataframe from A)

In [None]:
data_to_split = data_encoded

def split_set(data_to_split, ratio=0.8):
    mask = np.random.rand(len(data_to_split)) < ratio
    return [data_to_split[mask].reset_index(drop=True), data_to_split[~mask].reset_index(drop=True)]

[train, test] = split_set(data_to_split)

X_train, y_train = train.drop(columns=['outcome_type']), train['outcome_type']
X_test, y_test = test.drop(columns=['outcome_type']), test['outcome_type']

# Standardize features using training set statistics
mean = X_train.mean()
std = X_train.std()
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std  

age_upon_outcome_(years)             2.142107
age_upon_intake_(years)              2.107231
intake_year                       2015.436030
intake_number                        1.127267
time_in_shelter_days                16.837285
animal_type_Cat                      0.371004
animal_type_Dog                      0.568351
animal_type_Other                    0.056233
intake_condition_Feral               0.001095
intake_condition_Injured             0.050444
intake_condition_Normal              0.878397
intake_condition_Nursing             0.024674
intake_condition_Other               0.001846
intake_condition_Pregnant            0.000610
intake_condition_Sick                0.039006
intake_type_Owner Surrender          0.188522
intake_type_Public Assist            0.062788
intake_type_Stray                    0.701860
intake_type_Wildlife                 0.043763
sex_upon_intake_Intact Male          0.317134
sex_upon_intake_Neutered Male        0.159655
sex_upon_intake_Spayed Female     

### B) Train a logistic regression classifier on your training set. Logistic regression returns probabilities as predictions, so in order to arrive at a binary prediction, you need to put a threshold on the predicted probabilities. 
- For the decision threshold of 0.5, present the performance of your classifier on the test set by displaying the confusion matrix. Based on the confusion matrix, manually calculate accuracy, precision, recall, and F1-score with respect to the positive and the negative class. 

- The features in the testing set must be matched with the traning set.

- You can use the functions below to compute and plot a confusion matrix as well as compute all relevant scores

In [24]:
def compute_confusion_matrix(true_label, prediction_proba, decision_threshold=0.5): 
    
    predict_label = (prediction_proba[:,1]>decision_threshold).astype(int)   
                                                                                                                       
    TP = np.sum(np.logical_and(predict_label==1, true_label==1))
    TN = np.sum(np.logical_and(predict_label==0, true_label==0))
    FP = np.sum(np.logical_and(predict_label==1, true_label==0))
    FN = np.sum(np.logical_and(predict_label==0, true_label==1))
    
    confusion_matrix = np.asarray([[TP, FP],
                                    [FN, TN]])
    return confusion_matrix


def plot_confusion_matrix(confusion_matrix):
    [[TP, FP],[FN, TN]] = confusion_matrix
    label = np.asarray([['TP {}'.format(TP), 'FP {}'.format(FP)],
                        ['FN {}'.format(FN), 'TN {}'.format(TN)]])
    
    df_cm = pd.DataFrame(confusion_matrix, index=['Yes', 'No'], columns=['Positive', 'Negative']) 
    
    return sn.heatmap(df_cm, cmap='YlOrRd', annot=label, annot_kws={"size": 16}, cbar=False, fmt='')


def compute_all_score(confusion_matrix, t=0.5):
    [[TP, FP],[FN, TN]] = confusion_matrix.astype(float)
    
    accuracy =  (TP+TN)/np.sum(confusion_matrix)
    
    precision_positive = TP/(TP+FP) if (TP+FP) !=0 else np.nan
    precision_negative = TN/(TN+FN) if (TN+FN) !=0 else np.nan
    
    recall_positive = TP/(TP+FN) if (TP+FN) !=0 else np.nan
    recall_negative = TN/(TN+FP) if (TN+FP) !=0 else np.nan

    F1_score_positive = 2 *(precision_positive*recall_positive)/(precision_positive+recall_positive) if (precision_positive+recall_positive) !=0 else np.nan
    F1_score_negative = 2 *(precision_negative*recall_negative)/(precision_negative+recall_negative) if (precision_negative+recall_negative) !=0 else np.nan

    return [t, accuracy, precision_positive, recall_positive, F1_score_positive, precision_negative, recall_negative, F1_score_negative]

In [25]:
#fit a cute logistic regression

In [26]:
#evaluate the predictions

In [27]:
#show confusion matrix and scores

### C) Vary the value of the threshold in the range from 0 to 1 and visualize the value of accuracy, precision, recall  and F1-score (with respect to both classes) as a function of the threshold.

Here we expect one plot for the accuracy and then 2 sets of 3 plots for each of the classes, i.e. precision, recall and F1 score for the positive class and precision, recall and F1 score for the negative class.

### Comment on the results. What do you observe?

In clinic 3, we will focus on the AUROC curve.

In [28]:
#show cute plots here (in total 7 plots)

// comments here //

### D) Plot in a bar chart the coefficients of the logistic regression sorted by their contribution to the prediction.

### Interpret the results of the coefficients

In [29]:
#show the bartplot here

// comments here //