# Logistic Regression Model for Animal Outcome Prediction
Metis-Classification_Project
14JUN2022
John Tazioli

### Import Data:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('dog.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   animal_id_outcome           21989 non-null  object 
 1   date_of_birth               21989 non-null  object 
 2   outcome_subtype             3829 non-null   object 
 3   outcome_type                21989 non-null  object 
 4   sex_upon_outcome            21989 non-null  object 
 5   age_upon_outcome_(days)     21989 non-null  int64  
 6   age_upon_outcome_(years)    21989 non-null  float64
 7   age_upon_outcome_age_group  21989 non-null  object 
 8   outcome_datetime            21989 non-null  object 
 9   outcome_number              21989 non-null  int64  
 10  dob_monthyear               21989 non-null  object 
 11  age_upon_intake             21989 non-null  object 
 12  animal_type                 21989 non-null  object 
 13  breed                       219

### Isolate target Variable:

In [3]:
#establish target variable of animal outcome
y = df['outcome_type']
y.unique()

array(['Adoption', 'Died', 'Euthanasia'], dtype=object)

In [4]:
# replace died with euthanasia since I already filtered out the dogs that were sick, injured upon arrival
#or died outsidse the shelter
y = y.replace('Died','Euthanasia')
y.unique()

array(['Adoption', 'Euthanasia'], dtype=object)

In [5]:
#map str to int
y = y.map({'Euthanasia':0, 'Adoption':1})

### Isolate Baseline Features:
-Breed
-Age
-In shelter More than once (Y/N)

In [6]:
x = df[['breed', 'age_upon_intake_age_group', 'count', 'outcome_type']]
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   breed                      21989 non-null  object
 1   age_upon_intake_age_group  21989 non-null  object
 2   count                      21989 non-null  int64 
 3   outcome_type               21989 non-null  object
dtypes: int64(1), object(3)
memory usage: 687.3+ KB


In [7]:
breed_rate = x.groupby('breed')['outcome_type'].count()
breed_rate.head()

breed
Affenpinscher Mix                      4
Afghan Hound/Labrador Retriever        1
Airedale Terrier Mix                   5
Airedale Terrier/Irish Terrier         1
Airedale Terrier/Labrador Retriever    1
Name: outcome_type, dtype: int64

1410 breeds, no easy way to simplify/group breeds. Too specific as is to help with modeling. New basline:

In [8]:
x_bs = df[['age_upon_intake_age_group', 'count']]
x_bs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 2 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age_upon_intake_age_group  21989 non-null  object
 1   count                      21989 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 343.7+ KB


In [9]:
# convert age groups to 4 eras in a dogs life
x_bs.age_upon_intake_age_group.replace('(-0.025, 2.5]', 'puppy', inplace=True)
x_bs.age_upon_intake_age_group.replace('(2.5, 5.0]', 'adult', inplace=True)
x_bs.age_upon_intake_age_group.replace('(5.0, 7.5]', 'adult', inplace=True)
x_bs.age_upon_intake_age_group.replace('(7.5, 10.0]', 'senior', inplace=True)
x_bs.age_upon_intake_age_group.replace('(10.0, 12.5]', 'senior', inplace=True)
x_bs.age_upon_intake_age_group.replace('(12.5, 15.0]', 'late_senior', inplace=True)
x_bs.age_upon_intake_age_group.replace('(15.0, 17.5]', 'late_senior', inplace=True)
x_bs.age_upon_intake_age_group.replace('(17.5, 20.0]', 'late_senior', inplace=True)
x_bs.age_upon_intake_age_group.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


0         senior
1    late_senior
2         senior
3         senior
4         senior
Name: age_upon_intake_age_group, dtype: object

In [10]:
#need dummy variables for categorical age groups
x_base = pd.concat([x_bs,
               pd.get_dummies(x_bs['age_upon_intake_age_group'], drop_first = True)],
              axis = 1)
x_base = x_base.loc[:,'count':'senior']


### Logistic Regression

In [17]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(x_base, y)

LogisticRegression()

In [18]:
from sklearn.metrics import f1_score

lr_bs_acc = logreg.score(x_base, y)

y_lr_base_pred = pd.Series(logreg.predict(x_base))

print(f"Acc: {lr_bs_acc}")
print(f"F1: {f1_score(y, y_lr_base_pred)}")

Acc: 0.9358315521397063
F1: 0.9668522564427844


In [19]:
from sklearn.metrics import confusion_matrix

y_pred = logreg.predict(x_base)
confusion_matrix(y,y_pred)

array([[    0,  1411],
       [    0, 20578]])

###Conclusion:
model assumed all animals were being adopted and was 93% right. Need better indicator for euthanasia.

## New Model, New Features:
1. Breed Euth Frequency
2. Age upon arrival in days
3. days in shelter
4. intake status: categorical with dummy variables
5. color maybe

### Breed (Euth Freq)
create column with percent euthanized organized by breed

In [20]:
x = x.replace('Died','Euthanasia')

In [21]:
x['breed_freq'] = x.groupby('breed').outcome_type.transform('count')

In [22]:
x.head()

Unnamed: 0,breed,age_upon_intake_age_group,count,outcome_type,breed_freq
0,Pointer Mix,"(10.0, 12.5]",1,Adoption,200
1,Border Collie Mix,"(15.0, 17.5]",1,Adoption,366
2,German Shepherd Mix,"(10.0, 12.5]",1,Adoption,964
3,Labrador Retriever/German Shepherd,"(7.5, 10.0]",1,Adoption,82
4,Chihuahua Shorthair Mix,"(7.5, 10.0]",1,Adoption,2414


In [23]:
euth_freq = x[x['outcome_type'] == 'Euthanasia'].groupby('breed').outcome_type.transform('count')

map 'outcome_type' to Euthanasia = 1 and Adoption = 0

In [30]:
x_ef = x.copy()
x_ef.outcome_type = x_ef.outcome_type.map({'Euthanasia':1, 'Adoption':0})
x_ef.head()

Unnamed: 0,breed,age_upon_intake_age_group,count,outcome_type,breed_freq
0,Pointer Mix,"(10.0, 12.5]",1,0,200
1,Border Collie Mix,"(15.0, 17.5]",1,0,366
2,German Shepherd Mix,"(10.0, 12.5]",1,0,964
3,Labrador Retriever/German Shepherd,"(7.5, 10.0]",1,0,82
4,Chihuahua Shorthair Mix,"(7.5, 10.0]",1,0,2414


In [31]:
x_ef['euth_freq'] = x_ef.groupby('breed').outcome_type.transform('sum')
x_ef.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   breed                      21989 non-null  object
 1   age_upon_intake_age_group  21989 non-null  object
 2   count                      21989 non-null  int64 
 3   outcome_type               21989 non-null  int64 
 4   breed_freq                 21989 non-null  int64 
 5   euth_freq                  21989 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.0+ MB


In [32]:
x_ef['euth_rate'] = x_ef['euth_freq']/x_ef['breed_freq']
x_ef.head()

Unnamed: 0,breed,age_upon_intake_age_group,count,outcome_type,breed_freq,euth_freq,euth_rate
0,Pointer Mix,"(10.0, 12.5]",1,0,200,3,0.015
1,Border Collie Mix,"(15.0, 17.5]",1,0,366,14,0.038251
2,German Shepherd Mix,"(10.0, 12.5]",1,0,964,42,0.043568
3,Labrador Retriever/German Shepherd,"(7.5, 10.0]",1,0,82,5,0.060976
4,Chihuahua Shorthair Mix,"(7.5, 10.0]",1,0,2414,144,0.059652


In [47]:
x_ef[x_ef.euth_rate == 1.0]

Unnamed: 0,breed,age_upon_intake_age_group,count,outcome_type,breed_freq,euth_freq,euth_rate
8918,Queensland Heeler/German Shepherd,"(12.5, 15.0]",1,1,1,1,1.0
9198,Bulldog/Pit Bull,"(2.5, 5.0]",1,1,1,1,1.0
9546,Rottweiler/Chow Chow,"(10.0, 12.5]",1,1,1,1,1.0
9791,Boxer/Jack Russell Terrier,"(-0.025, 2.5]",1,1,1,1,1.0
10991,Miniature Pinscher/Australian Cattle Dog,"(-0.025, 2.5]",1,1,1,1,1.0
11071,Chow Chow/English Cocker Spaniel,"(2.5, 5.0]",1,1,1,1,1.0
11107,Canaan Dog/Pit Bull,"(-0.025, 2.5]",1,1,1,1,1.0
11112,Pit Bull/St. Bernard Smooth Coat,"(-0.025, 2.5]",1,1,1,1,1.0
11123,Plott Hound/Australian Cattle Dog,"(-0.025, 2.5]",1,1,1,1,1.0
11157,Pit Bull/Pit Bull,"(-0.025, 2.5]",1,1,1,1,1.0


### The Rest of The Features:
X is the new feature dataframe.

In [33]:
X = pd.DataFrame()
X['breed_euth_rate'] = x_ef['euth_rate']
X['age(days)'] = df['age_upon_intake_(days)']
X['time_in_shelt'] = df['time_in_shelter_days']
X['condition'] = df['intake_condition']
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   breed_euth_rate  21989 non-null  float64
 1   age(days)        21989 non-null  int64  
 2   time_in_shelt    21989 non-null  float64
 3   condition        21989 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 687.3+ KB


In [34]:
#need dummy variables for categorical condition
X = pd.concat([X,
               pd.get_dummies(X['condition'], drop_first = True)],
              axis = 1)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   breed_euth_rate  21989 non-null  float64
 1   age(days)        21989 non-null  int64  
 2   time_in_shelt    21989 non-null  float64
 3   condition        21989 non-null  object 
 4   Feral            21989 non-null  uint8  
 5   Injured          21989 non-null  uint8  
 6   Normal           21989 non-null  uint8  
 7   Nursing          21989 non-null  uint8  
 8   Other            21989 non-null  uint8  
 9   Pregnant         21989 non-null  uint8  
 10  Sick             21989 non-null  uint8  
dtypes: float64(2), int64(1), object(1), uint8(7)
memory usage: 837.6+ KB


In [35]:
X.drop('condition',axis=1,inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21989 entries, 0 to 21988
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   breed_euth_rate  21989 non-null  float64
 1   age(days)        21989 non-null  int64  
 2   time_in_shelt    21989 non-null  float64
 3   Feral            21989 non-null  uint8  
 4   Injured          21989 non-null  uint8  
 5   Normal           21989 non-null  uint8  
 6   Nursing          21989 non-null  uint8  
 7   Other            21989 non-null  uint8  
 8   Pregnant         21989 non-null  uint8  
 9   Sick             21989 non-null  uint8  
dtypes: float64(2), int64(1), uint8(7)
memory usage: 665.8 KB


## Scaler transform
In order to correct error from different scaled features, standard scaler applied.

In [36]:
from sklearn import preprocessing
import numpy as np

scaler = preprocessing.StandardScaler().fit(X)

X_scaled = scaler.transform(X)

## LogReg model Fit/Assessment

In [37]:
from sklearn.linear_model import LogisticRegression

logreg_scale = LogisticRegression()

logreg_scale.fit(X_scaled, y)

LogisticRegression()

In [38]:
logreg_scale.score(X_scaled, y)

0.9438355541407067

In [39]:
from sklearn.metrics import confusion_matrix

y_pred_sc = logreg_scale.predict(X_scaled)
confusion_matrix(y,y_pred_sc)

array([[  334,  1077],
       [  158, 20420]])

In [48]:
print(f"F1: {f1_score(y, y_pred_sc)}")

F1: 0.9706476530005942


### Save DataFrame

In [40]:
import pickle

with open('shelter_dogs_scaler.pickle','wb') as f:
    pickle.dump(X_scaled, f)

with open('shelter_dogs.pickle','wb') as f:
    pickle.dump(X, f)

In [41]:
with open('shelter_dogs_y.pickle','wb') as f:
    pickle.dump(y, f)