# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 1</a>


## Final Project 

In this notebook, we build a ML model to predict the __Time at Center__ field of our final project dataset.

1. <a href="#1">Read the dataset</a> (Given) 
2. <a href="#2">Train a model</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a>
    * <a href="#23">Data processing</a>
    * <a href="#24">Model training</a>
3. <a href="#3">Make predictions on the test dataset</a> (Implement)
4. <a href="#4">Write the test predictions to a CSV file</a> (Given)

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a training.csv, test_features.csv and y_test.csv files. Similar to our review dataset, we didn't consider animals with multiple entries to the facility to keep it simple. If you want to see the original datasets, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Time at Center__ - Time at center (0 = less than 30 days; 1 = more than 30 days). This is the value to predict. 


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Let's read the datasets into dataframes, using Pandas.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('training.csv')
test_data = pd.read_csv('test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)
training_data

The shape of the training dataset is: (71538, 13)
The shape of the test dataset is: (23846, 12)


Unnamed: 0,Pet ID,Outcome Type,Sex upon Outcome,Name,Found Location,Intake Type,Intake Condition,Pet Type,Sex upon Intake,Breed,Color,Age upon Intake Days,Time at Center
0,A745079,Transfer,Unknown,,7920 Old Lockhart in Travis (TX),Stray,Normal,Cat,Unknown,Domestic Shorthair Mix,Blue,3,0
1,A801765,Transfer,Intact Female,,5006 Table Top in Austin (TX),Stray,Normal,Cat,Intact Female,Domestic Shorthair,Brown Tabby/White,28,0
2,A667965,Transfer,Neutered Male,,14100 Thermal Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,Chihuahua Shorthair Mix,Brown/Tan,1825,0
3,A687551,Transfer,Intact Male,,5811 Cedardale Dr in Austin (TX),Stray,Normal,Cat,Intact Male,Domestic Shorthair Mix,Brown Tabby,28,0
4,A773004,Adoption,Neutered Male,*Boris,Highway 290 And Arterial A in Austin (TX),Stray,Normal,Dog,Intact Male,Chihuahua Shorthair Mix,Tricolor/Cream,365,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71533,A705211,Euthanasia,Neutered Male,Charlie,Austin (TX),Public Assist,Normal,Dog,Neutered Male,St. Bernard Smooth Coat Mix,White/Red,730,0
71534,A782455,Return to Owner,Neutered Male,Arlo,124 West Anderson Lane in Austin (TX),Stray,Normal,Cat,Neutered Male,Maine Coon,Brown Tabby,1825,0
71535,A757270,Died,Spayed Female,,3129 E 12Th St in Austin (TX),Stray,Sick,Cat,Spayed Female,Domestic Shorthair Mix,Black,3650,0
71536,A737192,Return to Owner,Neutered Male,Leo,8701 Panadero Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,Miniature Poodle/Chihuahua Shorthair,White/Black,365,0


## 2. <a name="2">Train a model</a> (Implement)
(<a href="#0">Go to top</a>)

 * <a href="#21">Exploratory Data Analysis</a>
 * <a href="#22">Select features to build the model</a>
 * <a href="#23">Data processing</a>
 * <a href="#24">Model training</a>

### 2.1 <a name="21">Exploratory Data Analysis</a> 
(<a href="#2">Go to Train a model</a>)

We look at number of rows, columns and some simple statistics of the dataset.

In [2]:
# Implement here
training_data.head()
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71538 entries, 0 to 71537
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Pet ID                71538 non-null  object
 1   Outcome Type          71533 non-null  object
 2   Sex upon Outcome      71537 non-null  object
 3   Name                  44360 non-null  object
 4   Found Location        71538 non-null  object
 5   Intake Type           71538 non-null  object
 6   Intake Condition      71538 non-null  object
 7   Pet Type              71538 non-null  object
 8   Sex upon Intake       71537 non-null  object
 9   Breed                 71538 non-null  object
 10  Color                 71538 non-null  object
 11  Age upon Intake Days  71538 non-null  int64 
 12  Time at Center        71538 non-null  int64 
dtypes: int64(2), object(11)
memory usage: 7.1+ MB


In [3]:
# Implement here
test_data.head()
training_data.drop(['Pet ID','Name','Found Location'], axis=1, inplace=True)
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71538 entries, 0 to 71537
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Outcome Type          71533 non-null  object
 1   Sex upon Outcome      71537 non-null  object
 2   Intake Type           71538 non-null  object
 3   Intake Condition      71538 non-null  object
 4   Pet Type              71538 non-null  object
 5   Sex upon Intake       71537 non-null  object
 6   Breed                 71538 non-null  object
 7   Color                 71538 non-null  object
 8   Age upon Intake Days  71538 non-null  int64 
 9   Time at Center        71538 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 5.5+ MB


### 2.2 <a name="22">Select features to build the model</a> 
(<a href="#2">Go to Train a model</a>)


In [4]:
# Implement here
features = ['Pet Type','Intake Condition','Outcome Type',
           'Intake Type','Sex upon Intake','Color']

### 2.3 <a name="23">Data Processing</a> 
(<a href="#2">Go to Train a model</a>)


In [5]:
# Implement here
columns = ['Pet Type','Intake Condition','Outcome Type',
           'Intake Type','Sex upon Intake']

training_data['Pet Type'].dropna(how='any', inplace=True)
training_data.drop(training_data[training_data['Pet Type'] == 'Other'].index, inplace=True)

for label in columns:
    training_data = training_data.join(pd.get_dummies(training_data[label]))
    training_data = training_data.drop(label, axis=1)
    
# training_data.drop(columns=['Unknown'], inplace=True)
training_data.head()

Unnamed: 0,Sex upon Outcome,Breed,Color,Age upon Intake Days,Time at Center,Bird,Cat,Dog,Livestock,Aged,...,Euthanasia Request,Owner Surrender,Public Assist,Stray,Wildlife,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
0,Unknown,Domestic Shorthair Mix,Blue,3,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,Intact Female,Domestic Shorthair,Brown Tabby/White,28,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
2,Neutered Male,Chihuahua Shorthair Mix,Brown/Tan,1825,0,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
3,Intact Male,Domestic Shorthair Mix,Brown Tabby,28,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
4,Neutered Male,Chihuahua Shorthair Mix,Tricolor/Cream,365,0,0,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0


In [6]:
training_data.corr().style.background_gradient(cmap='tab20c')

Unnamed: 0,Age upon Intake Days,Time at Center,Bird,Cat,Dog,Livestock,Aged,Behavior,Feral,Injured,Medical,Normal,Nursing,Other,Pregnant,Sick,Adoption,Died,Disposal,Euthanasia,Missing,Relocate,Return to Owner,Rto-Adopt,Transfer,Abandoned,Euthanasia Request,Owner Surrender,Public Assist,Stray,Wildlife,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
Age upon Intake Days,1.0,-0.101255,-0.016006,-0.217463,0.219882,-0.00539,0.217188,3.7e-05,0.001237,0.057481,0.006911,-0.044824,-0.115702,0.021229,0.003024,0.047344,-0.166864,-0.017172,-0.001991,0.117197,-0.005317,0.004333,0.316207,0.036789,-0.112861,-0.008181,0.109639,0.113745,0.115966,-0.175933,-0.00764,-0.208448,-0.199698,0.358836,0.344287,-0.107674
Time at Center,-0.101255,1.0,-0.022873,0.168103,-0.164421,0.006671,-0.01295,-0.001232,-0.001649,-0.011677,-0.001588,-0.023557,0.070633,-0.005033,-0.008356,-0.00642,0.257012,-0.016147,-0.010428,-0.047188,0.022064,0.001238,-0.095103,0.00059,-0.173394,-0.005301,-0.014064,-0.037457,-0.043664,0.059992,-0.009211,0.053397,0.054087,-0.062165,-0.051062,-0.073054
Bird,-0.016006,-0.022873,1.0,-0.071852,-0.087713,-0.001113,-0.005084,-0.000309,-0.002491,0.088596,-0.001235,-0.047299,-0.015155,0.017462,-0.002095,-0.007474,-0.014366,0.022057,0.088419,0.072589,0.0064,0.119565,-0.022487,-0.005122,-0.011007,-0.002548,0.008095,-0.012842,0.059676,-0.056247,0.421771,-0.038011,-0.012593,-0.029045,-0.027186,0.189322
Cat,-0.217463,0.168103,-0.071852,1.0,-0.986877,-0.012528,-0.042038,-0.003474,0.027914,0.014084,-0.006126,-0.072364,0.085414,-0.001278,-0.009816,0.04237,-0.023852,0.054194,0.013681,0.031309,0.007035,-0.007079,-0.276098,-0.025635,0.211145,-0.009805,-0.020579,0.007464,-0.137279,0.074334,-0.030305,0.041035,-0.015035,-0.091816,-0.0754,0.180738
Dog,0.219882,-0.164421,-0.087713,-0.986877,1.0,-0.015293,0.042818,0.00352,-0.027468,-0.028106,0.006321,0.079665,-0.082811,-0.001491,0.010148,-0.04105,0.026095,-0.057602,-0.027752,-0.042769,-0.008038,-0.011997,0.279155,0.026443,-0.209013,0.010211,0.01928,-0.005315,0.127674,-0.065401,-0.036995,-0.035063,0.017135,0.096377,0.07977,-0.210877
Livestock,-0.00539,0.006671,-0.001113,-0.012528,-0.015293,1.0,-0.000886,-5.4e-05,-0.000434,-0.003246,-0.000215,0.005323,-0.002642,-0.000624,-0.000365,-0.002583,0.00061,-0.001406,-0.000517,-0.00277,-0.000319,-0.000152,0.005983,-0.000893,-0.00356,-0.000444,-0.000662,-0.003243,-0.00337,0.004878,-0.00047,0.005177,-0.003958,-0.001726,-0.00474,0.006423
Aged,0.217188,-0.01295,-0.005084,-0.042038,0.042818,-0.000886,1.0,-0.000246,-0.001983,-0.014822,-0.000983,-0.166522,-0.012065,-0.002849,-0.001668,-0.011796,-0.035011,0.003017,-0.00236,0.055109,0.008856,-0.000695,0.043627,-0.000386,-0.019367,-0.002028,0.150989,0.000655,0.001114,-0.018081,-0.002144,-0.028581,-0.029463,0.045869,0.050229,-0.009541
Behavior,3.7e-05,-0.001232,-0.000309,-0.003474,0.00352,-5.4e-05,-0.000246,1.0,-0.00012,-0.0009,-6e-05,-0.010114,-0.000733,-0.000173,-0.000101,-0.000716,0.004361,-0.00039,-0.000143,-0.000768,-8.8e-05,-4.2e-05,-0.001643,-0.000248,-0.002853,-0.000123,-0.000184,0.00876,-0.000935,-0.007223,-0.00013,-0.002899,0.005067,-0.001404,-0.001315,-0.000903
Feral,0.001237,-0.001649,-0.002491,0.027914,-0.027468,-0.000434,-0.001983,-0.00012,1.0,-0.007261,-0.000482,-0.08158,-0.005911,-0.001396,-0.000817,-0.005779,-0.016025,0.001656,-0.001156,0.001324,-0.000713,-0.000341,-0.006593,-0.001998,0.021135,-0.000994,-0.001482,-0.004657,-0.003341,0.0063,-0.001051,-0.007398,0.008051,0.000621,-0.00274,0.001379
Injured,0.057481,-0.011677,0.088596,0.014084,-0.028106,-0.003246,-0.014822,-0.0009,-0.007261,1.0,-0.003601,-0.609782,-0.044181,-0.010431,-0.006108,-0.043197,-0.076026,0.065785,0.06798,0.271262,-0.002371,0.022184,-0.02155,0.007298,-0.033754,-0.005306,0.001747,-0.059575,-0.029434,0.059013,0.126612,-0.026264,0.004064,0.019924,-0.009229,0.031913


In [12]:
import seaborn as sns
import matplotlib.pyplot as plt

cor = training_data.corr()
cor_target = abs(cor["Time at Center"])
print(cor_target)
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.1]
print("---------------------")
print(relevant_features)

Age upon Intake Days    0.101255
Time at Center          1.000000
Bird                    0.022873
Cat                     0.168103
Dog                     0.164421
Livestock               0.006671
Aged                    0.012950
Behavior                0.001232
Feral                   0.001649
Injured                 0.011677
Medical                 0.001588
Normal                  0.023557
Nursing                 0.070633
Other                   0.005033
Pregnant                0.008356
Sick                    0.006420
Adoption                0.257012
Died                    0.016147
Disposal                0.010428
Euthanasia              0.047188
Missing                 0.022064
Relocate                0.001238
Return to Owner         0.095103
Rto-Adopt               0.000590
Transfer                0.173394
Abandoned               0.005301
Euthanasia Request      0.014064
Owner Surrender         0.037457
Public Assist           0.043664
Stray                   0.059992
Wildlife  

### 2.4 <a name="24">Model training</a> 
(<a href="#2">Go to Train a model</a>)


In [31]:
# Implement here
x = training_data.drop(columns=['Breed','Sex upon Outcome','Time at Center','Color'])
y = training_data['Time at Center']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

classifier = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = 6))
])
classifier.fit(X_train, y_train)
# tune your parameters using the validation dataset
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
# Use the fitted model to make predictions on the train dataset
# Train data going through the Pipeline it's first imputed (with means from the train), scaled (with the min/max from the train data), and finally used to make predictions
train_predictions = classifier.predict(X_train)
print('Model performance on the train set:')
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Train accuracy:", accuracy_score(y_train, train_predictions))

Model performance on the train set:
[[48199   409]
 [ 2535  2437]]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97     48608
           1       0.86      0.49      0.62      4972

    accuracy                           0.95     53580
   macro avg       0.90      0.74      0.80     53580
weighted avg       0.94      0.95      0.94     53580

Train accuracy: 0.9450541246733856


In [33]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
x = training_data.drop(columns=['Breed','Sex upon Outcome','Time at Center','Color'])
y = training_data['Time at Center']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
model1 = RandomForestRegressor(n_estimators=100, random_state=42)
model1.fit(X_train, y_train)
model1.score(X_test, y_test)

0.419322420014836

## 3. <a name="3">Make predictions on the test dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the test set to make predictions with the trained model.

In [34]:
test_data.columns

Index(['Pet ID', 'Outcome Type', 'Sex upon Outcome', 'Name', 'Found Location',
       'Intake Type', 'Intake Condition', 'Pet Type', 'Sex upon Intake',
       'Breed', 'Color', 'Age upon Intake Days'],
      dtype='object')

In [35]:
# Implement here
test_data.drop(['Pet ID','Name','Found Location'], axis=1, inplace=True)

features = ['Pet Type','Intake Condition','Outcome Type',
           'Intake Type','Sex upon Intake','Color']

columns = ['Pet Type','Intake Condition','Outcome Type',
           'Intake Type','Sex upon Intake']

test_data['Pet Type'].dropna(how='any', inplace=True)
test_data.drop(test_data[test_data['Pet Type'] == 'Other'].index, inplace=True)

for label in columns:
    test_data = test_data.join(pd.get_dummies(test_data[label]))
    test_data = test_data.drop(label, axis=1)
    
test_data

Unnamed: 0,Sex upon Outcome,Breed,Color,Age upon Intake Days,Bird,Cat,Dog,Livestock,Aged,Behavior,...,Euthanasia Request,Owner Surrender,Public Assist,Stray,Wildlife,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
0,Spayed Female,Labrador Retriever Mix,Black,60,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
1,Neutered Male,Boxer/Anatol Shepherd,Brown/Tricolor,60,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,Neutered Male,Australian Cattle Dog/Pit Bull,Black/White,3285,0,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,Spayed Female,Miniature Poodle,Gray,1825,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
4,Neutered Male,Domestic Shorthair,Blue/White,210,0,1,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23840,Spayed Female,Pit Bull Mix,Black/White,730,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
23841,Neutered Male,Miniature Schnauzer Mix,Tan/Gray,1460,0,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,0
23842,Neutered Male,American Pit Bull Terrier Mix,Brown,60,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
23844,Neutered Male,Pointer Mix,Black/White,730,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0


In [36]:
x = test_data.drop(columns=['Breed','Sex upon Outcome','Color'])
predictions = classifier.predict(x)
predictions = list(predictions)
print(predictions[0:20])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [37]:
import csv
with open('predictions.csv', 'w',newline="") as f:
    for item in predictions:
        writer = csv.writer(f)
        writer.writerow([item])