![MLU Logo](data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 1</a>


## Final Project 

In this notebook, we build a ML model to predict the __Time at Center__ field of our final project dataset.

1. <a href="#1">Read the dataset</a> (Given) 
2. <a href="#2">Train a model</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a>
    * <a href="#23">Data processing</a>
    * <a href="#24">Model training</a>
3. <a href="#3">Make predictions on the test dataset</a> (Implement)
4. <a href="#4">Write the test predictions to a CSV file</a> (Given)

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a training.csv, test_features.csv and y_test.csv files. Similar to our review dataset, we didn't consider animals with multiple entries to the facility to keep it simple. If you want to see the original datasets, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Time at Center__ - Time at center (0 = less than 30 days; 1 = more than 30 days). This is the value to predict. 


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Let's read the datasets into dataframes, using Pandas.

In [175]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
tr = pd.read_csv('training.csv')
tr.columns = tr.columns.str.strip()
te = pd.read_csv('test_features.csv')
te.columns = te.columns.str.strip()

print('The shape of the training dataset is:', tr.shape)
print('The shape of the test dataset is:', te.shape)
tr

The shape of the training dataset is: (71538, 13)
The shape of the test dataset is: (23846, 12)


Unnamed: 0,Pet ID,Outcome_Type,Sex_upon_Outcome,Name,Found_Location,Intake_Type,Intake_Condition,Pet_Type,Sex_upon_Intake,Breed,Color,Age_upon_Intake Days,Time_at_Center
0,A745079,Transfer,Unknown,,7920 Old Lockhart in Travis (TX),Stray,Normal,Cat,Unknown,Domestic Shorthair Mix,Blue,3,0
1,A801765,Transfer,Intact Female,,5006 Table Top in Austin (TX),Stray,Normal,Cat,Intact Female,Domestic Shorthair,Brown Tabby/White,28,0
2,A667965,Transfer,Neutered Male,,14100 Thermal Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,Chihuahua Shorthair Mix,Brown/Tan,1825,0
3,A687551,Transfer,Intact Male,,5811 Cedardale Dr in Austin (TX),Stray,Normal,Cat,Intact Male,Domestic Shorthair Mix,Brown Tabby,28,0
4,A773004,Adoption,Neutered Male,*Boris,Highway 290 And Arterial A in Austin (TX),Stray,Normal,Dog,Intact Male,Chihuahua Shorthair Mix,Tricolor/Cream,365,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
71533,A705211,Euthanasia,Neutered Male,Charlie,Austin (TX),Public Assist,Normal,Dog,Neutered Male,St. Bernard Smooth Coat Mix,White/Red,730,0
71534,A782455,Return to Owner,Neutered Male,Arlo,124 West Anderson Lane in Austin (TX),Stray,Normal,Cat,Neutered Male,Maine Coon,Brown Tabby,1825,0
71535,A757270,Died,Spayed Female,,3129 E 12Th St in Austin (TX),Stray,Sick,Cat,Spayed Female,Domestic Shorthair Mix,Black,3650,0
71536,A737192,Return to Owner,Neutered Male,Leo,8701 Panadero Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,Miniature Poodle/Chihuahua Shorthair,White/Black,365,0


## 2. <a name="2">Train a model</a> (Implement)
(<a href="#0">Go to top</a>)

 * <a href="#21">Exploratory Data Analysis</a>
 * <a href="#22">Select features to build the model</a>
 * <a href="#23">Data processing</a>
 * <a href="#24">Model training</a>

### 2.1 <a name="21">Exploratory Data Analysis</a> 
(<a href="#2">Go to Train a model</a>)

We look at number of rows, columns and some simple statistics of the dataset.

In [None]:
def convert(name):
    target = []
    length = []
    temp = tr[name].value_counts().sort_index().index
    for i in temp:
        target.append(i)
    for i in range(len(target)):
        length.append(i)
    return target, length

def convert_s(name):
    target = []
    length = []
    result = []
    #temp = tr[name].value_counts().sort_index().index
    for i in tr[name]:
        i = i.split('/')
        i = i[0].split(' ')
        target.append(i[0])
    tr[name] = target
    temp = tr[name].value_counts().sort_index().index
    for i in temp:
        result.append(i)
    for i in range(len(result)):
        length.append(i)
    return result, length

In [None]:
temp = []
for i in tr['Color'].value_counts().index:
    i = i.split('/')[0]
    i = i.split(' ')[0]
    temp.append(i)
print(set(temp))

In [179]:
dict = tr['Breed'].value_counts()
temp = []
for i in dict.index:
    temp.append((i, dict[i]))
print(temp)
# drop_list = []
# for i in temp:
#     if i[1] < 1000:
#         drop_list.append(i[0])
# print(drop_list)
# tr['Color'].dropna(how='any', inplace=True)

[('Domestic Shorthair Mix', 20676), ('Domestic Shorthair', 3796), ('Pit Bull Mix', 3770), ('Chihuahua Shorthair Mix', 3708), ('Labrador Retriever Mix', 3631), ('Domestic Medium Hair Mix', 2060), ('German Shepherd Mix', 1537), ('Bat Mix', 1279), ('Domestic Longhair Mix', 1001), ('Bat', 969), ('Siamese Mix', 866), ('Australian Cattle Dog Mix', 808), ('Dachshund Mix', 623), ('Pit Bull', 569), ('Labrador Retriever', 544), ('Chihuahua Shorthair', 536), ('Border Collie Mix', 486), ('Miniature Poodle Mix', 481), ('Boxer Mix', 460), ('Domestic Medium Hair', 449), ('Raccoon Mix', 406), ('German Shepherd', 397), ('Yorkshire Terrier Mix', 395), ('Australian Shepherd Mix', 392), ('Rat Terrier Mix', 345), ('Great Pyrenees Mix', 339), ('Catahoula Mix', 336), ('Miniature Schnauzer Mix', 332), ('Chihuahua Longhair Mix', 327), ('Jack Russell Terrier Mix', 321), ('Beagle Mix', 316), ('Cairn Terrier Mix', 301), ('Siberian Husky Mix', 295), ('Shih Tzu Mix', 265), ('Staffordshire Mix', 263), ('Pointer Mix'

In [163]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
tr = pd.read_csv('training.csv')
tr.columns = tr.columns.str.strip()


tr.drop(['Pet ID'], axis=1, inplace=True) #
tr.drop(['Name'], axis=1, inplace=True)
tr.drop(['Found_Location'], axis=1, inplace=True)

tr['Pet_Type'].dropna(how='any', inplace=True)
tr.drop(tr[tr['Pet_Type'] == 'Other'].index, inplace=True)
# tr['Pet Type'].value_counts()


columns = ['Pet_Type','Outcome_Type','Intake_Condition',
           'Intake_Type','Sex_upon_Intake']
for term in columns:
    tr = tr.join(pd.get_dummies(tr[term]))
    tr = tr.drop([term], axis=1)
tr.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 66975 entries, 0 to 71537
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Sex_upon_Outcome      66974 non-null  object
 1   Breed                 66975 non-null  object
 2   Color                 66975 non-null  object
 3   Age_upon_Intake Days  66975 non-null  int64 
 4   Time_at_Center        66975 non-null  int64 
 5   Bird                  66975 non-null  uint8 
 6   Cat                   66975 non-null  uint8 
 7   Dog                   66975 non-null  uint8 
 8   Livestock             66975 non-null  uint8 
 9   Adoption              66975 non-null  uint8 
 10  Died                  66975 non-null  uint8 
 11  Disposal              66975 non-null  uint8 
 12  Euthanasia            66975 non-null  uint8 
 13  Missing               66975 non-null  uint8 
 14  Relocate              66975 non-null  uint8 
 15  Return to Owner       66975 non-null

In [None]:
lists = ['Age_upon_Intake Days',
       'Time_at_Center', 'Bird', 'Cat', 'Dog', 'Livestock', 'Adoption', 'Died',
       'Disposal', 'Euthanasia', 'Missing', 'Relocate', 'Return to Owner',
       'Rto-Adopt', 'Transfer', 'Aged', 'Behavior', 'Feral', 'Injured',
       'Medical', 'Normal', 'Nursing', 'Other', 'Pregnant', 'Sick',
       'Abandoned', 'Euthanasia Request', 'Owner Surrender', 'Public Assist',
       'Stray', 'Wildlife', 'Intact Female', 'Intact Male', 'Neutered Male',
       'Spayed Female', 'Unknown']

In [None]:
columns = tr.columns
tr = tr.dropna(subset = columns)
columns = ['Outcome_Type', 'Sex_upon_Outcome','Intake_Type',
           'Intake_Condition', 'Pet_Type', 'Sex_upon_Intake',
       'Age_upon_Intake Days', 'Time_at_Center'] 

for i in columns:
    tr[i].replace(
        convert(i)[0],
        convert(i)[1],
        inplace = True
    )
    tr[i] = tr[i].astype(int)
tr

tr['Color'].replace(
    convert_s('Color')[0],
    convert_s('Color')[1],
    inplace = True
)

tr['Breed'].replace(
    convert_s('Breed')[0],
    convert_s('Breed')[1],
    inplace = True
)

tr

In [None]:
# Implement here
te.head()
print(te.info())

In [None]:
tr.corr().style.background_gradient(cmap='tab20c')

### 2.2 <a name="22">Select features to build the model</a> 
(<a href="#2">Go to Train a model</a>)


In [None]:
columns = ['Outcome_Type', 'Sex_upon_Outcome', 'Name','Intake_Type',
           'Intake_Condition', 'Pet_Type', 'Sex_upon_Intake',
       'Breed', 'Color', 'Age_upon_Intake Days', 'Time_at_Center']

columns = ['Outcome_Type', 'Sex_upon_Outcome','Intake_Type',
            'Intake_Condition', 'Pet_Type', 'Sex_upon_Intake',
            'Breed', 'Color', 'Age_upon_Intake Days']

columns = ['Outcome_Type', 'Sex_upon_Outcome','Intake_Condition',
           'Intake_Type','Sex_upon_Intake','Age_upon_Intake Days']

In [164]:
# Implement here
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(tr, test_size=0.1, shuffle=True, random_state=23)
X_train = train_data[['Age_upon_Intake Days', 'Bird', 'Cat',
                      'Dog', 'Livestock', 'Adoption', 'Died',
       'Disposal', 'Euthanasia', 'Missing', 'Relocate', 'Return to Owner',
       'Rto-Adopt', 'Transfer', 'Aged', 'Behavior', 'Feral', 'Injured',
       'Medical', 'Normal', 'Nursing', 'Other', 'Pregnant', 'Sick',
       'Abandoned', 'Euthanasia Request', 'Owner Surrender', 'Public Assist',
       'Stray', 'Wildlife', 'Intact Female', 'Intact Male', 'Neutered Male',
       'Spayed Female', 'Unknown']].values #Selected Columns
# X_train = train_data[
#     ['Sex_upon_Outcome','Intake_Type','Intake_Condition']
# ]
y_train = train_data['Time_at_Center'].tolist()
# numerical_features = ...

### 2.3 <a name="23">Data Processing</a> 
(<a href="#2">Go to Train a model</a>)


In [None]:
def convert_te(name):
    target = []
    length = []
    temp = te[name].value_counts().sort_index().index
    for i in temp:
        target.append(i)
    for i in range(len(target)):
        length.append(i)
    return target, length

def convert_te_s(name):
    target = []
    length = []
    result = []
    for i in te[name]:
        i = i.split('/')
        i = i[0].split(' ')
        target.append(i[0])
    te[name] = target
    temp = te[name].value_counts().sort_index().index
    for i in temp:
        result.append(i)
    for i in range(len(result)):
        length.append(i)
    return result, length

In [None]:
# Implement here
columns = ['Outcome_Type', 'Sex_upon_Outcome','Intake_Type',
           'Intake_Condition', 'Pet_Type', 'Sex_upon_Intake',
       'Age_upon_Intake_Days']
for i in columns:
    te[i].replace(
        convert_te(i)[0],
        convert_te(i)[1],
        inplace = True
    )
    te[i] = te[i].astype(int)

te['Color'].replace(
    convert_te_s('Color')[0],
    convert_te_s('Color')[1],
    inplace = True
)

te['Breed'].replace(
    convert_te_s('Breed')[0],
    convert_te_s('Breed')[1],
    inplace = True
)

te


### 2.4 <a name="24">Model training</a> 
(<a href="#2">Go to Train a model</a>)


In [165]:
# Implement here
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

classifier = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = 3))
])

classifier.fit(X_train, y_train)

# tune your parameters using the validation dataset
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

# Use the fitted model to make predictions on the train dataset
# Train data going through the Pipeline it's first imputed (with means from the train), scaled (with the min/max from the train data), and finally used to make predictions
train_predictions = classifier.predict(X_train)

print('Model performance on the train set:')
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Train accuracy:", accuracy_score(y_train, train_predictions))

Model performance on the train set:
[[54100   627]
 [ 2747  2803]]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97     54727
           1       0.82      0.51      0.62      5550

    accuracy                           0.94     60277
   macro avg       0.88      0.75      0.80     60277
weighted avg       0.94      0.94      0.94     60277

Train accuracy: 0.9440250841946348


## 3. <a name="3">Make predictions on the test dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the test set to make predictions with the trained model.

In [None]:
# Implement here

# test_predictions = ...

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
te = pd.read_csv('test_features.csv')
te.columns = te.columns.str.strip()
te

In [None]:
sex_upon_intake_dummy = sex_upon_intake_dummy.rename(columns={
 'Intact Male':'Intake Intact Male',
 'Intact Female':'Intake Intact Female',
 'Neutered Female':'Intake Neutered Female',
 'Spayed Female':'Intake Spayed Female'
})

In [171]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
te = pd.read_csv('test_features.csv')
te.columns = te.columns.str.strip()


te.drop(['Pet_ID'], axis=1, inplace=True) #
te.drop(['Name'], axis=1, inplace=True)
te.drop(['Found_Location'], axis=1, inplace=True)

# te['Pet_Type'].dropna(how='any', inplace=True)
te.drop(te[te['Pet_Type'] == 'Other'].index, inplace=True)
# tr['Pet Type'].value_counts()
columns = ['Pet_Type','Outcome_Type','Intake_Condition',
           'Intake_Type','Sex_upon_Intake']
for term in columns:
    te = te.join(pd.get_dummies(te[term]))
    te = te.drop([term], axis=1)
te.columns

Index(['Sex_upon_Outcome', 'Breed', 'Color', 'Age_upon_Intake_Days', 'Bird',
       'Cat', 'Dog', 'Livestock', 'Adoption', 'Died', 'Disposal', 'Euthanasia',
       'Missing', 'Relocate', 'Return to Owner', 'Rto-Adopt', 'Transfer',
       'Aged', 'Behavior', 'Feral', 'Injured', 'Medical', 'Normal', 'Nursing',
       'Other', 'Pregnant', 'Sick', 'Abandoned', 'Euthanasia Request',
       'Owner Surrender', 'Public Assist', 'Stray', 'Wildlife',
       'Intact Female', 'Intact Male', 'Neutered Male', 'Spayed Female',
       'Unknown'],
      dtype='object')

In [173]:
# te = te.reindex(columns=['Outcome_Type', 'Sex_upon_Outcome','Intake_Condition',
#            'Intake_Type','Sex_upon_Intake','Age_upon_Intake Days'])

te = te[['Age_upon_Intake_Days', 'Bird', 'Cat',
        'Dog', 'Livestock', 'Adoption', 'Died',
       'Disposal', 'Euthanasia', 'Missing', 'Relocate', 'Return to Owner',
       'Rto-Adopt', 'Transfer', 'Aged', 'Behavior', 'Feral', 'Injured',
       'Medical', 'Normal', 'Nursing', 'Other', 'Pregnant', 'Sick',
       'Abandoned', 'Euthanasia Request', 'Owner Surrender', 'Public Assist',
       'Stray', 'Wildlife', 'Intact Female', 'Intact Male', 'Neutered Male',
       'Spayed Female', 'Unknown']].values
X_test = te
train_predictions = classifier.predict(X_test)

In [174]:
len(train_predictions)

22291

In [None]:
te = te.reindex(columns=['Age_upon_Intake Days', 'Bird', 'Cat',
                      'Dog', 'Livestock', 'Adoption', 'Died',
       'Disposal', 'Euthanasia', 'Missing', 'Relocate', 'Return to Owner',
       'Rto-Adopt', 'Transfer', 'Aged', 'Behavior', 'Feral', 'Injured',
       'Medical', 'Normal', 'Nursing', 'Other', 'Pregnant', 'Sick',
       'Abandoned', 'Euthanasia Request', 'Owner Surrender', 'Public Assist',
       'Stray', 'Wildlife', 'Intact Female', 'Intact Male', 'Neutered Male',
       'Spayed Female', 'Unknown'])
X_test = te
train_predictions = classifier.predict(X_test)
train_predictions

In [None]:
print(len(train_predictions))

In [None]:
train_predictions = list(train_predictions)
#print(train_predictions)
num_1 = train_predictions.count(1)
num_0 = train_predictions.count(0)
print(num_1/num_0)

In [None]:
print(train_predictions)

In [None]:
# temp = te['Outcome_Type'].value_counts().sort_index()
# target = []
# index = []
# num0 = 0
# for term in temp.index:
#     target.append(term)
#     index.append(num0)
#     num0 += 1
# #print(target, index)

# te.replace(
#     target,
#     index,
#     inplace = True
# )
# te