<a href="https://colab.research.google.com/github/KanishkGar/AIMProjects/blob/main/week2/MLA_TAB_DAY1_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 1</a>


## Final Project 

In this notebook, we build a ML model to predict the __Time at Center__ field of our final project dataset.

1. <a href="#1">Read the dataset</a> (Given) 
2. <a href="#2">Train a model</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a>
    * <a href="#23">Data processing</a>
    * <a href="#24">Model training</a>
3. <a href="#3">Make predictions on the test dataset</a> (Implement)
4. <a href="#4">Write the test predictions to a CSV file</a> (Given)

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a training.csv, test_features.csv and y_test.csv files. Similar to our review dataset, we didn't consider animals with multiple entries to the facility to keep it simple. If you want to see the original datasets, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Time at Center__ - Time at center (0 = less than 30 days; 1 = more than 30 days). This is the value to predict. 


In [1]:
!pip install pandas
!pip install numpy




## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Let's read the datasets into dataframes, using Pandas.

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('training.csv')
test_data = pd.read_csv('test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)


The shape of the training dataset is: (71538, 13)
The shape of the test dataset is: (23846, 12)


## 2. <a name="2">Train a model</a> (Implement)
(<a href="#0">Go to top</a>)

 * <a href="#21">Exploratory Data Analysis</a>
 * <a href="#22">Select features to build the model</a>
 * <a href="#23">Data processing</a>
 * <a href="#24">Model training</a>

### 2.1 <a name="21">Exploratory Data Analysis</a> 
(<a href="#2">Go to Train a model</a>)

We look at number of rows, columns and some simple statistics of the dataset.

In [3]:
# Implement here

training_data.head()

Unnamed: 0,Pet ID,Outcome Type,Sex upon Outcome,Name,Found Location,Intake Type,Intake Condition,Pet Type,Sex upon Intake,Breed,Color,Age upon Intake Days,Time at Center
0,A745079,Transfer,Unknown,,7920 Old Lockhart in Travis (TX),Stray,Normal,Cat,Unknown,Domestic Shorthair Mix,Blue,3,0
1,A801765,Transfer,Intact Female,,5006 Table Top in Austin (TX),Stray,Normal,Cat,Intact Female,Domestic Shorthair,Brown Tabby/White,28,0
2,A667965,Transfer,Neutered Male,,14100 Thermal Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,Chihuahua Shorthair Mix,Brown/Tan,1825,0
3,A687551,Transfer,Intact Male,,5811 Cedardale Dr in Austin (TX),Stray,Normal,Cat,Intact Male,Domestic Shorthair Mix,Brown Tabby,28,0
4,A773004,Adoption,Neutered Male,*Boris,Highway 290 And Arterial A in Austin (TX),Stray,Normal,Dog,Intact Male,Chihuahua Shorthair Mix,Tricolor/Cream,365,0


In [4]:
# Implement here

test_data.head()

Unnamed: 0,Pet ID,Outcome Type,Sex upon Outcome,Name,Found Location,Intake Type,Intake Condition,Pet Type,Sex upon Intake,Breed,Color,Age upon Intake Days
0,A782657,Adoption,Spayed Female,,1911 Dear Run Drive in Austin (TX),Stray,Normal,Dog,Intact Female,Labrador Retriever Mix,Black,60
1,A804622,Adoption,Neutered Male,,702 Grand Canyon in Austin (TX),Stray,Normal,Dog,Intact Male,Boxer/Anatol Shepherd,Brown/Tricolor,60
2,A786693,Return to Owner,Neutered Male,Zeus,Austin (TX),Public Assist,Normal,Dog,Neutered Male,Australian Cattle Dog/Pit Bull,Black/White,3285
3,A693330,Adoption,Spayed Female,Hope,Levander Loop & Airport Blvd in Austin (TX),Stray,Normal,Dog,Intact Female,Miniature Poodle,Gray,1825
4,A812431,Adoption,Neutered Male,,Austin (TX),Owner Surrender,Injured,Cat,Intact Male,Domestic Shorthair,Blue/White,210


In [5]:
print(training_data.info())
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71538 entries, 0 to 71537
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Pet ID                71538 non-null  object
 1   Outcome Type          71533 non-null  object
 2   Sex upon Outcome      71537 non-null  object
 3   Name                  44360 non-null  object
 4   Found Location        71538 non-null  object
 5   Intake Type           71538 non-null  object
 6   Intake Condition      71538 non-null  object
 7   Pet Type              71538 non-null  object
 8   Sex upon Intake       71537 non-null  object
 9   Breed                 71538 non-null  object
 10  Color                 71538 non-null  object
 11  Age upon Intake Days  71538 non-null  int64 
 12  Time at Center        71538 non-null  int64 
dtypes: int64(2), object(11)
memory usage: 7.1+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23846 entries, 0 to 23845
Data colu

In [6]:
print(training_data.describe())
test_data.describe()

       Age upon Intake Days  Time at Center
count          71538.000000    71538.000000
mean             702.701487        0.087003
std             1051.158334        0.281841
min                0.000000        0.000000
25%               30.000000        0.000000
50%              365.000000        0.000000
75%              730.000000        0.000000
max             9125.000000        1.000000


Unnamed: 0,Age upon Intake Days
count,23846.0
mean,708.5143
std,1056.841712
min,0.0
25%,30.0
50%,365.0
75%,730.0
max,8030.0


### 2.2 <a name="22">Select features to build the model</a> 
(<a href="#2">Go to Train a model</a>)


In [9]:
# Implement here
model_features = training_data.columns.drop('Outcome Type')
model_target = 'Outcome Type'

# numerical_features = ...
numerical_features_all = training_data[model_features].select_dtypes(include=np.number).columns
print('Numerical columns:',numerical_features_all)

categorical_features_all = training_data[model_features].select_dtypes(include='object').columns
print('Categorical columns:',categorical_features_all)

Numerical columns: Index(['Age upon Intake Days', 'Time at Center'], dtype='object')
Categorical columns: Index(['Sex upon Outcome', 'Intake Type', 'Intake Condition', 'Pet Type',
       'Sex upon Intake', 'Breed', 'Color'],
      dtype='object')


In [7]:
training_data.drop(labels=['Pet ID', 'Found Location'], axis=1, inplace=True)
test_data.drop(labels=['Pet ID', 'Found Location'], axis=1, inplace=True)

In [8]:
training_data.drop(labels=['Name'], axis=1, inplace=True)
test_data.drop(labels=['Name'], axis=1, inplace=True)

In [10]:
test_data.head()

Unnamed: 0,Outcome Type,Sex upon Outcome,Intake Type,Intake Condition,Pet Type,Sex upon Intake,Breed,Color,Age upon Intake Days
0,Adoption,Spayed Female,Stray,Normal,Dog,Intact Female,Labrador Retriever Mix,Black,60
1,Adoption,Neutered Male,Stray,Normal,Dog,Intact Male,Boxer/Anatol Shepherd,Brown/Tricolor,60
2,Return to Owner,Neutered Male,Public Assist,Normal,Dog,Neutered Male,Australian Cattle Dog/Pit Bull,Black/White,3285
3,Adoption,Spayed Female,Stray,Normal,Dog,Intact Female,Miniature Poodle,Gray,1825
4,Adoption,Neutered Male,Owner Surrender,Injured,Cat,Intact Male,Domestic Shorthair,Blue/White,210


### 2.3 <a name="23">Data Processing</a> 
(<a href="#2">Go to Train a model</a>)


In [11]:
# Implement here
!pip install sklearn
from sklearn.model_selection import train_test_split
train_data_a, test_data_a = train_test_split(training_data, test_size=0.1, shuffle=True, random_state=69)




In [12]:
print('Training set shape:', train_data_a.shape)

print('Class 0 samples in the training set:', sum(train_data_a[model_target] != 'Adoption'))
print('Class 1 samples in the training set:', sum(train_data_a[model_target] == 'Adoption'))

print('Class 0 samples in the test set:', sum(test_data_a[model_target] != 'Adoption'))
print('Class 1 samples in the test set:', sum(test_data_a[model_target] == 'Adoption'))

Training set shape: (64384, 10)
Class 0 samples in the training set: 37571
Class 1 samples in the training set: 26813
Class 0 samples in the test set: 4171
Class 1 samples in the test set: 2983


In [13]:
from sklearn.utils import shuffle

class_0_no = train_data_a[train_data_a[model_target] != 'Adoption']
class_1_no = train_data_a[train_data_a[model_target] == 'Adoption']

upsampled_class_0_no = class_0_no.sample(n=len(class_1_no), replace=True, random_state=42)

train_data_b = pd.concat([class_1_no, upsampled_class_0_no])
train_data_b = shuffle(train_data_b)

In [14]:
print('Training set shape:', train_data_a.shape)

print('Class 0 samples in the training set:', sum(train_data_b[model_target] != 'Adoption'))
print('Class 1 samples in the training set:', sum(train_data_b[model_target] == 'Adoption'))

Training set shape: (64384, 10)
Class 0 samples in the training set: 26813
Class 1 samples in the training set: 26813


### 2.4 <a name="24">Model training</a> 
(<a href="#2">Go to Train a model</a>)


In [15]:
# Implement here
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

### PIPELINE ###
################

# Pipeline desired data transformers, along with an estimator at the end
# For each step specify: a name, the actual transformer/estimator with its parameters
classifier = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsClassifier(n_neighbors = 3))
])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
classifier

In [16]:
# Get train data to train the classifier
X_train = train_data_b[numerical_features_all]
y_train = train_data_b[model_target]

# Fit the classifier to training data
# Train data going through the Pipeline it's first imputed, then scaled, and finally used to fit the estimator
classifier.fit(X_train, y_train)

In [17]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

# Use the fitted model to make predictions on the train dataset
# Train data going through the Pipeline it's first imputed (with means from the train), scaled (with the min/max from the train data), and finally used to make predictions
train_predictions = classifier.predict(X_train)

print('Model performance on the train set:')
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Train accuracy:", accuracy_score(y_train, train_predictions))

Model performance on the train set:
[[15999     0   244    84     0     0  5626     0  4860]
 [  181     0    89     0     0     0    91     0   160]
 [  130     0    32     0     0     0    68     0    12]
 [ 1727     0   198    11     0     0  1388     0   400]
 [   12     0     0     0     0     0     2     0     1]
 [    2     0     1     1     0     0     3     0     0]
 [ 3041     0    14    36     0     0  3336     0   202]
 [   83     0     0     7     0     0    70     0     2]
 [ 6681     0   809    31     0     0  3656     0  4336]]
                 precision    recall  f1-score   support

       Adoption       0.57      0.60      0.59     26813
           Died       0.00      0.00      0.00       521
       Disposal       0.02      0.13      0.04       242
     Euthanasia       0.06      0.00      0.01      3724
        Missing       0.00      0.00      0.00        15
       Relocate       0.00      0.00      0.00         7
Return to Owner       0.23      0.50      0.32    

## 3. <a name="3">Make predictions on the test dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the test set to make predictions with the trained model.

In [24]:
# Implement here
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

# Get test data to test the classifier
X_test = test_data_a[numerical_features_all]
y_test = test_data_a[model_target]

# Use the fitted model to make predictions on the test dataset
# Test data going through the Pipeline it's first imputed (with means from the train), scaled (with the min/max from the train data), and finally used to make predictions
test_predictions = classifier.predict(X_test)

print('Model performance on the test set:')
print(type(test_predictions))
# print(confusion_matrix(y_test, test_predictions))
# print(classification_report(y_test, test_predictions))
# print("Test accuracy:", accuracy_score(y_test, test_predictions))

# test_predictions = ...

Model performance on the test set:
<class 'numpy.ndarray'>
