# Hackerearth ML Project: Pet Adoption

> URL: https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-pet-adoption/machine-learning/pet-adoption-9-5838c75b/

---

## Step 1: Import libraries and read data

In [1]:
import os
import pickle

import pandas as pd
import numpy as np
import seaborn as sns
sns.set()

import hyperopt

In [2]:
def read_data(fpath):
    """
    Read the train and test datasets and return the pandas dataframes
    """
    tr_df = pd.read_csv(f"{fpath}/train.csv", index_col="pet_id")
    te_df = pd.read_csv(f"{fpath}/test.csv", index_col="pet_id")
    return tr_df, te_df
fpath = "C:/Users/shaun/Documents/my_projects/Data-Science-and-Machine-Learning/Hackerearth Project - Pet Adoption/Dataset"

tr_df, te_df = read_data(fpath)

In [3]:
tr_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ANSL_69903,2016-07-10 00:00:00,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1
ANSL_66892,2013-11-21 00:00:00,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2
ANSL_69750,2014-09-28 00:00:00,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4
ANSL_71623,2016-12-31 00:00:00,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2
ANSL_57969,2017-09-28 00:00:00,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1


In [4]:
te_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ANSL_75005,2005-08-17 00:00:00,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7
ANSL_76663,2018-11-15 00:00:00,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1
ANSL_58259,2012-10-11 00:00:00,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7
ANSL_67171,2015-02-13 00:00:00,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1
ANSL_72871,2017-01-18 00:00:00,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7


## Step 2: Exploratory Data Analysis

In [5]:
tr_df.describe(include='all')

Unnamed: 0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
count,18834,18834,17357.0,18834,18834.0,18834.0,18834.0,18834.0,18834.0,18834.0
unique,3907,17209,,56,,,,,,
top,2017-03-20 00:00:00,2017-07-28 00:00:00,,Black,,,,,,
freq,41,17,,4620,,,,,,
mean,,,0.88339,,0.502636,27.448832,5.369598,4.577307,0.600563,1.709143
std,,,0.770434,,0.288705,13.019781,6.572366,3.517763,0.629883,0.717919
min,,,0.0,,0.0,5.0,0.0,0.0,0.0,0.0
25%,,,0.0,,0.25,16.1725,0.0,1.0,0.0,1.0
50%,,,1.0,,0.5,27.34,0.0,4.0,1.0,2.0
75%,,,1.0,,0.76,38.89,13.0,9.0,1.0,2.0


In [6]:
te_df.describe(include='all')

Unnamed: 0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
count,8072,8072,7453.0,8072,8072.0,8072.0,8072.0,8072.0
unique,2823,7719,,54,,,,
top,2016-11-21 00:00:00,2016-11-21 00:00:00,,Black,,,,
freq,22,6,,1955,,,,
mean,,,0.886623,,0.507265,27.451163,5.254336,4.505327
std,,,0.77095,,0.289615,12.917903,6.505841,3.523568
min,,,0.0,,0.0,5.01,0.0,0.0
25%,,,0.0,,0.26,16.2775,0.0,1.0
50%,,,1.0,,0.51,27.41,0.0,4.0
75%,,,1.0,,0.76,38.48,13.0,9.0


Col "condition" has missing values in both train and test data

In [7]:
def find_missing_values():
    """
    Check for missing values in each col and return % of missing values (if any)
    """
    tr_data, te_data = pd.DataFrame(tr_df.isnull().sum()*100/len(tr_df), columns=['% missing values']) , pd.DataFrame(te_df.isnull().sum()*100/len(te_df), columns=['% missing values'])
    return tr_data, te_data

tr_missing_data, te_missing_data = find_missing_values()

In [8]:
tr_missing_data

Unnamed: 0,% missing values
issue_date,0.0
listing_date,0.0
condition,7.8422
color_type,0.0
length(m),0.0
height(cm),0.0
X1,0.0
X2,0.0
breed_category,0.0
pet_category,0.0


In [9]:
te_missing_data

Unnamed: 0,% missing values
issue_date,0.0
listing_date,0.0
condition,7.668484
color_type,0.0
length(m),0.0
height(cm),0.0
X1,0.0
X2,0.0


In [None]:
def statistical_analysis(cols):
    """
    Basic statistical analysis like:
    1. For cont vars display mean, median, quantiles, missing values
    2. For cont var display corr and plots with each other
    3. For cat vars we display freq of each cat
    4. For cat vars display dist of target wrt each cat value
    """

In [10]:
tr_df['breed_category'].value_counts()

0.0    9000
1.0    8357
2.0    1477
Name: breed_category, dtype: int64

### Breed category - unique categories

<img src="./diagrams/diag1.png" height="1000" width="1300">

### Features - how do they influence the distribution of breed category

1. wrt X1:

<img src="./diagrams/diag2.png" height="600" width="1000">

2. wrt X2:

<img src="./diagrams/diag3.png" height="600" width="1000">

3. Since X1 and X2 affect breed v=cat in similar manner is there corr bw them?

- does not seem so

<img src="./diagrams/diag10.png" height="600" width="1000">


3. How does length affect breed? Not much

<img src="./diagrams/diag4.png" height="600" width="1000">

4. How does height affect breed? Slightly lower for breed=0

<img src="./diagrams/diag5.png" height="600" width="1000">

5. How does condition affect breed? - when NULL its always breed = 2
    - simply replace with -1 and create feature with condition_NULL

<img src="./diagrams/diag6.png" height="600" width="600">


In [13]:
def compute_corr(col1, col2):
    """
    Returns person corr bw col1 and col2
    """
    print ("Train data:", np.corrcoef(x=np.array(tr_df[col1]), y=np.array(tr_df[col2]))[0][1])
    print ("Test data:", np.corrcoef(x=np.array(te_df[col1]), y=np.array(te_df[col2]))[0][1])
    return

compute_corr('X1', 'X2')

Train data: 0.5843958932820943
Test data: 0.5918704878368073


### Any relationship bw breed and pet categories?

<img src="./diagrams/diag7.png" height="600" width="1000">

- if there had been a one-to-one relationship, then the model we build for one would have been suitable for the other, but it is not so

- so we should build separate models for each

### Issue Date features exploration wrt breed type

1. Year-wise

<img src="./diagrams/diag8.png" height="600" width="1000">

2. Month-wise

- there seems to be some seasonality - maybe encode months further as seasons

<img src="./diagrams/diag9.png" height="600" width="1000">

3. Day-wise: Weekday weekend patterns?

<img src="./diagrams/diag11.png" height="600" width="1000">

- On weekends breed = 1 sales exceed that of breed = 0, so might be helpful to have a feature for this


In [None]:
def issue_listing_date_diff():
    """
    """
    return

## Step 3: Feature engineering and Data cleaning for breed type