# Hackerearth ML Project: Pet Adoption

> URL: https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-pet-adoption/machine-learning/pet-adoption-9-5838c75b/

---

## Step 1: Import libraries and read data

In [1]:
import os
import pickle

import pandas as pd
import numpy as np
import seaborn as sns
sns.set()

import hyperopt

In [2]:
def read_data(fpath):
    """
    Read the train and test datasets and return the pandas dataframes
    """
    tr_df = pd.read_csv(f"{fpath}/train.csv", index_col="pet_id")
    te_df = pd.read_csv(f"{fpath}/test.csv", index_col="pet_id")
    return tr_df, te_df
fpath = "C:/Users/shaun/Documents/my_projects/Data-Science-and-Machine-Learning/Hackerearth Project - Pet Adoption/Dataset"

tr_df, te_df = read_data(fpath)

In [3]:
tr_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ANSL_69903,2016-07-10 00:00:00,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1
ANSL_66892,2013-11-21 00:00:00,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2
ANSL_69750,2014-09-28 00:00:00,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4
ANSL_71623,2016-12-31 00:00:00,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2
ANSL_57969,2017-09-28 00:00:00,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1


In [4]:
te_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ANSL_75005,2005-08-17 00:00:00,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7
ANSL_76663,2018-11-15 00:00:00,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1
ANSL_58259,2012-10-11 00:00:00,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7
ANSL_67171,2015-02-13 00:00:00,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1
ANSL_72871,2017-01-18 00:00:00,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7


## Step 2: Exploratory Data Analysis

In [5]:
tr_df.describe(include='all')

Unnamed: 0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
count,18834,18834,17357.0,18834,18834.0,18834.0,18834.0,18834.0,18834.0,18834.0
unique,3907,17209,,56,,,,,,
top,2017-03-20 00:00:00,2017-07-28 00:00:00,,Black,,,,,,
freq,41,17,,4620,,,,,,
mean,,,0.88339,,0.502636,27.448832,5.369598,4.577307,0.600563,1.709143
std,,,0.770434,,0.288705,13.019781,6.572366,3.517763,0.629883,0.717919
min,,,0.0,,0.0,5.0,0.0,0.0,0.0,0.0
25%,,,0.0,,0.25,16.1725,0.0,1.0,0.0,1.0
50%,,,1.0,,0.5,27.34,0.0,4.0,1.0,2.0
75%,,,1.0,,0.76,38.89,13.0,9.0,1.0,2.0


In [6]:
te_df.describe(include='all')

Unnamed: 0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
count,8072,8072,7453.0,8072,8072.0,8072.0,8072.0,8072.0
unique,2823,7719,,54,,,,
top,2016-11-21 00:00:00,2016-11-21 00:00:00,,Black,,,,
freq,22,6,,1955,,,,
mean,,,0.886623,,0.507265,27.451163,5.254336,4.505327
std,,,0.77095,,0.289615,12.917903,6.505841,3.523568
min,,,0.0,,0.0,5.01,0.0,0.0
25%,,,0.0,,0.26,16.2775,0.0,1.0
50%,,,1.0,,0.51,27.41,0.0,4.0
75%,,,1.0,,0.76,38.48,13.0,9.0


Col "condition" has missing values in both train and test data

In [7]:
def find_missing_values():
    """
    Check for missing values in each col and return % of missing values (if any)
    """
    tr_data, te_data = pd.DataFrame(tr_df.isnull().sum()*100/len(tr_df), columns=['% missing values']) , pd.DataFrame(te_df.isnull().sum()*100/len(te_df), columns=['% missing values'])
    return tr_data, te_data

tr_missing_data, te_missing_data = find_missing_values()

In [8]:
tr_missing_data

Unnamed: 0,% missing values
issue_date,0.0
listing_date,0.0
condition,7.8422
color_type,0.0
length(m),0.0
height(cm),0.0
X1,0.0
X2,0.0
breed_category,0.0
pet_category,0.0


In [9]:
te_missing_data

Unnamed: 0,% missing values
issue_date,0.0
listing_date,0.0
condition,7.668484
color_type,0.0
length(m),0.0
height(cm),0.0
X1,0.0
X2,0.0


In [None]:
def statistical_analysis(cols):
    """
    Basic statistical analysis like:
    1. For cont vars display mean, median, quantiles, missing values
    2. For cont var display corr and plots with each other
    3. For cat vars we display freq of each cat
    4. For cat vars display dist of target wrt each cat value
    """

In [10]:
tr_df['breed_category'].value_counts()

0.0    9000
1.0    8357
2.0    1477
Name: breed_category, dtype: int64

### Breed category - unique categories

<img src="./diagrams/diag1.png" height="1000" width="1300">

### Features - how do they influence the distribution of breed category

1. wrt X1:

<img src="./diagrams/diag2.png" height="600" width="1000">

2. wrt X2:

<img src="./diagrams/diag3.png" height="600" width="1000">

3. Since X1 and X2 affect breed v=cat in similar manner is there corr bw them?

- does not seem so

<img src="./diagrams/diag10.png" height="600" width="1000">


3. How does length affect breed? Not much

<img src="./diagrams/diag4.png" height="600" width="1000">

4. How does height affect breed? Slightly lower for breed=0

<img src="./diagrams/diag5.png" height="600" width="1000">

5. How does condition affect breed? - when NULL its always breed = 2
    - simply replace with -1 and create feature with condition_NULL

<img src="./diagrams/diag6.png" height="600" width="600">


In [13]:
def compute_corr(col1, col2):
    """
    Returns person corr bw col1 and col2
    """
    print ("Train data:", np.corrcoef(x=np.array(tr_df[col1]), y=np.array(tr_df[col2]))[0][1])
    print ("Test data:", np.corrcoef(x=np.array(te_df[col1]), y=np.array(te_df[col2]))[0][1])
    return

compute_corr('X1', 'X2')

Train data: 0.5843958932820943
Test data: 0.5918704878368073


### Any relationship bw breed and pet categories?

<img src="./diagrams/diag7.png" height="600" width="1000">

- if there had been a one-to-one relationship, then the model we build for one would have been suitable for the other, but it is not so

- so we should build separate models for each

### Issue Date features exploration wrt breed type

1. Year-wise

<img src="./diagrams/diag8.png" height="600" width="1000">

2. Month-wise

- there seems to be some seasonality - maybe encode months further as seasons

<img src="./diagrams/diag9.png" height="600" width="1000">

3. Day-wise: Weekday weekend patterns?

<img src="./diagrams/diag11.png" height="600" width="1000">

- On weekends breed = 1 sales exceed that of breed = 0, so might be helpful to have a feature for this


## Step 3: Feature engineering and Data cleaning for breed type

### Datetime manipulations

In [14]:
tr_df['issue_date']

pet_id
ANSL_69903    2016-07-10 00:00:00
ANSL_66892    2013-11-21 00:00:00
ANSL_69750    2014-09-28 00:00:00
ANSL_71623    2016-12-31 00:00:00
ANSL_57969    2017-09-28 00:00:00
                     ...         
ANSL_51738    2017-01-26 00:00:00
ANSL_59900    2016-06-18 00:00:00
ANSL_53210    2010-07-21 00:00:00
ANSL_63468    2017-05-12 00:00:00
ANSL_73558    2011-12-13 00:00:00
Name: issue_date, Length: 18834, dtype: object

In [15]:
tr_df['listing_date']

pet_id
ANSL_69903    2016-09-21 16:25:00
ANSL_66892    2018-12-27 17:47:00
ANSL_69750    2016-10-19 08:24:00
ANSL_71623    2019-01-25 18:30:00
ANSL_57969    2017-11-19 09:38:00
                     ...         
ANSL_51738    2018-03-09 15:35:00
ANSL_59900    2017-07-09 08:37:00
ANSL_53210    2018-08-22 14:27:00
ANSL_63468    2018-02-08 14:05:00
ANSL_73558    2018-10-26 14:18:00
Name: listing_date, Length: 18834, dtype: object

In [16]:
tr_df.sample(10)

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ANSL_67216,2017-02-12 00:00:00,2018-03-09 17:14:00,1.0,Lynx Point,0.55,18.08,0,1,0.0,1
ANSL_55802,2014-05-14 00:00:00,2017-06-08 11:36:00,1.0,White,0.27,34.17,13,9,0.0,2
ANSL_53113,2016-08-01 00:00:00,2017-08-27 19:05:00,2.0,Tortie,0.3,49.57,16,9,1.0,1
ANSL_52190,2016-01-30 00:00:00,2017-02-23 13:42:00,0.0,Blue Tabby,0.08,44.0,13,9,1.0,1
ANSL_73266,2015-03-15 00:00:00,2017-07-22 14:40:00,0.0,Black,0.63,12.9,0,1,1.0,2
ANSL_75029,2011-10-19 00:00:00,2017-04-16 11:14:00,0.0,Brown Tabby,0.82,5.77,7,1,1.0,1
ANSL_70533,2018-07-05 00:00:00,2019-01-25 15:29:00,2.0,Tan,0.81,5.83,0,7,1.0,2
ANSL_72883,2017-12-27 00:00:00,2019-02-21 15:41:00,0.0,Black,0.3,31.38,0,1,1.0,2
ANSL_71619,2012-12-01 00:00:00,2016-07-19 15:29:00,0.0,White,0.77,33.07,0,7,1.0,2
ANSL_72214,2018-10-01 00:00:00,2019-01-03 14:58:00,0.0,Torbie,0.92,16.51,0,1,1.0,1


In [18]:
tr_df['issue_date'] = pd.to_datetime(tr_df['issue_date'])
tr_df['listing_date'] = pd.to_datetime(tr_df['listing_date'])

In [20]:
tr_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ANSL_69903,2016-07-10,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1
ANSL_66892,2013-11-21,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2
ANSL_69750,2014-09-28,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4
ANSL_71623,2016-12-31,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2
ANSL_57969,2017-09-28,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1


In [22]:
te_df['issue_date'] = pd.to_datetime(te_df['issue_date'])
te_df['listing_date'] = pd.to_datetime(te_df['listing_date'])

In [23]:
te_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ANSL_75005,2005-08-17,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7
ANSL_76663,2018-11-15,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1
ANSL_58259,2012-10-11,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7
ANSL_67171,2015-02-13,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1
ANSL_72871,2017-01-18,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7


In [21]:
tr_df.dtypes

issue_date        datetime64[ns]
listing_date      datetime64[ns]
condition                float64
color_type                object
length(m)                float64
height(cm)               float64
X1                         int64
X2                         int64
breed_category           float64
pet_category               int64
dtype: object

In [24]:
te_df.dtypes

issue_date      datetime64[ns]
listing_date    datetime64[ns]
condition              float64
color_type              object
length(m)              float64
height(cm)             float64
X1                       int64
X2                       int64
dtype: object

In [55]:
days_diff = days_diff = [day.days for day in tr_df['listing_date']-tr_df['issue_date']]


tr_df['days_bw_list_issue'] = days_diff

In [56]:
tr_df

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ANSL_69903,2016-07-10,2016-09-21 16:25:00,2.0,Brown Tabby,0.80,7.78,13,9,0.0,1,73
ANSL_66892,2013-11-21,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2,1862
ANSL_69750,2014-09-28,2016-10-19 08:24:00,,Brown,0.15,40.90,15,4,2.0,4,752
ANSL_71623,2016-12-31,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2,755
ANSL_57969,2017-09-28,2017-11-19 09:38:00,2.0,Black,0.50,11.06,18,4,0.0,1,52
...,...,...,...,...,...,...,...,...,...,...,...
ANSL_51738,2017-01-26,2018-03-09 15:35:00,2.0,Tricolor,0.44,27.36,0,1,0.0,2,407
ANSL_59900,2016-06-18,2017-07-09 08:37:00,,Brown,0.73,14.25,15,4,2.0,4,386
ANSL_53210,2010-07-21,2018-08-22 14:27:00,0.0,Calico Point,0.99,28.13,13,9,1.0,1,2954
ANSL_63468,2017-05-12,2018-02-08 14:05:00,0.0,Tan,0.55,44.82,13,9,1.0,2,272


In [45]:
te_df.dtypes

issue_date      datetime64[ns]
listing_date    datetime64[ns]
condition              float64
color_type              object
length(m)              float64
height(cm)             float64
X1                       int64
X2                       int64
dtype: object

In [53]:
days_diff = [day.days for day in te_df['listing_date']-te_df['issue_date']]

te_df['days_bw_list_issue'] = days_diff

In [54]:
te_df

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ANSL_75005,2005-08-17,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7,4404
ANSL_76663,2018-11-15,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1,174
ANSL_58259,2012-10-11,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7,1999
ANSL_67171,2015-02-13,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1,1148
ANSL_72871,2017-01-18,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7,463
...,...,...,...,...,...,...,...,...,...
ANSL_66809,2016-02-10,2017-03-10 14:56:00,2.0,Brown,0.82,36.08,13,9,394
ANSL_59041,2015-12-07,2018-02-12 00:00:00,0.0,Tan,0.49,27.54,13,9,798
ANSL_60034,2015-12-08,2017-01-04 17:19:00,0.0,Black,0.98,37.19,0,7,393
ANSL_58066,2016-06-28,2017-07-20 18:19:00,,Black,0.79,23.83,0,2,387


Are there any diff in days which are -ve

In [57]:
tr_df['days_bw_list_issue'].describe()

count    18834.000000
mean       855.306786
std       1096.674990
min        -76.000000
25%        119.000000
50%        392.000000
75%       1117.000000
max       8056.000000
Name: days_bw_list_issue, dtype: float64

In [58]:
tr_df.loc[tr_df['days_bw_list_issue'] < 0]

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ANSL_52243,2018-01-17,2018-01-14 15:13:00,2.0,Orange Tabby,0.72,43.19,13,9,0.0,1,-3
ANSL_63737,2016-11-18,2016-09-03 17:01:00,0.0,Black,0.88,27.82,0,1,1.0,1,-76


In [70]:
te_df.loc[te_df['days_bw_list_issue'] < 0]

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


Lets delete these 2 observations from training

In [71]:
print (tr_df.shape)

tr_df = tr_df.loc[tr_df['days_bw_list_issue'] > 0]

print (tr_df.shape)

(18834, 11)
(18832, 11)


In [72]:
print (te_df.shape)

te_df = te_df.loc[te_df['days_bw_list_issue'] > 0]

print (te_df.shape)

(8072, 9)
(8072, 9)


In [59]:
te_df['days_bw_list_issue'].describe()

count    8072.000000
mean      856.057607
std      1103.689752
min        20.000000
25%       122.000000
50%       393.000000
75%      1116.000000
max      9154.000000
Name: days_bw_list_issue, dtype: float64

In the test data there are no -ves

In [61]:
tr_df['days_bw_list_issue'].max()

8056

In [68]:
tr_df.loc[tr_df['days_bw_list_issue']==0]

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [69]:
te_df.loc[te_df['days_bw_list_issue']==0]

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,days_bw_list_issue
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


In [73]:
tr_df['days_bw_list_issue_log2'] = np.log2(tr_df['days_bw_list_issue'])

tr_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category,days_bw_list_issue,days_bw_list_issue_log2
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ANSL_69903,2016-07-10,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1,73,6.189825
ANSL_66892,2013-11-21,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2,1862,10.862637
ANSL_69750,2014-09-28,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4,752,9.554589
ANSL_71623,2016-12-31,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2,755,9.560333
ANSL_57969,2017-09-28,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1,52,5.70044


In [77]:
tr_df['days_bw_list_issue_log2'].describe()

count    18832.000000
mean         8.643302
std          1.884812
min          4.247928
25%          6.894818
50%          8.614710
75%         10.125413
max         12.975848
Name: days_bw_list_issue_log2, dtype: float64

In [76]:
tr_df.to_csv('tr_df.csv')

<img src="./diagrams/diag12.png" height="600" width="1000">


- As we can see, using the log transform we can discern quite a bit of info on the sale of breed types based on diff in issue and listing dates

In [78]:
te_df['days_bw_list_issue_log2'] = np.log2(te_df['days_bw_list_issue'])

te_df.head()

Unnamed: 0_level_0,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,days_bw_list_issue,days_bw_list_issue_log2
pet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ANSL_75005,2005-08-17,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7,4404,12.104599
ANSL_76663,2018-11-15,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1,174,7.442943
ANSL_58259,2012-10-11,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7,1999,10.965063
ANSL_67171,2015-02-13,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1,1148,10.164907
ANSL_72871,2017-01-18,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7,463,8.854868


In [79]:
te_df['days_bw_list_issue_log2'].describe()

count    8072.000000
mean        8.647809
std         1.878908
min         4.321928
25%         6.930737
50%         8.618386
75%        10.124121
max        13.160187
Name: days_bw_list_issue_log2, dtype: float64

### Dates: month, year, weekday, seasons

In [80]:
def build_date_features():
    """
    Build month, year, weekday, season features
    Jan-April, May-Sep, Oct-Dec 
    """
    

In [81]:
tr_df['issue_date'].describe()

count                   18832
unique                   3907
top       2017-03-20 00:00:00
freq                       41
first     1994-12-23 00:00:00
last      2019-03-17 00:00:00
Name: issue_date, dtype: object

In [82]:
te_df['issue_date'].describe()

count                    8072
unique                   2823
top       2016-11-21 00:00:00
freq                       22
first     1993-03-03 00:00:00
last      2019-03-07 00:00:00
Name: issue_date, dtype: object

In [88]:
tr_df['issue_date'][0].weekday()

6