## The Dataset

You will use a simplified version of the PetFinder [dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). There are several thousand rows in the CSV. Each row describes a pet, and each column describes an attribute. You will use this information to predict if the pet will be adopted.

Following is a description of this dataset. Notice there are both numeric and categorical columns. There is a free text column which you will not use in this tutorial.

Column | Description| Feature Type | Data Type
------------|--------------------|----------------------|-----------------
Type | Type of animal (Dog, Cat) | Categorical | string
Age |  Age of the pet | Numerical | integer
Breed1 | Primary breed of the pet | Categorical | string
Color1 | Color 1 of pet | Categorical | string
Color2 | Color 2 of pet | Categorical | string
MaturitySize | Size at maturity | Categorical | string
FurLength | Fur length | Categorical | string
Vaccinated | Pet has been vaccinated | Categorical | string
Sterilized | Pet has been sterilized | Categorical | string
Health | Health Condition | Categorical | string
Fee | Adoption Fee | Numerical | integer
Description | Profile write-up for this pet | Text | string
PhotoAmt | Total uploaded photos for this pet | Numerical | integer
AdoptionSpeed | Speed of adoption | Classification | integer

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv("pet_adopt.csv")

In [3]:
df.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,1,1
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,2,1
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,7,1
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,8,1
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,3,1


In [4]:
df.describe(include='all')

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
count,4999,4999.0,4999,4999,4999,4999,4999,4999,4999,4999,4999,4999.0,4999.0,4999.0
unique,2,,139,2,7,7,3,3,3,3,3,,,
top,Dog,,Mixed Breed,Female,Black,No Color,Medium,Short,Yes,No,Healthy,,,
freq,2854,,2023,2813,2293,1726,3470,2924,2198,3238,4806,,,
mean,,11.766753,,,,,,,,,,23.203841,3.614123,0.727145
std,,19.017515,,,,,,,,,,76.438513,3.184366,0.445471
min,,0.0,,,,,,,,,,0.0,0.0,0.0
25%,,2.0,,,,,,,,,,0.0,2.0,0.0
50%,,4.0,,,,,,,,,,0.0,3.0,1.0
75%,,12.0,,,,,,,,,,0.0,5.0,1.0


In [5]:
for i in df.items():
  print(i)
  break

('Type', 0       Cat
1       Cat
2       Dog
3       Dog
4       Dog
       ... 
4994    Cat
4995    Cat
4996    Cat
4997    Dog
4998    Dog
Name: Type, Length: 4999, dtype: object)


In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train, test = train_test_split(df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

3199 train examples
800 validation examples
1000 test examples


In [11]:
train.describe(include='all')

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
count,3199,3199.0,3199,3199,3199,3199,3199,3199,3199,3199,3199,3199.0,3199.0,3199.0
unique,2,,120,2,7,7,3,3,3,3,3,,,
top,Dog,,Mixed Breed,Female,Black,No Color,Medium,Short,Yes,No,Healthy,,,
freq,1845,,1293,1808,1446,1132,2209,1896,1394,2072,3073,,,
mean,,12.094092,,,,,,,,,,23.443264,3.604877,0.725852
std,,19.514612,,,,,,,,,,78.43917,3.215842,0.446154
min,,0.0,,,,,,,,,,0.0,0.0,0.0
25%,,2.0,,,,,,,,,,0.0,1.0,0.0
50%,,4.0,,,,,,,,,,0.0,3.0,1.0
75%,,12.0,,,,,,,,,,0.0,5.0,1.0


In [8]:
import tensorflow as tf

In [9]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    return ds

In [10]:
batch_size = 32
train_dataset = df_to_dataset(train, shuffle=True, batch_size=batch_size)
val_dataset = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_dataset = df_to_dataset(test, shuffle=False, batch_size=batch_size)

- **Numerical Features**:
  - `Age`
  - `Fee`
  - `PhotoAmt`

- **Categorical Features**:
  - `Type`
  - `Breed1`
  - `Gender`
  - `Color1`
  - `Color2`
  - `MaturitySize`
  - `FurLength`
  - `Vaccinated`
  - `Sterilized`
  - `Health`

## Setup Preprocessing Layers

1. **Use Normalization for Numerical Features**:
   - Normalize features like `Age`, `Fee`, and `PhotoAmt` to scale them to a standard range or distribution.

2. **Use StringLookup for Categorical Features**:
   - Apply `StringLookup` to convert categorical features such as `Type`, `Breed1`, `Gender`, etc., into integer indices that can be used in model training.


In [12]:
from tensorflow.keras.layers.experimental import preprocessing

In [None]:
def get_normalization_layer(name, dataset):
  normalizer = preprocessing.Normalization(axis=None)
  feature_ds = dataset.map(lambda x, y: x[name])
  normalizer.adapt(feature_ds)
  return normalizer

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_tokens=max_tokens)

  feature_ds = dataset.map(lambda x, y: x[name])
  index.adapt(feature_ds)
  encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size())
  return lambda feature: encoder(index(feature))

In [None]:
all_inputs = []
encoded_features = []

# Numeric features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_dataset)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

# Categorical features encoded as integers.
age_col = tf.keras.Input(shape=(1,), name='Age', dtype='int64')
encoding_layer = get_category_encoding_layer('Age', train_dataset, dtype='int64',max_tokens=5)
encoded_age_col = encoding_layer(age_col)
all_inputs.append(age_col)
encoded_features.append(encoded_age_col)

# Categorical features encoded as string.
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize','FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']

for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(header, train_dataset, dtype='string', max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)
