<a href="https://colab.research.google.com/github/Pager07/Tensorflow-Data-and-Deployment/blob/master/course%204/week%201/Feature_columns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sklearn



In [2]:
import numpy as np 
import pandas as pd 
import tensorflow as tf 

from tensorflow import  feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split


#Download the data
Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from a URL, and load it into a dataframe.

In [3]:
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')
dataframe = pd.read_csv(csv_file)

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip


In [4]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,2
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,2


#Create target variable 

The task in the original dataset is to predict the speed at which a pet will be adopted (e.g., in the first week, the first month, the first three months, and so on). 

Here, we will transform this into a binary classification problem, and simply predict whether the pet was adopted, or not.

After modifying the label column, 0 will indicate the pet was not adopted, and 1 will indicate it was.

In [5]:
dataframe['target'] = np.where(dataframe['AdoptionSpeed']==4, 0,1)

In [10]:
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])


## Split the dataframe into train, validation, and test

The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets.

In [11]:
train,test = train_test_split(dataframe, test_size=0.2)
train,val =  train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

7383 train examples
1846 validation examples
2308 test examples


#Create and input pipline using td.data

Next, we will wrap the dataframes with tf.data.
-  This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model.

-  If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.

In [12]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe,shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [15]:
batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size =batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size =batch_size)


#Understanding the pipline

Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.

In [18]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:',list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['Age'])
  print('A batch of target:' , label_batch)


Every feature: ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt']
A batch of ages: tf.Tensor([5 2 3 2 4], shape=(5,), dtype=int64)
A batch of target: tf.Tensor([1 1 1 1 1], shape=(5,), dtype=int64)


## Demonstrate several types of feature columns
TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe.

In [19]:
# We will use this batch to demonstrate several types of feature columns
# As soon as you see a dict, you should visualize a table
example_batch = next(iter(train_ds))[0]


In [21]:
#A utility method to create a feature column
# and to transform a batch of data
#I dont understand what this function does 
def demo(feature_column):
feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

### Numeric columns
The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A [numeric column](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

In [22]:
# The key refers to the feature name in the dataset. If you spell it wrong it wont work
photo_count = feature_column.numeric_column('PhotoAmt')
demo(photo_count)

[[3.]
 [3.]
 [5.]
 [2.]
 [4.]]


### Bucketized columns
- Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. 
- Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets using a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column).

- Notice the one-hot values below describe which age range each row matches.

In [31]:
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1,3,5])
demo(age_buckets)

[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


### Categorical columns
- In this dataset, Type is represented as a string (e.g. 'Dog', or 'Cat'). We cannot feed strings directly to a model. 
- Instead, we must first map them to numeric values.(Tokenize) 

- The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). 

- The vocabulary can be passed as a list using [categorical_column_with_vocabulary_list](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list), or loaded from a file using [categorical_column_with_vocabulary_file](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file).

In [None]:
#Type is the feature name, 
animal_type = feature_column.categorical_column_with_vocabulary_list('Type', ['Cat' , 'Dog'])

animal_type_one_hot = feature_column.indicator_column(animal_type)

demo(animal_type_one_hot)

### Embedding columns 
Suppose instead of having just a few possible strings, we have thousands (or more) values per category. 
  - For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. 
  
  - We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an [embedding column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column) represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. 
  - The size of the embedding (8, in the example below) is a parameter that must be tuned.

Key point: using an embedding column is best when a categorical column has many possible values. We are using one here for demonstration purposes, so you have a complete example you can modify for a different dataset in the future.

1-hot encoding vs embedding
- They are pretty much the same thing, insense
  -  They poject sample/token in to different feature space
- Difference:
  - 1-hote encode: 
    - basis dimension is defined by number of unique tokens
    -  values are Always 1 or 0
  - embeeding: 
    - basis dimenion is a hyper parameter 
    - vallues can take range from 0 to 1


In [43]:
breed1 = feature_column.categorical_column_with_vocabulary_list(
    'Breed1', dataframe.Breed1.unique()
)

breed1_embedding = feature_column.embedding_column(breed1, dimension=8)
demo(breed1_embedding)

[[ 0.06940291 -0.08258729  0.2041481   0.24185796 -0.50573045  0.21611463
  -0.3276213   0.47428283]
 [ 0.06940291 -0.08258729  0.2041481   0.24185796 -0.50573045  0.21611463
  -0.3276213   0.47428283]
 [-0.02527314 -0.06092336 -0.01372539 -0.11359927 -0.44013467 -0.40439257
   0.23479083 -0.18684107]
 [-0.02527314 -0.06092336 -0.01372539 -0.11359927 -0.44013467 -0.40439257
   0.23479083 -0.18684107]
 [-0.06365097 -0.0945277  -0.4458406  -0.39117113  0.39095742 -0.06077234
  -0.14945887 -0.59262264]]
