# Feature Preprocessing

- Raw features will usually not be immediately usable in a model
- First step:
    - Preparing the features
        - User and item ids may be strings (titles, usernames) or large, noncontiguous integers (database IDs)
        - Item descriptions (Raw text)
        - Interaction timestamps (Raw Unix timestamps)
    - Transform in order to be useful in building models:
        - User & items ids - Translated into embedding vectors
        (High-dimensional numerical representations that are adjusted during training)
        (Help the model predict its objective better)
        - Raw text - Tokenized (Split into smaller parts such as individual words) and translated into embedding
        - Numerical features - Normalized (Values lie in a small interval around 0)
(TensowFlow - Can make preprocessing part of our model rather than separate preprocessing step)
(Convenient + Ensure pre-processing is the same during training and during serving)

In [1]:
import pprint
import tensorflow_datasets as tfds

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
ratings =tfds.load('movielens/100k-ratings', split = 'train')

for x in ratings.take(1).as_numpy_iterator():
    pprint.pprint(x)




{'bucketized_user_age': 45.0,
 'movie_genres': array([7], dtype=int64),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


- Movie title - Useful as a movie identifier (Categorical)
- User id - Useful as a user identifier (Categorical)
- Timestamps - Model the effect of time (Continuous)


## Turning categorical features into embeddings

- A categorical feature
    - A feature that does not express a continuous quantity
    - Takes on 1 of a set of fixed values
- Most DL express these feature by turning them into high-dimensional vectors
    - During model training, the value of that vector is adjusted to help the
    model predict its objective better
- Example:
    - Which User - Which Movie?
    - Represent each user and each movie by an embedding vector
    - Embedding take on random values but during training will adjust
    so that embeddings of users and the movies they watch end up closer together
- Taking raw categorical features and turning them into embeddings is a 2-step process:
    - Translate raw values into a range of contiguous integers
    (Normally by building a mapping that maps raw values to integers)
    - Take these integers and turn them into embeddings