# Preprocessing
- Implemented using **Polars + custom functions inspired by Feature-Engine** for speed.
- Handled outliers with **winsorization**.
- Treated categorical features: `'decade'`, `'genres'`, `'hour'`, `'dayofweek'`.
- Applied **ordinal encoding** only on training data.
- **Out-of-time split**:
  - Train: ~85% of data (before 2011-01-01)
  - Test: ~15% of data (after 2011-01-01)
- Saved preprocessed **train and test sets**.


In [3]:
import os
import polars as pl
import numpy as np
import joblib
from utils.utils import PolarsPreprocessor
from data_types import (
    impute_zeros, impute_minus_one, categorical_features,
    winsorize_variables, all_predictive_features  
)

I'll load the data using pandas as is compatible with feature_engine
- The data type definition is in the data_type.py file.
- I selected the features to winzorize based on their distribution. The ones with outliers are going to be winzorized to tackle that.
- I decided to treat the following features as categorical:
    - 'decade' of the movie
    - 'genres' (This would be like a combination of genres)
    - 'hour' of the prediction
    - 'dayofweek' of the prediction
        - I'll be processing these with OrdinalEncoder ** But only use the training data to train the preprocessor and not using the test.
- I grouped features to impute with zeros and with -1 and that depends on the definition of the feature and following the strategy that some features the null values are information that can be represented with -1.


I tipically use https://feature-engine.trainindata.com/en/1.8.x/index.html for feature preprocessing, but I decided to implement the functions that I'll be using with ChatGPT's help to being faster with polars.

In [4]:
#X = pl.scan_parquet("data.parquet")

In [5]:
data = pl.read_parquet('data/raw_data.parquet')

In [6]:
target = 'TARGET'

In [7]:
data.shape

(20000263, 69)

I will implement an Out Of Time train/test split, which is more suitable for this type of problem. In this approach, we train our model on data from a specific time period and subsequently evaluate its performance on new data. Therefore, an out-of-time partitioning strategy is a more effective approach for handling time-dependent scenarios.

- I'll select approximately the ~85% data as training.

In [8]:
anio_p85 = (
    data.select(
        pl.col("timestamp").dt.year().alias("year")
    )
    .select(
        pl.col("year").quantile(0.8529)
    )
    .item()
)

print(f"The year that represents the ~85th percentile is: {int(anio_p85)}")

The year that represents the ~85th percentile is: 2011


In [9]:
train = data.filter(data['timestamp'] < pl.lit("2011-01-01").str.to_date())
test = data.filter(data['timestamp'] >= pl.lit("2011-01-01").str.to_date())
del data

In [10]:
train = train.sort("timestamp")
test = test.sort("timestamp")

In [11]:
y_train = train[target]
X_train = train[all_predictive_features]

y_test = test[target]
X_test = test[all_predictive_features]

In [12]:
print('Train size percentage: ', train.shape[0] / (train.shape[0] + test.shape[0]))
print('Test size percentage: ', test.shape[0] / (train.shape[0] + test.shape[0]))

Train size percentage:  0.8528091355598674
Test size percentage:  0.1471908644401326


In [13]:
preprocessor = PolarsPreprocessor(
    impute_zeros=impute_zeros,
    impute_minus_one=impute_minus_one,
    categorical_features=categorical_features,
    winsorize_variables=winsorize_variables,
)

In [14]:
preprocessor.fit(X_train, y_train)

[Pipeline] .... (step 1 of 3) Processing winsor_stats
[Pipeline] .... (step 1 of 3) Processing winsor_stats, total=  4.82s
[Pipeline] .... (step 2 of 3) Processing rare_label_encoder
[Pipeline] .... (step 2 of 3) Processing rare_label_encoder, total=  6.29s
[Pipeline] .... (step 3 of 3) Processing ordinal_encoder
[Pipeline] .... (step 3 of 3) Processing ordinal_encoder, total=  3.56s


In [15]:
joblib.dump(preprocessor, 'models/preprocessor.joblib.dat')

['models/preprocessor.joblib.dat']

Let's take a look into the data before being processed.

In [16]:
X_train.head()

prev_rating,num_prev_ratings,mean_ratings,std_ratings,max_prev_ratings,min_prev_ratings,mean_previous_target,num_prev_ratings_greater_than_4,prev_target,hour,dayofweek,year,seconds_since_last_rating,num_movie_prev_ratings,movie_mean_rating,movie_std_ratings,movie_max_prev_ratings,movie_min_prev_ratings,movie_mean_previous_target,movie_num_prev_ratings_greater_than_4,tags_count,tags_avg_sentiment,tags_std_sentiment,tags_min_sentiment,tags_max_sentiment,tags_avg_relevance,tags_std_relevance,tags_min_relevance,tags_max_relevance,total_tags_with_relevance,avg_seconds_between_rating_and_tags,tagging_duration_seconds,avg_tag_length,total_special_chars_in_tags,total_rare_symbols_in_tags,at_least_one_symbol_in_tags,porc_tags_with_relevance_score,genres,movie_year,movie_number_of_genres,decade,is_sequel,title_len,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
f32,u32,f64,f64,f32,f32,f64,i64,i8,i8,i8,i32,i64,u32,f64,f64,f32,f32,f64,i64,u32,f64,f64,f64,f64,f64,f64,f64,f64,u32,f64,i64,f64,u32,u32,bool,f64,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
,,,,,,,,,11,1,1995,,,,,,,,,,,,,,,,,,,,,,,,,,"""Drama|Fantasy|Romance""",1991,3,1990,0,59,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0
,,,,,,,,,11,1,1995,,,,,,,,,,,,,,,,,,,,,,,,,,"""Comedy|Crime|Thriller""",1995,3,1990,0,10,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0
3.0,1.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,11,1,1995,0.0,,,,,,,,,,,,,,,,,,,,,,,,,"""Comedy|Crime""",1988,2,1980,0,20,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3.0,2.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,11,1,1995,0.0,,,,,,,,,,,,,,,,,,,,,,,,,"""Mystery|Thriller""",1995,2,1990,0,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
,,,,,,,,,0,1,1996,,,,,,,,,,,,,,,,,,,,,,,,,,"""Mystery|Sci-Fi|Thriller""",1995,3,1990,0,34,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0


Let's see how the categorical feature 'decade' looks before the preprocessing.

In [17]:
print(X_train['decade'].value_counts())

shape: (14, 2)
┌────────┬─────────┐
│ decade ┆ count   │
│ ---    ┆ ---     │
│ i64    ┆ u32     │
╞════════╪═════════╡
│ 1940   ┆ 248212  │
│ 1910   ┆ 1188    │
│ 1980   ┆ 2692953 │
│ 1930   ┆ 159952  │
│ 1920   ┆ 25126   │
│ …      ┆ …       │
│ 2000   ┆ 3701546 │
│ 1970   ┆ 1001403 │
│ 1960   ┆ 587828  │
│ 1900   ┆ 116     │
│ 1990   ┆ 8203061 │
└────────┴─────────┘


##### Final preprocessing

Note that in the preprocessing (in special the test set) I'm not passing the target variable.
- This is much faster than pandas and the original feature engine, I'm pretty sure about this (I tried at first and and I failed :P)

In [18]:
X_train = preprocessor.transform(X_train)
X_test  = preprocessor.transform(X_test)

[Pipeline] .... (step 1 of 6) Processing zero_imputer
[Pipeline] .... (step 1 of 6) Processing zero_imputer, total=  6.19s
[Pipeline] .... (step 2 of 6) Processing minus_one_imputer
[Pipeline] .... (step 2 of 6) Processing minus_one_imputer, total=  6.27s
[Pipeline] .... (step 3 of 6) Processing max_winsorizer
[Pipeline] .... (step 3 of 6) Processing max_winsorizer, total=  6.09s
[Pipeline] .... (step 4 of 6) Processing categorical_imputer
[Pipeline] .... (step 4 of 6) Processing categorical_imputer, total=  2.31s
[Pipeline] .... (step 5 of 6) Processing rare_label_encoder
[Pipeline] .... (step 5 of 6) Processing rare_label_encoder, total=  3.85s
[Pipeline] .... (step 6 of 6) Processing ordinal_encoder
[Pipeline] .... (step 6 of 6) Processing ordinal_encoder, total=  2.21s
[Pipeline] .... (step 1 of 6) Processing zero_imputer
[Pipeline] .... (step 1 of 6) Processing zero_imputer, total=  1.55s
[Pipeline] .... (step 2 of 6) Processing minus_one_imputer
[Pipeline] .... (step 2 of 6) Proc

Now, how it looks after the preprocessing step. The category with the highest percentage of good reviews will be placed at first.

In [19]:
print(X_train['decade'].value_counts())

shape: (8, 2)
┌────────┬─────────┐
│ decade ┆ count   │
│ ---    ┆ ---     │
│ i64    ┆ u32     │
╞════════╪═════════╡
│ -1     ┆ 219183  │
│ 8      ┆ 1001403 │
│ 6      ┆ 2692953 │
│ 2      ┆ 8203061 │
│ 9      ┆ 587828  │
│ 12     ┆ 402221  │
│ 1      ┆ 3701546 │
│ 13     ┆ 248212  │
└────────┴─────────┘


This means that the decade '2010' is the lowest liked and the '1940' the most liked. 
- What's your favorite movies decade?

In [20]:
train.group_by("decade").agg(
    pl.col("TARGET").mean().alias("avg_target")
).sort('avg_target')

decade,avg_target
i64,f64
2010,0.453084
2000,0.465601
1990,0.469785
1910,0.473906
,0.5
…,…
1960,0.62611
1920,0.648531
1930,0.649389
1950,0.658096


This does the oridinal encoder in the preprocessor step.

In [21]:
preprocessor.ordinal_mappings_['decade']

decade,ordinal
str,u32
"""2010""",0
"""2000""",1
"""1990""",2
"""1910""",3
,4
…,…
"""1960""",9
"""1920""",10
"""1930""",11
"""1950""",12


Finally, I'll save the train and test sets.

In [22]:
X_train.shape

(17056407, 63)

In [23]:
X_test.shape

(2943856, 63)

In [24]:
X_train = X_train.with_columns(
    y_train.alias("TARGET")
)

In [25]:
X_test = X_test.with_columns(
    y_test.alias("TARGET")
)

In [27]:
X_train.write_parquet("data/train_data.parquet")

In [28]:
X_test.write_parquet("data/test_data.parquet")