# Feature engineering
- **Movies features**: Binary encoding of genres, movie-level statistics (mean, std, min/max ratings).  
- **Tags features**:
  - Lowercased and deduplicated for better matching.
  - Merged with `genome_tags` and applied sentiment analysis using **TextBlob**.
  - Aggregated to `userId-movieId` level.
  - Excluded any tags after the target rating timestamp to avoid leakage.
- **Ratings features**:
  - Computed **cumulative user statistics** with temporal shift to prevent leakage.
  - Calculated **movie-level average ratings** excluding the current rating (cold-start handling).
- **Merged all features** for modeling.
- Feature engineering was implemented using Polars instead of Pandas due to its significantly better performance on large, time-ordered datasets.

- This allowed efficient computation of cumulative and shifted features while preserving temporal constraints.

In [203]:
import os
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt

In [204]:
pl.Config.set_tbl_rows(100)

polars.config.Config

In [205]:
ratings = pl.read_parquet("data/rating.parquet")

In [206]:
movies = pd.read_csv('data/movie.csv')

In [207]:
tags = pl.read_csv('data/tag.csv')
genome_tags = pl.read_csv('data/genome_tags.csv')
genome_scores = pl.read_csv('data/genome_scores.csv')

### Movies dataset
First, I'll do some feature engineering in the movies dataset
- In the EDA I found 22 duplicated titles, but with different movie IDs so I'll let that the same

In [208]:
movies['movie_year'] = movies['title'].str.extract(r'\((\d{4})\)')

In [209]:
movies['movie_year'] = pd.to_numeric(movies['movie_year']).astype('Int64')

In [210]:
movies['movie_number_of_genres'] = movies['genres'].str.split('|').str.len()

In [211]:
movies['movie_number_of_genres'] = np.where(
    movies['genres'] == '(no genres listed)', 0, movies['movie_number_of_genres'])

In [212]:
movies['decade'] = (movies['movie_year'] // 10 * 10)

In [213]:
romanos_regex = r'\b(II|III|IV|V|VI|VII|VIII)\b|Part \d'

In [214]:
movies['is_sequel'] = movies['title'].str.contains(romanos_regex, case=False, regex=True).astype(int)

  movies['is_sequel'] = movies['title'].str.contains(romanos_regex, case=False, regex=True).astype(int)


In [215]:
movies['title_len'] = movies['title'].str.replace(r'\s\(\d{4}\)', '', regex=True).str.len()

- Sanity check

As my split date to train and test is 2011-01-01, I'll corroborate that all the available genres are in movies before 2011. It's a sanity check to no commit data leakage

In [216]:
genres_before_2011 = movies[
    movies['movie_year'] < 2011]['genres'].str.split('|').explode().unique()
all_genres = movies['genres'].str.split('|').explode().unique()
print(np.array_equal(all_genres, genres_before_2011))

True


As all the available genres are in movies before 2011 I can confidently make binary features of genre in all movies.

In [217]:
movies = pd.concat([movies, movies['genres'].str.get_dummies(sep='|')], axis=1)

### Tags datasets
Now, I'm going to make features combining all the datasets related to tags.

- Important: In order to not commit data leakage, we can't use tags that their timestamp
is after the timestamp of the rating that we want to predict. So, I'll exclude them.

In [218]:
ratings = ratings.with_columns(
    pl.col("timestamp").cast(pl.Datetime)
)

In [219]:
tags = tags.with_columns(
    pl.col("timestamp")
    .str.to_datetime(format="%Y-%m-%d %H:%M:%S")
    .alias('timestamp_tag')
)

- Tags is at level "userId", "movieId", so, one user can add multiple tags at the same movie, that means that it can have duplicates in level "userId", "movieId" but ratings doesn't have duplicates at level "userId", "movieId"

In [220]:
tags = tags.join(ratings[['userId', 'movieId', 'timestamp']].rename(
    {"timestamp": "timestamp_rating"}), on=['userId', 'movieId'], how='inner')

- I'll only keep the tags that are before the timestamp rating, otherwise it would be using future information.

#### *This step is critical

In [221]:
tags = tags.filter(tags['timestamp_tag'] < tags['timestamp_rating'])

In [222]:
tags.sample(10)

userId,movieId,tag,timestamp,timestamp_tag,timestamp_rating
i64,i64,str,str,datetime[μs],datetime[μs]
10573,2920,"""backstage""","""2006-04-13 23:01:31""",2006-04-13 23:01:31,2008-09-29 03:24:31
110139,84954,"""Emily Blunt""","""2011-09-17 21:40:02""",2011-09-17 21:40:02,2011-09-17 21:41:31
132187,60069,"""Below R""","""2008-08-05 11:02:47""",2008-08-05 11:02:47,2009-08-07 12:32:51
108442,5944,"""Star Trek""","""2010-07-12 18:23:27""",2010-07-12 18:23:27,2010-09-16 16:03:50
124998,86982,"""Mothra""","""2013-10-22 19:42:45""",2013-10-22 19:42:45,2014-02-15 23:07:00
37762,63082,"""compassionate""","""2012-11-22 16:34:31""",2012-11-22 16:34:31,2013-09-30 22:09:40
88738,113035,"""forest""","""2014-07-30 11:38:13""",2014-07-30 11:38:13,2014-07-30 11:38:21
40847,68237,"""space""","""2010-02-06 02:38:01""",2010-02-06 02:38:01,2010-06-21 18:05:52
88738,85020,"""mentor""","""2013-01-15 09:13:40""",2013-01-15 09:13:40,2013-06-16 04:48:48
137805,2706,"""Gross-out""","""2009-05-02 03:51:33""",2009-05-02 03:51:33,2009-05-02 03:51:36


I found that doing this to lowercase increases the matches, and also doesn't add duplicates

In [223]:
tags = tags.with_columns(
    pl.col("tag").str.to_lowercase().alias("tag")
)

In [224]:
genome_tags = genome_tags.with_columns(
    pl.col("tag").str.to_lowercase().alias("tag")
)

In [None]:
tags.shape

Now, I join tags with genome_tags to add the tagId. Not all the tags are in genome_tags.

In [225]:
tags = tags.join(genome_tags, on='tag', how='left')

In [226]:
tags.head()

userId,movieId,tag,timestamp,timestamp_tag,timestamp_rating,tagId
i64,i64,str,str,datetime[μs],datetime[μs],i64
65,27866,"""new zealand""","""2011-05-09 16:05:53""",2011-05-09 16:05:53,2011-05-09 16:05:59,706.0
65,48082,"""surreal""","""2011-05-09 16:25:54""",2011-05-09 16:25:54,2011-05-09 16:26:19,995.0
65,48082,"""unusual""","""2011-05-09 16:25:59""",2011-05-09 16:25:59,2011-05-09 16:26:19,
121,80693,"""mental illness""","""2011-03-29 04:51:43""",2011-03-29 04:51:43,2011-03-30 07:54:07,645.0
121,80693,"""zach galifianakis""","""2011-03-29 04:51:51""",2011-03-29 04:51:51,2011-03-30 07:54:07,


Now that we have the movieId and tagId we can join the genome_scores.

In [227]:
tags.shape

(73979, 7)

In [228]:
tags = tags.join(genome_scores, on=['movieId', 'tagId'], how='left')

Let's do some feature engineering

In [229]:
tags = tags.with_columns([
    # (This feature can be computed in real time, are the request-time features)
    ((pl.col("timestamp_rating") - pl.col("timestamp_tag")).dt.total_seconds())
    .abs().alias("seconds_between_rating_and_tag"),
    
    pl.col("tag").str.len_chars().alias("tag_length"),
    pl.col("tag").str.contains(r"[^\w\s]").alias("has_special_chars"),
    pl.col("tag").str.contains(r"[^\x00-\x7f]").alias("has_rare_symbols"),
    pl.col("tag").str.contains(r"\p{S}").alias("tag_is_symbolic"),
    pl.col("relevance").is_not_null().alias("has_relevance_score")
])

I'll use a sentiment classifier from textblob library to give a sentiment score to each tag. Some tags reflect the user didn't like the movie.

In [230]:
from textblob import TextBlob
import re

In [231]:
def clean_text(text):
    if text is None: return ""
    # 1.Lower all
    text = text.lower()
    # 2. Quit weird characters
    text = re.sub(r'[^a-z\s]', '', text)
    # 3. Quit extra spaces
    text = text.strip()
    return text

unique_tags = (
    tags.select("tag")
    .filter(pl.col("tag").is_not_null())
    .unique()
    .with_columns(
        pl.col("tag").map_elements(clean_text, return_dtype=pl.String).alias("tag_clean")
    )
)

This is done in order to not iterate in all the tags, as they are shared we can save computing power

In [232]:
tag_map = {
    tag: TextBlob(tag).sentiment.polarity 
    for tag in unique_tags["tag"]
}

Map the score tag to the tags dataframe

In [233]:
tags = tags.with_columns(
    pl.col("tag").replace_strict(tag_map, default=0.0).alias("sentiment_score")
)

In [234]:
tags.filter(tags['sentiment_score'] < -0.3)['tag'].sample(10)

tag
str
"""bad acting"""
"""violent"""
"""meg fake orgasm"""
"""bad acting"""
"""depressing"""
"""boring"""
"""bad music"""
"""when travolta was thin"""
"""bad writing"""
"""depressing"""


Finally, as this tags table is at level "userId",	"movieId", tag I need to aggregate to have metrics at "userId", "movieId" level to predict the rating.

In [235]:
tags.head()

userId,movieId,tag,timestamp,timestamp_tag,timestamp_rating,tagId,relevance,seconds_between_rating_and_tag,tag_length,has_special_chars,has_rare_symbols,tag_is_symbolic,has_relevance_score,sentiment_score
i64,i64,str,str,datetime[μs],datetime[μs],i64,f64,i64,u32,bool,bool,bool,bool,f64
65,27866,"""new zealand""","""2011-05-09 16:05:53""",2011-05-09 16:05:53,2011-05-09 16:05:59,706.0,0.787,6,11,False,False,False,True,0.136364
65,48082,"""surreal""","""2011-05-09 16:25:54""",2011-05-09 16:25:54,2011-05-09 16:26:19,995.0,0.98775,25,7,False,False,False,True,0.25
65,48082,"""unusual""","""2011-05-09 16:25:59""",2011-05-09 16:25:59,2011-05-09 16:26:19,,,20,7,False,False,False,False,0.2
121,80693,"""mental illness""","""2011-03-29 04:51:43""",2011-03-29 04:51:43,2011-03-30 07:54:07,645.0,0.8115,97344,14,False,False,False,True,-0.1
121,80693,"""zach galifianakis""","""2011-03-29 04:51:51""",2011-03-29 04:51:51,2011-03-30 07:54:07,,,97336,17,False,False,False,False,0.0


In [236]:
tags_features = tags.group_by(["userId", "movieId"]).agg([
    pl.len().alias("tags_count"),
    # Sentiment score
    pl.col("sentiment_score").mean().alias("tags_avg_sentiment"),
    pl.col("sentiment_score").std().alias("tags_std_sentiment"),
    pl.col("sentiment_score").min().alias("tags_min_sentiment"),
    pl.col("sentiment_score").max().alias("tags_max_sentiment"),
    # Relevance
    pl.col("relevance").mean().alias("tags_avg_relevance"),
    pl.col("relevance").std().alias("tags_std_relevance"),
    pl.col("relevance").min().alias("tags_min_relevance"),
    pl.col("relevance").max().alias("tags_max_relevance"),
    pl.col("has_relevance_score").sum().alias("total_tags_with_relevance"),
    # seconds_between_rating_and_tag (This can be computed in real time, are the request-time features)
    pl.col("seconds_between_rating_and_tag").mean().alias("avg_seconds_between_rating_and_tags"),
    # (Difference between the first and last tag applied to that movie)
    ((pl.col("timestamp_tag").max() - pl.col("timestamp_tag").min()).dt.total_seconds()).alias("tagging_duration_seconds"),
    # Tag style
    pl.col("tag_length").mean().alias("avg_tag_length"),
    pl.col("has_special_chars").sum().alias("total_special_chars_in_tags"),
    pl.col("has_rare_symbols").sum().alias("total_rare_symbols_in_tags"),
    pl.col("tag_is_symbolic").any().alias("at_least_one_symbol_in_tags")
])

In [237]:
tags_features.shape

(24961, 18)

Finally one more feature

In [238]:
tags_features = tags_features.with_columns(
    (pl.col("total_tags_with_relevance") / pl.col('tags_count'))
      .alias("porc_tags_with_relevance_score")
)

In [239]:
tags_features.sample(10)

userId,movieId,tags_count,tags_avg_sentiment,tags_std_sentiment,tags_min_sentiment,tags_max_sentiment,tags_avg_relevance,tags_std_relevance,tags_min_relevance,tags_max_relevance,total_tags_with_relevance,avg_seconds_between_rating_and_tags,tagging_duration_seconds,avg_tag_length,total_special_chars_in_tags,total_rare_symbols_in_tags,at_least_one_symbol_in_tags,porc_tags_with_relevance_score
i64,i64,u32,f64,f64,f64,f64,f64,f64,f64,f64,u32,f64,i64,f64,u32,u32,bool,f64
35227,39234,1,0.0,,0.0,0.0,,,,,0,60938008.0,0,12.0,0,0,False,0.0
23982,81229,2,0.0,0.0,0.0,0.0,,,,,0,10.0,10,15.5,1,0,False,0.0
63781,858,1,0.0,,0.0,0.0,,,,,0,3623645.0,0,11.0,0,0,False,0.0
96296,76293,1,-0.2,,-0.2,-0.2,0.70075,,0.70075,0.70075,1,11.0,0,11.0,0,0,False,1.0
122523,109420,2,0.0,0.0,0.0,0.0,,,,,0,21701008.0,2,5.0,0,0,False,0.0
23165,35836,1,0.0,,0.0,0.0,,,,,0,3591.0,0,9.0,1,0,False,0.0
11081,1419,10,0.005,0.101242,-0.2,0.2,0.574821,0.26066,0.17875,0.9355,7,2590258.7,1532678,12.2,0,0,False,0.7
96370,40819,1,0.0,,0.0,0.0,,,,,0,4648107.0,0,11.0,0,0,False,0.0
124998,93785,2,0.0,0.0,0.0,0.0,,,,,0,42258000.0,14427593,12.5,0,0,False,0.0
123297,6096,1,0.0,,0.0,0.0,,,,,0,3512.0,0,7.0,0,0,False,0.0


### Rating features
Let's build some features regarding to this final and most important dataset as it cointains the target variable.

In [240]:
ratings = ratings.with_columns(
    pl.col("timestamp").cast(pl.Datetime),
    (pl.col("rating") >= 4).cast(pl.Int8).alias("TARGET")
    
)

In [241]:
ratings['TARGET'].value_counts()

TARGET,count
i8,u32
1,9995410
0,10004853


Sort by user id and timestamp. This is super important step as I'll build the features using shift to not commit data leakage.


In [242]:
ratings = ratings.sort(["userId", "timestamp"])

In [243]:
ratings.head()

userId,movieId,rating,timestamp,TARGET
i32,i32,f32,datetime[μs],i8
1,924,3.5,2004-09-10 03:06:38,0
1,919,3.5,2004-09-10 03:07:01,0
1,2683,3.5,2004-09-10 03:07:30,0
1,1584,3.5,2004-09-10 03:07:36,0
1,1079,4.0,2004-09-10 03:07:45,1


Feature engineering is performed on the full timeline using only historical information prior to each interaction, ensuring that each features have access to past data without leaking future information or the current one.

Cumulative user statistics are computed using only historical ratings available prior to each event. I explicitly apply a temporal shift to avoid target leakage and to simulate the constraints of online feature computation in production systems.

Let's build some features.

In [244]:
cum_mean = lambda x: pl.col(x).cum_sum().truediv(pl.col(x).cum_count())
cum_std = lambda x: (
    (
        (pl.col(x) ** 2).cum_sum() / pl.col(x).cum_count()
        -
        (pl.col(x).cum_sum() / pl.col(x).cum_count()) ** 2
    ).sqrt()
)

In [245]:
ratings = ratings.with_columns(
    pl.col("rating").shift(1).over("userId").alias("prev_rating"),
    pl.cum_count('rating').shift(1).over("userId").alias("num_prev_ratings"),
    # Average and std ratings excluding the actual one, as it's the one to predict
    cum_mean('rating').shift(1).over("userId").alias("mean_ratings"),
    cum_std('rating').shift(1).over("userId").alias("std_ratings"),
    pl.col("rating").cum_max().shift(1).over("userId").alias("max_prev_ratings"),
    pl.col("rating").cum_min().shift(1).over("userId").alias("min_prev_ratings"),
    # Previous ratings greater than 4
    cum_mean('TARGET').shift(1).over("userId").alias("mean_previous_target"),
    pl.col("TARGET").cum_sum().shift(1).over("userId").alias("num_prev_ratings_greater_than_4"),
    pl.col("TARGET").shift(1).over("userId").alias("prev_target"),
    # Time features. (These can be computed in real time, are the request-time features)
    pl.col("timestamp").dt.hour().alias("hour"),
    pl.col("timestamp").dt.weekday().alias("dayofweek"),
    pl.col("timestamp").dt.year().alias("year"),
    (
        pl.col("timestamp")
        - pl.col("timestamp").shift(1).over("userId")
    )
    .dt.total_seconds()
    .alias("seconds_since_last_rating")
)

Let's see how they were computed using one user

In [246]:
ratings.filter(ratings['userId'] == 20)

userId,movieId,rating,timestamp,TARGET,prev_rating,num_prev_ratings,mean_ratings,std_ratings,max_prev_ratings,min_prev_ratings,mean_previous_target,num_prev_ratings_greater_than_4,prev_target,hour,dayofweek,year,seconds_since_last_rating
i32,i32,f32,datetime[μs],i8,f32,u32,f64,f64,f32,f32,f64,i64,i8,i8,i8,i32,i64
20,1221,4.0,2005-09-12 15:38:26,1,,,,,,,,,,15,1,2005,
20,474,2.0,2005-09-12 15:38:35,0,4.0,1.0,4.0,0.0,4.0,4.0,1.0,1.0,1.0,15,1,2005,9.0
20,1961,3.0,2005-09-12 15:38:39,0,2.0,2.0,3.0,1.0,4.0,2.0,0.5,1.0,0.0,15,1,2005,4.0
20,1923,2.5,2005-09-12 15:38:43,0,3.0,3.0,3.0,0.816497,4.0,2.0,0.333333,1.0,0.0,15,1,2005,4.0
20,19,2.5,2005-09-12 15:38:50,0,2.5,4.0,2.875,0.73951,4.0,2.0,0.25,1.0,0.0,15,1,2005,7.0
20,442,2.0,2005-09-12 15:38:54,0,2.5,5.0,2.8,0.678233,4.0,2.0,0.2,1.0,0.0,15,1,2005,4.0
20,4306,3.5,2005-09-12 15:39:03,0,2.0,6.0,2.666667,0.687184,4.0,2.0,0.166667,1.0,0.0,15,1,2005,9.0
20,2987,3.0,2005-09-12 15:39:09,0,3.5,7.0,2.785714,0.699854,4.0,2.0,0.142857,1.0,0.0,15,1,2005,6.0
20,1208,4.5,2005-09-12 15:39:20,1,3.0,8.0,2.8125,0.658478,4.0,2.0,0.125,1.0,0.0,15,1,2005,11.0
20,235,3.5,2005-09-12 15:39:35,0,4.5,9.0,3.0,0.816497,4.5,2.0,0.222222,2.0,1.0,15,1,2005,15.0


As you can see it's correct to have a null in the first rating in the features that use previus information. That means we aren't committing leakage.

###### Movie rating features
Now I'm going to calculate the average ratings of movies (made by other users), but making sure that I'm not using the current rate.

- In order to make that I'll now sort by timestamp overall.

In [247]:
ratings = ratings.sort("timestamp")

In [248]:
ratings.head()

userId,movieId,rating,timestamp,TARGET,prev_rating,num_prev_ratings,mean_ratings,std_ratings,max_prev_ratings,min_prev_ratings,mean_previous_target,num_prev_ratings_greater_than_4,prev_target,hour,dayofweek,year,seconds_since_last_rating
i32,i32,f32,datetime[μs],i8,f32,u32,f64,f64,f32,f32,f64,i64,i8,i8,i8,i32,i64
28507,1176,4.0,1995-01-09 11:46:44,1,,,,,,,,,,11,1,1995,
131160,21,3.0,1995-01-09 11:46:49,0,,,,,,,,,,11,1,1995,
131160,1079,3.0,1995-01-09 11:46:49,0,3.0,1.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,11,1,1995,0.0
131160,47,5.0,1995-01-09 11:46:49,1,3.0,2.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,11,1,1995,0.0
20821,32,5.0,1996-01-29 00:00:00,1,,,,,,,,,,0,1,1996,


In [249]:
ratings = ratings.with_columns(
    
    pl.cum_count('rating').shift(1).over("movieId").alias("num_movie_prev_ratings"),
    cum_mean("rating").shift(1).over("movieId").alias('movie_mean_rating'),
    cum_std('rating').shift(1).over("movieId").alias("movie_std_ratings"),
    pl.col("rating").cum_max().shift(1).over("movieId").alias("movie_max_prev_ratings"),
    pl.col("rating").cum_min().shift(1).over("movieId").alias("movie_min_prev_ratings"),
    # Previous ratings greater than 4
    cum_mean('TARGET').shift(1).over("movieId").alias("movie_mean_previous_target"),
    pl.col("TARGET").cum_sum().shift(1).over("movieId").alias("movie_num_prev_ratings_greater_than_4")
)

In [250]:
ratings[['userId', 'movieId', 'timestamp',
         'num_movie_prev_ratings', 'movie_mean_rating', 'movie_std_ratings',
        'movie_max_prev_ratings', 'movie_min_prev_ratings',
        'movie_mean_previous_target', 'movie_num_prev_ratings_greater_than_4']].head()

userId,movieId,timestamp,num_movie_prev_ratings,movie_mean_rating,movie_std_ratings,movie_max_prev_ratings,movie_min_prev_ratings,movie_mean_previous_target,movie_num_prev_ratings_greater_than_4
i32,i32,datetime[μs],u32,f64,f64,f32,f32,f64,i64
28507,1176,1995-01-09 11:46:44,,,,,,,
131160,21,1995-01-09 11:46:49,,,,,,,
131160,1079,1995-01-09 11:46:49,,,,,,,
131160,47,1995-01-09 11:46:49,,,,,,,
20821,32,1996-01-29 00:00:00,,,,,,,


This the cold start for the movies. That's great.
- Let's see a particular movie

In [251]:
ratings.filter(ratings['movieId'] == 82978)[['userId', 'movieId', 'timestamp','rating',
         'num_movie_prev_ratings', 'movie_mean_rating', 'movie_std_ratings',
        'movie_max_prev_ratings', 'movie_min_prev_ratings',
        'movie_mean_previous_target', 'movie_num_prev_ratings_greater_than_4']]

userId,movieId,timestamp,rating,num_movie_prev_ratings,movie_mean_rating,movie_std_ratings,movie_max_prev_ratings,movie_min_prev_ratings,movie_mean_previous_target,movie_num_prev_ratings_greater_than_4
i32,i32,datetime[μs],f32,u32,f64,f64,f32,f32,f64,i64
131904,82978,2011-03-28 13:36:29,3.5,,,,,,,
31122,82978,2011-03-28 16:24:44,4.0,1.0,3.5,0.0,3.5,3.5,0.0,0.0
114406,82978,2011-03-28 21:35:44,3.0,2.0,3.75,0.25,4.0,3.5,0.5,1.0
86592,82978,2011-03-29 10:58:07,3.5,3.0,3.5,0.408248,4.0,3.0,0.333333,1.0
10627,82978,2011-05-07 02:32:31,4.5,4.0,3.5,0.353553,4.0,3.0,0.25,1.0
15617,82978,2011-10-15 23:33:21,3.0,5.0,3.7,0.509902,4.5,3.0,0.4,2.0
60427,82978,2011-10-21 14:19:04,4.5,6.0,3.583333,0.533594,4.5,3.0,0.333333,2.0
21163,82978,2011-10-29 14:00:47,3.5,7.0,3.714286,0.589015,4.5,3.0,0.428571,3.0
118205,82978,2011-11-17 22:20:19,3.5,8.0,3.6875,0.555512,4.5,3.0,0.375,3.0
105580,82978,2011-11-26 11:56:46,3.0,9.0,3.666667,0.527046,4.5,3.0,0.333333,3.0


In [252]:
# I'll sort again the data by this
ratings = ratings.sort(["userId", "timestamp"])

## Merge all data sources.
Finally let's merge all the data sources to write the data and then in other script process it and prepare it for training the model.
- Check how to input the data

In [253]:
ratings.shape

(20000263, 25)

In [254]:
ratings = ratings.join(tags_features, on=['userId', 'movieId'], how='left')

In [255]:
ratings = ratings.join(pl.from_pandas(movies), on=['movieId'], how='left')

In [256]:
ratings.shape

(20000263, 69)

In [257]:
ratings.head()

userId,movieId,rating,timestamp,TARGET,prev_rating,num_prev_ratings,mean_ratings,std_ratings,max_prev_ratings,min_prev_ratings,mean_previous_target,num_prev_ratings_greater_than_4,prev_target,hour,dayofweek,year,seconds_since_last_rating,num_movie_prev_ratings,movie_mean_rating,movie_std_ratings,movie_max_prev_ratings,movie_min_prev_ratings,movie_mean_previous_target,movie_num_prev_ratings_greater_than_4,tags_count,tags_avg_sentiment,tags_std_sentiment,tags_min_sentiment,tags_max_sentiment,tags_avg_relevance,tags_std_relevance,tags_min_relevance,tags_max_relevance,total_tags_with_relevance,avg_seconds_between_rating_and_tags,tagging_duration_seconds,avg_tag_length,total_special_chars_in_tags,total_rare_symbols_in_tags,at_least_one_symbol_in_tags,porc_tags_with_relevance_score,title,genres,movie_year,movie_number_of_genres,decade,is_sequel,title_len,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
i32,i32,f32,datetime[μs],i8,f32,u32,f64,f64,f32,f32,f64,i64,i8,i8,i8,i32,i64,u32,f64,f64,f32,f32,f64,i64,u32,f64,f64,f64,f64,f64,f64,f64,f64,u32,f64,i64,f64,u32,u32,bool,f64,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
1,924,3.5,2004-09-10 03:06:38,0,,,,,,,,,,3,5,2004,,13628,4.01152,1.049504,5.0,0.5,0.71265,9712,,,,,,,,,,,,,,,,,,"""2001: A Space Odyssey (1968)""","""Adventure|Drama|Sci-Fi""",1968,3,1960,0,21,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
1,919,3.5,2004-09-10 03:07:01,0,3.5,1.0,3.5,0.0,3.5,3.5,0.0,0.0,0.0,3,5,2004,23.0,13345,4.132447,0.932417,5.0,0.5,0.751592,10030,,,,,,,,,,,,,,,,,,"""Wizard of Oz, The (1939)""","""Adventure|Children|Fantasy|Mus…",1939,4,1930,0,17,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
1,2683,3.5,2004-09-10 03:07:30,0,3.5,2.0,3.5,0.0,3.5,3.5,0.0,0.0,0.0,3,5,2004,29.0,11355,3.325495,1.128763,5.0,0.5,0.46376,5266,,,,,,,,,,,,,,,,,,"""Austin Powers: The Spy Who Sha…","""Action|Adventure|Comedy""",1999,3,1990,0,37,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1584,3.5,2004-09-10 03:07:36,0,3.5,3.0,3.5,0.0,3.5,3.5,0.0,0.0,0.0,3,5,2004,6.0,10935,3.740238,0.991961,5.0,0.5,0.618381,6762,,,,,,,,,,,,,,,,,,"""Contact (1997)""","""Drama|Sci-Fi""",1997,2,1990,0,7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
1,1079,4.0,2004-09-10 03:07:45,1,3.5,4.0,3.5,0.0,3.5,3.5,0.0,0.0,0.0,3,5,2004,9.0,12265,3.920669,0.915345,5.0,0.5,0.69947,8579,,,,,,,,,,,,,,,,,,"""Fish Called Wanda, A (1988)""","""Comedy|Crime""",1988,2,1980,0,20,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Save all the raw data

In [261]:
ratings.write_parquet("data/raw_data.parquet")