# MovieLens 32M Recommendation System Project: Feature Engineering

## Overview
In this notebook, I will perform **feature engineering** to enhance the MovieLens 32M dataset for building a recommendation system. Feature engineering is a crucial step where we create new features or transform existing ones to better represent the underlying patterns in the data. This process will prepare the dataset for the next phase of **model development**.

The engineered features will focus on **movie characteristics**, **genre analysis**, and **temporal trends**, which will help in capturing user preferences and improving the performance of the recommendation system.

## Objectives
- **Genre-Based Features**:
    - Create new features based on the genres each movie belongs to.
    - Develop genre interaction features to explore relationships between genres.
    - Incorporate insights from genre clusters.
    
- **Temporal Features**:
    - Extract features such as release year and trends in movie ratings over time.
    
- **Movie Popularity**:
    - Create features based on movie popularity, such as the number of ratings and rating variance.

## Steps
### Step 1: Genre-Based Features
- Create features based on the number of genres a movie belongs to.
- Generate interaction features between genres.
- Utilize the genre clusters from the exploratory data analysis (EDA) to assign cluster labels.

### Step 2: Temporal Features
- Extract the release year or decade from the dataset.
- Calculate the trend of movie ratings over time to capture time-based behavior patterns.

### Step 3: Movie Popularity
- Compute the number of ratings for each movie as a proxy for popularity.
- Calculate the variance in ratings to capture polarization in audience reception.

### Project Status
- [x] Data Cleaning and Preprocessing
- [x] Exploratory Data Analysis (EDA)
- [ ] **Feature Engineering**
- [ ] Model Development
- [ ] Model Evaluation
- [ ] Model Deployment

---

By focusing on feature engineering, I aim to develop a robust dataset that will serve as a foundation for building an effective recommendation system in subsequent stages.

In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the data saved from the EDA
full_movie_ddf_with_genres = pd.read_csv('../data/processed/full_movie_ddf_with_genres.csv')
clustered_genre_stats = pd.read_csv('../data/processed/clustered_genre_stats.csv')

# Check the structure of both DataFrames to ensure they loaded correctly
print("full_movie_ddf_with_genres columns:", full_movie_ddf_with_genres.columns)
print("clustered_genre_stats columns:", clustered_genre_stats.columns)

# Display a few rows from each DataFrame to understand their structure
full_movie_ddf_with_genres.head()
clustered_genre_stats.head()


full_movie_ddf_with_genres columns: Index(['userId', 'movieId', 'tag', 'tags_timestamp', 'title', 'genres',
       'rating', 'ratings_timestamp', 'imdbId', 'tmdbId', '(no genres listed)',
       'action', 'adventure', 'animation', 'children', 'comedy', 'crime',
       'documentary', 'drama', 'fantasy', 'film-noir', 'horror', 'imax',
       'musical', 'mystery', 'romance', 'sci-fi', 'thriller', 'war', 'western',
       'rating_date', 'year_month'],
      dtype='object')
clustered_genre_stats columns: Index(['Movie Count', 'Average Rating', 'Cluster'], dtype='object')


Unnamed: 0,Movie Count,Average Rating,Cluster
0,6300,3.626428,2
1,3816,3.670802,2
2,2458,3.826029,1
3,2468,3.664781,1
4,14035,3.657846,0


In [12]:
from sklearn.cluster import KMeans

# Step 1: Recreate genre statistics from the full_movie_ddf_with_genres
def get_genre_columns(df):
    """
    Get the list of genre columns from the DataFrame.
    Filter columns that match the predefined genre labels.
    """
    # List of known genre labels
    genre_labels = [
        'action', 'adventure', 'animation', 'children', 'comedy', 'crime',
        'documentary', 'drama', 'fantasy', 'film-noir', 'horror', 'imax',
        'musical', 'mystery', 'romance', 'sci-fi', 'thriller', 'war', 'western'
    ]
    
    # Filter columns that match the genre labels
    genre_columns = [col for col in df.columns if col in genre_labels]
    return genre_columns


# Calculate genre statistics (movie count and average rating)
def calculate_genre_statistics(df):
    genres = get_genre_columns(df)
    
    genre_stats = pd.DataFrame(index=genres, columns=['Movie Count', 'Average Rating'])
    
    for genre in genres:
        # Movie count: Unique count of movie titles in each genre
        genre_stats.loc[genre, 'Movie Count'] = df[df[genre] == 1]['title'].nunique()
        
        # Average rating: Mean rating of all movies in that genre
        genre_stats.loc[genre, 'Average Rating'] = df[df[genre] == 1]['rating'].mean()
    
    # Filter out genres with NaN values (i.e., genres with no movies or no ratings)
    genre_stats = genre_stats.dropna(subset=['Movie Count', 'Average Rating'])
    
    return genre_stats


# Perform K-Means Clustering on the genre statistics
def perform_kmeans_clustering(genre_stats, n_clusters=3):
    # Check if there are enough valid genres to cluster
    if genre_stats.empty:
        raise ValueError("No valid data available for clustering.")
    
    # Fit K-Means with the movie count and average rating
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    genre_stats['Cluster'] = kmeans.fit_predict(genre_stats[['Movie Count', 'Average Rating']])
    
    return genre_stats, kmeans

# Recreate the genre stats from the full_movie_ddf_with_genres
genre_stats = calculate_genre_statistics(full_movie_ddf_with_genres)

# Perform K-Means clustering to assign clusters to each genre
clustered_genre_stats, kmeans_model = perform_kmeans_clustering(genre_stats, n_clusters=3)

# Display the clustered genre stats
print(clustered_genre_stats)


            Movie Count Average Rating  Cluster
action             6300       3.626428        2
adventure          3816       3.670802        2
animation          2458       3.826029        1
children           2468       3.664781        1
comedy            14035       3.657846        0
crime              4615       3.848743        2
documentary        4735       3.800447        2
drama             21609       3.824769        0
fantasy            2544       3.668145        1
film-noir           344       3.933365        1
horror             5500       3.435147        2
imax                194       3.705834        1
musical            1010       3.784775        1
mystery            2611       3.815896        1
romance            6508       3.743126        2
sci-fi             3429       3.663288        1
thriller           7622        3.68992        2
war                1620       3.866885        1
western            1323       3.746815        1


In [13]:
# Step 2: Weighted Cluster Assignment

# We already have the clustered_genre_stats DataFrame from the previous step
# Now, map the genres in full_movie_ddf_with_genres to their respective clusters

# Create an empty DataFrame to store the cluster weights for each movie
cluster_weights = pd.DataFrame(0, index=full_movie_ddf_with_genres.index, columns=[f'Cluster {i}' for i in range(3)])  # Adjust for number of clusters

# Map the genre to cluster based on the clustered_genre_stats
# Assuming clustered_genre_stats has an index of genres and a column 'Cluster'
for genre in clustered_genre_stats.index:
    # Get the cluster for this genre
    cluster = clustered_genre_stats.loc[genre, 'Cluster']
    
    # Add the one-hot encoded genre values (1 or 0) to the corresponding cluster column
    if genre in full_movie_ddf_with_genres.columns:
        cluster_weights[f'Cluster {cluster}'] += full_movie_ddf_with_genres[genre]

# Normalize the cluster weights so that they sum to 1 for each movie
# This gives us the proportion of genres from each cluster for each movie
cluster_weights = cluster_weights.div(cluster_weights.sum(axis=1), axis=0).fillna(0)

# Add the cluster weights to the full_movie_ddf_with_genres DataFrame
full_movie_ddf_with_genres = pd.concat([full_movie_ddf_with_genres, cluster_weights], axis=1)

# Display the updated DataFrame with cluster weights
print(full_movie_ddf_with_genres[['Cluster 0', 'Cluster 1', 'Cluster 2']].head())


   Cluster 0  Cluster 1  Cluster 2
0        0.0   0.333333   0.666667
1        1.0   0.000000   0.000000
2        0.5   0.500000   0.000000
3        0.5   0.000000   0.500000
4        0.5   0.000000   0.500000


In [14]:
# Step: Engineering movie popularity features

# Calculate the number of ratings per movie
full_movie_ddf_with_genres['Num_Ratings'] = full_movie_ddf_with_genres.groupby('movieId')['rating'].transform('count')

# Calculate the rating variance per movie
full_movie_ddf_with_genres['Rating_Variance'] = full_movie_ddf_with_genres.groupby('movieId')['rating'].transform('var')

# Display the new columns
print(full_movie_ddf_with_genres[['Num_Ratings', 'Rating_Variance']].head())


   Num_Ratings  Rating_Variance
0          430         0.557469
1            8         0.674107
2           12         1.111742
3            3         0.333333
4           10         0.488889


In [26]:
# Step 1: Create the user-level dataset
full_movie_ddf_with_genres['ratings_timestamp'] = pd.to_datetime(full_movie_ddf_with_genres['ratings_timestamp'])

# Group by userId to create the user-level dataset
user_level_df = full_movie_ddf_with_genres.groupby('userId').agg(
    # Number of movies rated by the user
    Num_Movies_Rated=('movieId', 'count'),
    
    # Average rating given by the user
    Avg_User_Rating=('rating', lambda x: round(x.mean(), 2)),
    
    # Most recent rating timestamp (recency of activity)
    Last_Rating_Timestamp=('ratings_timestamp', 'max')
).reset_index()

# Convert 'ratings_timestamp' to datetime format if necessary
user_level_df['Last_Rating_Timestamp'] = pd.to_datetime(user_level_df['Last_Rating_Timestamp'])

# Calculate 'Recency' as the difference between the most recent rating and a fixed reference point (e.g., today's date)
user_level_df['Recency'] = (pd.Timestamp.now() - user_level_df['Last_Rating_Timestamp']).dt.days
user_level_df['Recency'] = user_level_df['Recency'].fillna(0).round(0).astype('Int64')

# Display the first few rows of the user-level dataset
print(user_level_df.head())


   userId  Num_Movies_Rated  Avg_User_Rating Last_Rating_Timestamp  Recency
0      22                 3             3.25   2021-05-31 17:48:43     1234
1      34                 2             4.00   2009-08-09 09:02:14     5547
2      55                 1             4.00   2011-10-22 22:19:44     4743
3      58                 7             3.71   2023-01-01 05:39:42      654
4      60                 1              NaN                   NaT        0


In [28]:
# Step 1: Get the genre columns from the full_movie_ddf_with_genres DataFrame
genre_columns = get_genre_columns(full_movie_ddf_with_genres)

# Step 2: Create a new DataFrame to hold the genre preferences
genre_preferences = full_movie_ddf_with_genres.groupby('userId')[genre_columns].agg('mean').reset_index()

# Rename columns to indicate they represent average ratings for genres
genre_preferences.columns = ['userId'] + [f'Avg_{genre.capitalize()}_Rating' for genre in genre_columns]

# Step 3: Merge the genre preferences with the user-level dataset
user_level_with_genres = user_level_df.merge(genre_preferences, on='userId', how='left')

# Display the first few rows of the updated user-level dataset
print(user_level_with_genres.head())

# Saving the user level data to csv
user_level_with_genres.to_csv('../data/processed/user_level_with_genres.csv', index=False)

   userId  Num_Movies_Rated  Avg_User_Rating Last_Rating_Timestamp  Recency  \
0      22                 3             3.25   2021-05-31 17:48:43     1234   
1      34                 2             4.00   2009-08-09 09:02:14     5547   
2      55                 1             4.00   2011-10-22 22:19:44     4743   
3      58                 7             3.71   2023-01-01 05:39:42      654   
4      60                 1              NaN                   NaT        0   

   Avg_Action_Rating  Avg_Adventure_Rating  Avg_Animation_Rating  \
0           0.333333              0.333333                   0.0   
1           0.000000              0.000000                   0.0   
2           0.000000              0.000000                   0.0   
3           0.857143              0.857143                   0.0   
4           1.000000              1.000000                   0.0   

   Avg_Children_Rating  Avg_Comedy_Rating  ...  Avg_Film-noir_Rating  \
0                  0.0           0.666667  .