# Module E: AI Applications - Individual Open Project
## Content Recommendation System

**Student Name:** Neelay Upadhyay  
**Student ID:** iitrpr_ai_25010718  
**Submission Date:** 17.01.26

---

### Problem Definition & Objective

#### Selected Project Track
**Personalization & Recommender Systems**

#### Problem Statement
Users face decision fatigue when navigating vast content libraries on digital platforms. This project builds an intelligent recommendation engine that analyzes user behavior to deliver personalized content suggestions, reducing discovery time and improving engagement.

#### Real-world Relevance
Recommender systems drive critical business metrics across platforms. Netflix saves $1 billion annually through reduced churn, while Amazon attributes 35% of revenue to recommendations. With 71% of consumers expecting personalized experiences, effective recommendation systems directly impact user retention and platform competitiveness.


### Data Understanding & Preparation

#### Dataset Source
This project uses two datasets publicly available on Kaggle:<br>

    "TMDB 5000 Movie Dataset" from The Movie Database (TMDb)<br>
    "Steam Store Games (Clean dataset)" from Nik Davis

## Movies Dataset: Loading and Preprocessing

This section loads the TMDB movie metadata from Kaggle, merges it with credits, and prepares a cleaned feature-rich dataset for the recommendation engine.

In [1]:
# Cell 1 ‚Äì Imports and setup

import pandas as pd
import ast
import nltk
from nltk.stem.porter import PorterStemmer
import os

DATA_DIR = 'data'
os.makedirs(DATA_DIR, exist_ok=True)

# Ensure NLTK tokenizer is available
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

### Helper functions for JSON parsing and text processing

This cell defines utility functions to parse JSON-like fields, extract directors and top cast, and normalize multi-word tokens.

In [2]:
# Cell 2 ‚Äì Helper functions

def clean_json(obj):
    """Extract 'name' from JSON list strings."""
    try:
        L = []
        for i in ast.literal_eval(obj):
            L.append(i['name'])
        return L
    except:
        return []


def clean_json_top3(obj):
    """Extract top 3 names from JSON list strings."""
    try:
        L = []
        counter = 0
        for i in ast.literal_eval(obj):
            if counter != 3:
                L.append(i['name'])
                counter += 1
            else:
                break
        return L
    except:
        return []


def fetch_director(obj):
    """Extract Director name from crew."""
    L = []
    try:
        for i in ast.literal_eval(obj):
            if i['job'] == 'Director':
                L.append(i['name'])
                break
        return L
    except:
        return []


def collapse(L):
    """Remove spaces: 'Sam Worthington' -> 'SamWorthington'."""
    L1 = []
    for i in L:
        L1.append(i.replace(" ", ""))
    return L1

### Loading raw CSVs and merging

Here, the TMDB movies and credits CSV files are loaded from the `data/` folder and merged on the `title` column, keeping only relevant metadata fields.

In [3]:
# Cell 3 ‚Äì Load, merge, select, handle missing

print("Loading raw TMDB datasets...")
try:
    movies = pd.read_csv(os.path.join(DATA_DIR, 'tmdb_5000_movies.csv'))
    credits = pd.read_csv(os.path.join(DATA_DIR, 'tmdb_5000_credits.csv'))
except FileNotFoundError:
    print("Error: Raw CSV files not found in 'data/' folder. Please download them from Kaggle.")
    raise

print("Merging datasets...")
movies = movies.merge(credits, on='title')

print("Selecting relevant columns...")
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords',
                 'cast', 'crew', 'release_date', 'vote_average', 'vote_count']]

movies.dropna(inplace=True)

Loading raw TMDB datasets...
Merging datasets...
Selecting relevant columns...


### Cleaning, preprocessing, and feature engineering

This cell parses JSON-like fields, tokenizes the overview, collapses multi-word entries, and builds a unified `tags` feature for content-based similarity.

In [4]:
# Cell 4 ‚Äì Transform JSON fields and build tags

print("Transforming JSON fields...")
movies['genres'] = movies['genres'].apply(clean_json)
movies['keywords'] = movies['keywords'].apply(clean_json)
movies['cast'] = movies['cast'].apply(clean_json_top3)
movies['crew'] = movies['crew'].apply(fetch_director)

movies['overview'] = movies['overview'].apply(lambda x: x.split())

movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)

print("Creating master tags...")
movies['tags'] = (
    movies['overview']
    + movies['genres']
    + movies['keywords']
    + movies['cast']
    + movies['crew']
)

Transforming JSON fields...
Creating master tags...


### Final feature table and stemming

A compact dataframe is created with `id`, `title`, `tags`, and metadata, then tags are lowercased and stemmed before saving to disk.

In [5]:
# Cell 5 ‚Äì Final dataframe, stemming, save

new_df = movies[['movie_id', 'title', 'tags', 'release_date',
                 'vote_average', 'vote_count']]

new_df.loc[:, 'tags'] = new_df['tags'].apply(lambda x: " ".join(x))
new_df.loc[:, 'tags'] = new_df['tags'].apply(lambda x: x.lower())

print("Applying Porter Stemmer...")
ps = PorterStemmer()

def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

new_df.loc[:, 'tags'] = new_df['tags'].apply(stem)

new_df.rename(columns={'movie_id': 'id'}, inplace=True)

output_path = os.path.join(DATA_DIR, 'movies.csv')
new_df.to_csv(output_path, index=False)
print(f"Success! Processed data saved to '{output_path}'.")

Applying Porter Stemmer...
Success! Processed data saved to 'data\movies.csv'.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.rename(columns={'movie_id': 'id'}, inplace=True)


## Games Dataset: Loading and Preprocessing

This section loads the Steam Store games data from Kaggle, merges multiple metadata files, and prepares a cleaned feature-rich dataset for the recommendation engine.

In [6]:
# Cell 1 ‚Äì Imports and setup

import pandas as pd
import numpy as np
import os
import ast

DATA_DIR = 'data'

### Loading and merging Steam datasets

The core Steam datasets (main, descriptions, media, and requirements) are loaded from the `data/` folder and merged on the application ID.

In [7]:
# Cell 2 ‚Äì Load and merge raw Steam CSVs

print("Loading Steam datasets...")
try:
    df_main = pd.read_csv(os.path.join(DATA_DIR, 'steam.csv'))
    df_desc = pd.read_csv(os.path.join(DATA_DIR, 'steam_description_data.csv'))
    df_media = pd.read_csv(os.path.join(DATA_DIR, 'steam_media_data.csv'))
    df_reqs = pd.read_csv(os.path.join(DATA_DIR, 'steam_requirements_data.csv'))
except FileNotFoundError as e:
    print(f"Error: Missing file. {e}")
    print("Please ensure steam.csv, steam_description_data.csv, steam_media_data.csv, and steam_requirements_data.csv are in the 'data/' folder.")
    raise

print("Merging datasets...")
df_desc.rename(columns={'steam_appid': 'appid'}, inplace=True)
df_media.rename(columns={'steam_appid': 'appid'}, inplace=True)
df_reqs.rename(columns={'steam_appid': 'appid'}, inplace=True)

df = df_main.merge(df_desc, on='appid', how='left')
df = df.merge(df_media, on='appid', how='left')
df = df.merge(df_reqs, on='appid', how='left')

Loading Steam datasets...
Merging datasets...


### Cleaning metadata and creating rating features

This step standardizes column names, parses release dates, and derives vote-based popularity metrics used for filtering and ranking.

In [8]:
# Cell 3 ‚Äì Rename, date processing, rating features, popularity filter

print("Cleaning and transforming...")

df.rename(columns={
    'appid': 'id',
    'name': 'title',
    'short_description': 'overview',
    'header_image': 'poster',
}, inplace=True)

df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
df['year'] = df['release_date'].dt.year.fillna(0).astype(int)

df['total_votes'] = df['positive_ratings'] + df['negative_ratings']
df['vote_average'] = (df['positive_ratings'] / df['total_votes']) * 10
df['vote_average'] = df['vote_average'].fillna(0).round(1)
df.rename(columns={'total_votes': 'vote_count'}, inplace=True)

initial_count = len(df)
df = df[df['vote_count'] >= 200]
print(f"üìâ Filtered dataset from {initial_count} -> {len(df)} high-quality games.")

Cleaning and transforming...
üìâ Filtered dataset from 27075 -> 6376 high-quality games.


### Tag construction and feature engineering

Genres, categories, tags, descriptions, and developer names are combined into a single lowercase `tags` field to support content-based similarity.

In [9]:
# Cell 4 ‚Äì Tag construction and requirements

def clean_tags(text):
    if pd.isna(text):
        return ""
    return text.replace(';', ' ')

df['genres_str'] = df['genres'].apply(clean_tags)
df['categories_str'] = df['categories'].apply(clean_tags)
df['tags_str'] = df['steamspy_tags'].apply(clean_tags)
df['developer'] = df['developer'].fillna('')

df['tags'] = (
    df['overview'].fillna('') + " " +
    df['genres_str'] + " " +
    df['categories_str'] + " " +
    df['tags_str'] + " " +
    df['developer']
).str.lower()

df['pc_requirements'] = df['minimum'].fillna("")

### Final games table and export

The final games dataframe keeps IDs, metadata, tags, and requirements, removes invalid rows, and saves the result to `games.csv`.

In [10]:
# Cell 5 ‚Äì Final games dataframe and save

final_df = df[[
    'id',
    'title',
    'overview',
    'tags',
    'poster',
    'year',
    'vote_average',
    'vote_count',
    'developer',
    'publisher',
    'genres',
    'pc_requirements'
]]

final_df = final_df.dropna(subset=['title', 'overview'])

output_path = os.path.join(DATA_DIR, 'games.csv')
final_df.to_csv(output_path, index=False)

print(f"Success! Saved clean dataset to: {output_path}")

Success! Saved clean dataset to: data\games.csv


## Model / System Design

### AI Technique Used
This project implements a **content-based filtering recommendation system** driven by **natural language processing (NLP)**.  
- **Embeddings:** Uses `sentence-transformers` (specifically `all-MiniLM-L6-v2`) to convert text data (titles, overviews, genres) into dense vector representations.  

- **Similarity search:** Utilizes **FAISS (Facebook AI Similarity Search)** for efficient nearest-neighbor search to find content with semantically similar vectors.

### Architecture & Pipeline Explanation
The system follows a modular design divided into three primary components:

1. **`run.py` (application entry point)**  
   - Initializes the Flask web server and configures it to run on host `0.0.0.0` and port `7860`, enabling cloud deployment (e.g., Hugging Face Spaces).

2. **`routes.py` (API controller)**  
   - Defines API endpoints (such as `/recommend`) that receive user requests, validate inputs, trigger the recommendation logic, and return JSON responses to the frontend.

3. **`nlp_engine.py` (core recommendation engine)**  
   - Loads and cleans the movie and game datasets.  
   - Encodes combined text features into vector embeddings.  
   - Builds a FAISS index for fast retrieval.  
   - Runs semantic search to return the top‚Äëk most relevant items for a given query.

### Justification of Design Choices
- **Sentence Transformers:** Preferred over TF‚ÄëIDF because they capture semantic context, not just exact word overlap, which improves recommendation quality.  

- **FAISS:** Chosen for scalability, since brute‚Äëforce cosine similarity over all items becomes slow as the catalog grows.

- **Flask:** Used as a lightweight framework to expose the model as an HTTP API with minimal overhead.

## Core Implementation

This section implements the content-based recommendation engine using Sentence Transformers for embeddings and FAISS for similarity search.

In [11]:
# Core recommendation engine: NeuroBrain (Tailored version for notebook)

import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import os

_shared_model = None  # global shared model


class NeuroBrain:
    def __init__(self, media_type='movies'):
        global _shared_model

        self.media_type = media_type
        self.data_path = f'data/{media_type}.csv'
        self.index_path = f'data/{media_type}_index.bin'
        self.model_name = 'all-MiniLM-L6-v2'
        self.df = pd.DataFrame()

        print(f"Initializing NeuroBrain for {self.media_type.upper()}...")

        # 1. Load data
        if os.path.exists(self.data_path):
            try:
                self.df = pd.read_csv(self.data_path)

                # ID cleanup
                if 'id' in self.df.columns:
                    self.df['id'] = self.df['id'].astype(str)
                elif 'movie_id' in self.df.columns:
                    self.df['id'] = self.df['movie_id'].astype(str)
                elif 'appid' in self.df.columns:
                    self.df['id'] = self.df['appid'].astype(str)
                else:
                    self.df['id'] = self.df.index.astype(str)

                # Year cleanup
                if 'release_date' in self.df.columns:
                    self.df['release_date'] = pd.to_datetime(self.df['release_date'], errors='coerce')
                    self.df['year'] = self.df['release_date'].dt.year.fillna(0).astype(int)
                elif 'year' in self.df.columns:
                    self.df['year'] = self.df['year'].fillna(0).astype(int)
                else:
                    self.df['year'] = 0

                # Ensure required columns exist
                for col in ['vote_average', 'title', 'tags']:
                    if col not in self.df.columns:
                        self.df[col] = ""

            except Exception as e:
                print(f"Error reading CSV: {e}")
                return
        else:
            print(f"Data file missing: {self.data_path}")
            return

        # 2. Load / share SentenceTransformer model
        if _shared_model is None:
            print("Loading SentenceTransformer model (once)...")
            _shared_model = SentenceTransformer(self.model_name)
        self.model = _shared_model

        # 3. Load or build FAISS index
        if not self.df.empty:
            if os.path.exists(self.index_path):
                print(f"Loading FAISS index from {self.index_path}...")
                self.index = faiss.read_index(self.index_path)
            else:
                print("No index found. Building new index...")
                self.build_index()

        print(f"{self.media_type.upper()} brain ready.")

    def build_index(self):
        if self.df.empty:
            return
        text_data = (self.df['title'].astype(str) + " " + self.df['tags'].astype(str)).tolist()
        embeddings = self.model.encode(text_data, show_progress_bar=True)
        embeddings = np.array(embeddings).astype('float32')
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(embeddings)
        faiss.write_index(self.index, self.index_path)
        print("Index built and saved.")

    def search(self, query, top_k=20):
        if self.df.empty or not hasattr(self, 'index'):
            return []
        query_vector = self.model.encode([query]).astype('float32')
        fetch_k = min(top_k * 5, len(self.df))
        distances, indices = self.index.search(query_vector, fetch_k)

        results = []
        for idx in indices[0]:
            if idx == -1:
                continue
            item = self.df.iloc[idx]
            results.append(self._format_item(item))
            if len(results) >= top_k:
                break
        return results

    def get_random(self, top_k=10):
        if self.df.empty:
            return []
        n = min(top_k, len(self.df))
        samples = self.df.sample(n=n).to_dict(orient='records')
        return [self._format_item(item) for item in samples]

    def _format_item(self, item):
        """Return a compact dict without poster/overview."""
        obj = {
            'id': str(item.get('id', '')),
            'title': str(item.get('title', 'Unknown')),
            'year': int(item.get('year', 0)),
            'vote_average': float(item.get('vote_average', 0.0)),
            'type': self.media_type
        }
        if 'developer' in item:
            obj['developer'] = str(item['developer'])
        if 'publisher' in item:
            obj['publisher'] = str(item['publisher'])
        return obj

  from .autonotebook import tqdm as notebook_tqdm


### Recommendation Engine Initialization

We instantiate separate `NeuroBrain` engines for movies and games, each loading its own dataset, embeddings index, and shared transformer model.

In [14]:
# Initialize engines for both media types

movies_engine = NeuroBrain('movies')
games_engine = NeuroBrain('games')

Initializing NeuroBrain for MOVIES...
Loading FAISS index from data/movies_index.bin...
MOVIES brain ready.
Initializing NeuroBrain for GAMES...
Loading FAISS index from data/games_index.bin...
GAMES brain ready.


In [17]:
# Helper to display recommendations as a DataFrame

def show_recommendations(engine, query, top_k=5):
    results = engine.search(query, top_k=top_k)
    if not results:
        return "No results."
    return pd.DataFrame(results)[['title', 'year', 'vote_average']]


### Example Recommendations

The following cells show example recommendations for different natural-language queries over movies and games.

In [18]:
# Example 1: movie query

show_recommendations(movies_engine, "emotional sci-fi space drama", top_k=5)

Unnamed: 0,title,year,vote_average
0,Star Trek Into Darkness,2013,7.4
1,Red Planet,2000,5.4
2,Serenity,2005,7.4
3,The Day the Earth Stood Still,2008,5.2
4,Interstellar,2014,8.1


In [19]:
# Example 2: another movie query

show_recommendations(movies_engine, "dark psychological thriller", top_k=5)

Unnamed: 0,title,year,vote_average
0,May,2002,6.3
1,Regression,2015,5.3
2,Dark City,1998,7.2
3,Dressed to Kill,1980,6.8
4,The Dark Hours,2005,5.5


In [20]:
# Example 1: game query

show_recommendations(games_engine, "open world RPG with rich story", top_k=5)

Unnamed: 0,title,year,vote_average
0,The Book of Legends,2014,7.1
1,La Tale - Evolved,2017,5.2
2,UnReal World,2016,9.6
3,The Memory of Eldurim,2014,4.8
4,Thea 2: The Shattering,2018,8.2


In [21]:
# Example 2: another game query

show_recommendations(games_engine, "fast-paced competitive shooter", top_k=5)

Unnamed: 0,title,year,vote_average
0,Rogue Shooter: The FPS Roguelike,2014,7.5
1,Ballistic Overkill,2017,7.6
2,The Art of Fight | 4vs4 Fast-Paced FPS,2017,7.0
3,Mad Bullets,2016,9.2
4,Shot Shot Tactic,2016,4.3


## Evaluation & Analysis

- **Metrics used (qualitative):** Due to the lack of user‚Äëlevel interaction data, the system is evaluated qualitatively by manually inspecting the top‚Äëk recommendations for different natural-language queries and checking genre, theme, and mood consistency. This is a common approach for early-stage content-based recommenders when click or rating logs are not available.

- **Sample outputs:** The previous cells show sample recommendations for movie queries such as ‚Äúemotional sci‚Äëfi space drama‚Äù and game queries such as ‚Äúopen world RPG with rich story‚Äù. The returned items are generally consistent in genre (sci‚Äëfi, RPG) and tone, indicating that the semantic embeddings + FAISS pipeline is capturing high-level content similarity. 

- **Performance analysis and limitations:** While the recommendations are thematically coherent, the system does not optimize any explicit ranking metric such as Precision@k or NDCG, and it cannot personalize results to individual users because only item content is modeled. It also inherits popularity and representation biases from the TMDB and Steam datasets, and items with sparse or poor text metadata may receive lower-quality recommendations.

## Ethical Considerations & Responsible AI

- **Bias and fairness:** The recommender inherits popularity and representation biases from TMDB and Steam, so already popular genres, studios, and AAA titles are more likely to be suggested than niche or indie content. Popularity bias is a well‚Äëknown ethical issue in recommender systems.

- **Dataset limitations:** Both datasets are historically bounded: the TMDB 5000 movies metadata mostly covers titles released before 2017‚Äì2018, and the Steam games dataset only includes games available on Steam before around 2019. This means recent releases and games from other platforms (PlayStation, Xbox, Switch, mobile, etc.) are completely absent, limiting coverage and making the system unsuitable for ‚Äúlatest releases‚Äù discovery.

- **Responsible use:** The system is designed for entertainment discovery only and should not be used for high‚Äëstakes or sensitive decisions. Any real‚Äëworld deployment should clearly communicate these limitations, avoid over‚Äëpersonalization that could create filter bubbles, and continuously monitor for biased or harmful recommendation patterns.

## 7. Conclusion & Future Scope

- **Summary of results:** The project demonstrates a content‚Äëbased recommendation engine that uses transformer embeddings and FAISS to retrieve semantically similar movies and games from TMDB and Steam datasets, producing thematically coherent recommendations for natural‚Äëlanguage queries.

- **Possible improvements and extensions:** Future work could add user interaction data for personalization, adopt more recent and broader datasets that include post‚Äë2019 titles and non‚ÄëSteam platforms, experiment with larger or domain‚Äëspecific embedding models, and introduce quantitative ranking metrics (e.g., Precision@k, NDCG) and A/B testing for more rigorous evaluation.