
# Movie Recommendation System — step-by-step notebook
**What you'll get:** a complete starter notebook to build a movie recommender (popularity baseline, collaborative filtering, content-based, and a simple classification approach).  
**Dataset:** `Movie Recommendation System` (you provided the Kaggle dataset link).

> ⚠️ I couldn't download the dataset from here (no internet in the notebook-creation environment). There's a ready-to-run cell below with Kaggle CLI commands — run them on your machine to fetch the data into `./data/` before executing the rest of the notebook.

**Sections**
1. Setup & data download
2. Smart data loading (auto-detect CSVs)
3. Exploratory Data Analysis (EDA)
4. Popularity-based recommender (baseline)
5. Collaborative Filtering (Surprise SVD)
6. Content-based recommender (TF-IDF on titles)
7. Classification: predict if a user will *like* a movie (rating >= 4)
8. Advanced: sentence-transformers embeddings & next steps


In [None]:

# -----------------------------
# 0) Setup: pip installs (run once in your env)
# -----------------------------
# Uncomment and run these if you don't have the libraries.
# Note: sentence-transformers is optional (used in advanced section).
#
# !pip install pandas numpy scikit-learn matplotlib scikit-surprise sentence-transformers
#
# If you prefer conda:
# conda install -c conda-forge scikit-surprise pandas scikit-learn matplotlib
#
print('Skip installs if already available.')


: 


## Download dataset (two options)

**Option A — Kaggle CLI (recommended if the dataset is on Kaggle)**

1. Install kaggle CLI and configure an API token: https://github.com/Kaggle/kaggle-api
2. Run in terminal (not inside this notebook necessarily):
```bash
kaggle datasets download -d parasharmanas/movie-recommendation-system -p ./data
unzip ./data/movie-recommendation-system.zip -d ./data
```

**Option B — manual**
- Go to the Kaggle dataset page (the link you provided) and download `.csv` files. Put them in `./data/`.
- The notebook will auto-detect CSVs in `./data/`.


In [None]:
import os, glob
import pandas as pd

data_dir = './data'
os.makedirs(data_dir, exist_ok=True)  # create if missing

print('Files in', data_dir, ':', os.listdir(data_dir))

csv_files = glob.glob(os.path.join(data_dir, '*.csv'))
print('\nDetected CSV files:', [os.path.basename(f) for f in csv_files])

# try to read all CSVs into memory (careful with very large files)
dfs = {}
for p in csv_files:
    name = os.path.basename(p)
    try:
        dfs[name] = pd.read_csv(p, low_memory=False)
        print(f'Loaded {name} — shape={dfs[name].shape}')
    except Exception as e:
        print('Failed to read', name, e)

# helper to guess which dataframe is which based on column heuristics
def find_df_by_cols(dfs, required_cols_any):
    for name, df in dfs.items():
        cols = [c.lower() for c in df.columns]
        # fix: reversed loop order and condition
        if any(req in c for req in required_cols_any for c in cols):
            return name, df
    return None, None

# Common heuristics
ratings_name, ratings_df = find_df_by_cols(dfs, ['rating', 'user', 'movie'])
movies_name, movies_df = find_df_by_cols(dfs, ['title', 'genres', 'movie'])

print('\nGuessed ratings file:', ratings_name)
print('Guessed movies file:', movies_name)

# Show samples (if found)
if ratings_df is not None:
    print('\nRatings sample:')
    display(ratings_df.head())
else:
    print('\nNo ratings file auto-detected. If your dataset uses different column names, inspect ./data and load manually.')

if movies_df is not None:
    print('\nMovies sample:')
    display(movies_df.head())
else:
    print('\nNo movies file auto-detected. If your dataset uses different column names, inspect ./data and load manually.')

In [None]:

# -----------------------------
# 2) Quick EDA (ratings distribution, counts)
# -----------------------------
import numpy as np
import matplotlib.pyplot as plt

if 'ratings_df' in globals() and ratings_df is not None:
    # try to find rating column name
    rating_col = None
    for c in ratings_df.columns:
        if 'rating' in c.lower():
            rating_col = c
            break
    print('rating_col =', rating_col)
    print('\nBasic stats:')
    display(ratings_df[rating_col].describe())
    
    # histogram
    plt.figure(figsize=(6,4))
    plt.hist(ratings_df[rating_col].dropna(), bins=20)
    plt.title('Rating distribution')
    plt.xlabel('Rating')
    plt.ylabel('Count')
    plt.show()
    
    # unique counts
    user_cols = [c for c in ratings_df.columns if 'user' in c.lower()]
    movie_cols = [c for c in ratings_df.columns if 'movie' in c.lower() or 'title' in c.lower()]
    print('\nUnique users (guess):', ratings_df[user_cols[0]].nunique() if user_cols else 'N/A')
    print('Unique movies (guess):', ratings_df[movie_cols[0]].nunique() if movie_cols else 'N/A')
else:
    print('No ratings_df available for EDA. Load your CSVs into ./data/ and re-run the loader cell.')


In [None]:

# -----------------------------
# 3) Popularity-based recommender (baseline)
# -----------------------------
# This recommends overall top movies by average rating and minimum number of ratings.
if 'ratings_df' in globals() and ratings_df is not None:
    # find column names
    def find_col(df, options):
        for o in options:
            for c in df.columns:
                if c.lower() == o.lower():
                    return c
        # partial match
        for o in options:
            for c in df.columns:
                if o.lower() in c.lower():
                    return c
        return None

    user_col = find_col(ratings_df, ['userId', 'user_id', 'user'])
    movie_col = find_col(ratings_df, ['movieId', 'movie_id', 'movie'])
    rating_col = find_col(ratings_df, ['rating'])

    agg = ratings_df.groupby(movie_col)[rating_col].agg(['mean', 'count']).reset_index().rename(columns={'mean': 'avg_rating', 'count': 'n_ratings'})
    display(agg.sort_values(['avg_rating', 'n_ratings'], ascending=[False, False]).head(20))

    # join titles if movies_df exists
    if 'movies_df' in globals() and movies_df is not None:
        title_col = find_col(movies_df, ['title', 'name'])
        if title_col is not None:
            agg = agg.merge(movies_df[[movie_col, title_col]].drop_duplicates(), on=movie_col, how='left')
            print('\nTop movies (with titles):')
            display(agg.sort_values(['avg_rating', 'n_ratings'], ascending=[False, False]).head(20))
else:
    print('ratings_df not available — run the loader cell.')


In [None]:

# -----------------------------
# 4) Collaborative Filtering using Surprise (SVD)
# -----------------------------
# Trains SVD on user-item ratings and reports RMSE. Then shows how to get top-N recommendations for a user.
try:
    from surprise import Dataset, Reader, SVD
    from surprise.model_selection import train_test_split as surprise_train_test_split
    from surprise import accuracy
except Exception as e:
    print('surprise library not installed; install with: pip install scikit-surprise')
    raise e

if 'ratings_df' in globals() and ratings_df is not None:
    # find columns
    def find_col(df, options):
        for o in options:
            for c in df.columns:
                if c.lower() == o.lower():
                    return c
        for o in options:
            for c in df.columns:
                if o.lower() in c.lower():
                    return c
        return None

    user_col = find_col(ratings_df, ['userId', 'user_id', 'user'])
    movie_col = find_col(ratings_df, ['movieId', 'movie_id', 'movie'])
    rating_col = find_col(ratings_df, ['rating'])

    print('Using columns:', user_col, movie_col, rating_col)
    df_sub = ratings_df[[user_col, movie_col, rating_col]].dropna()

    # ensure rating range defined
    minr, maxr = df_sub[rating_col].min(), df_sub[rating_col].max()
    reader = Reader(rating_scale=(minr, maxr))
    data = Dataset.load_from_df(df_sub[[user_col, movie_col, rating_col]], reader)

    trainset, testset = surprise_train_test_split(data, test_size=0.2, random_state=42)

    algo = SVD(n_factors=50, random_state=42)
    algo.fit(trainset)
    preds = algo.test(testset)

    print('\nRMSE on testset:')
    accuracy.rmse(preds, verbose=True)

    # helper: get top-N recommendations for a raw user id
    def get_top_n_recommendations(algo, user_raw_id, movies_df=None, n=10):
        # collect all movie ids
        all_movie_ids = df_sub[movie_col].unique()
        # movies the user has rated
        if user_raw_id in df_sub[user_col].values:
            user_rated = df_sub[df_sub[user_col] == user_raw_id][movie_col].unique()
        else:
            user_rated = []
        candidates = [m for m in all_movie_ids if m not in user_rated]
        predictions = [(m, algo.predict(user_raw_id, m).est) for m in candidates]
        predictions.sort(key=lambda x: x[1], reverse=True)
        topn = predictions[:n]

        if movies_df is not None:
            title_col = find_col(movies_df, ['title', 'name'])
            rows = []
            for mid, score in topn:
                title = movies_df.loc[movies_df[movie_col] == mid, title_col].values
                rows.append({
                    'movieId': mid,
                    'pred_score': score,
                    'title': title[0] if len(title) > 0 else None
                })
            return rows
        else:
            return [{'movieId': m, 'pred_score': s} for m, s in topn]

    # Example: replace with a real user id from your data
    example_user = df_sub[user_col].iloc[0]
    print('\nTop 10 recommendations for user', example_user)
    display(get_top_n_recommendations(algo, example_user, movies_df=globals().get('movies_df', None), n=10))

else:
    print('ratings_df required — run the loader cell to populate ratings_df.')


In [None]:

# -----------------------------
# 5) Content-based recommender using TF-IDF on titles
# -----------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd  # missing import for pd.Series

# We'll use the movies_df if it exists; otherwise try to merge movie titles from ratings_df if present.
if 'movies_df' in globals() and movies_df is not None:
    title_col = None
    for c in movies_df.columns:
        if 'title' in c.lower() or 'name' in c.lower():
            title_col = c
            break
    movies_for_cb = movies_df.copy()
    movies_for_cb['title_text'] = movies_for_cb[title_col].astype(str)
elif 'ratings_df' in globals() and ratings_df is not None:
    # try to find a title-like column in ratings table
    title_col = None
    for c in ratings_df.columns:
        if 'title' in c.lower() or 'name' in c.lower():
            title_col = c
            break
    if title_col is not None:
        movies_for_cb = ratings_df[[title_col]].drop_duplicates().rename(columns={title_col: 'title_text'})
    else:
        movies_for_cb = None
else:
    movies_for_cb = None

if movies_for_cb is None:
    print('No title text found in your CSVs. Content-based recommender requires a title/description column.')
else:
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
    tfidf_matrix = tfidf.fit_transform(movies_for_cb['title_text'].fillna(''))

    # cosine similarities
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    indices = pd.Series(movies_for_cb.index, index=movies_for_cb['title_text']).drop_duplicates()

    # helper to recommend by movie title string (fuzzy match)
    from difflib import get_close_matches

    def find_title_match(query, choices, n=1):
        matches = get_close_matches(query, choices, n=n, cutoff=0.4)
        return matches[0] if matches else None

    def recommend_by_title(query_title, topn=10):
        match = find_title_match(query_title, movies_for_cb['title_text'].tolist(), n=1)
        if match is None:
            print('No close title match found for query:', query_title)
            return []
        idx = indices[match]
        sim_scores = list(enumerate(cosine_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:topn + 1]  # exclude the movie itself
        movie_indices = [i[0] for i in sim_scores]
        return movies_for_cb.iloc[movie_indices][['title_text']].assign(score=[i[1] for i in sim_scores])

    # small demo (replace with a title from your dataset)
    sample_title = movies_for_cb['title_text'].iloc[0]
    print('Sample title:', sample_title)
    display(recommend_by_title(sample_title, topn=8))


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer  # missing import added
import pandas as pd  # required for merge operations
from scipy.sparse import hstack  # moved to top for clarity

if 'ratings_df' not in globals() or ratings_df is None:
    print('ratings_df required — run the loader cell first.')
else:
    # find column names robustly
    def find_col(df, options):
        for o in options:
            for c in df.columns:
                if c.lower() == o.lower():
                    return c
        for o in options:
            for c in df.columns:
                if o.lower() in c.lower():
                    return c
        return None

    user_col = find_col(ratings_df, ['userId', 'user_id', 'user'])
    movie_col = find_col(ratings_df, ['movieId', 'movie_id', 'movie'])
    rating_col = find_col(ratings_df, ['rating'])

    # Prepare dataset
    dfc = ratings_df[[user_col, movie_col, rating_col]].copy().dropna()
    dfc['liked'] = (dfc[rating_col] >= 4).astype(int)  # binary target (tweak threshold as needed)

    # aggregate movie stats
    movie_stats = dfc.groupby(movie_col)[rating_col].agg(['mean', 'count']).reset_index().rename(
        columns={'mean': 'movie_avg_rating', 'count': 'movie_rating_count'}
    )
    user_stats = dfc.groupby(user_col)[rating_col].agg(['mean']).reset_index().rename(
        columns={'mean': 'user_avg_rating'}
    )

    dfc = dfc.merge(movie_stats, on=movie_col, how='left').merge(user_stats, on=user_col, how='left')

    # attach movie title if movies_df is available (for TF-IDF)
    if 'movies_df' in globals() and movies_df is not None:
        title_col = None
        for c in movies_df.columns:
            if 'title' in c.lower() or 'name' in c.lower():
                title_col = c
                break
        if title_col is not None:
            dfc = dfc.merge(movies_df[[movie_col, title_col]].drop_duplicates(), on=movie_col, how='left')
            dfc['title_text'] = dfc[title_col].astype(str)
        else:
            dfc['title_text'] = ''
    else:
        dfc['title_text'] = ''

    # TF-IDF on title_text
    tfidf = TfidfVectorizer(max_features=2000, ngram_range=(1, 2), stop_words='english')
    X_text = tfidf.fit_transform(dfc['title_text'].fillna(''))

    # numeric features
    num_feats = ['movie_avg_rating', 'movie_rating_count', 'user_avg_rating']
    X_num = dfc[num_feats].fillna(0).values

    # combine sparse + dense
    X = hstack([X_text, X_num])
    y = dfc['liked'].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # logistic regression
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('\nLogistic Regression results:')
    print(classification_report(y_test, y_pred))
    print('accuracy:', accuracy_score(y_test, y_pred))

    # random forest on numeric only (as an alternative)
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train[:, -len(num_feats):].toarray(), y_train)
    y_pred_rf = rf.predict(X_test[:, -len(num_feats):].toarray())
    print('\nRandom Forest (numeric-only) results:')
    print(classification_report(y_test, y_pred_rf))
    print('accuracy:', accuracy_score(y_test, y_pred_rf))

    print('\nNote: Text features often help but require enough data & tuning.')



## Advanced (optional): use sentence-transformers embeddings for better text vectors

If you want stronger semantic embeddings (title/description), use `sentence-transformers`:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # small and fast
embs = model.encode(list_of_titles, show_progress_bar=True, convert_to_numpy=True)
```
Then you can index embeddings with Annoy / Faiss / Milvus for fast nearest-neighbor search, or use cosine similarity directly for small datasets.

**Tip:** TF-IDF is cheap and often works surprisingly well for short title text. For descriptions or plots, embeddings shine.



## Utilities & Next steps

**Ideas to improve & expand**
- Use full movie metadata (plot, genres, cast) for content-based models.
- Use matrix-factorization (ALS) at scale, or neural CF models for deep learning approaches.
- Hybrid: combine content-similarity features with CF predictions as inputs for a classifier or ranker.
- For production: expose the model via a small Flask/FastAPI service and cache recommendations for speed.

**How to run**
1. Download the dataset into `./data/` using the Kaggle CLI or manually.
2. Open this notebook and run cells top-to-bottom.
3. Tweak thresholds, hyperparameters, and vector sizes based on results.
