# Movie Recommendation System

## Prerequisites
 **Dataset**: Download the MovieLens 32M dataset from https://files.grouplens.org/datasets/movielens/ml-32m.zip. Unzip `ml-32m.zip` and copy ratings.csv and movies.csv to dataset folder. .
   - Expected files: `ratings.csv`, `movies.csv`.
   - Directory structure:
     ```
     Movie Recommendation System/
     ├── dataset/
     │     ├── ratings.csv
     │     ├── movies.csv
     │     └── ...
     ├── notebook
           ├── reduced_32M_movies_dataset.ipynb
     ```

In [1]:
import pandas as pd

In [2]:
def load_data():
    try:
        # Load movies data
        movies = pd.read_csv("../dataset/movies.csv")

        # Load full ratings data at once (no chunking)
        ratings = pd.read_csv("../dataset/ratings.csv")

        # Get valid movie IDs from ratings
        valid_movie_ids = ratings['movieId'].unique()

        # Filter movies to only include those present in ratings
        movies = movies[movies['movieId'].isin(valid_movie_ids)]

        return movies, ratings

    except FileNotFoundError as e:
        print(f"Error: Data file not found - {e}")
        raise
    except Exception as e:
        print(f"Error loading data: {e}")
        raise

In [3]:
movies, ratings = load_data()
print(f"Movies shape: {movies.shape}")
print(f"Ratings shape: {ratings.shape}")

Movies shape: (84432, 3)
Ratings shape: (32000204, 4)


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
movies.isna().sum().sum()

0

In [7]:
ratings.isna().sum().sum()

0

In [8]:
# Step 2: Extract year from title
def extract_year(title):
    if '(' in title and title.endswith(')'):
        year_str = title[-5:-1]
        if year_str.isdigit():
            return int(year_str)
    return None

In [9]:
movies['year'] = movies['title'].apply(extract_year)

In [10]:
# Step 3: Get min and max year
min_year = movies['year'].min()
max_year = movies['year'].max()

In [11]:
min_year

1874.0

In [12]:
max_year

2023.0

In [13]:
# You may choose the starting and ending year of your choice
start_year = 2019
end_year = 2023

# Filter movies within the year range
filtered_movies = movies[(movies['year'] >= start_year) & (movies['year'] <= end_year)]

# Get valid movie IDs
valid_movie_ids = filtered_movies['movieId'].unique()

In [14]:
# Filter ratings to only include those for selected movies
filtered_ratings = ratings[ratings['movieId'].isin(valid_movie_ids)]

In [15]:
filtered_movies.to_csv('../dataset/filtered_movies.csv', index=False)
filtered_ratings.to_csv('../dataset/filtered_ratings.csv', index=False)

print("Filtered datasets saved:")
print("- filtered_movies.csv")
print("- filtered_ratings.csv")

Filtered datasets saved:
- filtered_movies.csv
- filtered_ratings.csv


In [17]:
m = pd.read_csv("../dataset/filtered_movies.csv")
r = pd.read_csv("../dataset/filtered_ratings.csv")

In [18]:
m.shape

(10991, 4)

In [19]:
r.shape

(515161, 4)

In [20]:
m.head()

Unnamed: 0,movieId,title,genres,year
0,122914,Avengers: Infinity War - Part II (2019),Action|Adventure|Sci-Fi,2019.0
1,143345,Shazam! (2019),Action|Adventure|Fantasy|Sci-Fi,2019.0
2,195473,Les Invisibles (2019),(no genres listed),2019.0
3,196223,Hellboy (2019),Action|Adventure|Fantasy,2019.0
4,196417,How to Train Your Dragon: The Hidden World (2019),Adventure|Animation|Children,2019.0


In [21]:
r.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,16,200818,4.0,1572741766
1,20,196417,4.0,1553181245
2,20,200042,4.0,1553180593
3,22,197889,4.0,1617162338
4,22,200838,4.0,1581474048
