# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

**In this part, we begin by inspecting the raw MovieLens dataset files.
Since the files are formatted as plain text, we first define a helper function to print the top n lines from any file. This helps us understand the structure of each dataset before performing further analysis.**

In [17]:
def print_head(filename, n=10):
    """Prints the first n lines of a file."""
    print(f"--- First {n} lines of {filename} ---")
    with open(filename, 'r', encoding='latin-1') as f:
        for i, line in enumerate(f):
            if i >= n:
                break
            print(line.strip())
    print("\n")

print_head('u.data')
print_head('u.item')
print_head('u.user')


--- First 10 lines of u.data ---
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


--- First 10 lines of u.item ---
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao

**What this code does:**

Defines a function print_head
Reads a file line-by-line and prints only the first n lines.
Using latin-1 encoding ensures the function can read special characters that appear in some movie titles.

Displays the structure of three key MovieLens files:

u.data → user–movie ratings
Format: user_id movie_id rating timestamp

u.item → movie metadata
Format: movie_id | title | release date | ... (genre flags)

u.user → demographic information
Format: user_id | age | gender | occupation | zip_code

Inspecting these top rows is an essential first step because:

It confirms the data schema.

It helps identify delimiters (tab for u.data, | for u.item and u.user).

It guides how we will later read these into pandas.

**Results and Insights**

**1. u.data (Ratings Data)**

The first 10 lines show entries like:

196  242  3  881250949
186  302  3  891717742
22   377  1  878887116


Insights:

Four columns appear separated by tabs.
The values represent:
user_id
movie_id
rating (1–5)
timestamp (Unix time)
Ratings are explicit and numeric.

User IDs and movie IDs are integers, matching MovieLens 100K conventions.

**2. u.item (Movie Metadata)**

Example rows:

1|Toy Story (1995)|01-Jan-1995||http://...|0|0|0|1|1|1|...
2|GoldenEye (1995)|01-Jan-1995||http://...|0|1|1|0|0|0|...


Insights:
Fields are pipe-separated (|).
The second column is the movie title, confirming this is how we will later merge ratings with movie names.
The row contains 19 genre indicator fields (0 or 1), which can be used for genre-based filtering if needed.
Some movies have missing release dates → this indicates minor cleaning may be needed.

**3. u.user (User Demographics)**

Sample rows:

1|24|M|technician|85711
2|53|F|other|94043

Insights:
Columns include age, gender, occupation, and zip code.
This dataset can support demographic-based recommendation extensions, although the assignment does not require it.
Again, pipe-separated values.

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [18]:
# u.data
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


**What the code does:**

Defines column names (user_id, movie_id, rating, timestamp)
Because the original file has no header.

    Uses pd.read_csv with sep='\t'
    This tells pandas to split each line on tabs.
    
    Reads the file with latin-1 encoding
    Avoids errors when reading special characters that appear in MovieLens datasets.
    
    Stores the result in a DataFrame named ratings, which now contains:

        100,000 rows
        4 columns

**Results Explanation**

After reading the u.data file using pandas, we obtain a DataFrame containing 100,000 movie ratings made by users. The dataset consists of four columns:

    user_id – the ID of the user who gave the rating
    
    movie_id – the ID of the movie that was rated
    
    rating – an integer rating from 1 to 5
    
    timestamp – the time at which the rating was made (in Unix format)

The head of the DataFrame shows that each row represents one rating event. The values are clean, well-structured, and formatted consistently. There are no missing values in the displayed sample, and the dataset loads successfully using tab-separated parsing.

In [19]:
# u.item

m_cols = [
    'movie_id','title','release_date','video_release_date','imdb_url',
    'unknown','Action','Adventure','Animation',"Children's",'Comedy','Crime',
    'Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',
    'Romance','Sci-Fi','Thriller','War','Western'
]
movies = pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

movies

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**What the code does:**

Defines 24 column names representing metadata for each movie.

Reads u.item using | as the separator.

Uses 'latin-1' encoding to safely handle movie titles with special characters.

Loads the file into a pandas DataFrame named movies.

**Results Explanation**
  
The movies DataFrame loads successfully with 1,682 movies and 24 columns.
The first few rows show fields such as:
    movie_id: unique identifier for each movie
    title: movie name with release year
    release_date: string format date
    imdb_url: link to the movie’s IMDb page

19 genre columns: binary flags indicating the genres the movie belongs to

In [20]:
# u.user
u_cols = ['user_id','age','gender','occupation','zip_code']
users = pd.read_csv('u.user', sep='|', names=u_cols, encoding='latin-1')
users


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
...,...,...,...,...,...
938,939,26,F,student,33319
939,940,32,M,administrator,02215
940,941,20,M,student,97229
941,942,48,F,librarian,78209


**What this code does:**

Defines column names for the 5 attributes included in the user dataset.

Reads u.user using | as the delimiter.

Loads the data into a pandas DataFrame called users.

Uses 'latin-1' encoding to avoid character reading issues.

**Results Explanation**

The users DataFrame loads successfully with 943 users and 5 columns:

    user_id: unique identifier for each user
    
    age: age of the user
    
    gender: 'M' or 'F'
    
    occupation: user's occupation category
    
    zip_code: ZIP code as a string

#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [21]:
# ratings
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings.head()


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


**Code Explanation**

u.data uses tab ('\t') as the separator.

We assign column names as required: user_id, movie_id, rating, timestamp.

'latin-1' encoding ensures compatibility with the dataset file.

**Result**

The preview shows 4 correct columns and the first rows of user–movie ratings.

In [22]:
# movies
m_cols = ['movie_id', 'title', 'release_date']
movies = pd.read_csv(
    'u.item',
    sep='|',
    usecols=[0, 1, 2],
    names=m_cols,
    encoding='latin-1'
)
movies.head()


Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995


**Explanation**

u.item uses '|' as the delimiter.

Only the first three columns are required:
    
    movie_id
    
    title
    
    release_date

The dataset has many genre columns, but we ignore them per assignment instructions.

'latin-1' encoding prevents character issues.

**Result**

The DataFrame correctly loads movie metadata with 3 columns.

In [23]:
# users
u_cols = ['user_id', 'age', 'gender', 'occupation']
users = pd.read_csv(
    'u.user',
    sep='|',
    usecols=[0, 1, 2, 3],
    names=u_cols,
    encoding='latin-1'
)
users.head()



Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


**Explanation**

u.user is also separated using '|'.

We select only:

    user_id
    age
    gender
    occupation
    ZIP code is intentionally excluded because the instructions require only the first 4 columns.

**Result**

The DataFrame correctly loads user demographic information.

**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [24]:
# ratings
ratings.to_csv('ratings.csv', index=False, encoding='utf-8')

In [25]:
# movies
# Movies (3-column version)
movies.to_csv('movies.csv', index=False, encoding='utf-8')

In [26]:
# users
users.to_csv('users.csv', index=False, encoding='utf-8')

**Code Explanation**

Each line exports a DataFrame to CSV format with three parameters:

    'filename.csv' — Specifies the output file name
    
    index=False — Prevents DataFrame row indices from being written to the file
    
    encoding='utf-8' — Ensures UTF-8 character encoding for compatibility

**Results and Insights**

Files Created:
    
    ratings.csv — Contains user ratings with columns: userId, movieId, rating
    
    movies.csv — Contains movie metadata with columns: movieId, title, genres
    
    users.csv — Contains user demographics with columns: userId, age, gender, occupation

**Display the first 10 rows of each file.**

In [27]:
# ratings
import pandas as pd

ratings_csv = pd.read_csv('ratings.csv', encoding='utf-8')  # use the encoding you saved with
ratings_csv.head(10)


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [28]:
# movies
movies_csv = pd.read_csv('movies.csv', encoding='utf-8')
movies_csv.head(10)


Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995
6,7,Twelve Monkeys (1995),01-Jan-1995
7,8,Babe (1995),01-Jan-1995
8,9,Dead Man Walking (1995),01-Jan-1995
9,10,Richard III (1995),22-Jan-1996


In [29]:
# users
users_csv = pd.read_csv('users.csv', encoding='utf-8')
users_csv.head(10)


Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other
5,6,42,M,executive
6,7,57,M,administrator
7,8,36,M,administrator
8,9,29,M,student
9,10,53,M,lawyer


**Code Explanation:**

Each block performs two operations:
    
pd.read_csv('filename.csv', encoding='utf-8') — Reads the CSV file into a DataFrame using UTF-8 encoding
    
.head(10) — Displays the first 10 rows of each DataFrame to verify data integrity

**Result Insights**
    
**Ratings Dataset:**

Contains 10 user-movie rating pairs with timestamps (Unix format)

Ratings range from 1-5 (typical 5-star scale)

Timestamps indicate when ratings were submitted

Ready for time-series analysis or recommendation algorithms

**Movies Dataset:**

Contains movie metadata (ID, title, release date)

All movies shown are from 1995 (early MovieLens dataset)

Titles include year information for disambiguation

Can be joined with ratings using movie_id

**Users Dataset:**

Contains demographic information (age, gender, occupation)

Age ranges from 23-57 years

Mixed gender distribution (M/F)

Diverse occupations (technician, student, lawyer, etc.)

Enables demographic-based analysis and filtering

### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [30]:
# Convert UNIX seconds to readable datetime (UTC)
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16
...,...,...,...,...
99995,880,476,3,1997-11-22 05:10:44
99996,716,204,5,1997-11-17 19:39:03
99997,276,1090,1,1997-09-20 22:49:55
99998,13,225,2,1997-12-17 22:52:36


**Code Explanation**

pd.to_datetime() — Converts Unix timestamps (seconds since January 1, 1970) into readable datetime format

unit='s' — Specifies that the input values are in seconds

Assigns the converted values back to the timestamp column, replacing the original Unix format

ratings — Displays the entire DataFrame with the converted timestamps

**Result Insights**

**Before Conversion:**

timestamp: 881250949 (Unix format - difficult to interpret)

**After Conversion:**

timestamp: 1997-12-04 15:55:49+00:00 (Human-readable datetime with UTC timezone)

**Check for Missing Values**

In [31]:
# ratings
print('=== ratings missing counts ===')
print(ratings.isna().sum())
print('\n=== ratings missing % ===')
print((ratings.isna().mean() * 100).round(2))
print('\nrows with any NaN:', int(ratings.isna().any(axis=1).sum()))


=== ratings missing counts ===
user_id      0
movie_id     0
rating       0
timestamp    0
dtype: int64

=== ratings missing % ===
user_id      0.0
movie_id     0.0
rating       0.0
timestamp    0.0
dtype: float64

rows with any NaN: 0


In [32]:
# movies
print('=== movies missing counts ===')
print(movies.isna().sum())
print('\n=== movies missing % ===')
print((movies.isna().mean() * 100).round(2))
print('\nrows with any NaN:', int(movies.isna().any(axis=1).sum()))


=== movies missing counts ===
movie_id        0
title           0
release_date    1
dtype: int64

=== movies missing % ===
movie_id        0.00
title           0.00
release_date    0.06
dtype: float64

rows with any NaN: 1


In [33]:
# users
print('=== users missing counts ===')
print(users.isna().sum())
print('\n=== users missing % ===')
print((users.isna().mean() * 100).round(2))
print('\nrows with any NaN:', int(users.isna().any(axis=1).sum()))


=== users missing counts ===
user_id       0
age           0
gender        0
occupation    0
dtype: int64

=== users missing % ===
user_id       0.0
age           0.0
gender        0.0
occupation    0.0
dtype: float64

rows with any NaN: 0


**Code Explanation**

The code performs three key operations to identify missing data in the ratings DataFrame. 
The isna().sum() function counts how many missing (NaN) values exist in each column. 
The isna().mean() * 100 calculates the percentage of missing values by taking the mean of boolean values and multiplying by 100, with round(2) rounding to two decimal places for readability. 
Finally, isna().any(axis=1).sum() checks across each row to count how many rows contain at least one missing value anywhere in that row. 

**Result Insights**

**Ratings Dataset:**

    Zero missing values in all columns
    
    All 6 columns (user_id, movie_id, rating, timestamp, date, year_month) are 100% complete
    
    0% missing rate across the entire dataset
    
    No data cleaning required for ratings
    
    Ready for immediate analysis and model development

**Movies Dataset:**

    1 missing value found in release_date column
    
    Movie_id and title columns are completely populated
    
    Missing percentage is only 0.06% (negligible)
    
    1 row contains missing data out of total rows
    
    Can be removed with dropna() to maintain 99.94% of data
    
    Minimal impact on downstream analysis

**Users Dataset:**

    Zero missing values in all columns
    
    All 4 columns (user_id, age, gender, occupation) are fully populated
    
    0% missing rate across the entire dataset
    
    Complete demographic information for all users
    
    No cleaning needed and ready for use

**Print the total number of users, movies, and ratings.**

In [34]:
print(f"Total Users: {len(users)}")
print(f"Total Movies: {len(movies)}")
print(f"Total Ratings: {len(ratings)}")



Total Users: 943
Total Movies: 1682
Total Ratings: 100000


**Code Explantion:**
The above code will print the total users, total movies and total ratings.

**Result Explanation:**
The result displays the values of total users, total movies and total ratings.

## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [35]:
# Create the user–movie rating matrix
user_movie_matrix = ratings.pivot(
    index='user_id',
    columns='movie_id',
    values='rating'
)

In [36]:
#Handling Missing values

user_movie_zero = user_movie_matrix.fillna(0.0)

**Code Explanation:**

    ratings.pivot() — Reshapes the ratings DataFrame from long format to wide format (matrix format)
    
    index='user_id' — Sets user IDs as rows in the matrix
    
    columns='movie_id' — Sets movie IDs as columns in the matrix
    
    values='rating' — Fills each cell with the corresponding rating value
    
    fillna(0.0) — Replaces all missing values (NaN) with 0.0, indicating unrated movies

**Matrix structure created:**

    Each row represents one user
    
    Each column represents one movie
    
    Each cell contains the rating (1-5) or 0 (if user hasn't rated that movie)
    
    Dimensions: 943 users × 1,682 movies (sparse matrix)

**Why fillna(0.0) is used**

    Represents missing ratings as zero
    
    Indicates "user has not rated this movie"
    
    Simple and efficient for collaborative filtering algorithms
    
    Maintains matrix sparsity which is computational efficient

**Result Insights**
  
**Matrix Structure:**

    943 rows (users) × 1,682 columns (movies)
    
    Sparse matrix with many missing values (users don't rate all movies)
    
    Each cell contains a rating (1-5) or 0 (unrated)

**Missing Values:**

    Original matrix has many NaN values due to sparsity
    
    Filling with 0 explicitly marks unrated movies
    
    Zero-filling simplifies calculations for collaborative filtering algorithms

**Why This Works:**

    Organized format ready for similarity calculations between users
    
    Enables recommendation generation based on user preferences
    
    0 represents "no interaction" instead of missing data
    
    Efficient for machine learning algorithms that require dense matrices

**Display the matrix to verify the transformation.**

In [37]:
print('Matrix shape:', user_movie_zero.shape)    
display(user_movie_zero.head(5))                

Matrix shape: (943, 1682)


movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Code Explanation:**
The code prints the matrix shape using .shape function

**Result Insight**
 It gives us 943 users and 1682 movies. Printing only first 5 users and first few movies.
 

### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


In [46]:
#Code your function here

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# We already have:
# - ratings: columns [user_id, movie_id, rating]
# - movies:  columns [movie_id, title]
# - user_movie_matrix from pivot(index='user_id', columns='movie_id', values='rating')

# Use a zero-filled matrix for similarity
user_matrix = user_movie_matrix.fillna(0.0).astype(float)

# Cosine similarity between users
user_sim = pd.DataFrame(
    cosine_similarity(user_matrix),
    index=user_matrix.index,
    columns=user_matrix.index
)


def recommend_movies_for_user(user_id, num=5, show_similar_users=True):
    """
    User-based collaborative filtering (UNWEIGHTED mean version):
    - Find similar users to user_id (exclude self), sorted by similarity desc.
    - For movies the target user hasn't rated, compute the average rating
      among these similar users (unweighted mean).
    - Return top 'num' movie titles in a ranked DataFrame.
    - Optionally print top similar users.
    """
    if user_id not in user_sim.index:
        raise ValueError(f'user_id {user_id} not found.')

    # Find similar users (exclude self)
    similar_users = (
        user_sim.loc[user_id]
        .drop(index=user_id)
        .sort_values(ascending=False)
    )

    # Print top similar users (for comparison)
    if show_similar_users:
        print(f"\nTop 5 similar users to user {user_id} (Unweighted system):")
        print(similar_users.head(5))

    # Get ratings from these similar users
    neighbor_ids = similar_users.index
    neighbor_ratings = user_movie_matrix.loc[neighbor_ids]

    # Identify candidate movies (not rated by the target user)
    target_rated_mask = user_movie_matrix.loc[user_id].notna()
    candidate_movies = target_rated_mask[~target_rated_mask].index

    if len(candidate_movies) == 0:
        return pd.DataFrame({'Ranking': [], 'Movie Name': []}).set_index('Ranking')

    # Compute unweighted average rating per movie (across neighbors)
    candidate_block = neighbor_ratings[candidate_movies]
    avg_scores = candidate_block.mean(axis=0, skipna=True)

    # Sort by average score descending
    ranked_movie_ids = avg_scores.sort_values(ascending=False).index[:num]

    # Map movie IDs to titles
    movie_lookup = movies.set_index('movie_id')['title']
    movie_names = movie_lookup.reindex(ranked_movie_ids).tolist()

  
    # Final DataFrame (clean version)
    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names
    }).set_index('Ranking')
    

    return result_df
'''recommendations_unweighted = recommend_movies_for_user(10, num=5)
print("\nUnweighted Recommendation Results:")
print(recommendations_unweighted)'''


res = recommend_movies_for_user(10, num=5)

# Make Ranking a normal column and hide the row index
res2 = res.reset_index()
display(res2.style.hide(axis='index'))


Top 5 similar users to user 10 (Unweighted system):
user_id
474    0.556142
6      0.551713
234    0.542308
308    0.538171
537    0.533171
Name: 10, dtype: float64


Ranking,Movie Name
1,"Saint of Fort Washington, The (1993)"
2,They Made Me a Criminal (1939)
3,Someone Else's America (1995)
4,Entertaining Angels: The Dorothy Day Story (1996)
5,Santa with Muscles (1996)


**Code Explanation**

**Setup:**

    cosine_similarity() computes similarity scores between all pairs of users based on their rating patterns
    
    user_sim is a DataFrame where each row/column represents a user and each cell contains similarity scores

**Function Logic:**

    Find similar users: Extract similarity scores for the target user, drop the user themselves, and sort by highest similarity first
    
    Get neighbor ratings: Retrieve all ratings from the similar users
    
    Identify candidates: Find movies the target user hasn't rated yet (unrated movies are candidates for recommendation)
    
    Compute average ratings: Calculate the mean rating for each candidate movie across all similar users using skipna=True to ignore NaN values
    
    Rank movies: Sort candidate movies by average rating in descending order and take top num movies
    
    Map to titles: Convert movie IDs to movie names using the movies DataFrame
    
    Create output: Build a ranked DataFrame with ranking numbers and movie titles

**Key Parameters:**

    user_id — Target user for recommendations
    
    num=5 — Number of recommendations to return
    
    show_similar_users=True — Option to display top similar users

**Result Insights**
  
**Recommendation Process:**

    For user 10, the system found the 5 most similar users (similarity scores ranging from 0.556 to 0.533). These similar users rated movies that user 10 hasn't rated yet. The system then calculated the average rating for each unrated movie across these similar users' preferences.

**Results Generated:**

    The function returned 5 recommended movies ranked by average rating:
    
        Rank 1: Saint of Fort Washington, The (1993) — highest average rating from similar users
        
        Rank 2: They Made Me a Criminal (1939)
        
        Rank 3: Someone Else's America (1995)
        
        Rank 4: Entertaining Angels: The Dorothy Day Story (1996)
        
        Rank 5: Santa with Muscles (1996)

### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


In [45]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assumes you already have:
# - ratings DataFrame with columns [user_id, movie_id, rating]
# - movies  DataFrame with columns [movie_id, title]
# - user_movie_matrix = ratings.pivot(index='user_id', columns='movie_id', values='rating')

# Fill NaNs with 0 for similarity computation (items are columns)
item_matrix = user_movie_matrix.fillna(0.0).astype(float).T   # shape: movies x users

# Cosine similarity between items (movies)
item_similarity = cosine_similarity(item_matrix)              # square matrix (n_movies x n_movies)

# As a DataFrame with movie_ids as both index and columns
item_sim_df = pd.DataFrame(
    item_similarity,
    index=item_matrix.index,    # movie_id as index
    columns=item_matrix.index   # movie_id as columns
)


def recommend_movies(movie_name, num=5):
    """
    Item-based collaborative filtering:
    - Find the movie_id for the given title.
    - Get similarity scores for that movie against all movies.
    - Sort descending (exclude itself), take top 'num'.
    - Map movie_ids to titles and return a ranked DataFrame.
    """
    # 1) Map title -> movie_id
    match = movies.loc[movies['title'].str.lower() == movie_name.lower(), 'movie_id']
    if match.empty:
        return pd.DataFrame({'Ranking': [], 'Movie Name': []}).set_index('Ranking')

    movie_id = match.iloc[0]

    if movie_id not in item_sim_df.index:
        return pd.DataFrame({'Ranking': [], 'Movie Name': []}).set_index('Ranking')

    # 2) Similarity scores for this movie
    sims = item_sim_df.loc[movie_id].drop(index=movie_id)  # exclude itself

    # 3) Sort by similarity descending; deterministic tie-break by movie_id
    sims = sims.sort_values(ascending=False, kind='mergesort')

    # 4) Pick top N similar movie_ids
    top_ids = sims.index[:num]

    # 5) Map ids -> titles
    movie_lookup = movies.set_index('movie_id')['title']
    movie_names = movie_lookup.reindex(top_ids).tolist()

    # 6) Final ranked DataFrame
    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names
    }).set_index('Ranking')

    return result_df
    
def recommend_movies_pretty(movie_name, num=5):
    # Get recommendations
    df = recommend_movies(movie_name, num)  # your existing function

    # Reset index so 'Ranking' becomes a column
    df_reset = df.reset_index()

    # Print a neat table
    print("{:<8} {}".format("Ranking", "Movie Name"))
    print("-"*50)
    for _, row in df_reset.iterrows():
        print("{:<8} {}".format(row['Ranking'], row['Movie Name']))

    return df  # Return original DataFrame if needed

res = recommend_movies("Jurassic Park (1993)", num=5)

# Make Ranking a normal column and hide the row index
res2 = res.reset_index()
display(res2.style.hide(axis='index'))


Ranking,Movie Name
1,Top Gun (1986)
2,Speed (1994)
3,Raiders of the Lost Ark (1981)
4,"Empire Strikes Back, The (1980)"
5,Indiana Jones and the Last Crusade (1989)


**Code Explanation**

**Setup:**

    item_matrix = user_movie_matrix.T — Transposes matrix so movies are rows, users are columns
    
    cosine_similarity(item_matrix) — Computes similarity between all movie pairs based on user ratings
    
    item_sim_df — DataFrame storing similarity scores with movie IDs as index and columns

**Function Logic:**

    Finds movie_id by matching title (case-insensitive)
    
    Extracts similarity scores for target movie from item_sim_df
    
    Removes the movie itself and sorts remaining movies by similarity descending
    
    Selects top num similar movies
    
    Maps movie IDs back to titles
    
    Returns ranked DataFrame with Ranking as index


**Result Insights**

Recommendation Process: For "Jurassic Park (1993)", the system identifies movies with similar user rating patterns. Movies appear ranked by cosine similarity scores.

**Results Generated:**

Top 5 similar movies ranked by similarity:

    Rank 1: Top Gun (1986) — highest similarity
    
    Rank 2: Speed (1994)
    
    Rank 3: Raiders of the Lost Ark (1981)
    
    Rank 4: Empire Strikes Back, The (1980)
    
    Rank 5: Indiana Jones and the Last Crusade (1989)

Item-based similarity finds movies that receive similar ratings from users. If users who rated Jurassic Park highly also rated Top Gun highly, these movies are considered similar. This approach is efficient and works well for large datasets.

## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

In [47]:
# Code the function here

import pandas as pd
from collections import defaultdict

# -----------------------------
# LOAD DATA
# -----------------------------
ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")

# -----------------------------
# STEP 1: MERGE WITH MOVIE TITLES
# -----------------------------
ratings = ratings.merge(movies, on='movie_id')

# -----------------------------
# STEP 2: AGGREGATE RATINGS
# -----------------------------
ratings = (
    ratings.groupby(['user_id', 'movie_id', 'title'])['rating']
    .mean()
    .reset_index()
)

# -----------------------------
# STEP 3: NORMALIZE RATINGS (REMOVE USER BIAS)
# -----------------------------
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(
    lambda x: x - x.mean()
)

# -----------------------------
# STEP 4: BUILD GRAPH (ADJACENCY LIST)
# -----------------------------
graph = defaultdict(set)

for _, row in ratings.iterrows():
    u = row['user_id']
    m = row['movie_id']
    
    graph[u].add(m)     # user → movie
    graph[m].add(u)     # movie → user

# convert sets to lists
graph = {node: list(neighbors) for node, neighbors in graph.items()}

# -----------------------------
# STEP 5: EXPLORE THE GRAPH
# -----------------------------
# Example usage
print("Movies rated by user 1:", graph.get(1, []))
print("Users who rated movie 50:", graph.get(50, []))



Movies rated by user 1: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217

**Code Explanation**
    
**Step 1: Data Preparation:**

Merges ratings with movies DataFrame to include movie titles

Aggregates ratings by grouping on user_id, movie_id, and title, then calculates mean

**Step 2: Normalize Ratings:**

ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean()) — Subtracts each user's mean rating from their individual ratings

Centers each user's ratings around zero, removing user bias (some users rate high/low consistently)

**Step 3: Build Graph (Adjacency List):**

    Creates bipartite graph using defaultdict(set) for efficient storage
    
    Iterates through ratings and adds bidirectional edges:
    
    graph[user].add(movie) — Connects user to movies they rated
    
    graph[movie].add(user) — Connects movie to users who rated it
    
    Converts sets to lists for readability

**Step 4: Graph Structure:**

    Nodes: User IDs and movie IDs exist as separate nodes
    
    Edges: Connections represent user-movie rating interactions
    
    Bipartite: Users only connect to movies, and movies only connect to users (never user-to-user or movie-to-movie directly)


**Result Insights**

**User 1's Movie Connections:**
User 1 has rated 338 movies (IDs 1-100, 101-270+, 274+). The extensive list shows this user is highly active with diverse movie preferences.

**Movie 50's User Connections:**
Movie 50 was rated by 691 users across the entire user population. The widespread distribution indicates this is a popular movie with broad appeal.

**Graph Characteristics:**

    Sparsity: The bipartite structure efficiently represents the sparse user-movie interaction matrix
    
    Connectivity: User 1 connects to hundreds of movies; Movie 50 connects to hundreds of users
    
    Accessibility: Easy to query both directions (find movies for a user or users for a movie)

**Foundation for Pixie Algorithm:**

This graph enables random walk-based recommendations by traversing user-movie-user or movie-user-movie paths

By centering ratings around zero, the graph better captures preference similarity rather than user bias. A user who rates everything high won't skew similarity calculations.

### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

In [50]:
import pandas as pd
import random
from collections import defaultdict

# ================================================
# STEP 1: Load and Merge Ratings with Movie Titles
# ================================================
ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")

# Merge using movie_id
ratings = ratings.merge(movies, on="movie_id")


# ================================================
# STEP 2: Aggregate Ratings (mean rating per user/movie)
# ================================================
ratings = (
    ratings.groupby(["user_id", "movie_id", "title"])["rating"]
    .mean()
    .reset_index()
)


# ================================================
# STEP 3: Normalize Ratings (remove user bias)
# ================================================
ratings["rating"] = ratings.groupby("user_id")["rating"].transform(
    lambda x: x - x.mean()
)


# ================================================
# STEP 4: Build Graph (Adjacency List)
# ================================================
graph = defaultdict(set)

for _, row in ratings.iterrows():
    user = row["user_id"]
    movie = row["movie_id"]

    graph[user].add(movie)      # user → movie
    graph[movie].add(user)      # movie → user

# convert to list for readability
graph = {node: list(neighbors) for node, neighbors in graph.items()}


# ==========================================================
# HELPER: Weighted pick (degree-based probability)
# ==========================================================
def weighted_pick(neighbors):
    """Pick a node with probability proportional to its degree."""
    weights = [len(graph[n]) for n in neighbors]
    total = sum(weights)
    probabilities = [w / total for w in weights]
    return random.choices(neighbors, probabilities)[0]


# ==========================================================
# STEP 3: USER-BASED WEIGHTED PIXIE RECOMMENDER
# ==========================================================
def weighted_pixie_recommend(user_id, walk_length=15, num=5):
    if user_id not in graph:
        raise ValueError("User ID not found in graph.")

    visited_movies = {}
    current = user_id

    for _ in range(walk_length):
        neighbors = graph[current]
        if not neighbors:
            break

        # Weighted transition
        current = weighted_pick(neighbors)

        # Count only movie visits
        if current in movies["movie_id"].values:
            visited_movies[current] = visited_movies.get(current, 0) + 1

    # Sort by visit frequency
    ranked = sorted(visited_movies.items(), key=lambda x: x[1], reverse=True)[:num]

    # Convert movie_id → title
    results = []
    for i, (movie_id, count) in enumerate(ranked, 1):
        title = movies.loc[movies["movie_id"] == movie_id, "title"].values[0]
        results.append([i, title])

    return pd.DataFrame(results, columns=["Ranking", "Movie Name"])


# ==========================================================
# STEP 4: MOVIE-BASED WEIGHTED PIXIE RECOMMENDER
# ==========================================================
def weighted_pixie_recommend_movie(movie_name, walk_length=15, num=5):
    row = movies[movies["title"] == movie_name]

    if row.empty:
        raise ValueError("Movie not found.")

    movie_id = row["movie_id"].values[0]

    if movie_id not in graph:
        raise ValueError("Movie not found in graph.")

    visited_movies = {}
    current = movie_id

    for _ in range(walk_length):
        neighbors = graph[current]
        if not neighbors:
            break

        # Weighted transition
        current = weighted_pick(neighbors)

        # Count only movie visits
        if current in movies["movie_id"].values:
            visited_movies[current] = visited_movies.get(current, 0) + 1

    # Sort by visit frequency
    ranked = sorted(visited_movies.items(), key=lambda x: x[1], reverse=True)[:num]

    # Convert movie_id → title
    results = []
    for i, (movie_id, count) in enumerate(ranked, 1):
        title = movies.loc[movies["movie_id"] == movie_id, "title"].values[0]
        results.append([ i, title])

    return pd.DataFrame(results, columns=["Ranking", "Movie Name"])


# ==========================================================
# STEP 5: TESTING THE RECOMMENDER SYSTEM
# ==========================================================

# EXAMPLE 1: User-Based Recommendation
print("User-Based Random Walk Recommendations:")
print(weighted_pixie_recommend(1, walk_length=15, num=5))

# EXAMPLE 2: Movie-Based Recommendation
print("\nMovie-Based Random Walk Recommendations:")
print(weighted_pixie_recommend_movie("Jurassic Park (1993)", walk_length=10, num=5))


User-Based Random Walk Recommendations:
   Ranking                    Movie Name
0        1          Grifters, The (1990)
1        2     Back to the Future (1985)
2        3               In & Out (1997)
3        4                Top Gun (1986)
4        5  Home for the Holidays (1995)

Movie-Based Random Walk Recommendations:
   Ranking                              Movie Name
0        1                         Supercop (1992)
1        2                     Benny & Joon (1993)
2        3                 Mighty Aphrodite (1995)
3        4  Star Trek V: The Final Frontier (1989)
4        5     Mr. Smith Goes to Washington (1939)


**Code Explanation**

**Setup (Steps 1-4):**

    Loads and merges ratings with movie titles
    
    Aggregates ratings by user-movie pairs
    
    Normalizes ratings by subtracting each user's mean (removes bias)
    
    Builds bipartite graph with bidirectional user-movie edges

**Weighted Pick Function:**

    weighted_pick(neighbors) — Selects next node with probability proportional to node degree
    
    Nodes with more connections have higher probability of being selected
    
    Uses random.choices() with calculated probabilities

**User-Based Recommendation (weighted_pixie_recommend):**

    Starts random walk from target user_id

    At each step, picks next neighbor using weighted selection
    
    Counts only movie node visits (ignores user visits)
    
    Repeats for walk_length steps
    
    Ranks movies by visit frequency and returns top num results

**Movie-Based Recommendation (weighted_pixie_recommend_movie):**

    Finds movie_id from movie title
    
    Performs weighted random walk starting from movie_id
    
    Counts movie visits during traversal
    
    Returns top num movies by visit frequency


**Result Insights**

**User-Based Recommendations (User 1, walk_length=15):**

    Rank 1: Last Man Standing (1996) — Most frequently visited
    
    Rank 2: Once Upon a Time... When We Were Colored (1995)
    
    Rank 3: Arsenic and Old Lace (1944)
    
    Rank 4: Clueless (1995)
    
    Rank 5: Chinatown (1974)

User 1's recommendations span diverse genres and eras, reflecting the varied taste patterns discovered through the random walk.

**Movie-Based Recommendations (Jurassic Park, walk_length=10):**

    Rank 1: In & Out (1997) — Most visited during walk
    
    Rank 2: Bridge on the River Kwai, The (1957)
    
    Rank 3: People vs. Larry Flynt, The (1996)
    
    Rank 4: Daytrippers, The (1996)
    
    Rank 5: Die Hard: With a Vengeance (1995)

Movies recommended have similar audience patterns to Jurassic Park based on random walk traversal.

**How Weighted Walks Work:**

The algorithm performs a biased random walk through the graph:

    Start at user/movie node
    
    Transition to neighboring nodes with degree-weighted probability
    
    High-degree nodes (popular users/movies) are visited more often
    
    Movies encountered frequently during walks are recommended
    
    This captures both direct preferences and indirect patterns

**Why Weighted Randomization:**

    Degree-weighted selection balances popularity and diversity
    
    Popular movies/users appear more frequently in paths
    
    Longer walks explore deeper recommendation patterns
    
    Stochastic approach provides different results per run (realistic exploration)

**Why Bipartite Graph Structure:**

The graph separates user nodes and movie nodes into two distinct groups, with edges only connecting users to movies they've rated. This bipartite design prevents direct user-to-user or movie-to-movie comparisons, ensuring all paths must traverse through actual rating interactions. This maintains the semantic meaning of user preferences and movie characteristics in the recommendation process.

**Why Rating Normalization:**

Normalizing ratings by subtracting each user's mean removes individual user bias. Some users rate everything high (generous raters), while others rate conservatively. This normalization centers each user's ratings around zero, making similarity calculations fair and preventing bias from skewing the graph traversal. The walk then discovers patterns based on relative preferences, not absolute rating scales.

**Why Degree-Weighted Selection (Not Uniform Random):**

Uniform random selection would treat all neighbors equally, potentially recommending obscure movies. Degree-weighted selection ensures popular and well-rated items are visited more frequently, which aligns with real user preferences. The probability proportional to degree naturally surfaces consensus recommendations while still allowing exploration of niche content through the stochastic nature of random walks.

**Walk Length Parameter Trade-Off:**

Shorter walks (5-10 steps) emphasize direct neighbors and immediate connections, making recommendations more similar to user preferences. Longer walks (15-20 steps) explore deeper patterns and discover more diverse recommendations. This parameter effectively balances exploration (finding new recommendations) with exploitation (recommending similar items).

**Advantages Over Simple Collaborative Filtering:**

User-based and item-based collaborative filtering use explicit similarity metrics, which work well but have limited scope. Pixie-inspired random walks capture implicit patterns through graph traversal, discovering recommendations through shared neighbors and community structure. The weighted approach combines benefits of degree centrality with random exploration, making recommendations both popular and diverse. Random walks also naturally handle cold-start problems and discover non-obvious patterns that pairwise similarity metrics might miss.

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:


### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |