# ITCS 6162: Data Mining - Programming Assignment

## Pavithra Selvan

#### Assignment Structure

1. Part-1: Explore data analysis
2. Part-2: Recommendation algorithms (collaborative filtering)
3. Part-3: Pixie-inspired Graph-based techniques (Unweighted and then, Weighted versions).



#### Dataset Files:
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
path = 'C:\\pyprojects\\DataMining\\Data\\'
path

'C:\\pyprojects\\DataMining\\Data\\'

### Step-A Inspecting the Dataset Format

The dataset is not in a traditional CSV format. 
To examine its structure, the file is opened in read mode, and printed the first 10 lines.
The code is being run in a Windows machine. Therefore, using python functions to examine the contents, instead of shell commands.


***
**u.data file - print first 10 lines**
***
**Findings:**
1. The file is a tab separated file, with no header
2. There are four columns, out of which the first three will be used (user_id, movie_id and rating).


In [3]:
with open(path+'u.data', 'r') as input_file:
    for _ in range(10):
        print(next(input_file))

196	242	3	881250949

186	302	3	891717742

22	377	1	878887116

244	51	2	880606923

166	346	1	886397596

298	474	4	884182806

115	265	2	881171488

253	465	5	891628467

305	451	3	886324817

6	86	3	883603013



***
**u.item file - print first 10 lines**
***
**Findings:**
1. The file is a pipe separated file, with no header
2. There are many columns, out of which the first three will be used (movie_id, title and release_date).


In [4]:
with open(path+'u.item', 'r') as input_file:
    for _ in range(10):
        print(next(input_file))

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0

5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0

7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0

8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|

***
**u.user file - print first 10 lines**
***
**Findings:**
1. The file is a pipe separated file, with no header
2. There are four columns, out of which the first four will be used (user_id, age, gender and occupation).


In [5]:
with open(path+'u.user', 'r') as input_file:
    for _ in range(10):
        print(next(input_file))

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

4|24|M|technician|43537

5|33|F|other|15213

6|42|M|executive|98101

7|57|M|administrator|91344

8|36|M|administrator|05201

9|29|M|student|01002

10|53|M|lawyer|90703



### Step-B Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

In [6]:
# ratings
u_data_df=pd.read_csv(path+'u.data',sep='\t',names=['user_id','movie_id','rating','timestamp'])
u_data_df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [7]:
# movies
u_item_df=pd.read_csv(path+'u.item',sep='|',encoding='latin-1',usecols=[0, 1, 2],names=['movie_id','title','release_date'])
u_item_df.head()

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995


In [8]:
# users
u_user_df=pd.read_csv(path+'u.user',sep='|',usecols=[0, 1, 2, 3],names=['user_id','age','gender','occupation'])
u_user_df.head()


Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

### Step-C Saving dataframe to CSV file format
The three dataframes have been saved as csv files. 

**File Names:**
1. u.data.csv
2. u.item.csv
3. u.user.csv

In [9]:
# ratings
u_data_csv=u_data_df.to_csv(path+'u.data.csv',index=False)

In [10]:
# movies
u_item_csv=u_item_df.to_csv(path+'u.item.csv',index=False)

In [11]:
# users
u_user_csv=u_user_df.to_csv(path+'u.user.csv',index=False)

**Display the first 10 rows of each file.**

In [12]:
# ratings
u_data_df.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [13]:
# movies
u_item_df.head(10)

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995
6,7,Twelve Monkeys (1995),01-Jan-1995
7,8,Babe (1995),01-Jan-1995
8,9,Dead Man Walking (1995),01-Jan-1995
9,10,Richard III (1995),22-Jan-1996


In [14]:
# users
u_user_df.head(10)

Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other
5,6,42,M,executive
6,7,57,M,administrator
7,8,36,M,administrator
8,9,29,M,student
9,10,53,M,lawyer


### Step-D Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Dataframe-1: u_data_df**
1. Convert Timestamps into Readable dates.
2. Check for missing values. There are no missing values, so proceed to next step.

In [15]:
# Timestamp conversion
u_data_df['timestamp']=pd.to_datetime(u_data_df['timestamp'],unit='s')
u_data_df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


In [16]:
# Checking for Missing Values
u_data_df.isnull().sum()

user_id      0
movie_id     0
rating       0
timestamp    0
dtype: int64

**Dataframe-2: u_item_df**
1. Check for missing values. 
2. Remove rows with missing values
3. Recheck for missing values, and proceed to next step.

In [17]:
#Check for blank or missing values
u_item_df.isnull().sum()

movie_id        0
title           0
release_date    1
dtype: int64

In [18]:
# Drop the data rows that had blank or missing values
u_item_df.dropna(inplace=True)
# Re-check for blank of missing values
u_item_df.isnull().sum()

movie_id        0
title           0
release_date    0
dtype: int64

**Dataframe-3: u_users_df**
1. Check for missing values. 
2. No rows with missing values
3. So, proceed to next step.

In [19]:
# users
u_user_df.isnull().sum()


user_id       0
age           0
gender        0
occupation    0
dtype: int64

**Reset Index for all three dataframes, after cleaning**

In [20]:
u_data_df.reset_index(drop=True, inplace=True)
u_item_df.reset_index(drop=True, inplace=True)
u_user_df.reset_index(drop=True, inplace=True)

**Print the total number of users, movies, and ratings.**

In [21]:
print(f"Total Users: { u_user_df.shape[0] }")
print(f"Total Movies: {u_item_df.shape[0] }")
print(f"Total Ratings: {u_data_df.shape[0]}")


Total Users: 943
Total Movies: 1681
Total Ratings: 100000


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  

2. **Create the User-Movie Rating Matrix**:  
 
3. **Inspect the Matrix**:  
  
4. **Handle Missing Values**:  
  

**Create the user-movie rating matrix using the `pivot()` function.**

In [22]:
user_movie_matrix = u_data_df.pivot(index='user_id', columns='movie_id', values='rating')
user_movie_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


**Display the matrix to verify the transformation.**

In [23]:
user_movie_matrix.fillna(0,inplace=True)
user_movie_matrix.head(10)

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,5.0,0.0,0.0,5.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,4.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

In [24]:
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

##### **Step 3: Implement the Recommendation Function**
Implementation of function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.


In [25]:
def recommend_movies_for_user(user_id, num=5):
  similar_users=user_sim_df[user_id].sort_values(ascending=False)[1:].head(num)
  movie_ratings=user_movie_matrix.loc[similar_users.index]
  avg_ratings=movie_ratings.mean()
  top_movies=avg_ratings.sort_values(ascending=False).head(num)
  movie_names=u_item_df.loc[top_movies.index]['title']
  result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':movie_names})
  result_df.set_index('Ranking',inplace=True)
  return result_df


##### **Step 4: Return the Final Recommendation List**


In [26]:
recommend_movies_for_user(7, num = 5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,2001: A Space Odyssey (1968)
2,"Apartment, The (1960)"
3,Legends of the Fall (1994)
4,"Third Man, The (1949)"
5,Fly Away Home (1996)


### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

##### **Step 1: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.

In [27]:
# Code the function here
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)

##### **Step 2: Implement the Recommendation Function**
Implementation of the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.


In [28]:
def recommend_movies(movie_name, num=5):
  if movie_name not in u_item_df['title'].values:
    return "Movie not found in the dataset."
  else:
    movie_id=u_item_df[u_item_df['title']==movie_name]['movie_id'].values[0]
    similar_movies=item_sim_df[movie_id].sort_values(ascending=False)[1:].head(num)
    movie_names=u_item_df.loc[similar_movies.index]['title']
    result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':movie_names})
    result_df.set_index('Ranking',inplace=True)
    return result_df


##### **Step 3: Return the Final Recommendation List**


In [29]:
recommend_movies("Jurassic Park (1993)", num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,On Golden Pond (1981)
2,Wyatt Earp (1994)
3,Brazil (1985)
4,"Princess Bride, The (1987)"
5,M*A*S*H (1970)


In [30]:
recommend_movies("Brazil (1985)", num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,So I Married an Axe Murderer (1993)
2,Mr. Smith Goes to Washington (1939)
3,Indiana Jones and the Last Crusade (1989)
4,"Wrong Trousers, The (1993)"
5,"First Wives Club, The (1996)"


## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

##### **Goal is to Create a Graph for Adjacency List**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

##### **Step 1: Merge Ratings with Movie Titles**


In [31]:
# Code the function here

ratings_df = u_data_df.merge(u_item_df, on='movie_id')

##### **Step 2: Aggregate Ratings**

In [32]:
ratings_df = ratings_df.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
ratings_df.head(5)

Unnamed: 0,user_id,movie_id,title,rating
0,1,1,Toy Story (1995),5.0
1,1,2,GoldenEye (1995),3.0
2,1,3,Four Rooms (1995),4.0
3,1,4,Get Shorty (1995),3.0
4,1,5,Copycat (1995),3.0


##### **Step 3: Normalize Ratings**


In [33]:
ratings_df['rating'] = ratings_df.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
ratings_df.head(5)

Unnamed: 0,user_id,movie_id,title,rating
0,1,1,Toy Story (1995),1.391144
1,1,2,GoldenEye (1995),-0.608856
2,1,3,Four Rooms (1995),0.391144
3,1,4,Get Shorty (1995),-0.608856
4,1,5,Copycat (1995),-0.608856


##### **Step 4: Construct the Graph Representation**


In [34]:
graph = {}
for _, row in ratings_df.iterrows():
    user, movie = row['user_id'], row['title']
    # user = 'u-' + str(user)
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)


##### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

##### **Step 6: Exploring the Graph**

In [35]:
# get movie title recommendations for users
user_id=5
print(graph[user_id])

{'Fish Called Wanda, A (1988)', 'Burnt Offerings (1976)', 'Rumble in the Bronx (1995)', 'True Lies (1994)', 'Thinner (1996)', 'Sudden Death (1995)', 'Day the Earth Stood Still, The (1951)', 'Star Trek V: The Final Frontier (1989)', 'Back to the Future (1985)', 'Jaws 2 (1978)', 'Wrong Trousers, The (1993)', 'Heathers (1989)', 'Adventures of Priscilla, Queen of the Desert, The (1994)', 'Highlander (1986)', 'Parent Trap, The (1961)', 'Star Trek: Generations (1994)', 'Psycho (1960)', 'Bob Roberts (1992)', 'Batman Forever (1995)', 'Willy Wonka and the Chocolate Factory (1971)', 'Apple Dumpling Gang, The (1975)', "William Shakespeare's Romeo and Juliet (1996)", 'Miracle on 34th Street (1994)', 'Santa Clause, The (1994)', 'Amityville II: The Possession (1982)', 'Home Alone (1990)', 'Love Bug, The (1969)', 'Jeffrey (1995)', 'Age of Innocence, The (1993)', 'Clerks (1994)', 'Blade Runner (1982)', 'Return of the Pink Panther, The (1974)', 'Toy Story (1995)', 'Naked Gun 33 1/3: The Final Insult (1

In [36]:
# get user recommendations for movie titles
movie_title = 'Powder (1995)'
print(graph[movie_title])

{642, 5, 262, 7, 774, 393, 11, 395, 269, 401, 276, 405, 790, 798, 417, 291, 804, 551, 682, 684, 429, 308, 311, 314, 62, 65, 577, 711, 200, 712, 588, 846, 472, 345, 346, 601, 94, 222, 224, 483, 109, 749, 880, 244, 885, 887, 760, 378}


### **Implement Random Walks (Unweighted)**

#### **Random Walk-Based Movie Recommendation System (UnWeighted)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**


In [37]:
# Code the function here
import random
import pandas as pd
import sklearn

In [38]:
# Random Walk simulation - NO WEIGHTS

# Let's find movie recommendation for the user_id 5
user_id = 5

# find the movies that user id has visited:
visited_movies=graph[user_id]

# randomly pick a movie to start the walk from:
current_node=random.choice(list(visited_movies))

# start with a walk length of 5000 (High number is for illustration purpose to see evident numbers)
walk_length = 5000

# start with the visit count dictionary, that keeps track of count of each movie visited during the walk:
visit_count={movie:0 for movie in visited_movies}

# start walking
for _ in range(walk_length):
  connected_nodes=graph[current_node]
  if not connected_nodes:
    break
  next_node=random.choice(list(connected_nodes))
  if next_node in visit_count:
    visit_count[next_node]+=1
  current_node=next_node

# Review the visit count after the walk
# visit_count

#### **Step 3: Implement User-Based Recommendation**

In [39]:
def unweighted_pixie_recommend_userbased(user_id, walk_length=5000, num=5):
    

    if user_id in graph:
        visited_movies=graph[user_id]
        current_node=random.choice(list(visited_movies))
        visit_count={movie:0 for movie in visited_movies}
        for _ in range(walk_length):
          connected_nodes=graph[current_node]
          if not connected_nodes:
            break
          next_node=random.choice(list(connected_nodes))
          if next_node in visit_count:
            visit_count[next_node]+=1
          current_node=next_node
        ranked_movies=sorted(visit_count,key=visit_count.get,reverse=True)
        # ranked_movies = [movie for movie in ranked_movies if movie in u_item_df['movie_id'].values]

    else:
        return "User not found in the dataset."
    
    result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':ranked_movies[:num]})
    result_df.set_index('Ranking',inplace=True)
    return result_df




#### **Step 4: Implement Movie-Based Recommendation**

In [40]:
def unweighted_pixie_recommend_moviebased(movie_name, walk_length=5000, num=5):
  if movie_name not in u_item_df['title'].values:
    return "Movie not found in the dataset."
  else:
    current_node = movie_name
    visit_count={movie:0 for movie in u_item_df['title']}
    for _ in range(walk_length):
      connected_nodes=graph[current_node]
      if not connected_nodes:
        break

      next_node=random.choice(list(connected_nodes))
      if next_node in visit_count:
        visit_count[next_node]+=1
      current_node=next_node
    ranked_movies=sorted(visit_count,key=visit_count.get,reverse=True)
    # ranked_movies = [movie for movie in ranked_movies if movie in u_item_df['movie_id'].values]
    # movie_names = u_item_df.set_index('movie_id').loc[ranked_movies[:num]]['title']
    result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':ranked_movies[:num]})
    result_df.set_index('Ranking',inplace=True)
    return result_df



#### **Step 5: Running Your Recommendation System**



##### **User-Based Recommendation**


In [41]:

# call the function for any user_id
unweighted_pixie_recommend_userbased(1)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Fargo (1996)
2,Contact (1997)
3,Return of the Jedi (1983)
4,"Silence of the Lambs, The (1991)"
5,Mr. Holland's Opus (1995)



##### **Movie-Based Recommendation**


In [42]:

unweighted_pixie_recommend_moviebased("Jurassic Park (1993)")

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Liar Liar (1997)
2,Air Force One (1997)
3,"Silence of the Lambs, The (1991)"
4,"English Patient, The (1996)"
5,Dante's Peak (1997)


#### **Step 6: Understanding the Results**


In [43]:
unweighted_pixie_recommend_moviebased("Rear Window (1954)",walk_length=10,num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Crimson Tide (1995)
2,"Net, The (1995)"
3,Four Weddings and a Funeral (1994)
4,Wolf (1994)
5,Lost Horizon (1937)


In [44]:
unweighted_pixie_recommend_moviebased("abc", walk_length=10, num=5)

'Movie not found in the dataset.'

In [45]:
unweighted_pixie_recommend_userbased(15, walk_length=10, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Michael Collins (1996)
2,Dante's Peak (1997)
3,Air Force One (1997)
4,Murder at 1600 (1997)
5,Unforgettable (1996)


In [46]:
unweighted_pixie_recommend_userbased(0, walk_length=10, num=5)

'User not found in the dataset.'

In [47]:
unweighted_pixie_recommend_moviebased("Top Gun (1986)",walk_length=10,num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Gone with the Wind (1939)
2,"Mrs. Brown (Her Majesty, Mrs. Brown) (1997)"
3,187 (1997)
4,"Grifters, The (1990)"
5,Rocket Man (1997)


In [48]:
unweighted_pixie_recommend_moviebased("Jurassic Park (1993)",walk_length=15,num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Four Weddings and a Funeral (1994)
2,Independence Day (ID4) (1996)
3,"Princess Bride, The (1987)"
4,Return of the Jedi (1983)
5,Back to the Future (1985)


## Weighted Pixie Recommendation

#### **Step 1: Import Required Libraries**

In [49]:
# Code the function here
import random
import pandas as pd
import sklearn

#### **Step 2: Implement the Random Walk Algorithm**
The task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.


In [50]:
# Updated Graph (with weights):

weighted_graph = {}
for _, row in ratings_df.iterrows():
    user, movie, weight = row['user_id'], row['title'], row['rating']
    # user = 'u-' + str(user)
    if user not in weighted_graph:
        weighted_graph[user] = set()
    if movie not in weighted_graph:
        weighted_graph[movie] = set()
    weighted_graph[user].add((movie, weight))
    weighted_graph[movie].add((user, weight))



In [51]:
# Random Walk simulation

# Let's find movie recommendation for the user_id 5
user_id = 5

# find the movies that user id has visited:
visited_movies=weighted_graph[user_id]
# visited_movies 



# filter the movies that have highest score
max_weight = max(dict(visited_movies).values())
top_nodes = [node for node, weight in dict(visited_movies).items() if weight == max_weight]

# randomly pick a movie to start the walk from:
current_node=random.choice(top_nodes)

# start with a walk length of 5000 (High number is for illustration purpose to see evident numbers)
walk_length = 5000

# # start with the visit count dictionary, that keeps track of count of each movie visited during the walk:
visit_count={movie:0 for movie in (dict(visited_movies)).keys()}
visit_count
# # start walking
for _ in range(walk_length):
    connected_nodes=weighted_graph[current_node]
    if not connected_nodes:
        break
    
    max_weight = max(dict(connected_nodes).values())
    top_nodes = [node for node, weight in dict(connected_nodes).items() if weight == max_weight]
    next_node=random.choice(top_nodes)
    
    if next_node in visit_count:
        visit_count[next_node]+=1
    current_node=next_node
    # if _ == 2:
    #     break


# Review the visit count after the walk
counter = 1
print('Counts\tTitle')
for movie in sorted(visit_count, key=visit_count.get, reverse=True):
    print(str(visit_count[movie]) + '\t' + movie )
    counter = counter + 1
    if counter > 5:
        break

Counts	Title
49	Batman (1989)
44	To Kill a Mockingbird (1962)
43	Return of the Jedi (1983)
43	Star Wars (1977)
42	Snow White and the Seven Dwarfs (1937)


#### **Step 3: Implement User-Based Recommendation**

In [52]:
def weighted_pixie_recommend_userbased(user_id, walk_length=5000, num=10):
    

    if user_id in weighted_graph:
        visited_movies=weighted_graph[user_id]
        max_weight = max(dict(visited_movies).values())
        top_nodes = [node for node, weight in dict(visited_movies).items() if weight == max_weight]
        
        # randomly pick a movie to start the walk from:
        current_node=random.choice(top_nodes)
        visit_count={movie:0 for movie in (dict(visited_movies)).keys()}

        for _ in range(walk_length):
            connected_nodes=weighted_graph[current_node]
            if not connected_nodes:
                break
            
            max_weight = max(dict(connected_nodes).values())
            top_nodes = [node for node, weight in dict(connected_nodes).items() if weight == max_weight]
            next_node=random.choice(top_nodes)
            
            if next_node in visit_count:
                visit_count[next_node]+=1
            current_node=next_node
        ranked_movies=sorted(visit_count,key=visit_count.get,reverse=True)
        # ranked_movies = [movie for movie in ranked_movies if movie in u_item_df['movie_id'].values]

    else:
        return "User not found in the dataset."
    
    result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':ranked_movies[:num]})
    result_df.set_index('Ranking',inplace=True)
    return result_df


# call the function for any user_id
weighted_pixie_recommend_userbased(1, walk_length=50, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Field of Dreams (1989)
2,Chasing Amy (1997)
3,Ed Wood (1994)
4,"Usual Suspects, The (1995)"
5,"Net, The (1995)"



#### **Step 4: Implement Movie-Based Recommendation**

In [53]:
def weighted_pixie_recommend_moviebased(movie_name, walk_length=5000, num=5):

    if movie_name not in u_item_df['title'].values:
        return "Movie not found in the dataset."
    else:

    
        # randomly pick a movie to start the walk from:
        current_node=movie_name
        visit_count={movie:0 for movie in u_item_df['title']}

        for _ in range(walk_length):
            connected_nodes=weighted_graph[current_node]
            if not connected_nodes:
                break
            
            max_weight = max(dict(connected_nodes).values())
            top_nodes = [node for node, weight in dict(connected_nodes).items() if weight == max_weight]
            next_node=random.choice(top_nodes)
            
            if next_node in visit_count:
                visit_count[next_node]+=1
            current_node=next_node
        ranked_movies=sorted(visit_count,key=visit_count.get,reverse=True)
        # ranked_movies = [movie for movie in ranked_movies if movie in u_item_df['movie_id'].values]

        result_df=pd.DataFrame({'Ranking':range(1,num+1),'Movie Name':ranked_movies[:num]})
        result_df.set_index('Ranking',inplace=True)
        return result_df


# call the function for any user_id
weighted_pixie_recommend_moviebased('Young Frankenstein (1974)', walk_length=50, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Ace Ventura: Pet Detective (1994)
2,"Client, The (1994)"
3,Cinderella (1950)
4,Cape Fear (1962)
5,Shadowlands (1993)


#### **Step 5: Running Your Recommendation System**



##### **User-Based Recommendation**


In [54]:
# call the function for any user_id
weighted_pixie_recommend_userbased(1, walk_length=50, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Star Wars (1977)
2,"Terminator, The (1984)"
3,"Blues Brothers, The (1980)"
4,Swingers (1996)
5,"Usual Suspects, The (1995)"



##### **Movie-Based Recommendation**


In [55]:
# call the function for any user_id
weighted_pixie_recommend_moviebased('Back to the Future (1985)', walk_length = 50, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Schindler's List (1993)
2,2001: A Space Odyssey (1968)
3,Friday (1995)
4,Ace Ventura: Pet Detective (1994)
5,Home Alone (1990)


## Insights and Findings
1. `walk_length` is a parameter that says how many steps the random walk happens. Higher the walk_length, the more likely it finds some hidden gems (rarely watched movies). Lower walk_length could lead to popular movies in the results, that most people watch
2. `num` is anotherthe variable that limits the number of recommendations

---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:


### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |