# MovieLens 100k Dataset Description

The MovieLens 100k dataset contains user ratings for movies. It is often used to build and test recommendation systems. The dataset is split into multiple files. Two of the most important are:

### 1. `u.item` (Movie Information)

- **Contains**: Metadata about each movie  
- **Delimiter**: | (pipe symbol)  
- **Columns (simplified):**
  - `movieID` → Unique numeric ID for each movie  
  - `movieName` → The title of the movie (e.g., *Toy Story (1995)*)  
  - `genres` → One or more categories the movie belongs to (e.g., Action, Comedy, Romance, etc.)  

**Example row:**


1|Toy Story (1995)|Animation|Children's|Comedy

This means:
movieID = 1

movieName = "Toy Story (1995)"

Genres = Animation, Children’s, Comedy

## 2. `u.data` (User Ratings)

- **Contains**: The ratings users gave to movies  
- **Delimiter**: Whitespace (space or tab)  
- **Columns:**
  - `userID` → The user who rated the movie  
  - `movieID` → The movie being rated (matches with u.item)  
  - `rating` → A number from 1 to 5 (higher = better)  
  - `timestamp` → When the rating was given  

**Example row:**
196 242 3 881250949

**This means:**
- userID = 196  
- movieID = 242  
- rating = 3 (out of 5)  
- timestamp = 881250949 (Unix time)  

___

## Spark ALS Recommendation Setup

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import lit

`lit()` is used to create a constant column.  

For example, if you want to add a column userID = 0 for all movies, you can use lit(0).


## Define Helper Functions

In [24]:
# Load up movie ID -> movie name dictionary
def loadMovieNames():
    movieNames = {}
    # Note: we specify encoding='ISO-8859-1' to correctly handle special characters in the file.
    with open("u.item", encoding='ISO-8859-1') as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1]
    return movieNames
    #movieNames[id]= movieName

# Convert u.data lines into (userID, movieID, rating) rows
def parseInput(line):
    fields = line.value.split()
    return Row(userID=int(fields[0]), movieID=int(fields[1]), rating=float(fields[2]))


### Understanding the | Separator and Iteration in the Code

- The character | is **already in the dataset** as a delimiter (separator).  
- It separates each column of information in the file:  
  - **First column**: Movie ID  
  - **Second column**: Movie name  
  - **Third column**: Movie genre  
  - … and so on  



### Iteration with for line in f:

- The key point: it’s not fields[0] that iterates, but for line in f.  
- Each time through the loop, line becomes the **next row** in the file.  

---


## Initialize Spark

In [4]:
# Create a SparkSession
spark = SparkSession.builder.appName("MovieRecs").getOrCreate()
spark.conf.set("spark.sql.crossJoin.enabled", "true")

## Load Movie Names

In [12]:
from google.colab import files

# Upload multiple files
uploaded = files.upload()

# View the names of uploaded files
for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])
    ))


Saving u.data to u.data
Saving u.item to u (1).item
User uploaded file "u.data" with length 2079229 bytes
User uploaded file "u (1).item" with length 236344 bytes


In [25]:
# Load up movie ID -> name dictionary
import pandas as pd
movieNames = loadMovieNames()

## Load and Prepare Ratings Data

In [27]:
# Load raw data from HDFS
lines = spark.read.text("u.data").rdd

# Convert it to a RDD of Row objects with (userID, movieID, rating)
ratingsRDD = lines.map(parseInput)

# Convert RDD to DataFrame and cache
ratings = spark.createDataFrame(ratingsRDD).cache()


RDD = Resilient Distributed Dataset

Resilient: Has the ability to recover — if one node fails, the data can be recovered from other nodes.

Distributed: The data is spread across multiple machines in a cluster, not stored on a single computer.

Dataset: A collection of data.

So, RDD can be understood as a large distributed list (or array), which Spark can split into many smaller pieces and process on different machines.

---

1️⃣ `lines = spark.read.text("u.data").rdd`

- spark.read.text("u.data") → Reads the uploaded u.data file.  
  This file contains user ratings for movies. Each line looks like:  
  196 242 3 881250949


**Explanation of each value:**
- `196` → userID  
- `242` → movieID  
- `3` → rating  
- `881250949` → timestamp  

- `.rdd` → Converts the read data into an **RDD (Resilient Distributed Dataset)**.  
RDD is the fundamental data structure in Spark for distributed computing.  
You can think of it as a large table where each row is a record, but Spark can split it across multiple nodes for parallel processing.

2️⃣ **`ratingsRDD = lines.map(parseInput)`**

- lines.map(parseInput) → Applies the previously defined parseInput function to each row.  

**What parseInput does:**
- Takes a line like "196 242 3 881250949", splits it, and keeps the first 3 values.  
- Converts it into a Row object: Row(userID=196, movieID=242, rating=3.0)  

- Row is similar to a Python dictionary but works better in Spark DataFrames.  

**Result:**
- ratingsRDD is an RDD containing all ratings, where each record is a Row object.

3️⃣ **`ratings = spark.createDataFrame(ratingsRDD).cache()`**

- spark.createDataFrame(ratingsRDD) → Converts the RDD into a **DataFrame**.  
Think of a DataFrame like an Excel table:

| userID | movieID | rating |
|--------|---------|--------|
| 196    | 242     | 3.0    |

- A DataFrame makes it easier to filter, sort, aggregate, and perform other operations.  
- .cache() → Stores the DataFrame in memory so subsequent operations run faster.

___


## Train ALS Model

In [28]:
# Create an ALS collaborative filtering model from the complete data set
als = ALS(maxIter=5, regParam=0.01, userCol="userID", itemCol="movieID", ratingCol="rating")
model = als.fit(ratings)


Explaination:

<br>

ALS = Alternating Least Squares (ALS)
Sounds complicated, but it can actually be understood like this:

It treats each user as a vector (an array of numbers) that records their "preference features" for different movies. It treats each movie as a vector that records the movie's "features". By multiplying the user vector and the movie vector, it can predict how a user might rate a movie.

regParam=0.01 → Prevents the model from overfitting (simple understanding: don't memorize the ratings, but learn the patterns)

UserCol/itemCol/ratingCol → Tells the algorithm which column is the user, which is the movie, and which is the rating.


---

Vectors are basically lists or arrays of numbers.

Example:

[1, 0, 1]  
[0.8, 0.1, 0.9]  

Each number represents the strength of a certain feature.

Example: Movie feature vectors

| Feature | Action | Comedy | Animation |
|---------|--------|--------|-----------|
| Toy Story | 0      | 0.1    | 1         |
| Die Hard  | 1      | 0      | 0         |

0 = does not have this feature  
1 = very strong feature  
0.1 = a little of this feature  

User vectors are similar:

| User  | Action | Comedy | Animation |
|-------|--------|--------|-----------|
| Alice | 0.9    | 0.2    | 0.1       |
| Bob   | 0.1    | 0.7    | 0.9       |

Higher numbers → stronger preference for that feature.



### How ALS uses vectors to predict ratings

Suppose we want to predict Alice's rating for Toy Story:

Alice's vector = [0.9, 0.2, 0.1]  
Toy Story's vector = [0, 0.1, 1]  

The predicted rating is the **dot product** of the two vectors:
Predicted rating = (0.9 * 0) + (0.2 * 0.1) + (0.1 * 1) = 0 + 0.02 + 0.1 = 0.12




The larger the dot product → the more the user is predicted to like the movie.

 Intuition: The more the user's preference vector aligns with the movie's feature vector, the higher the predicted rating.

---

### Why ALS uses "Alternating Least Squares"

We do not know the user vectors or the movie vectors initially, but we know some ratings (e.g., Alice gave Die Hard a 5).

ALS approach:

1. Randomly assume movie vectors, then solve for user vectors so predicted ratings are close to real ratings.  
2. Fix user vectors, then update movie vectors.  
3. Repeat alternately until predicted ratings are close to real ratings.  

Finally, we get vectors for each user and each movie, which allows us to predict ratings for any unseen movie.


---

## Print out ratings from user 0

In [29]:
print("\nRatings for user ID 0:")
userRatings = ratings.filter("userID = 0")
for rating in userRatings.collect():
    print(movieNames[rating['movieID']], rating['rating'])



Ratings for user ID 0:
Star Wars (1977) 5.0
Empire Strikes Back, The (1980) 5.0
Gone with the Wind (1939) 1.0


rating = Row(userID=0, movieID=1, rating=5.0)
rating['movieID'] → 1

movieNames[1] → "Toy Story (1995)"

rating['rating'] → 5.0

it is the movieID in rating table.

## Generate Top 20 Recommendations for User 0

In [30]:
print("\nTop 20 recommendations:")
# Find movies rated more than 100 times
ratingCounts = ratings.groupBy("movieID").count().filter("count > 100")

# Construct a "test" dataframe for user 0 with every movie rated more than 100 times
popularMovies = ratingCounts.select("movieID").withColumn('userID', lit(0))

# Run model on that list of popular movies for user ID 0
recommendations = model.transform(popularMovies)

# Get the top 20 movies with the highest predicted rating for this user
topRecommendations = recommendations.sort(recommendations.prediction.desc()).take(20)

for recommendation in topRecommendations:
    print(movieNames[recommendation['movieID']], recommendation['prediction'])



Top 20 recommendations:
GoodFellas (1990) 6.379866123199463
Terminator, The (1984) 6.127628803253174
Die Hard (1988) 6.111161231994629
Pulp Fiction (1994) 6.1084747314453125
Army of Darkness (1993) 6.076435565948486
Terminator 2: Judgment Day (1991) 5.834048271179199
Good, The Bad and The Ugly, The (1966) 5.817391395568848
Chasing Amy (1997) 5.654448986053467
Alien (1979) 5.617126941680908
Kingpin (1996) 5.549688339233398
Reservoir Dogs (1992) 5.48178243637085
Godfather: Part II, The (1974) 5.452390193939209
Mars Attacks! (1996) 5.450388431549072
Raiders of the Lost Ark (1981) 5.438565254211426
Taxi Driver (1976) 5.411818027496338
Raging Bull (1980) 5.348715305328369
Beavis and Butt-head Do America (1996) 5.346415042877197
Platoon (1986) 5.344396114349365
Fifth Element, The (1997) 5.335829257965088
Blues Brothers, The (1980) 5.3260722160339355


Summary:
These are the movies ALS predicts user 0 will like the most, based on learned user and movie vectors. Higher predicted ratings indicate a stronger match between the user’s preferences and the movie features.

# Explaination ALS Predictions

## What ALS Stores

When you train an ALS model, it learns:

- **User vectors (user factors)**  
  Example: User 0's vector `[0.9, 0.2, 0.1]`  
  Each number represents the user's preference for a type of movie (e.g., Action, Comedy, Animation…).

- **Movie vectors (item factors)**  
  Example: Toy Story's vector `[0, 0.1, 1]`  
  Each number represents the strength of a feature in the movie.

These vectors are learned after training the model. Initially, they are random, then optimized alternately until predicted ratings are close to the real ratings.

---

##  Calculating Predicted Ratings with Dot Product

model.transform(popularMovies) does the following:

prediction=user vector⋅movie vector=u1​⋅i1​+u2​⋅i2​+u3​⋅i3​+...

**Example:**

- User 0 vector: `[0.9, 0.2, 0.1]`  
- Toy Story vector: `[0, 0.1, 1]`  

Predicted rating:
0.9 * 0 + 0.2 * 0.1 + 1 * 0.1

This `0.12` is the **prediction**.  
A higher dot product → the user is more likely to rate the movie highly.

---

##  Why Dot Product Predicts Ratings

- **Intuition:**  
  - User vector represents the features the user likes  
  - Movie vector represents the movie's features  
  - The more they align, the higher the predicted rating

- **Mathematical goal:**  
  ALS aims to find vectors such that their dot product is as close as possible to the user's actual rating.

---

## Stop Spark Session

In [31]:
spark.stop()
