# MovieLens Data Pipeline with Apache Beam

This notebook demonstrates a complete ETL pipeline using Apache Beam to analyze movie data from the MovieLens dataset. The pipeline reads movie and rating data, performs data cleaning and enrichment, joins datasets and generates five analytical reports providing insights into movie ratings, genres and popularity trends.

## Pipeline Overview
- **Input**: MovieLens CSV files (movies.csv, ratings.csv)
- **Processing**: Data cleaning, joining, aggregations
- **Output**:
  - Average rating by genre  
  - Top-N highest rated movies per genre  
  - Movie statistics by decade  
  - Rating distribution analysis  
  - Popularity analysis (Popular / Moderate / Niche)

## Setup and Imports

In [None]:
# install Apache Beam (i have run once)
!pip install apache-beam

In [73]:
import io
import csv
from typing import Dict, Iterable, List
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

## Helper Functions

These utility functions help with:
- **Safe type conversions:** Convert strings to float/int
- **CSV formatting:** Properly format data for CSV output

In [74]:
def to_float_safe(x: str) -> float:
    try:
        return float(x)
    except Exception:
        return float("nan")


def to_int_safe(x: str):
    try:
        return int(x)
    except Exception:
        return None


def format_csv_line(fields: Iterable) -> str:
    out = io.StringIO()
    csv.writer(out).writerow(list(fields))
    return out.getvalue().rstrip("\r\n")

## Apache Beam Transformations (DoFn Classes)

Apache Beam uses DoFn (Do Function) classes to define data transformations. Each DoFn processes elements in a PCollection (Parallel Collection) and can:
- Filter elements
- Modify data
- Add new fields

### 1. ParseCSV - Converts CSV lines to dictionaries

This DoFn reads CSV text lines and converts them into Python dictionaries with column names as keys.

In [75]:
class ParseCSV(beam.DoFn):
    def __init__(self, header_line: str):
        self.header_line = header_line
        self._fieldnames: List[str] = []

    def setup(self):
        self._fieldnames = next(csv.reader([self.header_line]))

    def process(self, line: str) -> Iterable[Dict]:
        line = line.strip()
        if not line or line == self.header_line:
            return
        
        # parse CSV row and create dictionary
        values = next(csv.reader([line]))
        if len(values) < len(self._fieldnames):
            values = values + [""] * (len(self._fieldnames) - len(values))
        row = dict(zip(self._fieldnames, values))
        yield row

### 2. PreprocessMovies - Cleans and enriches movie data

This DoFn:
- Extracts year from title: "Toy Story (1995)" → year = 1995
- Calculates decade: 1995 → 1990
- Parses genres: "Action|Adventure" → ["Action", "Adventure"]
- Filters invalid entries (missing year or genres)

In [76]:
class PreprocessMovies(beam.DoFn):
    def process(self, row: Dict) -> Iterable[Dict]:
        title = row.get("title", "")
        year = None
        if "(" in title and ")" in title:
            try:
                year_str = title.split("(")[-1].split(")")[0]
                year = int(year_str)
                if year < 1900 or year > 2025:
                    year = None
            except:
                pass   
        if year is None:
            return  # skip movies without valid year
        row["year"] = year
        row["decade"] = (year // 10) * 10
        
        # parse genres 
        genres = row.get("genres", "")
        if not genres or genres == "(no genres listed)":
            return  # skip movies without genres
        genre_list = genres.split("|")
        row["primaryGenre"] = genre_list[0]
        row["allGenres"] = genre_list
        yield row

### 3. JoinWithRatings - Joins movies with their ratings

This DoFn:
- Receives results from CoGroupByKey (grouped by movie ID)
- Calculates average rating from all user ratings
- Counts number of ratings per movie
- Filters movies with less than 10 ratings (quality threshold)

In [77]:
class JoinWithRatings(beam.DoFn):
    def process(self, element) -> Iterable[Dict]:
        movie_id, data = element
        movies_list = data.get("movies", [])
        ratings_list = data.get("ratings", [])
        
        # skipping if no movie or ratings found
        if not movies_list or not ratings_list:
            return        
        movie = movies_list[0]

        # calculating average rating from all user ratings
        total_rating = 0
        count = 0
        for rating_row in ratings_list:
            rating = to_float_safe(rating_row.get("rating", ""))
            if rating == rating:  # Check if not NaN
                total_rating += rating
                count += 1
        if count == 0:
            return
        movie["averageRating"] = round(total_rating / count, 2)
        movie["numRatings"] = count
        if count >= 10:
            yield movie

## Pipeline Configuration

Set up configuration parameters for the pipeline.

In [78]:
# configuration parameters
MOVIES_FILE = "data/movies.csv"
RATINGS_FILE = "data/ratings.csv"
OUTPUT_DIR = "outputs"
TOP_N = 10 

# reading headers from files
with open(MOVIES_FILE, "r", encoding="utf-8") as f:
    movies_header = f.readline().strip()

with open(RATINGS_FILE, "r", encoding="utf-8") as f:
    ratings_header = f.readline().strip()

print(f"Movies header: {movies_header}")
print(f"Ratings header: {ratings_header}")

Movies header: movieId,title,genres
Ratings header: userId,movieId,rating,timestamp


## Main Pipeline

### Pipeline Structure:
1. **Read & Parse**: Load CSV files and convert to dictionaries
2. **Preprocess**: Clean and enrich movie data
3. **Join**: Combine movies with ratings using movieId as key
4. **Analyze**: Generate 5 different analytical outputs

In [79]:
# creating pipeline options
opts = PipelineOptions()
with beam.Pipeline(options=opts) as p:
    
    # Read and preprocess movies
    movies_raw = (
        p 
        | "ReadMovies" >> beam.io.ReadFromText(MOVIES_FILE)
        | "ParseMovies" >> beam.ParDo(ParseCSV(movies_header))
        | "PreprocessMovies" >> beam.ParDo(PreprocessMovies())
        | "KeyMoviesByID" >> beam.Map(lambda r: (r["movieId"], r))
    )
    
    # Read ratings
    ratings_raw = (
        p
        | "ReadRatings" >> beam.io.ReadFromText(RATINGS_FILE)
        | "ParseRatings" >> beam.ParDo(ParseCSV(ratings_header))
        | "KeyRatingsByID" >> beam.Map(lambda r: (r["movieId"], r))
    )
    
    # Join movies with ratings
    movies = (
        {"movies": movies_raw, "ratings": ratings_raw}
        | "CoGroupByKey" >> beam.CoGroupByKey()
        | "JoinWithRatings" >> beam.ParDo(JoinWithRatings())
    )

    # Output 1 - Average rating by genre
    genre_rating = (
        movies
        | "KeyByGenre" >> beam.Map(lambda r: (r["primaryGenre"], r["averageRating"]))
        | "GroupRatingByGenre" >> beam.GroupByKey()
        | "AvgRating" >> beam.Map(
            lambda kv: (kv[0], sum(kv[1]) / len(list(kv[1])))
        )
    )

    rating_header = p | "RatingHeader" >> beam.Create(
        [format_csv_line(("genre", "avg_rating"))]
    )

    rating_rows = (
        genre_rating
        | "FormatRatingRows" >> beam.Map(
            lambda kv: format_csv_line((kv[0], f"{kv[1]:.2f}"))
        )
    )

    _ = ((rating_header, rating_rows)
         | "RatingFlatten" >> beam.Flatten()
         | "WriteRatingCSV" >> beam.io.WriteToText(
                file_path_prefix=f"{OUTPUT_DIR}/genre_avg_rating",
                file_name_suffix=".csv",
                num_shards=1
            )
    )

    # Output 2 - Top-N highest rated movies per genre
    genre_movie_ratings = (
        movies
        | "KeyByGenreMovie" >>
            beam.Map(lambda r: (r["primaryGenre"], 
                               (r["title"], r["averageRating"], r["numRatings"])))
        | "GroupByGenre" >> beam.GroupByKey()
        | "SortByRating" >> beam.Map(
            lambda kv: (kv[0], sorted(kv[1], key=lambda x: x[1], reverse=True))
        )
        | "TakeTopN" >> beam.Map(lambda kv: (kv[0], kv[1][:TOP_N]))
        | "ExplodeTopN" >> beam.FlatMap(
            lambda kv: [(kv[0], title, rating, votes) for (title, rating, votes) in kv[1]]
        )
    )

    topn_header = p | "TopNHeader" >> beam.Create(
        [format_csv_line(("genre", "title", "rating", "num_ratings"))]
    )
    
    topn_rows = (
        genre_movie_ratings
        | "FormatTopNRows" >> beam.Map(
            lambda t: format_csv_line((t[0], t[1], f"{t[2]:.2f}", t[3]))
        )
    )

    _ = ((topn_header, topn_rows)
         | "TopNFlatten" >> beam.Flatten()
         | "WriteTopNCSV" >> beam.io.WriteToText(
                file_path_prefix=f"{OUTPUT_DIR}/top{TOP_N}_movies_by_genre",
                file_name_suffix=".csv",
                num_shards=1
            )
    )

    # Output 3: Movie count and average rating by decade
    decade_stats = (
        movies
        | "KeyByDecade" >> beam.Map(lambda r: (r["decade"], 
                                               (r["averageRating"], 1)))
        | "GroupByDecade" >> beam.GroupByKey()
        | "ComputeDecadeStats" >> beam.Map(
            lambda kv: (
                kv[0], 
                len(list(kv[1])),
                sum(x[0] for x in kv[1]) / len(list(kv[1]))
            )
        )
    )

    decade_header = p | "DecadeHeader" >> beam.Create(
        [format_csv_line(("decade", "movie_count", "avg_rating"))]
    )

    decade_rows = (
        decade_stats
        | "FormatDecadeRows" >> beam.Map(
            lambda t: format_csv_line((f"{t[0]}s", t[1], f"{t[2]:.2f}"))
        )
    )

    _ = ((decade_header, decade_rows)
         | "DecadeFlatten" >> beam.Flatten()
         | "WriteDecadeCSV" >> beam.io.WriteToText(
                file_path_prefix=f"{OUTPUT_DIR}/decade_statistics",
                file_name_suffix=".csv",
                num_shards=1
            )
    )

    # Output 4 - Rating distribution buckets
    rating_buckets = (
        movies
        | "CreateRatingBuckets" >> beam.Map(
            lambda r: (int(r["averageRating"]), 1)
        )
        | "SumRatingBuckets" >> beam.CombinePerKey(sum)
    )

    rating_bucket_header = p | "RatingBucketHeader" >> beam.Create(
        [format_csv_line(("rating_bucket", "movie_count"))]
    )

    rating_bucket_rows = (
        rating_buckets
        | "FormatRatingBuckets" >> beam.Map(
            lambda kv: format_csv_line((f"{kv[0]}-{kv[0]+1}", kv[1]))
        )
    )

    _ = ((rating_bucket_header, rating_bucket_rows)
         | "RatingBucketFlatten" >> beam.Flatten()
         | "WriteRatingBucketCSV" >> beam.io.WriteToText(
                file_path_prefix=f"{OUTPUT_DIR}/rating_distribution",
                file_name_suffix=".csv",
                num_shards=1
            )
    )

    # Output 5 - Popular vs Niche movies analysis
    popularity_category = (
        movies
        | "CategorizePopularity" >> beam.Map(
            lambda r: (
                "Popular" if r["numRatings"] >= 100 
                else "Moderate" if r["numRatings"] >= 50 
                else "Niche",
                (r["averageRating"], 1)
            )
        )
        | "GroupByPopularity" >> beam.GroupByKey()
        | "ComputePopularityStats" >> beam.Map(
            lambda kv: (
                kv[0],
                len(list(kv[1])),
                sum(x[0] for x in kv[1]) / len(list(kv[1]))
            )
        )
    )

    popularity_header = p | "PopularityHeader" >> beam.Create(
        [format_csv_line(("popularity_category", "movie_count", "avg_rating"))]
    )

    popularity_rows = (
        popularity_category
        | "FormatPopularityRows" >> beam.Map(
            lambda t: format_csv_line((t[0], t[1], f"{t[2]:.2f}"))
        )
    )

    _ = ((popularity_header, popularity_rows)
         | "PopularityFlatten" >> beam.Flatten()
         | "WritePopularityCSV" >> beam.io.WriteToText(
                file_path_prefix=f"{OUTPUT_DIR}/popularity_analysis",
                file_name_suffix=".csv",
                num_shards=1
            )
    )

print("\n Pipeline completed successfully !")




 Pipeline completed successfully !
