In [14]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._

## Case Study 1: Genre-Specific Data Aggregation Pipeline

Objective: Aggregate movie ratings by genre and store the results in a Parquet format for analytics.

Scenario: The Movielens dataset is stored in GCP Cloud Storage as CSV files. You need to calculate the average ratings per genre for analytics. Some genre information requires custom transformations due to inconsistent formats.
Steps:

Ingestion: Load the movies.csv and ratings.csv files as DataFrames from GCP Cloud Storage.

movies.csv contains columns: movieId, title, genres.
ratings.csv contains columns: userId, movieId, rating, timestamp.
Transformation:

Use DataFrames to parse and explode the genres column into individual genre rows (e.g., split Action|Comedy into two rows: Action and Comedy).
Convert to an RDD for custom transformations to handle inconsistent genre names (e.g., mapping Sci-Fi to Science Fiction).
Aggregation:

Perform the join between movies and ratings on movieId using a DataFrame.
Use RDD transformations to calculate the average rating for each genre using a combination of reduceByKey and custom key-value mapping.
Storage:

Convert the RDD back to a DataFrame and save the aggregated results in Parquet format in HDFS.

In [10]:
val conf = new SparkConf()
      .setAppName("Partitioning Impact on Performance")
      .setMaster("yarn")

val sc = new SparkContext(conf)


conf = org.apache.spark.SparkConf@a4e6790
sc = org.apache.spark.SparkContext@61bf7eb2


org.apache.spark.SparkContext@61bf7eb2

In [11]:
// Define the path to the CSV file in the same bucket
val bucketName = "scala_assgn_bucket"
val filePath = s"gs://$bucketName/ml-32m/movies.csv"

val df = spark.read
  .option("header", "true") // If the first row contains column names
  .option("inferSchema", "true") // Infer the schema automatically
  .option("quote", "\"")         // Treat text inside quotes as a single field
  .option("escape", "\"")
  .csv(filePath)

df.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

bucketName = scala_assgn_bucket
filePath = gs://scala_assgn_bucket/ml-32m/movies.csv
df = [movieId: int, title: string ... 1 more field]


[movieId: int, title: string ... 1 more field]

In [12]:
val movie = df.withColumn("genres_list", split(col("genres"), "\\|"))

// Explode genres
val explodedMoviesDF = movie.select(col("movieId"), col("title"), explode(col("genres_list")).as("genres"))

// Convert DataFrame to RDD for custom transformations
val genreMapping = Map(
    "Sci-Fi" -> "Science Fiction",
    "Sci Fi" -> "Science Fiction",
    "Science-Fiction" -> "Science Fiction",
    "(no genres listed)" -> "None",
    "IMAX" -> "None"
)

val standardizedGenresRDD = explodedMoviesDF.rdd.map(row => {
  val movieId = row.getAs[Int]("movieId")
  val title = row.getAs[String]("title")
  val genre = row.getAs[String]("genres")
  val standardizedGenre = genreMapping.getOrElse(genre, genre)
  (movieId, title, standardizedGenre)
})

// Convert back to DataFrame for joining
val standardizedMoviesDF = standardizedGenresRDD.toDF("movieId", "title", "genres")

// Load ratings.csv from GCP Cloud Storage
val ratingsPath = s"gs://$bucketName/ml-32m/ratings.csv"
val ratingsDF = spark.read.option("header", "true").csv(ratingsPath)

val joinedDF = standardizedMoviesDF.join(ratingsDF, "movieId").select("genres", "rating")


// Convert to RDD for aggregation
val genreRatingsRDD = joinedDF.rdd.map(row => {
    val genre = row.getAs[String]("genres")
    val rating = row.getAs[String]("rating").toDouble
    (genre, (rating, 1)) // (genre, (rating, count))
})

// ReduceByKey to calculate total ratings and counts per genre
val aggregatedRatingsRDD = genreRatingsRDD.reduceByKey {
  case ((rating1, count1), (rating2, count2)) => (rating1 + rating2, count1 + count2)
}

// Map to calculate average rating per genre
val averageRatingsRDD = aggregatedRatingsRDD.map {
  case (genre, (totalRating, count)) => (genre, totalRating / count)
}

// Convert back to DataFrame
val averageRatingsDF = averageRatingsRDD.toDF("genre", "average_rating")

println("Average Ratings DataFrame:")
averageRatingsDF.show()

// Save the results to Parquet format
val outputPath = s"hdfs:///user/shraman_jana/average_ratings.parquet"
averageRatingsDF.write.mode("overwrite").parquet(outputPath)

println(s"Results saved to: $outputPath")



Average Ratings DataFrame:
+---------------+------------------+
|          genre|    average_rating|
+---------------+------------------+
|          Crime|3.6917711184948736|
|           None| 3.585249055125681|
|        Fantasy| 3.512174705402107|
|        Western|3.6001753109842554|
|Science Fiction|3.4916991949223912|
|      Animation|3.6153322869262636|
|       Thriller|3.5317020152396505|
|      Film-Noir| 3.915774014636868|
|         Horror|3.3071549944529486|
|      Adventure|3.5234385724723545|
|        Romance|3.5450028644529983|
|    Documentary|3.6911815290871948|
|            War|3.7916994435766664|
|         Comedy|3.4323856961311248|
|         Action| 3.476407141777424|
|        Musical| 3.554276956937205|
|       Children|3.4392409733948646|
|        Mystery| 3.673102967818112|
|          Drama|3.6824540581800784|
+---------------+------------------+

Results saved to: hdfs:///user/shraman_jana/average_ratings.parquet


movie = [movieId: int, title: string ... 2 more fields]
explodedMoviesDF = [movieId: int, title: string ... 1 more field]
genreMapping = Map(Science-Fiction -> Science Fiction, Sci Fi -> Science Fiction, IMAX -> None, (no genres listed) -> None, Sci-Fi -> Science Fiction)
standardizedGenresRDD = MapPartitionsRDD[20] at map at <console>:46
standardizedMoviesDF = [movieId: int, title: string ... 1 more field]
ratingsPath = gs://scala_assgn_bucket/ml-32m/ratings.csv
ratingsDF = [userId: string, movieId: string ... 2 more fields]


joinedDF: org.apache.spark.sql.Da...


[userId: string, movieId: string ... 2 more fields]

In [13]:
sc.stop()

lastException = null


null