<a href="https://colab.research.google.com/github/EonTechie/Big_Data_Processing_Spark_Projects/blob/main/spark-sql-tasks/TagBasedAverageRatingAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Filiz-Yıldız-Part_2_Question_1
"""
Objective:
The goal of Question 1 was to load, explore, and prepare the MovieLens dataset for further analysis.

Steps Performed:
Google Drive Mounting:
I mounted Google Drive using drive.mount() to access the dataset files stored in my drive.

Reading CSV Files:
I loaded ratings.csv and movies.csv using Spark’s DataFrame API with inferSchema=True and header=True options to automatically detect data types and column names.

Exploring Data:
I used .show() and .count() to understand the structure and size of both datasets.

Joining Data:
I joined the ratings and movies DataFrames on the movieId column to combine user ratings with movie titles and genres.

Genre Splitting:
Since movies can have multiple genres, I split the genres column by the | character and used explode() to create a separate row for each genre. This step made genre-based analysis possible.

Basic Statistics:
I counted the number of rows, distinct users, and genre values to get an idea of data coverage and variety.

Conclusion:
The dataset was successfully loaded, cleaned, and transformed into the format providing with requested knowledge.
"""

from google.colab import drive
drive.mount('/content/drive')

import os

folder_path = "/content/drive/My Drive/datasets/ml-latest-small"
files = os.listdir(folder_path)
print(files)


Mounted at /content/drive
['movies.csv', 'README.txt', 'ratings.csv', 'links.csv', 'tags.csv']


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession with the application
spark = SparkSession.builder.appName("Question_2_Part_1-Movielens-1").getOrCreate()

In [None]:
from pyspark.sql.functions import col, lag, regexp_replace, round
from pyspark.sql.window import Window

# Load the 'tags.csv' file from the MovieLens dataset into a DataFrame
# - 'inferSchema': Automatically detects column data types
# - 'header': First row is treated as column names
df_tags = spark.read \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .csv("/content/drive/My Drive/datasets/ml-latest-small/tags.csv")

# Display the first few records to understand the structure and content
df_tags.show()

+------+-------+-----------------+----------+
|userId|movieId|              tag| timestamp|
+------+-------+-----------------+----------+
|     2|  60756|            funny|1445714994|
|     2|  60756|  Highly quotable|1445714996|
|     2|  60756|     will ferrell|1445714992|
|     2|  89774|     Boxing story|1445715207|
|     2|  89774|              MMA|1445715200|
|     2|  89774|        Tom Hardy|1445715205|
|     2| 106782|            drugs|1445715054|
|     2| 106782|Leonardo DiCaprio|1445715051|
|     2| 106782|  Martin Scorsese|1445715056|
|     7|  48516|     way too long|1169687325|
|    18|    431|        Al Pacino|1462138765|
|    18|    431|         gangster|1462138749|
|    18|    431|            mafia|1462138755|
|    18|   1221|        Al Pacino|1461699306|
|    18|   1221|            Mafia|1461699303|
|    18|   5995|        holocaust|1455735472|
|    18|   5995|       true story|1455735479|
|    18|  44665|     twist ending|1456948283|
|    18|  52604|  Anthony Hopkins|

In [None]:
# Load the 'ratings.csv' file from the MovieLens dataset into a DataFrame
# - 'inferSchema': Automatically detects column data types
# - 'header': First row is treated as column names
df_ratings = spark.read \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .csv("/content/drive/My Drive/datasets/ml-latest-small/ratings.csv")

# Display the first few records to see rating data (userId, movieId, rating, timestamp)
df_ratings.show()


+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



In [None]:
# Join 'df_tags' and 'df_ratings' DataFrames on 'movieId' using inner join
# This keeps only the movies that exist in both DataFrames
df_ = df_tags.join(df_ratings, on="movieId", how="inner")

# Show the first few joined records (combined tag and rating info for each movie)
df_.show()


+-------+------+---------------+----------+------+------+---------+
|movieId|userId|            tag| timestamp|userId|rating|timestamp|
+-------+------+---------------+----------+------+------+---------+
|      1|   567|            fun|1525286013|     1|   4.0|964982703|
|      1|   474|          pixar|1137206825|     1|   4.0|964982703|
|      1|   336|          pixar|1139045764|     1|   4.0|964982703|
|      3|   289|            old|1143424860|     1|   4.0|964981247|
|      3|   289|          moldy|1143424860|     1|   4.0|964981247|
|     47|   474|  serial killer|1137206452|     1|   5.0|964983815|
|     47|   424|   twist ending|1457842458|     1|   5.0|964983815|
|     47|   424|        mystery|1457842470|     1|   5.0|964983815|
|     50|   474|          heist|1137206826|     1|   5.0|964982931|
|     50|   424|   twist ending|1457842306|     1|   5.0|964982931|
|     50|   424|         tricky|1457842340|     1|   5.0|964982931|
|     50|   424|       thriller|1457842332|     

In [None]:
# Select only the 'tag' and 'rating' columns from the joined DataFrame
df = df_.select("tag", "rating")

# Show the selected data: each tag with its corresponding rating
df.show()


+---------------+------+
|            tag|rating|
+---------------+------+
|            fun|   4.0|
|          pixar|   4.0|
|          pixar|   4.0|
|            old|   4.0|
|          moldy|   4.0|
|  serial killer|   5.0|
|   twist ending|   5.0|
|        mystery|   5.0|
|          heist|   5.0|
|   twist ending|   5.0|
|         tricky|   5.0|
|       thriller|   5.0|
|       suspense|   5.0|
|       mindfuck|   5.0|
|         quirky|   5.0|
|off-beat comedy|   5.0|
|          crime|   5.0|
|       Scotland|   4.0|
|    sword fight|   4.0|
|        revenge|   4.0|
+---------------+------+
only showing top 20 rows



In [None]:
# Count how many unique (distinct) tags exist in the DataFrame
df.select("tag").distinct().count()


1584

In [None]:
from pyspark.sql.functions import avg

# Group data by 'tag' and calculate average rating for each tag
# Rename the result column as 'average_rating'
df_avg = df.groupBy("tag").agg(avg("rating").alias("average_rating"))

# Show the average rating per tag
df_avg.show()


+--------------------+------------------+
|                 tag|    average_rating|
+--------------------+------------------+
|              ransom|3.9245283018867925|
|              freaks|3.7577319587628866|
|wrongful imprison...| 4.429022082018927|
|        Heartwarming|4.1477272727272725|
|               anime| 4.002923976608187|
|  intelligent sci-fi| 3.776923076923077|
|               1970s|3.7934782608695654|
|                 art|             3.675|
|             lyrical|              3.58|
|                hope|3.4166666666666665|
|          creativity|               5.0|
|       John Travolta| 4.197068403908795|
|intertwining stor...| 4.197068403908795|
|        conversation| 4.197068403908795|
|              sequel|3.6893305439330546|
|               macho| 3.581967213114754|
|          Emma Stone|3.8773584905660377|
|           Wolverine| 3.723684210526316|
|               mafia| 3.649193548387097|
|          television|3.5970149253731343|
+--------------------+------------

In [None]:
# Count how many unique tags have an average rating calculated
df_avg.count()


1584

In [None]:
# Show tags sorted by average rating in descending order (highest rated first)
df_avg.orderBy(col("average_rating").desc()).show()

+--------------------+-----------------+
|                 tag|   average_rating|
+--------------------+-----------------+
|          creativity|              5.0|
|        human rights|              5.0|
|          procedural|              5.0|
|    free to download|              5.0|
|         no dialogue|              5.0|
|            Dystopia|             4.75|
|   thought provoking|             4.75|
|             parrots|             4.75|
|            jon hamm|             4.75|
| movies about movies|4.666666666666667|
|interracial marriage|4.545454545454546|
|           prejudice|4.545454545454546|
|        Metaphorical|              4.5|
|political right v...|              4.5|
|       individualism|              4.5|
|             freedom|              4.5|
|        good writing|              4.5|
|     black-and-white|              4.5|
|   building a family|              4.5|
|               crazy|              4.5|
+--------------------+-----------------+
only showing top