# Project3_25M_MovieLens Data Analysis

Author : Ashita Chandnani

[1. What are the most popular tags?](#section_1)

[2. Finding top tags for a specific movie.](#section_2)

[3. Exploring tagging trends over time](#section_3)

In [1]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *
import sys

In [2]:
# Creating a Spark session
spark = (SparkSession
        .builder
        .appName("Spark SQL Project3 Movie Data Mining- 25 Million")
        .getOrCreate())

# Set the log level to ERROR to suppress INFO and WARN messages
spark.sparkContext.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/06 14:10:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Loading the DataFrames
movie_csv_file = "hdfs://cscluster00.boisestate.edu:9000/user/ashitachandnani/ml-25M/movies.csv"
ratings_csv_file = "hdfs://cscluster00.boisestate.edu:9000/user/ashitachandnani/ml-25M/ratings.csv"
tags_csv_file = "hdfs://cscluster00.boisestate.edu:9000/user/ashitachandnani/ml-25M/tags.csv"

In [4]:
movies_df = (spark.read.format("csv")
      .option("inferschema", "true")
      .option("header", "true")
      .option("samplingRatio", 0.1)  # Adjust the sampling ratio 
      .load(movie_csv_file))

In [5]:
ratings_df = (spark.read.format("csv")
    .option("inferschema", "true")
    .option("header", "true")
    .option("samplingRatio", 0.1)  # Adjust the sampling ratio
    .load(ratings_csv_file))

                                                                                

In [6]:
tags_df = (spark.read.format("csv")
    .option("inferschema", "true")
    .option("header", "true")
    .option("samplingRatio", 0.1)  # Adjust the sampling ratio
    .load(tags_csv_file))

In [7]:
movies=movies_df.withColumn("Genre", explode(split(trim(col("Genres")), "\\|")))

In [8]:
movies=movies.drop('Genres')
movies.show(5)

+-------+----------------+---------+
|movieId|           title|    Genre|
+-------+----------------+---------+
|      1|Toy Story (1995)|Adventure|
|      1|Toy Story (1995)|Animation|
|      1|Toy Story (1995)| Children|
|      1|Toy Story (1995)|   Comedy|
|      1|Toy Story (1995)|  Fantasy|
+-------+----------------+---------+
only showing top 5 rows



In [9]:
ratings_df.show(5)

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
+------+-------+------+----------+
only showing top 5 rows



In [10]:
tags_df.show(5)

+------+-------+----------------+----------+
|userId|movieId|             tag| timestamp|
+------+-------+----------------+----------+
|     3|    260|         classic|1439472355|
|     3|    260|          sci-fi|1439472256|
|     4|   1732|     dark comedy|1573943598|
|     4|   1732|  great dialogue|1573943604|
|     4|   7569|so bad it's good|1573943455|
+------+-------+----------------+----------+
only showing top 5 rows



In [11]:
# Joining tags and ratings on UserID
# README says Users were selected separately for inclusion in the ratings and tags data sets, which implies that user ids may appear in one set but not the other. 
ratings_tags_df = ratings_df.join(tags_df, on=['UserID', 'MovieID'], how="full_outer")

In [12]:
# Joining with movies on MovieID
joined_df = movies.join(ratings_tags_df, on="MovieID")

In [13]:
# creating a view on the final joined dataframe
joined_df.createOrReplaceTempView("all_tbl")

Now we can issue standard SQL queries. These would be no differenet than what we would do in a standard relational database.

The all_tbl has the data of all 3 tables joined on the common columns:
* MovieID
* UserID

Now we can proceed with answering all our questions using standard SQL statements

In [14]:
joined_df.cache()

DataFrame[movieId: int, title: string, Genre: string, userId: int, rating: double, timestamp: int, tag: string, timestamp: int]

# Interesting Trends -

## 1. What are the most popular tags?<a id="section_1"></a>

In [15]:
# SQL query to find the most popular tags
result = spark.sql("""
SELECT
    Tag,
    COUNT(*) AS TagCount
FROM
    all_tbl
WHERE
    Tag IS NOT NULL
    AND Tag <> ''
GROUP BY
    Tag
ORDER BY
    TagCount DESC
""")
result.show(10)




+------------------+--------+
|               Tag|TagCount|
+------------------+--------+
|            sci-fi|   28157|
|            action|   20003|
|       atmospheric|   18741|
|           surreal|   17701|
|visually appealing|   15111|
|      twist ending|   14899|
|          dystopia|   14575|
|            comedy|   14072|
|   based on a book|   13845|
|             funny|   13396|
+------------------+--------+
only showing top 10 rows



                                                                                

This analysis of popular tags can provide insights into user interests and trends.

**'Sci-fi'** and **'action'** are the top tags and represent the most popular themes in this dataset.

## 2. Finding top tags for a specific movie <a id="section_2"></a>

In [16]:
# SQL query to find the top tags for a specific movie
result = spark.sql("""
SELECT
    MovieID,
    Title,
    Tag,
    COUNT(*) AS TagCount
FROM
    all_tbl
WHERE
    MovieID = 1
    AND Tag IS NOT NULL
    AND Tag <> ''
GROUP BY
    MovieID, Title, Tag
ORDER BY
    TagCount DESC
""")
result.show(10)




+-------+----------------+------------------+--------+
|MovieID|           Title|               Tag|TagCount|
+-------+----------------+------------------+--------+
|      1|Toy Story (1995)|         animation|     360|
|      1|Toy Story (1995)|             Pixar|     350|
|      1|Toy Story (1995)|             pixar|     220|
|      1|Toy Story (1995)|            Disney|     210|
|      1|Toy Story (1995)|         Tom Hanks|     155|
|      1|Toy Story (1995)|computer animation|     150|
|      1|Toy Story (1995)|             funny|     150|
|      1|Toy Story (1995)|          children|     135|
|      1|Toy Story (1995)|             witty|     130|
|      1|Toy Story (1995)|        friendship|     115|
+-------+----------------+------------------+--------+
only showing top 10 rows



                                                                                

This understanding of popular tags for a movie can help platforms suggest other movies with similar themes or characteristics.

**'animation'**,**'pixar'**,**'disney'** are the top tags associated with the movie **Toy Story**.

## 3. Exploring tagging trends over time <a id="section_3"></a>

In [17]:
# creating a view on the final joined dataframe
tags_df.createOrReplaceTempView("tags_tbl")

In [18]:
# SQL query to explore tagging trends over time
result = spark.sql("""
WITH RankedTags AS (
    SELECT
        
        YEAR(FROM_UNIXTIME(Timestamp)) AS Year,
        Tag,
        COUNT(*) AS TagCount,
        ROW_NUMBER() OVER (PARTITION BY YEAR(FROM_UNIXTIME(Timestamp)) ORDER BY COUNT(*) DESC) AS TagRank
    FROM
        tags_tbl
    WHERE
        Tag IS NOT NULL
        AND Tag <> ''
    GROUP BY
        Year, Tag
)

-- Main query to filter the top 4 tags for each year
SELECT
    
    Year,
    Tag,
    TagCount,
    TagRank
FROM
    RankedTags
WHERE
    TagRank <= 5
ORDER BY
    Year DESC, TagRank
""")
result.show(20)

[Stage 24:=====>                                                   (1 + 9) / 10]

+----+--------------------+--------+-------+
|Year|                 Tag|TagCount|TagRank|
+----+--------------------+--------+-------+
|2019|         atmospheric|     842|      1|
|2019|              sci-fi|     719|      2|
|2019|  visually appealing|     671|      3|
|2019|              action|     670|      4|
|2019|             surreal|     644|      5|
|2018|      woman director|    3555|      1|
|2018|              murder|    1965|      2|
|2018|    independent film|    1843|      3|
|2018|             musical|    1266|      4|
|2018|based on novel or...|    1162|      5|
|2017|              sci-fi|     936|      1|
|2017|  visually appealing|     725|      2|
|2017|         atmospheric|     699|      3|
|2017|                BD-R|     653|      4|
|2017|              action|     545|      5|
|2016|              sci-fi|    1016|      1|
|2016|         atmospheric|     827|      2|
|2016|             surreal|     646|      3|
|2016|                BD-R|     584|      4|
|2016|    

                                                                                

This analysis helps identify the most popular tags for each year by ranking them based on the number of occurrences. This information can be useful for content platform to understand user preferences and observing how tag popularity evolves over time.

Identifying tags that are becoming more popular in recent years can be indicative of emerging trends or topics.

For example, **'atmospheric'** was most frequently used tag in 2019 while in 2018 **'woman director'** was on top. The prevalence of certain genres **(sci-fi, action, surreal)** or themes **(visually appealing, woman director)** can indicate trends in movie preferences over time.

In [19]:
joined_df.unpersist()

DataFrame[movieId: int, title: string, Genre: string, userId: int, rating: double, timestamp: int, tag: string, timestamp: int]

In [20]:
spark.stop()