# Project3_10M_MovieLens Data Analysis
### Author : Farhana Alam

[k-Most popular movies of all time](#1)\
[k-Most popular movies for a particular year](#2)\
[k-Most popular movies for a particular season](#3)\
[Top k movies with the most ratings (presumably most popular) that have the lowest ratings](#4)\
[k-Most tagged movies of all time](#5)\
[k-Most commonly used tags for movies of all time](#6)\
[k-Most commonly used tags for the most common genre of the dataset](#7)\
[Finding the month of the year where movies get most tags based on tagging timestamp](#8)\
[For a particular genre which month is getting most tags](#9)

In [1]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as func
import sys
k = 10

In [2]:
# Creating a Spark session
spark = (SparkSession
        .builder
        .appName("Spark Project3 10M Movie Data Analysis")
        .getOrCreate())

In [3]:
# Loading the 10M MovieLens DataFiles
movies_file = "Documents/CS535/movie_data/ml-10m/movies.dat"
ratings_file = "Documents/CS535/movie_data/ml-10m/ratings.dat"
tags_file = "Documents/CS535/movie_data/ml-10m/tags.dat"

# Output CSV file path
output_csv_path = "Documents/CS535/movie_data/ml-10m-output1"

In [4]:
#ratings datafiles to dataframes
ratings_df = (spark.read.format("csv")
    .option("sep", "::")
    .option("inferschema", "true")
    .option("samplingRatio", 0.1)  # Adjust the sampling ratio
    .load(ratings_file)
    .toDF("UserID", "MovieID", "Rating", "Rating_Timestamp"))

ratings_df.show(k)

+------+-------+------+----------------+
|UserID|MovieID|Rating|Rating_Timestamp|
+------+-------+------+----------------+
|     1|    122|   5.0|       838985046|
|     1|    185|   5.0|       838983525|
|     1|    231|   5.0|       838983392|
|     1|    292|   5.0|       838983421|
|     1|    316|   5.0|       838983392|
|     1|    329|   5.0|       838983392|
|     1|    355|   5.0|       838984474|
|     1|    356|   5.0|       838983653|
|     1|    362|   5.0|       838984885|
|     1|    364|   5.0|       838983707|
+------+-------+------+----------------+
only showing top 10 rows



In [5]:
#tags datafiles to dataframes
tags_df = (spark.read.format("csv")
    .option("sep", "::")
    .option("inferschema", "true")
    .option("samplingRatio", 0.1)  # Adjust the sampling ratio
    .load(tags_file)
    .toDF("UserID", "MovieID", "Tag", "Tag_Timestamp"))

tags_df.show(k)

+------+-------+---------------+-------------+
|UserID|MovieID|            Tag|Tag_Timestamp|
+------+-------+---------------+-------------+
|    15|   4973|     excellent!|   1215184630|
|    20|   1747|       politics|   1188263867|
|    20|   1747|         satire|   1188263867|
|    20|   2424|chick flick 212|   1188263835|
|    20|   2424|          hanks|   1188263835|
|    20|   2424|           ryan|   1188263835|
|    20|   2947|         action|   1188263755|
|    20|   2947|           bond|   1188263756|
|    20|   3033|          spoof|   1188263880|
|    20|   3033|      star wars|   1188263880|
+------+-------+---------------+-------------+
only showing top 10 rows



In [6]:
#movies datafiles to dataframes
movies_df = (spark.read.format("csv")
      .option("sep", "::")
      .option("header", "true")
      .option("samplingRatio", 0.1)  # Adjust the sampling ratio 
      .load(movies_file)
      .toDF("MovieID","Titles","Genres"))
movies_df.show(k)

+-------+--------------------+--------------------+
|MovieID|              Titles|              Genres|
+-------+--------------------+--------------------+
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
+-------+--------------------+--------------------+
only showing top 10 rows



# Movies and Ratings

In [7]:
#INNER JOIN movies_df & ratings_df
moviesNratings = movies_df.join(ratings_df,movies_df.MovieID == ratings_df.MovieID, 'inner').select(
        movies_df.MovieID,movies_df.Titles,ratings_df.UserID,ratings_df.Rating, ratings_df.Rating_Timestamp)

moviesNratings.sort(col("UserID")).show(3)

+-------+--------------------+------+------+----------------+
|MovieID|              Titles|UserID|Rating|Rating_Timestamp|
+-------+--------------------+------+------+----------------+
|    122|    Boomerang (1992)|     1|   5.0|       838985046|
|    185|     Net, The (1995)|     1|   5.0|       838983525|
|    231|Dumb & Dumber (1994)|     1|   5.0|       838983392|
+-------+--------------------+------+------+----------------+
only showing top 3 rows



<a id="1"></a>
## k-Most popular movies of all time
### Finding k most popular movies of all times assuming k=10
#### Considering rating timestamp, not the movie realease time

In [8]:
most_pop_movies = moviesNratings.groupBy("MovieID","Titles").avg("Rating").orderBy("avg(Rating)", ascending=False)

#writing to file
most_pop_movies.limit(k).write.csv(output_csv_path, header=True, mode="append")
most_pop_movies.show(k)

+-------+--------------------+-----------+
|MovieID|              Titles|avg(Rating)|
+-------+--------------------+-----------+
|  33264|Satan's Tango (SÃ¡...|        5.0|
|  51209|Fighting Elegy (K...|        5.0|
|  53355|Sun Alley (Sonnen...|        5.0|
|  42783|Shadows of Forgot...|        5.0|
|  64275|Blue Light, The (...|        5.0|
|   5194|Who's Singin' Ove...|       4.75|
|  26048|Human Condition I...|       4.75|
|   4454|         More (1998)|       4.75|
|  65001|Constantine's Swo...|       4.75|
|  26073|Human Condition I...|       4.75|
+-------+--------------------+-----------+
only showing top 10 rows



<a id="2"></a>
## k-Most popular movies for a particular year
### Finding k most popular movies of all times assuming k=10, year = 2000
#### Considering rating timestamp, not the movie realease time

In [9]:
moviesNratings_withYear = moviesNratings.withColumn("Year", year(from_unixtime("Rating_Timestamp")))\
                               .select("MovieID","Titles","UserID","Rating","Year")\
                               .where(col("Year") == 2000)
moviesNratings_withYear.show(3)

+-------+--------------------+------+------+----+
|MovieID|              Titles|UserID|Rating|Year|
+-------+--------------------+------+------+----+
|      5|Father of the Bri...|    12|   3.0|2000|
|    253|Interview with th...|    12|   3.0|2000|
|    345|Adventures of Pri...|    12|   4.0|2000|
+-------+--------------------+------+------+----+
only showing top 3 rows



In [10]:
top_10_pop_movies_of_a_year = moviesNratings_withYear.select("MovieID","Titles","Rating","Year")\
                               .groupBy("MovieID","Titles").avg("Rating").orderBy("avg(Rating)",ascending=False)
#writing to file
top_10_pop_movies_of_a_year.limit(k).write.csv(output_csv_path, header=True, mode="append")
top_10_pop_movies_of_a_year.show(k)

+-------+--------------------+-----------+
|MovieID|              Titles|avg(Rating)|
+-------+--------------------+-----------+
|    756|Carmen Miranda: B...|        5.0|
|   3595|      Held Up (1999)|        5.0|
|   3236|    Zachariah (1971)|        5.0|
|   3172|Ulysses (Ulisse) ...|        5.0|
|    989|Brother of Sleep ...|        5.0|
|    654|Und keiner weint ...|        5.0|
|    584|I Don't Want to T...|        5.0|
|   1768|Mother and Son (M...|        5.0|
|   2270|Century of Cinema...|        5.0|
|   3280|    Baby, The (1973)|        5.0|
+-------+--------------------+-----------+
only showing top 10 rows



<a id="3"></a>
## k-Most popular movies for a particular season
### Defining the season as (1: Winter, 2: Spring, 3: Summer, 4: Fall) 
### Assuming k=10, target_season = 3 (summer: month 7,8,9)
#### Considering rating timestamp, not the movie realease time

In [11]:
target_season = 3
moviesNratings_withMonth = moviesNratings.withColumn("Month", month(from_unixtime("Rating_Timestamp")))\
                               .select("MovieID","Titles","UserID","Rating","Month")\
                               .where((col("Month") >=(target_season * 3 - 2))&(col("Month")<=(target_season * 3)))

moviesNratings_withMonth.show(3)

+-------+--------------------+------+------+-----+
|MovieID|              Titles|UserID|Rating|Month|
+-------+--------------------+------+------+-----+
|    122|    Boomerang (1992)|     1|   5.0|    8|
|    185|     Net, The (1995)|     1|   5.0|    8|
|    231|Dumb & Dumber (1994)|     1|   5.0|    8|
+-------+--------------------+------+------+-----+
only showing top 3 rows



In [12]:
top_10_pop_movies_of_summer = moviesNratings_withMonth.select("MovieID","Titles","Rating","Month")\
                                  .groupBy("MovieID","Titles").avg("Rating").orderBy("avg(Rating)",ascending=False)
#writing to file
top_10_pop_movies_of_summer.limit(k).write.csv(output_csv_path, header=True, mode="append")
top_10_pop_movies_of_summer.show(k)

+-------+--------------------+-----------+
|MovieID|              Titles|avg(Rating)|
+-------+--------------------+-----------+
|   5194|Who's Singin' Ove...|        5.0|
|  25975|Life of Oharu, Th...|        5.0|
|   8120|  29th Street (1991)|        5.0|
|   3233|Smashing Time (1967)|        5.0|
|    654|Und keiner weint ...|        5.0|
|    401|       Mirage (1995)|        5.0|
|   4454|         More (1998)|        5.0|
|  42783|Shadows of Forgot...|        5.0|
|   5849|I'm Starting From...|        5.0|
|    395| Desert Winds (1995)|        5.0|
+-------+--------------------+-----------+
only showing top 10 rows



<a id="4"></a>
## Top k movies with the most ratings (presumably most popular) that have the lowest ratings
#### Most rating counts, but less popular/lowest rating avg

In [13]:
# count of the ratings
moviesNratings_with_rating_counts = moviesNratings.groupBy("MovieID","Titles").count()                                                                 
moviesNratings_with_rating_counts.show(10)

+-------+--------------------+-----+
|MovieID|              Titles|count|
+-------+--------------------+-----+
|   4995|Beautiful Mind, A...| 9575|
|   7153|Lord of the Rings...|12366|
|   4027|O Brother, Where ...| 9445|
|   4015|Dude, Where's My ...| 2496|
|   4866|Last Castle, The ...|  823|
|   1324|Amityville: Dollh...|  154|
|   2728|    Spartacus (1960)| 2577|
|     63|Don't Be a Menace...| 1343|
|   2202|     Lifeboat (1944)|  887|
|   2372| Fletch Lives (1989)| 1140|
+-------+--------------------+-----+
only showing top 10 rows



In [14]:
# average of the ratings
moviesNratings_with_avg_rating = moviesNratings.groupBy("MovieID","Titles").avg("Rating")
moviesNratings_with_avg_rating = moviesNratings_with_avg_rating.select("MovieID","Titles",round("avg(Rating)",2).alias("avg_rating"))
moviesNratings_with_avg_rating.show(k)

+-------+--------------------+----------+
|MovieID|              Titles|avg_rating|
+-------+--------------------+----------+
|   4995|Beautiful Mind, A...|      3.91|
|   7153|Lord of the Rings...|      4.16|
|   4027|O Brother, Where ...|      3.89|
|   4015|Dude, Where's My ...|       2.5|
|   4866|Last Castle, The ...|       3.3|
|   1324|Amityville: Dollh...|      1.71|
|   2728|    Spartacus (1960)|      3.89|
|     63|Don't Be a Menace...|      3.05|
|   2202|     Lifeboat (1944)|      3.89|
|   2372| Fletch Lives (1989)|      2.88|
+-------+--------------------+----------+
only showing top 10 rows



In [15]:
# Joining movie rating-count and average-rating
moviesNratings_rating_counts_with_avgRatings = moviesNratings_with_rating_counts.join(moviesNratings_with_avg_rating, 
                    moviesNratings_with_rating_counts.MovieID == moviesNratings_with_avg_rating.MovieID)\
                     .orderBy("count", ascending=False)\
                     .select(moviesNratings_with_rating_counts.MovieID,
                             moviesNratings_with_rating_counts.Titles,"count","avg_rating")
moviesNratings_rating_counts_with_avgRatings.show(k)

+-------+--------------------+-----+----------+
|MovieID|              Titles|count|avg_rating|
+-------+--------------------+-----+----------+
|    296| Pulp Fiction (1994)|34864|      4.16|
|    356| Forrest Gump (1994)|34457|      4.01|
|    593|Silence of the La...|33668|       4.2|
|    480|Jurassic Park (1993)|32631|      3.66|
|    318|Shawshank Redempt...|31126|      4.46|
|    110|   Braveheart (1995)|29154|      4.08|
|    457|Fugitive, The (1993)|28951|      4.01|
|    589|Terminator 2: Jud...|28948|      3.93|
|    260|Star Wars: Episod...|28566|      4.22|
|    150|    Apollo 13 (1995)|27035|      3.89|
+-------+--------------------+-----+----------+
only showing top 10 rows



In [16]:
#Calculating the result
movies_hcount_lrating = moviesNratings_rating_counts_with_avgRatings.orderBy(['count', 'avg_rating'], ascending=[False, True])

#writing to file
movies_hcount_lrating.limit(k).write.csv(output_csv_path, header=True, mode="append")
movies_hcount_lrating.show(k)

+-------+--------------------+-----+----------+
|MovieID|              Titles|count|avg_rating|
+-------+--------------------+-----+----------+
|    296| Pulp Fiction (1994)|34864|      4.16|
|    356| Forrest Gump (1994)|34457|      4.01|
|    593|Silence of the La...|33668|       4.2|
|    480|Jurassic Park (1993)|32631|      3.66|
|    318|Shawshank Redempt...|31126|      4.46|
|    110|   Braveheart (1995)|29154|      4.08|
|    457|Fugitive, The (1993)|28951|      4.01|
|    589|Terminator 2: Jud...|28948|      3.93|
|    260|Star Wars: Episod...|28566|      4.22|
|    150|    Apollo 13 (1995)|27035|      3.89|
+-------+--------------------+-----+----------+
only showing top 10 rows



# Movies and Tags

In [17]:
#INNER JOIN movies_df & tags_df
moviesNtags = movies_df.join(tags_df,movies_df.MovieID == tags_df.MovieID, 'inner').select(
        movies_df.MovieID,movies_df.Titles,movies_df.Genres,tags_df.UserID,tags_df.Tag, tags_df.Tag_Timestamp)\
           .orderBy("UserID", ascending=True)

moviesNtags.show(13)

+-------+--------------------+--------------------+------+---------------+-------------+
|MovieID|              Titles|              Genres|UserID|            Tag|Tag_Timestamp|
+-------+--------------------+--------------------+------+---------------+-------------+
|   4973|Amelie (Fabuleux ...|      Comedy|Romance|    15|     excellent!|   1215184630|
|   1747|  Wag the Dog (1997)|              Comedy|    20|       politics|   1188263867|
|   1747|  Wag the Dog (1997)|              Comedy|    20|         satire|   1188263867|
|   2424|You've Got Mail (...|      Comedy|Romance|    20|chick flick 212|   1188263835|
|   2424|You've Got Mail (...|      Comedy|Romance|    20|          hanks|   1188263835|
|   2424|You've Got Mail (...|      Comedy|Romance|    20|           ryan|   1188263835|
|   2947|   Goldfinger (1964)|Action|Adventure|...|    20|         action|   1188263755|
|   2947|   Goldfinger (1964)|Action|Adventure|...|    20|           bond|   1188263756|
|   3033|   Spaceball

<a id="5"></a>
## k-Most tagged movies of all time
### Finding k most tagged movies of all times assuming k=10
#### Considering tagging timestamp, not the movie realease time

In [18]:
most_tagged_movies = moviesNtags.groupBy("MovieID","Titles").count().orderBy("count", ascending=False)

#writing to file
most_tagged_movies.limit(k).write.csv(output_csv_path, header=True, mode="append")
most_tagged_movies.show(k)

+-------+--------------------+-----+
|MovieID|              Titles|count|
+-------+--------------------+-----+
|    296| Pulp Fiction (1994)|  308|
|    318|Shawshank Redempt...|  257|
|   2959|   Fight Club (1999)|  235|
|    527|Schindler's List ...|  232|
|   2571|  Matrix, The (1999)|  223|
|    260|Star Wars: Episod...|  223|
|   7361|Eternal Sunshine ...|  221|
|  44191|V for Vendetta (2...|  209|
|    858|Godfather, The (1...|  205|
|    593|Silence of the La...|  203|
+-------+--------------------+-----+
only showing top 10 rows



<a id="6"></a>
## k-Most commonly used tags for movies of all time
### Finding k most commomly used tags for movies of all times, assuming k=10

In [19]:
most_used_tags = moviesNtags.groupBy("Tag").count().orderBy("count", ascending=False)

#writing to file
most_used_tags.limit(k).write.csv(output_csv_path, header=True, mode="append")
most_used_tags.show(k)

+--------------------+-----+
|                 Tag|count|
+--------------------+-----+
|        Tumey's DVDs|  641|
|             classic|  619|
|     based on a book|  549|
|                   R|  518|
|less than 300 rat...|  505|
|                70mm|  464|
|    Nudity (Topless)|  464|
|       erlend's DVDs|  404|
|Oscar (Best Picture)|  400|
|              comedy|  393|
+--------------------+-----+
only showing top 10 rows



<a id="7"></a>
## k-Most commonly used tags for the most common genre of the dataset
### Finding the most common genre of the dataset,then finding the k-most common tags for that genre, assuming k=10

In [20]:
#Exploding Genre
movie_tagsNgenre = moviesNtags.withColumn("Genre", explode(split(trim(col("Genres")), "\\|"))).drop('Genres')
movie_tagsNgenre.show(5)

+-------+--------------------+------+---------------+-------------+-------+
|MovieID|              Titles|UserID|            Tag|Tag_Timestamp|  Genre|
+-------+--------------------+------+---------------+-------------+-------+
|   4973|Amelie (Fabuleux ...|    15|     excellent!|   1215184630| Comedy|
|   4973|Amelie (Fabuleux ...|    15|     excellent!|   1215184630|Romance|
|   1747|  Wag the Dog (1997)|    20|       politics|   1188263867| Comedy|
|   1747|  Wag the Dog (1997)|    20|         satire|   1188263867| Comedy|
|   2424|You've Got Mail (...|    20|chick flick 212|   1188263835| Comedy|
+-------+--------------------+------+---------------+-------------+-------+
only showing top 5 rows



In [21]:
#Finding the most common genre which is our targer genre
genre_counts = movie_tagsNgenre.groupBy("Genre").count().orderBy("count", ascending=False)
target_genre = genre_counts.collect()[0][0]
genre_counts.show(25)

+------------------+-----+
|             Genre|count|
+------------------+-----+
|             Drama|51136|
|            Comedy|31125|
|          Thriller|23282|
|            Action|22526|
|         Adventure|17759|
|           Romance|17518|
|             Crime|14847|
|            Sci-Fi|12205|
|           Fantasy|10702|
|            Horror| 7516|
|           Mystery| 7185|
|               War| 6708|
|          Children| 6357|
|         Animation| 5264|
|           Musical| 4524|
|       Documentary| 2704|
|         Film-Noir| 2266|
|           Western| 1841|
|              IMAX|  167|
|(no genres listed)|    6|
+------------------+-----+



In [22]:
# Calculating the count of each tags for each genre
r_movie_tagsNgenre = movie_tagsNgenre.groupBy("Genre","Tag").count().orderBy(['count','Genre'], ascending=[False, False])
r_movie_tagsNgenre.show(5)

+------+---------------+-----+
| Genre|            Tag|count|
+------+---------------+-----+
| Drama|   Tumey's DVDs|  403|
|Comedy|         comedy|  374|
| Drama|              R|  371|
| Drama|based on a book|  354|
|Sci-Fi|         sci-fi|  345|
+------+---------------+-----+
only showing top 5 rows



In [23]:
tags_for_target_genre = movie_tagsNgenre.where(col("Genre")== target_genre)

#writing to file
tags_for_target_genre.limit(k).write.csv(output_csv_path, header=True, mode="append")
tags_for_target_genre.show(k)

+-------+--------------------+------+-------------+-------------+-----+
|MovieID|              Titles|UserID|          Tag|Tag_Timestamp|Genre|
+-------+--------------------+------+-------------+-------------+-----+
|   7438|Kill Bill: Vol. 2...|    20|       bloody|   1188263801|Drama|
|   7438|Kill Bill: Vol. 2...|    20|      kung fu|   1188263801|Drama|
|   7438|Kill Bill: Vol. 2...|    20|    Tarantino|   1188263801|Drama|
|  55247|Into the Wild (2007)|    21|            R|   1205081506|Drama|
|  55253|Lust, Caution (Se...|    21|        NC-17|   1205081488|Drama|
|    277|Miracle on 34th S...|    39|      classic|   1188263791|Drama|
|    724|   Craft, The (1996)|    39|         goth|   1188263843|Drama|
|    198| Strange Days (1995)|    49|Ralph Fiennes|   1188264255|Drama|
|    261| Little Women (1994)|    49| Winona Ryder|   1188264178|Drama|
|   1597|Conspiracy Theory...|    49|Julia Roberts|   1188264095|Drama|
+-------+--------------------+------+-------------+-------------

<a id="8"></a>
## Finding the month of the year where movies get most tags based on tagging timestamp

In [24]:
moviesNtags_withMonth = moviesNtags.withColumn("Month", month(from_unixtime("Tag_Timestamp")))\
                               .select("MovieID","Titles","Tag","Month")
moviesNtags_withMonth.show(15)

+-------+--------------------+---------------+-----+
|MovieID|              Titles|            Tag|Month|
+-------+--------------------+---------------+-----+
|   4973|Amelie (Fabuleux ...|     excellent!|    7|
|   1747|  Wag the Dog (1997)|       politics|    8|
|   1747|  Wag the Dog (1997)|         satire|    8|
|   2424|You've Got Mail (...|chick flick 212|    8|
|   2424|You've Got Mail (...|          hanks|    8|
|   2424|You've Got Mail (...|           ryan|    8|
|   2947|   Goldfinger (1964)|         action|    8|
|   2947|   Goldfinger (1964)|           bond|    8|
|   3033|   Spaceballs (1987)|          spoof|    8|
|   3033|   Spaceballs (1987)|      star wars|    8|
|   7438|Kill Bill: Vol. 2...|         bloody|    8|
|   7438|Kill Bill: Vol. 2...|        kung fu|    8|
|   7438|Kill Bill: Vol. 2...|      Tarantino|    8|
|  55247|Into the Wild (2007)|              R|    3|
|  55253|Lust, Caution (Se...|          NC-17|    3|
+-------+--------------------+---------------+

In [25]:
moviesNtags_withMonth = moviesNtags_withMonth.groupBy("Month").count().orderBy('count',ascending = False)

#writing to file
moviesNtags_withMonth.limit(12).write.csv(output_csv_path, header=True, mode="append")
moviesNtags_withMonth.show(12)

+-----+-----+
|Month|count|
+-----+-----+
|    2|14145|
|    1|13523|
|    8| 8733|
|    3| 8060|
|    4| 7597|
|    7| 7578|
|    6| 6839|
|   12| 6833|
|    5| 6095|
|    9| 5545|
|   10| 5396|
|   11| 5096|
+-----+-----+



<a id="9"></a>
## For a particular genre which month is getting most tags
### selecting genre = Thriller

In [26]:
moviesNtags_Mn_Gn = moviesNtags.withColumn("Month", month(from_unixtime("Tag_Timestamp")))\
                               .select("MovieID","Titles","Genres","Tag","Month").where(col('Genres') == 'Thriller')
moviesNtags_Mn_Gn.show(k)

+-------+--------------------+--------+-----------------+-----+
|MovieID|              Titles|  Genres|              Tag|Month|
+-------+--------------------+--------+-----------------+-----+
|   1343|    Cape Fear (1991)|Thriller|           horror|    8|
|   1343|    Cape Fear (1991)|Thriller|           killer|    8|
|   1343|    Cape Fear (1991)|Thriller|          stalker|    8|
|    240|     Hideaway (1995)|Thriller|  based on a book|   10|
|    240|     Hideaway (1995)|Thriller|      Dean Koontz|   10|
|   1892|Perfect Murder, A...|Thriller|  based on a play|    4|
|   3005|Bone Collector, T...|Thriller|  based on a book|   11|
|   3015|         Coma (1978)|Thriller|  based on a book|    8|
|   4803|Play Misty for Me...|Thriller|directorial debut|    4|
|   5294|      Frailty (2001)|Thriller|    serial killer|    5|
+-------+--------------------+--------+-----------------+-----+
only showing top 10 rows



In [27]:
moviesNtags_Mn_Gn = moviesNtags_Mn_Gn.groupBy("Month").count().orderBy('count',ascending = False)

#writing to file
moviesNtags_Mn_Gn.limit(12).write.csv(output_csv_path, header=True, mode="append")
moviesNtags_Mn_Gn.show(12)

+-----+-----+
|Month|count|
+-----+-----+
|    4|  104|
|    8|  100|
|    5|   79|
|    1|   78|
|    2|   67|
|    6|   54|
|    7|   51|
|    3|   50|
|    9|   48|
|   12|   29|
|   10|   25|
|   11|   25|
+-----+-----+



In [28]:
#stopping spark
spark.stop()