In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Capstone workaround").getOrCreate()

# Quick analysis title.basics file

In [2]:
title_basic = spark.read.csv("../data/imdb_metadata/title.basics.tsv", header=True, inferSchema=True, sep="\t")

In [3]:
title_basic.printSchema()

root
 |-- tconst: string (nullable = true)
 |-- titleType: string (nullable = true)
 |-- primaryTitle: string (nullable = true)
 |-- originalTitle: string (nullable = true)
 |-- isAdult: string (nullable = true)
 |-- startYear: string (nullable = true)
 |-- endYear: string (nullable = true)
 |-- runtimeMinutes: string (nullable = true)
 |-- genres: string (nullable = true)



We only need tconst, primaryTitle, originalTitle, startYear, endYear, runtimeMinutes columns.

In [4]:
cols = (
    "titleType",
    "isAdult",
    "genres"
)

title_basic = title_basic.drop(*cols)

In [5]:
title_basic.printSchema()

root
 |-- tconst: string (nullable = true)
 |-- primaryTitle: string (nullable = true)
 |-- originalTitle: string (nullable = true)
 |-- startYear: string (nullable = true)
 |-- endYear: string (nullable = true)
 |-- runtimeMinutes: string (nullable = true)



## Check for null value

In [29]:
title_basic.count()

8852149

In [46]:
title_basic.filter("tconst is null").count()

0

In [47]:
title_basic.filter("originalTitle is null").count()

0

In [48]:
title_basic.filter("primaryTitle is null").count()

0

## Check for duplicates

In [40]:
title_basic.select("tconst").distinct().count() == title_basic.count()

True

In [56]:
title_basic.select("originalTitle").distinct().count() == title_basic.count()

False

The number of distinct value for originalTitle is not equal to the number of rows in the table, so there are duplicate values for this column. And by that, we need to find those values to investigate.

In [53]:
title_basic.select("originalTitle").groupby("originalTitle").agg({"originalTitle": "count"}).where("count(1) > 1").show()

+--------------------+--------------------+
|       originalTitle|count(originalTitle)|
+--------------------+--------------------+
|La descente de croix|                   2|
|    Master and Pupil|                   2|
|          Quo Vadis?|                  10|
|        The Kangaroo|                   9|
|    The Star Boarder|                   6|
|      As It Happened|                   2|
|            Gladiola|                   2|
|The Moth and the ...|                   5|
|    The Stool Pigeon|                   5|
|  The Vivisectionist|                   2|
|                Zaza|                   9|
|     A Maid to Order|                   2|
|     Saved by a Song|                   2|
|       Anything Once|                   4|
|            The Moth|                  20|
|His Majesty, Bunk...|                   2|
|       The One Woman|                   3|
|    Between the Acts|                   2|
|     Marion de Lorme|                   2|
|   The Branding Iron|          

Those above are the duplicate values. We will choose one of them to see in details.

In [78]:
title_basic.select("*").where("originalTitle = 'The Moth'").show()

+----------+------------------+-------------+---------+-------+--------------+
|    tconst|      primaryTitle|originalTitle|startYear|endYear|runtimeMinutes|
+----------+------------------+-------------+---------+-------+--------------+
| tt0008321|          The Moth|     The Moth|     1917|     \N|            72|
| tt0025518|          The Moth|     The Moth|     1934|     \N|            64|
| tt0118405|          The Moth|     The Moth|     1997|     \N|           152|
| tt0410334|          The Moth|     The Moth|     1911|     \N|            10|
| tt0566835|          The Moth|     The Moth|     1961|     \N|            30|
| tt0636298|          The Moth|     The Moth|     2004|     \N|            43|
| tt0716993|          The Moth|     The Moth|     1987|     \N|            30|
| tt0799885|          The Moth|     The Moth|     2002|     \N|            60|
| tt1195784|          The Moth|     The Moth|     1914|     \N|            10|
|tt12511248|Moth Directors Cut|     The Moth|     20

Although the originalTitle are the same, but all the tconst are different. So does startYear values. This mean this movie has multiple versions, so it is not wrong value.

## Check for wrong value

startYear and endYear must be number (can have null) and startYear must be less than or equal to endYear.

In [7]:
title_basic.filter("startYear > endYear").count()

0

# Quick analysis sample.json file
The sample.json is a subset of data in imdb_ratings data. It will represent the method of analysis.

In [65]:
ratings = spark.read.json("../data/imdb_ratings/sample.json", multiLine=True)

In [68]:
ratings.show()

+--------------------+------+------------+---------+-------------------+
|               movie|rating| review_date|review_id|           reviewer|
+--------------------+------+------------+---------+-------------------+
|Kill Bill: Vol. 2...|     8|24 July 2005|rw1133942|OriginalMovieBuff21|
|Journey to the Un...|  null|24 July 2005|rw1133943|           sentra14|
|   The Island (2005)|     9|24 July 2005|rw1133946|  GreenwheelFan2002|
|Win a Date with T...|     3|24 July 2005|rw1133948|     itsascreambaby|
|Saturday Night Li...|    10|24 July 2005|rw1133949|OriginalMovieBuff21|
|Outlaw Star (1998– )|    10|24 July 2005|rw1133950|          Aaron1375|
|  The Aviator (2004)|    10|24 July 2005|rw1133952| TheFilmConnoisseur|
|Star Wars: Episod...|     9|24 July 2005|rw1133953|        swansongang|
|The Amityville Ho...|     3|24 July 2005|rw1133954|             diand_|
|Flying Tigers (1942)|     6|24 July 2005|rw1133955|         btillman63|
|Phantasm III: Lor...|     6|24 July 2005|rw1133956

In [58]:
ratings.printSchema()

root
 |-- helpful: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- movie: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- review_date: string (nullable = true)
 |-- review_detail: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- review_summary: string (nullable = true)
 |-- reviewer: string (nullable = true)
 |-- spoiler_tag: long (nullable = true)



We only need movie, rating, review_date, review_id, reviewer columns.

In [66]:
cols = ("helpful", "review_detail", "review_summary", "spoiler_tag")
ratings = ratings.drop(*cols)

In [59]:
ratings.count()

100000

## Check for null value

In [60]:
ratings.filter("movie is null").count()

0

In [61]:
ratings.filter("rating is null").count()

12092

Rating column has null values. We will check on that.

In [67]:
ratings.select("*").where("rating is null").show()

+--------------------+------+------------+---------+--------------------+
|               movie|rating| review_date|review_id|            reviewer|
+--------------------+------+------------+---------+--------------------+
|Journey to the Un...|  null|24 July 2005|rw1133943|            sentra14|
|The Venture Bros....|  null|24 July 2005|rw1133961|           Aaron1375|
|Good Times (1974–...|  null|24 July 2005|rw1133976|      sheworexacharm|
|Donny and Marie (...|  null|24 July 2005|rw1133979|        Manningmilt1|
|An Affair to Reme...|  null|24 July 2005|rw1133980|   Myshkin_Karamazov|
|Johnny Guitar (1954)|  null|24 July 2005|rw1133981|           thirsch-2|
|    All of Me (1984)|  null|24 July 2005|rw1133991|              TxMike|
|    Used Cars (1980)|  null|24 July 2005|rw1133997|aliasanythingyouwant|
|Pizza My Heart (2...|  null|24 July 2005|rw1134016|   HallmarkMovieBuff|
|    Liar Liar (1997)|  null|24 July 2005|rw1134017|           goleafs84|
|   The Fear (1988– )|  null|24 July 2

We do not know if these ratings are representing for how many stars (from 1-10) so the best solution is to drop them.

In [69]:
ratings = ratings.filter("rating is not null")

In [70]:
ratings.count()

87908

In [71]:
ratings.filter("review_date is null").count()

0

In [72]:
ratings.filter("reviewer is null").count()

0

## Check for duplicates

In theory, 1 reviewer can only rate 1 movie at a time. But the value in review_date is date format, not timestamp with seconds. So a reviewer can reviews a movie at the morning, and modify his/her review later in the day, so this is acceptable. And by that we do not need to check for duplicates.