# Recap Data Engineering Course

## Step 1: Scope the Project and Gather Data
This is developed for movie dataset.  
The dataset comes from kaggle. There are two sources used:
  - **A** : [IMDB summary (ratings)](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset): which contains movie details, reviews (score), people involved and people role.
  - **B** : [IMDB plots & spoilers](https://www.kaggle.com/rmisra/imdb-spoiler-dataset): which contains plots & synopsys.

The dataset contains those files:
  - movie summary (score) : *IMDb_movies.csv*
  - movie details (plots, synopsys) : *IMDb_movies_details.jsonl*
  - people involved in movie making (actors, directors) : *IMDb_names.csv*
  - ratings summary : *IMDb_ratings.csv*
  - movie plots / synopsys : *IMDb_reviews.jsonl*
  - people role in each movie : *IMDb_title_principals.csv*
  
The csv files came from [**source A**](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset)  
The jsonl files came from [**source B**](https://www.kaggle.com/rmisra/imdb-spoiler-dataset)

For convenience, I also provides the csv and jsonl files on [cloud storage](https://storage.googleapis.com/course_data_engineering_sample_data/) 

In [None]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

import os
import sys

Required if at some point you got error `java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified`

In [None]:
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Files are stored on cloud (S3 public storage). **Since we are taking data from cloud, expect a longer load time (need time to download the files).**

In [None]:
movies_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_movies.gz"
names_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_names.gz"
ratings_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_ratings.gz"
title_principals_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_title_principals.gz"
movie_details_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_movie_details.gz"
reviews_file = "gs://course_data_engineering_sample_data/movie-datasets/IMDb_reviews.gz"

If you using local files, unzip the `movie-datasets.zip` and `movie-datasets2.zip` from lecture resource (last section of the course).
Then uncommment and use the following lines (adjust path if neccessary).

In [None]:
#movies_file = "data/movie-datasets/IMDb_movies.gz"
#names_file = "data/movie-datasets/IMDb_names.csv"
#ratings_file = "data/movie-datasets/IMDb_ratings.csv"
#title_principals_file = "data/movie-datasets/IMDb_title_principals.csv"
#movie_details_file = "data/movie-datasets/IMDb_movies_details.gz"
#reviews_file = "data/movie-datasets/IMDb_reviews.gz"

### Google Cloud Storage Settings

**Note** : If the above code failed, alternatively you can do the following steps.

1. Go [here](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#non-dataproc_clusters) and download Google Cloud Storage connector for hadoop. Use latest version. You will get jar file liike `gcs-connector-hadoop3-latest.jar`
2. Paste the downloaded jar file into the pyspark folder in your local computer. If you install pyspark using `pip`, it will be in `$PYTHON_INSTALLATION_DIR/lib/site-packages/pyspark/jars`
3. Go to [google cloud console](https://console.cloud.google.com/)
4. Go to menu `IAM & admin` > `Service accounts`
5. Create new service account, and give access as `Storage Object Viewer`
6. Create and download new key (JSON) for that service account (e.g. `google-credential-key.json`)
7. Set environment variable to the key file, like this

```
# Windows
# From Start Menu > Environment Variables > add new variable GOOGLE_APPLICATION_CREDENTIALS
# Or from terminal using this command
set GOOGLE_APPLICATION_CREDENTIALS="path\to\your\google-credential-key.json"

# Linux
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/google-credential-key.json"
```
8. Restart jupyter notebook

Open Spark session

In [None]:
spark = SparkSession.builder.appName("Movies DB").getOrCreate()

Read from csv and jsonl. Will take some time since the data is quite large.

In [None]:
movies_spark = spark.read.csv(movies_file, header=True)
names_spark = spark.read.csv(names_file, header=True)
ratings_spark = spark.read.csv(ratings_file, header=True)
title_principals_spark = spark.read.csv(title_principals_file, header=True)
movie_details_spark = spark.read.json(movie_details_file)
reviews_spark = spark.read.json(reviews_file)

All dataframes should be available by now. Try to check each of them.

In [None]:
movies_spark

In [None]:
names_spark

In [None]:
ratings_spark

In [None]:
title_principals_spark

In [None]:
movie_details_spark

In [None]:
reviews_spark

## Step 2: Explore and Assess the Data

Use spark SQL (just for conveninence)

In [None]:
movies_spark.createOrReplaceTempView("movies_v")
names_spark.createOrReplaceTempView("names_v")
ratings_spark.createOrReplaceTempView("ratings_v")
title_principals_spark.createOrReplaceTempView("title_principals_v")
movie_details_spark.createOrReplaceTempView("movie_details_v")
reviews_spark.createOrReplaceTempView("reviews_v")

Use pandas (just for convenience), take only sample data

In [None]:
movies = spark.sql("SELECT * FROM movies_v LIMIT 5000").toPandas()
names = spark.sql("SELECT * FROM names_v LIMIT 5000").toPandas()
ratings = spark.sql("SELECT * FROM ratings_v LIMIT 5000").toPandas()
title_principals = spark.sql("SELECT * FROM title_principals_v LIMIT 5000").toPandas()
movie_details = spark.sql("SELECT * FROM movie_details_v LIMIT 5000").toPandas()
reviews = spark.sql("SELECT * FROM reviews_v LIMIT 5000").toPandas()

### Analyze `movies` dataset

Quick peek for `movies`

In [None]:
movies.head()

Let's check percentage of null values in each columnn (_cleaning_)

In [None]:
s = movies.isnull().sum()/(len(movies)) * 100
s.where(s > 0).dropna().sort_values(ascending=False)

Check duplicate value, use title id (_deduplication_)

In [None]:
movies.duplicated(subset=["imdb_title_id"]).sum()

Find out possible repeating columns (_key restructuring_)

In [None]:
movies.dtypes.sort_index()

Some fields should be numeric, but it seems recognized as string. Let's see the sample data ignoring null (_data validation_)

In [None]:
movies[["avg_vote", "budget", "usa_gross_income", "worlwide_gross_income"]].dropna().head()

Check if any column contains multiple values, hence violates 1NF (_key restructuring / splitting_).  
Check if cell contains comma character.

In [None]:
for col in movies:
    if movies[col].astype(str).str.contains(",", na=False).any():
        print(col)

Let's quick check whether `title` and `original_title` are all the same, or there is difference. (_deduplication_)

In [None]:
movies.loc[movies["title"] != movies["original_title"]].shape[0]

Quite a difference. Let's see the sample data to make sure (_deduplication_)

In [None]:
movies.loc[movies["title"] != movies["original_title"]].head()

Are all `date_published` in yyyy-mm-dd format (_data validation_)

In [None]:
movies.loc[movies["date_published"].astype(str).str.len() != len("yyyy-mm-dd")]["date_published"].head()

Is `year` and `date_published` represents same data? (_deduplication_)

In [None]:
date_check = pd.to_datetime(movies["date_published"], errors="coerce")
date_diff = date_check.loc[date_check.dt.year != movies["year"].astype(int, errors="raise")]

date_diff.shape[0]

Apparently not, let's peek some different items.

In [None]:
movies.iloc[date_diff.sample(5).index][["year", "date_published"]]

Does financial columns (`budget`, `usa_gross_income`, `worlwide_gross_income`) all in USD, or any other currency?

In [None]:
currency_check = movies.loc[~movies["budget"].str.startswith(("$"), na=False)]["budget"].dropna()
currency_check.sample(5)

What are the currencies?

In [None]:
foreign_currencies = set()

In [None]:
foreign_currencies.update(currency_check.str.slice(0,3).unique().tolist())

In [None]:
currency_check = movies.loc[~movies["usa_gross_income"].str.startswith(("$"), na=False)]["usa_gross_income"].dropna().sample(5)
currency_check.sample(5)

In [None]:
foreign_currencies.update(currency_check.str.slice(0,3).unique().tolist())

In [None]:
currency_check = movies.loc[~movies["worlwide_gross_income"].str.startswith(("$"), na=False)][
    "worlwide_gross_income"].dropna().sample(5)
currency_check.sample(5)

In [None]:
foreign_currencies.update(currency_check.str.slice(0,3).unique().tolist())

Get all foreign currencies

In [None]:
len(foreign_currencies)

### Analyze `names` dataset

Quick peek for `names`. 

In [None]:
names.head()

Let's check percentage of null values in each columnn (_cleaning_)

In [None]:
s = names.isnull().sum()/(len(names)) * 100
s.where(s > 0).dropna().sort_values(ascending=False)

Check duplicate value, use name id (_deduplication_)

In [None]:
names.duplicated(subset=["imdb_name_id"]).sum()

Find out possible repeating columns (_deduplication_)

In [None]:
names.dtypes.sort_index()

Check if any column contains multiple values, hence violates 1NF (_key restructuring / splitting_).  
Check if cell contains comma character.

In [None]:
for col in names:
    if names[col].astype(str).str.contains(",", na=False).any():
        print(col)

Why there's comma on `date_of_birth`? See the sample data (_data validation_)

In [None]:
check_date = names.loc[names["date_of_birth"].str
                        .contains(",", na=False)][["place_of_birth", "birth_details", "date_of_birth"]]
check_date.sample(5)

Why there's comma on `date_of_death`? See the sample data (_data validation_)

In [None]:
check_date = names.loc[names["date_of_death"].str
                        .contains(",", na=False)][["place_of_death", "death_details", "date_of_death"]]
check_date.sample(5)

From analyzing `movies`, we might need to create dedicated table for `director, writer, production_company, actors`, 
then later relate each with `movies`.  

Let's check if they actually on dataset `names` already

### `movies` director vs `names`

There are name fields on `movies`, and we also has `names` dataset. Are they the same? (_key restructuring / data validation_)

In [None]:
name_list = [i for i in names["name"].tolist() if i]

In [None]:
col = "director"

In [None]:
print("{} name in names : {}"
      .format(col, movies.loc[movies[col].str.title()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))
print("{} name not in names : {}"
      .format(col, movies.loc[~movies[col].str.title()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))

In [None]:
movies.loc[~movies[col].str.title()
           .isin([x.title().strip() for x in name_list])][col].head()

In [None]:
names_from_comma = set()
for idx, val in movies.loc[~movies[col].str.title()
           .isin([x.title().strip() for x in name_list])][col].str.split(",").items():
    try:
        title_names = [str.title().strip() for str in val]
        names_from_comma.update(title_names)
    except Exception as e:
        pass

In [None]:
names_diff = set()

In [None]:
names_diff = names_from_comma - set([x.title().strip() for x in name_list])
len(names_diff)

### `movies` writer vs `names`

There are name fields on `movies`, and we also has `names` dataset. Are they the same? (_key restructuring / data validation_)

In [None]:
col = "writer"

In [None]:
print("{} name in names : {}"
      .format(col, movies.loc[movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))
print("{} name not in names : {}"
      .format(col, movies.loc[~movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))

In [None]:
movies.loc[~movies[col].str.title().str.strip()
           .isin([x.title().strip() for x in name_list])][col].head()

In [None]:
names_from_comma = set()
for idx, val in movies.loc[~movies[col].str.title().str.strip()
           .isin([x.title().strip() for x in name_list])][col].str.split(",").items():
    try:
        title_names = [str.title().strip() for str in val]
        names_from_comma.update(title_names)
    except Exception as e:
        pass

names_from_comma

In [None]:
names_diff.update(names_from_comma - set([x.title().strip() for x in name_list]))
len(names_diff)

### `movies` production company vs `names`

There are name fields on `movies`, and we also has `names` dataset. Are they the same? (_key restructuring / data validation_)

In [None]:
col = "production_company"

In [None]:
print("{} name in names : {}"
      .format(col, movies.loc[movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))
print("{} name not in names : {}"
      .format(col, movies.loc[~movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))

That's a lot of difference, but make sense since `production_company` is not person, unlike `director, writer, or actors`.
Hence, we will not use dataframe `names` as master for `production_company`.

### `movies` actors vs `names`

There are name fields on `movies`, and we also has `names` dataset. Are they the same? (_key restructuring / data validation_)

In [None]:
col = "actors"

In [None]:
print("{} name in names : {}"
      .format(col, movies.loc[movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))
print("{} name not in names : {}"
      .format(col, movies.loc[~movies[col].str.title().str.strip()
                         .isin([x.title().strip() for x in name_list])]
              .shape[0]))

In [None]:
movies.loc[~movies[col].str.title().str.strip()
           .isin([x.title().strip() for x in name_list])][col].head()

In [None]:
names_from_comma = set()
for idx, val in movies.loc[~movies[col].str.title().str.strip()
           .isin([x.title().strip() for x in name_list])][col].str.split(",").items():
    try:
        title_names = [str.title().strip() for str in val]
        names_from_comma.update(title_names)
    except Exception as e:
        pass

len(names_from_comma)

In [None]:
names_diff.update(names_from_comma - set([x.title().strip() for x in name_list]))

Total name from movies that not exists on names dataset

In [None]:
print("Total name difference (movies vs name) : {} data".format(len(names_diff)))

### Analyze `ratings` dataset

Quick peek for `ratings`.

In [None]:
ratings.head()

Let's check percentage of null values in each columnn (_cleaning_)

In [None]:
s = ratings.isnull().sum()/(len(ratings)) * 100
s.where(s > 0).dropna().sort_values(ascending=False)

Check duplicate value, using title id (_deduplication_)

In [None]:
ratings.duplicated(subset=["imdb_title_id"]).sum()

Find out possible repeating columns (_deduplication_)

In [None]:
ratings.dtypes.sort_index()

Check if any column contains multiple values, hence violates 1NF (_key restructuring / splitting_).  
Check if cell contains comma character.

In [None]:
for col in ratings:
    if ratings[col].astype(str).str.contains(",", na=False).any():
        print(col)

Avoid data inconsistency.  
Let's check whether `total_votes` is sum of `votes_10, votes_9, ...` (_data validation_)

In [None]:
ratings.loc[ratings["total_votes"].astype(int) != 
            ratings["votes_1"].astype(int) + ratings["votes_2"].astype(int) 
            + ratings["votes_3"].astype(int) + ratings["votes_4"].astype(int)
            + ratings["votes_5"].astype(int) + ratings["votes_6"].astype(int)
            + ratings["votes_7"].astype(int) + ratings["votes_8"].astype(int) 
            + ratings["votes_9"].astype(int) + ratings["votes_10"].astype(int)]

### Analyze `title_principals` dataset

Quick peek for `title_principals`.

In [None]:
title_principals.head()

Seems like many-to-many join table between movies and names.

Let's check percentage of null values in each columnn (_cleaning_)

In [None]:
s = title_principals.isnull().sum()/(len(title_principals)) * 100
s.where(s > 0).dropna().sort_values(ascending=False)

What is actually in `job`?

In [None]:
title_principals.loc[title_principals["job"].notnull()]["job"].value_counts()

What is actually in `characters`?

In [None]:
title_principals.loc[title_principals["characters"].notnull()]["characters"].value_counts()

Check duplicate value, using title id and name id (_deduplication_)

In [None]:
title_principals.duplicated(subset=["imdb_title_id", "imdb_name_id", "category", "job", "characters"]).sum()

Find out possible repeating columns (_deduplication_)

In [None]:
title_principals.dtypes.sort_index()

Check if any column contains multiple values, hence violates 1NF (_key restructuring / splitting_).  
Check if cell contains comma character.

In [None]:
for col in title_principals:
    if title_principals[col].astype(str).str.contains(",", na=False).any():
        print(col)

See sample data for `job`, which has comma. Is it multiple values? (_key restructuring / splitting_)

In [None]:
title_principals.loc[title_principals["job"].astype(str)
                     .str.contains(",", na=False)]["job"].head()

See sample data for `characters`, which has comma. Is it multiple values? (_key restructuring / splitting_)

In [None]:
ser_characters = title_principals.loc[title_principals["characters"].astype(str)
                                      .str.contains(",", na=False)]["characters"]
ser_characters.head()

Is this regular string, or list? If this a string, we need to transform it later into list.

In [None]:
pd.api.types.is_string_dtype(ser_characters)

Check whether all titles available on `movies`

In [None]:
set(title_principals["imdb_title_id"].tolist()) - set(movies["imdb_title_id"].tolist())

Check whether all name available on `names`

In [None]:
set(title_principals["imdb_name_id"].tolist()) - set(names["imdb_name_id"].tolist())

### Analyze `movie_details` dataset

Quick peek for `movie_details`.

In [None]:
movie_details.head()

Let's check percentage of null values in each columnn (_cleaning_)

In [None]:
s = movie_details.isnull().sum()/(len(movie_details)) * 100
s.where(s > 0).dropna().sort_values(ascending=False)

Check duplicate value, use title id (_deduplication_)

In [None]:
movie_details.duplicated(subset=["movie_id"]).sum()

Find out possible repeating columns (_key restructuring_)

In [None]:
movie_details.dtypes.sort_index()

Check if any column contains multiple values, hence violates 1NF (_key restructuring / splitting_).  
Check if cell contains comma character.

In [None]:
for col in movie_details:
    if movie_details[col].astype(str).str.contains(",", na=False).any():
        print(col)

Let's check the values

In [None]:
movie_details[["genre", "plot_summary", "plot_synopsis"]]

The genre should not relevant, we will take genre from `movies` dataframe

### Analyze `reviews` dataset

Quick peek for `reviews`.

In [None]:
reviews.head()

The `reviews` seems OK. Duplicated movie_id is fine, and the other fields seems OK. The only catch, is that we need to process `review_date` which currently full string.

## End analyze. Let's check what we have

### `movies` analysis
**Good**
- No duplicate values
- No repeating columns

**Not so good (need transform)**
- Null values found on these columns, let's define the default value, or just left to null
  + `description` : leave as is (keep null)
  + `metascore` : default to 0
  + `usa_gross_income` : default to `$ 0` (maintain data type consistency as string, so use this instead of numeric 0)
  + `budget` : default to `$ 0` (maintain data type consistency as string, so use this instead of numeric 0)
  + `worlwide_gross_income` : default to `$ 0` (maintain data type consistency as string, so use this instead of numeric 0)
  + `reviews_from_critics` : default to 0
  + `reviews_from_users` : default to 0
  + `production_company` : default to `Unknown`
  + `writer` : default to `Unknown`
  + `language` : default to `Unknown`
  + `director` : default to `Unknown`
  + `actors` : default to `Unknown`
  + `country` : default to `Unknown`
- Multiple values in cell (based on comma character)
  + `title` : OK to have comma in this field, so it's a single value
  + `original_title` : OK to have comma in this field, so it's a single value
  + `description` : OK to have comma in this field, so it's a single value
  + `genre` : needs to be separated to different table(s) to achieve 1NF, will put this on generic lookup table
  + `country` : needs to be separated to different table(s) to achieve 1NF, will put this on generic lookup table
  + `language` : needs to be separated to different table(s) to achieve 1NF, will put this on generic lookup table
  + `director` : needs to be separated to different table(s) to achieve 1NF
  + `writer` : needs to be separated to different table(s) to achieve 1NF
  + `production_company` : needs to be separated to different table(s) to achieve 1NF
  + `actors` : needs to be separated to different table(s) to achieve 1NF
- `date_published` and year can have different value (year on `date_published` vs year on raw data)
- Need to convert foreign currencies on `budget`, `usa_gross_income`, and `worlwide_gross_income` to **USD**

### `names` analysis
**Good**
- No duplicate values
- No repeating columns

**Not so good (need transform)**
- Null values found on these columns, let's define the default value, or just left to null
    + `reason_of_death` : leave as is (keep null)
    + `place_of_death` : leave as is (keep null)
    + `death_details` : leave as is (keep null)
    + `date_of_death` : leave as is (keep null)
    + `height` : leave as is (keep null)
    + `birth_details` : leave as is (keep null)
    + `date_of_birth` : leave as is (keep null)
    + `bio` : leave as is (keep null)
    + `place_of_birth` : default to `Unknown`
    + `spouses_string` : default to `Unknown`
- Multiple values in cell (based on comma character)
    + `name` : OK to have comma in this field, so it's a single value
    + `birth_name` : OK to have comma in this field, so it's a single value
    + `bio` : OK to have comma in this field, so it's a single value
    + `birth_details` : OK to have comma in this field, so it's a single value
    + `place_of_birth` : OK to have comma in this field, so it's a single value
    + `death_details` : OK to have comma in this field, so it's a single value
    + `place_of_death` : OK to have comma in this field, so it's a single value
    + `reason_of_death` : OK to have comma in this field, so it's a single value
    + `spouses_string` : consider as descriptive value in this demo, so it's OK to contains comma (not considered as multiple values)
    + `date_of_birth` : we need to extract year from this field, then just default it to 1-January using extracted year
    + `date_of_death` : we need to extract year from this field, then just default it to 1-January using extracted year
- A lot of names appears on `movies`, but not on `names`. We must insert those difference into `names`, just use default data for the fields other than name:
    + `imdb_name_id` : use `xx` and 7 digits sequence number
    + `name` : known name from our checking
    + `birth_name` : known name from our checking 
    + `height` : null
    + `bio` : null
    + `birth_details` : null
    + `date_of_birth` : null
    + `place_of_birth` : `Unknown`
    + `death_details` : null
    + `date_of_death` : null
    + `place_of_death` : null
    + `reason_of_death` : null
    + `spouses_string` : `Unknown`
    + `spouses` : null
    + `divorces` : null
    + `spouses_with_children` : null
    + `children` : null

### `ratings` analysis
**Good**
- No duplicate values
- No repeating columns
- The `total_votes` is exact sum of all `votes_x`

**Not so good (need transform)**
- Null values found on these columns, let's define the default value, or just left to null
    + `females_0age_votes` : default to 0
    + `females_0age_avg_vote` : default to 0
    + `males_0age_votes` : default to 0
    + `males_0age_avg_vote` : default to 0
    + `allgenders_0age_votes` : default to 0
    + `allgenders_0age_avg_vote` : default to 0
    + `females_18age_votes` : default to 0
    + `females_18age_avg_vote` : default to 0
    + `females_45age_votes` : default to 0
    + `females_45age_avg_vote` : default to 0
    + `males_18age_avg_vote` : default to 0
    + `males_18age_votes` : default to 0
    + `females_30age_votes` : default to 0
    + `females_30age_avg_vote` : default to 0
    + `allgenders_18age_votes` : default to 0
    + `allgenders_18age_avg_vote` : default to 0
    + `top1000_voters_votes` : default to 0
    + `top1000_voters_rating` : default to 0
    + `us_voters_votes` : default to 0
    + `us_voters_rating` : default to 0
    + `males_45age_votes` : default to 0
    + `males_45age_avg_vote` : default to 0
    + `females_allages_votes` : default to 0
    + `females_allages_avg_vote` : default to 0
    + `allgenders_45age_votes` : default to 0
    + `allgenders_45age_avg_vote` : default to 0
    + `males_30age_votes` : default to 0
    + `males_30age_avg_vote` : default to 0
    + `allgenders_30age_votes` : default to 0
    + `allgenders_30age_avg_vote` : default to 0
    + `males_allages_votes` : default to 0
    + `males_allages_avg_vote` : default to 0
    + `non_us_voters_rating` : default to 0
    + `non_us_voters_votes` : default to 0

### `title_principals` analysis
**Good**
- No repeating columns

**Not so good (need transform)**
- Null values found on these columns, let's define the default value, or just left to null
    + `category` : keep null, a person can only in `category / job / characters`
    + `job` : keep null, a person can only in `category / job / characters`
    + `characters` : keep null, a person can only in `category / job / characters`
- Multiple values in cell (based on comma character)
  + `job` : keep as is, the comma is part of the value
  + `characters` : one person can play multi characters in one movie, so we need to convert this string into list of characters
- Two values missing from `movies` : _'tt1860336', 'tt2082513'_

### `movie_details` analysis
We will only process `movie_id`, `plot_summary`, `plot_synopsys`

**Good**
- Required fields are free text, so just take them as-is

**Not so good (need transform)**
- None

### `reviews` analysis
We will process `is_spoiler`, `movie_id`, `review_date`, `review_summary`, `review_text`, `user_id`

**Good**
- Required fields are boolean or free text, so just take them as-is

**Not so good (need transform)**
- Transform `is_spoiler` into boolean data type
- Transform `review_date` into date data type

----

## Tables & Relationships
### Core entities
- From `movies` : table `movies`
- From `names` : table `people` (list of directors, actors, writers, ...)
- From `ratings` : table `movie_numeric_votes` (from `votes_xx` field) and table `movie_avg_votes` 
(from `xxx_rating` and `xxx_votes` where `xxx_rating` is the average score, while `xxx_votes` is the vote count)

### Lookups
In denormalized data warehouse, we will not use lookups to avoid joins. Since we use BigQuery, we can store array of values as data type in one field, and we will use it.

### Relationships
- many-to-many from `movies` to lookup table (for `genre`, `country`, and `language`)
- many-to-many from `movies` to `people` (for director, writers, and actors)
- one-to-many from `movies` to `movie_numeric_votes` and `movie_avg_votes`
- one-to-one from `movies` to `movie_details`
- one-to-many from `movie_details` to `reviews`