In [0]:
# imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t

In [0]:
# creating a spark session
spark = SparkSession.builder.master("yarn").appName("MovieAnalysis").getOrCreate()

In [0]:
# reading the for this task needed data from the stored (already cleaned) parquet files in dataframes
pf_movie = spark.read.parquet("dbfs:/FileStore/shared_uploads/ibele@stud.dhbw-ravensburg.de/movies.parquet")
pf_crew = spark.read.parquet("dbfs:/FileStore/shared_uploads/ibele@stud.dhbw-ravensburg.de/crew.parquet")

<b> Question: </b>\
*How many movies were written by a female writer?*

In [0]:
# since there could be more than one female writer writing one movie, the id of the movie has to be selected as distinct
(pf_crew
    .where(
        (pf_crew.job=="Writer") 
         & (pf_crew.gender==1))
    .select("id")
    .distinct()
    .count())

Out[6]: 139

In [0]:
# nicer option with f-string
f_w_movies = pf_crew.where((pf_crew.job=="Writer") & (pf_crew.gender==1)).select("id").distinct().count()
print(f"There are {f_w_movies} movies, which were written by a female writer.")

There are 139 movies, which were written by a female writer.


<b> Answer: </b>\
Under the assumption that '1' in the column gender stands for female (2 for male and 0 for the ones, where the gender is not specified), there are 151 movies, that were written by a female writer.

*Explain what data storage structure you used to store the information and why. When storing the information how can you speed up the information retrieval if you know you are interested in looking at the gender of the writer?\
Why does it speed up the information retrieval when you store the data differently?*

<b> Answer: </b>\
For this task I used Apache Parquet Files as data storage structure to store the given information. Apache Parquet is a free and open source, column-oriented data file format designed for efficient data storage and retrieval (definition source: https://parquet.apache.org/). It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk, that's why it's known for both performant data compression and its ability to handle a wide variety of encoding types. Parquet is designed to be a common interchange format for both batch and interactive workloads. It is implemented through using a record shredding and assembly algorithm, which takes into account the complex data structures, that can be used to store the data. When storing data Parquet uses a hybrid model of physical storage layout, which means a combination of columnar and row-wise.\
One of the advantages of Parquet, in contrast to for example CSV, is that column storage files are more lightweight. The reason for that is that in column storage files adequate compression can be done for each column. The column storage architecture allows you to quickly skip data that isn't relevant, which is the reason why queries and aggregations are faster or less time-consuming compared to row-oriented databases. This results in hardware savings, minimizing latency for accessing data as well as saving time and money. Furthermore in each column there are metadata available, which represent for example the minimum, maximum and count of the respective column.\
In addition to that Parquet is best especially for those queries that need to read certain columns from a large table (as processing large volumes of data in the gigabyte range), because Parquet can only read the needed columns therefore greatly minimizing the IO. Parquet also supports nested files and can be compressed as you like, for instance in GZIP, LZO or Snappy. \
So in general Parquet is good for storing big data of any kind (e.g. structured data tables, videos or images), it saves on cloud storage space by using highly efficient column-wise compression and flexible encoding schemes for columns with different data types. Parquet uses different encoding schemes like PLAIN or RLE_DICTIONARY, where the second for example is helpful when having many duplicated and repeated values. This again helps reducing the file size. Furthermore it also supports an increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.\
In summary Parquet is a more efficient data format for bigger files, which is the reason why I used it to store the given movie-information since one of the requirements of this task is to also handle very large amounts of data. \
To speed up the retrieval of specific information you can use the Partitioning of Parquet. With Partitioning you can define how the dataset or rather the data is to be divided into partitions after the columns of the dataset. That means you divide the data into groups (partitions) based on column values, which will improve the performance of queries that restrict results by the partitioned column. For example in this case, when we are looking specifically for the gender of the writer, we could partition after the column 'gender' (e.g. df.write.partitionBy("gender").parquet("parquet-gender")), so that all data with the same gender would be stored in one partition. In this case we could as well partition the dataset after the column 'job', so we would have for instance all writers in one partition. Since in this context we're interested in female writers, a potential Partitioning could be on the column 'gender' and and the column 'job' of the crew dataframe, so the partitioning could look like the following: pf_crew.write.partitionBy("gender", "job").parquet("gender-job-parquet"). But since there are many different jobs in this dataset, the partitioning of the column "job" might not make that a big difference in the performance of the query. So in general Parquet Partitioning speeds up the information retrieval because all the wanted information is stored in partitions next to each other, when partitioning after gender and job, so the query can read all the potential data searching through the whole dataset.