# Shared Variables

Special variables that can be used in parallel operations. PySpark provides two types of shared variables:

1. Broadcast variables
2. Accumulators


### 1. Broadcast Variables in PySpark

#### Key Points:

- **Definition**: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
- **Purpose**: They are useful for sharing large datasets across all nodes to avoid the overhead of serializing and deserializing the dataset multiple times.
- **Read-Only**: Broadcast variables are read-only; they cannot be modified by the tasks once broadcasted.

In [14]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("BroadcastApp").setMaster("local[*]")
sc = SparkContext(conf=conf)

In [16]:
# Create an RDD from a list
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


In [17]:
squared_numbers = numbers.map(lambda x: x ** 2)
print("Original:", numbers.collect())
print("Squared:", squared_numbers.collect())

Original: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Squared: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [20]:
# Large lookup table
lookup_table = {1: "one", 2: "two", 3: "three"}

# Broadcast the lookup table
broadcast_lookup = sc.broadcast(lookup_table)

# Data RDD
data = sc.parallelize([1, 2, 3, 4, 5])

# Use the broadcast variable in a transformation
result = data.map(lambda x: broadcast_lookup.value.get(x, "unknown")).collect()

print(result)

['one', 'two', 'three', 'unknown', 'unknown']


### 2. Accumulators in PySpark

#### Key Points:

- **Definition**: Accumulators are variables that are only “added” to through an associative and commutative operation and can be used to implement counters or sums.
- **Purpose**: They are useful for aggregating information across the cluster, such as counting events or accumulating values.
- **Write-Only from Workers**: Accumulators can be updated from the workers, but the value is only reliably read by the driver program.


### Behavior and Characteristics:

- **Initialization**:    
    - An accumulator is created on the driver and initialized to a given value.
- **Update Mechanism**:    
    - Tasks running on workers add to the accumulator using operations like `+=`.
    - Each task has its own local copy of the accumulator to which it writes.
- **Aggregation**:    
    - The driver program periodically aggregates these updates.
    - The aggregated value can be accessed on the driver program using the `value` attribute.

In [21]:
error_count = sc.accumulator(0)

data = sc.parallelize(["good", "bad", "good", "bad", "good"])

def process_record(record):
    global error_count
    if record == "bad":
        error_count += 1
        return record
    else:
      return "it's good"



processed_data = data.map(process_record)
print(processed_data.collect())
print(error_count.value)


["it's good", 'bad', "it's good", 'bad', "it's good"]
2


In [22]:


total_sales = sc.accumulator(0.0)

# Data RDD
data = sc.parallelize([100.0, 200.5, 300.0, 150.75, 250.25])

# Function to add sales amount to the accumulator
def add_sales(sale):
    global total_sales
    total_sales += sale

# Apply the function
data.foreach(add_sales)

# The value of the accumulator on the driver
print(f"Total sales: {total_sales.value}")

Total sales: 1001.5


----

# Dataframes


## Introduction to Spark DataFrames
A Spark DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database. DataFrames provide a high-level API for working with structured data and integrate
seamlessly with Spark’s SQL capabilities.

In [23]:
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Spark DataFrame") \
.getOrCreate()

In [24]:
## HW:subprocess to download a file using wget or curl

download_url = "https://raw.githubusercontent.com/KirkYagami/PySpark_Training/refs/heads/main/PySpark/03_data/disney_plus_shows.csv"

import requests
response = requests.get(download_url)

if response.status_code == 200:
  with open("disney_plus_shows.csv", "wb") as file:
      file.write(response.content)
  print("sucessfully downloaded the file")
else:
  print(f"Failed to download the file. Status code: {response.status_code}")



sucessfully downloaded the file


In [25]:
disney_df = spark.read \
  .format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("multiline", "true") \
  .option("quote", "\"") \
  .option("escape", "\"") \
  .load("disney_plus_shows.csv")

In [26]:
type(disney_df)

In [27]:
disney_df.printSchema()

root
 |-- imdb_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- plot: string (nullable = true)
 |-- type: string (nullable = true)
 |-- rated: string (nullable = true)
 |-- year: string (nullable = true)
 |-- released_at: string (nullable = true)
 |-- added_at: string (nullable = true)
 |-- runtime: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- director: string (nullable = true)
 |-- writer: string (nullable = true)
 |-- actors: string (nullable = true)
 |-- language: string (nullable = true)
 |-- country: string (nullable = true)
 |-- awards: string (nullable = true)
 |-- metascore: string (nullable = true)
 |-- imdb_rating: string (nullable = true)
 |-- imdb_votes: string (nullable = true)



In [30]:
disney_df.show(3, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------
 imdb_id     | tt0147800                                                                                                                                              
 title       | 10 Things I Hate About You                                                                                                                             
 plot        | A pretty, popular teenager can't go out on a date until her ill-tempered older sister does.                                                            
 type        | movie                                                                                                                                                  
 rated       | PG-13                                                                                                                                                 

In [31]:
# Display column names
print(disney_df.columns)

['imdb_id', 'title', 'plot', 'type', 'rated', 'year', 'released_at', 'added_at', 'runtime', 'genre', 'director', 'writer', 'actors', 'language', 'country', 'awards', 'metascore', 'imdb_rating', 'imdb_votes']


In [32]:
# Select specific columns
selected_columns = disney_df.select("title", "genre", "imdb_rating")
selected_columns.show(truncate=False)

+------------------------------------------+---------------------------------------------------------------+-----------+
|title                                     |genre                                                          |imdb_rating|
+------------------------------------------+---------------------------------------------------------------+-----------+
|10 Things I Hate About You                |Comedy, Drama, Romance                                         |7.3        |
|101 Dalmatian Street                      |Animation, Comedy, Family                                      |6.2        |
|101 Dalmatians                            |Adventure, Comedy, Crime, Family                               |5.7        |
|101 Dalmatians 2: Patch's London Adventure|Animation, Adventure, Comedy, Family, Musical                  |5.8        |
|102 Dalmatians                            |Adventure, Comedy, Family                                      |4.9        |
|12 Dates of Christmas          

In [33]:
high_rated_shows = disney_df.filter(disney_df.imdb_rating > 7.0)
high_rated_shows.show(truncate=False)

+----------+------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+---------+-----------+-----------------+-------+---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## The Four Primary Column Selection Methods
### Method 1: String Literals - The Simple Approach

String literals represent the most straightforward way to reference columns. You simply use the column name as a string.

In [34]:
# Basic string literal selection
disney_df.select("title", "genre", "imdb_rating").show(5)

+--------------------+--------------------+-----------+
|               title|               genre|imdb_rating|
+--------------------+--------------------+-----------+
|10 Things I Hate ...|Comedy, Drama, Ro...|        7.3|
|101 Dalmatian Street|Animation, Comedy...|        6.2|
|      101 Dalmatians|Adventure, Comedy...|        5.7|
|101 Dalmatians 2:...|Animation, Advent...|        5.8|
|      102 Dalmatians|Adventure, Comedy...|        4.9|
+--------------------+--------------------+-----------+
only showing top 5 rows



In [35]:
# String literals work well with groupBy operations
disney_df.groupBy("type").count().show()

+-------+-----+
|   type|count|
+-------+-----+
|   NULL|   98|
|episode|   23|
|  movie|  680|
| series|  191|
+-------+-----+



In [36]:
# You can mix string literals in various operations
disney_df.filter("year > '2015'").select("title", "year", "genre").show(5)

+--------------------+-----+--------------------+
|               title| year|               genre|
+--------------------+-----+--------------------+
|101 Dalmatian Street|2018–|Animation, Comedy...|
|A Celebration of ...| 2020|               Music|
|   A Wrinkle in Time| 2018|Adventure, Family...|
|             Aladdin| 2019|Adventure, Family...|
|America's Nationa...|2015–|         Documentary|
+--------------------+-----+--------------------+
only showing top 5 rows



**When to use string literals:**

- Simple column selections without any transformations
- GroupBy operations where you're just specifying grouping columns
- Basic filtering operations with simple conditions
- When you want clean, readable code for straightforward operations

**Limitations of string literals:**

- No support for column expressions or transformations
- Cannot chain operations like `.desc()` or `.alias()`
- Limited to basic column references only
- Cannot be used for mathematical operations between columns

### Method 2: DataFrame Dot Notation - The Object-Oriented Way

Dot notation treats columns as attributes of the DataFrame object, giving you access to Column objects that support more operations.

In [37]:
# Using dot notation for column selection
disney_df.select(disney_df.title, disney_df.genre, disney_df.imdb_rating).show(5)


+--------------------+--------------------+-----------+
|               title|               genre|imdb_rating|
+--------------------+--------------------+-----------+
|10 Things I Hate ...|Comedy, Drama, Ro...|        7.3|
|101 Dalmatian Street|Animation, Comedy...|        6.2|
|      101 Dalmatians|Adventure, Comedy...|        5.7|
|101 Dalmatians 2:...|Animation, Advent...|        5.8|
|      102 Dalmatians|Adventure, Comedy...|        4.9|
+--------------------+--------------------+-----------+
only showing top 5 rows



In [38]:
# Dot notation supports expressions and transformations
disney_df.select(
    disney_df.title,
    disney_df.imdb_rating.cast("double").alias("rating_numeric"),
    disney_df.genre.substr(1, 10).alias("genre_short")
).show(5)

+--------------------+--------------+-----------+
|               title|rating_numeric|genre_short|
+--------------------+--------------+-----------+
|10 Things I Hate ...|           7.3| Comedy, Dr|
|101 Dalmatian Street|           6.2| Animation,|
|      101 Dalmatians|           5.7| Adventure,|
|101 Dalmatians 2:...|           5.8| Animation,|
|      102 Dalmatians|           4.9| Adventure,|
+--------------------+--------------+-----------+
only showing top 5 rows



In [40]:
# You can use dot notation in orderBy operations
disney_df.select("title", "imdb_rating", "year").orderBy(disney_df.imdb_rating.desc()).show(5)

+--------------------+-----------+-----+
|               title|imdb_rating| year|
+--------------------+-----------+-----+
|       Holiday Magic|        N/A| 2017|
|        Episode #2.7|        N/A| 2013|
|Disneyland Around...|        N/A| 1966|
|     Dog: Impossible|        N/A|2019–|
|Incredible! The S...|        N/A| 2015|
+--------------------+-----------+-----+
only showing top 5 rows



**When to use dot notation:**

- When you need to apply transformations to columns
- For chaining operations like `.alias()`, `.desc()`, `.asc()`
- When working with column expressions
- For type casting and string operations on columns

**Limitations of dot notation:**

- Column names must be valid Python identifiers (no spaces, special characters)
- Doesn't work with column names stored in variables
- Can become verbose when referencing the same DataFrame repeatedly
- Fails with column names that conflict with DataFrame methods

```

# This won't work if you have a column named "count" or "select"
# disney_df.count  # This refers to the DataFrame method, not a column

# This won't work with spaces in column names
# disney_df.imdb rating  # Syntax error due to space

```

### Method 3: DataFrame Bracket Notation - The Flexible Approach

Bracket notation is similar to dot notation but uses dictionary-style access, making it more flexible for problematic column names.

In [41]:
# Basic bracket notation usage
disney_df.select(disney_df["title"], disney_df["genre"], disney_df["imdb_rating"]).show(5)
# Bracket notation handles any column name, including those with spaces


+--------------------+--------------------+-----------+
|               title|               genre|imdb_rating|
+--------------------+--------------------+-----------+
|10 Things I Hate ...|Comedy, Drama, Ro...|        7.3|
|101 Dalmatian Street|Animation, Comedy...|        6.2|
|      101 Dalmatians|Adventure, Comedy...|        5.7|
|101 Dalmatians 2:...|Animation, Advent...|        5.8|
|      102 Dalmatians|Adventure, Comedy...|        4.9|
+--------------------+--------------------+-----------+
only showing top 5 rows



In [42]:

# Using variables with bracket notation - very powerful feature
columns_of_interest = ["title", "genre", "imdb_rating", "year"]
disney_df.select([disney_df[col] for col in columns_of_interest]).show(5)

+--------------------+--------------------+-----------+-----+
|               title|               genre|imdb_rating| year|
+--------------------+--------------------+-----------+-----+
|10 Things I Hate ...|Comedy, Drama, Ro...|        7.3| 1999|
|101 Dalmatian Street|Animation, Comedy...|        6.2|2018–|
|      101 Dalmatians|Adventure, Comedy...|        5.7| 1996|
|101 Dalmatians 2:...|Animation, Advent...|        5.8| 2002|
|      102 Dalmatians|Adventure, Comedy...|        4.9| 2000|
+--------------------+--------------------+-----------+-----+
only showing top 5 rows



In [43]:
# Bracket notation supports all the same expressions as dot notation
disney_df.select(
    disney_df["title"],
    disney_df["imdb_rating"].cast("double").alias("numeric_rating"),
    disney_df["year"].cast("integer").alias("release_year")
).show(5)

+--------------------+--------------+------------+
|               title|numeric_rating|release_year|
+--------------------+--------------+------------+
|10 Things I Hate ...|           7.3|        1999|
|101 Dalmatian Street|           6.2|        NULL|
|      101 Dalmatians|           5.7|        1996|
|101 Dalmatians 2:...|           5.8|        2002|
|      102 Dalmatians|           4.9|        2000|
+--------------------+--------------+------------+
only showing top 5 rows



In [44]:

# Dynamic column selection based on conditions
rating_columns = [col for col in disney_df.columns if "rating" in col.lower()]
disney_df.select(["title"] + [disney_df[col] for col in rating_columns]).show(5)


+--------------------+-----------+
|               title|imdb_rating|
+--------------------+-----------+
|10 Things I Hate ...|        7.3|
|101 Dalmatian Street|        6.2|
|      101 Dalmatians|        5.7|
|101 Dalmatians 2:...|        5.8|
|      102 Dalmatians|        4.9|
+--------------------+-----------+
only showing top 5 rows



**When to use bracket notation:**

- Column names contain spaces, special characters, or reserved words
- Dynamic column selection using variables or loops
- When column names are stored in lists or generated programmatically
- Any situation where dot notation limitations become problematic

**Advantages of bracket notation:**

- Works with any column name regardless of special characters
- Supports dynamic column references through variables
- All the expression capabilities of dot notation
- More flexible for programmatic column selection

### Method 4: col() Function - The Universal Solution

The `col()` function from `pyspark.sql.functions` is the most versatile approach and is required in many contexts.

In [46]:
from pyspark.sql.functions import col, count, avg, max, min, desc, asc

# Basic col() usage
disney_df.select(col("title"), col("genre"), col("imdb_rating")).show(5)




+--------------------+--------------------+-----------+
|               title|               genre|imdb_rating|
+--------------------+--------------------+-----------+
|10 Things I Hate ...|Comedy, Drama, Ro...|        7.3|
|101 Dalmatian Street|Animation, Comedy...|        6.2|
|      101 Dalmatians|Adventure, Comedy...|        5.7|
|101 Dalmatians 2:...|Animation, Advent...|        5.8|
|      102 Dalmatians|Adventure, Comedy...|        4.9|
+--------------------+--------------------+-----------+
only showing top 5 rows



In [47]:
# col() is required in many aggregation contexts
disney_df.groupBy("genre").agg(
    count(col("title")).alias("movie_count"),
    avg(col("imdb_rating").cast("double")).alias("avg_rating")
).show()

+--------------------+-----------+------------------+
|               genre|movie_count|        avg_rating|
+--------------------+-----------+------------------+
|Animation, Advent...|          1|               6.5|
|Animation, Comedy...|          1|               6.7|
|Adventure, Biogra...|          1|               8.1|
|Adventure, Comedy...|          1|               4.7|
|Family, Romance, ...|          1|               5.4|
|Animation, Action...|          5|               6.4|
|Animation, Advent...|          1|               6.1|
|Action, Adventure...|          6|7.1499999999999995|
|Animation, Comedy...|          3| 6.533333333333334|
|Documentary, Hist...|          1|               7.6|
|Adventure, Family...|          1|               8.5|
|Action, Adventure...|          1|               6.5|
|Documentary, Biog...|          2|              7.85|
|Documentary, Adve...|          1|               4.7|
|  Documentary, Music|          2|               1.9|
|Animation, Advent...|      

In [48]:
# col() works seamlessly with ordering operations
disney_df.select("title", "imdb_rating", "year").orderBy(col("imdb_rating").desc()).show(5)

+--------------------+-----------+-----+
|               title|imdb_rating| year|
+--------------------+-----------+-----+
|       Holiday Magic|        N/A| 2017|
|        Episode #2.7|        N/A| 2013|
|Disneyland Around...|        N/A| 1966|
|     Dog: Impossible|        N/A|2019–|
|Incredible! The S...|        N/A| 2015|
+--------------------+-----------+-----+
only showing top 5 rows



In [49]:
# col() is essential for complex expressions in select statements
disney_df.select(
    col("title"),
    col("imdb_rating").cast("double").alias("rating"),
    (col("year").cast("integer") + 10).alias("future_year")
).show(5)

+--------------------+------+-----------+
|               title|rating|future_year|
+--------------------+------+-----------+
|10 Things I Hate ...|   7.3|       2009|
|101 Dalmatian Street|   6.2|       NULL|
|      101 Dalmatians|   5.7|       2006|
|101 Dalmatians 2:...|   5.8|       2012|
|      102 Dalmatians|   4.9|       2010|
+--------------------+------+-----------+
only showing top 5 rows



In [50]:
# col() enables mathematical operations between columns
# Let's create a derived column using multiple existing columns
disney_df.select(
    col("title"),
    col("imdb_rating").cast("double").alias("rating"),
    col("metascore").cast("double").alias("meta_score"),
    (col("imdb_rating").cast("double") * 10 + col("metascore").cast("double")).alias("combined_score")
).show(5)

+--------------------+------+----------+--------------+
|               title|rating|meta_score|combined_score|
+--------------------+------+----------+--------------+
|10 Things I Hate ...|   7.3|      70.0|         143.0|
|101 Dalmatian Street|   6.2|      NULL|          NULL|
|      101 Dalmatians|   5.7|      49.0|         106.0|
|101 Dalmatians 2:...|   5.8|      NULL|          NULL|
|      102 Dalmatians|   4.9|      35.0|          84.0|
+--------------------+------+----------+--------------+
only showing top 5 rows



**When to use col():**

- Inside aggregation functions (required)
- Complex column expressions and mathematical operations
- When working with functions from `pyspark.sql.functions`
- Ordering operations that need `.desc()` or `.asc()`
- Any context where other methods don't work

**Why col() is often the best choice:**

- Works in all contexts where other methods work
- Required by many PySpark functions
- Provides consistent behavior across different operations
- Most explicit and clear about your intentions

## Null Handling🕵
Null values are common in datasets, representing missing, unknown, or undefined data. Properly handling these null values is crucial for accurate data analysis and processing.

In [51]:
disney_df.filter(disney_df["director"].isNull()).show(truncate=False)

+-------+-----+----+----+-----+----+-----------+-----------------+-------+-----+--------+------+------+--------+-------+------+---------+-----------+----------+
|imdb_id|title|plot|type|rated|year|released_at|added_at         |runtime|genre|director|writer|actors|language|country|awards|metascore|imdb_rating|imdb_votes|
+-------+-----+----+----+-----+----+-----------+-----------------+-------+-----+--------+------+------+--------+-------+------+---------+-----------+----------+
|NULL   |NULL |NULL|NULL|NULL |NULL|NULL       |December 13, 2019|NULL   |NULL |NULL    |NULL  |NULL  |NULL    |NULL   |NULL  |NULL     |NULL       |NULL      |
|NULL   |NULL |NULL|NULL|NULL |NULL|NULL       |May 1, 2020      |NULL   |NULL |NULL    |NULL  |NULL  |NULL    |NULL   |NULL  |NULL     |NULL       |NULL      |
|NULL   |NULL |NULL|NULL|NULL |NULL|NULL       |November 12, 2019|NULL   |NULL |NULL    |NULL  |NULL  |NULL    |NULL   |NULL  |NULL     |NULL       |NULL      |
|NULL   |NULL |NULL|NULL|NULL |NUL

In [61]:
disney_df.filter(col("director").isNull()).count()

#HW: count of nulls in each and every column

98

In [59]:
# Replacing Null Values
disney_filled = disney_df.fillna({"metascore": 0})
disney_filled.show(truncate=True)




+----------+--------------------+--------------------+------+---------+-----+-----------+-----------------+-------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+---------+-----------+----------+
|   imdb_id|               title|                plot|  type|    rated| year|released_at|         added_at|runtime|               genre|            director|              writer|              actors|        language|             country|              awards|metascore|imdb_rating|imdb_votes|
+----------+--------------------+--------------------+------+---------+-----+-----------+-----------------+-------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+---------+-----------+----------+
| tt0147800|10 Things I Hate ...|A pretty, popular...| movie|    PG-13| 1999|31 Mar 1999|November 12, 2019| 97 min|Comedy, D

In [60]:
# Replacing Null Values
disney_filled = disney_df.fillna({"director": "Unknown"})
disney_filled.show(truncate=False)

+----------+------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+-----+-----------+-----------------+-------+---------------------------------------------------------------+-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------+----------------+-------------

In [62]:
disney_filled.filter(col("director").isNull()).count()

0

In [63]:
#Replacing Nulls in Multiple Columns

disney_filled = disney_df.fillna({"metascore": 0, "imdb_rating": 5.0})
disney_filled.show(truncate=True)

+----------+--------------------+--------------------+------+---------+-----+-----------+-----------------+-------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+---------+-----------+----------+
|   imdb_id|               title|                plot|  type|    rated| year|released_at|         added_at|runtime|               genre|            director|              writer|              actors|        language|             country|              awards|metascore|imdb_rating|imdb_votes|
+----------+--------------------+--------------------+------+---------+-----+-----------+-----------------+-------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+---------+-----------+----------+
| tt0147800|10 Things I Hate ...|A pretty, popular...| movie|    PG-13| 1999|31 Mar 1999|November 12, 2019| 97 min|Comedy, D

In [65]:
disney_filled.explain(True)

== Parsed Logical Plan ==
Project [imdb_id#0, title#1, plot#2, type#3, rated#4, year#5, released_at#6, added_at#7, runtime#8, genre#9, director#10, writer#11, actors#12, language#13, country#14, awards#15, coalesce(metascore#16, cast(0 as string)) AS metascore#1519, coalesce(imdb_rating#17, cast(5.0 as string)) AS imdb_rating#1520, imdb_votes#18]
+- Relation [imdb_id#0,title#1,plot#2,type#3,rated#4,year#5,released_at#6,added_at#7,runtime#8,genre#9,director#10,writer#11,actors#12,language#13,country#14,awards#15,metascore#16,imdb_rating#17,imdb_votes#18] csv

== Analyzed Logical Plan ==
imdb_id: string, title: string, plot: string, type: string, rated: string, year: string, released_at: string, added_at: string, runtime: string, genre: string, director: string, writer: string, actors: string, language: string, country: string, awards: string, metascore: string, imdb_rating: string, imdb_votes: string
Project [imdb_id#0, title#1, plot#2, type#3, rated#4, year#5, released_at#6, added_at#7