# Working with PySpark: Reading CSV and Basic Operations

In this notebook we will cover:
1. Reading CSV files with proper parameters
2. Why inferSchema is important
3. Basic transformations (Select, Filter, GroupBy)
4. Saving to Parquet format

## 1. Initialize Spark Session

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Netflix Data Analysis") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/31 18:00:31 WARN Utils: Your hostname, Deniss-Laptop.local, resolves to a loopback address: 127.0.0.1; using 192.168.1.37 instead (on interface en0)
26/01/31 18:00:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/31 18:00:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 2. Reading CSV with Proper Parameters

In [2]:
# Read CSV with header=True and inferSchema=True
df = spark.read.csv("data/netflix_titles.csv", header=True, inferSchema=True)

In [3]:
# Show first 5 rows
df.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                NULL|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           NULL|Ama Qamata, Khosi...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         NULL|Septem

In [4]:
# Display the schema
df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



### Why is inferSchema Important?

**inferSchema=True** automatically determines data types:
- Numeric fields (e.g., release_year) will have type `int` or `double`
- Text fields remain as `string`

**Without inferSchema** all columns will be strings, which:
- Makes numeric operations more complex
- Requires manual type casting
- Can lead to errors in filtering and aggregation

### What Happens WITHOUT header=True?

In [5]:
# Read WITHOUT header - first row becomes data!
df_no_header = spark.read.csv("data/netflix_titles.csv", inferSchema=True)
df_no_header.show(5)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|    _c0|    _c1|                 _c2|            _c3|                 _c4|          _c5|               _c6|         _c7|   _c8|      _c9|                _c10|                _c11|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                NULL|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           NULL|Ama Qamata, Khosi...| South Africa|Septem

In [6]:
# Notice the auto-generated column names - _c0, _c1, _c2...
df_no_header.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



In [None]:
# Read WITHOUT header - first row becomes data!
df_no_schema_infer = spark.read.csv("data/netflix_titles.csv")
df_no_schema_infer.show(5)

In [None]:
df_no_schema_infer.printSchema()

## 3. Basic Transformations

### Select - Choosing Columns

In [7]:
# Select only the columns we need
df.select("title", "release_year", "type").show(10)

+--------------------+------------+-------+
|               title|release_year|   type|
+--------------------+------------+-------+
|Dick Johnson Is Dead|        2020|  Movie|
|       Blood & Water|        2021|TV Show|
|           Ganglands|        2021|TV Show|
|Jailbirds New Orl...|        2021|TV Show|
|        Kota Factory|        2021|TV Show|
|       Midnight Mass|        2021|TV Show|
|My Little Pony: A...|        2021|  Movie|
|             Sankofa|        1993|  Movie|
|The Great British...|        2021|TV Show|
|        The Starling|        2021|  Movie|
+--------------------+------------+-------+
only showing top 10 rows


### Filter - Filtering Data

In [8]:
# Filter content from 2020 and newer
df.filter(df.release_year >= 2020).show(10)

+-------+-------+--------------------+--------------------+--------------------+--------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|       country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|     Kirsten Johnson|                NULL| United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|                NULL|Ama Qamata, Khosi...|  South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|     Julien Leclercq|Sami B

In [9]:
# Filter only movies (not TV shows) from 2020 onwards
df.filter((df.type == "Movie") & (df.release_year >= 2020)).show(10)

+-------+-----+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+--------+--------------------+--------------------+
|show_id| type|               title|            director|                cast|             country|        date_added|release_year|rating|duration|           listed_in|         description|
+-------+-----+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+--------+--------------------+--------------------+
|     s1|Movie|Dick Johnson Is Dead|     Kirsten Johnson|                NULL|       United States|September 25, 2021|        2020| PG-13|  90 min|       Documentaries|As her father nea...|
|     s7|Movie|My Little Pony: A...|Robert Cullen, Jo...|Vanessa Hudgens, ...|                NULL|September 24, 2021|        2021|    PG|  91 min|Children & Family...|Equestria's divid...|
|    s10|Movie|        The Starling|      Theodore

### GroupBy - Grouping and Aggregation

In [10]:
# Count number of movies vs TV shows
df.groupBy("type").count().show()

+-------------+-----+
|         type|count|
+-------------+-----+
|         NULL|    1|
|      TV Show| 2676|
|        Movie| 6131|
|William Wyler|    1|
+-------------+-----+



In [11]:
# Count content by year, sort in descending order
df.groupBy("release_year").count().orderBy("release_year", ascending=False).show(20)

+-----------------+-----+
|     release_year|count|
+-----------------+-----+
|    United States|    1|
|    June 12, 2021|    1|
| January 15, 2021|    1|
| January 13, 2021|    1|
|December 15, 2020|    1|
|  August 13, 2020|    1|
|           40 min|    1|
|             2021|  589|
|             2020|  952|
|             2019| 1026|
|             2018| 1145|
|             2017| 1030|
|             2016|  901|
|             2015|  559|
|             2014|  352|
|             2013|  288|
|             2012|  237|
|             2011|  185|
|             2010|  193|
|             2009|  152|
+-----------------+-----+
only showing top 20 rows


## 4. Saving to Parquet Format

### Why Parquet?

**Parquet** is a columnar storage format that:
- **Faster**: columnar storage is optimal for analytical queries
- **More compact**: built-in compression reduces file size
- **Preserves schema**: data types are saved automatically
- **Works with partitions**: ideal for big data

We'll cover this in more detail in a separate video!

In [12]:
# Save DataFrame to Parquet format
df.write.parquet("output/netflix_cleaned")

AnalysisException: [PATH_ALREADY_EXISTS] Path file:/Users/dkulemza/youtube_code/Spark101/dataframes_101/output/netflix_cleaned already exists. Set mode as "overwrite" to overwrite the existing path. SQLSTATE: 42K04

After running the command above:
- A folder `output/netflix_cleaned/` will be created
- Inside will be `.parquet` files (one per partition)
- Plus `_SUCCESS` file and metadata

In [None]:
# Verify we can read the saved data
df_parquet = spark.read.parquet("output/netflix_cleaned")
df_parquet.show(5)

In [None]:
# Schema is preserved automatically!
df_parquet.printSchema()