In [4]:
#pip install pyspark

### Start a spark session

In [2]:
import pyspark 
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadWriteVal").getOrCreate()

spark

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/25 15:49:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
cores

1

Dataset source: https://www.kaggle.com/satpreetmakhija/netflix-movies-and-tv-shows-2021?select=netflixData.csv

### Importing netflix data in the form of csv

In [3]:
path = 'netflixData.csv'
# We ask spark to infer schema and header types
netflix_rawdata = spark.read.csv(path,inferSchema=True,header=True)

# Print top 5 rows of data
netflix_rawdata.limit(5).toPandas()

                                                                                

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


### Writing csv file

In [5]:
# Here I'm just going to create a copy of the original datasource and write it to another csv

netflix_rawdata.toPandas().to_csv('netflix_copy.csv')

In [6]:
path = './netflix_copy.csv'
# We ask spark to infer schema and header types
netflix_copy_rawdata = spark.read.csv(path,inferSchema=True,header=True)

# Print top 5 rows of data
netflix_copy_rawdata.limit(5).toPandas()

Unnamed: 0,_c0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020,TV-MA,1 Season,6.6/10,TV Show,
1,1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,TV-MA,1 Season,6.6/10,TV Show,
4,4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


We can see above that the copy was created and saved in output. It could be read as the regular data file was.

We can also save/import data in parquet format. Parquet is more efficient than a csv file.

### Writing parquet file

In [7]:
#netflix_rawdata.write.parquet('netflix_parquet/')

The above code would give us an error because parquet file cannot have spaces in headers. We will need to rename the columns

In [8]:
netflix_rawdata_for_parquet = netflix_rawdata.withColumnRenamed("Show Id","Show_Id") \
                                .withColumnRenamed("Production Country","Production_Country") \
                                .withColumnRenamed("Release Date","Release_Date") \
                                .withColumnRenamed("Imdb Score","Imdb_Score") \
                                .withColumnRenamed("Content Type","Content_Type") \
                                .withColumnRenamed("Date Added","Date_Added")
netflix_rawdata_for_parquet.limit(2).toPandas()

Unnamed: 0,Show_Id,Title,Description,Director,Genres,Cast,Production_Country,Release_Date,Rating,Duration,Imdb_Score,Content_Type,Date_Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"


Now the the columns are renames and spaces are removed, we can write the data to a parquet file

In [9]:
netflix_rawdata_for_parquet.write.mode("overwrite").parquet('netflix_parquet/')

We can also save parquet files by partitioning them on a variable. For example, lets partition this netflix data by release date for top 10 rows

In [10]:
netflix_rawdata_for_partition_parquet = netflix_rawdata_for_parquet.filter("Release_Date is not NULL").limit(10)
netflix_rawdata_for_partition_parquet.toPandas()

Unnamed: 0,Show_Id,Title,Description,Director,Genres,Cast,Production_Country,Release_Date,Rating,Duration,Imdb_Score,Content_Type,Date_Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020,TV-14,90 min,5.1/10,Movie,"February 5, 2020"
5,c293788a-41f7-49a3-a7fc-005ea33bce2b,#FriendButMarried,"Pining for his high school crush for years, a ...",Rako Prijanto,"Dramas, International Movies, Romantic Movies","Adipati Dolken, Vanesha Prescilla, Rendi Jhon,...",Indonesia,2018,TV-G,102 min,7.0/10,Movie,"May 21, 2020"
6,0555e67e-f624-4a05-93e4-55c117d0056d,#FriendButMarried 2,As Ayu and Ditto finally transition from best ...,Rako Prijanto,"Dramas, International Movies, Romantic Movies","Adipati Dolken, Mawar de Jongh, Sari Nila, Von...",Indonesia,2020,TV-G,104 min,7.0/10,Movie,"June 28, 2020"
7,c844460f-6178-4f87-929e-80816c74ca35,#realityhigh,When nerdy high schooler Dani finally attracts...,Fernando Lebrija,Comedies,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017,TV-14,99 min,5.1/10,Movie,"September 8, 2017"
8,8b34e0e9-7258-4e49-b799-2e7eddbd7e34,#Rucker50,This documentary celebrates the 50th anniversa...,Robert McCullough Jr.,"Documentaries, Sports Movies",,United States,2016,TV-PG,56 min,5.1/10,Movie,"December 1, 2016"
9,6da2fc83-1546-4e9d-bf2e-9b472a059c18,#Selfie,"Two days before their final exams, three teen ...",Cristina Jacob,"Comedies, Dramas, International Movies","Flavia Hojda, Crina Semciuc, Olimpia Melinte, ...",Romania,2014,TV-MA,125 min,5.8/10,Movie,"June 21, 2021"


In [11]:
netflix_rawdata_for_partition_parquet.write.mode("overwrite").partitionBy("Release_Date").parquet('part_parquet/')

### Reading parquet file

In [12]:
path = "./part_parquet"
partitioned = spark.read.parquet(path)
partitioned.toPandas()

Unnamed: 0,Show_Id,Title,Description,Director,Genres,Cast,Production_Country,Rating,Duration,Imdb_Score,Content_Type,Date_Added,Release_Date
0,c844460f-6178-4f87-929e-80816c74ca35,#realityhigh,When nerdy high schooler Dani finally attracts...,Fernando Lebrija,Comedies,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,TV-14,99 min,5.1/10,Movie,"September 8, 2017",2017
1,c293788a-41f7-49a3-a7fc-005ea33bce2b,#FriendButMarried,"Pining for his high school crush for years, a ...",Rako Prijanto,"Dramas, International Movies, Romantic Movies","Adipati Dolken, Vanesha Prescilla, Rendi Jhon,...",Indonesia,TV-G,102 min,7.0/10,Movie,"May 21, 2020",2018
2,6da2fc83-1546-4e9d-bf2e-9b472a059c18,#Selfie,"Two days before their final exams, three teen ...",Cristina Jacob,"Comedies, Dramas, International Movies","Flavia Hojda, Crina Semciuc, Olimpia Melinte, ...",Romania,TV-MA,125 min,5.8/10,Movie,"June 21, 2021",2014
3,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,TV-14,95 min,6.4/10,Movie,"July 1, 2020",2019
4,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,TV-MA,1 Season,6.6/10,TV Show,,2020
5,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,TV-MA,99 min,6.2/10,Movie,"September 8, 2020",2020
6,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,TV-MA,1 Season,6.6/10,TV Show,,2020
7,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,TV-14,90 min,5.1/10,Movie,"February 5, 2020",2020
8,0555e67e-f624-4a05-93e4-55c117d0056d,#FriendButMarried 2,As Ayu and Ditto finally transition from best ...,Rako Prijanto,"Dramas, International Movies, Romantic Movies","Adipati Dolken, Mawar de Jongh, Sari Nila, Von...",Indonesia,TV-G,104 min,7.0/10,Movie,"June 28, 2020",2020
9,8b34e0e9-7258-4e49-b799-2e7eddbd7e34,#Rucker50,This documentary celebrates the 50th anniversa...,Robert McCullough Jr.,"Documentaries, Sports Movies",,United States,TV-PG,56 min,5.1/10,Movie,"December 1, 2016",2016


### Getting summary

To get type of dataframe

In [13]:
print(type(partitioned))

<class 'pyspark.sql.dataframe.DataFrame'>


To get type of each column in dataframe

In [14]:
print(partitioned.printSchema())
#------
#--OR--
#------
print(partitioned.describe()) 

root
 |-- Show_Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Director: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Cast: string (nullable = true)
 |-- Production_Country: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Duration: string (nullable = true)
 |-- Imdb_Score: string (nullable = true)
 |-- Content_Type: string (nullable = true)
 |-- Date_Added: string (nullable = true)
 |-- Release_Date: integer (nullable = true)

None
DataFrame[summary: string, Show_Id: string, Title: string, Description: string, Director: string, Genres: string, Cast: string, Production_Country: string, Rating: string, Duration: string, Imdb_Score: string, Content_Type: string, Date_Added: string, Release_Date: string]


To get list of columns in dataframe

In [15]:
print(partitioned.columns)

['Show_Id', 'Title', 'Description', 'Director', 'Genres', 'Cast', 'Production_Country', 'Rating', 'Duration', 'Imdb_Score', 'Content_Type', 'Date_Added', 'Release_Date']


To get type of just 1 column

In [16]:
partitioned.schema['Director'].dataType

StringType

To get summary statistics of 1 feature

In [17]:
partitioned.describe(['Imdb_Score']).show()

+-------+----------+
|summary|Imdb_Score|
+-------+----------+
|  count|        10|
|   mean|      null|
| stddev|      null|
|    min|    5.1/10|
|    max|    7.0/10|
+-------+----------+



To get summary of multiple features

In [18]:
partitioned.select("Imdb_Score", "Director").summary("count", "min", "25%", "75%", "max").show()

+-------+----------+--------------------+
|summary|Imdb_Score|            Director|
+-------+----------+--------------------+
|  count|        10|                   8|
|    min|    5.1/10|              Cho Il|
|    25%|      null|                null|
|    75%|      null|                null|
|    max|    7.0/10|Sabina Fedeli, An...|
+-------+----------+--------------------+



In the above 2 lines, 
* notice that count of Director is 8 but the dataframe has 10 rows. This is because 2 of the movies have null values in Director.
* Also notice these data types we got:
    - Release_Date: string (should be numeric as its just year)
    - Rating: string (should be numeric)
    - Duration: string (should be numeric)
    - Imdb_Score: string (should be numeric)
    - Date_Added: string (should be date)

In [19]:
from pyspark.sql.functions import * 
from pyspark.sql.types import *
partitioned_formatted = partitioned.withColumn("Release_Date", partitioned["Release_Date"].cast(IntegerType())) \
            .withColumn("Rating", partitioned["Rating"].cast(IntegerType())) \
            .withColumn("Duration", partitioned["Duration"].cast(IntegerType())) \
            .withColumn("Imdb_Score", partitioned["Imdb_Score"].cast(IntegerType())) \
            .withColumn("Date_Added", to_date(partitioned.Date_Added, 'MMMMM d, yyyy')) 
partitioned_formatted

DataFrame[Show_Id: string, Title: string, Description: string, Director: string, Genres: string, Cast: string, Production_Country: string, Rating: int, Duration: int, Imdb_Score: int, Content_Type: string, Date_Added: date, Release_Date: int]

We can see that the data types has been converted to the correct formats