## PySpark Dataframe
#### This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated.
#### They are implemented on top of RDDs. 
#### When Spark transforms data, it does not immediately compute the transformation but plans how to compute later. 
#### When actions such as collect() are explicitly called, the computation starts.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

25/01/25 22:38:44 WARN Utils: Your hostname, Amoakos-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.20.10.12 instead (on interface en0)
25/01/25 22:38:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/25 22:38:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

In [4]:
df = spark.read.csv('appendix.csv', header=True)

In [5]:
df.show()

+-----------+-------------+-----------+--------------------+--------------------+--------------------+----+-----------------------+--------------------------------------+---------------------------------------+---------+---------+-----------+--------------------------------------------+--------------+-----------------+------------------------+------------------------------+------------------------------+----------+------+--------+-----------------------------+
|Institution|Course Number|Launch Date|        Course Title|         Instructors|      Course Subject|Year|Honor Code Certificates|Participants (Course Content Accessed)|Audited (> 50% Course Content Accessed)|Certified|% Audited|% Certified|% Certified of > 50% Course Content Accessed|% Played Video|% Posted in Forum|% Grade Higher Than Zero|Total Course Hours (Thousands)|Median Hours for Certification|Median Age|% Male|% Female|% Bachelor's Degree or Higher|
+-----------+-------------+-----------+--------------------+----------

In [9]:
df.printSchema()

root
 |-- Institution: string (nullable = true)
 |-- Course Number: string (nullable = true)
 |-- Launch Date: string (nullable = true)
 |-- Course Title: string (nullable = true)
 |-- Instructors: string (nullable = true)
 |-- Course Subject: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Honor Code Certificates: string (nullable = true)
 |-- Participants (Course Content Accessed): string (nullable = true)
 |-- Audited (> 50% Course Content Accessed): string (nullable = true)
 |-- Certified: string (nullable = true)
 |-- % Audited: string (nullable = true)
 |-- % Certified: string (nullable = true)
 |-- % Certified of > 50% Course Content Accessed: string (nullable = true)
 |-- % Played Video: string (nullable = true)
 |-- % Posted in Forum: string (nullable = true)
 |-- % Grade Higher Than Zero: string (nullable = true)
 |-- Total Course Hours (Thousands): string (nullable = true)
 |-- Median Hours for Certification: string (nullable = true)
 |-- Median Age: string

In [7]:
df.createOrReplaceTempView('tableA')
spark.sql('Select * from tableA limit 5').show()

+-----------+-------------+-----------+--------------------+--------------------+--------------------+----+-----------------------+--------------------------------------+---------------------------------------+---------+---------+-----------+--------------------------------------------+--------------+-----------------+------------------------+------------------------------+------------------------------+----------+------+--------+-----------------------------+
|Institution|Course Number|Launch Date|        Course Title|         Instructors|      Course Subject|Year|Honor Code Certificates|Participants (Course Content Accessed)|Audited (> 50% Course Content Accessed)|Certified|% Audited|% Certified|% Certified of > 50% Course Content Accessed|% Played Video|% Posted in Forum|% Grade Higher Than Zero|Total Course Hours (Thousands)|Median Hours for Certification|Median Age|% Male|% Female|% Bachelor's Degree or Higher|
+-----------+-------------+-----------+--------------------+----------

In [12]:
spark.sql("Select * from tableA where '% Female' < '43'").show()

+-----------+-------------+-----------+--------------------+--------------------+--------------------+----+-----------------------+--------------------------------------+---------------------------------------+---------+---------+-----------+--------------------------------------------+--------------+-----------------+------------------------+------------------------------+------------------------------+----------+------+--------+-----------------------------+
|Institution|Course Number|Launch Date|        Course Title|         Instructors|      Course Subject|Year|Honor Code Certificates|Participants (Course Content Accessed)|Audited (> 50% Course Content Accessed)|Certified|% Audited|% Certified|% Certified of > 50% Course Content Accessed|% Played Video|% Posted in Forum|% Grade Higher Than Zero|Total Course Hours (Thousands)|Median Hours for Certification|Median Age|% Male|% Female|% Bachelor's Degree or Higher|
+-----------+-------------+-----------+--------------------+----------

Alternatively, you can enable spark.sql.repl.eagerEval.enabled configuration for the eager evaluation of PySpark DataFrame in notebooks such as Jupyter. The number of rows to show can be controlled via spark.sql.repl.eagerEval.maxNumRows configuration.

In [13]:
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
df

Institution,Course Number,Launch Date,Course Title,Instructors,Course Subject,Year,Honor Code Certificates,Participants (Course Content Accessed),Audited (> 50% Course Content Accessed),Certified,% Audited,% Certified,% Certified of > 50% Course Content Accessed,% Played Video,% Posted in Forum,% Grade Higher Than Zero,Total Course Hours (Thousands),Median Hours for Certification,Median Age,% Male,% Female,% Bachelor's Degree or Higher
MITx,6.002x,09/05/2012,Circuits and Elec...,Khurram Afridi,"Science, Technolo...",1,1,36105,5431,3003,15.04,8.32,54.98,83.2,8.17,28.97,418.94,64.45,26,88.28,11.72,60.68
MITx,6.00x,09/26/2012,Introduction to C...,"Eric Grimson, Joh...",Computer Science,1,1,62709,8949,5783,14.27,9.22,64.05,89.14,14.38,39.5,884.04,78.53,28,83.5,16.5,63.04
MITx,3.091x,10/09/2012,Introduction to S...,Michael Cima,"Science, Technolo...",1,1,16663,2855,2082,17.13,12.49,72.85,87.49,14.42,34.89,227.55,61.28,27,70.32,29.68,58.76
HarvardX,CS50x,10/15/2012,Introduction to C...,"David Malan, Nate...",Computer Science,1,1,129400,12888,1439,9.96,1.11,11.11,0,0.0,1.11,220.9,0.0,28,80.02,19.98,58.78
HarvardX,PH207x,10/15/2012,Health in Numbers...,Earl Francis Cook...,"Government, Healt...",1,1,52521,10729,5058,20.44,9.64,47.12,77.45,15.98,32.52,804.41,76.1,32,56.78,43.22,88.33
MITx,6.00x,02/04/2013,Introduction to C...,Larry Rudolph,Computer Science,1,1,65380,6473,3313,9.9,5.07,51.17,82.43,10.3,28.9,639.4,84.14,27,83.99,16.01,60.9
MITx,3.091x,02/05/2013,Introduction to S...,Michael Cima,"Science, Technolo...",1,1,8270,838,547,10.13,6.61,65.16,80.25,10.22,23.49,68.11,59.29,27,73.3,26.7,58.99
MITx,14.73x,02/12/2013,The Challenges of...,"Esther Duflo, Abh...","Government, Healt...",1,1,29044,6510,4607,22.41,15.86,70.6,83.24,13.89,39.38,279.22,40.3,30,53.76,46.24,81.94
MITx,8.02x,02/18/2013,Electricity and M...,"Walter Lewin, Joh...","Science, Technolo...",1,1,39178,3543,1722,9.04,4.4,48.49,85.3,5.86,16.04,380.35,107.88,26,85.42,14.58,56.97
HarvardX,ER22x,03/02/2013,Justice,Michael Sandel,"Humanities, Histo...",1,1,58779,9425,5438,16.05,9.26,51.07,---,21.86,20.98,186.61,13.67,30,60.42,39.58,69.78


In [17]:
df.describe('Institution','Course Title','Course Subject','Certified','% Certified','Median Age','% Male','% Female').show()

+-------+-----------+--------------------+--------------------+------------------+-----------------+-----------------+------------------+------------------+
|summary|Institution|        Course Title|      Course Subject|         Certified|      % Certified|       Median Age|            % Male|          % Female|
+-------+-----------+--------------------+--------------------+------------------+-----------------+-----------------+------------------+------------------+
|  count|        290|                 290|                 290|               290|              290|              290|               290|               290|
|   mean|       NULL|                NULL|                NULL| 843.8103448275862|7.782586206896548|             29.3| 67.01068965517243| 32.98931034482757|
| stddev|       NULL|                NULL|                NULL|1105.5943720296111|6.972436627689397|4.047896630106513|15.843641959038022|15.843641959038024|
|    min|   HarvardX|A Global History ...|    Computer Sci

In [27]:
df.collect()

[Row(Institution='MITx', Course Number='6.002x', Launch Date='09/05/2012', Course Title='Circuits and Electronics', Instructors='Khurram Afridi', Course Subject='Science, Technology, Engineering, and Mathematics', Year='1', Honor Code Certificates='1', Participants (Course Content Accessed)='36105', Audited (> 50% Course Content Accessed)='5431', Certified='3003', % Audited='15.04', % Certified='8.32', % Certified of > 50% Course Content Accessed='54.98', % Played Video='83.2', % Posted in Forum='8.17', % Grade Higher Than Zero='28.97', Total Course Hours (Thousands)='418.94', Median Hours for Certification='64.45', Median Age='26', % Male='88.28', % Female='11.72', % Bachelor's Degree or Higher='60.68'),
 Row(Institution='MITx', Course Number='6.00x', Launch Date='09/26/2012', Course Title='Introduction to Computer Science and Programming', Instructors='Eric Grimson, John Guttag, Chris Terman', Course Subject='Computer Science', Year='1', Honor Code Certificates='1', Participants (Cou

In [28]:
df.distinct().count()

290

In [31]:
df.dtypes

[('Institution', 'string'),
 ('Course Number', 'string'),
 ('Launch Date', 'string'),
 ('Course Title', 'string'),
 ('Instructors', 'string'),
 ('Course Subject', 'string'),
 ('Year', 'string'),
 ('Honor Code Certificates', 'string'),
 ('Participants (Course Content Accessed)', 'string'),
 ('Audited (> 50% Course Content Accessed)', 'string'),
 ('Certified', 'string'),
 ('% Audited', 'string'),
 ('% Certified', 'string'),
 ('% Certified of > 50% Course Content Accessed', 'string'),
 ('% Played Video', 'string'),
 ('% Posted in Forum', 'string'),
 ('% Grade Higher Than Zero', 'string'),
 ('Total Course Hours (Thousands)', 'string'),
 ('Median Hours for Certification', 'string'),
 ('Median Age', 'string'),
 ('% Male', 'string'),
 ('% Female', 'string'),
 ("% Bachelor's Degree or Higher", 'string')]

In [33]:
df.summary()

summary,Institution,Course Number,Launch Date,Course Title,Instructors,Course Subject,Year,Honor Code Certificates,Participants (Course Content Accessed),Audited (> 50% Course Content Accessed),Certified,% Audited,% Certified,% Certified of > 50% Course Content Accessed,% Played Video,% Posted in Forum,% Grade Higher Than Zero,Total Course Hours (Thousands),Median Hours for Certification,Median Age,% Male,% Female,% Bachelor's Degree or Higher
count,290,290,290,290,289,290,290.0,290.0,290.0,290.0,290.0,290.0,290.0,290.0,290,290.0,290.0,290.0,290.0,290.0,290.0,290.0,290.0
mean,,,,,,,3.1724137931034484,0.8137931034482758,15344.33448275862,2549.1724137931037,843.8103448275862,24.91696551724138,7.782586206896548,31.445655172413804,63.934948096885776,9.347517241379316,21.21037931034481,94.9818275862069,44.364551724137925,29.3,67.01068965517243,32.98931034482757,72.07872413793105
stddev,,,,,,,0.9063011070802596,0.3899464411196652,28207.57873317167,3095.1599694695287,1105.5943720296111,15.883538161621258,6.972436627689397,19.75110161401606,13.47148320736528,7.51714060072847,13.411539505110827,157.6176102908904,43.95370941795084,4.047896630106513,15.843641959038022,15.843641959038024,10.256434460742785
min,HarvardX,0.111x,01/01/2014,A Global History ...,"Aaron Bernstein, ...",Computer Science,1.0,0.0,10188.0,1000.0,0.0,10.13,0.0,0.0,---,0.0,0.0,0.11,0.0,22.0,25.24,10.39,44.95
25%,,,,,,,3.0,1.0,3806.0,754.0,138.0,14.21,2.4,13.38,58.87,3.98,10.57,12.87,12.23,26.0,54.14,18.31,64.49
50%,,,,,,,3.0,1.0,7898.0,1516.0,393.0,20.41,5.88,31.25,65.99,7.19,19.5,37.58,26.74,29.0,66.49,33.46,73.01
75%,,,,,,,4.0,1.0,18183.0,3389.0,1208.0,33.92,10.71,47.79,72.41,14.16,30.91,97.28,64.45,31.0,81.69,45.86,79.35
max,MITx,VJx,12/15/2015,World Religions T...,Wolfgang Ketterle...,"Science, Technolo...",4.0,1.0,9933.0,997.0,99.0,9.96,9.74,9.88,89.14,9.97,9.91,99.42,99.23,53.0,93.44,8.96,98.11
