<a href="https://colab.research.google.com/github/FredArgoX/ChaoticTest_PySpark/blob/main/01_GL_Spark_DataFrame_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: [Great Learning](https://olympus.mygreatlearning.com/courses/31729/modules/items/879875?pb_id=581)

Spark DataFrames are the workhouse and main way of working with Spark and Python post Spark 2.0. DataFrames act as powerful versions of tables, with rows and columns, easily handling large datasets. The shift to DataFrames provides many advantages:

- A much simpler syntax
- Ability to use SQL directly in the dataframe
- Operations are automatically distributed across RDDs

If you've used R or even the pandas library with Python you are probably already familiar with the concept of DataFrames. Spark DataFrame expand on a lot of these concepts, allowing you to transfer that knowlwdge easily by understanding the simple syntax of Spark DataFrames. Remember that the main advantage to using Spark DataFrames vs those of other programs is that Spark cand handle data across many RDDs, huge data sets that would never fit on a single computer.

# Creating a DataFrame

First we need to start a SparkSession:

In [1]:
from pyspark.sql import SparkSession

Then start the SparkSession

In [2]:
# May take a little while on a local computer
spark = SparkSession.builder.appName("Basics").getOrCreate()

In [3]:
spark

You will first need to get the data from a file (or connect to a large distributed file like HDFS)

In [5]:
# This dataset is from Spark's examples
# The json data is located on the link: https://raw.githubusercontent.com/FredArgoX/ChaoticTest_PySpark/refs/heads/main/data/people.json

!wget https://raw.githubusercontent.com/FredArgoX/ChaoticTest_PySpark/refs/heads/main/data/people.json -O people.json

--2025-06-26 19:07:41--  https://raw.githubusercontent.com/FredArgoX/ChaoticTest_PySpark/refs/heads/main/data/people.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72 [text/plain]
Saving to: ‘people.json’


2025-06-26 19:07:41 (4.47 MB/s) - ‘people.json’ saved [72/72]



In [6]:
# Read data
df = spark.read.json('people.json')

Showing the data

In [7]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [8]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [9]:
df.columns

['age', 'name']

In [10]:
df.describe()

DataFrame[summary: string, age: string, name: string]

In [12]:
df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   NULL|
| stddev|7.7781745930520225|   NULL|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



Some data types make it easier to infer schema (like tabular formats such as csv).

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [13]:
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

Next we need to create the list of Structure fields * :param name: string, name of the field. * :param dataType: class:DataType of the field. * :param nullable: boolean, whether the field can be null (None) or not.

In [14]:
data_schema = [
    StructField("age", IntegerType(), True),
    StructField("name", StringType(), True)
]

In [15]:
final_struc = StructType(fields=data_schema)

In [16]:
df = spark.read.json("people.json", schema=final_struc)

In [17]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



# Grabbing the data