In [1]:
from pyspark.sql import SparkSession

In [2]:
spark=SparkSession.builder.appName("DataFrames").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/30 12:02:42 WARN Utils: Your hostname, bhuvaneshwaran-Latitude-5420, resolves to a loopback address: 127.0.1.1; using 192.168.1.17 instead (on interface wlp0s20f3)
25/11/30 12:02:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/30 12:02:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

## ‚úî Method 1: From a Python List

In [4]:
data = [("Bhuvan", 25), ("Arun", 30)]
cols = ["Name", "Age"]

df = spark.createDataFrame(data, cols)
df.show()

                                                                                

+------+---+
|  Name|Age|
+------+---+
|Bhuvan| 25|
|  Arun| 30|
+------+---+



Why use this?

* For testing

* For small sample datasets

* To validate logic

In [5]:
from pyspark.sql import Row

# Create RDD and convert to DataFrame
data = [ 
    Row(Name="Krish", Age=31, Experience=10),
    Row(Name="Sudhanshi", Age=30, Experience=8),
    Row(Name="Ajay", Age=29, Experience=4)
]

df = spark.createDataFrame(data)
df.show()

+---------+---+----------+
|     Name|Age|Experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshi| 30|         8|
|     Ajay| 29|         4|
+---------+---+----------+



## ‚úî Method 2: Reading a CSV File (Most common)

In [6]:
df = spark.read.csv(r"/home/bhuvaneshwaran/Desktop/Medium/csvFiles/Employees.csv", header=True, inferSchema=True)
df.show(50)

+---+-----------+---------------+------+----------+-----+--------------------+--------------------+------+--------------+-------------+--------+-----------+-------------+--------------+
| No| First Name|      Last Name|Gender|Start Date|Years|          Department|             Country|Center|Monthly Salary|Annual Salary|Job Rate|Sick Leaves|Unpaid Leaves|Overtime Hours|
+---+-----------+---------------+------+----------+-----+--------------------+--------------------+------+--------------+-------------+--------+-----------+-------------+--------------+
|  1|     Ghadir|          Hmshw|  Male|04/04/2018|    7|     Quality Control|               Egypt|  West|          1560|        18720|     3.0|          1|            0|           183|
|  2|       Omar|         Hishan|  Male|21/05/2020|    5|     Quality Control|        Saudi Arabia|  West|          3247|        38964|     1.0|          0|            5|           198|
|  3|      Ailya|         Sharaf|Female|28/09/2017|    8|  Major Mfg P

Explanation of every parameter:
### üè∑ header=True

Uses first row as column names.

### üìê inferSchema=True

Automatically detects data types.

### üõë Important note:
For production, avoid inferSchema=True.
It scans entire file ‚Üí slow.

### Alternative:

Define schema manually:

In [7]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("First Name", StringType(), True),
    StructField("Monthly Salary", IntegerType(), True)
])

df = spark.read.csv(r"/home/bhuvaneshwaran/Desktop/Medium/csvFiles/Employees.csv", header=True, schema=schema)


In [8]:
df.show()

+----------+--------------+
|First Name|Monthly Salary|
+----------+--------------+
|         1|          NULL|
|         2|          NULL|
|         3|          NULL|
|         4|          NULL|
|         5|          NULL|
|         6|          NULL|
|         7|          NULL|
|         8|          NULL|
|         9|          NULL|
|        10|          NULL|
|        11|          NULL|
|        12|          NULL|
|        13|          NULL|
|        14|          NULL|
|        15|          NULL|
|        16|          NULL|
|        17|          NULL|
|        18|          NULL|
|        19|          NULL|
|        20|          NULL|
+----------+--------------+
only showing top 20 rows


25/11/30 12:02:49 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 15, schema size: 2
CSV file: file:///home/bhuvaneshwaran/Desktop/Medium/csvFiles/Employees.csv


## ‚úî Method 3: Read JSON

In [9]:
df = spark.read.json(r"/home/bhuvaneshwaran/Desktop/Medium/csvFiles/US_STATE_recipes.json")

JSON is widely used in logs, APIs, and NoSQL exports.

In [10]:
df

DataFrame[0: struct<Contient:string,Country_State:string,URL:string,cook_time:bigint,cuisine:string,description:string,ingredients:array<string>,instructions:array<string>,nutrients:struct<calories:string,carbohydrateContent:string,cholesterolContent:string,fatContent:string,fiberContent:string,proteinContent:string,saturatedFatContent:string,sodiumContent:string,sugarContent:string,unsaturatedFatContent:string>,prep_time:bigint,rating:double,serves:string,title:string,total_time:bigint>, 1: struct<Contient:string,Country_State:string,URL:string,cook_time:bigint,cuisine:string,description:string,ingredients:array<string>,instructions:array<string>,nutrients:struct<calories:string,carbohydrateContent:string,cholesterolContent:string,fatContent:string,fiberContent:string,proteinContent:string,saturatedFatContent:string,sodiumContent:string,sugarContent:string,unsaturatedFatContent:string>,prep_time:bigint,rating:double,serves:string,title:string,total_time:bigint>, 2: struct<Contient:str

## ‚úî Method 4: Read Parquet (FASTEST)

In [11]:
df = spark.read.parquet(r"/home/bhuvaneshwaran/Desktop/Medium/parquetFiles/titanic.parquet")

In [12]:
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3|    Moran, Mr. James|  male|NULL|    0|    0|      

## üí° Why Parquet is preferred?

* Compressed

* Columnar format

* Very fast for queries

* Saves storage

Used in data lakes (S3, Delta Lake, ADLS)

## üìä Viewing Data


In [13]:
df.show()  #Shows 5 rows.


+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          6|       0|     3|    Moran, Mr. James|  male|NULL|    0|    0|      

In [14]:
df.printSchema()   #Useful for checking nested fields, types, nulls.

root
 |-- PassengerId: long (nullable = true)
 |-- Survived: long (nullable = true)
 |-- Pclass: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: long (nullable = true)
 |-- Parch: long (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [15]:
df.columns  #Returns list of column names.

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

## ‚ú® Selecting Columns
.select()

In [16]:
df.select("Name", "Age").show()

+--------------------+----+
|                Name| Age|
+--------------------+----+
|Braund, Mr. Owen ...|22.0|
|Cumings, Mrs. Joh...|38.0|
|Heikkinen, Miss. ...|26.0|
|Futrelle, Mrs. Ja...|35.0|
|Allen, Mr. Willia...|35.0|
|    Moran, Mr. James|NULL|
|McCarthy, Mr. Tim...|54.0|
|Palsson, Master. ...| 2.0|
|Johnson, Mrs. Osc...|27.0|
|Nasser, Mrs. Nich...|14.0|
|Sandstrom, Miss. ...| 4.0|
|Bonnell, Miss. El...|58.0|
|Saundercock, Mr. ...|20.0|
|Andersson, Mr. An...|39.0|
|Vestrom, Miss. Hu...|14.0|
|Hewlett, Mrs. (Ma...|55.0|
|Rice, Master. Eugene| 2.0|
|Williams, Mr. Cha...|NULL|
|Vander Planke, Mr...|31.0|
|Masselmani, Mrs. ...|NULL|
+--------------------+----+
only showing top 20 rows


## Select with operations:

In [17]:
from pyspark.sql.functions import col

df.select(col("Age") + 10).show()


+----------+
|(Age + 10)|
+----------+
|      32.0|
|      48.0|
|      36.0|
|      45.0|
|      45.0|
|      NULL|
|      64.0|
|      12.0|
|      37.0|
|      24.0|
|      14.0|
|      68.0|
|      30.0|
|      49.0|
|      24.0|
|      65.0|
|      12.0|
|      NULL|
|      41.0|
|      NULL|
+----------+
only showing top 20 rows


## ‚ú® Filtering Rows
Method 1: Using conditions

In [18]:
df.filter(df.Age > 25).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|           17463|51.8625|  E46|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|      

Method 2: Using SQL-like strings

In [19]:
df.filter("Age > 25").show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|           17463|51.8625|  E46|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|      

Method 3: Multiple conditions

In [20]:
df.filter((df.Age > 25) & (df.Sex == "female")).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333| NULL|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|          113783|  26.55| C103|       S|
|         16|       1|     2|Hewlett, Mrs. (Ma...|female|55.0|    0|    0|      

## üî• Adding New Columns ‚Äî withColumn

One of the most used functions in PySpark.

In [21]:
df = df.withColumn("AgePlus10", df.Age + 10)
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|AgePlus10|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|     32.0|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|     48.0|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|     36.0|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|     45.0|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|     45.0|


In [22]:
df.printSchema()

root
 |-- PassengerId: long (nullable = true)
 |-- Survived: long (nullable = true)
 |-- Pclass: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: long (nullable = true)
 |-- Parch: long (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- AgePlus10: double (nullable = true)



### Examples:
Convert type:

In [23]:
df = df.withColumn("AgePlus10", df.AgePlus10.cast("string"))

In [24]:
df.printSchema()

root
 |-- PassengerId: long (nullable = true)
 |-- Survived: long (nullable = true)
 |-- Pclass: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: long (nullable = true)
 |-- Parch: long (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- AgePlus10: string (nullable = true)



Conditional column:

In [25]:
df.printSchema()

root
 |-- PassengerId: long (nullable = true)
 |-- Survived: long (nullable = true)
 |-- Pclass: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: long (nullable = true)
 |-- Parch: long (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- AgePlus10: string (nullable = true)



In [26]:
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|AgePlus10|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|     32.0|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|     48.0|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|     36.0|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|     45.0|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|     45.0|


In [27]:
from pyspark.sql.functions import when

df = df.withColumn(
    "Category",
    when(df.Age > 30, "Senior").otherwise("Junior")
)

In [28]:
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|AgePlus10|Category|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+---------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|     32.0|  Junior|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|     48.0|  Senior|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|     36.0|  Junior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|     45.0|  Senior|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|

## üóÇ Dropping Columns

In [29]:
df = df.drop("AgePlus10")

In [30]:
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Senior|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|  Senior|
|       

## üìå Renaming Columns

In [31]:
df = df.withColumnRenamed("Age", "Employee_Age")

In [32]:
df.show()

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|       S|  Senior|
|          5|       0|     3|Allen, Mr. Willia...|  male|        35.0|    0|

## üìä Sorting

In [34]:
df.orderBy("Employee_Age", ascending=False).show()

+-----------+--------+------+--------------------+------+------------+-----+-----+-----------+-------+-----------+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|     Ticket|   Fare|      Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+-----------+-------+-----------+--------+--------+
|        631|       1|     1|Barkworth, Mr. Al...|  male|        80.0|    0|    0|      27042|   30.0|        A23|       S|  Senior|
|        852|       0|     3| Svensson, Mr. Johan|  male|        74.0|    0|    0|     347060|  7.775|       NULL|       S|  Senior|
|         97|       0|     1|Goldschmidt, Mr. ...|  male|        71.0|    0|    0|   PC 17754|34.6542|         A5|       C|  Senior|
|        494|       0|     1|Artagaveytia, Mr....|  male|        71.0|    0|    0|   PC 17609|49.5042|       NULL|       C|  Senior|
|        117|       0|     3|Connors, Mr. Patrick|  male|        70.5

In [37]:
df.count()

891

## üîÅ Handling Missing Values
Drop rows with nulls:

In [38]:
df.dropna().show()

+-----------+--------+------+--------------------+------+------------+-----+-----+-----------+--------+-----------+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|     Ticket|    Fare|      Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+-----------+--------+-----------+--------+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|   PC 17599| 71.2833|        C85|       C|  Senior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|     113803|    53.1|       C123|       S|  Senior|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|        54.0|    0|    0|      17463| 51.8625|        E46|       S|  Senior|
|         11|       1|     3|Sandstrom, Miss. ...|female|         4.0|    1|    1|    PP 9549|    16.7|         G6|       S|  Junior|
|         12|       1|     1|Bonnell, Miss. El...|female|     

In [39]:
df.count()

891

Fill nulls

In [40]:
df.fillna({"Employee_Age": 0, "Name": "Unknown"}).show()

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|       S|  Senior|
|          5|       0|     3|Allen, Mr. Willia...|  male|        35.0|    0|

## üöÄ Aggregations
.groupBy().count()

In [41]:
df.groupBy("Category").count().show()

+--------+-----+
|Category|count|
+--------+-----+
|  Senior|  305|
|  Junior|  586|
+--------+-----+



In [42]:
df.groupBy("Sex").count().show()

+------+-----+
|   Sex|count|
+------+-----+
|female|  314|
|  male|  577|
+------+-----+



.agg()

In [44]:
from pyspark.sql.functions import sum, avg

df.groupBy("Sex").agg(
    avg("Fare").alias("AvgFare"),
    sum("Fare").alias("TotalFare")
).show()


+------+------------------+-----------------+
|   Sex|           AvgFare|        TotalFare|
+------+------------------+-----------------+
|female| 44.47981783439487|13966.66279999999|
|  male|25.523893414211418|14727.28649999999|
+------+------------------+-----------------+



In [45]:
data = [("1", "chennai"), ("2", "Madurai"), ("3", "Sivaganga"), ("4", "Pudukkottai"), ("5", "Trichy"), ("6", "Ramanathapuram"), ("7", "Tenkasi")]
cols = ["PassengerId", "City"]

df2 = spark.createDataFrame(data, cols)
df2.show()

+-----------+--------------+
|PassengerId|          City|
+-----------+--------------+
|          1|       chennai|
|          2|       Madurai|
|          3|     Sivaganga|
|          4|   Pudukkottai|
|          5|        Trichy|
|          6|Ramanathapuram|
|          7|       Tenkasi|
+-----------+--------------+



## üî• Joins in PySpark

In [47]:
df.join(df2, on="PassengerId", how="inner").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|          City|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|       chennai|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|       Madurai|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|     Sivaganga|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|     

### Join types:

* inner

* left

* right

* full

* semi

* anti

In [48]:
df.join(df2, on="PassengerId", how="left").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|          City|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|          7|       0|     1|McCarthy, Mr. Tim...|  male|        54.0|    0|    0|           17463|51.8625|  E46|       S|  Senior|       Tenkasi|
|          6|       0|     3|    Moran, Mr. James|  male|        NULL|    0|    0|          330877| 8.4583| NULL|       Q|  Junior|Ramanathapuram|
|          9|       1|     3|Johnson, Mrs. Osc...|female|        27.0|    0|    2|          347742|11.1333| NULL|       S|  Junior|          NULL|
|          5|       0|     3|Allen, Mr. Willia...|  male|        35.0|    0|    0|          373450|   8.05| NULL|     

In [49]:
df.join(df2, on="PassengerId", how="right").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|          City|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|       chennai|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|       Madurai|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|     Sivaganga|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|     

In [50]:
df.join(df2, on="PassengerId", how="full").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|          City|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+--------------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|       chennai|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|       Madurai|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|     Sivaganga|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|     

In [51]:
df.join(df2, on="PassengerId", how="semi").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+----------------+-------+-----+--------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|        22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|  Junior|
|          2|       1|     1|Cumings, Mrs. Joh...|female|        38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Senior|
|          3|       1|     3|Heikkinen, Miss. ...|female|        26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|  Junior|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|        35.0|    1|    0|          113803|   53.1| C123|       S|  Senior|
|          5|       0|     3|Allen, Mr. Willia...|  male|        35.0|    0|

In [52]:
df.join(df2, on="PassengerId", how="anti").show(10)

+-----------+--------+------+--------------------+------+------------+-----+-----+---------------+--------+-----+--------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Employee_Age|SibSp|Parch|         Ticket|    Fare|Cabin|Embarked|Category|
+-----------+--------+------+--------------------+------+------------+-----+-----+---------------+--------+-----+--------+--------+
|         26|       1|     3|Asplund, Mrs. Car...|female|        38.0|    1|    5|         347077| 31.3875| NULL|       S|  Senior|
|         29|       1|     3|O'Dwyer, Miss. El...|female|        NULL|    0|    0|         330959|  7.8792| NULL|       Q|  Junior|
|        474|       1|     2|Jerwan, Mrs. Amin...|female|        23.0|    0|    0|SC/AH Basle 541| 13.7917|    D|       C|  Junior|
|         65|       0|     1|Stewart, Mr. Albe...|  male|        NULL|    0|    0|       PC 17605| 27.7208| NULL|       C|  Junior|
|        191|       1|     2| Pinsky, Mrs. (Rosa)|female|        32.0|    0|

## üöÄ Window Functions (Advanced but useful)

In [54]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.orderBy("Fare")

df.withColumn("RowNum", row_number().over(window)).show()

+-----------+--------+------+--------------------+----+------------+-----+-----+------+------+-----------+--------+--------+------+
|PassengerId|Survived|Pclass|                Name| Sex|Employee_Age|SibSp|Parch|Ticket|  Fare|      Cabin|Embarked|Category|RowNum|
+-----------+--------+------+--------------------+----+------------+-----+-----+------+------+-----------+--------+--------+------+
|        303|       0|     3|Johnson, Mr. Will...|male|        19.0|    0|    0|  LINE|   0.0|       NULL|       S|  Junior|     1|
|        278|       0|     2|Parkes, Mr. Franc...|male|        NULL|    0|    0|239853|   0.0|       NULL|       S|  Junior|     2|
|        272|       1|     3|Tornquist, Mr. Wi...|male|        25.0|    0|    0|  LINE|   0.0|       NULL|       S|  Junior|     3|
|        264|       0|     1|Harrison, Mr. Wil...|male|        40.0|    0|    0|112059|   0.0|        B94|       S|  Senior|     4|
|        482|       0|     2|Frost, Mr. Anthon...|male|        NULL|    0|  

25/11/30 12:58:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/30 12:58:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/11/30 12:58:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


### Window functions are used for:

* ranking

* cumulative totals

* moving averages

* partitioning