## Loading data in PySpark

In [1]:
from pyspark.sql.context import SparkContext
sc = SparkContext("local")

In [2]:
rdd1 = sc.parallelize([1,2,3,4,5])

In [3]:
rdd2 = sc.textFile("test.txt")

## Lambda Function

Syntax:

lambda arguments: expression

In [17]:
double = lambda x:x*2
print(double(3))

6


Difference between def and lambda:

def is used for named functions with more complex logic, while lambda is used for simple, anonymous functions that are often used for short-term tasks or as arguments to higher-order functions. Also, def can contain multiple expressions and a return statement, whereas, lambda contains only a single expression. Anonymous means does not having any name.


In [18]:
def cube(x):
    return x ** 3

g = lambda x: x ** 3

print(g(10))
print(cube(10))

1000
1000


## Map function

map() function takes a function and a list and returns a new list which contains items returnzed by that function for each item.

General syntax of map():

map(function, list)

In [19]:
items = [1,2,3,4]
list(map(lambda x: x+2, items))

[3, 4, 5, 6]

## Filter function

Filter function takes a function and a list and returns a new list for which the function evaluates as true.

Syntax:

filter(function, list)

In [21]:
items = [1,2,3,4]
list(filter(lambda x: (x%2 != 0), items))

[1, 3]

## RDDs

Resilient Distributed Datasets

- Resilient: Ability to withstand failures
- Distributed: Spanning across multiple machines
- Datasets: Collection of partitioned data e.g, Arrays, Tables, Tuples etc.

## Partitioning

Partition is a logical division of a large distributed data.

In [4]:
numRDD = sc.parallelize(range(10), numSlices = 6)

In [5]:
numRDD.getNumPartitions()

6

In [7]:
fileRDD = sc.textFile(r"C:\Users\aditya.rambhad\Downloads\spark notes.txt", minPartitions = 5)

In [9]:
fileRDD.getNumPartitions()

5

## PySpark Operations

- Transformations create new RDDs
- Actions perform computation on the RDDs

## Basic RDD Transformations

- map()
- filter()
- flatMap()
- union()

### map() Transformation

- Applies a funciton to all elements in the RDD.

In [10]:
RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x)

In [17]:
# Output: [1,4,9,16]

### filter() Transformation

- Returns a new RDD with only the elements that pass the condition. 

In [16]:
RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x: x > 2)

In [18]:
# Output: [3,4]

### flatMap() Transformation

- Returns multiple values for each element in the original RDD.

In [19]:
RDD = sc.parallelize(["hello world", "how are you"])
RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))

In [20]:
# Output: ["hello","world","how","are","you"]

## RDD Actions

- Operation return a value after running a compution on the RDD.
- Basic RDD Actions
  - collect()
  - take(N)
  - first()
  - count()

In [21]:
RDD = sc.parallelize([1,2,3,4])

### collect()

In [22]:
RDD.collect()

[1, 2, 3, 4]

### take()

In [24]:
RDD.take(2)

# Output: [1,2]

### first()

In [27]:
RDD.first()

# Output: [1]

### count()

In [28]:
RDD.count()

# Output: 4

## Pair RDDs

- Real life datasets are usually key/value pairs.
- Pair RDD: Key is the identifier and value is data.

In [29]:
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]
pairRDD_tuple = sc.parallelize(my_tuple)

In [30]:
pairRDD_tuple.collect()

[('Sam', 23), ('Mary', 34), ('Peter', 25)]

## Pair RDDs Transformations

- reduceByKey()
- groupByKey()
- sortByKey(): Return an RDD sorted by the key
- join()

### reduceByKey() transformation

- Combine values with the same key

In [33]:
regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)]) 
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y) 
pairRDD_reducebykey.collect()

In [34]:
# Output: [('Neymar', 22), ('Ronaldo', 34), ('Messi', 47)]

### groupByKey() transformation

- Group values with the same key

In [36]:
airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]
regularRDD = sc.parallelize(airports) 
pairRDD_group = regularRDD.groupByKey().collect() 
for cont, air in pairRDD_group:
    print(cont, list(air))

In [35]:
# Output: 
FR ['CDG']
US ['JFK', 'SFO'] 
UK ['LHR']

### join() transformation

- Join two pairRDDs based on their key

In [37]:
RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])
RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])

In [38]:
# Output: [('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ('Messi', (34, 100))]

## Actions on pair RDDs

- countByKey()
- collectAsMap()

### countByKey()

- Counts the number of elements for each key.

In [39]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) 
for kee, val in rdd.countByKey().items():
    print(kee, val)

In [40]:
# Output:
('a', 2)
('b', 1)

### collectAsMap()

- Returns the key-value pairs in the RDD as a dictionary.

In [41]:
sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

In [42]:
# Output: {1: 2, 3: 4}

## More Actions on RDDs

- reduce()
- saveAsTextFile()
- coalesce()

### reduce()

- reduce() aciton is used for aggregating the elements of a regular RDD.

In [44]:
x = [1,3,4,6]
RDD = sc.parallelize(x) 
RDD.reduce(lambda x, y : x + y)

In [45]:
# Output: 14

### saveAsTextFile()

- Saves RDD into a text file inside a directory with each partition as a separate file.

In [46]:
RDD.saveAsTextFile("tempFile")

### coalesce()

- Used to save RDD as a single text file.

In [47]:
RDD.coalesce(1).saveAsTextFile("tempFile")

## DataFrames

- PySpark DataFrame is an immutable distributed collection of data with named columns.
- Designed for processing both structured (e.g relational database) and semi-structured data(e.g JSON).
- DataFramesin PySparksupportboth SQL queries( SELECT * from table ) or expression methods ( df.select() ).
- SparkSession: Entry point to interact with Spark DataFrames.

## Create DataFrame from RDD

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [42]:
iphones_RDD = sc.parallelize([
    ("XS", 2018, 5.65, 2.79, 6.24),
    ("XR", 2018, 5.94, 2.98, 6.84),
    ("X10", 2017, 5.65, 2.79, 6.13),
    ("8Plus", 2017, 6.23, 3.07, 7.12)
])

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

## Create a DataFrame from reading a CSV/JSON/TXT

- Two optional parameters
  - header = True
  - inferSchema = True

In [56]:
df_csv = spark.read.csv("people.json", header=True, inferSchema=True)

df_json = spark.read.json("people.json", header=True, inferSchema=True)

df_txt = spark.read.txt("people.txt", header=True, inferSchema=True)

## DataFrame operations

- Transformations:
  - select()
  - filer()
  - groupby()
  - orderby()
  - dropDuplicates()
  - withColumnRenamed()
- Actions:
  - printSchema()
  - head()
  - show()
  - count()
  - columns()
  - describe()

In [7]:
df = spark.read.csv(r"C:\Users\aditya.rambhad\Downloads\IPL - Player Performance Dataset\IPL_Ball_by_Ball_2008_2022.csv", header=True, inferSchema=True)

### select() operation

- subsets the columns in the dataframe

In [15]:
df_id = df.select('ID')

df_id.show(5)

+-------+
|     ID|
+-------+
|1312200|
|1312200|
|1312200|
|1312200|
|1312200|
+-------+
only showing top 5 rows



### filter() operation

- Filters out the rows based on a condition.

In [14]:
df_runs = df.filter(df.total_run >= 4)

df_runs.select('batter','total_run').show(3)

+-----------+---------+
|     batter|total_run|
+-----------+---------+
| JC Buttler|        4|
|YBK Jaiswal|        4|
|YBK Jaiswal|        6|
+-----------+---------+
only showing top 3 rows



### where() operation

- Same as filter

In [48]:
df_runs = df.where(df.total_run >= 4)

df_runs.select('batter','total_run').show(3)

+-----------+---------+
|     batter|total_run|
+-----------+---------+
| JC Buttler|        4|
|YBK Jaiswal|        4|
|YBK Jaiswal|        6|
+-----------+---------+
only showing top 3 rows



### groupby() operation

- Can be used to group a variable

In [17]:
df_group = df.groupby('')

df_group.count().show(5)

+--------------+-----+
|        batter|count|
+--------------+-----+
| Kuldeep Yadav|  130|
|  M Theekshana|    7|
|    KA Pollard| 2447|
|   SS Cottrell|    2|
|R Sanjay Yadav|    2|
+--------------+-----+
only showing top 5 rows



### orderby() operation

- Sorts the dataframe based on one or more columns

In [19]:
df_group.count().orderBy('count').show(5)

+--------------+-----+
|        batter|count|
+--------------+-----+
|   RP Meredith|    1|
|V Pratap Singh|    1|
| Y Prithvi Raj|    1|
|    Yash Dayal|    1|
|      JL Denly|    1|
+--------------+-----+
only showing top 5 rows



### dropDuplicates() operation

- Removes the duplicate rows of a dataframe.

In [21]:
df_dup = df.select('ID','batter').dropDuplicates()

print(df.select('ID','batter').count())
print(df_dup.count())

225954
14229


### withColumnRenamed() operation

- Renames a column in the dataframe.

In [25]:
df_renamed = df.withColumnRenamed('batter', 'player')

df_renamed.select('ID','player').show(5)

+-------+-----------+
|     ID|     player|
+-------+-----------+
|1312200|YBK Jaiswal|
|1312200|YBK Jaiswal|
|1312200| JC Buttler|
|1312200|YBK Jaiswal|
|1312200|YBK Jaiswal|
+-------+-----------+
only showing top 5 rows



### withColumn() operation

- Adds a new column to the dataframe.

In [57]:
import pyspark.sql.functions as F

df_new = df.withColumn('upper', F.upper('batter')) \
           .withColumn('splits', F.split('batter',' '))
    
df_new.select('ID','batter','upper','splits').show(5)

+-------+-----------+-----------+--------------+
|     ID|     batter|      upper|        splits|
+-------+-----------+-----------+--------------+
|1312200|YBK Jaiswal|YBK JAISWAL|[YBK, Jaiswal]|
|1312200|YBK Jaiswal|YBK JAISWAL|[YBK, Jaiswal]|
|1312200| JC Buttler| JC BUTTLER| [JC, Buttler]|
|1312200|YBK Jaiswal|YBK JAISWAL|[YBK, Jaiswal]|
|1312200|YBK Jaiswal|YBK JAISWAL|[YBK, Jaiswal]|
+-------+-----------+-----------+--------------+
only showing top 5 rows



### drop() operation

- Removes a column from the dataframe.

In [66]:
df_drop = df_new.select('ID','batter','upper').drop('upper')

df_drop.show(5)

+-------+-----------+
|     ID|     batter|
+-------+-----------+
|1312200|YBK Jaiswal|
|1312200|YBK Jaiswal|
|1312200| JC Buttler|
|1312200|YBK Jaiswal|
|1312200|YBK Jaiswal|
+-------+-----------+
only showing top 5 rows



### printSchema()

- Prints the types of columns in the dataframe.

In [26]:
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- innings: integer (nullable = true)
 |-- overs: integer (nullable = true)
 |-- ballnumber: integer (nullable = true)
 |-- batter: string (nullable = true)
 |-- bowler: string (nullable = true)
 |-- non-striker: string (nullable = true)
 |-- extra_type: string (nullable = true)
 |-- batsman_run: integer (nullable = true)
 |-- extras_run: integer (nullable = true)
 |-- total_run: integer (nullable = true)
 |-- non_boundary: integer (nullable = true)
 |-- isWicketDelivery: integer (nullable = true)
 |-- player_out: string (nullable = true)
 |-- kind: string (nullable = true)
 |-- fielders_involved: string (nullable = true)
 |-- BattingTeam: string (nullable = true)



### columns

- Prints the columns of a dataframe

In [27]:
df.columns

['ID',
 'innings',
 'overs',
 'ballnumber',
 'batter',
 'bowler',
 'non-striker',
 'extra_type',
 'batsman_run',
 'extras_run',
 'total_run',
 'non_boundary',
 'isWicketDelivery',
 'player_out',
 'kind',
 'fielders_involved',
 'BattingTeam']

### describe()

- Compute summary statistics of numerical columns in the dataframe.

In [31]:
df.select('batter','total_run','bowler','overs','ballnumber').describe().show()

+-------+--------------+------------------+--------------+-----------------+------------------+
|summary|        batter|         total_run|        bowler|            overs|        ballnumber|
+-------+--------------+------------------+--------------+-----------------+------------------+
|  count|        225954|            225954|        225954|           225954|            225954|
|   mean|          null|1.3104304415943069|          null|9.185679386069731|3.6197500376182763|
| stddev|          null|1.6060501061067816|          null|5.681796781138258|1.8106327890386609|
|    min|A Ashish Reddy|                 0|A Ashish Reddy|                0|                 1|
|    max|        Z Khan|                 7|        Z Khan|               19|                10|
+-------+--------------+------------------+--------------+-----------------+------------------+



### head()

- Shows the first row in the dataframe.

In [38]:
df.head()

Row(ID=1312200, innings=1, overs=0, ballnumber=1, batter='YBK Jaiswal', bowler='Mohammed Shami', non-striker='JC Buttler', extra_type='NA', batsman_run=0, extras_run=0, total_run=0, non_boundary=0, isWicketDelivery=0, player_out='NA', kind='NA', fielders_involved='NA', BattingTeam='Rajasthan Royals')

### explain()

- Shows the physical plan of execution.

In [104]:
df_new.explain()

== Physical Plan ==
*(1) Project [ID#119, innings#120, overs#121, ballnumber#122, batter#123, bowler#124, non-striker#125, extra_type#126, batsman_run#127, extras_run#128, total_run#129, non_boundary#130, isWicketDelivery#131, player_out#132, kind#133, fielders_involved#134, BattingTeam#135, upper(batter#123) AS upper#3348, split(batter#123,  , -1) AS splits#3367]
+- FileScan csv [ID#119,innings#120,overs#121,ballnumber#122,batter#123,bowler#124,non-striker#125,extra_type#126,batsman_run#127,extras_run#128,total_run#129,non_boundary#130,isWicketDelivery#131,player_out#132,kind#133,fielders_involved#134,BattingTeam#135] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/C:/Users/aditya.rambhad/Downloads/IPL - Player Performance Datas..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ID:int,innings:int,overs:int,ballnumber:int,batter:string,bowler:string,non-striker:string...




## Functions

- lit()
- substring()
- regexp_replace()
- broadcast()

In [68]:
import pyspark.sql.functions as F

### lit() function

- Add a literal or constant to the dataframe

In [67]:
df.select('ID','batter',F.lit('IPL')).show(5)

+-------+-----------+---+
|     ID|     batter|IPL|
+-------+-----------+---+
|1312200|YBK Jaiswal|IPL|
|1312200|YBK Jaiswal|IPL|
|1312200| JC Buttler|IPL|
|1312200|YBK Jaiswal|IPL|
|1312200|YBK Jaiswal|IPL|
+-------+-----------+---+
only showing top 5 rows



### substring() function

- Gets substring from a column.

In [72]:
df.select('batter',F.substring('batter',0,3).alias('player')).show(5)

+-----------+------+
|     batter|player|
+-----------+------+
|YBK Jaiswal|   YBK|
|YBK Jaiswal|   YBK|
| JC Buttler|   JC |
|YBK Jaiswal|   YBK|
|YBK Jaiswal|   YBK|
+-----------+------+
only showing top 5 rows



### regexp_replace() function

- Replace a column value with a string for another string.

In [74]:
df.select('batter',F.regexp_replace('batter','YBK','Yashasvi').alias('Updated_batter')).show(5)

+-----------+----------------+
|     batter|  Updated_batter|
+-----------+----------------+
|YBK Jaiswal|Yashasvi Jaiswal|
|YBK Jaiswal|Yashasvi Jaiswal|
| JC Buttler|      JC Buttler|
|YBK Jaiswal|Yashasvi Jaiswal|
|YBK Jaiswal|Yashasvi Jaiswal|
+-----------+----------------+
only showing top 5 rows



### broadcast() function

- Used to broadcast smaller dataframe for join operation.

In [115]:
df1 = spark.read.csv(r"C:\Users\aditya.rambhad\Downloads\IPL - Player Performance Dataset\IPL_Ball_by_Ball_2008_2022.csv", header=True, inferSchema=True)
df2 = spark.read.csv(r"C:\Users\aditya.rambhad\Downloads\IPL - Player Performance Dataset\IPL_Matches_2008_2022.csv", header=True, inferSchema=True)

In [116]:
df1.count() #larger df

225954

In [117]:
df2.count() #smaller df

950

In [118]:
from pyspark.sql.functions import broadcast

joined_df = df1.join(broadcast(df2),df1.ID == df2.ID)

## Conditional clauses

- Inline version of if / then / else
- when()
- otherwise()

In [64]:
df.select(df.batter, df.batsman_run,
         F.when(df.batsman_run >= 4, "yes")
         .otherwise('no').alias('is_boundary')).show(10)

+-----------+-----------+-----------+
|     batter|batsman_run|is_boundary|
+-----------+-----------+-----------+
|YBK Jaiswal|          0|         no|
|YBK Jaiswal|          0|         no|
| JC Buttler|          1|         no|
|YBK Jaiswal|          0|         no|
|YBK Jaiswal|          0|         no|
|YBK Jaiswal|          0|         no|
| JC Buttler|          0|         no|
| JC Buttler|          0|         no|
| JC Buttler|          4|        yes|
| JC Buttler|          0|         no|
+-----------+-----------+-----------+
only showing top 10 rows



## Interacting with DataFrames using PySpark SQL

### DataFrame API vs SQL queries

- The DataFrame API provides a programmatic domain-specific language(DSL) for data
- DataFrame transformations and actions are easier to construct programmatically
- SQL queries can be concise and easier to understand and portable
- The operations on DataFrames can also be done using SQL queries

In [39]:
df.createOrReplaceTempView("table1")

sql() method takes a SQL statement as an argument and returns the result as DataFrame.

In [41]:
df2 = spark.sql("Select batter, total_run from table1")
df2.show(5)

+-----------+---------+
|     batter|total_run|
+-----------+---------+
|YBK Jaiswal|        0|
|YBK Jaiswal|        1|
| JC Buttler|        1|
|YBK Jaiswal|        0|
|YBK Jaiswal|        0|
+-----------+---------+
only showing top 5 rows



### Summarizing and grouping data using SQL queries

In [44]:
query = '''Select batter, sum(total_run) as runs_scored from table1 group by batter'''

spark.sql(query).show(5)

+--------------+-----------+
|        batter|runs_scored|
+--------------+-----------+
| Kuldeep Yadav|        111|
|  M Theekshana|         11|
|    KA Pollard|       3650|
|   SS Cottrell|          0|
|R Sanjay Yadav|          0|
+--------------+-----------+
only showing top 5 rows



### Filtering columns using SQL queries

In [46]:
query = '''Select batter, total_run from table1 where total_run >= 4'''

spark.sql(query).show(5)

+-----------+---------+
|     batter|total_run|
+-----------+---------+
| JC Buttler|        4|
|YBK Jaiswal|        4|
|YBK Jaiswal|        6|
|YBK Jaiswal|        6|
|  SV Samson|        4|
+-----------+---------+
only showing top 5 rows



## User defined functions

- Python method
- Wrapped via the pyspark.sql.functions.udf method
- Stored as a variable
- Called like a normal Spark function

In [79]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def reverseString(mystr):
    return mystr[::-1]

udfReverseString = udf(reverseString, StringType())

In [83]:
df_udf = df.withColumn('ReverseName', udfReverseString(df.batter))