# Simple PySpark Programs

Lab Exercises:

1) Implement a PySpark script that applies transformations like filter and withColumn on a DataFrame.

2) Write a PySpark script that performs actions like count and show on a DataFrame.

3) Demonstrate how to perform basic aggregations (e.g., sum, average) on a PySpark DataFrame.

4) Show how to write a PySpark DataFrame to a CSV file.

5) Implement wordcount program in PySpark.

In [1]:
import pyspark
import os
import sys
from pyspark import SparkContext
from pyspark import SparkConf
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [6]:
from pyspark.sql import SparkSession

#creating a spark session
spark = SparkSession.builder.getOrCreate()

#create dataframe
df = spark.createDataFrame([
    (1, "A"),
    (2, "B"),
    (3, "C"),
    (4, "D"),
    (5, "E")]
    ,["id","name"])

### count and show on a DataFrame.

In [6]:
df.show()
print(df.count())

+---+----+
| id|name|
+---+----+
|  1|   A|
|  2|   B|
|  3|   C|
|  4|   D|
|  5|   E|
+---+----+

5


### apply transformations like filter and withColumn on a DataFrame.

In [7]:
df1 = df.filter(df["name"].startswith("A"))
df1.show()

+---+----+
| id|name|
+---+----+
|  1|   A|
+---+----+



In [9]:
df2 = df.withColumn("age", df["id"]*10)
df2.show()

+---+----+---+
| id|name|age|
+---+----+---+
|  1|   A| 10|
|  2|   B| 20|
|  3|   C| 30|
|  4|   D| 40|
|  5|   E| 50|
+---+----+---+



### Basic aggregations (e.g., sum, average) on a PySpark DataFrame.

In [10]:
from pyspark.sql.functions import sum

df2.select(sum("age")).show()

+--------+
|sum(age)|
+--------+
|     150|
+--------+



In [11]:
from pyspark.sql.functions import avg

df2.select(avg("age")).show()

+--------+
|avg(age)|
+--------+
|    30.0|
+--------+



### Wordcount program in PySpark.

In [8]:
from pyspark.sql.functions import explode,split,col

df=spark.read.text("/home/lplab/Desktop/210962069/BDAL/lab2/text")

#Apply Split, Explode and groupBy to get count()
df_count=(
  df.withColumn('word', explode(split(col('value'), ' ')))
    .groupBy('word')
    .count()
    .sort('count', ascending=False)
)

#Display Output
df_count.show()

+---------+-----+
|     word|count|
+---------+-----+
|       in|    3|
|  pyspark|    3|
|wordcount|    3|
|      for|    2|
|       is|    2|
|  testing|    2|
|   sample|    2|
|     This|    2|
|     text|    2|
+---------+-----+



### Write PySpark DataFrame to a CSV file.

In [10]:
df.write.format("csv").mode('overwrite').save('output')