 Q1. Implement a PySpark script that applies transformations like filter and withColumn on a dataFrame.
 
 Q2. Write a PySpark script that performs actions like count and show on a DataFrame.
 
 Q3. Demonstrate how to perform basic aggregations (e.g., sum, average) on a PySpark
DataFrame.

Q4. Show how to write a PySpark DataFrame to a CSV file.

In [84]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([(1,"Alice"),(2,"Bill"),(3,"Cathy"),(4,"Annabeth"),(5,"Athena")],["id","name"])
print(df.count())
df.show()

# Filter and withColumn
df1 = df.filter(df["name"].startswith("A"))
df1.show()

df2 = df1.withColumn('age',df.id + 10)
df2.show()

# Aggregate Functions
df2.agg(F.avg(df2['age']).alias("Average Age")).show()
df2.agg(F.sum(df2['age']).alias("Sum of Ages")).show()   

#Write to CSV
df2.coalesce(3).write.mode('overwrite').csv('output.csv') #coelesce decides how many files the dataframe gets divided into

5
+---+--------+
| id|    name|
+---+--------+
|  1|   Alice|
|  2|    Bill|
|  3|   Cathy|
|  4|Annabeth|
|  5|  Athena|
+---+--------+

+---+--------+
| id|    name|
+---+--------+
|  1|   Alice|
|  4|Annabeth|
|  5|  Athena|
+---+--------+

+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|   Alice| 11|
|  4|Annabeth| 14|
|  5|  Athena| 15|
+---+--------+---+

+------------------+
|       Average Age|
+------------------+
|13.333333333333334|
+------------------+

+-----------+
|Sum of Ages|
+-----------+
|         40|
+-----------+



Q5. Implement wordcount program in PySpark.

In [83]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark import SparkContext
from pyspark import SparkConf

#Using DataFrame
spark = SparkSession.builder.getOrCreate()
lines = spark.read.text("sample.txt")
words = lines.withColumn("word",F.explode(F.split(F.col('value'),' '))).groupby('word').count().sort('count',ascending = False).show()

#Using RDDs
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
lines = sc.textFile("sample.txt")
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word,1)).reduceByKey(lambda x,y:x+y)

output = counts.collect()

for word,count in output:
    print("%s,%i"%(word,count))

DataFrame[value: string]
+-------+-----+
|   word|count|
+-------+-----+
| Python|    4|
|    SQL|    3|
|Pyspark|    3|
|     ML|    2|
+-------+-----+

sample.txt MapPartitionsRDD[1153] at textFile at NativeMethodAccessorImpl.java:0
SQL,3
Python,4
Pyspark,3
ML,2
