# SPARK101 EXERCISES

## 1. **Create a spark data frame that contains your favorite programming languages.**

In [1]:
import pyspark

In [2]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/25 12:56:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

* The name of the column should be language  

In [4]:
# created a spark session
spark = pyspark.sql.SparkSession.builder.appName("ProgrammingLanguages").getOrCreate()

23/11/25 12:56:29 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [5]:
# Created a list of my favorite programming languages
fav_languages = ["Python", "C++", "Java", "C#", "R"]
fav_languages

['Python', 'C++', 'Java', 'C#', 'R']

In [6]:
# imported sparksession and row for creating asaprk df column
from pyspark.sql import SparkSession
from pyspark.sql import Row

prgm_lang_df = spark.createDataFrame([Row(Languages=lang) for lang in fav_languages])

In [7]:
prgm_lang_df.show()

                                                                                

+---------+
|Languages|
+---------+
|   Python|
|      C++|
|     Java|
|       C#|
|        R|
+---------+



* View the schema of the dataframe  

In [8]:
# Viewing Schema
prgm_lang_df.printSchema()

root
 |-- Languages: string (nullable = true)



* Output the shape of the dataframe  

In [9]:
# getting rows count and columns because there is shape attribute

# number of rows
num_rows = prgm_lang_df.count() 

# number of columns
columns = prgm_lang_df.columns
num_columns = len(columns)

                                                                                

In [10]:
# printing out shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

Number of Rows: 5
Number of Columns: 1


* Show the first 5 records in the dataframe

In [11]:
prgm_lang_df.show(5)

+---------+
|Languages|
+---------+
|   Python|
|      C++|
|     Java|
|       C#|
|        R|
+---------+



In [12]:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
from pydataset import data
# from vega_datasets import data

# Note: The pyspark avg and mean functions are aliases of eachother
from pyspark.sql.functions import col, expr, concat, sum, avg, min, max, count, mean, lit, regexp_extract, regexp_replace, when, asc, desc, month, year, quarter

### Visualization (or Lack Therof)

Spark does not provide a way to do visualization with their dataframes. To
visualize data from spark, you should use the `.toPandas` method on a spark
dataframe to convert it to a pandas dataframe, then visualize as you normally
would.

!!!warning "Converting to A Pandas Dataframe"
    Converting a spark dataframe to a pandas dataframe will pull all the data into memory, so make sure you have enough available memory to do so.

## References

- [PySpark API Docs](https://spark.apache.org/docs/latest/api/python/index.html)
- [Spark SQL Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) -- Note that the docs here show examples in many different programming languages, make sure you choose Python.
- [DataFrame class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame)
- [Column class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column)
- [pyspark.sql.functions module](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)
- `df.na`: [DataFrameNaFunctions class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions)

In [15]:
spark = SparkSession.builder.getOrCreate()

In [16]:
# utilized demo code to populate mpg
mpg = spark.createDataFrame(data("mpg"))

mpg.write.json("data/mpg_json", mode="overwrite")

# like much else in spark, there's multiple ways we could do this:
(
    mpg.write.format("csv")
    .mode("overwrite")
    .option("header", "true")
    .save("data/mpg_csv")
)

  if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
                                                                                

## 2. **Load the mpg dataset as a spark dataframe.**

$a.$ Create 1 column of output that contains a message like the one below:

 `The 1999 audi a4 has a 4 cylinder engine.`
 
For each vehicle.

In [17]:
# displaying top 20 of the vehicles
# concat can be used to create a message like inthe example.
mpg.show()

+------------+------------------+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|             model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+------------------+-----+----+---+----------+---+---+---+---+-------+
|        audi|                a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|                a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|                a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|                a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|                a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
|        audi|                a4|  2.8|1999|  6|manual(m5)|  f| 18| 26|  p|compact|
|        audi|                a4|  3.1|2008|  6|  auto(av)|  f| 18| 27|  p|compact|
|        audi|        a4 quattro|  1.8|1999|  4|manual(m5)|  4| 18| 26|  p|compact|
|        audi|        a4 quattro|  1.8|1999|  4|  auto(l5)|  4| 16| 25|  p|c

In [18]:
from pyspark.sql import functions as F

def create_message(year, make, model, cylinders):
    return f"The {year} {make} {model} has a {cylinders} cylinder engine."

create_message_udf = F.udf(create_message)

mpg_df = mpg.withColumn(
    "output_message",
    create_message_udf(
        F.col("year"), F.col("manufacturer"), F.col("model"), F.col("cyl")
    )
)

mpg_df.select("output_message").show(truncate=False)

[Stage 12:>                                                         (0 + 1) / 1]

+--------------------------------------------------------------+
|output_message                                                |
+--------------------------------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 2008 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 6 cylinder engine.             |
|The 1999 audi a4 quattro

                                                                                

$b.$ Transform the trans column so that it only contains either manual or auto.

In [19]:
from pyspark.sql.functions import regexp_extract
from pyspark.sql.types import StringType

In [20]:
mpg.select(mpg.trans).show(5)

+----------+
|     trans|
+----------+
|  auto(l5)|
|manual(m5)|
|manual(m6)|
|  auto(av)|
|  auto(l5)|
+----------+
only showing top 5 rows



In [21]:
mpg_transformed = mpg.withColumn("trans", regexp_extract(mpg["trans"], r"(manual|auto)", 1).cast(StringType()))

mpg_transformed.select("trans").show(5)

+------+
| trans|
+------+
|  auto|
|manual|
|manual|
|  auto|
|  auto|
+------+
only showing top 5 rows



In [22]:
mpg.select(
    "trans",
    regexp_extract("trans", r"^(\w+)", 1).alias("trans_new"),
).show(truncate=False)

+----------+---------+
|trans     |trans_new|
+----------+---------+
|auto(l5)  |auto     |
|manual(m5)|manual   |
|manual(m6)|manual   |
|auto(av)  |auto     |
|auto(l5)  |auto     |
|manual(m5)|manual   |
|auto(av)  |auto     |
|manual(m5)|manual   |
|auto(l5)  |auto     |
|manual(m6)|manual   |
|auto(s6)  |auto     |
|auto(l5)  |auto     |
|manual(m5)|manual   |
|auto(s6)  |auto     |
|manual(m6)|manual   |
|auto(l5)  |auto     |
|auto(s6)  |auto     |
|auto(s6)  |auto     |
|auto(l4)  |auto     |
|auto(l4)  |auto     |
+----------+---------+
only showing top 20 rows



## 3. **Load the tips dataset as a spark dataframe.**

In [24]:
# utilized demo code to populate mpg
tips = spark.createDataFrame(data("tips"))

tips.write.json("data/tips_json", mode="overwrite")

# like much else in spark, there's multiple ways we could do this:
(
    mpg.write.format("csv")
    .mode("overwrite")
    .option("header", "true")
    .save("data/tips_csv")
)

  if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
                                                                                

$a.$ What percentage of observations are smokers?  

In [25]:
tips.show(100_000)

+----------+----+------+------+----+------+----+
|total_bill| tip|   sex|smoker| day|  time|size|
+----------+----+------+------+----+------+----+
|     16.99|1.01|Female|    No| Sun|Dinner|   2|
|     10.34|1.66|  Male|    No| Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No| Sun|Dinner|   3|
|     23.68|3.31|  Male|    No| Sun|Dinner|   2|
|     24.59|3.61|Female|    No| Sun|Dinner|   4|
|     25.29|4.71|  Male|    No| Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No| Sun|Dinner|   2|
|     26.88|3.12|  Male|    No| Sun|Dinner|   4|
|     15.04|1.96|  Male|    No| Sun|Dinner|   2|
|     14.78|3.23|  Male|    No| Sun|Dinner|   2|
|     10.27|1.71|  Male|    No| Sun|Dinner|   2|
|     35.26| 5.0|Female|    No| Sun|Dinner|   4|
|     15.42|1.57|  Male|    No| Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No| Sun|Dinner|   4|
|     14.83|3.02|Female|    No| Sun|Dinner|   2|
|     21.58|3.92|  Male|    No| Sun|Dinner|   2|
|     10.33|1.67|Female|    No| Sun|Dinner|   3|
|     16.29|3.71|  M

$b.$ Create a column that contains the tip percentage  

$c.$ Calculate the average tip percentage for each combination of sex and smoker.  

## 4. **Use the seattle weather dataset referenced in the lesson to answer the questions below.**

* Convert the temperatures to fahrenheit.

* Which month has the most rain, on average?
   
* Which year was the windiest?

* What is the most frequent type of weather in January?
  
* What is the average high and low temperature on sunny days in July in 2013 and 2014?
  
* What percentage of days were rainy in q3 of 2015?
  
* For each year, find what percentage of days it rained (had non-zero precipitation).