# SparkUI and Debugging Lab

Run the following code. This will do an import needed for later, but also start up the Spark session. Click the link to get to the Spark UI. You will need to replace `http://spark-training-primary.umbctraining.com` with `http://54.156.199.198` in the link that opens up.

In [None]:
from pyspark.sql import functions as F

Start the following code, then head over to the SparkUI and take a look at what happens while code is running. This code will take about 5 minutes to run, so you should have time to explore.

In [None]:
players = spark.read.csv("hdfs:///user/bryan/data/name.basics.tsv",header=True,inferSchema=True,sep='\t')
players = players.filter(players.primaryProfession.rlike("act")).filter((players.deathYear == '\\N') & (players.birthYear != '\\N')).filter(players.birthYear.cast('int') > 1950)

players2 = spark.read.csv("hdfs:///user/bryan/data/name.basics.tsv",header=True,inferSchema=True,sep='\t')
for c in players2.columns:
    players2 = players2.withColumnRenamed(c, c + "_2")    
players2 = players2.filter(players2.primaryProfession_2.rlike("act")).filter((players2.deathYear_2 == '\\N') & (players2.birthYear_2 != '\\N')).filter(players2.birthYear_2.cast('int') > 1950)

print(players2.count())

joined = players.crossJoin(players2)

print(joined.count())

In the next section, fix the bug. Even if you can spot the bug now, run the code so you can pratice reading error statements.

In [None]:
def get_last_4(column):
    return column.split('-')[2]

In [None]:
data = spark.read.csv("hdfs:///user/bryan/data/100_percent_real_data.csv",inferSchema=True,header=True)

In [None]:
data.show(100)

In [None]:
get_last_4_udf = spark.udf.register("get_last_4",get_last_4)

In [None]:
data = data.withColumn("last_4",get_last_4_udf(data.phone))

In [None]:
data.show()

Again, the code below has a bug in it. Run it until you get the bug, and then try to use the error statements to fix the error.

In [None]:
def same_address(df):
    return df.assign(address = df['adress'].iloc[0])

In [None]:
from pyspark.sql.types import StructType, StructField, StringType

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

In [None]:
same_udf = pandas_udf(same_address, StructType([StructField("name",StringType()),
                                StructField("address",StringType()),
                                StructField("company",StringType()),
                                StructField("phone",StringType()),
                                StructField("ssn",StringType())]),
                              PandasUDFType.GROUPED_MAP)

In [None]:
data.groupby('company').apply(same_udf).show()