# UDF and Caching Lab

In this lab, we will be working with data for all international soccer games ever played. 

First, run the following imports for later use and read in the data. 

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType, BooleanType

In [None]:
games = spark.read.csv("hdfs:///data/soccer_games.csv",header=True, inferSchema=True)
games = games.repartition(100)

View the data using `show`.

### Expand some acronyms in the Tournament column using UDFs

We have provided this dictionary to perform look up in below.

In [None]:
acronyms = {'UEFA': "Union of European Football Associations",
            "FIFA":"Fédération Internationale de Football Association",
            "AFC":"Asian Football Confederation", 
            "CONCACAF":"Confederation of North, Central American and Caribbean Association Football"
           }

Write a python function `slow` that takes in one argument, a string.
- Split the string into words by splitting on spaces using `split`
- For each word in the string, use the value in the acronyms dictionary if it exists. Otherwise leave it as is.
- Return the expanded words joined back together in a single string

In [None]:
def slow(row):


Test your python function below. It should return "Fédération Internationale de Football Association World Cup."

In [None]:
print(slow("FIFA World Cup"))

Register your python function as a UDF using `spark.udf.register`.

In [None]:
slow_udf =

Call your UDF function on the tournament column by using a `select` method on `games`.

In [None]:
expanded =

Use `distinct` and `show` to view the results.

Now, we are going to write the same UDF using Pandas. We have written the python function for you this time.

In [None]:
def fast(series):
    return series.str.split().apply(lambda y: ''.join([acronyms.get(x,x) for x in y]))

Make a vectorized UDF using `pandas_udf`.

Call your UDF function on the tournament column by using a `select` method on `games`.

In [None]:
expanded =

Use distinct and show to view the results.

### Find the games in which a team scored the most goals, per tournament
We have already written the vectorized Python function for you, see if you can follow what it is doing.

In [None]:
def most_goals(df):
    df = df.assign(game_max = df[['home_score','away_score']].max(axis=1))
    most = df.iloc[df.game_max.idxmax()]
    most = most.drop('game_max')
    return most.to_frame().T

Next we need to create the return type, which will have a list of all the columns and their types.

We've done the first few columns for you.

In [None]:
gamesType = StructType([StructField('date',DateType()),
                        StructField('home_team',StringType()),
                        StructField('away_team',StringType()),
                       ...
                       ])

Make a GROUPED_MAP Vectorized UDF.

In [None]:
most_goals_udf = 

Use `groupby` and `apply` to determine the game in each tournament with the most goals for one team. 
Use `show` to view the results.

## Caching and Repartioning

In [None]:
from pyspark.sql.functions import year

In [None]:
games = games.withColumn('date', year(games.date))

First, we are going to run a `groupby` on the data as is .

In [None]:
games.groupby('tournament').min('date').show()

In [None]:
games.groupby('tournament').count().show()

Now, repartition on tournament using `repartition` and then call `cache`. Make 100 partitions.

In [None]:
games = 

Run the same code as before.

In [None]:
games.groupby('tournament').min('date').show()

In [None]:
games.groupby('tournament').count().show()