# Introduction

This notebook serves as as easy dive-in into spark. It will hold your hand into learning spark, giving you all the information you need to actually use spark in your work.

We refer to some sections of the book _Spark, the Definitive Guide_ if you want to learn more about certain topics. Don't hesitate to read the [PySpark documentation](https://spark.apache.org/docs/latest/api/python/reference/index.html) which contains a lot of information about already existing methods. 

# Install PySpark

In [None]:
!pip3 install pyspark

# Run spark

Let's create a spark session to start loading data.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

## SparkUI

Some stuff about spark ui.

In [None]:
spark

# Spark basics

Read the player data from the player csv table and print its schema. See _Basics of Reading Data_ (pg. 161).

Select the _player_name_ column. There are multiple ways to select a column, see _Columns_ (pg. 68)

Create two additional columns where the first column is the name of the player and the second is its surname. Now select all players whose are Quinten or Paolo. Don't forget you need to import functions from the [PySpark sql functions api](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions).

Now let's get back to the player dataset. Let's extract the tallest player and the lightest player.

Now load the player attributes and inspect the data. Is it clean?
If not assume the following:
   1. duplicates of column __player_api_id__ can be _dropped_,
   2. null values can be replaced with 0.

Now join each player's attributes with their name from the _player_ table. If you are unsure about which join to use, think about what you can do with attributes with no player or player with no attributes (See Chap. 8 __Joins__). 

_Tip_: first investigate how the player attributes table and the player name table are constructed.

Based on the attributes you now have, select 4 combinaisons of numerical attributes that you think a goalkeeper, a defensor, a midfielder and a striker should have to be a good player and compute the mean.


Ex. :
 * Goalkeeper -> gk_diving, gk_handling, gk_kicking, gk_positioning, gk_reflexes
 * defensor -> defensive_work_rate, sliding_tackle, standing_tackle, marking, interceptions, long_passing
 * midfielder -> attacking_work_rate, defensive_work_rate, short_passing, ball_control, stamina
 * striker -> acceleration, sprint_speed, attacking_work_rate, finishing, heading_accuracy, dribbling

Now that you have some new ratings for the players, let's create a alternative potential column based on those new ratings. If any of the ratings the players receive are higher than 85, then the column _high_potential_ should be marked as _True_.

_Tip_: use the [_when/otherwise_ function](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.when)

Great! You are now ready to start the project! Good luck.