# Ex2 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import DoubleType

In [2]:
spark = SparkSession.builder.master("local[2]").appName("euro").getOrCreate()

22/09/07 13:25:01 WARN Utils: Your hostname, xkeyscore resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface wlp0s20f3)
22/09/07 13:25:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/07 13:25:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/07 13:25:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv). 

### Step 3. Assign it to a variable called euro12.

In [3]:
euro12 = spark.read.options(header=True, inferSchema=True).csv("Euro_2012_stats_TEAM.csv")

In [10]:
euro12.columns

['Team',
 'Goals',
 'Shots on target',
 'Shots off target',
 'Shooting Accuracy',
 '% Goals-to-shots',
 'Total shots (inc. Blocked)',
 'Hit Woodwork',
 'Penalty goals',
 'Penalties not scored',
 'Headed goals',
 'Passes',
 'Passes completed',
 'Passing Accuracy',
 'Touches',
 'Crosses',
 'Dribbles',
 'Corners Taken',
 'Tackles',
 'Clearances',
 'Interceptions',
 'Clearances off line',
 'Clean Sheets',
 'Blocks',
 'Goals conceded',
 'Saves made',
 'Saves-to-shots ratio',
 'Fouls Won',
 'Fouls Conceded',
 'Offsides',
 'Yellow Cards',
 'Red Cards',
 'Subs on',
 'Subs off',
 'Players Used']

### Step 4. Select only the Goal column.

In [6]:
euro12.select("Goals").show()

+-----+
|Goals|
+-----+
|    4|
|    4|
|    4|
|    5|
|    3|
|   10|
|    5|
|    6|
|    2|
|    2|
|    6|
|    1|
|    5|
|   12|
|    5|
|    2|
+-----+



### Step 5. How many team participated in the Euro2012?

In [11]:
euro12.count()

16

### Step 6. What is the number of columns in the dataset?

In [12]:
len(euro12.columns)

35

### Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

In [13]:
discipline = euro12.select("Team", "Yellow Cards", "Red Cards")

In [15]:
discipline.show(10)

+--------------+------------+---------+
|          Team|Yellow Cards|Red Cards|
+--------------+------------+---------+
|       Croatia|           9|        0|
|Czech Republic|           7|        0|
|       Denmark|           4|        0|
|       England|           5|        0|
|        France|           6|        0|
|       Germany|           4|        0|
|        Greece|           9|        1|
|         Italy|          16|        0|
|   Netherlands|           5|        0|
|        Poland|           7|        1|
+--------------+------------+---------+
only showing top 10 rows



### Step 8. Sort the teams by Red Cards, then to Yellow Cards

In [17]:
discipline.sort(["Red Cards", "Yellow Cards"], ascending=False).show(10)

+-------------------+------------+---------+
|               Team|Yellow Cards|Red Cards|
+-------------------+------------+---------+
|             Greece|           9|        1|
|             Poland|           7|        1|
|Republic of Ireland|           6|        1|
|              Italy|          16|        0|
|           Portugal|          12|        0|
|              Spain|          11|        0|
|            Croatia|           9|        0|
|     Czech Republic|           7|        0|
|             Sweden|           7|        0|
|             France|           6|        0|
+-------------------+------------+---------+
only showing top 10 rows



### Step 9. Calculate the mean Yellow Cards given per Team

In [33]:
discipline.select("Yellow Cards", "Team").groupby("Team").mean().show()

+-------------------+-----------------+
|               Team|avg(Yellow Cards)|
+-------------------+-----------------+
|             Russia|              6.0|
|             Sweden|              7.0|
|            Germany|              4.0|
|             France|              6.0|
|             Greece|              9.0|
|            Croatia|              9.0|
|              Italy|             16.0|
|              Spain|             11.0|
|            Denmark|              4.0|
|            Ukraine|              5.0|
|     Czech Republic|              7.0|
|Republic of Ireland|              6.0|
|            England|              5.0|
|             Poland|              7.0|
|           Portugal|             12.0|
|        Netherlands|              5.0|
+-------------------+-----------------+



### Step 10. Filter teams that scored more than 6 goals

In [34]:
euro12.select("Team").filter("Goals > 6").show()

+-------+
|   Team|
+-------+
|Germany|
|  Spain|
+-------+



### Step 11. Select the teams that start with G

In [37]:
euro12.select("Team").filter("Team like 'G%'").show(10)

+-------+
|   Team|
+-------+
|Germany|
| Greece|
+-------+



### Step 12. Select the first 7 columns

In [40]:
euro12.show(7)

+--------------+-----+---------------+----------------+-----------------+----------------+--------------------------+------------+-------------+--------------------+------------+------+----------------+----------------+-------+-------+--------+-------------+-------+----------+-------------+-------------------+------------+------+--------------+----------+--------------------+---------+--------------+--------+------------+---------+-------+--------+------------+
|          Team|Goals|Shots on target|Shots off target|Shooting Accuracy|% Goals-to-shots|Total shots (inc. Blocked)|Hit Woodwork|Penalty goals|Penalties not scored|Headed goals|Passes|Passes completed|Passing Accuracy|Touches|Crosses|Dribbles|Corners Taken|Tackles|Clearances|Interceptions|Clearances off line|Clean Sheets|Blocks|Goals conceded|Saves made|Saves-to-shots ratio|Fouls Won|Fouls Conceded|Offsides|Yellow Cards|Red Cards|Subs on|Subs off|Players Used|
+--------------+-----+---------------+----------------+-------------

### Step 13. Select all columns except the last 3.

In [41]:
euro12.columns[0:len(euro12.columns) - 3]

['Team',
 'Goals',
 'Shots on target',
 'Shots off target',
 'Shooting Accuracy',
 '% Goals-to-shots',
 'Total shots (inc. Blocked)',
 'Hit Woodwork',
 'Penalty goals',
 'Penalties not scored',
 'Headed goals',
 'Passes',
 'Passes completed',
 'Passing Accuracy',
 'Touches',
 'Crosses',
 'Dribbles',
 'Corners Taken',
 'Tackles',
 'Clearances',
 'Interceptions',
 'Clearances off line',
 'Clean Sheets',
 'Blocks',
 'Goals conceded',
 'Saves made',
 'Saves-to-shots ratio',
 'Fouls Won',
 'Fouls Conceded',
 'Offsides',
 'Yellow Cards',
 'Red Cards']

### Step 14. Present only the Shooting Accuracy from England, Italy and Russia

In [46]:
euro12.select("Team","Shooting Accuracy").filter("Team ='England' OR Team='Italy' OR Team='Russia'").show()

+-------+-----------------+
|   Team|Shooting Accuracy|
+-------+-----------------+
|England|            50.0%|
|  Italy|            43.0%|
| Russia|            22.5%|
+-------+-----------------+

