# Ex2 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 44 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 65.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=250ec0fa11a8d4cad3c409ac8d58685f2d5b4f07fdde5a5718c5b1f099e4add9
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [2]:
from pyspark.sql import SparkSession, functions as f
from pyspark.files import SparkFiles

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv). 

In [11]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv"

spark = SparkSession.builder.appName("exercise22").getOrCreate()
spark.sparkContext.addFile(url)

df = spark.read.csv("file://" + SparkFiles.get("Euro_2012_stats_TEAM.csv"), sep=',', header=True, inferSchema=True) 
df.printSchema()
df.show()

root
 |-- Team: string (nullable = true)
 |-- Goals: integer (nullable = true)
 |-- Shots on target: integer (nullable = true)
 |-- Shots off target: integer (nullable = true)
 |-- Shooting Accuracy: string (nullable = true)
 |-- % Goals-to-shots: string (nullable = true)
 |-- Total shots (inc. Blocked): integer (nullable = true)
 |-- Hit Woodwork: integer (nullable = true)
 |-- Penalty goals: integer (nullable = true)
 |-- Penalties not scored: integer (nullable = true)
 |-- Headed goals: integer (nullable = true)
 |-- Passes: integer (nullable = true)
 |-- Passes completed: integer (nullable = true)
 |-- Passing Accuracy: string (nullable = true)
 |-- Touches: integer (nullable = true)
 |-- Crosses: integer (nullable = true)
 |-- Dribbles: integer (nullable = true)
 |-- Corners Taken: integer (nullable = true)
 |-- Tackles: integer (nullable = true)
 |-- Clearances: integer (nullable = true)
 |-- Interceptions: integer (nullable = true)
 |-- Clearances off line: integer (nullable = t

### Step 3. Assign it to a variable called euro12.

### Step 4. Select only the Goal column.

In [6]:
df.select("Goals").show()

+-----+
|Goals|
+-----+
|    4|
|    4|
|    4|
|    5|
|    3|
|   10|
|    5|
|    6|
|    2|
|    2|
|    6|
|    1|
|    5|
|   12|
|    5|
|    2|
+-----+



### Step 5. How many team participated in the Euro2012?

In [7]:
df.select("Team").distinct().count()

16

### Step 6. What is the number of columns in the dataset?

In [8]:
len(df.columns)

35

### Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

In [9]:
discipline = df.select("Team", "Yellow Cards", "Red Cards")
discipline.show()

+-------------------+------------+---------+
|               Team|Yellow Cards|Red Cards|
+-------------------+------------+---------+
|            Croatia|           9|        0|
|     Czech Republic|           7|        0|
|            Denmark|           4|        0|
|            England|           5|        0|
|             France|           6|        0|
|            Germany|           4|        0|
|             Greece|           9|        1|
|              Italy|          16|        0|
|        Netherlands|           5|        0|
|             Poland|           7|        1|
|           Portugal|          12|        0|
|Republic of Ireland|           6|        1|
|             Russia|           6|        0|
|              Spain|          11|        0|
|             Sweden|           7|        0|
|            Ukraine|           5|        0|
+-------------------+------------+---------+



### Step 8. Sort the teams by Red Cards, then to Yellow Cards

In [10]:
discipline.sort("Red cards", "Yellow Cards").show()

+-------------------+------------+---------+
|               Team|Yellow Cards|Red Cards|
+-------------------+------------+---------+
|            Denmark|           4|        0|
|            Germany|           4|        0|
|        Netherlands|           5|        0|
|            Ukraine|           5|        0|
|            England|           5|        0|
|             France|           6|        0|
|             Russia|           6|        0|
|     Czech Republic|           7|        0|
|             Sweden|           7|        0|
|            Croatia|           9|        0|
|              Spain|          11|        0|
|           Portugal|          12|        0|
|              Italy|          16|        0|
|Republic of Ireland|           6|        1|
|             Poland|           7|        1|
|             Greece|           9|        1|
+-------------------+------------+---------+



### Step 9. Calculate the mean Yellow Cards given per Team

In [12]:
df.select(f.mean("Yellow Cards")).show()

+-----------------+
|avg(Yellow Cards)|
+-----------------+
|           7.4375|
+-----------------+



### Step 10. Filter teams that scored more than 6 goals

In [14]:
df.filter(f.col("Goals") > 6).show()

+-------+-----+---------------+----------------+-----------------+----------------+--------------------------+------------+-------------+--------------------+------------+------+----------------+----------------+-------+-------+--------+-------------+-------+----------+-------------+-------------------+------------+------+--------------+----------+--------------------+---------+--------------+--------+------------+---------+-------+--------+------------+
|   Team|Goals|Shots on target|Shots off target|Shooting Accuracy|% Goals-to-shots|Total shots (inc. Blocked)|Hit Woodwork|Penalty goals|Penalties not scored|Headed goals|Passes|Passes completed|Passing Accuracy|Touches|Crosses|Dribbles|Corners Taken|Tackles|Clearances|Interceptions|Clearances off line|Clean Sheets|Blocks|Goals conceded|Saves made|Saves-to-shots ratio|Fouls Won|Fouls Conceded|Offsides|Yellow Cards|Red Cards|Subs on|Subs off|Players Used|
+-------+-----+---------------+----------------+-----------------+----------------

### Step 11. Select the teams that start with G

In [16]:
df.filter(f.col("Team").startswith("G")).show()

+-------+-----+---------------+----------------+-----------------+----------------+--------------------------+------------+-------------+--------------------+------------+------+----------------+----------------+-------+-------+--------+-------------+-------+----------+-------------+-------------------+------------+------+--------------+----------+--------------------+---------+--------------+--------+------------+---------+-------+--------+------------+
|   Team|Goals|Shots on target|Shots off target|Shooting Accuracy|% Goals-to-shots|Total shots (inc. Blocked)|Hit Woodwork|Penalty goals|Penalties not scored|Headed goals|Passes|Passes completed|Passing Accuracy|Touches|Crosses|Dribbles|Corners Taken|Tackles|Clearances|Interceptions|Clearances off line|Clean Sheets|Blocks|Goals conceded|Saves made|Saves-to-shots ratio|Fouls Won|Fouls Conceded|Offsides|Yellow Cards|Red Cards|Subs on|Subs off|Players Used|
+-------+-----+---------------+----------------+-----------------+----------------

### Step 12. Select the first 7 columns

In [33]:
df.limit(7).show()

+--------------+-----+---------------+----------------+-----------------+----------------+--------------------------+------------+-------------+--------------------+------------+------+----------------+----------------+-------+-------+--------+-------------+-------+----------+-------------+-------------------+------------+------+--------------+----------+--------------------+---------+--------------+--------+------------+---------+-------+--------+------------+
|          Team|Goals|Shots on target|Shots off target|Shooting Accuracy|% Goals-to-shots|Total shots (inc. Blocked)|Hit Woodwork|Penalty goals|Penalties not scored|Headed goals|Passes|Passes completed|Passing Accuracy|Touches|Crosses|Dribbles|Corners Taken|Tackles|Clearances|Interceptions|Clearances off line|Clean Sheets|Blocks|Goals conceded|Saves made|Saves-to-shots ratio|Fouls Won|Fouls Conceded|Offsides|Yellow Cards|Red Cards|Subs on|Subs off|Players Used|
+--------------+-----+---------------+----------------+-------------

### Step 13. Select all columns except the last 3.

In [31]:
#df.tail(df.count()-3)
df.select("*").limit(df.count()-3).show()

+-------------------+-----+---------------+----------------+-----------------+----------------+--------------------------+------------+-------------+--------------------+------------+------+----------------+----------------+-------+-------+--------+-------------+-------+----------+-------------+-------------------+------------+------+--------------+----------+--------------------+---------+--------------+--------+------------+---------+-------+--------+------------+
|               Team|Goals|Shots on target|Shots off target|Shooting Accuracy|% Goals-to-shots|Total shots (inc. Blocked)|Hit Woodwork|Penalty goals|Penalties not scored|Headed goals|Passes|Passes completed|Passing Accuracy|Touches|Crosses|Dribbles|Corners Taken|Tackles|Clearances|Interceptions|Clearances off line|Clean Sheets|Blocks|Goals conceded|Saves made|Saves-to-shots ratio|Fouls Won|Fouls Conceded|Offsides|Yellow Cards|Red Cards|Subs on|Subs off|Players Used|
+-------------------+-----+---------------+---------------

### Step 14. Present only the Shooting Accuracy from England, Italy and Russia

In [35]:
df.select("Team", "Shooting Accuracy").filter(f.col("Team").isin("England", "Italy","Russia")).show()

+-------+-----------------+
|   Team|Shooting Accuracy|
+-------+-----------------+
|England|            50.0%|
|  Italy|            43.0%|
| Russia|            22.5%|
+-------+-----------------+

