In this example we will do some _Exploratory Data Analysis_ (EDA) using data on players from the 2010 World Cup

The data frame contains 595 observations on the following variables:

Variable      | Description
--------------|--------------------------------------------------------
`Player`      | Player's last name
`Team`        | Country
`Position`    | a factor (levels: `Defender`, `Forward`, `Goalkeeper`, `Midfielder`)
`Time`        | Time played in minutes
`Shots`       | Number of shots attempted
`Passes`      | Number of passes made
`Tackles`     | Number of tackles made
`Saves`       | Number of saves made


1. Load the `tidyverse` package


In [1]:
library(tidyverse)



── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


2. Get the data:



In [2]:
fifa_link <- "https://raw.githubusercontent.com/reisanar/datasets/master/worldcup.csv"
fifa10 <- read_csv(fifa_link)


[1mRows: [22m[34m595[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): Player, Team, Position
[32mdbl[39m (5): Time, Shots, Passes, Tackles, Saves

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


3. Learn about the nature of the data



In [3]:
glimpse(fifa10)



Rows: 595
Columns: 8
$ Player   [3m[90m<chr>[39m[23m "Abdoun", "Abe", "Abidal", "Abou Diaby", "Aboubakar", "Abreu"…
$ Team     [3m[90m<chr>[39m[23m "Algeria", "Japan", "France", "France", "Cameroon", "Uruguay"…
$ Position [3m[90m<chr>[39m[23m "Midfielder", "Midfielder", "Defender", "Midfielder", "Forwar…
$ Time     [3m[90m<dbl>[39m[23m 16, 351, 180, 270, 46, 72, 138, 33, 21, 103, 270, 55, 106, 27…
$ Shots    [3m[90m<dbl>[39m[23m 0, 0, 0, 1, 2, 0, 0, 0, 5, 0, 2, 0, 2, 1, 1, 1, 5, 9, 0, 1, 0…
$ Passes   [3m[90m<dbl>[39m[23m 6, 101, 91, 111, 16, 15, 51, 9, 22, 38, 120, 31, 57, 123, 172…
$ Tackles  [3m[90m<dbl>[39m[23m 0, 14, 6, 5, 0, 0, 2, 0, 0, 1, 10, 2, 2, 11, 13, 7, 3, 18, 0,…
$ Saves    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…


In [10]:
fifa10 |>
  filter(Team == "Spain")  |> 
  sample_n(3)

Player,Team,Position,Time,Shots,Passes,Tackles,Saves
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Casillas,Spain,Goalkeeper,540,0,67,0,11
TorresF,Spain,Forward,278,9,50,0,0
Pedro Rodriguez,Spain,Midfielder,116,5,80,0,0


**Challenge**

- Use the `filter()` function to find the names of at least 3 players from Spain's national team during the 2010 world cup.

## Boxplot 

4. We can create a boxplot that shows the relationship between the number of `Passes` by a player (vertical axis) based on his `Position` in the field (horizontal axis).


**Challenge**

- What do you notice from the above boxplots ?

- Can you create a boxplot that shows the relationship between the number of `Shots` by a player (vertical axis) based on his `Position` in the field (horizontal axis).

- What do you notice?



## Scatterplot

5. To understand the relationship between the number of passes and the number of tackles, let us use a _scatter plot_ with `Tackles` in the horizontal axis, and `Passes` in the vertical axis. Color the points by `Position`


**Challenge**

- Do you notice anything in particular about those players in `Goalkeeper` and `Forward` positions?



6. We can also check the relationship between the number of minutes played in the tournament (in the horizontal axis) and the number of passes completed by the player (in the vertical axis). Color the points by `Position`


**Challenge**

- Do you notice any particular _structure/pattern_ in the graph above? 


7. We can even add the variable `Shots` to the previous plot and plot the points based on the number of shots per player. Use the option `size=Shots` in the aesthetics. 


**Challenge**

- Can you find the name of any _outstanding_ players based on the graph above? 
