## An R Markdown file to scrape the data from [IMO official website](https://www.imo-official.org/)
### General Idea

fist we need to get the link to the problems and use some variable date to scrape
the tables and change it to a data frame, then clean each table individually and lastly
parsing them all together and saving the result as a .csv file for later cleaning
and analysis.

In [2]:
#Loading the tidyverse
library(rvest)
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conf

### Scraping the first table

We need to establish a fist table, as a basis for all our clean ones to get parsed into.
and as such we start by using the link and some functions to do so

In [3]:
#declaring a url variable
url <- 'https://www.imo-official.org/year_statistics.aspx?year=1985'

#using pipes to load the table into the data frame dummy variable df
read_html(url) %>% html_table() %>% data.frame() -> df

#removing extra rows
df <- df[-c(9:20),]

#adding a name to the first column, since it doesn't have one in the webpage
colnames(df)[1] = "Problem Number"

#adding the year to the table
df <- df %>% mutate(Problem_year = 1985)

#our log super table, establishing log as the table where all the other tables will
#get parsed
log = df
log

Unnamed: 0_level_0,Problem Number,P1,P2,P3,P4,P5,P6,Problem_year
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Num( P# = 0 ),32,60,153,61,113,57,1985
2,Num( P# = 1 ),46,27,27,28,21,74,1985
3,Num( P# = 2 ),11,8,8,46,20,19,1985
4,Num( P# = 3 ),9,9,5,18,7,13,1985
5,Num( P# = 4 ),4,3,1,16,7,9,1985
6,Num( P# = 5 ),2,4,0,6,5,10,1985
7,Num( P# = 6 ),2,6,3,2,1,4,1985
8,Num( P# = 7 ),103,92,12,32,35,23,1985


### Looping through the years

Now all that we have to do is to use a for-loop to repeat the process and append
every table of every year to the log super-table, this can be achieved as follows:

In [5]:
#creating a list of years to be scraped 
Dates <- c(1986:2023)

#start of for-loop
for (year in Dates){
  
  #pasting the year as a string to the end of each URL to get the desired webpage
  url <- paste('https://www.imo-official.org/year_statistics.aspx?year=',toString(year),sep ="")
  
  #using the df dummy variable to store the scraped raw data
  url %>% read_html %>% html_table() %>% data.frame() -> df
  
  #removing unwanted rows
  df <- df[-c(9:20),]
  
  #filling the empty column name
  colnames(df)[1] = "Problem Number"
  
  #adding the year as an extra column and to make later cleaning easier
  df <- df %>% mutate(Problem_year = year)
  
  #adding the dummy variable df's table to the log super-table, and preparing to
  #repeat the loop
  log <- bind_rows(log, df)

}

## Saving the results as .csv

Analysis and visualizations on [Tableau](https://public.tableau.com/views/InternationalMathematicalOlympiadDataCategoryandDifficulty/Dashboard1?:language=en-GB&:sid=&:display_count=n&:origin=viz_share_link)

In [7]:
write.csv(log, "Name_of_your_file.csv")

In [1]:
#Python code
from IPython.display import display, HTML
display(HTML("rrrr.html"))