## REI 603M - Assignment 2
### Andri Freyr Viðarsson

In this notebook data on the top divisions of icelandic football is collected from the icelandic football association, https://www.ksi.is. The scraping is performed using the library Gazpacho and the data is exported as csv files. The icelandic football association has kept data on the men's league since the 1910s and data on the women's league since it's establishment in the 1970s. More extensive data is available on seasons that took place after the mid 1980s.

Each row in the exported datasets corresponds to a single team for a given year. The datasets have 18 columns, the columns are:

*team_name*: Team name<br />
*year*: Year<br />
*place*: Spot in the league table at end of season<br />
*played_games*: Number of games that the team played in the season<br />
*n_wins*: Number of games won<br />
*n_ties*: Number of ties<br />
*n_losses*: Number of losses<br />
*goals_scored*: Number of goals scored<br />
*goals_against*: Number of goals conceded<br />
*goal_difference*: goals_scored - goals_agianst<br />
*points*: Number of points<br />
*points_3point_scale*: points on 3 point scale, (3 points for win, 1 for draw), first used in 1984<br />
*teams_in_league*: Number of participating teams<br />
*url*: URL for team info site<br />
*n_yellow_cards*: Number of yellow cards<br />
*n_red_cards*: Number of red cards<br />
*avg_weighted_age*: Average age of players in the team, weighted by played minutes<br />
*avg_home_attendance*: Average attendance at home games

In [1]:
from gazpacho import get, Soup
import pandas as pd
import time
import re

In [2]:
base_url = 'https://www.ksi.is'

Here we get the url to the league table for the latest season, pass parameter 'm' for mens league, 'w' for women's league. The scraping starts from this url.

In [3]:
def get_start_url(gender):
    # params: gender: m or f, m for male league, f for female
    # returns: url for most recent tournament for the gender
    index = 1 if gender == 'm' else 2
    html = get(base_url)
    soup = Soup(html)
    nav = soup.find('nav', {'class': 'main-nav'})
    bla = nav.find('div', {'class':'col-md-3'})[1]
    ul = bla.find('ul')[index] # 1 for male, 2 for female
    li = ul.find('li')[0]
    a = li.find('a').attrs
    if 'https:' in a['href']: # some problem with relative paths
        return a['href']
    start_url = base_url + a['href']
    return start_url
    
get_start_url('m')   

'https://www.ksi.is/mot/stakt-mot/?motnumer=43801'

From the latest league table we gather urls for all previous years

In [4]:
def get_tournament_urls(start_url):
    # get url for league table for all previous years 
    html = get(start_url)
    soup = Soup(html)
    dropdown = soup.find('ul', {'class':'dropdown-menu'})[1].find('li')[1:]
    urls = []
    for i in dropdown:
        href = i.find('a').attrs['href']
        year = i.find('a').text.lower()
        if 'riðill' not in year and 'úrslit' not in year and 'auka' not in year:
            urls.append({'year':int(year[:4]), 'url': base_url+href})
    return urls
get_tournament_urls(get_start_url('m'))

[{'year': 2022,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=43525'},
 {'year': 2022,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=44701'},
 {'year': 2021,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=42373'},
 {'year': 2020,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=40928'},
 {'year': 2019,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=38296'},
 {'year': 2018,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=37403'},
 {'year': 2017,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=36622'},
 {'year': 2016,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=35586'},
 {'year': 2015,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=33503'},
 {'year': 2014,
  'url': 'https://www.ksi.is/mot/stakt-mot/$TournamentDet

Given a team_url and team_name we get average home attendance for the team by going through each match report linked to the team's home games and collecting the attendance numbers.

In [5]:
def get_avg_attendance(team_url, team_name):
    attendance_ls = []
    html = get(team_url)
    soup = Soup(html)
    games_url = base_url + soup.find('div', {'class': 'btn-group pull-right'}).find('a')[2].attrs['href']
    
    html_games = get(games_url)
    soup_games = Soup(html_games)
    results = soup_games.find('div', {'class': 'results-area'}).find('table', {'class': 'games-list'})
    for r in results:
        home_team = r.find('ul', {'class': 'list type2'}).find('li')[0].find('span').text
        if home_team == team_name:
            match_report = r.find('ul', {'class':'img-list'}).find('li')[-1].find('a').attrs['href']
            game_url = base_url + match_report
            n_fans = get_attendance(game_url)
            if n_fans:
                attendance_ls.append(n_fans)
    if len(attendance_ls) > 0:
        return sum(attendance_ls)/len(attendance_ls)
    return None
        
def get_attendance(game_url):
    html = get(game_url)
    soup = Soup(html)
    info = soup.find('div', {'class': 'report-head'}).find('p').find('span')
    try:
        att_str = re.findall(r'\(.*?\)', str(info))[0]
        attendance = int(att_str[1:-1].split(':')[1])
        return attendance
    except:
        return None
    

get_avg_attendance('https://www.ksi.is/mot/stakt-mot/lid-i-moti/?motnumer=40928&lid=101', 'Valur')

680.0

Here we collect data from the league table and additional data from team pages.

In [6]:
def get_league_table(url, year):
    html = get(url)
    soup = Soup(html)
    table = soup.find('table', {'class':'situation-list'})
    body = table.find('tbody')
    rows = body.find('tr')
    n_teams = len(rows)
    teams_info = []
    for r in rows:
        info = r.find('td')[0]
        team_name = info.find('span', {'class':'name'}).text
        team_stats = r.find('td')[1:8]
        goals_scored = int(team_stats[4].text.split(':')[0])
        goals_against = int(team_stats[4].text.split(':')[1])
        goal_difference = goals_scored - goals_against
        
        team_url = base_url + info.find('a').attrs['href']
        try:
            n_yellows, n_reds, avg_weighted_age = get_team_info(team_url, year)
            avg_home_attendance = get_avg_attendance(team_url, team_name)
        except:
            n_yellows, n_reds, avg_weighted_age = None, None, None
            avg_home_attendance = None
           
        teams_info.append({
            'team_name': team_name,
            'year':year,
            'place': int(info.find('span', {'class':'num'}).text),
            'played_games': int(team_stats[0].text),
            'n_wins': int(team_stats[1].text),
            'n_ties': int(team_stats[2].text),
            'n_losses': int(team_stats[3].text),
            'goals_scored': goals_scored,
            'goals_against': goals_against,
            'goal_difference': goal_difference,
            'points': int(team_stats[6].text),
            'points_3point_scale': 3*int(team_stats[1].text)+int(team_stats[2].text),
            'teams_in_league':n_teams,
            'url':team_url,
            'n_yellow_cards':n_yellows,
            'n_red_cards': n_reds,
            'avg_weighted_age': avg_weighted_age,
            'avg_home_attendance': avg_home_attendance
        })
    return teams_info



def get_team_info(url, year):
    # params: url: url for team page
    # returns: total number of reds and yellows and weighted avg player age
    html = get(url)
    soup = Soup(html)
    n_yellows = 0
    n_red_cards = 0
    weighted_ages = []
    sum_minutes = 0
    players_list = soup.find('ul', {'class': 'profile-grid players'}).find('li')
    for player in players_list:
        birth_year = int(player.find('h3', {'class': 'birth-year'}).text)
        age = year - birth_year
        perc_minutes = float(player.find('div',{'class': 'chart-item'})[1].attrs['data-value'])
        sum_minutes += perc_minutes
        weighted_ages.append(perc_minutes*age)
        
        bookings_info = player.find('ul', {'class': 'list'}).find('li')[1:]
        n_yellows += int(bookings_info[0].text)
        n_red_cards += int(bookings_info[1].text)
    
    avg_weighted_age = sum(weighted_ages)/sum_minutes # sum_minutes should be 11 and is for most years
    # however something wrong with 2020 because of short covid season
    
    return n_yellows, n_red_cards, avg_weighted_age
        
                
get_league_table('https://www.ksi.is/mot/stakt-mot/$TournamentDetails/Table/?motnumer=12486', 2006)

[{'team_name': 'FH',
  'year': 2006,
  'place': 1,
  'played_games': 18,
  'n_wins': 10,
  'n_ties': 6,
  'n_losses': 2,
  'goals_scored': 31,
  'goals_against': 14,
  'goal_difference': 17,
  'points': 36,
  'points_3point_scale': 36,
  'teams_in_league': 10,
  'url': 'https://www.ksi.is/mot/stakt-mot/lid-i-moti/?motnumer=12486&lid=220',
  'n_yellow_cards': 22,
  'n_red_cards': 3,
  'avg_weighted_age': 27.27822671156005,
  'avg_home_attendance': 1948.4444444444443},
 {'team_name': 'KR',
  'year': 2006,
  'place': 2,
  'played_games': 18,
  'n_wins': 9,
  'n_ties': 3,
  'n_losses': 6,
  'goals_scored': 23,
  'goals_against': 27,
  'goal_difference': -4,
  'points': 30,
  'points_3point_scale': 30,
  'teams_in_league': 10,
  'url': 'https://www.ksi.is/mot/stakt-mot/lid-i-moti/?motnumer=12486&lid=107',
  'n_yellow_cards': 29,
  'n_red_cards': 6,
  'avg_weighted_age': 27.076167076167067,
  'avg_home_attendance': 1459.7777777777778},
 {'team_name': 'Valur',
  'year': 2006,
  'place': 3,
  

In the following cells all the previous functions are combined and the final product is returned 

In [7]:
def get_historic_data(gender):
    # params: m for male league, f for female league
    data = []
    years_info = get_tournament_urls(get_start_url(gender))
    for dict_ in years_info:
        try:
            league_table = get_league_table(dict_['url'], dict_['year'])
            data +=league_table
        except:
            print(dict_['year'])
    return data

In [None]:
start_time = time.time()
df = pd.DataFrame(get_historic_data('m'))
print(f'Elapsed time: {time.time()-start_time}' )
df

In [9]:
df.loc[df['team_name'] == 'Haukar']

Unnamed: 0,team_name,year,place,played_games,n_wins,n_ties,n_losses,goals_scored,goals_against,goal_difference,points,points_3point_scale,teams_in_league,url,n_yellow_cards,n_red_cards,avg_weighted_age,avg_home_attendance
130,Haukar,2010,11,22,4,8,10,29,45,-16,20,20,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,48.0,1.0,26.87337,759.454545
435,Haukar,1979,10,18,1,3,14,12,45,-33,5,6,10,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,,,,


In [10]:
df.loc[df['team_name'] == 'FH']

Unnamed: 0,team_name,year,place,played_games,n_wins,n_ties,n_losses,goals_scored,goals_against,goal_difference,points,points_3point_scale,teams_in_league,url,n_yellow_cards,n_red_cards,avg_weighted_age,avg_home_attendance
1,FH,2020,2,18,11,3,4,37,23,14,36,36,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,37.0,3.0,28.466667,620.6
14,FH,2019,3,22,11,4,7,40,36,4,37,37,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,51.0,4.0,28.345102,1205.818182
28,FH,2018,5,22,10,7,5,36,28,8,37,37,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,48.0,3.0,28.281836,1069.454545
38,FH,2017,3,22,9,8,5,33,25,8,35,35,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,51.0,1.0,27.511369,1060.090909
48,FH,2016,1,22,12,7,3,32,17,15,43,43,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,48.0,1.0,26.604759,1540.636364
60,FH,2015,1,22,15,3,4,47,26,21,48,48,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,52.0,1.0,26.54336,1924.727273
73,FH,2014,2,22,15,6,1,46,17,29,51,51,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,34.0,4.0,27.177778,1740.727273
85,FH,2013,2,22,14,5,3,47,22,25,47,47,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,44.0,4.0,27.098072,1496.181818
96,FH,2012,1,22,15,4,3,51,23,28,49,49,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,43.0,2.0,28.924197,1595.363636
109,FH,2011,2,22,13,5,4,48,31,17,44,44,12,https://www.ksi.is/mot/stakt-mot/lid-i-moti/?m...,41.0,5.0,28.656933,1685.727273


Finally we export the datasets as csv files

In [13]:
df.to_csv("ksi_kk.csv",encoding='utf-8-sig', index = False)

In [14]:
df_kvk = pd.DataFrame(get_historic_data('f'))
df_kvk.to_csv('ksi_kvk.csv', encoding='utf-8-sig', index = False)

2021
