# Eurovision voting - Data wrangling

## Introduction

This notebook gathers and prepares the data needed for the model used later on.

In [14]:
# Set up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pycountry
import pathlib

DATAPATH = pathlib.Path().resolve().parent / "data"

## Voting data

Analysing eurovision voting patterns first of all requires some data on each countries voting in each year. To start, we will use [this data](https://www.kaggle.com/datasets/datagraver/eurovision-song-contest-scores-19752019), which can be found in the data directory of this repo.

Lets load in the data and see what we are working with:

In [16]:
df_input = pd.read_excel(DATAPATH / "eurovision_voting_scores_1975_2019.xlsx")
df_input.head()

Unnamed: 0,Year,(semi-) final,Edition,Jury or Televoting,From country,To country,Points,Duplicate
0,1975,f,1975f,J,Belgium,Belgium,0,x
1,1975,f,1975f,J,Belgium,Finland,0,
2,1975,f,1975f,J,Belgium,France,2,
3,1975,f,1975f,J,Belgium,Germany,0,
4,1975,f,1975f,J,Belgium,Ireland,12,


The voting system [significantly changed](https://en.wikipedia.org/wiki/Voting_at_the_Eurovision_Song_Contest) from 1998 onwards so we will get just the finals data from 1998 onwards.

In [17]:
df = df_input[df_input['Year'] > 1998]
df = df.loc[df['(semi-) final'] == 'f']
df = df.drop_duplicates()
df.head()

Unnamed: 0,Year,(semi-) final,Edition,Jury or Televoting,From country,To country,Points,Duplicate
10841,1999,f,1999f,J,Austria,Austria,0,x
10842,1999,f,1999f,J,Austria,Belgium,0,
10843,1999,f,1999f,J,Austria,Bosnia & Herzegovina,12,
10844,1999,f,1999f,J,Austria,Croatia,8,
10845,1999,f,1999f,J,Austria,Cyprus,0,


We need to clean up some of these columns and standardise things by removing white space and putting all names in lower case to prevent any mismatches later on. Furthermore, a deeper dive into the country names shows the inconsistencies which need resolving.

In [None]:
# remove white space from countries
df['To country'] = df['To country'].str.strip()
df['From country'] = df['From country'].str.strip()

# lower case
df['To country'] = df['To country'].str.lower()
df['From country'] = df['From country'].str.lower()

# tidy country names: fix typos, fill whitespace, rename
replace_list = [['-', ' '],
                ['&', 'and'], 
                ['netherands', 'netherlands'],
                ['f.y.r. macedonia', 'north macedonia'], 
                ['russia', 'russian federation'], 
                ['the netherlands', 'netherlands'], 
                ['czech republic', 'czechia'],
                ['serbia and montenegro', 'yugoslavia'],
                ['moldova', 'moldova, republic of']] 

for replacements in replace_list: 
  df['To country'] = df['To country'].str.replace(
      replacements[0], replacements[1], regex=True
    )
  df['From country'] = df['From country'].str.replace(
      replacements[0], replacements[1], regex=True
    )

countries = [df['From country'], df['To country']]
countries = np.sort(np.unique(countries))

For consistency with other data sets, we want to convert the country names to ISO alpha-2 codes. We will use the [pycountry](https://pypi.org/project/pycountry/) package to do this.