## Data preparation

This dataset contains ESC final results from 1956 to 2022.

This notebook cleans the data and replaces categorical values with numeric ones and then applies machine learning to created dataset.

In [1]:
import pandas as pd
import numpy as np

Read in data from csv file

In [2]:
data = pd.read_csv("../data/ESCDB2.csv", sep=";", encoding="latin-1")

# Removing potential unneccesary spaces from dataframe columns
for column in data.select_dtypes(include='object'):
    data[column] = data[column].str.strip()

# Dropping duplicate column "songid"
data.drop('songid', axis=1)

# Dropping 1956 contest result, since no points/places were awarded
data = data.drop(data[data['Year'] == 1956].index)

# Replacing "Macedonia" with "North Macedonia", since the country changed its name
data["Country"] = data["Country"].replace("Macedonia", "North Macedonia")

data

Unnamed: 0,song_id,Year,Order,Country,Singer,Title,Points,Place,in_english,songid,lovementions,population
14,15,1957,1,Belgium,Bobbejaan Schoepen,Straatdeuntje,5,8,False,15,0,11350000
15,16,1957,2,Luxembourg,DaniÃ¨le DuprÃ©,Tant De Peine,8,5,False,16,5,590667
16,17,1957,3,United Kingdom,Patricia Bredin,All,6,7,True,17,1,66020000
17,18,1957,4,Italy,Nunzio Gallo,Corde Della Mia Chitarra,7,6,False,18,0,60590000
18,19,1957,5,Austria,Bob Martin,Wohin Kleines Pony,3,10,False,19,0,8773000
...,...,...,...,...,...,...,...,...,...,...,...,...
1368,1369,2022,21,Australia,Sheldon Riley,Not the Same,125,15,True,1369,0,24600000
1369,1370,2022,22,United Kingdom,Sam Ryder,Space Man,466,2,True,1370,0,66020000
1370,1371,2022,23,Poland,Ochman,River,151,12,True,1371,0,37970000
1371,1372,2022,24,Serbia,Konstrakta,In corpore sano,312,5,False,1372,0,7240000


Making era column based on years:

- years 1957-1986 -> 0 (very old contests)
- years 1987-1999 -> 1 (roughly after soviet union collapse)
- years 2000-2013 -> 2 (prior to invasion of ukraine)
- years 2014-2022 -> 3 (modern era of the contest)

In [3]:
def era(row):
    if row["Year"] >= 1957 and row["Year"] <= 1986:
        return 0
    if row["Year"] >= 1987 and row["Year"] <= 1999:
        return 1
    if row["Year"] >= 2000 and row["Year"] <= 2013:
        return 2
    if row["Year"] >= 2014 and row["Year"] <= 2022:
        return 3
    
data["Era"] = data.apply(lambda row: era(row), axis = 1)

Making country column numeric

In [4]:
countries = data["Country"].unique()
countries.sort()
countries

countries_to_numbers = {}

for i in range(len(countries)):
    countries_to_numbers[countries[i]] = i

data["Country"] = data["Country"].map(countries_to_numbers)

Making the singer column numeric

This will be a boolean value:
* **0** if the singer has only performed once in ESC
* **1** if the singer has performed multiple times in ESC

In [5]:
data["Singer"] = data["Singer"].map(lambda x: 0 if data["Singer"].value_counts()[x] == 1 else 1)

Making title column numeric.

We are extracting two attributes:
* 1 if song title included word "Love" 0 otherwise
* 1 if song title is more than 1 word, 0 otherwise 

In [6]:
data["love_in_title"] = data["Title"].map(lambda x: 1 if 'love' in x.lower() else 0)

In [7]:
data["Title"] = data["Title"].map(lambda x: len(x.split()))
data = data.rename(columns={"Title" : "title_word_count"})

Making "in_english" column numeric.

In [8]:
data["in_english"] = data["in_english"].map(lambda x: 1 if x else 0)

In [9]:
# display out the result
data

Unnamed: 0,song_id,Year,Order,Country,Singer,title_word_count,Points,Place,in_english,songid,lovementions,population,Era,love_in_title
14,15,1957,1,6,0,1,5,8,0,15,0,11350000,0,0
15,16,1957,2,26,0,3,8,5,0,16,5,590667,0,0
16,17,1957,3,49,0,1,6,7,1,17,1,66020000,0,0
17,18,1957,4,23,0,4,7,6,0,18,0,60590000,0,0
18,19,1957,5,3,0,3,3,10,0,19,0,8773000,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1368,1369,2022,21,2,0,3,125,15,1,1369,0,24600000,3,0
1369,1370,2022,22,49,0,2,466,2,1,1370,0,66020000,3,0
1370,1371,2022,23,35,0,1,151,12,1,1371,0,37970000,3,0
1371,1372,2022,24,40,0,3,312,5,0,1372,0,7240000,3,0


---

In [10]:
# unneccesary column
data = data.drop('song_id', axis=1)

# not needed
data = data.drop(columns=['Place'], axis=1)

In [11]:
year_to_predict = 2016

y = data.iloc[:, [0, 5]] # year | points

# trying to predict 2018
y_train = y[y["Year"] != year_to_predict]["Points"] # y_train
y_test = y[y["Year"] == year_to_predict]["Points"]  # y_test

X_train = data[data["Year"] != year_to_predict]     # X_train
X_test = data[data["Year"] == year_to_predict]      # X_test

In [12]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn import neighbors

classifier1 = neighbors.KNeighborsClassifier(n_neighbors = 5)
classifier2 = RandomForestClassifier()
classifier3 = SVC(kernel="rbf")
classifier4 = Lasso() # SEE ANNAB 100% õiged, sest meil on databaasis aasta kohta read koha järgi järjestatud ja ta teab seda
classifier5 = Ridge() # ka 100% õiged, sest meil on databaasis aasta kohta read koha järgi järjestatud ja ta teab seda

classifier1.fit(X_train, y_train)
classifier2.fit(X_train, y_train)
classifier3.fit(X_train, y_train)
classifier4.fit(X_train, y_train)
classifier5.fit(X_train, y_train)

y_pred1 = classifier1.predict(X_test)
y_pred2 = classifier2.predict(X_test)
y_pred3 = classifier3.predict(X_test)
y_pred4 = classifier4.predict(X_test)
y_pred5 = classifier5.predict(X_test)

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


In [13]:
def top10accuracy(result):
    real_top10 = []
    pred_top10 = []
    o = 0
    for i in range(10):
        o+=1
        real_top10.append(result.loc[result['actual_place'] == o, 'Country'].iloc[0])
        pred_top10.append(result.loc[result['pred_place'] == o, 'Country'].iloc[0])
    #Create a counter for the number of elements that match
    accuracy_count = 0
    #Loop through both lists and check if the elements are equal and have the same index
    for country in pred_top10:
        if country in real_top10:
            accuracy_count += 1
    #Calculate the accuracy of the two lists
    accuracy = accuracy_count/len(real_top10)
    print("quessed top 10 countries with", accuracy*100 "% accuracy")
    return accuracy

def top3accuracy(result):
    real_top3 = []
    pred_top3 = []
    o = 0
    for i in range(3):
        o+=1
        real_top3.append(result.loc[result['actual_place'] == o, 'Country'].iloc[0])
        pred_top3.append(result.loc[result['pred_place'] == o, 'Country'].iloc[0])
    #Create a counter for the number of elements that match
    accuracy_count = 0
    #Loop through both lists and check if the elements are equal and have the same index
    for country in pred_top3:
        if country in real_top3:
            accuracy_count += 1
    #Calculate the accuracy of the two lists
    accuracy = accuracy_count/len(real_top3)
    print("quessed top 3 countries with", accuracy*100 "% accuracy")
    
    return accuracy

In [14]:
def evaluate(y_pred, X_test, countries_to_numbers, classifier_name):
    numbers_to_countries = dict((v, k) for k, v in countries_to_numbers.items())

    result = X_test.iloc[:, [2, 5]]

    result["Country"] = result["Country"].map(numbers_to_countries)
    result = result.assign(Predicted=list(y_pred))

    result = result.sort_values("Predicted", ascending=False)
    result = result.assign(pred_place=range(1, len(result) + 1))

    result = result.sort_values("Points", ascending=False)
    result = result.assign(actual_place=range(1, len(result) + 1))

    result["correct"] = abs(result["pred_place"] - result["actual_place"])
    result["correct"] = result["correct"].apply(lambda x: x <= 0)

    print(classifier_name, "classifiers results were:")
    print(result)
    score = result["correct"].value_counts("True")

    print(score)
    top10acc = top10accuracy(result)
    top3acc = top3accuracy(result)

In [15]:
evaluate(y_pred1, X_test, countries_to_numbers, "KNN")
evaluate(y_pred2, X_test, countries_to_numbers, "Random Forest")
evaluate(y_pred3, X_test, countries_to_numbers, "SVM")
evaluate(y_pred4, X_test, countries_to_numbers, "Lasso")
evaluate(y_pred5, X_test, countries_to_numbers, "Ridge")

KNN classifiers results were:
             Country  Points  Predicted  pred_place  actual_place  correct
1238         Ukraine     534        214           2             1    False
1230       Australia     511         99           6             2    False
1235          Russia     491        204           3             3     True
1225        Bulgaria     307        157           4             4     True
1226          Sweden     261        218           1             5    False
1228          France     257         82           7             6    False
1243         Armenia     249         34          13             7    False
1229          Poland     229         10          22             8    False
1233       Lithuania     200         30          17             9    False
1218         Belgium     181         64           8            10    False
1220     Netherlands     153        114           5            11    False
1239           Malta     153         32          16            12    F

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result["Country"] = result["Country"].map(numbers_to_countries)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result["Country"] = result["Country"].map(numbers_to_countries)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result["Country"] = result["Country"].map(numbers_to_countries)
A value is tr