In [1]:
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
df = pd.read_csv('./WinnersInterviewBlogPosts.csv')
pd.set_option('display.max_colwidth', None)

In [3]:
df.head(1)

Unnamed: 0,title,link,publication_date,content
0,Computer scientist Jure Zbontar on winning the Eurovision challenge,http://blog.kaggle.com/2010/06/09/computer-scientist-jure-zbontars-method-for-winning-the-eurovision-challenge/,2010-06-09 18:22:29,"My approach was actually quite simple. The only attributes I used where the approximate betting odds and the information on past voting. I sought patterns in the voting behaviour of all countries and combined that knowledge with this year's betting odds. I used cross-validation to select my model and to avoid overfitting it.<!--more-->\r\n\r\n<strong>Predicting the finalists </strong>\r\n\r\nI trusted the bookmakers on this one and just took the top ten countries from each semi-final group. I got the betting odds from Betfair.\r\n\r\n<strong>Learning the voting patterns </strong>\r\n\r\nA simple approach worked well enough here. The idea was to calculate, for each country, the average points awarded to each other country. Coming from Slovenia which was once part of Yugoslavia, together with Croatia, Serbia, Bosnia and Herzegovina, Macedonia and Montenegro, it is perhaps not surprising that our voting patterns are rather interesting:\r\n<pre><code> AVG' COUNTRY\r\n10.38 Serbia\r\n 8.53 Croatia\r\n 8.00 Bosnia and Herzegovina\r\n 5.91 Macedonia\r\n 3.21 Norway\r\n 3.17 Russia\r\n 3.07 Greece\r\n...\r\n 0.18 Portugal\r\n 0.17 Belarus\r\n</code></pre>\r\nIt is painfully obvious that Slovenia is not judging the quality of the artist alone and it is well known that other countries follow similar patterns. It would, therefore, seem like a good idea to use this knowledge in predicting this year's voting.<!--more-->\r\n\r\nThe estimated average points awarded are not very stable, especially for newer countries. To remedy this, instead of using:\r\n<pre><code>avg := sum(x) / |x|\r\n</code></pre>\r\nI used\r\n<pre><code>avg' := (sum(x) + 1) / (|x| + 1)\r\n</code></pre>\r\nThe new estimate got better results on the cross-validation tests.\r\n\r\n<strong>Betting Odds</strong>\r\n\r\nUsing just the voting patterns of countries to predict this year's results was not enough. I had to, somehow, incorporate the approximate betting odds as well. Many approaches could have worked well here. In the end I opted for the one that gave the best cross-validation results.\r\n\r\nI had to convert the approximate betting odds into something comparable with the average points awarded. I used:\r\n<pre><code>odds'(ctr) := 1 / log(odds(ctr)) * a + b\r\n</code></pre>\r\nThe coefficients a and b were chosen experimentally, as the ones that gave the best cross-validation score.\r\n\r\nA small example will elucidate how I calculated the converted betting odds.\r\n<pre><code>odds'(Croatia) = 1 / log(odds(Croatia)) * 4.4 + 0.8 =\r\n = 1 / log(48) * 4.4 + 0.8\r\n = 1.94\r\n</code></pre>\r\nThe converted betting odds for the top and bottom countries were:\r\n<pre><code>ODDS' COUNTRY\r\n5.23 Azerbaijan\r\n3.21 Germany\r\n2.54 Armenia\r\n...\r\n1.94 Croatia\r\n...\r\n1.45 Slovenia\r\n1.44 Bulgaria\r\n1.44 Macedonia\r\n1.44 Switzerland</code></pre>\r\n<strong>Combining the voting patterns with the betting odds</strong>\r\n\r\nIt was now time to bring everything together. This was simply a matter of summing the average points awarded with the converted betting odds.\r\n\r\nThis was how I predicted Slovenia's votes for this year:\r\n<pre><code>COUNTRY AVG' ODDS' SUM POINTS\r\nSerbia 10.38 + 1.84 = 12.21 12\r\nCroatia 8.53 + 1.94 = 10.47 10\r\nBosnia and Herzegovina 8.00 + 1.49 = 9.49 8\r\nMacedonia 5.91 + 1.44 = 7.35 7\r\nAzerbaijan 1.80 + 5.23 = 7.03 6\r\nNorway 3.21 + 2.01 = 5.22 5\r\nGreece 3.07 + 1.96 = 5.03 4\r\nSweden 2.85 + 2.18 = 5.03 3\r\nRussia 3.17 + 1.62 = 4.79 2\r\nGermany 1.50 + 3.21 = 4.71 1\r\nDenmark 2.42 + 2.25 = 4.66 0\r\n...\r\n</code></pre>\r\nWe saw earlier that Slovenia's votes have little to do with song quality, as we usually award the top points to Balkan countries, no matter how bad they sing. The added betting odds should not influence the prediction of such countries considerably. On the other hand, if we take a country that is perhaps a bit more fair, like Israel, we see that the final predictions are affected to a greater extent:\r\n<pre><code>COUNTRY AVG' ODDS' SUM POINTS\r\nArmenia 7.50 + 2.54 = 10.04 12\r\nAzerbaijan 3.75 + 5.23 = 8.98 10\r\nRussia 7.23 + 1.62 = 8.85 8\r\nUkraine 6.30 + 1.54 = 7.84 7\r\nRomania 6.07 + 1.61 = 7.68 6\r\nGreece 4.08 + 1.96 = 6.04 5\r\nGeorgia 4.25 + 1.77 = 6.02 4\r\nIceland 3.83 + 1.72 = 5.55 3\r\nSerbia 3.71 + 1.84 = 5.55 2\r\nDenmark 3.25 + 2.25 = 5.50 1\r\nSweden 3.27 + 2.18 = 5.45 0\r\n...\r\n</code></pre>\r\n<h2><span style=""font-size: 13px;"">Cross validation</span></h2>\r\n<h2><span style=""font-weight: normal; font-size: 13px;"">The most important component of my solution was cross-validation and was probably the reason why I won the competition in the first place. It enabled me to try many different models and, between them, choose the one that was most likely to give the best results.</span></h2>\r\nThe dataset was split into partitions, one for each Eurovision event. I then proceeded to build the model on all but one partition and calculated the error of that model on the partition that was left out. The procedure was repeated so that each time a different partition was left out. This gave me a fair estimate of how the model performs on unseen data.\r\n\r\nThe cross-validation procedure in pseudocode:\r\n<pre><code>function crossValidation(dataset, buildModel):\r\n error = 0\r\n for year in eurovisionEvents:\r\n learnData = {example | example in dataset and example.year != year}\r\n testData = {example | example in dataset and example.year == year}\r\n model = buildModel(learnData)\r\n error += testModel(model, testData)\r\n return error</code></pre>\r\n<pre><span style=""font-family: monospace, Monaco, 'Courier New', Courier, monospace;""><span style=""font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; white-space: normal; font-size: 13px;""><strong>Conclusion </strong></span></span></pre>\r\n<pre><span style=""font-family: monospace, Monaco, 'Courier New', Courier, monospace;""><strong><span style=""font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-weight: normal; line-height: 19px; white-space: normal; font-size: 13px;"">I am well aware that certain parts of my approach are not very strong. I had to do my best with the time that was available. I have many ideas for next year, which I will, for the moment at least, keep to myself :)</span></strong></span></pre>\r\nI really enjoy competing in events like this and hope there will be more to come in the future."


In [4]:
df.columns

Index(['title', 'link', 'publication_date', 'content'], dtype='object')

In [5]:
df.shape

(182, 4)

In [7]:
df.publication_date

0      2010-06-09 18:22:29
1      2010-08-09 12:35:46
2      2010-09-27 18:30:25
3      2010-10-11 13:31:35
4      2010-11-19 17:39:47
              ...         
177    2016-08-18 17:12:00
178    2016-09-08 16:08:23
179    2016-08-24 16:01:31
180    2016-08-31 15:16:47
181    2016-09-15 16:16:15
Name: publication_date, Length: 182, dtype: object

In [6]:
all_kaggle_interviews = []

for blogpost in range(0, 182):
    all_kaggle_interviews.append(BeautifulSoup(df.loc[blogpost, 'content'], 'html.parser'))

In [None]:
for blogpost in range(0, len(all_kaggle_interviews)):
    print(f"<h2>blogpost {blogpost} - {df.iloc[blogpost]['title']}. Posted on: {df.iloc[blogpost]['publication_date'][:10]}</h2>\n")
    print(f"{df.iloc[blogpost]['content']}\n")
    print("\n\n\n")