## Table of content:
* [Introduction](#introduction)
* [Exploration data analysis](#eda)
    * [Importing data](#importdata)
    * [Groupping and filtering](#groupfilterdata)
    * [Plotting and analyzing](#plottinganalyzing)
* [Relationship between Olympic events and population, GDP](#olympic_pop_gdp)
    * [Cleaning and merging data](#cleaningmerging)
    * [Analyzing the relationship between Olympic achievement with population and GDP](#analyzing_pop_gdp)
    * [Summary about the findings](#summary_pop_gdp)
* [Predictions](#predictions)
* [References](#references)

# Introduction  <a class="anchor" id="introduction"></a>
We want to explore the Olympic dataset from [here](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results). This data has 15 attributes and more than 271k records.
Beside of it, we would like to see if there's any relationship between the achieved medals with the country population, its GDP, life expectancy, fertility, and happiness.  
- For the country population, fertility rate, and life expectancy, we have the data from [here](https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016)
- For the GDP we refer to the World Bank data [here](https://www.kaggle.com/theworldbank/world-bank-gdp-ranking)
- For the Happiness, [this report](https://www.kaggle.com/unsdsn/world-happiness) gives us from 2015 to 2017.  

# Exploration data analysis  <a class="anchor" id="eda"></a>
Let's digging in.  
First, we will import the necessary libraries, and prepare to load the data.

In [None]:
# importing libraries for EDA
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sb
import matplotlib.pyplot as plt
import random
from scipy.stats import variation
from wordcloud import WordCloud
# importing libraries for logistic regression prediction
import statsmodels.api as sm 
import time
import profile
import random
import math
import scipy 
%pylab inline
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

sb.set()

## Importing data  <a class="anchor" id="importdata"></a>
Now, we allocate the olympic data and let see few records.  
Beside, as mentioned in the Olympic data [link](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)  
*Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.*

Therefore, we will use the data after 1993.

In [None]:
olympics = pd.read_csv('../input/120-years-of-olympic-history-athletes-and-results/athlete_events.csv')
olympics = olympics[olympics['Year'] > 1993]
olympics.head(5)

As we can see, the dataset has 15 attributes such as ID, Name, Sex, Age, ..., NOC, Medal (if the athlete achieved any).  
Let's have a look medals achieved per year when the Olympics events held.  
First 10 events of the Olympic records.

In [None]:
olympics.groupby(['Year'])[['Medal']].count().head(10)

Last, 10 events.

In [None]:
olympics.groupby(['Year'])[['Medal']].count().tail(10)

Let's have a look at the population.

In [None]:
population = pd.read_csv('../input/world-bank-data-1960-to-2016/country_population.csv')
population.head()

As we can see from the above, each country has its poupulation value across from 1960 to 2016.  
Also, each country has its country code, which is same as the NOC from the Olympic data.  
How about life expectancy of each country?

In [None]:
lifeexpectancy = pd.read_csv('../input/world-bank-data-1960-to-2016/life_expectancy.csv')
lifeexpectancy.head()

We have some values from 1960 to 2016, NOC is also defined. But, we need to deal with some NaN values.  
Let's have a look at fertility rate of each country (Fertility rate indicates the number of children a woman would give birth to during her childbearing years).

In [None]:
fertility = pd.read_csv('../input/world-bank-data-1960-to-2016/fertility_rate.csv')
fertility.head()

NOC is here and we have NaN values to deal with.  
From the GDP, we just need to know the ranking of countries in 2017

In [None]:
gdp = pd.read_csv('../input/gdp-world-bank-data/GDP by Country.csv', skiprows = range(0,4))
gdp.head()

As far as we have only happiness data of 3 years: 2015, 2016, and 2017. So, we will focus on the year 2016.

In [None]:
happiness2016 = pd.read_csv('../input/world-happiness/2016.csv')
happiness2016.head()

Lately, we want to view on the map. So, we will need to use GPS data.

In [None]:
location = pd.read_csv("../input/world-capitals-gps/concap.csv")
location.head()

In [None]:
medaltypes = ['Gold', 'Silver', 'Bronze']
medalpoints = ['Gold', 'Silver', 'Bronze', 'Medals', 'Medal_points']

In [None]:
olympics = pd.concat([olympics, pd.get_dummies(olympics['Medal'])], axis = 1)
olympics = olympics.drop('Medal', axis = 1)
olympics.head()

## Groupping and filtering data  <a class="anchor" id="groupfilterdata"></a>
The olympics data contains more than 271k records for all the individual participated in all events. In order to analyze further, we need to group the Team, Seasons to find out about the medal values.  
Also, some countries have multiple teams joining the events, for those, we need to generalize the country names accordingly.  
There're three types of medal: Gold, Silver, and Bronze. To easily find out which team/country performs best in each event, we need to convert this Medal types category to numerical data. One simple approach is to assign a number to each medal type such as:
- Gold = 3 points
- Silver = 2 points
- Bronze = 1 points

In [None]:
def prepareMedals(origData, medals):
    medalsDF = origData.groupby(['Year','Season','Team','NOC','Event'])[medals].sum()
    for m in medals:
            medalsDF.loc[medalsDF[m] > 0, m] = 1
    medalsDF.reset_index(inplace = True )
    return medalsDF

medals = prepareMedals(olympics, medaltypes)
medals.head(10)

In [None]:
groupForMedals = ['Year','Season','NOC','Team']
medalsTeams = medals.groupby(groupForMedals)[medaltypes].sum()
print(medalsTeams.head(10))

In [None]:
teamlist = olympics['Team'][olympics['Team'].str.contains("-")].unique() 
display(teamlist)

In [None]:
for i in teamlist:
    # we go back to initial list athletes and remove last 2 chars if the name of the team is in the_list.
    olympics.loc[olympics['Team']==i,'Team']=i[:-2]

In [None]:
medals = prepareMedals(olympics, medaltypes)
medalsTeams = medals.groupby(groupForMedals)[medaltypes].sum()
medalsTeams.reset_index(inplace = True)
print(medalsTeams.head(20))

In [None]:
# The Medal_points column is defined above to store the weigh of achieved medals for each country in each event.
# Let assume:
# - Gold = 3 points
# - Silver = 2 points
# - Bronze = 1 point
# - NaN = 0 point
# The Medal column is for the sum of achieved medal by that team
medalsTeams['Medal_points'] = (3 * medalsTeams['Gold']) + (2 * medalsTeams['Silver']) + medalsTeams['Bronze']
medalsTeams['Medals'] = medalsTeams['Gold'] + medalsTeams['Silver'] + medalsTeams['Bronze']
medalsTeams = medalsTeams.reindex(['Year', 'Season', 'Team', 'NOC', 'Gold', 'Silver', 'Bronze', 'Medals', 'Medal_points'] , axis=1)
display(medalsTeams.head(10))

In [None]:
totallist = medalsTeams.groupby(['Team'])[medalpoints].sum()
totallist.reset_index(inplace = True)
display(totallist.head(10))

Just for fun, plot a Word Cloud map for the countries achieving medals!

In [None]:
medalteams = totallist[totallist['Medal_points'] > 0]
wc = WordCloud(background_color='white', max_words=300, max_font_size=20, colormap='plasma').generate(str(medalteams['Team']))
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

## Plotting and analyzing <a class="anchor" id="plottinganalyzing"></a>
Let's have a look into the distributions of medals for countries.  
We will go thru from medal points to each type of medals, like Gold, Silver, and Bronze.

In [None]:
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(5, 1, figsize=(30, 40), sharex=False)

# First graph to show the top 20 countries from 1994 until 2016 based on the weight of medals.
sb.barplot(data=totallist.sort_values(by='Medal_points').reset_index(drop=True).tail(20), x='Team', y='Medal_points', palette="deep", ax=ax1)
# Iterate through the list of axes' patches
for p in ax1.patches:
    ax1.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=12, color='Blue', ha='center', va='bottom')
ax1.axhline(0, color="k", clip_on=True)
ax1.set_ylabel("Medal points")
ax1.set_xlabel("Country")

# Second graph for medal count
sb.barplot(data=totallist.sort_values(by='Medals').reset_index(drop=True).tail(20), x='Team', y='Medals', palette="deep", ax=ax2)
for p in ax2.patches:
    ax2.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=12, color='Blue', ha='center', va='bottom')
ax2.axhline(0, color="k", clip_on=True)
ax2.set_ylabel("Medals")
ax2.set_xlabel("Country")

# 3rd graph for gold medals 
sb.barplot(data=totallist.sort_values(by='Gold', ascending = False).head(20), x='Team', y='Gold', palette="deep", ax=ax3)
for p in ax3.patches:
    ax3.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=12, color='Blue', ha='center', va='bottom')
ax3.axhline(0, color="k", clip_on=True)
ax3.set_ylabel("Gold")
ax3.set_xlabel("Country")

# 4th graph for silver medals 
sb.barplot(data=totallist.sort_values(by='Silver', ascending = False).head(20), x='Team', y='Silver', palette="deep", ax=ax4)
for p in ax4.patches:
    ax4.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=12, color='Blue', ha='center', va='bottom')
ax4.axhline(0, color="k", clip_on=True)
ax4.set_ylabel("Silver")
ax4.set_xlabel("Country")

# 5th graph for bronze medals 
sb.barplot(data=totallist.sort_values(by='Bronze', ascending = False).head(20), x='Team', y='Bronze', palette="deep", ax=ax5)
for p in ax5.patches:
    ax5.text(p.get_x() + p.get_width()/2., p.get_height(), '%d' % int(p.get_height()), fontsize=12, color='Blue', ha='center', va='bottom')
ax5.axhline(0, color="k", clip_on=True)
ax5.set_ylabel("Bronze")
ax5.set_xlabel("Country")

The above graphs give us the overall view for all medals distribution between countries for both Summer and Winter events from 1994 to 2016. There's question that how about if we want to know which year which country performs best throughout their involvements in all Olympic games? Also, how can we know which team performs best in which season?  
These questions can help us to narrow down to the certain criteria of the success of each country.  
By doing so, we will need some functions to filter the countries.

In [None]:
countrylist = medalsTeams['Team'].unique()
bestyearofeachcountry = pd.DataFrame(columns=['Team','Year','Medal_points'])
for country in countrylist:
    temp = medalsTeams.loc[medalsTeams['Team']==country].sort_values(by='Medal_points', ascending = False).head(1)[['Team','Year','Medal_points']]
    frames = [bestyearofeachcountry, temp]
    bestyearofeachcountry = pd.concat(frames)

In [None]:
bestyearofeachcountry = bestyearofeachcountry.loc[bestyearofeachcountry['Medal_points']>10]

# The Medal_points is type object, hence we must change it into float or int.
bestyearofeachcountry.loc[:,'Medal_points'] = bestyearofeachcountry.Medal_points.astype(np.float)
g, (ax1) = plt.subplots(1, 1, figsize=(20, 10))
sb.scatterplot(data = bestyearofeachcountry, x = 'Team', y = 'Year', size ='Medal_points', sizes=(10,1000), hue ='Medal_points', palette="Set1", ax=ax1)
ax1.axhline(0, color="k", clip_on=True)
ax1.set(ylim=(1990, 2020))
ax1.set_ylabel("Year")
ax1.set_xlabel("Country")
for item in ax1.get_xticklabels():
    item.set_rotation(90)

Whoa, the graph looks nice with the bubbles (the bubble idea is copied from [Han Rosling](https://en.wikipedia.org/wiki/Hans_Rosling) - the Master of data visualization).  
We can see some large bubbles like US, China, Russia, Great Britain, Australia, Germany.  
Olympic games are usually held into two seasons. Let's have a look if our assumption is correct.  
Next, let see how the medals in each season.

In [None]:
medalsTeams.groupby('Season').count()

Likely, we will need to separate into two subset of data for each season. We will check the distribution of the athletes participating the events.  
We just need to work woth top 10 countries, and we will go for each season.
Let's put them into a pie chartn with their name, medal points distribution.  
How about the medal points distribution in Summer?

In [None]:
# define the weigh for the plot
explode = (0.3, 0.2, 0.1, 0.025, 0.025, 0.025, 0.025, 0.025, 0.025, 0.025)

#Athlete participation by country
summerdata = olympics[olympics.Season == 'Summer'] #focus only on Summer olympics (exclude Winter Olympics)
summercount = summerdata.groupby('NOC').count()
summerdf = pd.DataFrame(summercount, columns = ['Name']) 
summersorted = summerdf.sort_values(['Name'], ascending = False) #Total number of athletes by country

# Medal Count by Top 10 Countries:
summertop = summersorted.head(10)
summertop.sum()

#Plot Top 10 Countries by  Athlete Count: Athletics Only
figure = summertop.plot(kind='pie', figsize=(7,5),subplots=True, explode = explode, autopct='%1.1f%%', legend = None)
plt.axis('equal')
plt.title("Top 10 Countries by  Athlete participating in SUMMER events")
plt.tight_layout()
plt.show()  

How about the Winter events? Similarly, we will apply the same approach from the summer analysis.

In [None]:
#Athlete participation by country
winterdata = olympics[olympics.Season == 'Winter'] #focus only on Winter olympics (exclude Summer Olympics)
wintercount = winterdata.groupby('NOC').count()
winterdf = pd.DataFrame(wintercount, columns = ['Name']) 
wintersorted = winterdf.sort_values(['Name'], ascending = False) #Total number of athletes by country

# Top 10
wintertop = wintersorted.head(10)
wintertop.sum()

# Similar as Summer chart
figure = wintertop.plot(kind='pie', figsize=(7,5),subplots=True, explode = explode, autopct='%1.1f%%', legend = None)
plt.axis('equal')
plt.title("Top 10 Countries by  Athletes participating in WINTER events")
plt.tight_layout()
plt.show()  

One thing we can see that the number of participating athletes does not totally reflect to the medals achievements, meaning to say the most athlete contributing country is not the dominance on the medal table.  
For example, from our calculation above with the medal points, China is the second place for medals. However, they mainly join the Summer events rather than the Winter events.

Let see the difference between Summer and Winter events thru the number of games.

In [None]:
eventspergame = pd.DataFrame(olympics.groupby(['Year','Season'])['Event'].nunique())
eventspergame.columns = ['Events']
eventspergame.reset_index(inplace = True)
g, (ax1) = plt.subplots(1, 1, figsize = (20, 5))
sb.lineplot(data = eventspergame,x = 'Year', y = 'Events', hue = 'Season', ax = ax1)

Clearly we can see more medals in the Summer events than the Winters.  
Since, there's a big gap between Summer and Winter events, we need to separate the seasons to find the best year of each country. Why do we need this? Because we need to know which country performing in each season. After knowing this, we can combine the result with the GDP and population for further analysis to understand if there's any links between these.

We need a function to filter which year is best for which country regardless winter or summer.  
Hence, the function input will base on the given data and the season indicator.  
Also, we need to find out the minimal to drop rows that in best years do not have more then "minimal" number of medals points.  
By doing so, we will need an empty dataframe, list thru the countries, sort the value based on Season, Team, then concat them.  
Note: Medal_points is type object, so, we ust change into float or int.

In [None]:
def getbestyear(dataFrame, season = 'All', minimal = 5, value = 'Medal_points', columns = ['Team','Year','Medal_points'], sort_ascending = False):
    dataFrame = dataFrame[dataFrame['Medal_points'] > minimal]
    season = [season]
    res = pd.DataFrame(columns = columns)
    for country in countrylist:
        temp = dataFrame.loc[(dataFrame['Team'] == country) & (dataFrame['Season'].isin(season))].sort_values(by = value, ascending = sort_ascending).head(1)[columns]
        res = pd.concat([res, temp])
    res.loc[:,value] = res[value].astype(np.float)
    return res

Now, we have the best year value, we need to make a plot, let use the scatterplot. Two things we need to do:
- Define max and min values of y axis based on the data
- Loop thru rows and get the values, label on the scatter point.

In [None]:
def drawbybestyear(dataFrame, season = 'all', value = 'Medal_points', textDivisor = 1):
    season = season.lower()
    g, (ax1) = plt.subplots(1, 1, figsize=(20, 8))
    sb.scatterplot(data = dataFrame, x = 'Team', y = 'Year', size = value, sizes = (5,750), hue = value, alpha = 0.7, edgecolors = "red", palette = season, legend = False, ax = ax1)
    ax1.axhline(0, color="k", clip_on=True)
    yearsTicks = dataFrame['Year'].unique().tolist()
    yearsTicks.sort()
    ax1.set(ylim = (yearsTicks[0] - 1, yearsTicks[-1] + 1), yticks = (yearsTicks))
    ax1.set_ylabel("Year")
    ax1.set_title("Year of best result in " + value, pad = 50, loc = "left")
    i = 0
    prevYear = 0
    for index, row in dataFrame.iterrows():
        ax1.text(i, row['Year'] + 1, round(row[value]/textDivisor, 2), color = 'black', withdash=True)
        i += 1
        prevYear = row['Year']
    for item in ax1.get_xticklabels():
        item.set_rotation(90)

Now, let see the Summer seasons plot!

In [None]:
bestyearSummer = getbestyear(medalsTeams, "Summer")
drawbybestyear(bestyearSummer, 'Summer')

How about the Winter seasons?

In [None]:
bestyearWinter = getbestyear(medalsTeams, "Winter")
drawbybestyear(bestyearWinter, 'Winter')

Well, some interesting points:
- China did participate in some Winter seasons, but the success is far from the Summer seasons (23 vs 223).
- Some countries had less than 5 medals in Winter seasons, but overall, the Summer achiements outperformed the Winter ones.
- Winter has less events than Summer.
- We cannot see which countries are on one or both.  

We need a combined graph to show both Summer and Winter events. To do that, we will separate countries into 2 groups, then get the best year for each. Ater that, we will merge them, then draw the graph.

In [None]:
# For Summer
summerCountries = pd.DataFrame(columns=['Team','Game'])
summerCountries['Team'] = bestyearSummer['Team']
summerCountries['Game'] = 1
# For Winter
winterCountries = pd.DataFrame(columns = ['Team','Game'])
winterCountries['Team'] = bestyearWinter['Team']
winterCountries['Game'] = 2
# Merge Summer and Winter
wsCountries = pd.concat([winterCountries, summerCountries])
wsCountries = wsCountries.groupby(['Team'])['Game'].sum().to_frame()
wsCountries.reset_index(inplace = True)
# Now we plot the merged dataframe
g, (ax1) = plt.subplots(1, 1, figsize = (20, 7))
sb.scatterplot(data = wsCountries, x = 'Team', y = 'Game', s = 1000, ax = ax1, alpha = 0.5, edgecolors = "gray", linewidth = 1, palette="Set2")
ax1.axhline(0.6, clip_on = True)
ax1.set(ylim = (0.6, 3.4), yticks = (1, 2, 3), yticklabels = ('Summer','Winter', 'Both'))
ax1.set_ylabel("Olympic season")
for item in ax1.get_xticklabels():
    item.set_rotation(90)

What we can learn from the above chart is:
- Summer events always have more countries and the achievements are more than the Winter, which is understandable because the Winter event usually has less games than the Summer's.
- Numerous countries involved in both seasons and performed well.

Now we can see each country performs best in each season. How about the participants from each country contributing to the success? The reason is with the participant ratio we can check the coefficient variation of each country in each season throughout the Olympics events.

In [None]:
# create dataframe with unique participants grouped by year, season and team
participants = pd.DataFrame(olympics.groupby(['Year','Season','Team'])['ID'].nunique())
participants.columns = ['UniqueParticipants']
participants.reset_index(inplace=True)
#now we merge it with dataframe we worked with (until now that is)
medalsTeamsParticipants = medalsTeams.merge(participants, on=['Year','Season','Team'])
medalsTeamsParticipants.head()

Now, we can even go deeper into the data for further analysis like which country outperformed in which sport with medal points or Gold, or Silver, or Bronze.  
However, right now, we are interesting in how the performance relating to the GDP and population of each country.  
First, we need a function which gives us the result for all years of a country (let apply medal point threshold by 5, 1 Gold and 1 Silver).

In [None]:
def resultsallyears(tmpmedalsteams, season):
    season = season[0].upper() + season.lower()[1:]
    if season!='All':
        tmpmedalsteams = tmpmedalsteams[tmpmedalsteams['Season'] == season]
    tmplistcountries = np.unique(tmpmedalsteams.loc[tmpmedalsteams['Medal_points'] > 5, ['Team']].values)
    # dataframe for the list
    numcountries = pd.DataFrame(tmplistcountries)
    numcountries.columns = ['Team']
    # Add index column
    numcountries['Id'] = numcountries.index
    #Merge two dataframes 
    tmpmedalsteams = tmpmedalsteams.loc[tmpmedalsteams['Team'].isin(tmplistcountries)]
    tmpmedalsteams = tmpmedalsteams.merge(numcountries, on = 'Team')
    return tmpmedalsteams

We will need to compute the Coefficient variation, because only variation or standard deviation is not enough to see who has the most stable results. We need to check what standard deviation in relation with average value is. And that is Coefficient of variation.  
As usual we will create functions to get the coefficient and then to draw them, and we will do for Winter and Summer games.

In [None]:
def computecoeffvars(dataFrame):
    coeffvardf = pd.DataFrame(columns=['Team', 'Coeff'])
    countries = dataFrame['Team'].unique()
    for c in countries:
        points = dataFrame.loc[dataFrame['Team'] == c]['Medal_points']
        coef = variation(points)
        #add values to dataframe
        coeffvardf = coeffvardf.append({'Team':c, 'Coeff':coef}, ignore_index=True) 
    return coeffvardf

Now, draw function to be called for Winter and Summer. Here, we will use barblot.

In [None]:
def drawcoeffvar(dataFrame, palette): # palette for the graph
    maxim = dataFrame.sort_values(by = 'Coeff', ascending = False)['Coeff'].head(1)
    maxim = round((maxim.values[0] * 1.1), 2)
    palette = palette.lower()
    g, (ax1) = plt.subplots(1, 1, figsize = (20, 5))
    sb.barplot(data = dataFrame, x = 'Team', y = 'Coeff', palette = palette, ax = ax1)
    ax1.axhline(0, color = "k", clip_on=True)
    ax1.set(ylim = (0, maxim))
    ax1.set_ylabel("Coefficient of variation")
    #writting the datalabels
    i = -0.25
    prevValue = 0
    for index, row in dataFrame.iterrows():
        value = round(row['Coeff'], 2) #this is indendent to avoid text over text
        if abs(1 - prevValue / value) < 0.1:
            position = value + (maxim * 0.05)
        else:
            position = value
        prevValue = position
        ax1.text(i, position, value, color = 'black', withdash = True)
        i += 1
    for item in ax1.get_xticklabels():
        item.set_rotation(90)

Let draw Coefficient Variation for Summer first.

In [None]:
drawcoeffvar(computecoeffvars(resultsallyears(medalsTeams, 'Summer')) ,'Summer')

How about Winter?

In [None]:
drawcoeffvar(computecoeffvars(resultsallyears(medalsTeams, 'Winter')) ,'Winter')

In Summer seasons, the United State (0.08) and France (0.08) are the most reliable.  
In Winter seasons, the Germany (0.19) and Norway (0.16) are the most reliable.

# Relationship between Olympic events and population, GDP <a class="anchor" id="olympic_pop_gdp"></a>
Now time to work with the population data.  
First, we will need the years when the Olympic events held (of course we need only after 1993).

In [None]:
years = np.sort(olympics['Year'].unique())
print(str(years).split())

We will go thru all countries participated in the Olympics. Meaning we have columns of years and columns of the country names. In that matrix, we will fill in: 
- Population for each year of that country, 
- And GDP of the country in that year.

In [None]:
years_string=str(years)[1:-1].split()
# In the GDP and Population, the Country is under 'Country Name'
# We will get rid of 'Country Name' later after we merge with the Olympics data
# For now, we leave it to filter the list.
cntrname = 'Country Name'
years_string.insert(0, cntrname)
print(years_string)

In [None]:
populationdata = population[years_string]
display(populationdata.head())

Similarly, we will have GDP data with corresponding years.

In [None]:
gdpdata = gdp[years_string]
display(gdpdata.head())

## Cleaning and merging data <a class="anchor" id="cleaningmerging"></a>
Not all the countries in the GDP data are involving in the Olympics, we will need to filter those not participating in Olympics.

In [None]:
olympic_countries = olympics['Team'].unique()
gdp_countries = gdpdata['Country Name'].unique()

Let check the list of countries competing in Olympics, but are not in the GDP list.

In [None]:
np.setdiff1d(olympic_countries, gdp_countries)

And countries not competing in Olympics, but in GDP list

In [None]:
np.setdiff1d(gdp_countries, olympic_countries)

There are problems with the name of countries. For example:  
- 'Bahamas' vs 'Bahamas, The'
- or 'British' vs 'Great Britain'
- etc  

We need to convert those names based on the Olympics names, since we will need to merge with the Olympics data.

In [None]:
change = {
'Bahamas, The':'Bahamas', 
'Cabo Verde':'Cape Verde',
'Congo, Rep.':'Congo (Brazzaville)',
'Russian Federation':'Russia',
'St. Vincent and the Grenadines':'Saint Vincent and the Grenadines',
'Venezuela, RB':'Venezuela',
'Congo, Dem. Rep.':'Congo (Kinshasa)',
'Micronesia, Fed. Sts.':'Federated States of Micronesia',
'Gambia, The':'Gambia',
'Guinea-Bissau':'Guinea Bissau',
'Iran, Islamic Rep.':'Iran',
'St. Kitts and Nevis':'Saint Kitts and Nevis',
'Slovak Republic':'Slovakia',
'Syrian Arab Republic':'Syria',
'Hong Kong SAR, China':'Hong Kong',
'Kyrgyz, Republic':'Kyrgyzstan',
'Macedonia, FYR':'Macedonia',
'Korea, Dem. People’s Rep.':'North Korea',
'St. Lucia':'Saint Lucia',
'Timor-Leste':'Timor Leste',
'Brunei Darussalam':'Brunei',
'Egypt, Arab Rep.':'Egypt',
'United Kingdom':'Great Britain',
'Korea, Rep.':'South Korea',
'Virgin Islands (U.S.)':'United States Virgin Islands',
'Yemen, Rep.':'Yemen'}

gdpmelted = gdpdata.melt(id_vars = ['Country Name'], value_vars = years_string[1:], var_name = 'Years')
gdpmelted.columns = ['Team','Year','GDP']
gdpmelted['Country'] = gdpmelted['Team']
gdpmelted.head()
# Loop to apply the changes
for key in change:
    value = change[key]
    gdpmelted.loc[gdpmelted['Team'] == key,'Team'] = value
    
# we need to convert the year back to int.
gdpmelted['Year'] = gdpmelted['Year'].apply(int)
gdpmelted.head()

Looks like we have a good dataframe. Now, we need to merge the above list to the olympics participants.

In [None]:
combineddata = pd.merge(medalsTeamsParticipants, gdpmelted, on = ['Team','Year'])
combineddata.drop(['Country'], axis = 1, inplace = True)
combineddata.head()

Similarly, we will do the same melting process with population data.

In [None]:
populationmelted = populationdata.melt(id_vars = ['Country Name'], value_vars = years_string[1:], var_name = 'Years')
populationmelted.columns = ['Team','Year','Population']
populationmelted['Year'] = populationmelted['Year'].apply(int)
populationmelted.head()

Now, it's time to merge Olympics, GDP, and Population data.

In [None]:
combineddata = pd.merge(combineddata, populationmelted, on = ['Team','Year'])
combineddata.head()

## Analyzing the relationship between Olympic achievement with population and GDP <a class="anchor" id="analyzing_pop_gdp"></a>
Now, GDP and population data is with the Olympics data. However, a few questions are:
- How can we know the rate of GDP per population, or per medal?
- How can we measure the medal point worth according to the GDP?
- What portion is the rate of medal on population?

We will need to add some features into the combined dataframe.

In [None]:
combineddata['MedalsPerPopulation'] = combineddata['Medals']/(combineddata['Population']/100000)
combineddata['MedalPointsPerPopulation'] = combineddata['Medal_points']/(combineddata['Population']/100000)
combineddata['GDPWorthPerMedal'] = (combineddata['GDP']/combineddata['Medals'])
combineddata['GDPWorthPerMedalPoints'] = (combineddata['GDP']/(combineddata['Medal_points']))
combineddata['ParticipantsPerPopulation'] = combineddata['UniqueParticipants']/(combineddata['Population']/100000)
combineddata.head()

In [None]:
# This is to safer use, in case we make any messy.
working_data = combineddata.copy()

Let's draw the GPD Worth per Medal points.  
For Summer events.

In [None]:
summerbest = getbestyear(combineddata, "Summer", 10, 'GDPWorthPerMedalPoints', ['Team', 'Year', 'GDPWorthPerMedalPoints'], True)
drawbybestyear(summerbest, 'Summer', 'GDPWorthPerMedalPoints', 1000000000)

For Winter events.

In [None]:
winterbest = getbestyear(combineddata, "Winter", 10, 'GDPWorthPerMedalPoints', ['Team', 'Year', 'GDPWorthPerMedalPoints'], True)
drawbybestyear(winterbest, 'Winter', 'GDPWorthPerMedalPoints', 1000000000)

By comparing to Medal points of Summer and Winter graphs, we can notice some points:
- The last event (2016), the GDP is getting higher. This is explainable with growth of GDP over years.
- The investment into Olympics events for the athletes is likely increasing.

Let's have a comparison between GDP growth and Olympics success rate.  
We will test and analyse on few countries like: United States, Germany, China.

In [None]:
def comparisongraph(country, palette, data1, datalb1, data2, datalb2, d1div = 1, d2div = 1):
    tempData = resultsallyears(combineddata, 'All')
    dataFrame = tempData.loc[tempData['Team'] == country]
    palette = palette.lower()
    g, ax1 = plt.subplots(figsize = (20, 7))
    sb.barplot(dataFrame['Year'].apply(lambda x: str(x)), dataFrame[data1].apply(lambda x: x / d1div), ax = ax1, palette = palette)
    ax1.axhline(0, color = "k", clip_on = True)
    ax2 = ax1.twinx()
    sb.lineplot(dataFrame['Year'].apply(lambda x: str(x)), dataFrame[data2].apply(lambda x: x / d2div), linewidth = 4, marker = 's', markersize = 12, ax = ax2)
    ax2.grid(False)
    ax1.set_title('Comparison of ' + datalb1 + ' and ' + datalb2 + ' through years for ' + country)
    ax1.set_ylabel(datalb1)
    ax2.set_ylabel(datalb2)   

Germany for all seasons.

In [None]:
comparisongraph("Germany", "seismic", "Medal_points", "Medal points", "GDP", "GDP (billion USD)", 1, 1000000000)

United States for all seasons.

In [None]:
comparisongraph("United States", "Rocket", "Medal_points", "Medal points", "GDP", "GDP (billion USD)", 1, 1000000000)

And China.

In [None]:
comparisongraph("China", "deep", "Medal_points", "Medal points", "GDP", "GDP (billion USD)", 1, 1000000000)

**Conclusion on GDP and Olympics performance view**  
As we can see from the three graphs:
- Germany and US have a stable GDP, but the performance is different between these two countries. US likely performs in a stable improvement, while Germany has some downgrade of the achievement.
- China GDP grows fast, but the performance decreased after the outstanding success in 2008, regardless its GDP increasing.

There are some links between the GDP and Olympics, but it's fading, not so clear. However, in bigger picture, the GDP increment has some effects on the Olympics performance, since more investments can spare for the athletes, the competition becomes more challenging.  

Let's have a look which country is the most sporty. In order to do that, we need to add a column indicating the value of participant per population of each country to the combineddata dataframe.

In [None]:
combineddata = working_data.copy()
summermostparticipants = getbestyear(combineddata, "Summer", 5, 'ParticipantsPerPopulation', ['Team', 'Year', 'ParticipantsPerPopulation'], False)
drawbybestyear(summermostparticipants, 'Summer', 'ParticipantsPerPopulation')

In [None]:
summermostparticipants = getbestyear(combineddata, "Winter", 5, 'ParticipantsPerPopulation', ['Team', 'Year', 'ParticipantsPerPopulation'], False)
drawbybestyear(summermostparticipants, 'Winter', 'ParticipantsPerPopulation')

For the Summer, likely New Zealand (4.18) has the highest participant rate.  
For the Winter, Slovenia and Latvia (both 2.81) are the most sporty.  
Let's put them in to graph for beter visualization. 

In [None]:
shortlistofsportycountries = ['Latvia','Slovenia', 'New Zealand']
for c in shortlistofsportycountries:
    comparisongraph(c, "deep", "UniqueParticipants", "Unique participants", "Population", "Population (in millions)", 1, 1000000)

From here, we can see:
- The population rate in Latvia is dropping, but the rate of participant is still contributing well.
- Slovenia and New Zealand population increase together with the participant rate.

## Summary about the findings <a class="anchor" id="summary_pop_gdp"></a>
1. In terms of number of medals (or medal points) then Germany is most succesfull in Winter and United States in Summer
2. In terms of stability of results, France is the winner in Summer and Norway in Winter.
3. In terms of number of participants over population, the mosts are Slovenia and Latvia in Winter, and New Zealand for Summer.
4. Some countries don't participate in Winter games, and it's understandable due to their country's nature, like tropical or hot, rainy weather.

# Predictions <a class="anchor" id="predictions"></a>

- Find the average of gold medal in Summer and Winter
- Label the classes as "Good = 1" and "Bad = 0"
- Classify the models: Decision Tree, Random Forest (ensembling), KNN

Using 'combineddata' dataset as this data has countries, medal points, GDP, and population.  
The independent variables: GDP, Population, Team (country).  
The dependent variables: Medal (average).

To predict which country can perform good or bad.

We will:
- Get the top 10 gold medals for each year
- Get the average top 10 gold medal threshold

Dataset will be separated into 2 sets: training and test.

Then we will apply 4 models, one-by-one, tuning the hyper parameters, then plot the complexity curve, 2 learning curves.  
Then we will choose the final model based on the comparison of all 4 models above (AUC is included).

In [None]:
summerdata = combineddata.copy()
summerdata = summerdata[summerdata['Season'] == 'Summer']
summerdata.head()

In [None]:
summerdata["Team"] = summerdata["Team"].astype('category')
summerdata.dtypes

In [None]:
summerdata["Team_Num"] = summerdata["Team"].cat.codes
summerdata.head()

We will fill in the null with the frequent values.

In [None]:
summerdata["GDP"].fillna(0, inplace = True)
summerdata["Population"].fillna(0, inplace = True)
summerdata["UniqueParticipants"].fillna(0, inplace = True)

In [None]:
summerdata.isnull().sum()

Find the average medal points

In [None]:
summerdata['Winner'] = summerdata.Medal_points.apply(lambda x: 1 if x > 0 else 0) 
summerdata.head()

Now we need to define the independent and dependent variables for our prediction models.
- Independent variables will be: Medals, Medal points, UniqueParticipants (athletes), GDP, ParticipantPerPopulation
- Dependent variables: Team number

In [None]:
Independent_var = summerdata.iloc[:, [8,9,10,11,17]].values  
Dependent_var = summerdata.iloc[:, 18].values 

We separate the dataset into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Independent_var, Dependent_var, test_size = 0.2, random_state = 10)  

Now, we will apply KNN first.

In [None]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)
#Train the model using the training sets
knn.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

How the confusion matrix looks?

In [None]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  

Let's see if we have a smooth error curve or zic-zac curve.

In [None]:
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):  
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))

plt.figure(figsize=(12, 6))  
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',  
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')  
plt.xlabel('K Value')  
plt.ylabel('Mean Error')  

Next, we will use Random Forest model.

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Next, Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=20)

clf = clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.linear_model import LinearRegression  
from sklearn.neural_network import MLPClassifier  
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier  
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

plt.figure()

# Add the models to the list that you want to view on the ROC plot
models = [
{
    'label': 'Random Forest',
    'model': RandomForestClassifier(n_estimators=100),
},
{
    'label': 'Decision Tree',
    'model': DecisionTreeClassifier(max_depth=20)  
}
    ,
    {  
     'label': 'Neural Network',
     'model':MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000) 
    }
     ,
    {
     'label': 'Logistic Regression',
    'model': LogisticRegression(),
    }
]

# Below for loop iterates through your models list
for m in models:
    model = m['model'] # select the model
    model.fit(X_train, y_train) # train the model
    y_pred=model.predict(X_test) # predict the test data
# Compute False postive rate, and True positive rate
    fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
# Calculate Area under the curve to display on the plot
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
# Now, plot the computed values
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], auc))
# Custom settings for the plot 
#plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()   # Display

In [None]:
# We skip autosklearn for now as we need to deal with some packages installation
#import autosklearn.classification
#import sklearn.model_selection
#import sklearn.datasets
#import sklearn.metrics
#X, y = sklearn.datasets.load_digits(return_X_y=True)
#X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
#automl = autosklearn.classification.AutoSklearnClassifier()
#automl.fit(X_train, y_train)
#y_hat = automl.predict(X_test)
#print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25)

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))

# References <a class="anchor" id="references"></a>
- [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)
- [Let's discover more about the Olympic Games!](https://www.kaggle.com/marcogdepinto/let-s-discover-more-about-the-olympic-games)
- [Olympic games results vs GDP vs Population](https://www.kaggle.com/vhlaca/olympic-games-results-vs-gdp-vs-population)
- [World Bank Data (1960 to 2016)](https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016)
- [Olympic winner prediction - Athletics](https://www.kaggle.com/jbhosal/olympic-winner-prediction-athletics)
- [World Happiness Report Analysis](https://www.kaggle.com/sabihaif/world-happiness-report-analysis)
- [Olympics Data- Cleaning, Exploration, Prediction](https://www.kaggle.com/chadalee/olympics-data-cleaning-exploration-prediction)