# Network Analysis of Podiuming Nocs
We want to create a network where we showcase the relationships between countires that meet on the podium of an event in each of the games. We want to do this for all games summer and winter seperately than maybe will do it for winter and summer in general to get a broader view over time of theserelationships.
What will we need? 
*	Three NOCs on the podium (EDGES) for every event for every olympics.
	This will be formatted as such. E.g. (USA, UK, RUS) <br>
	-> USA,UK <br>
	-> UK,RUS <br>
	-> RUS,USA <br>

* We need a list of the NOCS for each Olympics (Nodes) 


In [1]:
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
%matplotlib inline
from bs4 import BeautifulSoup
import webbrowser
import urllib.request
from lxml import html
import zipfile
import re
import string
import sys, os
from IPython.display import Image
import itertools

In [2]:
# Ensure the file exists
if not os.path.exists( r"..\..\data\prep\Games\Games-Unedited-500.csv" ):
    print("Missing dataset file")

In [3]:
# read the medal csv into a dataframe
df = pd.read_csv( r"..\..\data\prep\Games\Games-Unedited-500.csv" , encoding = "ISO-8859-1")

In [4]:
df.head(3)

Unnamed: 0,Ath_Name,Host_City,Discipline,Event,Event_Gender,Gender,Medal,NOC,Sport,Summer,Winter,Year
0,"WEBSTER, Robert",Rome,Diving,10m platform,M,Men,Gold,USA,Aquatics,True,False,1960
1,"TOBIAN, Gary",Rome,Diving,10m platform,M,Men,Silver,USA,Aquatics,True,False,1960
2,"PHELPS, Brian",Rome,Diving,10m platform,M,Men,Bronze,GBR,Aquatics,True,False,1960


In [5]:
# dropping some redundant columns keeping only the ones we need
df = df.drop(df.columns[[0, 4, 8]], axis=1)

In [6]:
df.head(3)

Unnamed: 0,Host_City,Discipline,Event,Gender,Medal,NOC,Summer,Winter,Year
0,Rome,Diving,10m platform,Men,Gold,USA,True,False,1960
1,Rome,Diving,10m platform,Men,Silver,USA,True,False,1960
2,Rome,Diving,10m platform,Men,Bronze,GBR,True,False,1960


In [7]:
# Here we're sorting the df so we can find out which countires are podiuming together 
df= df.sort_values(by=['Host_City', 'Year', 'Summer', 'Winter', 'Gender', 'Event'], ascending=False)

# Multi athlete events
Below we showcase the situation where mutliple athletes from each country are involved in an event meaning more than 3 medals are given out it will be (3 * Team size). We will have to remove all of the teams medal wins bar one  because we only want a team win to show up as a single occurance of a podium for a country. <br> </br>
Another siutation to think about is the possability of a draw or if there is less than three medals awarded due to unforseen circumstances such as drug disquailfications etc. 
We're not sure as of yet about why some of the situations where less than three medals were awarded occurred but we've seen some examples of the boxing draw occuring.  

In [8]:
df.head(10)

Unnamed: 0,Host_City,Discipline,Event,Gender,Medal,NOC,Summer,Winter,Year
27223,Vancouver,Bobsleigh,two-man,M,Bronze,RUS,False,True,2010
27224,Vancouver,Bobsleigh,two-man,M,Bronze,RUS,False,True,2010
27225,Vancouver,Bobsleigh,two-man,M,Gold,GER,False,True,2010
27226,Vancouver,Bobsleigh,two-man,M,Gold,GER,False,True,2010
27227,Vancouver,Bobsleigh,two-man,M,Silver,GER,False,True,2010
27228,Vancouver,Bobsleigh,two-man,M,Silver,GER,False,True,2010
27554,Vancouver,Alpine Skiing,super-G,M,Bronze,USA,False,True,2010
27555,Vancouver,Alpine Skiing,super-G,M,Gold,NOR,False,True,2010
27556,Vancouver,Alpine Skiing,super-G,M,Silver,USA,False,True,2010
27542,Vancouver,Alpine Skiing,slalom,M,Bronze,SWE,False,True,2010


In [9]:
# here we're dropping all the team meaals and leaving one so the relationship will be intact
df = df.drop_duplicates(keep='first')

In [10]:
df.head(10)

Unnamed: 0,Host_City,Discipline,Event,Gender,Medal,NOC,Summer,Winter,Year
27223,Vancouver,Bobsleigh,two-man,M,Bronze,RUS,False,True,2010
27225,Vancouver,Bobsleigh,two-man,M,Gold,GER,False,True,2010
27227,Vancouver,Bobsleigh,two-man,M,Silver,GER,False,True,2010
27554,Vancouver,Alpine Skiing,super-G,M,Bronze,USA,False,True,2010
27555,Vancouver,Alpine Skiing,super-G,M,Gold,NOR,False,True,2010
27556,Vancouver,Alpine Skiing,super-G,M,Silver,USA,False,True,2010
27542,Vancouver,Alpine Skiing,slalom,M,Bronze,SWE,False,True,2010
27543,Vancouver,Alpine Skiing,slalom,M,Gold,ITA,False,True,2010
27544,Vancouver,Alpine Skiing,slalom,M,Silver,CRO,False,True,2010
27409,Vancouver,Luge,singles,M,Bronze,ITA,False,True,2010


In [11]:
# Re-sorting again and dropping the previus index column
df= df.sort_values(by=['Year', 'Winter', 'Gender', 'Discipline', 'Event', 'Medal']).reset_index()
# dropping the old index column
df = df.drop(df.columns[[0]], axis=1)

In [12]:
# Changing the order of the columns 
df = df[['Year', 'Host_City', 'Summer', 'Winter', 'Discipline', 'Event', 'Gender', 'NOC' , 'Medal']]

In [13]:
df.head(1)

Unnamed: 0,Year,Host_City,Summer,Winter,Discipline,Event,Gender,NOC,Medal
0,1960,Rome,True,False,Athletics,10000m,Men,AUS,Bronze


# Creating the relationships
We've taken out the duplicate medals for team wins now we have to create the relationships shared by podiuming countries. Realationships will be between countires that win a medal in a certain event. We have to get these relationships for each event split into the different olympics by year and whether they are summer or winter and also. So we can create a sub networks for every olympics and then at the end we can have a network that covers all the olympics since 1960 however we'll proaly still spilt that into summer and winter games.   

In [14]:
df.head(10)

Unnamed: 0,Year,Host_City,Summer,Winter,Discipline,Event,Gender,NOC,Medal
0,1960,Rome,True,False,Athletics,10000m,Men,AUS,Bronze
1,1960,Rome,True,False,Athletics,10000m,Men,URS,Gold
2,1960,Rome,True,False,Athletics,10000m,Men,EUA,Silver
3,1960,Rome,True,False,Athletics,100m,Men,GBR,Bronze
4,1960,Rome,True,False,Athletics,100m,Men,EUA,Gold
5,1960,Rome,True,False,Athletics,100m,Men,USA,Silver
6,1960,Rome,True,False,Athletics,110m hurdles,Men,USA,Bronze
7,1960,Rome,True,False,Athletics,110m hurdles,Men,USA,Gold
8,1960,Rome,True,False,Athletics,110m hurdles,Men,USA,Silver
9,1960,Rome,True,False,Athletics,1500m,Men,HUN,Bronze


# How? 
I think the easiest way to do this would to just get an ordered list of all the NOCs who won a medal (podiumed) for each games. If the list is order as the df above we know each set of three have the podium relationsip. Then after we get these lists, one for each games we can then turn the list form e.g. ['GER', 'USA' 'CHN'....] to ['GER,CHN', 'GER,USA', 'USA,CHN'....] so we need up with lists of relationships for each olympics. 
This is crucial because as of now one easy way to create a network of relationships is to have a csv containg nodes (Unique NOC names) and relationships (the lists we talked about above). 
<br></br>
The other consideration we have to deal with is the sitatuions of draws and less than three medals awarded like we talked about above. This would mean that a list of all the NOCs sorted like this ['GER', 'USA' 'CHN'....] taking every three to be a podium  might not represent the relatioships correctly because as we said before we could have a draw sitatuion where four NOCs could be on a podium. <br> </br>
This issue is addressed below by putting each events podiums into a seperate list and putting all these event lists into a list that represent that given olympic games. Then we will put all of these lists of games into a final list. 
<br></br>
So to be clear we will have a main list of olympic games lists and in each olympic games list there will be lists for every event in that games containing every NOC that reaches the podium for that event. 

In [15]:
# first I'll create a list to hold all of the lists for each olympics 
olympicGsList = []
# Then a list to hold the values for the current olympics Games
currGamesList = []
# Then a list to hold the values for the current olympics events 
currEventList = []

In [16]:
# The lastyear and lasthost varaibles are needed so we can track when the games change in the iteration
lastyear = df['Year'].iloc[0]
lastHost = df['Host_City'].iloc[0]
# The lastyEvent and lastDis varaibles are needed so we can track when the events change in the iteration
lastEvent = df['Event'].iloc[0]
lastDis = df['Discipline'].iloc[0]

for x, row in df.iterrows():
    
    # current year and host to compare with the last years 
    curryear = df['Year'].iloc[x]
    currHost = df['Host_City'].iloc[x]
    
    # varaible for the country 
    country = df['NOC'].iloc[x]
    
    # current event and year 
    currEvent = df['Event'].iloc[x]
    currDis = df['Discipline'].iloc[x]
    
    
    # as long as the current host and year are the same we're in the same games
    # so each list representing an event is added to the currGamesList
    if(curryear == lastyear and currHost == lastHost):
        
         # as long as the current event and discipline are the same then these NOcs will be added to the currEventList 
        if( currEvent == lastEvent and currDis == lastDis):
            currEventList.append(country)
        else:
        # When the event changes we add the list of the last event to the list for the current games 
            currGamesList.append(currEventList)
            # we reset the currEventsList for a new event 
            currEventList = []
            # we add the first NOC to the new event list 
            currEventList.append(country)
        
        
    
    # if the games changes then we add the currGamesList containing a full olympics to the olympicGsList
    # the we reset currList and move to the next olympics 
    else:
        # add last event to the currGames
        currGamesList.append(currEventList)
        # add currGames to the olympicGsList
        olympicGsList.append(currGamesList)
        
        #reset currGamesList and currEventsL:ist for new games and events 
        currGamesList = []
        currEventList = []
        
        # We need this here so we the change from one games to another happens, we don't miss that first NOC
        currEventList.append(country)
        
        
        
    # give the last year and host varaibles their new values 
    lastyear = curryear
    lastHost = currHost
    # give the last event and discipline varaibles their new values 
    lastEvent = currEvent
    lastDis = currDis
    

# This just appends the last event to the currentGamesList and then this last games to the olympic list 
currGamesList.append(currEventList)
olympicGsList.append(currGamesList)

In [17]:
#This nested for loop makes sure we got every NOC into a list of its event 
sumb = 0
suma = 0
for x in range(len(olympicGsList)): 

    for y in range(len(olympicGsList[x])):
        suma = suma +  len(olympicGsList[x][y])
    
    sumb = sumb + suma 
    suma = 0
    
if (sumb == len(df)):
    print("We got all the NOCS!")
    

We got all the NOCS!


** Quick note .append places the lists at the end of the list ** <br>
This means our list will start at the NOCS from 1960 and end with 2018


# Nodes before relationships
Before I take the lists in the olympicGsList and create all the realtionships I want to get all of the nodes for each of the olympics which As I said before will just be all the unique NOCs who got medals from each olympics. Now with our big list olympicGsList we have all the NOCs who got medals from every olympics however they will obviusly be lots of duplicates in each of these lists as countires will win in multiple events. In order to get a list of unqiue NOCs for each olympics for these lists we can transform it into a set which is all the unique values of the list and then just transform it back into a list. This is what I do below. 

In [18]:
# This will hold one list for every olympics containing all the NOCs in each event 
# So we'll then transform these lists in NOdeList to sets then back to lists to get all the unique values 
nodeList = [] 

In [19]:
# We must use a double nested for loop
# The first goes through the list containing each list one for every olympic games 
# The second goes through the olympic games lists so we can access the diffe rent event lists in this games 
# The third goes through the NOCs in each of these event lists

# This lists is used to append each NOC to the indivdual lists in the nodeList
currNodeList = []

for x in range(len(olympicGsList)): 

    for y in range(len(olympicGsList[x])):

           for j in range(len(olympicGsList[x][y])):
                
                # appending each NOC for each event 
                currNodeList.append(olympicGsList[x][y][j])
                
    # appending all the NOCs from a single olympics to the NodeList
    nodeList.append(currNodeList)
    currNodeList = []

In [20]:
# this is the final nodes list with all the unique NOCs for each games. So these will be the nodes for our networks 
fnodesList = []

for x in range(len(nodeList)):
    
    # just getting the unique NOCs for all of the nodes 
    fnodesList.append(list(set(nodeList[x])))

# Creating the relationsips 
As I said before we are now left with the list olympicGsList contains a list for every olympic games. Now in these lists are seperate lists, one for each of the events in the games. 
The structure of an event list is like so ['USA, UK, CAN'] so for an event like this is relationships would look like this 'USA,UK' 'USA,CAN' and 'UK,CAN'. So we have to do this for every event for every games. Oce we create the realationships we can just have a list of relationships for each games they do not have to be seperated into events.  

In [21]:
# This list will contain the list of relationsips for each olympic games 
relationList = [] 

# Relationship for Loop
The second nested for loop is where the relationships are created. olympicGsList[x][y] will be an event list in one of the list of games. The third nested loop will take this list and produce all the possible relationships like in the example we talked about above. Then it will append these relationships in the currNodeList and do the same for every event in that games. After the last event is done this list of realatioships for a given games is appended onto the relationList. The process is completed untill the relationList contains a list of realatioships for every games. 

In [22]:
# This lists is used to append each NOC to the indivdual lists in the nodeList
currNodeList = []

# This list will contain the list of relationsips for each olympic games 
relationList = [] 

for x in range(len(olympicGsList)): 

    for y in range(len(olympicGsList[x])):
        for pair in itertools.combinations(olympicGsList[x][y], 2):
            rel = pair[0] + ',' + pair[1]
            currNodeList.append(rel)
    # appending all the NOCs from a single olympics to the NodeList
    relationList.append(currNodeList)
    currNodeList = []
            

In [23]:
relationList

[['AUS,URS',
  'AUS,EUA',
  'URS,EUA',
  'GBR,EUA',
  'GBR,USA',
  'EUA,USA',
  'USA,USA',
  'USA,USA',
  'USA,USA',
  'HUN,AUS',
  'HUN,FRA',
  'AUS,FRA',
  'FRA,ITA',
  'FRA,USA',
  'ITA,USA',
  'GBR,URS',
  'GBR,AUS',
  'URS,AUS',
  'URS,POL',
  'URS,URS',
  'POL,URS',
  'RSA,USA',
  'RSA,EUA',
  'USA,EUA',
  'USA,USA',
  'USA,USA',
  'USA,USA',
  'GBR,EUA',
  'GBR,URS',
  'EUA,URS',
  'BWI,USA',
  'BWI,EUA',
  'USA,EUA',
  'POL,NZL',
  'POL,EUA',
  'NZL,EUA',
  'ITA,GBR',
  'ITA,SWE',
  'GBR,SWE',
  'BWI,NZL',
  'BWI,BEL',
  'NZL,BEL',
  'URS,USA',
  'URS,TPE',
  'USA,TPE',
  'USA,USA',
  'USA,USA',
  'USA,USA',
  'POL,URS',
  'POL,HUN',
  'URS,HUN',
  'USA,URS',
  'USA,URS',
  'URS,URS',
  'HUN,URS',
  'HUN,EUA',
  'URS,EUA',
  'URS,USA',
  'URS,USA',
  'USA,USA',
  'NZL,ETH',
  'NZL,MAR',
  'ETH,MAR',
  'FIN,USA',
  'FIN,USA',
  'USA,USA',
  'USA,USA',
  'USA,USA',
  'USA,USA',
  'URS,POL',
  'URS,URS',
  'POL,URS',
  'BRA,USA',
  'BRA,URS',
  'USA,URS',
  'TCH,EUA',
  'TCH,ITA',

# List containing info on Games
The list i create below can be used to get information on olympics in the relationList. The indexes for both lists will reference the same olympics so we can match the relationships of an olympics to any other infomration on that olympics.  

In [36]:
# If we need the host city, year or anything of an olympics we can just call an index with .iloc[]
# and call the same index on relationList and we'll match the same relationships to whatever info we want. 
gameInfoDf =  df.drop_duplicates(subset=['Year', 'Host_City'], keep='first').reset_index()
# dropping the old index column
gameInfoDf = gameInfoDf[['Year','Host_City','Summer','Winter','Discipline','Event','Gender','NOC','Medal']]

In [37]:
gameInfoDf.head(5)

Unnamed: 0,Year,Host_City,Summer,Winter,Discipline,Event,Gender,NOC,Medal
0,1960,Rome,True,False,Athletics,10000m,Men,AUS,Bronze
1,1960,Squaw Valley,False,True,Alpine Skiing,downhill,F,AUT,Bronze
2,1964,Tokyo,True,False,Athletics,10000m,Men,AUS,Bronze
3,1964,Innsbruck,False,True,Alpine Skiing,downhill,F,AUT,Bronze
4,1968,Mexico,True,False,Athletics,10000m,Men,TUN,Bronze


# Creating the csvs for GEPHI
First we'll create a function to produce a node csv and edges csv for a given olympics. You'll reference which olympics you want by index which pertains to gameInfoDf above. 

In [38]:
# this is a function to get the csvs for the nodes and relationships of a given games 
# the number you must enter is the index of the olympics in the relation list and gamesinfoDf

def getNetworkCSVs(gNum):
    #creating two lists the source is one of the NOCs in  a relationsip and the target is the other
    source = []
    relation = relationList[gNum]
    target = []

    # this for loop seperates the relationsips and puts one NOc in each column source and target
    for i in relation:
        source.append(i.split(',')[0])
        target.append(i.split(',')[1])

    # create a df for these lists and equate the lists to the columns 
    dfRel = pd.DataFrame(columns=['Source', 'Target'])
    dfRel['Source'] = source
    dfRel['Target'] = target

    # creating a dataframe for all the nodes of a games 
    nodes = fnodesList[gNum]
    dfNodes = pd.DataFrame(nodes)

    # setting a varaible for whether the gmaes is summer or winter 
    if(gameInfoDf['Summer'].iloc[gNum]):
        gamesType = 'Summer'
    else:
        gamesType = 'Winter'

    # the next couple of lines create the directory for where the csvs for the edges and nodes are placed
    nodepath = r"..\..\data\analysis\Games_Networks\{}".format(str(gameInfoDf['Year'].iloc[gNum]) + '_' + str(gameInfoDf['Host_City'].iloc[gNum])  + '_' + gamesType)

    edgepath = r"..\..\data\analysis\Games_Networks\{}".format(str(gameInfoDf['Year'].iloc[gNum])+ '_' + str(gameInfoDf['Host_City'].iloc[gNum]) + '_' + gamesType)
    if not os.path.exists(nodepath):
        os.makedirs(nodepath)

    nodepath = nodepath + '\{}'
    nodepath = nodepath.format('nodes.csv')

    edgepath = edgepath + '\{}'
    edgepath = edgepath.format('edges.csv')


    # creating the csvs for the dataframes made above with to the directories made above
    dfNodes.to_csv(nodepath, index=False)
    dfRel.to_csv(edgepath, index=False)

Now we can create a for loop to create all of the required csvs to create networks of all the games 

In [39]:
for i in range(len(gameInfoDf)):
    getNetworkCSVs(i)

# Network Analysis of Summer and winter games as a whole
Now that we have csvs for the nodes and edges of every olympics from 1960 we can do a more general analysis of the summer and winter games as a whole. So we can get all the relationships from summer and winter and create a big network. To do this we'll use the relationList and the gamesInfoDf so we'll know whether a games is summer or winter. We'' have the same process for getting the nodes for the summer and winter games.  

In [40]:
# We'll have two lists one for the summer games relationships and one for the winter gmaes relationsips 
summerRels = []
winterRels = [] 
# We'll also have two lists for the summer and winter nodes 
summerNodes = []
winterNodes = []

In [41]:
# The for  loop just checks to see if the current games is winter or summer
# Seeing as all the lists below have each olympics at the same index making things quite easy
for x in range(len(relationList)):
    
    if(gameInfoDf['Summer'].iloc[x]):
        summerRels.append(relationList[x])
        summerNodes.append(fnodesList[x])
    else:
        winterRels.append(relationList[x])
        winterNodes.append(fnodesList[x])

Now that we have all the olympics lists split into summer and winter we can just breakdown the indivdual lists and form four big lists two for the winter nodes and edges and two for the summer nodes and edges.  

In [42]:
# the final list summer olympic games relationships 
summerEdges = []

In [43]:
for x in range(len(summerRels)):
    
    for y in range(len(summerRels[x])):
        summerEdges.append(summerRels[x][y])

In [44]:
# the final list winter olympic games relationships 
winterEdges = []

In [45]:
for x in range(len(winterRels)):
    
    for y in range(len(winterRels[x])):
        winterEdges.append(winterRels[x][y])

In [46]:
# the final list summer olympic games nodes or unique NOCs
summerFNodes = []

In [47]:
for x in range(len(summerNodes)):
    
    for y in range(len(summerNodes[x])):
        summerFNodes.append(summerNodes[x][y])
        
summerFNodes = list(set(summerFNodes))

In [48]:
# the final list summer olympic games nodes or unique NOCs
winterFNodes = []

In [49]:
for x in range(len(winterNodes)):
    
    for y in range(len(winterNodes[x])):
        winterFNodes.append(winterNodes[x][y])
        
winterFNodes = list(set(winterFNodes))

# Creating the dataFrames for the summer and winter edges and nodes
Like we did in the networks for all the individual olympics we must change the lists for nodes and edges to datafrmaes so they can be represented in the right formatt for Gehpi.

In [50]:
# Summer games first  
#creating two lists the source is one of the NOCs in  a relationsip and the target is the other
source = []
target = []

# this for loop seperates the relationsips and puts one NOc in each column source and target
for i in summerEdges:
    source.append(i.split(',')[0])
    target.append(i.split(',')[1])

# create a df for these lists and equate the lists to the columns 
sumEdgesdf = pd.DataFrame(columns=['Source', 'Target'])
sumEdgesdf['Source'] = source
sumEdgesdf['Target'] = target

# creating a dataframe for all the nodes of the summer games 
sumNodesdf = pd.DataFrame(summerFNodes)

# Creating the Summer network folder
sumpath = r"..\..\data\analysis\\Summer_Games_Network"
if not os.path.exists(sumpath):
    os.makedirs(sumpath)

    
# creating the csvs for the dataframes    
sumNodesdf.to_csv(r"..\..\data\analysis\Summer_Games_Network\SummerNodes.csv", index=False)
sumEdgesdf.to_csv(r"..\..\data\analysis\Summer_Games_Network\SummerEdges.csv", index=False)

In [51]:
# Winter games  
#creating two lists the source is one of the NOCs in a relationsip and the target is the other
source = []
target = []

# this for loop seperates the relationsips and puts one NOc in each column source and target
for i in winterEdges:
    source.append(i.split(',')[0])
    target.append(i.split(',')[1])

# create a df for these lists and equate the lists to the columns 
winEdgesdf = pd.DataFrame(columns=['Source', 'Target'])
winEdgesdf['Source'] = source
winEdgesdf['Target'] = target

# creating a dataframe for all the nodes of the summer games 
winNodesdf = pd.DataFrame(winterFNodes)

# Creating the Summer network folder
winpath = r"..\..\data\analysis\\Winter_Games_Network"
if not os.path.exists(winpath):
    os.makedirs(winpath)

    
# creating the csvs for the dataframes    
winNodesdf.to_csv(r"..\..\data\analysis\Winter_Games_Network\WinterNodes.csv", index=False)
winEdgesdf.to_csv(r"..\..\data\analysis\\Winter_Games_Network\WinterEdges.csv", index=False)