Skip to content

Latest commit

 

History

History
1175 lines (982 loc) · 36 KB

visualization_football.md

File metadata and controls

1175 lines (982 loc) · 36 KB

Visualization of Football Matches using the Lean Startup Method

Lean Startup has been a breakthrough during the last years in the entrepreneurial landscape. But this movement is not only a methodology that can be used to create successful businesses, it can also be applied to other domains like Data Science. In this post, I discuss how to apply the Lean Startup method to a Data Science project. As an example, I created a project that visualizes all football matches that took place in the UEFA Champions League since 1955.

There are several methodologies to develop data science projects. At Microsoft, we developed and intensively use the Team Data Science Process (TDSP). This methodology enables us to efectively implement projects while collaborating between teams inside and outside the company.

In some situations, when starting a Data Science project, we have a clear view of the business case, access to all customer data and a clear roadmap of what the customer wants. In that situation, the Lean Startup process has little value. However, when there is uncertainty, when the customer doesn't know what he wants or when we don't know if a product is going to be sucessful, then the Lean Startup method can prove its benefit.

Implementing the Lean Startup method in the TDSP is easy, we can use all the tools TDSP proposes. In the Lean Startup method the priority is to reduce (or eliminate) the uncertainty to understand what the customer really wants.

So following the Lean Startup method, first we have to set the hyphotesis. The next step is to build a Minimun Viable Product (MVP). The MVP has two important features, it helps us to validate the hypothesis or ideas we proposed and it is a complete, end to end product with the minimum number of features. The final step is to show the MVP to the customer and measure its impact.

#Load all libraries
import os,sys  
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf
from datashader import reductions
from datashader.colors import colormap_select, Hot, inferno
from datashader.bokeh_ext import InteractiveImage
from bokeh.palettes import Greens3, Blues3, Blues4, Blues9, Greys9
from bokeh.plotting import figure, output_notebook
from bokeh.tile_providers import WMTSTileSource, STAMEN_TONER, STAMEN_TERRAIN
from functools import partial
import wget
import zipfile
import math
from difflib import SequenceMatcher

output_notebook()
#print(sys.path)
print(sys.version)
<div class="bk-root">
    <a href="http://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
    <span id="76c440ce-2298-48ae-a60f-7d5e47bbe87c">Loading BokehJS ...</span>
</div>




3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Visualization of Champions League matches

Visualization is one of the key parts in a Data Science project. It allows us to get a global sense of our data and to understand better our results.

There are many free and non-free tools in the market to make data visualization. One of my favourites is datashader, an open source python library that allows to visualize big amounts of data with a clean and nice API.

MVP1: Initial match visualization

We can easily create a visualization of the Champion League matches from 1955 to 2016 using datashader. For that we need a dataset of the matches, such as this one and the coordinates of the stadiums of the teams, that you can find here. This last dataset has the stadiums coordinates only of teams form England, Spain, France, Germany and Scotland. We will still need the data from other countries such as Italy, Portugal, Netherlands and many others, but for our first MVP we don't care about it, our objective is to reach a minimum product as fast as possible to reduce uncertainty, and 5 countries is enough for now.

The first step is to treat the data.

df_stadium = pd.read_csv("stadiums.csv", usecols=['Team','Stadium','Latitude','Longitude','Country'])
print("Number of rows: %d" % df_stadium.shape[0])
dd1 = df_stadium.take([0,99, 64, 121])
dd1
Number of rows: 164
Team Stadium Latitude Longitude Country
0 Arsenal Emirates Stadium 51.555000 -0.108611 England
99 Real Madrid Santiago Bernabéu 40.453060 -3.688350 Spain
64 Bayern Munich Allianz Arena 48.218775 11.624753 Germany
121 Paris Saint-Germain Parc des Princes 48.841389 2.253056 France

The next step is to match the club names in the dataset of coordinates with the those in the dataset of matches. They are similar but not always exactly the same, for example, in the dataset of coordinates we have Real Madrid FC and in the dataset of matches we have Real Madrid. Furthermore, in the first one there are several entries for some teams, like Atletico Madrid, Atletico Madrid B or Atletico Madrid C meaning they are the teams from the first division and from other divisions.

df_match = pd.read_csv('champions.csv', usecols=['Date','home','visitor','hcountry','vcountry'])
df_match = df_match.rename(columns = {'hcountry':'home_country', 'vcountry':'visitor_country'})
df_teams_champions = pd.concat([df_match['home'], df_match['visitor']])
teams_champions = set(df_teams_champions)
print("Number of teams that have participated in the Champions League: %d" % len(teams_champions))
print("Number of matches in the dataset: %d" % df_match.shape[0])
df_match.head()
Number of teams that have participated in the Champions League: 502
Number of matches in the dataset: 6554
Date home visitor home_country visitor_country
0 1955-09-04 Sporting CP Partizan Belgrade POR SRB
1 1955-09-07 Budapesti Voros Lobogo RSC Anderlecht HUN BEL
2 1955-09-08 Servette Geneve Real Madrid SUI ESP
3 1955-09-14 Rot-Weiss Essen Hibernian FC GER SCO
4 1955-09-20 Djurgardens IF Gwardia Warszawa SWE POL

To find the string similarity you can use different methods. Here we will use a simple method to calculate it with difflib.

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def get_info_similar_team(team, df_stadium, threshold=0.6, verbose=False):
    max_rank = 0
    max_idx = -1
    stadium = "Unknown"
    latitude = np.NaN
    longitude = np.NaN
    for idx, val in enumerate(df_stadium['Team']):
        rank = similar(team, val)
        if rank > threshold:
            if(verbose): print("%s and %s(Idx=%d) are %f similar." % (team, val, idx, rank))
            if rank > max_rank:
                if(verbose): print("New maximum rank: %f" %rank)
                max_rank = rank
                max_idx = idx
                stadium = df_stadium['Stadium'].iloc[max_idx]
                latitude = df_stadium['Latitude'].iloc[max_idx]
                longitude = df_stadium['Longitude'].iloc[max_idx]
    return stadium, latitude, longitude
print(get_info_similar_team("Real Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Atletico de Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Inter Milan", df_stadium, verbose=True))
 
Real Madrid FC and Real Madrid (Idx=99) are 0.923077 similar.
New maximum rank: 0.923077
('Santiago Bernabéu ', 40.453060000000001, -3.6883499999999998)
Atletico de Madrid FC and Atlético Madrid (Idx=87) are 0.810811 similar.
New maximum rank: 0.810811
('Vicente Calderón ', 40.401719, -3.7206059999999996)
('Unknown', nan, nan)

The next step is to create a dataframe relating each match with the stadium coordinates of each team

%%time
df_match_stadium = df_match
home_stadium_index = df_match_stadium['home'].map(lambda x: get_info_similar_team(x, df_stadium))
visitor_stadium_index = df_match_stadium['visitor'].map(lambda x: get_info_similar_team(x, df_stadium))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium = pd.concat([df_match_stadium, df_home, df_visitor], axis=1, ignore_index=False)
CPU times: user 1min 50s, sys: 460 ms, total: 1min 51s
Wall time: 1min 51s
print("Number of missing values for home teams: %d out of %d" % (df_match_stadium['home_stadium'].value_counts()['Unknown'], df_match_stadium.shape[0]))
df1 = df_match_stadium['home_stadium'] == 'Unknown'
df2 = df_match_stadium['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium.shape[0] - df_match_stadium[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium.shape[0]))
df_match_stadium.head()
Number of missing values for home teams: 4499 out of 6554
Number of matches with complete data: 728 out of 6554
Date home visitor home_country visitor_country home_stadium home_latitude home_longitude visitor_stadium visitor_latitude visitor_longitude
0 1955-09-04 Sporting CP Partizan Belgrade POR SRB El Molinón 43.536111 -5.637222 Unknown NaN NaN
1 1955-09-07 Budapesti Voros Lobogo RSC Anderlecht HUN BEL Unknown NaN NaN Unknown NaN NaN
2 1955-09-08 Servette Geneve Real Madrid SUI ESP Unknown NaN NaN Santiago Bernabéu 40.453060 -3.688350
3 1955-09-14 Rot-Weiss Essen Hibernian FC GER SCO Unknown NaN NaN Easter Road 55.961667 -3.165556
4 1955-09-20 Djurgardens IF Gwardia Warszawa SWE POL Unknown NaN NaN Unknown NaN NaN

Now, even though there are many entries in the dataset that don't have any value, we are going to create a dataframe with the teams that do have values and advance in the project. This dataframe finds the combination of teams (home and visitor) that have values and concatenate each other to create the map.

def aggregate_dataframe_coordinates(dataframe):
    df = pd.DataFrame(index=np.arange(0, n_complete_matches*3), columns=['Latitude','Longitude'])
    count = 0
    for ii in range(dataframe.shape[0]):
        if dataframe['home_stadium'].loc[ii]!= 'Unknown' and dataframe['visitor_stadium'].loc[ii]!= 'Unknown':
            df.loc[count] = [dataframe['home_latitude'].loc[ii], dataframe['home_longitude'].loc[ii]]
            df.loc[count+1] = [dataframe['visitor_latitude'].loc[ii], dataframe['visitor_longitude'].loc[ii]]
            df.loc[count+2] = [np.NaN, np.NaN]
            count += 3
    return df
df_agg = aggregate_dataframe_coordinates(df_match_stadium)
df_agg.head()
Latitude Longitude
0 53.4631 -2.29139
1 51.4926 7.45184
2 NaN NaN
3 55.8532 -4.30926
4 43.7233 7.25876

We have to transform the latitude and longitude coordinates to web mercator format in order to be able to represent it in a map using bokeh. Mercator coordinates are a cilindrical projection of the World coordinates. It was invented in 1569 by Gerardus Mercator and became the standard format for nautical purposes. The web mercator format is an adaptation of the original mercator format and it is currently used by most modern map systems such as Google Maps, Bing Maps or OpenStreetMaps.

def to_web_mercator(yLat, xLon):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):  
        return
 
    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295
 
    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east
 
    return [easting, northing]
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
df_agg_mercator.head()
Latitude Longitude
0 -255076.256689 7.069114e+06
1 829535.256912 6.708890e+06
2 NaN NaN
3 -479704.406257 7.529248e+06
4 808041.021713 5.422726e+06

The next step is to plot the trayectories in the map using datashader

plot_width  = 850
plot_height = 600
x_range = (-1.9e6, 5.9e6)
y_range = (3.7e6, 9.0e6)
def create_image(x_range=x_range, y_range=y_range, w=plot_width, h=plot_height, cmap=None):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.line(df_agg_mercator, 'Latitude', 'Longitude',  ds.count())
    #img = tf.shade(agg, cmap=reversed(Blues3), how='eq_hist')
    #img = tf.shade(agg, cmap=reversed(Greens3), how='eq_hist')    
    img = tf.shade(agg, cmap=cmap, how='eq_hist')
    return img

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height,**plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    
    return p
ArcGIS=WMTSTileSource(url='http://server.arcgisonline.com/ArcGIS/rest/services/World_Street_Map/MapServer/tile/{Z}/{Y}/{X}.png')
p = base_plot()
p.add_tile(ArcGIS)
#InteractiveImage(p, create_image, cmap=inferno)
<bokeh.models.renderers.TileRenderer at 0x7fd69392f5f8>

The function InteractiveImage is not rendered in github. I made a snapshot and show it here:

Matches between teams in the Champions League

This is the first MVP which provides a final product. It is imperfect, it doesn't have all the teams in Europe and some of the matches are wronly represented. However, we got some data, transform it and plot it in a map.

MVP2: Improving the map to show all teams in Europe

Now that we have the map, we can start to improve it. If you are into football, you will notice that there are several points in the north of Spain, that corresponds to Sporting de Gijon. Sadly for Sporting supporters, they have never reached to the Champions. Instead, Sporting Clube de Portugal has participated several times in the championship, but since the current dataset doesn't have teams from Portugal, the system mistakenly thinks that Sporting CP from champions.csv is the Sporting de Gijon from stadiums.csv. So lets fix this issue by getting the stadiums coordinates from the rest of the countries in Europe.
We can get that info from wikidata. Using SPARQL language we can get the information we need:

SELECT ?clubLabel ?venueLabel ?coordinates ?countryLabel  WHERE {
  ?club wdt:P31 wd:Q476028.
  ?club wdt:P115 ?venue.
  ?venue wdt:P625 ?coordinates.
  ?club wdt:P17 ?country.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?clubLabel

This generates 4435 Results in 10111 ms that can be saved to a csv file.

df_stadium_read = pd.read_csv('stadiums_wikidata.csv', usecols=['clubLabel','venueLabel','coordinates','countryLabel'])
df_stadium_read.tail()
clubLabel venueLabel coordinates countryLabel
4430 ŠK SFM Senec NTC Senec Point(17.410412 48.217703) Slovakia
4431 ŠK Slovan Bratislava Štadión Pasienky Point(17.141111 48.166111) Slovakia
4432 Żurrieq F.C. Ta' Qali National Stadium Point(14.415372222 35.894863888) Malta
4433 ŽFK Crvena zvezda Red Star Stadium Point(20.464914 44.7832) Serbia
4434 ŽP Šport Podbrezová Stadium Kolkáreň Point(19.535 48.8167) Slovakia

The first step is to clean the column coordinates. For that we will use a regex pattern. The pattern [-+]?[0-9]*\.?[0-9]+ finds any signed float in a string. Then we create two patterns separated by a space and name the columns using this format: (?P<Longitude>). Finally we have to concatente the club information with the coordinates.

df_temp = df_stadium_read['coordinates'].str.extract('(?P<Longitude>[-+]?[0-9]*\.?[0-9]+) (?P<Latitude>[-+]?[0-9]*\.?[0-9]+)', expand=True)
df_stadium_new = pd.concat([df_stadium_read['clubLabel'],df_stadium_read['venueLabel'], df_temp, df_stadium_read['countryLabel']], axis=1) 
df_stadium_new = df_stadium_new.rename(columns = {'clubLabel':'Team', 'venueLabel':'Stadium','countryLabel':'Country'})
print("Number of rows: %d" % df_stadium_new.shape[0])
unique_teams_stadium = list(set(df_stadium_new['Team']))
print("Unique team's name number: %d" % len(unique_teams_stadium))
df_stadium_new.take(list(range(3388,3393)))
Number of rows: 4435
Unique team's name number: 4314
Team Stadium Longitude Latitude Country
3388 Real Madrid Juan Ramón Loubriel Stadium -66.150833 18.393333 United States of America
3389 Real Madrid Galeshewe Stadium 24.740857 -28.716786 South Africa
3390 Real Madrid C Ciudad Real Madrid -3.61166667 40.47916667 Spain
3391 Real Madrid C.F. Ciudad Real Madrid -3.61166667 40.47916667 Spain
3392 Real Madrid FC Estadio Santiago Bernabéu -3.68835 40.45306 Spain

As it can be seen in the previous dataframe, we came into another problem. The new dataset contains all team instances, all over the world. There are some teams that have the same name in different countries, there is a Real Madrid team from USA and South Africa, and a similar name within a country. In our case, we are only interested in the instance Real Madrid, Estadio Santiago Bernabeu, -3.68835, 40.45306, Spain. So how can we find an automated way to filter the correct teams that have participated in the Champions League?

A practical approach is to combine an automated and manual way. With the data we have so far we can automatically filter the two first entries using the country. We can get the country info from champions.csv dataset. To distinguish teams from the same country we will filter them manually.

The first step then is to get a dataframe with all the teams that have participated in Champions and their country of origin. Then we have to remove the repeated entries and rename the country code to the country name that can be found in wikidata.

df_match_home = df_match[['home','home_country']]
df_match_home = df_match_home.rename(columns={'home':'Team','home_country':'Country'})
df_match_visitor = df_match[['visitor','visitor_country']]
df_match_visitor = df_match_visitor.rename(columns={'visitor':'Team','visitor_country':'Country'})
df_champions_teams = pd.concat([df_match_home,df_match_visitor], axis=0, ignore_index=True)
df_champions_teams = df_champions_teams.drop_duplicates()
print("Number of unique teams: %d" % df_champions_teams.shape[0])
country_dict = {'ALB':'Albania',
                'AND':'Andorra',
                'ARM':'Armenia',
                'AUT':'Austria',
                'AZE':'Azerbaijan',
                'BEL':'Belgium',
                'BIH':'Bosnia and Herzegovina',
                'BLR':'Belarus',
                'BUL':'Bulgaria',
                'CRO':'Croatia',
                'CYP':'Cyprus',
                'CZE':'Czech Republic',
                'DEN':'Denmark',
                #'ENG':'England',
                'ENG':'United Kingdom',
                'ESP':'Spain',
                'EST':'Estonia',
                'FIN':'Finland',
                'FRA':'France',
                'FRO':'Feroe Islands',
                'GEO':'Georgia',
                'GER':'Germany',
                'GIB':'Gibraltar',
                'GRE':'Greece',
                'HUN':'Hungary',
                'ITA':'Italy',
                'IRL':'Ireland',
                'ISL':'Iceland',
                'ISR':'Israel',
                'KAZ':'Kazakhstan',
                'LTU':'Lithuania',
                'LUX':'Luxembourg',
                'LVA':'Latvia',
                'MDA':'Moldova',
                'MKD':'Macedonia',
                'MLT':'Malta',
                'MNE':'Montenegro',
                'NED':'Netherlands',
                #'NIR':'Northern Ireland',
                'NIR':'United Kingdom',
                'NOR':'Norwey',
                'POL':'Poland',
                'POR':'Portugal',
                'ROU':'Romania',
                'RUS':'Russia',
                #'SCO':'Scotland',
                'SCO':'United Kingdom',
                'SMR':'San Marino',
                'SRB':'Serbia',
                'SUI':'Switzerland',
                'SVK':'Slovakia',
                'SVN':'Slovenia',
                'SWE':'Sweden',
                'TUR':'Turkey',
                'UKR':'Ukrania',
                #'WAL':'Wales',
                'WAL':'United Kingdom'}
df_champions_teams['Country'].replace(country_dict, inplace=True)
#df_champions_teams.to_csv('match_unique.csv')# To check that the mapping is correct
df_champions_teams.sort_values(by='Team',inplace=True)
df_champions_teams = df_champions_teams.reset_index(drop=True)
df_champions_teams.head()
Number of unique teams: 502
Team Country
0 AC Chievo Verona Italy
1 AC Milan Italy
2 AC Sparta Praha Czech Republic
3 ACF Fiorentina Italy
4 AE Larisa Greece

Once we have the list of all teams that have participated in the Champions League, we have to generate a new dataset relating each Champions League matches with the coordinates of the team stadiums. For that we will use the function similar to match a the name of the team in the different datasets similarly as we did before.

Once the csv has been generated, let's manually erase the combinations that are not correct and save everything in a new file. We won't correct those entries that are not matched, a Data Science project is better out than perfect!!

%%time
def get_info_similar_team_country(team, country, df_stadium, df, threshold, verbose):
    team2 = "Unknown"
    stadium = "Unknown"
    latitude = np.NaN
    longitude = np.NaN
    cols = list(df)
    for idx, val in enumerate(df_stadium['Team']):
        rank = similar(team, val)
        if rank > threshold and country == df_stadium['Country'].iloc[idx]:
            if(verbose): print("%s and %s(Idx=%d) are %f similar and from the same country %s." 
                               % (team, val, idx, rank, country))
            team2 = df_stadium['Team'].iloc[idx]
            stadium = df_stadium['Stadium'].iloc[idx]
            latitude = df_stadium['Latitude'].iloc[idx]
            longitude = df_stadium['Longitude'].iloc[idx]
            dtemp = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
            df = df.append(dtemp, ignore_index=True)
    #if there is no match, register it
    if(team2 == "Unknown"):
        df_nomatch = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
        df = df.append(df_nomatch, ignore_index=True)
    return df

def generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold=0.6, verbose=False):
    df = pd.DataFrame(columns=['Team', 'Team2', 'Stadium', 'Latitude','Longitude','Country'])
    for idx, row in df_champions_teams.iterrows():
        df = get_info_similar_team_country(row['Team'],row['Country'], df_stadium_new, df, 
                                           threshold=threshold, verbose=verbose)
    return df

verbose = False # You can change this to True to see all the combinations
threshold = 0.5
df_stadiums_champions = generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold, verbose)
df_stadiums_champions.to_csv('stadiums_champions.csv', index=False)
CPU times: user 2min 9s, sys: 240 ms, total: 2min 9s
Wall time: 2min 9s

After we filtered the entries in the csv, let's load again the data and repeat the process.

df_stadiums_champions = pd.read_csv('stadiums_champions_filtered.csv', usecols=['Team','Stadium','Latitude','Longitude','Country'])
df_stadiums_champions.head()
Team Stadium Latitude Longitude Country
0 AC Chievo Verona Stadio Marc'Antonio Bentegodi 45.435356 10.968647 Italy
1 AC Milan Stadio Giuseppe Meazza 45.478080 9.124000 Italy
2 AC Sparta Praha Generali Arena 50.099716 14.416122 Czech Republic
3 ACF Fiorentina Stadio Artemio Franchi 43.780822 11.282258 Italy
4 AE Larisa AEL FC Arena 39.615610 22.399490 Greece

As previously, we create a dataframe that relates each match with the coordinates of its stadium

df_match_stadium_new= df_match
home_stadium_index = df_match_stadium_new['home'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
visitor_stadium_index = df_match_stadium_new['visitor'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium_new = pd.concat([df_match_stadium_new, df_home, df_visitor], axis=1, ignore_index=False)
df1 = df_match_stadium_new['home_stadium'] == 'Unknown'
df2 = df_match_stadium_new['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium_new.shape[0] - df_match_stadium_new[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium_new.shape[0]))
df_match_stadium_new.head()
Number of matches with complete data: 5110 out of 6554
Date home visitor home_country visitor_country home_stadium home_latitude home_longitude visitor_stadium visitor_latitude visitor_longitude
0 1955-09-04 Sporting CP Partizan Belgrade POR SRB Estádio José Alvalade 38.761194 -9.160783 Partizan Stadium 44.788611 20.459167
1 1955-09-07 Budapesti Voros Lobogo RSC Anderlecht HUN BEL Unknown NaN NaN Constant Vanden Stock Stadium 50.834167 4.298333
2 1955-09-08 Servette Geneve Real Madrid SUI ESP Stade de Genève 46.177778 6.127500 Estadio Santiago Bernabéu 40.453060 -3.688350
3 1955-09-14 Rot-Weiss Essen Hibernian FC GER SCO Stadion Essen 51.486667 6.976389 Easter Road 55.961667 -3.165556
4 1955-09-20 Djurgardens IF Gwardia Warszawa SWE POL Stockholm Olympic Stadium 59.345278 18.078889 Stadion Polonii Warszawa 52.255517 21.001322

The next step is to aggregate the coordinates and transform them to mercator format.

df_agg = aggregate_dataframe_coordinates(df_match_stadium_new)
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
print("Number of rows: %d" % df_agg_mercator.shape[0])
df_agg_mercator.head()
Number of rows: 15330
Latitude Longitude
0 -1.019774e+06 4.687522e+06
1 2.277504e+06 5.588304e+06
2 NaN NaN
3 6.821102e+05 5.808884e+06
4 -4.105852e+05 4.932000e+06

Finally, we plot the coordinates in the map.

%%time
#InteractiveImage(p, create_image, cmap=inferno)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.63 µs

Finally, the image for github:

Matches between teams in the Champions League

This is the second MVP. In this case we have the correct connections between teams in Europe.

MVP3: improving design

If you take a look at the map you will notice that it is fairly ugly. The final quality of a product will drop if we don't show a good visualization. The next step is to improve the map changing the background and colors.

ArcGIS2 = WMTSTileSource(url='http://tile.stamen.com/toner-background/{Z}/{X}/{Y}.png')
p2 = base_plot()
p2.add_tile(ArcGIS2)
cmap = reversed(Blues3)
#cmap = reversed(Greens3)
#InteractiveImage(p2, create_image, cmap=cmap)

Finally, the image for github:

Matches between teams in the Champions League

We developed three different MVPs iterating on the project. The Lean Startup method provides a very easy way to remove uncertainty and get a minimum product very quickly that we can show to the customer.

Happy data sciencing!