Visualization of Football Matches using the Lean Startup Method

Lean Startup has been a breakthrough during the last years in the entrepreneurial landscape. But this movement is not only a methodology that can be used to create successful businesses, it can also be applied to other domains like Data Science. In this post, I discuss how to apply the Lean Startup method to a Data Science project. As an example, I created a project that visualizes all football matches that took place in the UEFA Champions League since 1955.

There are several methodologies to develop data science projects. At Microsoft, we developed and intensively use the Team Data Science Process (TDSP). This methodology enables us to efectively implement projects while collaborating between teams inside and outside the company.

In some situations, when starting a Data Science project, we have a clear view of the business case, access to all customer data and a clear roadmap of what the customer wants. In that situation, the Lean Startup process has little value. However, when there is uncertainty, when the customer doesn't know what he wants or when we don't know if a product is going to be sucessful, then the Lean Startup method can prove its benefit.

Implementing the Lean Startup method in the TDSP is easy, we can use all the tools TDSP proposes. In the Lean Startup method the priority is to reduce (or eliminate) the uncertainty to understand what the customer really wants.

So following the Lean Startup method, first we have to set the hyphotesis. The next step is to build a Minimun Viable Product (MVP). The MVP has two important features, it helps us to validate the hypothesis or ideas we proposed and it is a complete, end to end product with the minimum number of features. The final step is to show the MVP to the customer and measure its impact.

#Load all libraries
import os,sys  
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf
from datashader import reductions
from datashader.colors import colormap_select, Hot, inferno
from datashader.bokeh_ext import InteractiveImage
from bokeh.palettes import Greens3, Blues3, Blues4, Blues9, Greys9
from bokeh.plotting import figure, output_notebook
from bokeh.tile_providers import WMTSTileSource, STAMEN_TONER, STAMEN_TERRAIN
from functools import partial
import wget
import zipfile
import math
from difflib import SequenceMatcher

output_notebook()
#print(sys.path)
print(sys.version)

<div class="bk-root">
    <a href="http://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
    <span id="76c440ce-2298-48ae-a60f-7d5e47bbe87c">Loading BokehJS ...</span>
</div>




3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Visualization of Champions League matches

Visualization is one of the key parts in a Data Science project. It allows us to get a global sense of our data and to understand better our results.

There are many free and non-free tools in the market to make data visualization. One of my favourites is datashader, an open source python library that allows to visualize big amounts of data with a clean and nice API.

MVP1: Initial match visualization

We can easily create a visualization of the Champion League matches from 1955 to 2016 using datashader. For that we need a dataset of the matches, such as this one and the coordinates of the stadiums of the teams, that you can find here. This last dataset has the stadiums coordinates only of teams form England, Spain, France, Germany and Scotland. We will still need the data from other countries such as Italy, Portugal, Netherlands and many others, but for our first MVP we don't care about it, our objective is to reach a minimum product as fast as possible to reduce uncertainty, and 5 countries is enough for now.

The first step is to treat the data.

df_stadium = pd.read_csv("stadiums.csv", usecols=['Team','Stadium','Latitude','Longitude','Country'])
print("Number of rows: %d" % df_stadium.shape[0])
dd1 = df_stadium.take([0,99, 64, 121])
dd1

Number of rows: 164

	Team	Stadium	Latitude	Longitude	Country
0	Arsenal	Emirates Stadium	51.555000	-0.108611	England
99	Real Madrid	Santiago Bernabéu	40.453060	-3.688350	Spain
64	Bayern Munich	Allianz Arena	48.218775	11.624753	Germany
121	Paris Saint-Germain	Parc des Princes	48.841389	2.253056	France

The next step is to match the club names in the dataset of coordinates with the those in the dataset of matches. They are similar but not always exactly the same, for example, in the dataset of coordinates we have Real Madrid FC and in the dataset of matches we have Real Madrid. Furthermore, in the first one there are several entries for some teams, like Atletico Madrid, Atletico Madrid B or Atletico Madrid C meaning they are the teams from the first division and from other divisions.

df_match = pd.read_csv('champions.csv', usecols=['Date','home','visitor','hcountry','vcountry'])
df_match = df_match.rename(columns = {'hcountry':'home_country', 'vcountry':'visitor_country'})
df_teams_champions = pd.concat([df_match['home'], df_match['visitor']])
teams_champions = set(df_teams_champions)
print("Number of teams that have participated in the Champions League: %d" % len(teams_champions))
print("Number of matches in the dataset: %d" % df_match.shape[0])
df_match.head()

Number of teams that have participated in the Champions League: 502
Number of matches in the dataset: 6554

	Date	home	visitor	home_country	visitor_country
0	1955-09-04	Sporting CP	Partizan Belgrade	POR	SRB
1	1955-09-07	Budapesti Voros Lobogo	RSC Anderlecht	HUN	BEL
2	1955-09-08	Servette Geneve	Real Madrid	SUI	ESP
3	1955-09-14	Rot-Weiss Essen	Hibernian FC	GER	SCO
4	1955-09-20	Djurgardens IF	Gwardia Warszawa	SWE	POL

To find the string similarity you can use different methods. Here we will use a simple method to calculate it with difflib.

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def get_info_similar_team(team, df_stadium, threshold=0.6, verbose=False):
    max_rank = 0
    max_idx = -1
    stadium = "Unknown"
    latitude = np.NaN
    longitude = np.NaN
    for idx, val in enumerate(df_stadium['Team']):
        rank = similar(team, val)
        if rank > threshold:
            if(verbose): print("%s and %s(Idx=%d) are %f similar." % (team, val, idx, rank))
            if rank > max_rank:
                if(verbose): print("New maximum rank: %f" %rank)
                max_rank = rank
                max_idx = idx
                stadium = df_stadium['Stadium'].iloc[max_idx]
                latitude = df_stadium['Latitude'].iloc[max_idx]
                longitude = df_stadium['Longitude'].iloc[max_idx]
    return stadium, latitude, longitude
print(get_info_similar_team("Real Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Atletico de Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Inter Milan", df_stadium, verbose=True))

Real Madrid FC and Real Madrid (Idx=99) are 0.923077 similar.
New maximum rank: 0.923077
('Santiago Bernabéu ', 40.453060000000001, -3.6883499999999998)
Atletico de Madrid FC and Atlético Madrid (Idx=87) are 0.810811 similar.
New maximum rank: 0.810811
('Vicente Calderón ', 40.401719, -3.7206059999999996)
('Unknown', nan, nan)

The next step is to create a dataframe relating each match with the stadium coordinates of each team

%%time
df_match_stadium = df_match
home_stadium_index = df_match_stadium['home'].map(lambda x: get_info_similar_team(x, df_stadium))
visitor_stadium_index = df_match_stadium['visitor'].map(lambda x: get_info_similar_team(x, df_stadium))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium = pd.concat([df_match_stadium, df_home, df_visitor], axis=1, ignore_index=False)

CPU times: user 1min 50s, sys: 460 ms, total: 1min 51s
Wall time: 1min 51s

print("Number of missing values for home teams: %d out of %d" % (df_match_stadium['home_stadium'].value_counts()['Unknown'], df_match_stadium.shape[0]))
df1 = df_match_stadium['home_stadium'] == 'Unknown'
df2 = df_match_stadium['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium.shape[0] - df_match_stadium[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium.shape[0]))
df_match_stadium.head()

Number of missing values for home teams: 4499 out of 6554
Number of matches with complete data: 728 out of 6554

	Date	home	visitor	home_country	visitor_country	home_stadium	home_latitude	home_longitude	visitor_stadium	visitor_latitude	visitor_longitude
0	1955-09-04	Sporting CP	Partizan Belgrade	POR	SRB	El Molinón	43.536111	-5.637222	Unknown	NaN	NaN
1	1955-09-07	Budapesti Voros Lobogo	RSC Anderlecht	HUN	BEL	Unknown	NaN	NaN	Unknown	NaN	NaN
2	1955-09-08	Servette Geneve	Real Madrid	SUI	ESP	Unknown	NaN	NaN	Santiago Bernabéu	40.453060	-3.688350
3	1955-09-14	Rot-Weiss Essen	Hibernian FC	GER	SCO	Unknown	NaN	NaN	Easter Road	55.961667	-3.165556
4	1955-09-20	Djurgardens IF	Gwardia Warszawa	SWE	POL	Unknown	NaN	NaN	Unknown	NaN	NaN

Now, even though there are many entries in the dataset that don't have any value, we are going to create a dataframe with the teams that do have values and advance in the project. This dataframe finds the combination of teams (home and visitor) that have values and concatenate each other to create the map.

def aggregate_dataframe_coordinates(dataframe):
    df = pd.DataFrame(index=np.arange(0, n_complete_matches*3), columns=['Latitude','Longitude'])
    count = 0
    for ii in range(dataframe.shape[0]):
        if dataframe['home_stadium'].loc[ii]!= 'Unknown' and dataframe['visitor_stadium'].loc[ii]!= 'Unknown':
            df.loc[count] = [dataframe['home_latitude'].loc[ii], dataframe['home_longitude'].loc[ii]]
            df.loc[count+1] = [dataframe['visitor_latitude'].loc[ii], dataframe['visitor_longitude'].loc[ii]]
            df.loc[count+2] = [np.NaN, np.NaN]
            count += 3
    return df
df_agg = aggregate_dataframe_coordinates(df_match_stadium)
df_agg.head()

	Latitude	Longitude
0	53.4631	-2.29139
1	51.4926	7.45184
2	NaN	NaN
3	55.8532	-4.30926
4	43.7233	7.25876

We have to transform the latitude and longitude coordinates to web mercator format in order to be able to represent it in a map using bokeh. Mercator coordinates are a cilindrical projection of the World coordinates. It was invented in 1569 by Gerardus Mercator and became the standard format for nautical purposes. The web mercator format is an adaptation of the original mercator format and it is currently used by most modern map systems such as Google Maps, Bing Maps or OpenStreetMaps.

def to_web_mercator(yLat, xLon):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):  
        return
 
    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295
 
    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east
 
    return [easting, northing]
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
df_agg_mercator.head()

	Latitude	Longitude
0	-255076.256689	7.069114e+06
1	829535.256912	6.708890e+06
2	NaN	NaN
3	-479704.406257	7.529248e+06
4	808041.021713	5.422726e+06

The next step is to plot the trayectories in the map using datashader

plot_width  = 850
plot_height = 600
x_range = (-1.9e6, 5.9e6)
y_range = (3.7e6, 9.0e6)
def create_image(x_range=x_range, y_range=y_range, w=plot_width, h=plot_height, cmap=None):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.line(df_agg_mercator, 'Latitude', 'Longitude',  ds.count())
    #img = tf.shade(agg, cmap=reversed(Blues3), how='eq_hist')
    #img = tf.shade(agg, cmap=reversed(Greens3), how='eq_hist')    
    img = tf.shade(agg, cmap=cmap, how='eq_hist')
    return img

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height,**plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    
    return p

ArcGIS=WMTSTileSource(url='http://server.arcgisonline.com/ArcGIS/rest/services/World_Street_Map/MapServer/tile/{Z}/{Y}/{X}.png')
p = base_plot()
p.add_tile(ArcGIS)
#InteractiveImage(p, create_image, cmap=inferno)

<bokeh.models.renderers.TileRenderer at 0x7fd69392f5f8>

The function InteractiveImage is not rendered in github. I made a snapshot and show it here:

This is the first MVP which provides a final product. It is imperfect, it doesn't have all the teams in Europe and some of the matches are wronly represented. However, we got some data, transform it and plot it in a map.

MVP2: Improving the map to show all teams in Europe

Now that we have the map, we can start to improve it. If you are into football, you will notice that there are several points in the north of Spain, that corresponds to Sporting de Gijon. Sadly for Sporting supporters, they have never reached to the Champions. Instead, Sporting Clube de Portugal has participated several times in the championship, but since the current dataset doesn't have teams from Portugal, the system mistakenly thinks that Sporting CP from champions.csv is the Sporting de Gijon from stadiums.csv. So lets fix this issue by getting the stadiums coordinates from the rest of the countries in Europe.
We can get that info from wikidata. Using SPARQL language we can get the information we need:

SELECT ?clubLabel ?venueLabel ?coordinates ?countryLabel  WHERE {
  ?club wdt:P31 wd:Q476028.
  ?club wdt:P115 ?venue.
  ?venue wdt:P625 ?coordinates.
  ?club wdt:P17 ?country.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?clubLabel

This generates 4435 Results in 10111 ms that can be saved to a csv file.

df_stadium_read = pd.read_csv('stadiums_wikidata.csv', usecols=['clubLabel','venueLabel','coordinates','countryLabel'])
df_stadium_read.tail()

	clubLabel	venueLabel	coordinates	countryLabel
4430	ŠK SFM Senec	NTC Senec	Point(17.410412 48.217703)	Slovakia
4431	ŠK Slovan Bratislava	Štadión Pasienky	Point(17.141111 48.166111)	Slovakia
4432	Żurrieq F.C.	Ta' Qali National Stadium	Point(14.415372222 35.894863888)	Malta
4433	ŽFK Crvena zvezda	Red Star Stadium	Point(20.464914 44.7832)	Serbia
4434	ŽP Šport Podbrezová	Stadium Kolkáreň	Point(19.535 48.8167)	Slovakia

The first step is to clean the column coordinates. For that we will use a regex pattern. The pattern [-+]?[0-9]*\.?[0-9]+ finds any signed float in a string. Then we create two patterns separated by a space and name the columns using this format: (?P<Longitude>). Finally we have to concatente the club information with the coordinates.

df_temp = df_stadium_read['coordinates'].str.extract('(?P<Longitude>[-+]?[0-9]*\.?[0-9]+) (?P<Latitude>[-+]?[0-9]*\.?[0-9]+)', expand=True)
df_stadium_new = pd.concat([df_stadium_read['clubLabel'],df_stadium_read['venueLabel'], df_temp, df_stadium_read['countryLabel']], axis=1) 
df_stadium_new = df_stadium_new.rename(columns = {'clubLabel':'Team', 'venueLabel':'Stadium','countryLabel':'Country'})
print("Number of rows: %d" % df_stadium_new.shape[0])
unique_teams_stadium = list(set(df_stadium_new['Team']))
print("Unique team's name number: %d" % len(unique_teams_stadium))
df_stadium_new.take(list(range(3388,3393)))

Number of rows: 4435
Unique team's name number: 4314

	Team	Stadium	Longitude	Latitude	Country
3388	Real Madrid	Juan Ramón Loubriel Stadium	-66.150833	18.393333	United States of America
3389	Real Madrid	Galeshewe Stadium	24.740857	-28.716786	South Africa
3390	Real Madrid C	Ciudad Real Madrid	-3.61166667	40.47916667	Spain
3391	Real Madrid C.F.	Ciudad Real Madrid	-3.61166667	40.47916667	Spain
3392	Real Madrid FC	Estadio Santiago Bernabéu	-3.68835	40.45306	Spain

As it can be seen in the previous dataframe, we came into another problem. The new dataset contains all team instances, all over the world. There are some teams that have the same name in different countries, there is a Real Madrid team from USA and South Africa, and a similar name within a country. In our case, we are only interested in the instance Real Madrid, Estadio Santiago Bernabeu, -3.68835, 40.45306, Spain. So how can we find an automated way to filter the correct teams that have participated in the Champions League?

A practical approach is to combine an automated and manual way. With the data we have so far we can automatically filter the two first entries using the country. We can get the country info from champions.csv dataset. To distinguish teams from the same country we will filter them manually.

The first step then is to get a dataframe with all the teams that have participated in Champions and their country of origin. Then we have to remove the repeated entries and rename the country code to the country name that can be found in wikidata.

df_match_home = df_match[['home','home_country']]
df_match_home = df_match_home.rename(columns={'home':'Team','home_country':'Country'})
df_match_visitor = df_match[['visitor','visitor_country']]
df_match_visitor = df_match_visitor.rename(columns={'visitor':'Team','visitor_country':'Country'})
df_champions_teams = pd.concat([df_match_home,df_match_visitor], axis=0, ignore_index=True)
df_champions_teams = df_champions_teams.drop_duplicates()
print("Number of unique teams: %d" % df_champions_teams.shape[0])
country_dict = {'ALB':'Albania',
                'AND':'Andorra',
                'ARM':'Armenia',
                'AUT':'Austria',
                'AZE':'Azerbaijan',
                'BEL':'Belgium',
                'BIH':'Bosnia and Herzegovina',
                'BLR':'Belarus',
                'BUL':'Bulgaria',
                'CRO':'Croatia',
                'CYP':'Cyprus',
                'CZE':'Czech Republic',
                'DEN':'Denmark',
                #'ENG':'England',
                'ENG':'United Kingdom',
                'ESP':'Spain',
                'EST':'Estonia',
                'FIN':'Finland',
                'FRA':'France',
                'FRO':'Feroe Islands',
                'GEO':'Georgia',
                'GER':'Germany',
                'GIB':'Gibraltar',
                'GRE':'Greece',
                'HUN':'Hungary',
                'ITA':'Italy',
                'IRL':'Ireland',
                'ISL':'Iceland',
                'ISR':'Israel',
                'KAZ':'Kazakhstan',
                'LTU':'Lithuania',
                'LUX':'Luxembourg',
                'LVA':'Latvia',
                'MDA':'Moldova',
                'MKD':'Macedonia',
                'MLT':'Malta',
                'MNE':'Montenegro',
                'NED':'Netherlands',
                #'NIR':'Northern Ireland',
                'NIR':'United Kingdom',
                'NOR':'Norwey',
                'POL':'Poland',
                'POR':'Portugal',
                'ROU':'Romania',
                'RUS':'Russia',
                #'SCO':'Scotland',
                'SCO':'United Kingdom',
                'SMR':'San Marino',
                'SRB':'Serbia',
                'SUI':'Switzerland',
                'SVK':'Slovakia',
                'SVN':'Slovenia',
                'SWE':'Sweden',
                'TUR':'Turkey',
                'UKR':'Ukrania',
                #'WAL':'Wales',
                'WAL':'United Kingdom'}
df_champions_teams['Country'].replace(country_dict, inplace=True)
#df_champions_teams.to_csv('match_unique.csv')# To check that the mapping is correct
df_champions_teams.sort_values(by='Team',inplace=True)
df_champions_teams = df_champions_teams.reset_index(drop=True)
df_champions_teams.head()

Number of unique teams: 502

	Team	Country
0	AC Chievo Verona	Italy
1	AC Milan	Italy
2	AC Sparta Praha	Czech Republic
3	ACF Fiorentina	Italy
4	AE Larisa	Greece

Once we have the list of all teams that have participated in the Champions League, we have to generate a new dataset relating each Champions League matches with the coordinates of the team stadiums. For that we will use the function similar to match a the name of the team in the different datasets similarly as we did before.

Once the csv has been generated, let's manually erase the combinations that are not correct and save everything in a new file. We won't correct those entries that are not matched, a Data Science project is better out than perfect!!

%%time
def get_info_similar_team_country(team, country, df_stadium, df, threshold, verbose):
    team2 = "Unknown"
    stadium = "Unknown"
    latitude = np.NaN
    longitude = np.NaN
    cols = list(df)
    for idx, val in enumerate(df_stadium['Team']):
        rank = similar(team, val)
        if rank > threshold and country == df_stadium['Country'].iloc[idx]:
            if(verbose): print("%s and %s(Idx=%d) are %f similar and from the same country %s." 
                               % (team, val, idx, rank, country))
            team2 = df_stadium['Team'].iloc[idx]
            stadium = df_stadium['Stadium'].iloc[idx]
            latitude = df_stadium['Latitude'].iloc[idx]
            longitude = df_stadium['Longitude'].iloc[idx]
            dtemp = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
            df = df.append(dtemp, ignore_index=True)
    #if there is no match, register it
    if(team2 == "Unknown"):
        df_nomatch = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
        df = df.append(df_nomatch, ignore_index=True)
    return df

def generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold=0.6, verbose=False):
    df = pd.DataFrame(columns=['Team', 'Team2', 'Stadium', 'Latitude','Longitude','Country'])
    for idx, row in df_champions_teams.iterrows():
        df = get_info_similar_team_country(row['Team'],row['Country'], df_stadium_new, df, 
                                           threshold=threshold, verbose=verbose)
    return df

verbose = False # You can change this to True to see all the combinations
threshold = 0.5
df_stadiums_champions = generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold, verbose)
df_stadiums_champions.to_csv('stadiums_champions.csv', index=False)

CPU times: user 2min 9s, sys: 240 ms, total: 2min 9s
Wall time: 2min 9s

After we filtered the entries in the csv, let's load again the data and repeat the process.

df_stadiums_champions = pd.read_csv('stadiums_champions_filtered.csv', usecols=['Team','Stadium','Latitude','Longitude','Country'])
df_stadiums_champions.head()

	Team	Stadium	Latitude	Longitude	Country
0	AC Chievo Verona	Stadio Marc'Antonio Bentegodi	45.435356	10.968647	Italy
1	AC Milan	Stadio Giuseppe Meazza	45.478080	9.124000	Italy
2	AC Sparta Praha	Generali Arena	50.099716	14.416122	Czech Republic
3	ACF Fiorentina	Stadio Artemio Franchi	43.780822	11.282258	Italy
4	AE Larisa	AEL FC Arena	39.615610	22.399490	Greece

As previously, we create a dataframe that relates each match with the coordinates of its stadium

df_match_stadium_new= df_match
home_stadium_index = df_match_stadium_new['home'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
visitor_stadium_index = df_match_stadium_new['visitor'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium_new = pd.concat([df_match_stadium_new, df_home, df_visitor], axis=1, ignore_index=False)
df1 = df_match_stadium_new['home_stadium'] == 'Unknown'
df2 = df_match_stadium_new['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium_new.shape[0] - df_match_stadium_new[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium_new.shape[0]))
df_match_stadium_new.head()

Number of matches with complete data: 5110 out of 6554

	Date	home	visitor	home_country	visitor_country	home_stadium	home_latitude	home_longitude	visitor_stadium	visitor_latitude	visitor_longitude
0	1955-09-04	Sporting CP	Partizan Belgrade	POR	SRB	Estádio José Alvalade	38.761194	-9.160783	Partizan Stadium	44.788611	20.459167
1	1955-09-07	Budapesti Voros Lobogo	RSC Anderlecht	HUN	BEL	Unknown	NaN	NaN	Constant Vanden Stock Stadium	50.834167	4.298333
2	1955-09-08	Servette Geneve	Real Madrid	SUI	ESP	Stade de Genève	46.177778	6.127500	Estadio Santiago Bernabéu	40.453060	-3.688350
3	1955-09-14	Rot-Weiss Essen	Hibernian FC	GER	SCO	Stadion Essen	51.486667	6.976389	Easter Road	55.961667	-3.165556
4	1955-09-20	Djurgardens IF	Gwardia Warszawa	SWE	POL	Stockholm Olympic Stadium	59.345278	18.078889	Stadion Polonii Warszawa	52.255517	21.001322

The next step is to aggregate the coordinates and transform them to mercator format.

df_agg = aggregate_dataframe_coordinates(df_match_stadium_new)
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
print("Number of rows: %d" % df_agg_mercator.shape[0])
df_agg_mercator.head()

Number of rows: 15330

	Latitude	Longitude
0	-1.019774e+06	4.687522e+06
1	2.277504e+06	5.588304e+06
2	NaN	NaN
3	6.821102e+05	5.808884e+06
4	-4.105852e+05	4.932000e+06

Finally, we plot the coordinates in the map.

%%time
#InteractiveImage(p, create_image, cmap=inferno)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.63 µs

Finally, the image for github:

This is the second MVP. In this case we have the correct connections between teams in Europe.

MVP3: improving design

If you take a look at the map you will notice that it is fairly ugly. The final quality of a product will drop if we don't show a good visualization. The next step is to improve the map changing the background and colors.

ArcGIS2 = WMTSTileSource(url='http://tile.stamen.com/toner-background/{Z}/{X}/{Y}.png')
p2 = base_plot()
p2.add_tile(ArcGIS2)
cmap = reversed(Blues3)
#cmap = reversed(Greens3)
#InteractiveImage(p2, create_image, cmap=cmap)

Finally, the image for github:

We developed three different MVPs iterating on the project. The Lean Startup method provides a very easy way to remove uncertainty and get a minimum product very quickly that we can show to the customer.

Happy data sciencing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visualization_football.md

visualization_football.md

Visualization of Football Matches using the Lean Startup Method

Visualization of Champions League matches

MVP1: Initial match visualization

MVP2: Improving the map to show all teams in Europe

MVP3: improving design

Files

visualization_football.md

Latest commit

History

visualization_football.md

File metadata and controls

Visualization of Football Matches using the Lean Startup Method

Visualization of Champions League matches

MVP1: Initial match visualization

MVP2: Improving the map to show all teams in Europe

MVP3: improving design