Lean Startup has been a breakthrough during the last years in the entrepreneurial landscape. But this movement is not only a methodology that can be used to create successful businesses, it can also be applied to other domains like Data Science. In this post, I discuss how to apply the Lean Startup method to a Data Science project. As an example, I created a project that visualizes all football matches that took place in the UEFA Champions League since 1955.
There are several methodologies to develop data science projects. At Microsoft, we developed and intensively use the Team Data Science Process (TDSP). This methodology enables us to efectively implement projects while collaborating between teams inside and outside the company.
In some situations, when starting a Data Science project, we have a clear view of the business case, access to all customer data and a clear roadmap of what the customer wants. In that situation, the Lean Startup process has little value. However, when there is uncertainty, when the customer doesn't know what he wants or when we don't know if a product is going to be sucessful, then the Lean Startup method can prove its benefit.
Implementing the Lean Startup method in the TDSP is easy, we can use all the tools TDSP proposes. In the Lean Startup method the priority is to reduce (or eliminate) the uncertainty to understand what the customer really wants.
So following the Lean Startup method, first we have to set the hyphotesis. The next step is to build a Minimun Viable Product (MVP). The MVP has two important features, it helps us to validate the hypothesis or ideas we proposed and it is a complete, end to end product with the minimum number of features. The final step is to show the MVP to the customer and measure its impact.
#Load all libraries
import os,sys
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf
from datashader import reductions
from datashader.colors import colormap_select, Hot, inferno
from datashader.bokeh_ext import InteractiveImage
from bokeh.palettes import Greens3, Blues3, Blues4, Blues9, Greys9
from bokeh.plotting import figure, output_notebook
from bokeh.tile_providers import WMTSTileSource, STAMEN_TONER, STAMEN_TERRAIN
from functools import partial
import wget
import zipfile
import math
from difflib import SequenceMatcher
output_notebook()
#print(sys.path)
print(sys.version)
<div class="bk-root">
<a href="http://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
<span id="76c440ce-2298-48ae-a60f-7d5e47bbe87c">Loading BokehJS ...</span>
</div>
3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Visualization is one of the key parts in a Data Science project. It allows us to get a global sense of our data and to understand better our results.
There are many free and non-free tools in the market to make data visualization. One of my favourites is datashader, an open source python library that allows to visualize big amounts of data with a clean and nice API.
We can easily create a visualization of the Champion League matches from 1955 to 2016 using datashader. For that we need a dataset of the matches, such as this one and the coordinates of the stadiums of the teams, that you can find here. This last dataset has the stadiums coordinates only of teams form England, Spain, France, Germany and Scotland. We will still need the data from other countries such as Italy, Portugal, Netherlands and many others, but for our first MVP we don't care about it, our objective is to reach a minimum product as fast as possible to reduce uncertainty, and 5 countries is enough for now.
The first step is to treat the data.
df_stadium = pd.read_csv("stadiums.csv", usecols=['Team','Stadium','Latitude','Longitude','Country'])
print("Number of rows: %d" % df_stadium.shape[0])
dd1 = df_stadium.take([0,99, 64, 121])
dd1
Number of rows: 164
Team | Stadium | Latitude | Longitude | Country | |
---|---|---|---|---|---|
0 | Arsenal | Emirates Stadium | 51.555000 | -0.108611 | England |
99 | Real Madrid | Santiago Bernabéu | 40.453060 | -3.688350 | Spain |
64 | Bayern Munich | Allianz Arena | 48.218775 | 11.624753 | Germany |
121 | Paris Saint-Germain | Parc des Princes | 48.841389 | 2.253056 | France |
The next step is to match the club names in the dataset of coordinates with the those in the dataset of matches. They are similar but not always exactly the same, for example, in the dataset of coordinates we have Real Madrid FC
and in the dataset of matches we have Real Madrid
. Furthermore, in the first one there are several entries for some teams, like Atletico Madrid
, Atletico Madrid B
or Atletico Madrid C
meaning they are the teams from the first division and from other divisions.
df_match = pd.read_csv('champions.csv', usecols=['Date','home','visitor','hcountry','vcountry'])
df_match = df_match.rename(columns = {'hcountry':'home_country', 'vcountry':'visitor_country'})
df_teams_champions = pd.concat([df_match['home'], df_match['visitor']])
teams_champions = set(df_teams_champions)
print("Number of teams that have participated in the Champions League: %d" % len(teams_champions))
print("Number of matches in the dataset: %d" % df_match.shape[0])
df_match.head()
Number of teams that have participated in the Champions League: 502
Number of matches in the dataset: 6554
Date | home | visitor | home_country | visitor_country | |
---|---|---|---|---|---|
0 | 1955-09-04 | Sporting CP | Partizan Belgrade | POR | SRB |
1 | 1955-09-07 | Budapesti Voros Lobogo | RSC Anderlecht | HUN | BEL |
2 | 1955-09-08 | Servette Geneve | Real Madrid | SUI | ESP |
3 | 1955-09-14 | Rot-Weiss Essen | Hibernian FC | GER | SCO |
4 | 1955-09-20 | Djurgardens IF | Gwardia Warszawa | SWE | POL |
To find the string similarity you can use different methods. Here we will use a simple method to calculate it with difflib
.
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def get_info_similar_team(team, df_stadium, threshold=0.6, verbose=False):
max_rank = 0
max_idx = -1
stadium = "Unknown"
latitude = np.NaN
longitude = np.NaN
for idx, val in enumerate(df_stadium['Team']):
rank = similar(team, val)
if rank > threshold:
if(verbose): print("%s and %s(Idx=%d) are %f similar." % (team, val, idx, rank))
if rank > max_rank:
if(verbose): print("New maximum rank: %f" %rank)
max_rank = rank
max_idx = idx
stadium = df_stadium['Stadium'].iloc[max_idx]
latitude = df_stadium['Latitude'].iloc[max_idx]
longitude = df_stadium['Longitude'].iloc[max_idx]
return stadium, latitude, longitude
print(get_info_similar_team("Real Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Atletico de Madrid FC", df_stadium, verbose=True))
print(get_info_similar_team("Inter Milan", df_stadium, verbose=True))
Real Madrid FC and Real Madrid (Idx=99) are 0.923077 similar.
New maximum rank: 0.923077
('Santiago Bernabéu ', 40.453060000000001, -3.6883499999999998)
Atletico de Madrid FC and Atlético Madrid (Idx=87) are 0.810811 similar.
New maximum rank: 0.810811
('Vicente Calderón ', 40.401719, -3.7206059999999996)
('Unknown', nan, nan)
The next step is to create a dataframe relating each match with the stadium coordinates of each team
%%time
df_match_stadium = df_match
home_stadium_index = df_match_stadium['home'].map(lambda x: get_info_similar_team(x, df_stadium))
visitor_stadium_index = df_match_stadium['visitor'].map(lambda x: get_info_similar_team(x, df_stadium))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium = pd.concat([df_match_stadium, df_home, df_visitor], axis=1, ignore_index=False)
CPU times: user 1min 50s, sys: 460 ms, total: 1min 51s
Wall time: 1min 51s
print("Number of missing values for home teams: %d out of %d" % (df_match_stadium['home_stadium'].value_counts()['Unknown'], df_match_stadium.shape[0]))
df1 = df_match_stadium['home_stadium'] == 'Unknown'
df2 = df_match_stadium['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium.shape[0] - df_match_stadium[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium.shape[0]))
df_match_stadium.head()
Number of missing values for home teams: 4499 out of 6554
Number of matches with complete data: 728 out of 6554
Date | home | visitor | home_country | visitor_country | home_stadium | home_latitude | home_longitude | visitor_stadium | visitor_latitude | visitor_longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1955-09-04 | Sporting CP | Partizan Belgrade | POR | SRB | El Molinón | 43.536111 | -5.637222 | Unknown | NaN | NaN |
1 | 1955-09-07 | Budapesti Voros Lobogo | RSC Anderlecht | HUN | BEL | Unknown | NaN | NaN | Unknown | NaN | NaN |
2 | 1955-09-08 | Servette Geneve | Real Madrid | SUI | ESP | Unknown | NaN | NaN | Santiago Bernabéu | 40.453060 | -3.688350 |
3 | 1955-09-14 | Rot-Weiss Essen | Hibernian FC | GER | SCO | Unknown | NaN | NaN | Easter Road | 55.961667 | -3.165556 |
4 | 1955-09-20 | Djurgardens IF | Gwardia Warszawa | SWE | POL | Unknown | NaN | NaN | Unknown | NaN | NaN |
Now, even though there are many entries in the dataset that don't have any value, we are going to create a dataframe with the teams that do have values and advance in the project. This dataframe finds the combination of teams (home and visitor) that have values and concatenate each other to create the map.
def aggregate_dataframe_coordinates(dataframe):
df = pd.DataFrame(index=np.arange(0, n_complete_matches*3), columns=['Latitude','Longitude'])
count = 0
for ii in range(dataframe.shape[0]):
if dataframe['home_stadium'].loc[ii]!= 'Unknown' and dataframe['visitor_stadium'].loc[ii]!= 'Unknown':
df.loc[count] = [dataframe['home_latitude'].loc[ii], dataframe['home_longitude'].loc[ii]]
df.loc[count+1] = [dataframe['visitor_latitude'].loc[ii], dataframe['visitor_longitude'].loc[ii]]
df.loc[count+2] = [np.NaN, np.NaN]
count += 3
return df
df_agg = aggregate_dataframe_coordinates(df_match_stadium)
df_agg.head()
Latitude | Longitude | |
---|---|---|
0 | 53.4631 | -2.29139 |
1 | 51.4926 | 7.45184 |
2 | NaN | NaN |
3 | 55.8532 | -4.30926 |
4 | 43.7233 | 7.25876 |
We have to transform the latitude and longitude coordinates to web mercator format in order to be able to represent it in a map using bokeh. Mercator coordinates are a cilindrical projection of the World coordinates. It was invented in 1569 by Gerardus Mercator and became the standard format for nautical purposes. The web mercator format is an adaptation of the original mercator format and it is currently used by most modern map systems such as Google Maps, Bing Maps or OpenStreetMaps.
def to_web_mercator(yLat, xLon):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return
semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295
northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east
return [easting, northing]
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
df_agg_mercator.head()
Latitude | Longitude | |
---|---|---|
0 | -255076.256689 | 7.069114e+06 |
1 | 829535.256912 | 6.708890e+06 |
2 | NaN | NaN |
3 | -479704.406257 | 7.529248e+06 |
4 | 808041.021713 | 5.422726e+06 |
The next step is to plot the trayectories in the map using datashader
plot_width = 850
plot_height = 600
x_range = (-1.9e6, 5.9e6)
y_range = (3.7e6, 9.0e6)
def create_image(x_range=x_range, y_range=y_range, w=plot_width, h=plot_height, cmap=None):
cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
agg = cvs.line(df_agg_mercator, 'Latitude', 'Longitude', ds.count())
#img = tf.shade(agg, cmap=reversed(Blues3), how='eq_hist')
#img = tf.shade(agg, cmap=reversed(Greens3), how='eq_hist')
img = tf.shade(agg, cmap=cmap, how='eq_hist')
return img
def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height,**plot_args):
p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
x_range=x_range, y_range=y_range, outline_line_color=None,
min_border=0, min_border_left=0, min_border_right=0,
min_border_top=0, min_border_bottom=0, **plot_args)
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
return p
ArcGIS=WMTSTileSource(url='http://server.arcgisonline.com/ArcGIS/rest/services/World_Street_Map/MapServer/tile/{Z}/{Y}/{X}.png')
p = base_plot()
p.add_tile(ArcGIS)
#InteractiveImage(p, create_image, cmap=inferno)
<bokeh.models.renderers.TileRenderer at 0x7fd69392f5f8>
The function InteractiveImage
is not rendered in github. I made a snapshot and show it here:
This is the first MVP which provides a final product. It is imperfect, it doesn't have all the teams in Europe and some of the matches are wronly represented. However, we got some data, transform it and plot it in a map.
Now that we have the map, we can start to improve it. If you are into football, you will notice that there are several points in the north of Spain, that corresponds to Sporting de Gijon. Sadly for Sporting supporters, they have never reached to the Champions. Instead, Sporting Clube de Portugal has participated several times in the championship, but since the current dataset doesn't have teams from Portugal, the system mistakenly thinks that Sporting CP
from champions.csv
is the Sporting de Gijon from stadiums.csv
. So lets fix this issue by getting the stadiums coordinates from the rest of the countries in Europe.
We can get that info from wikidata. Using SPARQL
language we can get the information we need:
SELECT ?clubLabel ?venueLabel ?coordinates ?countryLabel WHERE {
?club wdt:P31 wd:Q476028.
?club wdt:P115 ?venue.
?venue wdt:P625 ?coordinates.
?club wdt:P17 ?country.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?clubLabel
This generates 4435 Results in 10111 ms that can be saved to a csv file.
df_stadium_read = pd.read_csv('stadiums_wikidata.csv', usecols=['clubLabel','venueLabel','coordinates','countryLabel'])
df_stadium_read.tail()
clubLabel | venueLabel | coordinates | countryLabel | |
---|---|---|---|---|
4430 | ŠK SFM Senec | NTC Senec | Point(17.410412 48.217703) | Slovakia |
4431 | ŠK Slovan Bratislava | Štadión Pasienky | Point(17.141111 48.166111) | Slovakia |
4432 | Żurrieq F.C. | Ta' Qali National Stadium | Point(14.415372222 35.894863888) | Malta |
4433 | ŽFK Crvena zvezda | Red Star Stadium | Point(20.464914 44.7832) | Serbia |
4434 | ŽP Šport Podbrezová | Stadium Kolkáreň | Point(19.535 48.8167) | Slovakia |
The first step is to clean the column coordinates. For that we will use a regex pattern. The pattern [-+]?[0-9]*\.?[0-9]+
finds any signed float in a string. Then we create two patterns separated by a space and name the columns using this format: (?P<Longitude>)
. Finally we have to concatente the club information with the coordinates.
df_temp = df_stadium_read['coordinates'].str.extract('(?P<Longitude>[-+]?[0-9]*\.?[0-9]+) (?P<Latitude>[-+]?[0-9]*\.?[0-9]+)', expand=True)
df_stadium_new = pd.concat([df_stadium_read['clubLabel'],df_stadium_read['venueLabel'], df_temp, df_stadium_read['countryLabel']], axis=1)
df_stadium_new = df_stadium_new.rename(columns = {'clubLabel':'Team', 'venueLabel':'Stadium','countryLabel':'Country'})
print("Number of rows: %d" % df_stadium_new.shape[0])
unique_teams_stadium = list(set(df_stadium_new['Team']))
print("Unique team's name number: %d" % len(unique_teams_stadium))
df_stadium_new.take(list(range(3388,3393)))
Number of rows: 4435
Unique team's name number: 4314
Team | Stadium | Longitude | Latitude | Country | |
---|---|---|---|---|---|
3388 | Real Madrid | Juan Ramón Loubriel Stadium | -66.150833 | 18.393333 | United States of America |
3389 | Real Madrid | Galeshewe Stadium | 24.740857 | -28.716786 | South Africa |
3390 | Real Madrid C | Ciudad Real Madrid | -3.61166667 | 40.47916667 | Spain |
3391 | Real Madrid C.F. | Ciudad Real Madrid | -3.61166667 | 40.47916667 | Spain |
3392 | Real Madrid FC | Estadio Santiago Bernabéu | -3.68835 | 40.45306 | Spain |
As it can be seen in the previous dataframe, we came into another problem. The new dataset contains all team instances, all over the world. There are some teams that have the same name in different countries, there is a Real Madrid team from USA and South Africa, and a similar name within a country. In our case, we are only interested in the instance Real Madrid, Estadio Santiago Bernabeu, -3.68835, 40.45306, Spain
. So how can we find an automated way to filter the correct teams that have participated in the Champions League?
A practical approach is to combine an automated and manual way. With the data we have so far we can automatically filter the two first entries using the country. We can get the country info from champions.csv
dataset. To distinguish teams from the same country we will filter them manually.
The first step then is to get a dataframe with all the teams that have participated in Champions and their country of origin. Then we have to remove the repeated entries and rename the country code to the country name that can be found in wikidata.
df_match_home = df_match[['home','home_country']]
df_match_home = df_match_home.rename(columns={'home':'Team','home_country':'Country'})
df_match_visitor = df_match[['visitor','visitor_country']]
df_match_visitor = df_match_visitor.rename(columns={'visitor':'Team','visitor_country':'Country'})
df_champions_teams = pd.concat([df_match_home,df_match_visitor], axis=0, ignore_index=True)
df_champions_teams = df_champions_teams.drop_duplicates()
print("Number of unique teams: %d" % df_champions_teams.shape[0])
country_dict = {'ALB':'Albania',
'AND':'Andorra',
'ARM':'Armenia',
'AUT':'Austria',
'AZE':'Azerbaijan',
'BEL':'Belgium',
'BIH':'Bosnia and Herzegovina',
'BLR':'Belarus',
'BUL':'Bulgaria',
'CRO':'Croatia',
'CYP':'Cyprus',
'CZE':'Czech Republic',
'DEN':'Denmark',
#'ENG':'England',
'ENG':'United Kingdom',
'ESP':'Spain',
'EST':'Estonia',
'FIN':'Finland',
'FRA':'France',
'FRO':'Feroe Islands',
'GEO':'Georgia',
'GER':'Germany',
'GIB':'Gibraltar',
'GRE':'Greece',
'HUN':'Hungary',
'ITA':'Italy',
'IRL':'Ireland',
'ISL':'Iceland',
'ISR':'Israel',
'KAZ':'Kazakhstan',
'LTU':'Lithuania',
'LUX':'Luxembourg',
'LVA':'Latvia',
'MDA':'Moldova',
'MKD':'Macedonia',
'MLT':'Malta',
'MNE':'Montenegro',
'NED':'Netherlands',
#'NIR':'Northern Ireland',
'NIR':'United Kingdom',
'NOR':'Norwey',
'POL':'Poland',
'POR':'Portugal',
'ROU':'Romania',
'RUS':'Russia',
#'SCO':'Scotland',
'SCO':'United Kingdom',
'SMR':'San Marino',
'SRB':'Serbia',
'SUI':'Switzerland',
'SVK':'Slovakia',
'SVN':'Slovenia',
'SWE':'Sweden',
'TUR':'Turkey',
'UKR':'Ukrania',
#'WAL':'Wales',
'WAL':'United Kingdom'}
df_champions_teams['Country'].replace(country_dict, inplace=True)
#df_champions_teams.to_csv('match_unique.csv')# To check that the mapping is correct
df_champions_teams.sort_values(by='Team',inplace=True)
df_champions_teams = df_champions_teams.reset_index(drop=True)
df_champions_teams.head()
Number of unique teams: 502
Team | Country | |
---|---|---|
0 | AC Chievo Verona | Italy |
1 | AC Milan | Italy |
2 | AC Sparta Praha | Czech Republic |
3 | ACF Fiorentina | Italy |
4 | AE Larisa | Greece |
Once we have the list of all teams that have participated in the Champions League, we have to generate a new dataset relating each Champions League matches with the coordinates of the team stadiums. For that we will use the function similar
to match a the name of the team in the different datasets similarly as we did before.
Once the csv has been generated, let's manually erase the combinations that are not correct and save everything in a new file. We won't correct those entries that are not matched, a Data Science project is better out than perfect!!
%%time
def get_info_similar_team_country(team, country, df_stadium, df, threshold, verbose):
team2 = "Unknown"
stadium = "Unknown"
latitude = np.NaN
longitude = np.NaN
cols = list(df)
for idx, val in enumerate(df_stadium['Team']):
rank = similar(team, val)
if rank > threshold and country == df_stadium['Country'].iloc[idx]:
if(verbose): print("%s and %s(Idx=%d) are %f similar and from the same country %s."
% (team, val, idx, rank, country))
team2 = df_stadium['Team'].iloc[idx]
stadium = df_stadium['Stadium'].iloc[idx]
latitude = df_stadium['Latitude'].iloc[idx]
longitude = df_stadium['Longitude'].iloc[idx]
dtemp = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
df = df.append(dtemp, ignore_index=True)
#if there is no match, register it
if(team2 == "Unknown"):
df_nomatch = pd.DataFrame([[team, team2, stadium, latitude, longitude, country]], columns=cols)
df = df.append(df_nomatch, ignore_index=True)
return df
def generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold=0.6, verbose=False):
df = pd.DataFrame(columns=['Team', 'Team2', 'Stadium', 'Latitude','Longitude','Country'])
for idx, row in df_champions_teams.iterrows():
df = get_info_similar_team_country(row['Team'],row['Country'], df_stadium_new, df,
threshold=threshold, verbose=verbose)
return df
verbose = False # You can change this to True to see all the combinations
threshold = 0.5
df_stadiums_champions = generate_new_stadium_dataset(df_champions_teams, df_stadium_new, threshold, verbose)
df_stadiums_champions.to_csv('stadiums_champions.csv', index=False)
CPU times: user 2min 9s, sys: 240 ms, total: 2min 9s
Wall time: 2min 9s
After we filtered the entries in the csv, let's load again the data and repeat the process.
df_stadiums_champions = pd.read_csv('stadiums_champions_filtered.csv', usecols=['Team','Stadium','Latitude','Longitude','Country'])
df_stadiums_champions.head()
Team | Stadium | Latitude | Longitude | Country | |
---|---|---|---|---|---|
0 | AC Chievo Verona | Stadio Marc'Antonio Bentegodi | 45.435356 | 10.968647 | Italy |
1 | AC Milan | Stadio Giuseppe Meazza | 45.478080 | 9.124000 | Italy |
2 | AC Sparta Praha | Generali Arena | 50.099716 | 14.416122 | Czech Republic |
3 | ACF Fiorentina | Stadio Artemio Franchi | 43.780822 | 11.282258 | Italy |
4 | AE Larisa | AEL FC Arena | 39.615610 | 22.399490 | Greece |
As previously, we create a dataframe that relates each match with the coordinates of its stadium
df_match_stadium_new= df_match
home_stadium_index = df_match_stadium_new['home'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
visitor_stadium_index = df_match_stadium_new['visitor'].map(lambda x: get_info_similar_team(x, df_stadiums_champions))
df_home = pd.DataFrame(home_stadium_index.tolist(), columns=['home_stadium', 'home_latitude', 'home_longitude'])
df_visitor = pd.DataFrame(visitor_stadium_index.tolist(), columns=['visitor_stadium', 'visitor_latitude', 'visitor_longitude'])
df_match_stadium_new = pd.concat([df_match_stadium_new, df_home, df_visitor], axis=1, ignore_index=False)
df1 = df_match_stadium_new['home_stadium'] == 'Unknown'
df2 = df_match_stadium_new['visitor_stadium'] == 'Unknown'
n_complete_matches = df_match_stadium_new.shape[0] - df_match_stadium_new[df1 | df2].shape[0]
print("Number of matches with complete data: %d out of %d" % (n_complete_matches, df_match_stadium_new.shape[0]))
df_match_stadium_new.head()
Number of matches with complete data: 5110 out of 6554
Date | home | visitor | home_country | visitor_country | home_stadium | home_latitude | home_longitude | visitor_stadium | visitor_latitude | visitor_longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1955-09-04 | Sporting CP | Partizan Belgrade | POR | SRB | Estádio José Alvalade | 38.761194 | -9.160783 | Partizan Stadium | 44.788611 | 20.459167 |
1 | 1955-09-07 | Budapesti Voros Lobogo | RSC Anderlecht | HUN | BEL | Unknown | NaN | NaN | Constant Vanden Stock Stadium | 50.834167 | 4.298333 |
2 | 1955-09-08 | Servette Geneve | Real Madrid | SUI | ESP | Stade de Genève | 46.177778 | 6.127500 | Estadio Santiago Bernabéu | 40.453060 | -3.688350 |
3 | 1955-09-14 | Rot-Weiss Essen | Hibernian FC | GER | SCO | Stadion Essen | 51.486667 | 6.976389 | Easter Road | 55.961667 | -3.165556 |
4 | 1955-09-20 | Djurgardens IF | Gwardia Warszawa | SWE | POL | Stockholm Olympic Stadium | 59.345278 | 18.078889 | Stadion Polonii Warszawa | 52.255517 | 21.001322 |
The next step is to aggregate the coordinates and transform them to mercator format.
df_agg = aggregate_dataframe_coordinates(df_match_stadium_new)
df_agg_mercator = df_agg.apply(lambda row: to_web_mercator(row['Latitude'], row['Longitude']), axis=1)
print("Number of rows: %d" % df_agg_mercator.shape[0])
df_agg_mercator.head()
Number of rows: 15330
Latitude | Longitude | |
---|---|---|
0 | -1.019774e+06 | 4.687522e+06 |
1 | 2.277504e+06 | 5.588304e+06 |
2 | NaN | NaN |
3 | 6.821102e+05 | 5.808884e+06 |
4 | -4.105852e+05 | 4.932000e+06 |
Finally, we plot the coordinates in the map.
%%time
#InteractiveImage(p, create_image, cmap=inferno)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.63 µs
Finally, the image for github:
This is the second MVP. In this case we have the correct connections between teams in Europe.
If you take a look at the map you will notice that it is fairly ugly. The final quality of a product will drop if we don't show a good visualization. The next step is to improve the map changing the background and colors.
ArcGIS2 = WMTSTileSource(url='http://tile.stamen.com/toner-background/{Z}/{X}/{Y}.png')
p2 = base_plot()
p2.add_tile(ArcGIS2)
cmap = reversed(Blues3)
#cmap = reversed(Greens3)
#InteractiveImage(p2, create_image, cmap=cmap)
Finally, the image for github:
We developed three different MVPs iterating on the project. The Lean Startup method provides a very easy way to remove uncertainty and get a minimum product very quickly that we can show to the customer.
Happy data sciencing!