# Analyzing the FIFA World Cup Tournament - Data Science Portfolio Project

This project focuses on using Python to explore and analyze men's FIFA World Cup matches, with the goal of answering the following questions:

- Do teams perform better when playing as the host team in the men's FIFA World Cup tournament?
- Who were the top goal scorers individually and by team?
- Are teams who perform better in the 1st or 2nd half more victorious?

## Import and Clean Data

### Import libraries


In [63]:
import pandas as pd
import numpy as np

### Load and Inspect Datasets

The football match data is stored across three CSV files:
- **results.csv**
- **goalscorers.csv**
- **shootouts.csv**

We'll need to load these datasets into pandas and merge them accordingly.

Let's first load and inspect the **results.csv** dataset.

In [64]:
results = pd.read_csv('datasets/results.csv')
print(results.info())
results.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44557 entries, 0 to 44556
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        44557 non-null  object
 1   home_team   44557 non-null  object
 2   away_team   44557 non-null  object
 3   home_score  44557 non-null  int64 
 4   away_score  44557 non-null  int64 
 5   tournament  44557 non-null  object
 6   city        44557 non-null  object
 7   country     44557 non-null  object
 8   neutral     44557 non-null  bool  
dtypes: bool(1), int64(2), object(6)
memory usage: 2.8+ MB
None


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


Some initial observations:
- There are match results for 44,557 football matches
- It looks like there are no missing values in any of the columns
- The column names and data types look correct

Now let's load and inspect the **shootouts.csv** dataset.

In [65]:
shootouts = pd.read_csv('datasets/shootouts.csv')
print(shootouts.info())
shootouts.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547 entries, 0 to 546
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       547 non-null    object
 1   home_team  547 non-null    object
 2   away_team  547 non-null    object
 3   winner     547 non-null    object
dtypes: object(4)
memory usage: 17.2+ KB
None


Unnamed: 0,date,home_team,away_team,winner
0,1967-08-22,India,Taiwan,Taiwan
1,1971-11-14,South Korea,Vietnam Republic,South Korea
2,1972-05-07,South Korea,Iraq,Iraq
3,1972-05-17,Thailand,South Korea,South Korea
4,1972-05-19,Thailand,Cambodia,Thailand


Some initial observations:
- We have the winners for 547 matches that ended in shootouts
- Again, it looks like there are no missing values in any of the columns
- The column names and data types look correct

### Data Cleaning + Preparation

Before we can merge `results` and `shootouts` into a single DataFrame, we'll need to clean the column names and row values in both DataFrames.

We want to make sure that all column names are lower case and have no hidden whitespace. We also want to make sure that there is no hidden whitespace in any of the text data, since we will be merging on those columns.

Let's save some time by creating a function that can be applied to all DataFrames in our analysis.

In [66]:
def clean_df_text(df):
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.strip()

    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = df[column].str.title()
            df[column] = df[column].str.strip()
    return df

Let's now apply the function to clean `results` and `shootouts`.

In [67]:
results_cleaned = clean_df_text(results)
results_cleaned.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [68]:
shootouts_cleaned = clean_df_text(shootouts)
shootouts_cleaned.head()

Unnamed: 0,date,home_team,away_team,winner
0,1967-08-22,India,Taiwan,Taiwan
1,1971-11-14,South Korea,Vietnam Republic,South Korea
2,1972-05-07,South Korea,Iraq,Iraq
3,1972-05-17,Thailand,South Korea,South Korea
4,1972-05-19,Thailand,Cambodia,Thailand


Let's now merge `results` and `shootouts` into a single DataFrame. We'll do this by matching the games in both DataFrames using the `date`, `home_team`, and `away_team` columns. Lastly, let's specify a **left** join to return all of the matches in `results`, whether they ended in a shootout or not.

In [69]:
results_shootouts = pd.merge(results_cleaned, shootouts_cleaned, on=['date','home_team','away_team'], how='left')
print(results_shootouts.info())
results_shootouts.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44557 entries, 0 to 44556
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        44557 non-null  object
 1   home_team   44557 non-null  object
 2   away_team   44557 non-null  object
 3   home_score  44557 non-null  int64 
 4   away_score  44557 non-null  int64 
 5   tournament  44557 non-null  object
 6   city        44557 non-null  object
 7   country     44557 non-null  object
 8   neutral     44557 non-null  bool  
 9   winner      546 non-null    object
dtypes: bool(1), int64(2), object(7)
memory usage: 3.4+ MB
None


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,


The `winner` column only contains data on matches that resulted in shootouts. The left join inserts `NaN`s for all games that ended without a penalty shootout. We'll address this issue in the later data cleaning and preparation steps.

For now, since we want to analyze the results of the FIFA World Cup tournament, let's first filter the DataFrame for matches played in this specific tournament.

In [70]:
results_shootouts['tournament'].value_counts().head(10)

Friendly                                17593
Fifa World Cup Qualification             7878
Uefa Euro Qualification                  2631
African Cup Of Nations Qualification     1976
Fifa World Cup                            964
Copa América                              841
Afc Asian Cup Qualification               764
African Cup Of Nations                    741
Cecafa Cup                                620
Cfu Caribbean Cup Qualification           606
Name: tournament, dtype: int64

For this analysis, let's focus primarily on the 964 matches played in the FIFA World Cup tournament and ignore the qualification matches.

In [71]:
fifa_wc_mask = results_shootouts['tournament'] == 'Fifa World Cup'
fifa_wc = results_shootouts[fifa_wc_mask].copy()

print(fifa_wc.info())
fifa_wc.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 964 entries, 1311 to 44358
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        964 non-null    object
 1   home_team   964 non-null    object
 2   away_team   964 non-null    object
 3   home_score  964 non-null    int64 
 4   away_score  964 non-null    int64 
 5   tournament  964 non-null    object
 6   city        964 non-null    object
 7   country     964 non-null    object
 8   neutral     964 non-null    bool  
 9   winner      35 non-null     object
dtypes: bool(1), int64(2), object(7)
memory usage: 76.3+ KB
None


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner
1311,1930-07-13,Belgium,United States,0,3,Fifa World Cup,Montevideo,Uruguay,True,
1312,1930-07-13,France,Mexico,4,1,Fifa World Cup,Montevideo,Uruguay,True,
1313,1930-07-14,Brazil,Yugoslavia,1,2,Fifa World Cup,Montevideo,Uruguay,True,
1314,1930-07-14,Peru,Romania,1,3,Fifa World Cup,Montevideo,Uruguay,True,
1315,1930-07-15,Argentina,France,1,0,Fifa World Cup,Montevideo,Uruguay,True,


One takeaway just from filtering for FIFA World Cup matches is that only 35 out of 964 matches have ended in a penalty shootout. One reason this number is relatively small is because of the tournament design. World Cup tournaments start with **group play**, where teams play in small round-robin tournaments that allow ties. After this stage, the best performing teams go into the **knockout rounds**. Only in these later rounds do ties result in penalty shootouts.

We'll need to know the winner of matches that aren't penalty shootouts. Let's create some columns that will help us fill in those values, and will be useful in other aspects of our analysis later on.

Specifically, creating the columns:
- `total_goals` that counts the total number of goals scored per match by both teams
- `win_margin` that calculates the margin of victory by `home_score - away_score`
    - positive `win_margin` values correspond to the home team winning
    - negative `win_margin` values correspond to the away team winning
    - `win_margin` values equal to `0` correpond to either a **Draw** or a match that ended in a penalty shootout

In [72]:
fifa_wc['total_goals'] = fifa_wc['home_score'] + fifa_wc['away_score']

fifa_wc['win_margin'] = fifa_wc['home_score'] - fifa_wc['away_score']

fifa_wc.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner,total_goals,win_margin
1311,1930-07-13,Belgium,United States,0,3,Fifa World Cup,Montevideo,Uruguay,True,,3,-3
1312,1930-07-13,France,Mexico,4,1,Fifa World Cup,Montevideo,Uruguay,True,,5,3
1313,1930-07-14,Brazil,Yugoslavia,1,2,Fifa World Cup,Montevideo,Uruguay,True,,3,-1
1314,1930-07-14,Peru,Romania,1,3,Fifa World Cup,Montevideo,Uruguay,True,,4,-2
1315,1930-07-15,Argentina,France,1,0,Fifa World Cup,Montevideo,Uruguay,True,,1,1


Since the `winner` column only contains the winners of matches that resulted in shootouts, let's rename the column to `shootout_winner`.

In [73]:
column_mapper = {'winner':'shootout_winner'}
fifa_wc = fifa_wc.rename(mapper=column_mapper, axis=1)
fifa_wc.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,shootout_winner,total_goals,win_margin
1311,1930-07-13,Belgium,United States,0,3,Fifa World Cup,Montevideo,Uruguay,True,,3,-3
1312,1930-07-13,France,Mexico,4,1,Fifa World Cup,Montevideo,Uruguay,True,,5,3
1313,1930-07-14,Brazil,Yugoslavia,1,2,Fifa World Cup,Montevideo,Uruguay,True,,3,-1
1314,1930-07-14,Peru,Romania,1,3,Fifa World Cup,Montevideo,Uruguay,True,,4,-2
1315,1930-07-15,Argentina,France,1,0,Fifa World Cup,Montevideo,Uruguay,True,,1,1


Let's now attempt to re-create the `winner` column to determine the winners of each match without any missing `NaN` values like before.

We can build a function to determine the winner of each match using the values in the `shootout_winner` and `win_margin` columns. 

We'll have to handle the `NaN`s in the `shootout_winner` column. There are a couple ways to do this. We could convert this column to strings, and use string comparisons to test if each value `== 'nan'`. But checking the pandas docs also shows a built-in method `pd.isna()` which will check for us!

In [74]:
def determine_winner(row):
    if not pd.isna(row['shootout_winner']):
        winner = row['shootout_winner']
    elif row['win_margin'] > 0:
        winner = row['home_team']
    elif row['win_margin'] < 0:
        winner = row['away_team']
    else:
        winner = 'Draw'

    return winner

fifa_wc['winner'] = fifa_wc.apply(determine_winner, axis=1)

print(fifa_wc.info())
fifa_wc[['date', 'home_team', 'away_team', 'home_score', 'away_score', 'win_margin', 'shootout_winner', 'winner']].head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 964 entries, 1311 to 44358
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             964 non-null    object
 1   home_team        964 non-null    object
 2   away_team        964 non-null    object
 3   home_score       964 non-null    int64 
 4   away_score       964 non-null    int64 
 5   tournament       964 non-null    object
 6   city             964 non-null    object
 7   country          964 non-null    object
 8   neutral          964 non-null    bool  
 9   shootout_winner  35 non-null     object
 10  total_goals      964 non-null    int64 
 11  win_margin       964 non-null    int64 
 12  winner           964 non-null    object
dtypes: bool(1), int64(4), object(8)
memory usage: 98.8+ KB
None


Unnamed: 0,date,home_team,away_team,home_score,away_score,win_margin,shootout_winner,winner
1311,1930-07-13,Belgium,United States,0,3,-3,,United States
1312,1930-07-13,France,Mexico,4,1,3,,France
1313,1930-07-14,Brazil,Yugoslavia,1,2,-1,,Yugoslavia
1314,1930-07-14,Peru,Romania,1,3,-2,,Romania
1315,1930-07-15,Argentina,France,1,0,1,,Argentina


Now let's use the `winner` column to create two additional columns:

- `home_win` returns `True` if the home team won, otherwise `False`
- `away_win` returns `True` if the away team won, otherwise `False`

In [75]:
fifa_wc['home_win'] = fifa_wc['winner'] == fifa_wc['home_team']

fifa_wc['away_win'] = fifa_wc['winner'] == fifa_wc['away_team']

fifa_wc[['home_team', 'away_team', 'winner', 'home_win', 'away_win']].head()

Unnamed: 0,home_team,away_team,winner,home_win,away_win
1311,Belgium,United States,United States,False,True
1312,France,Mexico,France,True,False
1313,Brazil,Yugoslavia,Yugoslavia,False,True
1314,Peru,Romania,Romania,False,True
1315,Argentina,France,Argentina,True,False


In [76]:
fifa_wc.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,shootout_winner,total_goals,win_margin,winner,home_win,away_win
1311,1930-07-13,Belgium,United States,0,3,Fifa World Cup,Montevideo,Uruguay,True,,3,-3,United States,False,True
1312,1930-07-13,France,Mexico,4,1,Fifa World Cup,Montevideo,Uruguay,True,,5,3,France,True,False
1313,1930-07-14,Brazil,Yugoslavia,1,2,Fifa World Cup,Montevideo,Uruguay,True,,3,-1,Yugoslavia,False,True
1314,1930-07-14,Peru,Romania,1,3,Fifa World Cup,Montevideo,Uruguay,True,,4,-2,Romania,False,True
1315,1930-07-15,Argentina,France,1,0,Fifa World Cup,Montevideo,Uruguay,True,,1,1,Argentina,True,False


Our full dataset now contains a lot of new information. Let's clean up the DataFrame a bit by dropping redundant or unnecessary columns for our analysis. 

Specifically let's drop the following columns:

- `tournament` since we know that all of the matches in our filtered dataset were played in the FIFA World Cup tournament
- `city` since we won't be needing the city information in our analysis
- `shootout_winner` since the `winner` column already determines the winner in all of the matches and we're also not directly analyzing shootouts

In [77]:
drop_columns = ['tournament', 'city', 'shootout_winner']
fifa_wc = fifa_wc.drop(labels= drop_columns, axis=1)

print(fifa_wc.info())
fifa_wc.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 964 entries, 1311 to 44358
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         964 non-null    object
 1   home_team    964 non-null    object
 2   away_team    964 non-null    object
 3   home_score   964 non-null    int64 
 4   away_score   964 non-null    int64 
 5   country      964 non-null    object
 6   neutral      964 non-null    bool  
 7   total_goals  964 non-null    int64 
 8   win_margin   964 non-null    int64 
 9   winner       964 non-null    object
 10  home_win     964 non-null    bool  
 11  away_win     964 non-null    bool  
dtypes: bool(3), int64(4), object(5)
memory usage: 78.1+ KB
None


Unnamed: 0,date,home_team,away_team,home_score,away_score,country,neutral,total_goals,win_margin,winner,home_win,away_win
1311,1930-07-13,Belgium,United States,0,3,Uruguay,True,3,-3,United States,False,True
1312,1930-07-13,France,Mexico,4,1,Uruguay,True,5,3,France,True,False
1313,1930-07-14,Brazil,Yugoslavia,1,2,Uruguay,True,3,-1,Yugoslavia,False,True
1314,1930-07-14,Peru,Romania,1,3,Uruguay,True,4,-2,Romania,False,True
1315,1930-07-15,Argentina,France,1,0,Uruguay,True,1,1,Argentina,True,False


## Exploratory Data Analysis

### Data Question #1 - Do teams perform better when playing as the host team in the FIFA World Cup tournament?

The home team data in our dataset is slightly misleading. In the World Cup, one team is always designated as the home team even if they aren't playing in their home country. What we really want to know is whether a team is the host country's team or not. 

Some ways we can tell that the designated home team is also the host team for that tournament:
- matches where `home_team` and `country` values are the same
- matches where `neutral` values are `False`

Let's first count how many games were played where the home team is also the host team.

In [78]:
fifa_wc['neutral'].value_counts()

True     843
False    121
Name: neutral, dtype: int64

Again, we'll want to compare the performances of each team as the host (in the 121 matches containing the host team) to their performance as non-hosts in the FIFA World Cup tournament. 

First, let's split the FIFA World Cup matches for matches played by the host team.

In [79]:
host_team_mask = fifa_wc['home_team'] == fifa_wc['country']
host_team_matches = fifa_wc[host_team_mask]
host_team_matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,country,neutral,total_goals,win_margin,winner,home_win,away_win
1320,1930-07-18,Uruguay,Peru,1,0,Uruguay,False,1,1,Uruguay,True,False
1325,1930-07-21,Uruguay,Romania,4,0,Uruguay,False,4,4,Uruguay,True,False
1329,1930-07-27,Uruguay,Yugoslavia,6,1,Uruguay,False,7,5,Uruguay,True,False
1330,1930-07-30,Uruguay,Argentina,4,2,Uruguay,False,6,2,Uruguay,True,False
1694,1934-05-27,Italy,United States,7,1,Italy,False,8,6,Italy,True,False


Let's compute summary statistics about the host team performances.

In [80]:
host_team_matches['home_team'].nunique()

18

In [81]:
host_team_matches['win_margin'].describe()

count    121.000000
mean       0.842975
std        2.000034
min       -6.000000
25%        0.000000
50%        1.000000
75%        2.000000
max        6.000000
Name: win_margin, dtype: float64

In [82]:
host_team_matches['home_win'].mean()

0.6528925619834711

Interestingly, we see that there are 18 unique host teams that won about 65.3% of their matches and on average scored about 0.843 more goals than their opponents!

Now, we'd like to take each of the 18 teams that has hosted a world cup, and compare their win record as a host team to their win record as a non-host team.

Let's start by calculating win percents for the host teams.

In [83]:
host_win_pct = host_team_matches.groupby('home_team').agg({'home_win':'mean'}).reset_index()
host_win_pct.columns = ['team','host_pct']
host_win_pct.head()

Unnamed: 0,team,host_pct
0,Argentina,0.714286
1,Brazil,0.615385
2,Chile,0.666667
3,England,0.833333
4,France,0.888889


And now let's calculate the win percents for the non-host teams.

To do this, we'll have to perform a Split-Apply-Combine process, since non-host teams can be designated home teams or away teams.

For each team, we'll calculate the number of wins and number of games as **home but not host** and **away**.

In [84]:
nonhost_home_matches = fifa_wc[~host_team_mask]
nonhost_home_record = nonhost_home_matches.groupby('home_team').agg({'home_win':'sum','date':'count'}).reset_index()
nonhost_home_record.columns = ['team','wins','num_games']

nonhost_away_record = fifa_wc.groupby('away_team').agg({'away_win':'sum','date':'count'}).reset_index()
nonhost_away_record.columns = ['team','wins','num_games']
nonhost_away_record

nonhost_record = pd.merge(left = nonhost_home_record,
                         right = nonhost_away_record,
                         on = ['team'],
                         how='outer')

nonhost_record.head()

Unnamed: 0,team,wins_x,num_games_x,wins_y,num_games_y
0,Algeria,1.0,6.0,2.0,7.0
1,Angola,0.0,1.0,0.0,2.0
2,Argentina,39.0,58.0,9.0,23.0
3,Australia,3.0,7.0,1.0,13.0
4,Austria,9.0,20.0,3.0,9.0


To calculate the win percents, we need to add up the total wins and divide them by the total number of games.

In [85]:
nonhost_record['nonhost_pct'] = (nonhost_record['wins_x'] + nonhost_record['wins_y']) / (nonhost_record['num_games_x'] + nonhost_record['num_games_y'])
nonhost_record = nonhost_record[['team','nonhost_pct']]

nonhost_record.head()

Unnamed: 0,team,nonhost_pct
0,Algeria,0.230769
1,Angola,0.0
2,Argentina,0.592593
3,Australia,0.2
4,Austria,0.413793


Lastly, we need to merge the non-host record to the host record to perform comparisons. We'll use a left merge to catch any teams that have hosted the world cup but never appeared as non-hosts (we expect this to be Qatar).

In [86]:
comparison = pd.merge(left = host_win_pct,
                  right = nonhost_record,
                  left_on = 'team',
                  right_on = 'team',
                  how = 'left')

comparison

Unnamed: 0,team,host_pct,nonhost_pct
0,Argentina,0.714286,0.592593
1,Brazil,0.615385,0.70297
2,Chile,0.666667,0.259259
3,England,0.833333,0.411765
4,France,0.888889,0.515625
5,Germany,0.857143,0.612245
6,Italy,0.833333,0.507042
7,Japan,0.5,0.238095
8,Mexico,0.555556,0.235294
9,Qatar,0.0,


Let's create a new column that quantifies the difference in each team's performance as the host team and non-host team.

In [87]:
comparison['win_pct_diff'] = comparison['host_pct'] - comparison['nonhost_pct']
comparison.sort_values(by='win_pct_diff',ascending=False)

Unnamed: 0,team,host_pct,nonhost_pct,win_pct_diff
17,Uruguay,1.0,0.4,0.6
12,South Korea,0.571429,0.129032,0.442396
3,England,0.833333,0.411765,0.421569
2,Chile,0.666667,0.259259,0.407407
4,France,0.888889,0.515625,0.373264
6,Italy,0.833333,0.507042,0.326291
8,Mexico,0.555556,0.235294,0.320261
14,Sweden,0.666667,0.355556,0.311111
7,Japan,0.5,0.238095,0.261905
5,Germany,0.857143,0.612245,0.244898


We see that only two countries have a better record in the world cup as non-host than as host. Interestingly, both are world cup winners and one is generally considered one of the most successful teams in the world cup (Brazil). Only one country (Qatar) has no non-host data to report.

There are a couple caveats to these conclusions (and are worth exploring in future analyses!), as this is simply a preliminary exploratory analysis and not a detailed statistical study. For example:

- most countries have played far more as non-host than as host. It may be the case that some of what we are observing is impacted by sample sizes. Additionally, most teams are only ever host once or twice. Thus, we are comparing performance in one tournament to performance in multiple tournaments. We could improve this analysis by disaggregating the nonhost tournaments, and averaging over them.
- win percent is not a perfect measure for tournament success. The more you win in a given tournament, the more games you play. Typically, all teams are guaranteed a certain number of games in group play, but after that it is win or go home. A team that leaves the tournament after group play will have a tournament denominator of 3 games, for example. A team that gets out of group play and then loses will have a tournament denominator of 4 games. Because of tie breakers, both teams could have the same number of wins, with one staying in the tournament by scoring more goals. This team is more successful in the tournament, but would have a lower win percentage for that tournament.
- we haven't taken into account the quality/strength of their opponents.
- we haven't taken into account varying team performance over time due to different coaches, players, and current meta-strategies.

### Data Question #2 - Who were top goal scorers individually and by team?

Next, let's explore the **goalscorers.csv** dataset and use the information to gain insights about:
* Top individual scorers
* Top scorers by team
* 1st half vs 2nd half team performances

In [88]:
scorers = pd.read_csv('datasets/goalscorers.csv')
print(scorers.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41008 entries, 0 to 41007
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       41008 non-null  object 
 1   home_team  41008 non-null  object 
 2   away_team  41008 non-null  object 
 3   team       41008 non-null  object 
 4   scorer     40959 non-null  object 
 5   minute     40750 non-null  float64
 6   own_goal   41008 non-null  bool   
 7   penalty    41008 non-null  bool   
dtypes: bool(2), float64(1), object(5)
memory usage: 2.0+ MB
None


Let's clean and standardize the `scorers` dataset by applying the same function used to clean `results` and `shootouts`!

In [89]:
scorers = clean_df_text(scorers)
scorers.head()

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty
0,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,44.0,False,False
1,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,55.0,False,False
2,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,70.0,False,False
3,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,75.0,False,False
4,1916-07-06,Argentina,Chile,Argentina,Alberto Ohaco,2.0,False,False


Some initial observations:
- there are about 41,000 rows corresponding individual goal scorers across matches 
- individual matches can be indexed/classified using the `date`, `home_team`, and `away_team` columns
- we know the minute each goal was scored
- we know if a goal was an own goal or a penalty

Let's perform an left merge to create a DataFrame containing goal scorers in the FIFA World Cup tournament matches while keeping all of the matches with zero goals scored.

In [90]:
fifa_wc_scorers = pd.merge(fifa_wc, scorers, on=['date','home_team', 'away_team'], how='left')
fifa_wc_scorers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2798 entries, 0 to 2797
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         2798 non-null   object 
 1   home_team    2798 non-null   object 
 2   away_team    2798 non-null   object 
 3   home_score   2798 non-null   int64  
 4   away_score   2798 non-null   int64  
 5   country      2798 non-null   object 
 6   neutral      2798 non-null   bool   
 7   total_goals  2798 non-null   int64  
 8   win_margin   2798 non-null   int64  
 9   winner       2798 non-null   object 
 10  home_win     2798 non-null   bool   
 11  away_win     2798 non-null   bool   
 12  team         2720 non-null   object 
 13  scorer       2720 non-null   object 
 14  minute       2720 non-null   float64
 15  own_goal     2720 non-null   object 
 16  penalty      2720 non-null   object 
dtypes: bool(3), float64(1), int64(4), object(9)
memory usage: 336.1+ KB


Since we're only interested in information related to match results and goal scorers, let's select a useful subset of columns for our analysis. 

In [91]:
fifa_wc_scorers = fifa_wc_scorers[['date', 'home_team', 'away_team', 'winner','home_win', 'away_win', 'team', 'scorer', 'minute', 'own_goal', 'penalty']]
fifa_wc_scorers.head()

Unnamed: 0,date,home_team,away_team,winner,home_win,away_win,team,scorer,minute,own_goal,penalty
0,1930-07-13,Belgium,United States,United States,False,True,United States,Bart Mcghee,23.0,False,False
1,1930-07-13,Belgium,United States,United States,False,True,United States,Tom Florie,45.0,False,False
2,1930-07-13,Belgium,United States,United States,False,True,United States,Bert Patenaude,69.0,False,False
3,1930-07-13,France,Mexico,France,True,False,France,Lucien Laurent,19.0,False,False
4,1930-07-13,France,Mexico,France,True,False,France,Marcel Langiller,40.0,False,False


Let's explore the data a little bit by calculating which teams scored the most goals off penalty kicks and which teams, unfortunately, scored the most into their own goals.

In [92]:
penalty_goals = fifa_wc_scorers.groupby('team').agg({'penalty':'sum'})
penalty_goals = penalty_goals.sort_values('penalty', ascending=False).reset_index()
penalty_goals.columns = ['team', 'total_penalty_goals']
penalty_goals.head(10)

Unnamed: 0,team,total_penalty_goals
0,Spain,16
1,France,15
2,Argentina,12
3,Germany,12
4,England,12
5,Netherlands,10
6,Brazil,10
7,Mexico,10
8,Portugal,9
9,Italy,8


In [93]:
own_goals = fifa_wc_scorers.groupby('team').agg({'own_goal':'sum'})
own_goals = own_goals.sort_values('own_goal', ascending=False).reset_index()
own_goals.columns = ['team', 'total_own_goals']
own_goals.head(10)

Unnamed: 0,team,total_own_goals
0,France,6
1,Germany,4
2,Italy,4
3,Portugal,3
4,United States,3
5,Belgium,2
6,Austria,2
7,Russia,2
8,Spain,2
9,Paraguay,2


Let's now calculate the all-time leading goal scorers in the FIFA World Cup tournament.

In [94]:
top_scorers = fifa_wc_scorers.groupby('scorer').agg({'date':'count'})
top_scorers = top_scorers.sort_values('date', ascending=False).reset_index()
top_scorers.columns = ['scorer', 'total_goals']

top_scorers.head(10)

Unnamed: 0,scorer,total_goals
0,Miroslav Klose,16
1,Ronaldo,15
2,Gerd Müller,14
3,Lionel Messi,13
4,Just Fontaine,13
5,Pelé,12
6,Kylian Mbappé,12
7,Jürgen Klinsmann,11
8,Sándor Kocsis,11
9,Gabriel Batistuta,10


Perfect, we see some very famous retired players such as Klose and Ronaldo and current active players such as Mbappé and Messi (as of 2023). 

Now let's calculate the top individual goal scorers for each team. To do this, we'll create a function that utilizes the `.value_counts()` method to count the number of goals scored by individual players and extract the top scorer. We'll apply this function to each team using the `.groupby()` and `.agg()` methods.

In [97]:
def get_top_scorer(column):
    value_counts = column.value_counts()
    top_scorer = value_counts.index[0]
    goals_scored = value_counts.iloc[0]
    return str(top_scorer) + ',' + str(goals_scored)

top_scorers_by_team = fifa_wc_scorers.groupby('team').agg({'scorer': get_top_scorer}).reset_index()
top_scorers_by_team.head()

Unnamed: 0,team,scorer
0,Algeria,"Salah Assad,2"
1,Angola,"Flávio Amado,1"
2,Argentina,"Lionel Messi,13"
3,Australia,"Tim Cahill,5"
4,Austria,"Erich Probst,6"


In [98]:
top_scorers_by_team['top_scorer'] = top_scorers_by_team['scorer'].str.split(',', expand=True)[0]
top_scorers_by_team['goals'] = top_scorers_by_team['scorer'].str.split(',', expand=True)[1]

top_scorers_by_team = top_scorers_by_team.drop(['scorer'], axis=1)

top_scorers_by_team.head(10)

Unnamed: 0,team,top_scorer,goals
0,Algeria,Salah Assad,2
1,Angola,Flávio Amado,1
2,Argentina,Lionel Messi,13
3,Australia,Tim Cahill,5
4,Austria,Erich Probst,6
5,Belgium,Romelu Lukaku,5
6,Bolivia,Erwin Sánchez,1
7,Bosnia And Herzegovina,Vedad Ibišević,1
8,Brazil,Ronaldo,15
9,Bulgaria,Hristo Stoichkov,6


Now we have each team's all-time leading scorer in the FIFA World Cup tournament!

### Data Question #3 - Are teams who perform better in the 1st or 2nd half more victorious?

Next, we might be interested in comparing each team's performance in the 1st and 2nd half of their matches.

We'll determine if a team is more 1st half or 2nd half dominant by looking deeper into the `minute` column which tells us when a goal was scored in the match.

In [99]:
fifa_wc_scorers['minute'].describe()

count    2720.000000
mean       51.581250
std        27.535554
min         1.000000
25%        28.000000
50%        53.000000
75%        75.000000
max       120.000000
Name: minute, dtype: float64

Let's classify each goal depending on if the goal was scored in the 1st half (or first 45 minutes), the 2nd half (between 45 and 90 minutes), or overtime (after 90 minutes). 

- It is important to consider that this is not a perfect measure since extra-time may be added at the end of each half due to breaks, player injuries, or substitutions. 

In [100]:
def classify_goal_time(row):
    if row['minute'] <= 45.0:
        output = '1st half'
    elif row['minute'] <= 90.0:
        output = '2nd half'
    else:
        output = 'Overtime'
    return output

fifa_wc_scorers['classify_goal'] = fifa_wc_scorers.apply(classify_goal_time, axis=1)
fifa_wc_scorers.head()

Unnamed: 0,date,home_team,away_team,winner,home_win,away_win,team,scorer,minute,own_goal,penalty,classify_goal
0,1930-07-13,Belgium,United States,United States,False,True,United States,Bart Mcghee,23.0,False,False,1st half
1,1930-07-13,Belgium,United States,United States,False,True,United States,Tom Florie,45.0,False,False,1st half
2,1930-07-13,Belgium,United States,United States,False,True,United States,Bert Patenaude,69.0,False,False,2nd half
3,1930-07-13,France,Mexico,France,True,False,France,Lucien Laurent,19.0,False,False,1st half
4,1930-07-13,France,Mexico,France,True,False,France,Marcel Langiller,40.0,False,False,1st half


Now that we've classified each goal into 1st half, 2nd half, and overtime, let's count the number of goals scored in each half by each team.

In [101]:
team_classified_goals = pd.pivot_table(fifa_wc_scorers, 
                                       values='scorer', 
                                       index=['team'], 
                                       columns=['classify_goal'], 
                                       aggfunc='count')

team_classified_goals.head()

classify_goal,1st half,2nd half,Overtime
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Algeria,7.0,5.0,1.0
Angola,,1.0,
Argentina,69.0,78.0,5.0
Australia,7.0,10.0,
Austria,18.0,23.0,2.0


Notice that some teams may not have scored in either half or overtime resulting in missing `NaN` values.

Let's now convert the counts to the percentages of goals scored in each half. To do this, we'll need to divide the number of goals in each half, respectively, by the total number of goals scored overall. 

In [102]:
team_classified_goals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 78 entries, Algeria to Yugoslavia
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   1st half  70 non-null     float64
 1   2nd half  76 non-null     float64
 2   Overtime  28 non-null     float64
dtypes: float64(3)
memory usage: 2.4+ KB


In [103]:
team_classified_goals['total_goals'] = team_classified_goals[['1st half', '2nd half', 'Overtime']].sum(axis=1)
team_classified_goals.head()

classify_goal,1st half,2nd half,Overtime,total_goals
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Algeria,7.0,5.0,1.0,13.0
Angola,,1.0,,1.0
Argentina,69.0,78.0,5.0,152.0
Australia,7.0,10.0,,17.0
Austria,18.0,23.0,2.0,43.0


Let's also round the percentages to 3 decimals places as we're making the calculations.

In [104]:
team_classified_goals['pct_goals_1st_half'] = round(team_classified_goals['1st half'] / team_classified_goals['total_goals'], 3)
team_classified_goals['pct_goals_2nd_half'] = round(team_classified_goals['2nd half'] / team_classified_goals['total_goals'], 3)
team_classified_goals['pct_goals_Overtime'] = round(team_classified_goals['Overtime'] / team_classified_goals['total_goals'], 3)

team_classified_goals.head()

classify_goal,1st half,2nd half,Overtime,total_goals,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Algeria,7.0,5.0,1.0,13.0,0.538,0.385,0.077
Angola,,1.0,,1.0,,1.0,
Argentina,69.0,78.0,5.0,152.0,0.454,0.513,0.033
Australia,7.0,10.0,,17.0,0.412,0.588,
Austria,18.0,23.0,2.0,43.0,0.419,0.535,0.047


Now, let's see which teams scored most of their goals in the 2nd half.

In [105]:
team_classified_goals.sort_values('pct_goals_2nd_half', ascending=False).head(10)

classify_goal,1st half,2nd half,Overtime,total_goals,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Qatar,,1.0,,1.0,,1.0,
El Salvador,,1.0,,1.0,,1.0,
Haiti,,2.0,,2.0,,1.0,
Kuwait,,2.0,,2.0,,1.0,
Bolivia,,1.0,,1.0,,1.0,
Angola,,1.0,,1.0,,1.0,
Iraq,,1.0,,1.0,,1.0,
Israel,,1.0,,1.0,,1.0,
Norway,1.0,6.0,,7.0,0.143,0.857,
Costa Rica,4.0,18.0,,22.0,0.182,0.818,


Unfortunately, it looks like there isn't enough data for many teams to confidently compare their 1st half and 2nd half performances. For example, it looks the top 8 teams scored only 1-2 total goals throughout all of their tournament appearances. These were all, perhaps coincidentally, in the 2nd half.

To better analyze and compare team's 1st half and 2nd half performances, let's filter for teams who have scored total goals greater than the 25th percentile.

In [106]:
team_classified_goals['total_goals'].describe()

count     78.000000
mean      34.871795
std       47.675134
min        1.000000
25%        5.000000
50%       16.500000
75%       43.000000
max      237.000000
Name: total_goals, dtype: float64

In [107]:
goal_threshold = 5.0
gt_threshold = team_classified_goals['total_goals'] >= goal_threshold
team_classified_goals = team_classified_goals[gt_threshold]

team_classified_goals.head(10)

classify_goal,1st half,2nd half,Overtime,total_goals,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Algeria,7.0,5.0,1.0,13.0,0.538,0.385,0.077
Argentina,69.0,78.0,5.0,152.0,0.454,0.513,0.033
Australia,7.0,10.0,,17.0,0.412,0.588,
Austria,18.0,23.0,2.0,43.0,0.419,0.535,0.047
Belgium,28.0,36.0,5.0,69.0,0.406,0.522,0.072
Brazil,94.0,140.0,3.0,237.0,0.397,0.591,0.013
Bulgaria,8.0,14.0,,22.0,0.364,0.636,
Cameroon,5.0,15.0,2.0,22.0,0.227,0.682,0.091
Chile,18.0,22.0,,40.0,0.45,0.55,
Colombia,9.0,22.0,1.0,32.0,0.281,0.688,0.031


Perfect, it looks we have more data to better evaluate each team's 1st half and 2nd half performances.

But before we compare their performances, let's also calculate each team's overall win percentage so that we can see if teams that perform better in the 1st or 2nd half are more victorious. 

To do this, we'll need to combine each team's record as both the home team and away team. 

In [108]:
home_record = pd.pivot_table(fifa_wc_scorers, 
                            index='home_team', 
                            columns='home_win', 
                            values='date', 
                            aggfunc='count').reset_index()

home_record.columns = ['team', 'home_losses', 'home_wins']
home_record = home_record
home_record.head()

Unnamed: 0,team,home_losses,home_wins
0,Algeria,10.0,5.0
1,Angola,1.0,
2,Argentina,54.0,141.0
3,Australia,11.0,8.0
4,Austria,35.0,30.0


In [110]:
away_record = pd.pivot_table(fifa_wc_scorers, 
                            index='away_team', 
                            columns='away_win', 
                            values='date', 
                            aggfunc='count').reset_index()

away_record.columns = ['team', 'away_losses', 'away_wins']
away_record = away_record
away_record.head()

Unnamed: 0,team,away_losses,away_wins
0,Algeria,9.0,9.0
1,Angola,3.0,
2,Argentina,43.0,21.0
3,Australia,35.0,1.0
4,Austria,10.0,15.0


Again, it looks like some teams have missing `NaN` values corresponding to 0 wins.

Now let's combine each team's home and away record and calculate their total win percentage.

In [111]:
team_record = pd.merge(home_record, away_record, left_on='team', right_on='team', how='inner')
team_record['total_wins'] = team_record[['home_wins', 'away_wins']].sum(axis=1)
team_record['total_losses'] = team_record[['home_losses', 'away_losses']].sum(axis=1)
team_record['win_pct'] = team_record['total_wins'] / (team_record['total_wins'] + team_record['total_losses'])

team_record[['team', 'total_wins', 'total_losses', 'win_pct']].head(10)

Unnamed: 0,team,total_wins,total_losses,win_pct
0,Algeria,14.0,19.0,0.424242
1,Angola,0.0,4.0,0.0
2,Argentina,162.0,97.0,0.625483
3,Australia,9.0,46.0,0.163636
4,Austria,45.0,45.0,0.5
5,Belgium,60.0,85.0,0.413793
6,Bolivia,0.0,22.0,0.0
7,Bosnia And Herzegovina,4.0,4.0,0.5
8,Brazil,267.0,87.0,0.754237
9,Bulgaria,11.0,67.0,0.141026


Finally, let's combine each team's win percentage to their 1st and 2nd half performance data.

In [112]:
goals_record = pd.merge(team_classified_goals, team_record, left_on='team', right_on='team', how='inner')
goals_record = goals_record[['team', 'pct_goals_1st_half', 'pct_goals_2nd_half', 'pct_goals_Overtime', 'win_pct']]
print(goals_record.info())
goals_record.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 58
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   team                59 non-null     object 
 1   pct_goals_1st_half  59 non-null     float64
 2   pct_goals_2nd_half  59 non-null     float64
 3   pct_goals_Overtime  27 non-null     float64
 4   win_pct             59 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.8+ KB
None


Unnamed: 0,team,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime,win_pct
0,Algeria,0.538,0.385,0.077,0.424242
1,Argentina,0.454,0.513,0.033,0.625483
2,Australia,0.412,0.588,,0.163636
3,Austria,0.419,0.535,0.047,0.5
4,Belgium,0.406,0.522,0.072,0.413793


Let's see which teams scored most of their goals in the 1st half.

In [113]:
top_1sthalf_teams = goals_record.sort_values('pct_goals_1st_half', ascending=False)
top_1sthalf_teams.head(10)

Unnamed: 0,team,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime,win_pct
31,North Korea,0.667,0.333,,0.037037
53,Turkey,0.65,0.3,0.05,0.459459
44,Serbia,0.625,0.375,,0.088889
15,Egypt,0.6,0.4,,0.0
46,Slovenia,0.6,0.4,,0.066667
30,Nigeria,0.565,0.435,,0.254545
43,Senegal,0.562,0.375,0.062,0.424242
42,Scotland,0.56,0.44,,0.25
28,Morocco,0.55,0.45,,0.27451
32,Northern Ireland,0.538,0.385,0.077,0.135135


In [114]:
print(top_1sthalf_teams['pct_goals_1st_half'].head(10).mean())
print(top_1sthalf_teams['pct_goals_2nd_half'].head(10).mean())

0.5917000000000001
0.3893


In [50]:
print(top_1sthalf_teams['win_pct'].head(10).mean())

0.19904848698966346


Looking at the top 10 1st half teams, we see that on average they scored ~59% of their goals in the 1st half and ~39% of their goals in the 2nd half. Unfortunately, it looks all of their win percentages are below 0.500 (with an average of ~20%) meaning they've lost more games than have won.

Let's see if the reverse phenomenon is true for teams who scored most of their goals in the 2nd half. 

In [115]:
top_2ndhalf_teams = goals_record.sort_values('pct_goals_2nd_half', ascending=False)
top_2ndhalf_teams.head(10)

Unnamed: 0,team,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime,win_pct
33,Norway,0.143,0.857,,0.25
10,Costa Rica,0.182,0.818,,0.222222
18,German Dr,0.2,0.8,,0.3
45,Slovakia,0.2,0.8,,0.416667
38,Republic Of Ireland,0.2,0.8,,0.217391
23,Iran,0.231,0.769,,0.133333
48,South Korea,0.205,0.769,0.026,0.141667
25,Ivory Coast,0.308,0.692,,0.392857
9,Colombia,0.281,0.688,0.031,0.354839
7,Cameroon,0.227,0.682,0.091,0.126761


In [116]:
print(top_2ndhalf_teams['pct_goals_1st_half'].head(10).mean())
print(top_2ndhalf_teams['pct_goals_2nd_half'].head(10).mean())

0.2177
0.7675


In [117]:
print(top_2ndhalf_teams['win_pct'].head(10).mean())

0.2555736609151559


It looks like the top 10 2nd half teams scored over ~77% of their goals in the 2nd half and ~22% of their goals in the 1st half. 

Interestingly, all of their win percentages are again under 0.500 (with an average of ~26%) which again means they've lost more games than have won.

Let's investigate this phenomenon further by sorting the teams with the highest win percentages.

In [118]:
top_teams = goals_record.sort_values('win_pct', ascending=False).iloc[:10]
top_teams

Unnamed: 0,team,pct_goals_1st_half,pct_goals_2nd_half,pct_goals_Overtime,win_pct
5,Brazil,0.397,0.591,0.013,0.754237
19,Germany,0.409,0.556,0.034,0.663957
17,France,0.426,0.522,0.051,0.634361
1,Argentina,0.454,0.513,0.033,0.625483
24,Italy,0.43,0.508,0.062,0.613208
29,Netherlands,0.406,0.594,,0.612903
11,Croatia,0.372,0.558,0.07,0.607595
22,Hungary,0.471,0.506,0.023,0.6
37,Portugal,0.393,0.607,,0.571429
49,Spain,0.417,0.574,0.009,0.507937


In [120]:
print(top_teams['pct_goals_1st_half'].mean())
print(top_teams['pct_goals_2nd_half'].mean())

0.4175
0.5529000000000001


In [119]:
print(top_teams['win_pct'].mean())

0.6191108575714994


It looks like the top 10 teams tend to score most of their goals in the 2nd half (55%) than the 1st half (42%) and have an average win percent of about ~62%. 

Interestingly, these top teams seem to have a more balanced performance across halves than the underperforming teams who tend to score most of their goals in a single half. And it makes sense, to be a top performer in this sport, your team needs to perform consistently well throughout the entire match. 

Again, there are a couple caveats to these conclusions that we could address in future analyses:

- win percent is not a perfect measure for tournament success.
- we haven't taken into account the quality/strength of their opponents.
- varying team performance over time due to different coaches, players, and current meta-strategies.

It would also be interesting to continue exploring other factors related to timely goals like seeing which teams are more "clutch" by analyzing late-game winning goals scored in overtime, or seeing which teams more often comeback to win when trailing after the 1st half. 