# Data Engineer
## Introduction
The following tasks are to test the basic skillset of a data engineer.
You aren’t expected to spend more than 30-60 minutes on this task
You are free to use the internet but must solve this task yourself.


* You will find three csv files attached whih you will need to use to answer the below questions: 1.goalscorers.csv, results.csv & shootouts.csv
* Add your answers to this file:answers

## Objectives
1. Create a query that calculates the average number of goals per game between 1900 and 2000.
2. Create a query that counts the number of shootouts wins by country and arrange in alphabetical order.
3. Create a reliable key that allows the joining together of goal scorers, results, and shootouts.
4. Create a query that identifies which teams have won a penalty shootout after a 1-1 draw.
5. Create a query that identifies the top goal scorer by tournament, and what percentage that equates to for all goals scored in the tournament.

Additional (If you have time)
1. Create and additional column that flags records with data quality issues
2. Resolve the identified quality issues



In [1]:
import pandas as pd

goalscorers = pd.read_csv('goalscorers.csv')
results = pd.read_csv('results.csv')
shootouts = pd.read_csv('shootouts.csv')

In [2]:
goalscorers.head()

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty
0,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,44.0,False,False
1,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,55.0,False,False
2,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,70.0,False,False
3,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,75.0,False,False
4,1916-07-06,Argentina,Chile,Argentina,Alberto Ohaco,2.0,False,False


In [3]:
results.head()


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [4]:
shootouts.head()

Unnamed: 0,date,home_team,away_team,winner,first_shooter
0,1967-08-22,India,Taiwan,Taiwan,
1,1971-11-14,South Korea,Vietnam Republic,South Korea,
2,1972-05-07,South Korea,Iraq,Iraq,
3,1972-05-17,Thailand,South Korea,South Korea,
4,1972-05-19,Thailand,Cambodia,Thailand,


Task 1: Average Number of Goals per Game (1900-2000)

Filter results for matches between 1900 and 2000, compute the total goals (home_score + away_score) for each match, and calculate the average.

1. Create a query that calculates the average number of goals per game between 1900 and 2000.


In [5]:
# Convert the 'date' column to datetime and extract the year
results['year'] = pd.to_datetime(results['date']).dt.year

# Filter the results for games between 1900 and 2000, and create a copy
filtered_results = results[(results['year'] >= 1900) & (results['year'] <= 2000)].copy()

# Add a new column for total goals
filtered_results['total_goals'] = filtered_results['home_score'] + filtered_results['away_score']

# Calculate the average number of goals per game
average_goals = filtered_results['total_goals'].mean()

print("Average goals per game (1900-2000):", average_goals)

Average goals per game (1900-2000): 3.0704284750337383


Task 2: Shootout Wins by Country

Group shootouts by winner and count the occurrences. Sort alphabetically by country.

2. Create a query that counts the number of shootouts wins by country and arrange in alphabetical order.


In [6]:
shootout_wins = shootouts['winner'].value_counts().reset_index()
shootout_wins.columns = ['country', 'wins']
shootout_wins_sorted = shootout_wins.sort_values(by='country')
print(shootout_wins_sorted)

                 country  wins
79              Abkhazia     2
26               Algeria     7
16                Angola     7
92   Antigua and Barbuda     2
0              Argentina    14
..                   ...   ...
1                 Zambia    13
108             Zanzibar     2
22              Zimbabwe     7
162                Åland     1
104        Åland Islands     2

[163 rows x 2 columns]


Task 3: Reliable Key for Joining

A reliable join key should uniquely identify matches across all datasets. Since all datasets share date, home_team, and away_team, we can use these fields to create a composite key.

3. Create a reliable key that allows the joining together of goal scorers, results, and shootouts.


In [7]:
# Create composite join keys for each dataset
results['join_key'] = results['date'] + '_' + results['home_team'] + '_' + results['away_team']
goalscorers['join_key'] = goalscorers['date'] + '_' + goalscorers['home_team'] + '_' + goalscorers['away_team']
shootouts['join_key'] = shootouts['date'] + '_' + shootouts['home_team'] + '_' + shootouts['away_team']

In [8]:
print("Results DataFrame:")
results.head()

Results DataFrame:


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872,1872-11-30_Scotland_England
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873,1873-03-08_England_Scotland
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874,1874-03-07_Scotland_England
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875,1875-03-06_England_Scotland
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876,1876-03-04_Scotland_England


In [9]:
print("\nGoalscorers DataFrame:")
goalscorers.head()



Goalscorers DataFrame:


Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key
0,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,44.0,False,False,1916-07-02_Chile_Uruguay
1,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,55.0,False,False,1916-07-02_Chile_Uruguay
2,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,70.0,False,False,1916-07-02_Chile_Uruguay
3,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,75.0,False,False,1916-07-02_Chile_Uruguay
4,1916-07-06,Argentina,Chile,Argentina,Alberto Ohaco,2.0,False,False,1916-07-06_Argentina_Chile


In [10]:
print("\nShootouts DataFrame:")
shootouts.head()


Shootouts DataFrame:


Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key
0,1967-08-22,India,Taiwan,Taiwan,,1967-08-22_India_Taiwan
1,1971-11-14,South Korea,Vietnam Republic,South Korea,,1971-11-14_South Korea_Vietnam Republic
2,1972-05-07,South Korea,Iraq,Iraq,,1972-05-07_South Korea_Iraq
3,1972-05-17,Thailand,South Korea,South Korea,,1972-05-17_Thailand_South Korea
4,1972-05-19,Thailand,Cambodia,Thailand,,1972-05-19_Thailand_Cambodia


Task 4: Teams Winning Shootouts After a 1-1 Draw

Filter results for matches with a 1-1 draw, and then cross-reference with shootouts to identify winners.

4. Create a query that identifies which teams have won a penalty shootout after a 1-1 draw.


In [11]:
# Rename columns in shootouts to avoid conflicts
shootouts_renamed = shootouts.rename(columns={
    'date': 'shootout_date',
    'home_team': 'shootout_home_team',
    'away_team': 'shootout_away_team',
    'winner': 'shootout_winner'
})

# Filter results for matches that ended in a 1-1 draw
draws_1_1 = results[results['home_score'] == 1 & (results['away_score'] == 1)].copy()

# Merge with shootouts based on the composite join key
draws_with_shootouts = draws_1_1.merge(shootouts_renamed, on='join_key', how='inner')

#Extract the relevant columns for winners
winners = draws_with_shootouts[['shootout_date', 'shootout_home_team', 'shootout_away_team', 'shootout_winner']].copy()

# Step 4: Rename columns for clarity
winners.rename(columns={
    'shootout_date': 'date',
    'shootout_home_team': 'home_team',
    'shootout_away_team': 'away_team'
}, inplace=True)

# Step 5: Display the winners
#used unique so no repeats
print(winners["shootout_winner"].unique())

['Taiwan' 'South Korea' 'Iraq' 'Guinea' 'Mauritius' 'Singapore' 'Malaysia'
 'Myanmar' 'Vietnam Republic' 'Syria' 'Algeria' 'Qatar' 'Indonesia'
 'Morocco' 'Tunisia' 'Argentina' 'Iran' 'Mali' 'Czechoslovakia' 'China PR'
 'Togo' 'Burkina Faso' 'Ghana' 'Thailand' 'Kenya' 'Nigeria' 'Ivory Coast'
 'Sierra Leone' 'Senegal' 'Cameroon' 'Spain' 'Saudi Arabia' 'Zambia'
 'Chad' 'Kuwait' 'Mozambique' 'Bahrain' 'Egypt' 'France' 'Germany'
 'Belgium' 'Australia' 'DR Congo' 'Zimbabwe' 'Gabon' 'Ethiopia' 'Sweden'
 'Canada' 'Ecuador' 'Eswatini' 'Colombia' 'Republic of Ireland' 'Uganda'
 'Italy' 'United States' 'Fiji' 'Martinique' 'Switzerland' 'Finland'
 'Bulgaria' 'Brazil' 'Denmark' 'Mexico' 'India' 'Paraguay' 'Uruguay'
 'Zanzibar' 'Honduras' 'England' 'Czech Republic' 'Sudan' 'Croatia'
 'Benin' 'Russia' 'Trinidad and Tobago' 'Cuba' 'Hungary' 'Namibia'
 'Poland' 'Rwanda' 'Romania' 'British Virgin Islands'
 'Antigua and Barbuda' 'Sri Lanka' 'Angola' 'Barbados' 'Lesotho'
 'United Arab Emirates' 'Guernsey'

Task 5: Top Goal Scorer by Tournament

Group by tournament and scorer to find the player with the most goals in each tournament. Then calculate their percentage of total tournament goals.

5. Create a query that identifies the top goal scorer by tournament, and what percentage that equates to for all goals scored in the tournament.

In [12]:
# Merge the goalscorers dataset with the results dataset to include the tournament information
goalscorers_with_tournament = goalscorers.merge(results[['join_key', 'tournament']], on='join_key', how='inner')

# the goals for each player in each tournament
scorer_goals = goalscorers_with_tournament.groupby(['tournament', 'scorer']).size().reset_index(name='scorer_goals')

#total goals in each tournament
tournament_goals = scorer_goals.groupby('tournament')['scorer_goals'].sum().reset_index(name='total_goals')

#Find the top scorer per tournament
top_scorer_per_tournament = scorer_goals.loc[scorer_goals.groupby('tournament')['scorer_goals'].idxmax()]

#Merge
top_scorer_with_percentage = top_scorer_per_tournament.merge(tournament_goals, on='tournament')

#percentage of total goals scored by the top scorer
top_scorer_with_percentage['percentage'] = (top_scorer_with_percentage['scorer_goals'] / 
                                            top_scorer_with_percentage['total_goals']) * 100


top_scorer_with_percentage[['tournament', 'scorer', 'scorer_goals', 'total_goals', 'percentage']]

Unnamed: 0,tournament,scorer,scorer_goals,total_goals,percentage
0,AFC Asian Cup,Ali Daei,14,987,1.41844
1,African Cup of Nations,Samuel Eto'o,18,1767,1.018676
2,Baltic Cup,Ēriks Pētersons,9,229,3.930131
3,British Home Championship,Geoff Hurst,4,33,12.121212
4,CONMEBOL–UEFA Cup of Champions,Claudio Caniggia,1,7,14.285714
5,Confederations Cup,Cuauhtémoc Blanco,9,423,2.12766
6,Copa América,Norberto Doroteo Méndez,17,2671,0.636466
7,FIFA World Cup,Miroslav Klose,16,2720,0.588235
8,FIFA World Cup qualification,Carlos Ruiz,39,22738,0.171519
9,Gold Cup,Landon Donovan,18,1097,1.640839


TASK
Additional

1. Create and additional column that flags records with data quality issues

2. Resolve the identified quality issues


In [13]:
goalscorers['data_quality_issue'] = (
    # Columns Null Check
    (goalscorers['minute'] < 0) |
    (goalscorers['minute'].isnull()) |
    (goalscorers['scorer'].isnull()) |
    (goalscorers['own_goal'].isnull()) |
    (goalscorers['penalty'].isnull()) |
    # Check for duplicate entries based on 'scorer', 'join_key', and 'minute' (excluding minute=NaN)
    (goalscorers['minute'].notna()) & (goalscorers.duplicated(subset=['scorer', 'join_key', 'minute'], keep=False))
)

results['data_quality_issue'] = (
    (results['home_score'] < 0) |  # Check for negative home score
    (results['away_score'] < 0) |  # Check for negative away score
    results['home_score'].isnull() |  # Home score should not be null
    results['away_score'].isnull() |  # Away score should not be null
    results['tournament'].isnull() |  # Tournament should not be null
    results['city'].isnull() |  # City should not be null
    results['country'].isnull()  # Country should not be null
)


shootouts['data_quality_issue'] = (
    # First shooter should not be null when there is a winner
    (shootouts['winner'].notnull() & shootouts['first_shooter'].isnull()) |
    (shootouts['home_team'].isnull()) |
    (shootouts['away_team'].isnull()) |
    # Winner should be home or away team
    (~shootouts['winner'].isin(shootouts['home_team'])) & (~shootouts['winner'].isin(shootouts['away_team'])) |
    # First shooter should be from home or away team
    (~shootouts['first_shooter'].isin(shootouts['home_team'].tolist() + shootouts['away_team'].tolist())) |
    # Null checks for columns
    (shootouts['winner'].isnull()) |  # Check if winner is null
    (shootouts['first_shooter'].isnull())  # Check if first_shooter is null
)

Focusing on goalscorers

In [14]:
goalscorers['data_quality_issue'].value_counts()

data_quality_issue
False    42833
True       356
Name: count, dtype: int64

Resolve flag = Duplicates 

In [15]:
#Goalscorers
goalscorers[goalscorers['data_quality_issue']==True]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
3107,1960-10-16,Taiwan,Vietnam Republic,Taiwan,Yiu Cheuk Yin,,False,False,1960-10-16_Taiwan_Vietnam Republic,True
3743,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
3744,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
3747,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
3748,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
...,...,...,...,...,...,...,...,...,...,...
42607,2023-06-18,Netherlands,Italy,Netherlands,Steven Bergwijn,68.0,False,False,2023-06-18_Netherlands_Italy,True
42608,2023-06-18,Netherlands,Italy,Italy,Federico Chiesa,72.0,False,False,2023-06-18_Netherlands_Italy,True
42609,2023-06-18,Netherlands,Italy,Italy,Federico Chiesa,72.0,False,False,2023-06-18_Netherlands_Italy,True
42610,2023-06-18,Netherlands,Italy,Netherlands,Georginio Wijnaldum,89.0,False,False,2023-06-18_Netherlands_Italy,True


In [16]:
#Check duplicate e.g 
goalscorers[goalscorers["join_key"]=="2023-06-18_Netherlands_Italy"]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
42602,2023-06-18,Netherlands,Italy,Italy,Federico Dimarco,6.0,False,False,2023-06-18_Netherlands_Italy,True
42603,2023-06-18,Netherlands,Italy,Italy,Federico Dimarco,6.0,False,False,2023-06-18_Netherlands_Italy,True
42604,2023-06-18,Netherlands,Italy,Italy,Davide Frattesi,20.0,False,False,2023-06-18_Netherlands_Italy,True
42605,2023-06-18,Netherlands,Italy,Italy,Davide Frattesi,20.0,False,False,2023-06-18_Netherlands_Italy,True
42606,2023-06-18,Netherlands,Italy,Netherlands,Steven Bergwijn,68.0,False,False,2023-06-18_Netherlands_Italy,True
42607,2023-06-18,Netherlands,Italy,Netherlands,Steven Bergwijn,68.0,False,False,2023-06-18_Netherlands_Italy,True
42608,2023-06-18,Netherlands,Italy,Italy,Federico Chiesa,72.0,False,False,2023-06-18_Netherlands_Italy,True
42609,2023-06-18,Netherlands,Italy,Italy,Federico Chiesa,72.0,False,False,2023-06-18_Netherlands_Italy,True
42610,2023-06-18,Netherlands,Italy,Netherlands,Georginio Wijnaldum,89.0,False,False,2023-06-18_Netherlands_Italy,True
42611,2023-06-18,Netherlands,Italy,Netherlands,Georginio Wijnaldum,89.0,False,False,2023-06-18_Netherlands_Italy,True


In [17]:
# Get rows where 'minute' is NaN and 'data_quality_issue' is True
nan_minute_rows = goalscorers[(goalscorers['data_quality_issue'] == True) & (goalscorers['minute'].isna())]

goalscorers = goalscorers[(goalscorers['data_quality_issue'] == True) & (goalscorers['minute'].notna())]

# Remove duplicates
goalscorers.drop_duplicates(subset=['date', 'home_team', 'away_team', 'scorer', 'minute'], inplace=True)

# Concatenate the rows with NaN 'minute' back into the goalscorers
goalscorers = pd.concat([goalscorers, nan_minute_rows])

# Reset the index to ensure it's consistent
goalscorers.reset_index(drop=True, inplace=True)

In [18]:
#Check duplicates is resolved 
goalscorers[goalscorers["join_key"]=="2023-06-18_Netherlands_Italy"]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
43,2023-06-18,Netherlands,Italy,Italy,Federico Dimarco,6.0,False,False,2023-06-18_Netherlands_Italy,True
44,2023-06-18,Netherlands,Italy,Italy,Davide Frattesi,20.0,False,False,2023-06-18_Netherlands_Italy,True
45,2023-06-18,Netherlands,Italy,Netherlands,Steven Bergwijn,68.0,False,False,2023-06-18_Netherlands_Italy,True
46,2023-06-18,Netherlands,Italy,Italy,Federico Chiesa,72.0,False,False,2023-06-18_Netherlands_Italy,True
47,2023-06-18,Netherlands,Italy,Netherlands,Georginio Wijnaldum,89.0,False,False,2023-06-18_Netherlands_Italy,True


In [19]:
goalscorers['data_quality_issue'] = (
    # Columns Null Check
    (goalscorers['minute'] < 0) |
    (goalscorers['minute'].isnull()) |
    (goalscorers['scorer'].isnull()) |
    (goalscorers['own_goal'].isnull()) |
    (goalscorers['penalty'].isnull()) |
    # Check for duplicate entries based on 'scorer', 'join_key', and 'minute' (excluding minute=NaN)
    (goalscorers['minute'].notna()) & (goalscorers.duplicated(subset=['scorer', 'join_key', 'minute'], keep=False))
)

In [20]:
goalscorers['data_quality_issue'].value_counts()

data_quality_issue
True     260
False     48
Name: count, dtype: int64

Scorer = null and minute = null

In [21]:
goalscorers[(goalscorers['data_quality_issue'] == True)]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
48,1960-10-16,Taiwan,Vietnam Republic,Taiwan,Yiu Cheuk Yin,,False,False,1960-10-16_Taiwan_Vietnam Republic,True
49,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
50,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
51,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
52,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
...,...,...,...,...,...,...,...,...,...,...
303,1997-03-27,Saudi Arabia,Bangladesh,Saudi Arabia,Abdullah Al-Dosari,,False,False,1997-03-27_Saudi Arabia_Bangladesh,True
304,1997-03-29,Taiwan,Bangladesh,Taiwan,Hsu Te Ming,,False,False,1997-03-29_Taiwan_Bangladesh,True
305,1997-03-29,Taiwan,Bangladesh,Bangladesh,Alfaz Ahmed,,False,False,1997-03-29_Taiwan_Bangladesh,True
306,1997-03-29,Taiwan,Bangladesh,Bangladesh,Imtiaz Ahmed Nakib,,False,False,1997-03-29_Taiwan_Bangladesh,True


Resolve Flag = Minute is NaN

In [22]:
goalscorers[(goalscorers['data_quality_issue']==True) & (goalscorers['minute'].isnull())]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
48,1960-10-16,Taiwan,Vietnam Republic,Taiwan,Yiu Cheuk Yin,,False,False,1960-10-16_Taiwan_Vietnam Republic,True
49,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
50,1963-11-26,Ghana,Ethiopia,Ghana,Edward Acquah,,False,False,1963-11-26_Ghana_Ethiopia,True
51,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
52,1963-11-28,Ethiopia,Tunisia,Ethiopia,Mengistu Worku,,False,False,1963-11-28_Ethiopia_Tunisia,True
...,...,...,...,...,...,...,...,...,...,...
303,1997-03-27,Saudi Arabia,Bangladesh,Saudi Arabia,Abdullah Al-Dosari,,False,False,1997-03-27_Saudi Arabia_Bangladesh,True
304,1997-03-29,Taiwan,Bangladesh,Taiwan,Hsu Te Ming,,False,False,1997-03-29_Taiwan_Bangladesh,True
305,1997-03-29,Taiwan,Bangladesh,Bangladesh,Alfaz Ahmed,,False,False,1997-03-29_Taiwan_Bangladesh,True
306,1997-03-29,Taiwan,Bangladesh,Bangladesh,Imtiaz Ahmed Nakib,,False,False,1997-03-29_Taiwan_Bangladesh,True


The NaN values in the “minute” column cannot be resolved, as other data files do not provide the necessary information to identify the minute of the goal.

Resolve Flag = Own goal is NaN

Resolve Flag = Penalty is NaN


In [23]:
goalscorers[(goalscorers['data_quality_issue'] == True)&(goalscorers['own_goal'].isna())&(goalscorers['penalty'].isna())]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue
241,1980-09-23,Malaysia,Qatar,Malaysia,Tukamin Bahari,,,,1980-09-23_Malaysia_Qatar,True
242,1980-09-23,Malaysia,Qatar,Qatar,,,,,1980-09-23_Malaysia_Qatar,True


In [24]:
results[(results['join_key'] == "1980-09-23_Malaysia_Qatar")]

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue
11687,1980-09-23,Malaysia,Qatar,1,1,AFC Asian Cup,Kuwait City,Kuwait,True,1980,1980-09-23_Malaysia_Qatar,False


In [25]:
shootouts[(shootouts['join_key'] == "1980-09-23_Malaysia_Qatar")]

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue


Find final score for games using only goalscores.csv and compare with final score results.csv

-create new column in goalscorers.csv which holds final score for each game

-create new column in results.csv which holds final score for each game

-compare

In [26]:
# Initialise the final score column n goalscorers
goalscorers['final_score'] = None

# Group by 'join_key' and calculate the final score
for join_key, group in goalscorers.groupby('join_key'):
    # Extract home and away teams
    home_team = group['home_team'].iloc[0]
    away_team = group['away_team'].iloc[0]
    
    # Calculate final home and away scores
    home_score = group[group['team'] == home_team].shape[0]  # Count goals for home team
    away_score = group[group['team'] == away_team].shape[0]  # Count goals for away team
    
    final_score = f"{home_score}-{away_score}"
    
    goalscorers.loc[group.index, 'final_score'] = final_score

In [27]:
# Initialise the final_score column for results
results['final_score'] = results['home_score'].astype(str) + '-' + results['away_score'].astype(str)

In [28]:
print("goalscorers.head()")
goalscorers.head()

goalscorers.head()


Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue,final_score
0,1981-06-06,Fiji,Taiwan,Fiji,Ratu Jone,55.0,False,False,1981-06-06_Fiji_Taiwan,False,1-0
1,2001-04-30,Oman,Laos,Oman,Hani Al-Dhabit,45.0,False,False,2001-04-30_Oman_Laos,False,1-0
2,2002-01-30,Burkina Faso,Ghana,Ghana,Isaac Boakye,90.0,False,False,2002-01-30_Burkina Faso_Ghana,False,0-1
3,2003-12-03,Maldives,Mongolia,Maldives,Ibrahim Fazeel,45.0,False,False,2003-12-03_Maldives_Mongolia,False,1-0
4,2004-06-13,France,England,France,Zinedine Zidane,90.0,False,False,2004-06-13_France_England,False,1-0


In [29]:
print("results.head()")
results.head()

results.head()


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872,1872-11-30_Scotland_England,False,0-0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873,1873-03-08_England_Scotland,False,4-2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874,1874-03-07_Scotland_England,False,2-1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875,1875-03-06_England_Scotland,False,2-2
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876,1876-03-04_Scotland_England,False,3-0


In [30]:
# Merging the datasets
merged_data = pd.merge(goalscorers[['join_key', 'final_score']], 
                       results[['join_key', 'final_score']], 
                       on='join_key', 
                       suffixes=('_goalscorers', '_results'))

discrepancies = merged_data[merged_data['final_score_goalscorers'] != merged_data['final_score_results']]


In [31]:
discrepancies

Unnamed: 0,join_key,final_score_goalscorers,final_score_results
0,1981-06-06_Fiji_Taiwan,1-0,2-1
1,2001-04-30_Oman_Laos,1-0,12-0
2,2002-01-30_Burkina Faso_Ghana,0-1,1-2
3,2003-12-03_Maldives_Mongolia,1-0,12-0
4,2004-06-13_France_England,1-0,2-1
5,2004-10-09_Turkey_Kazakhstan,1-0,4-0
6,2015-06-13_Poland_Georgia,1-0,4-0
16,2019-09-08_Spain_Faroe Islands,1-0,4-0
17,2021-09-05_San Marino_Poland,0-1,1-7
32,2021-10-12_Syria_Lebanon,0-1,2-3


In [32]:
# Count the number of discrepancies
discrepancy_count = discrepancies.shape[0]

# Display the count
print("Number of discrepancies:", discrepancy_count)

Number of discrepancies: 39


Focussing on Results

In [33]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45315 entries, 0 to 45314
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date                45315 non-null  object
 1   home_team           45315 non-null  object
 2   away_team           45315 non-null  object
 3   home_score          45315 non-null  int64 
 4   away_score          45315 non-null  int64 
 5   tournament          45315 non-null  object
 6   city                45315 non-null  object
 7   country             45315 non-null  object
 8   neutral             45315 non-null  bool  
 9   year                45315 non-null  int32 
 10  join_key            45315 non-null  object
 11  data_quality_issue  45315 non-null  bool  
 12  final_score         45315 non-null  object
dtypes: bool(2), int32(1), int64(2), object(8)
memory usage: 3.7+ MB


In [34]:
results.describe()

Unnamed: 0,home_score,away_score,year
count,45315.0,45315.0,45315.0
mean,1.739314,1.178241,1992.612733
std,1.746904,1.392095,24.735806
min,0.0,0.0,1872.0
25%,1.0,0.0,1979.0
50%,1.0,1.0,1999.0
75%,2.0,2.0,2011.0
max,31.0,21.0,2023.0


In [35]:
results[(results['data_quality_issue'] == True)]

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score


Focussing on Shootouts

In [36]:
shootouts.head()

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
0,1967-08-22,India,Taiwan,Taiwan,,1967-08-22_India_Taiwan,True
1,1971-11-14,South Korea,Vietnam Republic,South Korea,,1971-11-14_South Korea_Vietnam Republic,True
2,1972-05-07,South Korea,Iraq,Iraq,,1972-05-07_South Korea_Iraq,True
3,1972-05-17,Thailand,South Korea,South Korea,,1972-05-17_Thailand_South Korea,True
4,1972-05-19,Thailand,Cambodia,Thailand,,1972-05-19_Thailand_Cambodia,True


Flags for shootouts

In [37]:
shootouts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562 entries, 0 to 561
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date                562 non-null    object
 1   home_team           562 non-null    object
 2   away_team           562 non-null    object
 3   winner              562 non-null    object
 4   first_shooter       86 non-null     object
 5   join_key            562 non-null    object
 6   data_quality_issue  562 non-null    bool  
dtypes: bool(1), object(6)
memory usage: 27.0+ KB


In [38]:
shootouts.describe()

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
count,562,562,562,562,86,562,562
unique,498,165,178,163,29,562,2
top,2016-06-03,South Africa,Uganda,Argentina,Colombia,1967-08-22_India_Taiwan,True
freq,5,15,15,14,9,1,476


In [39]:
shootouts['data_quality_issue'].value_counts()

data_quality_issue
True     476
False     86
Name: count, dtype: int64

In [40]:
shootouts[(shootouts['data_quality_issue'] == True)]

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
0,1967-08-22,India,Taiwan,Taiwan,,1967-08-22_India_Taiwan,True
1,1971-11-14,South Korea,Vietnam Republic,South Korea,,1971-11-14_South Korea_Vietnam Republic,True
2,1972-05-07,South Korea,Iraq,Iraq,,1972-05-07_South Korea_Iraq,True
3,1972-05-17,Thailand,South Korea,South Korea,,1972-05-17_Thailand_South Korea,True
4,1972-05-19,Thailand,Cambodia,Thailand,,1972-05-19_Thailand_Cambodia,True
...,...,...,...,...,...,...,...
557,2023-07-12,United States,Panama,Panama,,2023-07-12_United States_Panama,True
558,2023-09-07,Iraq,India,Iraq,,2023-09-07_Iraq_India,True
559,2023-09-10,Thailand,Iraq,Iraq,,2023-09-10_Thailand_Iraq,True
560,2023-10-13,Iraq,Qatar,Qatar,,2023-10-13_Iraq_Qatar,True


In [41]:
# Get the count of rows with 'data_quality_issue' as True and 'first_shooter' as null
first_shooter_Null= shootouts[(shootouts['data_quality_issue'] == True) & (shootouts['first_shooter'].isnull())].shape[0]

print(f"Count shootouts with missing 'first_shooter': {first_shooter_Null}")

Count shootouts with missing 'first_shooter': 476


In [42]:
shootouts[(shootouts['data_quality_issue'] == True) & (shootouts['first_shooter'].isnull()) ]

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
0,1967-08-22,India,Taiwan,Taiwan,,1967-08-22_India_Taiwan,True
1,1971-11-14,South Korea,Vietnam Republic,South Korea,,1971-11-14_South Korea_Vietnam Republic,True
2,1972-05-07,South Korea,Iraq,Iraq,,1972-05-07_South Korea_Iraq,True
3,1972-05-17,Thailand,South Korea,South Korea,,1972-05-17_Thailand_South Korea,True
4,1972-05-19,Thailand,Cambodia,Thailand,,1972-05-19_Thailand_Cambodia,True
...,...,...,...,...,...,...,...
557,2023-07-12,United States,Panama,Panama,,2023-07-12_United States_Panama,True
558,2023-09-07,Iraq,India,Iraq,,2023-09-07_Iraq_India,True
559,2023-09-10,Thailand,Iraq,Iraq,,2023-09-10_Thailand_Iraq,True
560,2023-10-13,Iraq,Qatar,Qatar,,2023-10-13_Iraq_Qatar,True


In [43]:
shootouts[shootouts['first_shooter'].notnull()]

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
24,1976-06-20,Czechoslovakia,Germany,Czechoslovakia,Czechoslovakia,1976-06-20_Czechoslovakia_Germany,False
37,1980-06-21,Italy,Czechoslovakia,Czechoslovakia,Italy,1980-06-21_Italy_Czechoslovakia,False
48,1982-07-08,Germany,France,Germany,France,1982-07-08_Germany_France,False
65,1984-06-24,Denmark,Spain,Spain,Denmark,1984-06-24_Denmark_Spain,False
84,1986-06-21,Brazil,France,France,Brazil,1986-06-21_Brazil_France,False
...,...,...,...,...,...,...,...
542,2022-12-05,Japan,Croatia,Croatia,Japan,2022-12-05_Japan_Croatia,False
543,2022-12-06,Morocco,Spain,Morocco,Morocco,2022-12-06_Morocco_Spain,False
544,2022-12-09,Croatia,Brazil,Croatia,Croatia,2022-12-09_Croatia_Brazil,False
545,2022-12-09,Netherlands,Argentina,Argentina,Netherlands,2022-12-09_Netherlands_Argentina,False


Only error is first_shooter = NaN this can't be solved as the first shooter team is decided by toss and we do not have data for winner of toss

Extra Findings

In [44]:
#draw games from the results dataset
draw_games = results[results['home_score'] == results['away_score']]
# Check if their join_key is present in the shootouts dataset
draw_games_in_shootouts = draw_games[draw_games['join_key'].isin(shootouts['join_key'])]

draw_games_in_shootouts

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score
6648,1967-08-22,India,Taiwan,1,1,Merdeka Tournament,Kuala Lumpur,Malaysia,True,1967,1967-08-22_India_Taiwan,False,1-1
8110,1971-11-14,South Korea,Vietnam Republic,1,1,King's Cup,Bangkok,Thailand,True,1971,1971-11-14_South Korea_Vietnam Republic,False,1-1
8281,1972-05-07,South Korea,Iraq,0,0,AFC Asian Cup,Bangkok,Thailand,True,1972,1972-05-07_South Korea_Iraq,False,0-0
8299,1972-05-17,Thailand,South Korea,1,1,AFC Asian Cup,Bangkok,Thailand,False,1972,1972-05-17_Thailand_South Korea,False,1-1
8302,1972-05-19,Thailand,Cambodia,2,2,AFC Asian Cup,Bangkok,Thailand,False,1972,1972-05-19_Thailand_Cambodia,False,2-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
44779,2023-07-12,United States,Panama,1,1,Gold Cup,San Diego,United States,False,2023,2023-07-12_United States_Panama,False,1-1
44782,2023-07-13,Åland,Falkland Islands,1,1,Island Games,Saint Sampson,Guernsey,True,2023,2023-07-13_Åland_Falkland Islands,False,1-1
44839,2023-09-07,Iraq,India,2,2,King's Cup,Chiang Mai,Thailand,True,2023,2023-09-07_Iraq_India,False,2-2
44905,2023-09-10,Thailand,Iraq,2,2,King's Cup,Chiang Mai,Thailand,False,2023,2023-09-10_Thailand_Iraq,False,2-2


In [45]:
draw_games = results[results['home_score'] == results['away_score']]

# Check if their join_key is in the shootouts dataset
draw_games_with_no_shootout = draw_games[~draw_games['join_key'].isin(shootouts['join_key'])]

# draw games that do not have a corresponding entry in the shootouts dataset
draw_games_with_no_shootout

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872,1872-11-30_Scotland_England,False,0-0
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875,1875-03-06_England_Scotland,False,2-2
28,1883-03-17,Northern Ireland,Wales,1,1,Friendly,Belfast,Ireland,False,1883,1883-03-17_Northern Ireland_Wales,False,1-1
36,1885-03-14,England,Wales,1,1,British Home Championship,Blackburn,England,False,1885,1885-03-14_England_Wales,False,1-1
38,1885-03-21,England,Scotland,1,1,British Home Championship,London,England,False,1885,1885-03-21_England_Scotland,False,1-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45296,2023-11-21,Libya,Cameroon,1,1,FIFA World Cup qualification,Benghazi,Libya,False,2023,2023-11-21_Libya_Cameroon,False,1-1
45297,2023-11-21,Mauritius,Angola,0,0,FIFA World Cup qualification,Saint Pierre,Mauritius,False,2023,2023-11-21_Mauritius_Angola,False,0-0
45307,2023-11-21,Republic of Ireland,New Zealand,1,1,Friendly,Dublin,Republic of Ireland,False,2023,2023-11-21_Republic of Ireland_New Zealand,False,1-1
45309,2023-11-21,Greece,France,2,2,UEFA Euro qualification,Athens,Greece,False,2023,2023-11-21_Greece_France,False,2-2


In [46]:
#friendly games in results where the score is a draw
friendly_draws = results[(results['tournament'] == 'Friendly') & 
                         (results['home_score'] == results['away_score'])]

#Check if the 'join_key' in the results dataset is present in the shootouts dataset
friendly_draws_in_shootouts = friendly_draws[friendly_draws['join_key'].isin(shootouts['join_key'])]

friendly_draws_in_shootouts

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score
12779,1983-02-27,Mali,Guinea,1,1,Friendly,Bamako,Mali,False,1983,1983-02-27_Mali_Guinea,False,1-1
13186,1983-12-15,Ivory Coast,Mali,1,1,Friendly,Abidjan,Ivory Coast,False,1983,1983-12-15_Ivory Coast_Mali,False,1-1
13191,1983-12-18,Mali,Nigeria,0,0,Friendly,Abidjan,Ivory Coast,True,1983,1983-12-18_Mali_Nigeria,False,0-0
14139,1985-08-25,Mali,Guinea,1,1,Friendly,Bamako,Mali,False,1985,1985-08-25_Mali_Guinea,False,1-1
15056,1988-01-06,South Korea,Egypt,1,1,Friendly,Doha,Qatar,True,1988,1988-01-06_South Korea_Egypt,False,1-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40065,2018-03-22,Thailand,Gabon,0,0,Friendly,Bangkok,Thailand,False,2018,2018-03-22_Thailand_Gabon,False,0-0
40095,2018-03-24,Angola,Zimbabwe,2,2,Friendly,Ndola,Zambia,True,2018,2018-03-24_Angola_Zimbabwe,False,2-2
40277,2018-06-03,Andorra,Cape Verde,0,0,Friendly,Almada,Portugal,True,2018,2018-06-03_Andorra_Cape Verde,False,0-0
41005,2019-03-23,Oman,Singapore,1,1,Friendly,Kuala Lumpur,Malaysia,True,2019,2019-03-23_Oman_Singapore,False,1-1


In [47]:
# Check if all join_key values in shootouts are present in results
missing_in_results_shootouts = shootouts[~shootouts['join_key'].isin(results['join_key'])]

#rows where join_key in shootouts are not present in results
missing_in_results_shootouts

Unnamed: 0,date,home_team,away_team,winner,first_shooter,join_key,data_quality_issue
378,2011-06-29,Saare County,Åland Islands,Åland Islands,,2011-06-29_Saare County_Åland Islands,True


In [48]:
goalscorers[goalscorers["join_key"]=="2011-06-29_Saare County_Åland Islands"]

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty,join_key,data_quality_issue,final_score


In [49]:
results[results["join_key"]=="2011-06-29_Saare County_Åland Islands"]

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,join_key,data_quality_issue,final_score
