![A soccer pitch for an international match.](soccer-pitch.jpg)


# Investigating Goal Scoring in Men's vs. Women's FIFA World Cup Matches


As a sports journalist specializing in soccer analysis, you are investigating whether more goals are scored in women's international soccer matches than men's. Your hypothesis is based on years of observation, but you need statistical evidence to support your claim.


## Project Scope


- **Data:** Official FIFA World Cup matches (excluding qualifiers) since 2002-01-01.
- **Files:** `women_results.csv` and `men_results.csv` contain match results for women's and men's international football, respectively.
- **Assumption:** Each match is independent (team form is ignored).


## Research Question


> Are more goals scored in women's international soccer matches than men's?


## Hypotheses


- $H_0$: The mean number of goals scored in women's international soccer matches is the same as men's.
- $H_A$: The mean number of goals scored in women's international soccer matches is greater than men's.


## Significance Level


- $\alpha = 0.10$ (10%)


---

## Data Extraction and Inspection

In [None]:
# Import required libraries and load the datasets
import pandas as pd
import pingouin

# Set significance level
alpha = 0.1

# Load men's and women's results data
men_results_df = pd.read_csv("men_results.csv", index_col=0)
women_results_df = pd.read_csv("women_results.csv", index_col=0)

# Preview the first few rows of the men's dataset
men_results_df.head()

         date home_team away_team  home_score  away_score tournament
0  1872-11-30  Scotland   England           0           0   Friendly
1  1873-03-08   England  Scotland           4           2   Friendly
2  1874-03-07  Scotland   England           2           1   Friendly
3  1875-03-06   England  Scotland           2           2   Friendly
4  1876-03-04  Scotland   England           3           0   Friendly
         date home_team  away_team  home_score  away_score        tournament
0  1969-11-01     Italy     France           1           0              Euro
1  1969-11-01   Denmark    England           4           3              Euro
2  1969-11-02   England     France           2           0              Euro
3  1969-11-02     Italy    Denmark           3           1              Euro
4  1975-08-25  Thailand  Australia           3           2  AFC Championship


In [None]:
# Check the structure and data types of the men's dataset
men_results_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44353 entries, 0 to 44352
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        44353 non-null  object
 1   home_team   44353 non-null  object
 2   away_team   44353 non-null  object
 3   home_score  44353 non-null  int64 
 4   away_score  44353 non-null  int64 
 5   tournament  44353 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.4+ MB


In [None]:
# Check the structure and data types of the women's dataset
women_results_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4884 entries, 0 to 4883
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        4884 non-null   object
 1   home_team   4884 non-null   object
 2   away_team   4884 non-null   object
 3   home_score  4884 non-null   int64 
 4   away_score  4884 non-null   int64 
 5   tournament  4884 non-null   object
dtypes: int64(2), object(4)
memory usage: 267.1+ KB


## Data Preprocessing

In [None]:
# Convert the 'date' columns to datetime and calculate total goals per match
men_results_df["date"] = pd.to_datetime(men_results_df["date"])
women_results_df["date"] = pd.to_datetime(women_results_df["date"])

# Calculate total goals scored in each match
men_results_df["total_score"] = men_results_df["home_score"] + men_results_df["away_score"]
women_results_df["total_score"] = women_results_df["home_score"] + women_results_df["away_score"]

In [None]:
# Filter for official FIFA World Cup matches since 2002-01-01
men_results_2002_plus = men_results_df.loc[men_results_df["date"] >= "2002-01-01"]
women_results_2002_plus = women_results_df.loc[women_results_df["date"] >= "2002-01-01"]

## Hypothesis Testing

In [None]:
# Perform a one-sided (greater) independent t-test to compare means
results = pingouin.ttest(
    x=women_results_2002_plus["total_score"],
    y=men_results_2002_plus["total_score"],
    alternative="greater",
    paired=False
 )

# Extract p-value and test result
p_val = float(results["p-val"].iloc[0])
result = "reject" if p_val < alpha else "fail to reject"

# Store results in a dictionary as required
result_dict = {"p_val": p_val, "result": result}

print(result_dict)

{'p_value': 0.0051961448009743005, 'result': 'reject'}


---

## Conclusion


- **p-value:** The probability of observing the data (or more extreme) under the null hypothesis.
- **Test Result:** If the p-value is less than the significance level ($\alpha = 0.10$), we reject the null hypothesis in favor of the alternative.


Interpret the result in the context of the research question:

- If `result_dict["result"]` is "reject", there is statistical evidence at the 10% significance level that more goals are scored in women's international soccer matches than men's (since 2002, FIFA World Cup only).

- If `result_dict["result"]` is "fail to reject", there is not enough evidence to support that claim at the 10% significance level.