# Data Science and Soccer
### Investigating the Power of Scoring First
By Tyler Yang <br>
Written on May 29th, 2023<br>
Medium Post Link: https://medium.com/@tjyang2005/data-science-and-soccer-a6b7d3445ae4

I started playing club soccer in 4th grade. I never took the sport very seriously since I played for fun, not for the dream of going pro. Yet in my 7+ years of competitive soccer, I have put in considerable time and effort into performing at my highest level for my team. However, while in games, I have noticed some odd trends I want to look at through data science. 
<br>
<br>
The first trend I notice is that the team that scores the first goal often wins. This is because the team that scores the first goal was obviously good enough to score, but moreover, this first goal demoralizes the other team.
<br>
<br>
However, there is a catch to that first observation - I noticed that when a team gets scored on before halftime, they often come out harder during the second half and make comebacks. 
<br>
<br>
I will do hypothesis testing to see if the trends I notice in club soccer also occur in professional soccer. To do so, I will look at results from the 2022–2023 English Premier League Season. I specifically looked at data from the first 14 matchdays; I have not learned web scraping yet, so I had to find all the teams that scored the first goal and the minute the first goal was scored in by looking at each match's summary.

<strong>Data were obtained from several websites:</strong>
<br>
EuroSport: https://www.eurosport.com/football/premier-league/calendar-results.shtml <br>
Fbref: https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures <br>
FixtureDownload: https://fixturedownload.com/results/epl-2022%5C <br>
The Python Code for setting up my dataset is shown below. Download the .xlsx file from GitHub before trying to run this code.<br>
To test my hypotheses, I will be using Python, with NumPy and Pandas.

In [12]:
# Imports
import pandas as pd
import numpy as np
import math
from statistics import NormalDist

In [13]:
# SPI statistics to calculate predicted win percentages
spi = {
    'Arsenal': [83.9],
    'Aston Villa': [79.3],
    'Bournemouth': [59.6],
    'Brentford': [77.1],
    'Brighton': [80.9],
    'Chelsea': [75.8],
    'Crystal Palace': [73.5],
    'Everton': [63.6],
    'Fulham': [68.2],
    'Leeds': [59.0],
    'Leicester': [64.4],
    'Liverpool': [83.9],
    'Man City': [92.3],
    'Man Utd':  [79.1],
    'Newcastle': [83.7],
    'Nottingham Forest': [56.1],
    'Southampton': [56.7],
    'Spurs': [72.1],
    'West Ham': [70.9],
    'Wolves': [59.1]
    }

spi_df = pd.DataFrame.from_dict(spi)

print(spi_df)

   Arsenal  Aston Villa  Bournemouth  Brentford  Brighton  Chelsea  \
0     83.9         79.3         59.6       77.1      80.9     75.8   

   Crystal Palace  Everton  Fulham  Leeds  Leicester  Liverpool  Man City  \
0            73.5     63.6    68.2   59.0       64.4       83.9      92.3   

   Man Utd  Newcastle  Nottingham Forest  Southampton  Spurs  West Ham  Wolves  
0     79.1       83.7               56.1         56.7   72.1      70.9    59.1  


In [14]:
# Fixture Data from EPL 22-23 Season, Matchdays 1-14
data1 = pd.read_excel('epl-2022-UTC.xlsx', sheet_name='Data1') #Note: you may have to adjust the filepath to read the .xlsx file.
df = pd.DataFrame(data1)
gameCount = df.shape[0]
print(gameCount)
goalsHome = np.zeros(gameCount)
goalsAway = np.zeros(gameCount)
for i in range(0, gameCount):
    x = df['Score'][i].split()
    goalsHome[i] = x[0]
    goalsAway[i] = x[2]
df['Home Goals'] = goalsHome
df['Away Goals'] = goalsAway

winner = []
loser = []
for i in range(0, gameCount):
    if df['Home Goals'][i] == df['Away Goals'][i]:
        winner.append('TIE')
        loser.append('TIE')
    elif df['Home Goals'][i] > df['Away Goals'][i]:
        winner.append(df['Home'][i])
        loser.append(df['Away'][i])
    else:
        winner.append(df['Away'][i])
        loser.append(df['Home'][i])
df['Winner'] = winner
df['Loser'] = loser

other = []
for i in range(0, gameCount):
    if df['First Goal'][i] == df['Home'][i]:
        other.append(df['Away'][i])
    elif df['First Goal'][i] == df['Away'][i]:
        other.append(df['Home'][i])
    else:
        other.append("NaN")
df['Other'] = other

print(df)

140
               Home               Away  Score   First Goal Time  Home Goals  \
0    Crystal Palace            Arsenal  0 - 2      Arsenal   20         0.0   
1            Fulham          Liverpool  2 - 2       Fulham   32         2.0   
2       Bournemouth        Aston Villa  2 - 0  Bournemouth    2         2.0   
3             Leeds             Wolves  2 - 1       Wolves    6         2.0   
4         Newcastle  Nottingham Forest  2 - 0    Newcastle   58         2.0   
..              ...                ...    ...          ...  ...         ...   
135       Newcastle        Aston Villa  4 - 0    Newcastle   45         4.0   
136          Fulham            Everton  0 - 0          TIE  TIE         0.0   
137       Liverpool              Leeds  1 - 2        Leeds    4         1.0   
138         Arsenal  Nottingham Forest  5 - 0      Arsenal    5         5.0   
139         Man Utd           West Ham  1 - 0      Man Utd   38         1.0   

     Away Goals       Winner              Loser

## Does scoring first increase your chances of winning?
For the purposes of testing this hypothesis, a win is a success while a tie or a loss is a failure. After all, if you score first, you should maintain your lead; failing to do so is a failure. <br>
We will set the chances of winning for each game according to SPI ratings at the beginning of the 22–23 EPL season from FiveThirtyEight: https://projects.fivethirtyeight.com/soccer-predictions/premier-league/ <br>
The probability team A beats team B is equal to SPI_A / (SPI_A + SPI_B). One limitation of this approach is that a team's SPI changes throughout the season, so the probability should change accordingly. However, I will treat the teams as if they did not improve or regress during the season, keeping their SPI rating constant. <br>

The code block below calculates the expected win percentages for each games and averages them into p0. We also calculate the proportion of results that were expected: namely, that the team that scored first was also the winning team. <br>
<i>Note that the variable name expectedResults is confusing if you don't read the text above.</i>

In [15]:
p0 = 0
expectedResults = 0
gamesWithGoals = gameCount
for i in range(0, gameCount):
    if (df['Score'][i] == '0 - 0'):
        gamesWithGoals -= 1
        continue
    p0 += spi_df[df['First Goal'][i]] / (spi_df[df['First Goal'][i]] + spi_df[df['Other'][i]])
    if df['First Goal'][i] == df['Winner'][i]:
        expectedResults+=1

p0 /= gamesWithGoals
expectedResults /= gamesWithGoals

We carry out our 1-Sample Z-Test for Proportions.<br>
### Hypotheses
H0: p == p0<br>
Ha: p != p0<br>
Where p is the proportion of games in which the team that scores first also wins the game.<br>
### Checking Conditions
Note that these are roughly satisfied. <i>Technically</i>, the results are are only generalizable to EPL games during the 22-23 Season, and this would thus fail the 10% condition since 14 matchdays is 14/38 or rougly 37% of all EPL matchdays in the 22-23 season. Furthermore, the games were not randomly chosen; it's the first 14 matchdays, which makes this a convenience sample. The reason why I carry out the test anyways is because the EPL 22-23 Season is a small enough population to analyze without having to use significance testing, so I will not use my data to make conclusions about the EPL 22-23 season, but rather about professional soccer games as a whole. In that sense, the first 14 matchdays of the EPL 22-23 season is a randomly chosen set of games in a randomly chosen country in a randomly chosen division in a randomly chosen year. <br><br>
The<b> Random Condition</b> is not satisfied. Proceed with the test anyways, since 14 games of the 22-23 EPL season is a randomly chosen season and set of games out of all possible professional soccer games. <br><br>
The <b>10% Condition for Independence</b> is satisfied. n = 127, which is less than 10% of all games in which a goal is scored (thus there would be a team that scored the first goal).<br><br>
The <b>Large Counts Condition</b> is satisfied.<br>
np0 = 127(0.511181) = 64.92 ≥ 10<br>
n(1-p0) = 127(1–0.511181) = 62.08 ≥ 10<br>

### Results

In [16]:
# Conduct the 1-Sample Z-Test for Proportions
# H0 = p == p0
# Ha = p != p0
# Two tailed test
phat = expectedResults
p0 = p0
n = gamesWithGoals

z = (phat - p0)/math.sqrt(p0 * (1 - p0) /n)
pvalue = 2 * (1 - NormalDist(mu=0,sigma=1).cdf(z))
print('P-value:', pvalue)

P-value: 0.007429665277258746


### Conclusion
At a significance level of alpha = 0.05, we reject H0 because the p-value of 0.0074 (rounded to the nearest ten-thousandth) is less than 0.05. There is statistical evidence that scoring the first goal improves a team's chances of winning. <br>
This is a pretty interesting conclusion, but it makes sense. However, in my personal experience, I have seen that scoring the first goal is a double-edged sword. On one hand, the team that scores the first goal can carry their momentum forward against a demoralized opposition. On the other hand, the winning team gets lax on defense and allows the other team to come back into the game. <br><br>
Therefore, I asked another question:

## If you score first, does when you score first make a difference?
Now that we know that scoring first does not always lead to victory, I want to know if <i> when </i>the team scores that first goal affects their chances of winning. <br><br>
We first calculate our sample statistics. phat1 is the proportion of games in the EPL 22-23 season in which the first goal of the game was scored in the first half, where the winning team scored that first goal. phat2 is the proportion of games in the EPL 22-23 season in which the first goal of the game was scored in the second half, where the winning team scored that first goal.

In [17]:
p1 = 0
n1 = 0
p2 = 0
n2 = 0
for i in range(0,gameCount):
    if (df['Score'][i] == '0 - 0'):
        continue
    # First Half
    if df['Time'][i] <= 45:
        n1 += 1
        if df['First Goal'][i] == df['Winner'][i]:
            p1+=1
    elif df['Time'][i] > 45:
        n2 += 1
        if df['First Goal'][i] == df['Winner'][i]:
            p2 += 1

phat1 = p1/n1
phat2 = p2/n2
phatc = (p1+p2)/(n1+n2)

### Hypotheses
H0: p1 - p2 = 0 <br>
Ha1: p1 - p2 < 0 <br>
Ha2: p1 - p2 > 0
p1 is the proportion of professional soccer games in which the first goal of the game was scored in the first half, where the winning team scored that first goal. <br>
p2 is the proportion of professional soccer games in which the first goal of the game was scored in the second half, where the winning team scored that first goal.<br><br>
In essence, my first alternative hypothesis is that if you score first in the second half, you are more likely to win because <br>
A. The other team has less time to come back, and <br>
B. Scoring first in the first half gives the other team more motivation at halftime, during which they will rest up and come out harder during the second half.
<br><br>
My second alternative hypothesis is that if you score first in the first half, you are a much stronger team than your opponent and thus it does not take long for you to score.
### Conditions
The <b>Random Condition</b> is not satisfied, but I will proceed anyways under the same reasoning as I did in the first hypothesis test.<br><br>
The <b>10% Condition for Independence</b> is satisfied. n1 = 101, which is less than 10% of all games in which the first goal is scored in the first half; n2 = 26, which is less than 10% of all games in which the first goal is scored in the second half.<br><br>
The <b>Large Counts Condition</b> is satisfied.<br>
n1phat1 = 65 ≥ 10<br>
n1(1-phat1) = 36 ≥ 10<br>
n2phat2 = 15 ≥ 10<br>
n2(1-phat2) = 11 ≥ 10<br>

### Results

In [18]:
z = ((phat1 - phat2)-0)/(phatc*(1-phatc)*(1/n1+1/n2))
#Left tailed for Ha1
pvalue1 = NormalDist(mu=0,sigma=1).cdf(z)
print('P-value for Ha1: ', pvalue1)
#Right tailed for Ha2
pvalue2 = 1-NormalDist(mu=0,sigma=1).cdf(z)
print('P-value for Ha2: ', pvalue2)

P-value for Ha1:  0.9999999982988261
P-value for Ha2:  1.7011738684402644e-09


## Conclusion
When testing Ha1 at a significance level of alpha = 0.05, we fail to reject H0 because the p-value (rounded to the nearest ten-thousandth) of 1.0000 is greater than 0.05. There is no convincing statistical evidence that p2 is greater than p1. <br>
When testing Ha2 at a significance level of alpha = 0.05, we reject H0 because the p-value of 0.0000 (rounded to the nearest ten-thousandth) is less than 0.05. There is convincing statistical evidence that p1 is greater than p2. <br><br>
We have statistical evidence that, when a team scores the first goal, they are more likely to win if they score that first goal in the first half than in the second half.<br><br>
It seems that, for the most part, the team that scores first early on is stronger on the day, and thus is more likely to win, whereas a team that scores first later on in the match is struggling to find the net offensively, which may also indicate overall weakness that affects their defense too.