In [1]:
import pandas as pd
import numpy as np

The dataset captures key aspects of football match events, including team performances, match logistics, and contextual details. Each row represents an individual match, with the following columns:
1. Home_team_score: The number of goals scored by the home team during the match
2. Home_team_name: The name of the team playing at their home venue
3. Away_team_score: The number of goals scored by the away team during the match
4. Away_team_name: The name of the team playing away from their home venue
5. Attendance: The number of spectators who attended the match
6. Referee: The name of the official referee responsible for overseeing the match
7. Stadium: The name of the stadium where the match was played
8. Season: The season or year in which the match took place (e.g., "2023/24")
9. Date: The date on which the match occurred

In [2]:
soccer_data = pd.read_csv('soccer_data.csv')

In [3]:
soccer_data.head(10)

Unnamed: 0,Home_team_score,Home_team_name,Away_team_score,Away_team_name,attendance,referee,stadium,Season,Date
0,2,West Ham United,2,Arsenal,,David Coote,"London Stadium, London",2022/23,"Sun 16 Apr 2023, 14:00 BST"
1,2,Tottenham Hotspur,3,Bournemouth,"Att: 61,369",Andy Madley,"Tottenham Hotspur Stadium, London",2022/23,"Sat 15 Apr 2023, 15:15 BST"
2,0,Southampton,2,Crystal Palace,"Att: 30,309",Michael Oliver,"St. Mary's Stadium, Southampton",2022/23,"Sat 15 Apr 2023, 15:00 BST"
3,0,Nottingham Forest,2,Manchester United,"Att: 29,435",Simon Hooper,"The City Ground, Nottingham",2022/23,"Sun 16 Apr 2023, 16:30 BST"
4,3,Manchester City,1,Leicester City,"Att: 53,329",Darren England,"Etihad Stadium, Manchester",2022/23,"Sat 15 Apr 2023, 17:30 BST"
5,1,Everton,3,Fulham,,Anthony Taylor,"Goodison Park, Liverpool",2022/23,"Sat 15 Apr 2023, 15:00 BST"
6,1,Chelsea,2,Brighton & Hove Albion,"Att: 40,126",Robert Jones,"Stamford Bridge, London",2022/23,"Sat 15 Apr 2023, 15:00 BST"
7,3,Aston Villa,0,Newcastle United,"Att: 42,055",John Brooks,"Villa Park, Birmingham",2022/23,"Sat 15 Apr 2023, 12:30 BST"
8,1,Wolverhampton Wanderers,0,Chelsea,"Att: 31,614",Peter Bankes,"Molineux Stadium, Wolverhampton",2022/23,"Sat 8 Apr 2023, 15:00 BST"
9,2,Tottenham Hotspur,1,Brighton & Hove Albion,"Att: 61,405",Stuart Attwell,"Tottenham Hotspur Stadium, London",2022/23,"Sat 8 Apr 2023, 15:00 BST"


In [4]:
soccer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33230 entries, 0 to 33229
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Home_team_score  33230 non-null  object
 1   Home_team_name   33230 non-null  object
 2   Away_team_score  33230 non-null  object
 3   Away_team_name   33230 non-null  object
 4   attendance       9939 non-null   object
 5   referee          15284 non-null  object
 6   stadium          33230 non-null  object
 7   Season           33230 non-null  object
 8   Date             33230 non-null  object
dtypes: object(9)
memory usage: 2.3+ MB


In [5]:
soccer_data.rename(columns={'Home_team_score':'home_team_score',
                    'Home_team_name':'home_team_name',
                    'Away_team_score':'away_team_score',
                    'Away_team_name':'away_team_name',
                    'Season':'season',
                    'Date':'date'}, inplace=True)

In [6]:
soccer_data.head()

Unnamed: 0,home_team_score,home_team_name,away_team_score,away_team_name,attendance,referee,stadium,season,date
0,2,West Ham United,2,Arsenal,,David Coote,"London Stadium, London",2022/23,"Sun 16 Apr 2023, 14:00 BST"
1,2,Tottenham Hotspur,3,Bournemouth,"Att: 61,369",Andy Madley,"Tottenham Hotspur Stadium, London",2022/23,"Sat 15 Apr 2023, 15:15 BST"
2,0,Southampton,2,Crystal Palace,"Att: 30,309",Michael Oliver,"St. Mary's Stadium, Southampton",2022/23,"Sat 15 Apr 2023, 15:00 BST"
3,0,Nottingham Forest,2,Manchester United,"Att: 29,435",Simon Hooper,"The City Ground, Nottingham",2022/23,"Sun 16 Apr 2023, 16:30 BST"
4,3,Manchester City,1,Leicester City,"Att: 53,329",Darren England,"Etihad Stadium, Manchester",2022/23,"Sat 15 Apr 2023, 17:30 BST"


In [7]:
data = pd.isnull(soccer_data["home_team_score"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: home_team_score, Length: 33230, dtype: bool

In [8]:
data = pd.isnull(soccer_data["home_team_name"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: home_team_name, Length: 33230, dtype: bool

In [9]:
data = pd.isnull(soccer_data["away_team_score"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: away_team_score, Length: 33230, dtype: bool

In [10]:
data = pd.isnull(soccer_data["away_team_name"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: away_team_name, Length: 33230, dtype: bool

In [11]:
# Check the mode of the 'attendance' column
mode_value = soccer_data["attendance"].mode()[0]
print("Mode value:", mode_value)

# Fill missing values with the mode
soccer_data["attendance"].fillna(mode_value, inplace=True)

# Verify if null values have been filled
print("Number of missing values after filling:", soccer_data["attendance"].isnull().sum())

Mode value: Att: 2,000
Number of missing values after filling: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  soccer_data["attendance"].fillna(mode_value, inplace=True)


In [12]:
data = pd.isnull(soccer_data["referee"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: referee, Length: 33230, dtype: bool

In [13]:
data = pd.isnull(soccer_data["stadium"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: stadium, Length: 33230, dtype: bool

In [14]:
data = pd.isnull(soccer_data["season"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: season, Length: 33230, dtype: bool

In [15]:
data = pd.isnull(soccer_data["date"])
data

0        False
1        False
2        False
3        False
4        False
         ...  
33225    False
33226    False
33227    False
33228    False
33229    False
Name: date, Length: 33230, dtype: bool