# Predicting Soccer Match Results - Data Cleaning
In this notebook, I take the raw data I obtained and trim it down to the leagues I need. There is some additional work in cleaning the data done, and finally, saving the new datasets to be used in the main notebook.

# Data Cleaning


In [1]:
# import necessary libraries 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
import os

### Functions  

In [2]:
# function to read multiple csv docs and concatenate the DFs into one
def read_to_dataframe(path, csv_list):
    df_list = []
    for i in csv_list:
        df_list.append(pd.read_csv(path.format(i)))
    return pd.concat(df_list, sort=True)

# function to remove all leagues except UCL, Bundesliga, Premier League, La Liga, Serie A, Ligue 1 
def remove_useless_leagues(dataframe, leagues_to_keep):
    return dataframe[dataframe['league'].isin(leagues_to_keep)]

leagues_to_keep = ['Barclays Premier League','Spanish Primera Division','German Bundesliga',
                   'UEFA Champions League', 'French Ligue 1', 'Italy Serie A']

# functions to compare list of focus clubs to see where there are instances of overlap
def compare_overlap_home(list_of_clubs, df_to_compare):
    return df_to_compare[df_to_compare['HOME'].isin(list_of_clubs)]
def compare_overlap_away(list_of_clubs, df_to_compare):
    return df_to_compare[df_to_compare['AWAY'].isin(list_of_clubs)]

### FiveThirtyEight SPI Data

In [3]:
SPI_data = pd.read_csv('Data\soccer-spi\spi_global_rankings.csv')
SPI_data.head()

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Bayern Munich,German Bundesliga,3.51,0.43,93.96
1,2,2,Manchester City,Barclays Premier League,2.86,0.24,92.84
2,3,3,Barcelona,Spanish Primera Division,3.01,0.5,90.16
3,4,4,Liverpool,Barclays Premier League,2.79,0.46,88.95
4,5,5,Paris Saint-Germain,French Ligue 1,2.89,0.52,88.85


In [4]:
SPI_data.describe()

Unnamed: 0,rank,prev_rank,off,def,spi
count,637.0,637.0,637.0,637.0,637.0
mean,319.0,319.0,1.235243,1.439827,41.443846
std,184.030342,184.030342,0.493902,0.439988,18.325674
min,1.0,1.0,0.2,0.24,5.78
25%,160.0,160.0,0.9,1.15,28.18
50%,319.0,319.0,1.18,1.44,39.13
75%,478.0,478.0,1.53,1.72,53.59
max,637.0,637.0,3.51,2.99,93.96


In [4]:
SPI_data_Int = pd.read_csv('Data\soccer-spi\spi_global_rankings_intl.csv')
SPI_data_Int.head()

Unnamed: 0,rank,name,confed,off,def,spi
0,1,Spain,UEFA,3.54,0.39,93.99
1,2,Brazil,CONMEBOL,3.04,0.32,92.22
2,3,Belgium,UEFA,2.96,0.58,87.71
3,4,France,UEFA,2.9,0.55,87.64
4,5,England,UEFA,2.7,0.45,87.44


I can ignore this entire dataset, I will not be handling International squads with this model. 

In [5]:
SPI_data_Matches = pd.read_csv('Data\soccer-spi\spi_matches.csv')
SPI_data_Matches.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.0,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,


In [6]:
SPI_data_Match_Latest = pd.read_csv('Data\soccer-spi\spi_matches_latest.csv')
SPI_data_Match_Latest.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2019,2019-03-01,1979,Chinese Super League,Shandong Luneng,Guizhou Renhe,48.22,37.83,0.5755,0.174,...,45.9,22.1,1.0,0.0,1.39,0.26,2.05,0.54,1.05,0.0
1,2019,2019-03-01,1979,Chinese Super League,Shanghai Greenland,Shanghai SIPG,39.81,60.08,0.2387,0.5203,...,25.6,63.4,0.0,4.0,0.57,2.76,0.8,1.5,0.0,3.26
2,2019,2019-03-01,1979,Chinese Super League,Guangzhou Evergrande,Tianjin Quanujian,65.59,39.99,0.7832,0.0673,...,77.1,28.8,3.0,0.0,0.49,0.45,1.05,0.75,3.15,0.0
3,2019,2019-03-01,1979,Chinese Super League,Wuhan Zall,Beijing Guoan,32.25,54.82,0.2276,0.5226,...,35.8,58.9,0.0,1.0,1.12,0.97,1.51,0.94,0.0,1.05
4,2019,2019-03-01,1979,Chinese Super League,Chongqing Lifan,Guangzhou RF,38.24,40.45,0.4403,0.2932,...,26.2,21.3,2.0,2.0,2.77,3.17,1.05,2.08,2.1,2.1


### Club SPI 

In [7]:
SPI_data.head()

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Bayern Munich,German Bundesliga,3.51,0.43,93.96
1,2,2,Manchester City,Barclays Premier League,2.86,0.24,92.84
2,3,3,Barcelona,Spanish Primera Division,3.01,0.5,90.16
3,4,4,Liverpool,Barclays Premier League,2.79,0.46,88.95
4,5,5,Paris Saint-Germain,French Ligue 1,2.89,0.52,88.85


In [8]:
# looking for null values
SPI_data.isna().any()

rank         False
prev_rank    False
name         False
league       False
off          False
def          False
spi          False
dtype: bool

In [9]:
SPI_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 7 columns):
rank         637 non-null int64
prev_rank    637 non-null int64
name         637 non-null object
league       637 non-null object
off          637 non-null float64
def          637 non-null float64
spi          637 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 35.0+ KB


In [10]:
SPI_data.league.value_counts()

United Soccer League                        35
Major League Soccer                         26
English League Two                          24
English League One                          24
Argentina Primera Division                  24
English League Championship                 24
Spanish Segunda Division                    22
Turkish Turkcell Super Lig                  21
Brasileiro Série A                          20
Italy Serie B                               20
Italy Serie A                               20
Spanish Primera Division                    20
Barclays Premier League                     20
French Ligue 1                              20
French Ligue 2                              20
Belgian Jupiler League                      18
Mexican Primera Division Torneo Apertura    18
Japanese J League                           18
German 2. Bundesliga                        18
German Bundesliga                           18
Portuguese Liga                             18
Dutch Eredivi

In [11]:
SPI_data = remove_useless_leagues(SPI_data, leagues_to_keep)
SPI_data.head()

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
0,1,1,Bayern Munich,German Bundesliga,3.51,0.43,93.96
1,2,2,Manchester City,Barclays Premier League,2.86,0.24,92.84
2,3,3,Barcelona,Spanish Primera Division,3.01,0.5,90.16
3,4,4,Liverpool,Barclays Premier League,2.79,0.46,88.95
4,5,5,Paris Saint-Germain,French Ligue 1,2.89,0.52,88.85


In [12]:
SPI_data.league.value_counts()

Italy Serie A               20
Spanish Primera Division    20
French Ligue 1              20
Barclays Premier League     20
German Bundesliga           18
UEFA Champions League        3
Name: league, dtype: int64

In [13]:
SPI_data[SPI_data['league'] == 'UEFA Champions League']

Unnamed: 0,rank,prev_rank,name,league,off,def,spi
41,42,42,Shakhtar Donetsk,UEFA Champions League,2.31,1.02,72.48
68,69,69,Dynamo Kiev,UEFA Champions League,1.98,1.04,66.35
156,157,157,Ferencvaros,UEFA Champions League,1.64,1.28,53.9


Now that the SPI_data is pared down to the main 5 leagues and the UCL debutantes, I'll move to preparing the match data currently found in the SPIdataMatches dataframe. 

### SPI match data

In [14]:
SPI_data_Matches.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016,2016-07-09,7921,FA Women's Super League,Liverpool Women,Reading,51.56,50.42,0.4389,0.2767,...,,,2.0,0.0,,,,,,
1,2016,2016-07-10,7921,FA Women's Super League,Arsenal Women,Notts County Ladies,46.61,54.03,0.3572,0.3608,...,,,2.0,0.0,,,,,,
2,2016,2016-07-10,7921,FA Women's Super League,Chelsea FC Women,Birmingham City,59.85,54.64,0.4799,0.2487,...,,,1.0,1.0,,,,,,
3,2016,2016-07-16,7921,FA Women's Super League,Liverpool Women,Notts County Ladies,53.0,52.35,0.4289,0.2699,...,,,0.0,0.0,,,,,,
4,2016,2016-07-17,7921,FA Women's Super League,Chelsea FC Women,Arsenal Women,59.43,60.99,0.4124,0.3157,...,,,1.0,2.0,,,,,,


In [15]:
SPI_data_Matches.league.value_counts()

English League Championship                 2223
Italy Serie A                               1900
Barclays Premier League                     1900
French Ligue 1                              1900
Spanish Primera Division                    1900
Spanish Segunda Division                    1865
Italy Serie B                               1594
English League Two                          1554
Major League Soccer                         1535
German Bundesliga                           1530
French Ligue 2                              1520
Brasileiro Série A                          1520
English League One                          1514
United Soccer League                        1497
Turkish Turkcell Super Lig                  1338
Portuguese Liga                             1224
Dutch Eredivisie                            1224
German 2. Bundesliga                        1224
Argentina Primera Division                   979
Swedish Allsvenskan                          960
Norwegian Tippeligae

In [16]:
SPI_match_data = remove_useless_leagues(SPI_data_Matches, leagues_to_keep)
SPI_match_data.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
10,2016,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
11,2016,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
12,2016,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
13,2016,2016-08-13,2411,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,...,43.6,34.6,0.0,1.0,1.11,0.68,0.84,1.6,0.0,1.05
14,2016,2016-08-13,2411,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,...,31.9,48.0,1.0,1.0,0.73,1.11,0.88,1.81,1.05,1.05


In [17]:
SPI_match_data.league.value_counts()

Spanish Primera Division    1900
Italy Serie A               1900
French Ligue 1              1900
Barclays Premier League     1900
German Bundesliga           1530
UEFA Champions League        590
Name: league, dtype: int64

### SPI latest match data

In [18]:
SPI_data_Match_Latest

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2019,2019-03-01,1979,Chinese Super League,Shandong Luneng,Guizhou Renhe,48.22,37.83,0.5755,0.1740,...,45.9,22.1,1.0,0.0,1.39,0.26,2.05,0.54,1.05,0.00
1,2019,2019-03-01,1979,Chinese Super League,Shanghai Greenland,Shanghai SIPG,39.81,60.08,0.2387,0.5203,...,25.6,63.4,0.0,4.0,0.57,2.76,0.80,1.50,0.00,3.26
2,2019,2019-03-01,1979,Chinese Super League,Guangzhou Evergrande,Tianjin Quanujian,65.59,39.99,0.7832,0.0673,...,77.1,28.8,3.0,0.0,0.49,0.45,1.05,0.75,3.15,0.00
3,2019,2019-03-01,1979,Chinese Super League,Wuhan Zall,Beijing Guoan,32.25,54.82,0.2276,0.5226,...,35.8,58.9,0.0,1.0,1.12,0.97,1.51,0.94,0.00,1.05
4,2019,2019-03-01,1979,Chinese Super League,Chongqing Lifan,Guangzhou RF,38.24,40.45,0.4403,0.2932,...,26.2,21.3,2.0,2.0,2.77,3.17,1.05,2.08,2.10,2.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10863,2020,2021-05-30,1871,Spanish Segunda Division,Málaga,Castellon,32.34,27.63,0.4749,0.2226,...,,,,,,,,,,
10864,2020,2021-05-30,1871,Spanish Segunda Division,Mirandes,CD Sabadell,32.48,27.64,0.4739,0.2341,...,,,,,,,,,,
10865,2020,2021-05-30,1871,Spanish Segunda Division,AD Alcorcon,Espanyol,25.76,61.97,0.1171,0.6412,...,,,,,,,,,,
10866,2020,2021-05-30,1871,Spanish Segunda Division,Logrones,Las Palmas,32.13,35.66,0.3825,0.3257,...,,,,,,,,,,


In [19]:
SPI_data_Match_Latest.league.value_counts()

English League One                          552
English League Championship                 552
English League Two                          552
Spanish Segunda Division                    462
Turkish Turkcell Super Lig                  420
French Ligue 1                              380
Brasileiro Série A                          380
Italy Serie B                               380
Spanish Primera Division                    380
French Ligue 2                              380
Barclays Premier League                     380
Italy Serie A                               380
Major League Soccer                         315
Dutch Eredivisie                            306
Japanese J League                           306
Portuguese Liga                             306
Belgian Jupiler League                      306
German Bundesliga                           306
German 2. Bundesliga                        306
United Soccer League                        290
Argentina Primera Division              

In [21]:
SPI_match_data_latest = remove_useless_leagues(SPI_data_Match_Latest, leagues_to_keep)
SPI_match_data_latest

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
2077,2020,2020-08-21,1843,French Ligue 1,Bordeaux,Nantes,59.92,59.31,0.4538,0.2626,...,,,0.0,0.0,0.60,0.19,0.36,0.60,0.00,0.00
2098,2020,2020-08-22,1843,French Ligue 1,Dijon FCO,Angers,54.91,59.50,0.4038,0.3013,...,,,0.0,1.0,0.89,1.74,0.72,1.54,0.00,1.05
2120,2020,2020-08-22,1843,French Ligue 1,Lille,Stade Rennes,70.39,65.18,0.5088,0.2242,...,,,1.0,1.0,0.37,1.67,0.18,1.04,0.84,1.05
2153,2020,2020-08-23,1843,French Ligue 1,AS Monaco,Reims,67.42,58.47,0.5583,0.1940,...,,,2.0,2.0,2.67,1.75,2.66,0.62,2.10,2.10
2157,2020,2020-08-23,1843,French Ligue 1,Lorient,Strasbourg,52.87,60.44,0.3774,0.3471,...,,,3.0,1.0,3.09,0.34,1.11,0.53,2.72,1.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10849,2020,2021-05-23,1854,Italy Serie A,Internazionale,Udinese,84.87,63.23,0.6914,0.1134,...,,,,,,,,,,
10850,2020,2021-05-23,1854,Italy Serie A,Sampdoria,Parma,61.63,55.90,0.4757,0.2683,...,,,,,,,,,,
10851,2020,2021-05-23,1854,Italy Serie A,Crotone,Fiorentina,52.51,66.84,0.2702,0.4578,...,,,,,,,,,,
10852,2020,2021-05-23,1854,Italy Serie A,Cagliari,Genoa,57.00,54.17,0.4488,0.2821,...,,,,,,,,,,


In [22]:
# confirm I only have the leagues I need
SPI_match_data_latest.league.value_counts()

Barclays Premier League     380
Spanish Primera Division    380
Italy Serie A               380
French Ligue 1              380
German Bundesliga           306
UEFA Champions League        96
Name: league, dtype: int64

Now that the SPI_match_data and SPI_match_data_latest dataframes are pared down to the 6 leagues I'm focusing on, I can now concatenate them into a larger match data DF. 

In [23]:
SPI_matches_all = pd.concat([SPI_match_data, SPI_match_data_latest], axis=1)
SPI_matches_all.head()

Unnamed: 0,season,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
10,2016.0,2016-08-12,1843.0,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,...,,,,,,,,,,
11,2016.0,2016-08-12,1843.0,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,...,,,,,,,,,,
12,2016.0,2016-08-13,2411.0,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,...,,,,,,,,,,
13,2016.0,2016-08-13,2411.0,Barclays Premier League,Crystal Palace,West Bromwich Albion,55.19,58.66,0.4214,0.2939,...,,,,,,,,,,
14,2016.0,2016-08-13,2411.0,Barclays Premier League,Everton,Tottenham Hotspur,68.02,73.25,0.391,0.3401,...,,,,,,,,,,


In [24]:
# confirm the shape is correct
SPI_matches_all.shape

(11201, 46)

In [25]:
# save the cleaned up and trimmed datasets
SPI_matches_all.to_csv('SPI_matches_all.csv')
SPI_data.to_csv('SPI_data.csv')

Currently, the datasets are stored in the following way:  
 - SPI_matches_all : contains match data for 11,201 matches, related to the leagues and clubs being modeled.  
 - SPI_data : contains offensive, defensive, and team SPI coefficients for the top clubs in the top leagues.  
 
### Mohammad Ghahramani’s Match datasets
Next, I'm going to clean up and pare down the dataset of 350,000 matches, found [here](https://data.world/analystmasters/earn-your-6-figure-prize-by-soccer-betting).  


In [37]:
path = 'Data\MoGhahData\{}.csv'
csv_list = ['names6', 'fresults6', 'odds6']

MGMatch_data_all = read_to_dataframe(path, csv_list)

MGMatch_data_all.head()

Unnamed: 0,AWAY,DRAW,HOME,RESULTS_A,RESULTS_H
0,,,Nurnberg II / Pipinsried,,
1,,,Schalding / Munich 1860,,
2,,,Walldorf / Stutt. Kickers,,
3,,,Verl / Alemannia Aachen,,
4,,,Westfalia Rhynern / Erndtebruck,,


In [39]:
# splitting the Home and Away teams
new = MGMatch_data_all['HOME'].str.split(" / ", n=1, expand=True)
MGMatch_data_all['HOME'] = new[0]
MGMatch_data_all['AWAY'] = new[1]


In [41]:
MGMatch_data_all

Unnamed: 0,AWAY,DRAW,HOME,RESULTS_A,RESULTS_H
0,Pipinsried,,Nurnberg II,,
1,Munich 1860,,Schalding,,
2,Stutt. Kickers,,Walldorf,,
3,Alemannia Aachen,,Verl,,
4,Erndtebruck,,Westfalia Rhynern,,
...,...,...,...,...,...
2995,,5.02,,,
2996,,4.74,,,
2997,,5.07,,,
2998,,12.87,,,


While it appears this dataset is quite large, there seem to be a very small amount of match data actually pertaining to the clubs I'm looking to focus on. I'm going to make a list from the SPI data above, and compare to see how much useful data is actually present. 

In [43]:
SPI_data.name

0            Bayern Munich
1          Manchester City
2                Barcelona
3                Liverpool
4      Paris Saint-Germain
              ...         
171                Crotone
176                 Spezia
180                  Nimes
186              Benevento
205              Dijon FCO
Name: name, Length: 101, dtype: object

In [45]:
Focus_clubs = SPI_data.name.to_list()
Focus_clubs

['Bayern Munich',
 'Manchester City',
 'Barcelona',
 'Liverpool',
 'Paris Saint-Germain',
 'Borussia Dortmund',
 'Atletico Madrid',
 'Real Madrid',
 'Chelsea',
 'RB Leipzig',
 'Internazionale',
 'Manchester United',
 'Real Sociedad',
 'Tottenham Hotspur',
 'Napoli',
 'Juventus',
 'Leicester City',
 'Sevilla FC',
 'AC Milan',
 'Arsenal',
 'Borussia Monchengladbach',
 'Villarreal',
 'Atalanta',
 'Bayer Leverkusen',
 'AS Roma',
 'Wolverhampton',
 'Lyon',
 'Lille',
 'Everton',
 'Getafe',
 'TSG Hoffenheim',
 'Southampton',
 'West Ham United',
 'Aston Villa',
 'Eintracht Frankfurt',
 'Shakhtar Donetsk',
 'Real Betis',
 'Brighton and Hove Albion',
 'VfL Wolfsburg',
 'AS Monaco',
 'Athletic Bilbao',
 'Lazio',
 'Celta Vigo',
 'Hertha Berlin',
 'Osasuna',
 'Granada',
 '1. FC Union Berlin',
 'Levante',
 'Crystal Palace',
 'Cadiz',
 'Burnley',
 'Fiorentina',
 'Valencia',
 'Stade Rennes',
 'Sassuolo',
 'Sheffield United',
 'Dynamo Kiev',
 'Leeds United',
 'Newcastle',
 'Eibar',
 'FC Cologne',
 'SD 

In [52]:
# save the teams that are present in both datasets to their own df
home_teams_to_save = compare_overlap_home(Focus_clubs, MGMatch_data_all)
away_teams_to_save = compare_overlap_away(Focus_clubs, MGMatch_data_all)

In [54]:
MGMatch_data = pd.concat([home_teams_to_save, home_teams_to_save], sort=True)
MGMatch_data.head()

Unnamed: 0,AWAY,DRAW,HOME,RESULTS_A,RESULTS_H
16,Orleans,,Reims,,
38,Norwich,,Fulham,,
187,San Luis,,Everton,,
292,Bordeaux,,Angers,,
366,Dijon,,Marseille,,


In [69]:
print(MGMatch_data.RESULTS_H.unique())
print(MGMatch_data.RESULTS_A.unique())
print(MGMatch_data.DRAW.unique())

[nan]
[nan]
[nan]


As we can see above, once I've trimmed the dataset down to only the matches between clubs I'm focusing on, there's not much useful data - if any. I'm going to scrap this dataset and use the SPI set for modeling. 