# 2. Play Money Ball - Data Analyctics in Sports

Inspired by the book (and movie) "Monyball" I will try going to predict sport events based on data analytics. In the book the Oakland A's revolutionized baseball through analytical player scouting, but I find it much more interesting to predict final results of games. So that is what I will do here.

## Data sources for sports

There are plenty of data sources for data analytics in sports:
- [RotoWire.com](https://www.rotowire.com/)
- [Sports-Reference.com](https://www.sports-reference.com/)
- [CricsSheet.org](https://cricsheet.org/downloads/)
- [Football-Data.co.uk](http://football-data.co.uk/data.php)
- [Kaggle.com](https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=sport&page=1&pageSize=20&size=all&filetype=all&license=all)

I have downloaded the .csv files and put them in one directory. Now I can easily import all of them. There were some formating errors in the .csv files, so I fixed them. Afterthat I could import all .csv files with success.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def test_read_all_csv():
    divisions = ["D1", "D2"]
    for division in divisions:
        for year in range(1993, 2017):
            path = "data/" + division + "_" + str(year) + "_" + str(year+1) + ".csv"
            try:
                print(path)
                test = pd.read_csv(path, keep_default_na=False)
                print('success')
            except:
                print(path + " not successful")
    

In [3]:
test_read_all_csv()

data/D1_1993_1994.csv
success
data/D1_1994_1995.csv
success
data/D1_1995_1996.csv
success
data/D1_1996_1997.csv
success
data/D1_1997_1998.csv
success
data/D1_1998_1999.csv
success
data/D1_1999_2000.csv
success
data/D1_2000_2001.csv
success
data/D1_2001_2002.csv
success
data/D1_2002_2003.csv
success
data/D1_2003_2004.csv
success
data/D1_2004_2005.csv
success
data/D1_2005_2006.csv
success
data/D1_2006_2007.csv
success
data/D1_2007_2008.csv
success
data/D1_2008_2009.csv
success
data/D1_2009_2010.csv
success
data/D1_2010_2011.csv
success
data/D1_2011_2012.csv
success
data/D1_2012_2013.csv
success
data/D1_2013_2014.csv
success
data/D1_2014_2015.csv
success
data/D1_2015_2016.csv
success
data/D1_2016_2017.csv
success
data/D2_1993_1994.csv
success
data/D2_1994_1995.csv
success
data/D2_1995_1996.csv
success
data/D2_1996_1997.csv
success
data/D2_1997_1998.csv
success
data/D2_1998_1999.csv
success
data/D2_1999_2000.csv
success
data/D2_2000_2001.csv
success
data/D2_2001_2002.csv
success
data/D2_20

## First data set with games since up from 1993

In [4]:
from dataset_builders.max_years_dataset_builder import MaxYearsDataSetBuilder

dataset_builder = MaxYearsDataSetBuilder()
df = dataset_builder.build_data_set()
df

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,D1,07/08/93,Bayern Munich,Freiburg,3.0,1,H
1,D1,07/08/93,Dortmund,Karlsruhe,2.0,1,H
2,D1,07/08/93,Duisburg,Leverkusen,2.0,2,D
3,D1,07/08/93,FC Koln,Kaiserslautern,0.0,2,A
4,D1,07/08/93,Hamburg,Nurnberg,5.0,2,H
5,D1,07/08/93,Leipzig,Dresden,3.0,3,D
6,D1,07/08/93,M'Gladbach,Ein Frankfurt,0.0,4,A
7,D1,07/08/93,Wattenscheid,Schalke 04,3.0,0,H
8,D1,07/08/93,Werder Bremen,Stuttgart,5.0,1,H
9,D1,14/08/93,Dresden,Duisburg,0.0,1,A


In [5]:
n_matches = df.shape[0]

n_features = df.shape[1] - 1

n_homewins = len(df[df.FTR == 'H'])

win_rate = (float(n_homewins) / (n_matches)) * 100

print("Total number of matches: " + str(n_matches))
print("Number of features: "  + str(n_features))
print("Number of matches won by home team: " + str(n_homewins))
print("Win rate of home team: " + str(win_rate))

Total number of matches: 15918
Number of features: 6
Number of matches won by home team: 6735
Win rate of home team: 42.3105917828873


In the first dataset we will have to less and weak features to train our model from. Therefore we will use another dataset for training our machine learning model.

## Second data set with all the matches since year 2000

In [6]:
from dataset_builders.century_dataset_builder import CenturyDataSetBuilder

dataset_builder = CenturyDataSetBuilder()
df = dataset_builder.build_data_set()
df

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
0,D1,11/08/00,Dortmund,Hansa Rostock,1,0,H,0,0,D,...,5.00,,,,1.50,3.50,6.00,1.44,3.60,6.50
1,D1,12/08/00,Bayern Munich,Hertha,4,1,H,1,0,H,...,5.00,,,,1.40,3.75,7.00,1.44,3.60,6.50
2,D1,12/08/00,Freiburg,Stuttgart,4,0,H,2,0,H,...,2.50,,,,2.60,3.25,2.38,2.40,3.20,2.50
3,D1,12/08/00,Hamburg,Munich 1860,2,2,D,2,2,D,...,3.50,,,,1.75,3.30,4.00,1.66,3.30,4.50
4,D1,12/08/00,Kaiserslautern,Bochum,0,1,A,0,0,D,...,5.00,,,,1.44,3.80,6.00,1.50,3.60,5.50
5,D1,12/08/00,Leverkusen,Wolfsburg,2,0,H,2,0,H,...,4.50,,,,1.44,3.80,6.00,1.44,3.60,6.50
6,D1,12/08/00,Werder Bremen,Cottbus,3,1,H,2,1,H,...,5.00,,,,1.40,3.75,7.00,1.50,3.40,6.00
7,D1,13/08/00,Ein Frankfurt,Unterhaching,3,0,H,1,0,H,...,4.00,1.80,3.25,3.75,1.80,3.20,3.90,1.72,3.20,4.20
8,D1,13/08/00,Schalke 04,FC Koln,2,1,H,2,0,H,...,3.40,1.80,3.25,3.75,1.85,3.25,3.60,1.75,3.25,4.00
9,D1,18/08/00,Cottbus,Dortmund,1,4,A,0,0,D,...,2.00,,,,3.10,3.20,2.10,3.20,3.00,2.10


In [7]:
n_matches = df.shape[0]

n_features = df.shape[1] - 1

n_homewins = len(df[df.FTR == 'H'])

win_rate = (float(n_homewins) / (n_matches)) * 100

print("Total number of matches: " + str(n_matches))
print("Number of features: "  + str(n_features))
print("Number of matches won by home team: " + str(n_homewins))
print("Win rate of home team: " + str(win_rate))

Total number of matches: 10404
Number of features: 44
Number of matches won by home team: 4670
Win rate of home team: 44.88658208381392
