# 2. Play Money Ball - Data Analyctics in Sports

Inspired by the book (and movie) "Monyball" I will try going to predict sport events based on data analytics. In the book the Oakland A's revolutionized baseball through analytical player scouting, but I find it much more interesting to predict final results of games. So that is what I will do here.

## Data sources for sports

There are plenty of data sources for data analytics in sports:
- [RotoWire.com](https://www.rotowire.com/)
- [Sports-Reference.com](https://www.sports-reference.com/)
- [CricsSheet.org](https://cricsheet.org/downloads/)
- [Football-Data.co.uk](http://football-data.co.uk/data.php)
- [Kaggle.com](https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=sport&page=1&pageSize=20&size=all&filetype=all&license=all)

I have downloaded the .csv files and put them in one directory. Now I can easily import all of them. There were some formating errors in the .csv files, so I fixed them. Afterthat I could import all .csv files with success.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def test_read_all_csv():
    divisions = ["D1", "D2"]
    for division in divisions:
        for year in range(1993, 2017):
            path = "data/" + division + "_" + str(year) + "_" + str(year+1) + ".csv"
            try:
                print(path)
                test = pd.read_csv(path, keep_default_na=False)
                print('success')
            except:
                print(path + " not successful")
    

In [3]:
test_read_all_csv()

data/D1_1993_1994.csv
success
data/D1_1994_1995.csv
success
data/D1_1995_1996.csv
success
data/D1_1996_1997.csv
success
data/D1_1997_1998.csv
success
data/D1_1998_1999.csv
success
data/D1_1999_2000.csv
success
data/D1_2000_2001.csv
success
data/D1_2001_2002.csv
success
data/D1_2002_2003.csv
success
data/D1_2003_2004.csv
success
data/D1_2004_2005.csv
success
data/D1_2005_2006.csv
success
data/D1_2006_2007.csv
success
data/D1_2007_2008.csv
success
data/D1_2008_2009.csv
success
data/D1_2009_2010.csv
success
data/D1_2010_2011.csv
success
data/D1_2011_2012.csv
success
data/D1_2012_2013.csv
success
data/D1_2013_2014.csv
success
data/D1_2014_2015.csv
success
data/D1_2015_2016.csv
success
data/D1_2016_2017.csv
success
data/D2_1993_1994.csv
success
data/D2_1994_1995.csv
success
data/D2_1995_1996.csv
success
data/D2_1996_1997.csv
success
data/D2_1997_1998.csv
success
data/D2_1998_1999.csv
success
data/D2_1999_2000.csv
success
data/D2_2000_2001.csv
success
data/D2_2001_2002.csv
success
data/D2_20

In [4]:
from dataset_builders.max_years_dataset_builder import MaxYearsDateSetBuilder

dataset_builder = MaxYearsDateSetBuilder()
df = dataset_builder.build_data_set()
df

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,D1,07/08/93,Bayern Munich,Freiburg,3.0,1,H
1,D1,07/08/93,Dortmund,Karlsruhe,2.0,1,H
2,D1,07/08/93,Duisburg,Leverkusen,2.0,2,D
3,D1,07/08/93,FC Koln,Kaiserslautern,0.0,2,A
4,D1,07/08/93,Hamburg,Nurnberg,5.0,2,H
5,D1,07/08/93,Leipzig,Dresden,3.0,3,D
6,D1,07/08/93,M'Gladbach,Ein Frankfurt,0.0,4,A
7,D1,07/08/93,Wattenscheid,Schalke 04,3.0,0,H
8,D1,07/08/93,Werder Bremen,Stuttgart,5.0,1,H
9,D1,14/08/93,Dresden,Duisburg,0.0,1,A


In [11]:
n_matches = df.shape[0]

n_features = df.shape[1] - 1

n_homewins = len(df[df.FTR == 'H'])

win_rate = (float(n_homewins) / (n_matches)) * 100

print("Total number of matches: " + str(n_matches))
print("Number of features: "  + str(n_features))
print("Number of matches won by home team: " + str(n_homewins))
print("Win rate of home team: " + str(win_rate))

Total number of matches: 15918
Number of features: 6
Number of matches won by home team: 6735
Win rate of home team: 42.3105917828873
