# Machine Learning Prediction Model
#### by Mariam Sulakian
Machine learning model, written in Python, to predict the outcome of the 2018 English Premier
League (EPL)  football matches.  Built by training suitable machine
learning algorithms on historic results data.

## Introduction

I have built a machine learning model that looks at past EPL game data to predict future games in January 2018. Various attributes were studied including: total goals scored, total goals allowed, discipline (yellow cards, red cards, fouls incurred, total corners), shots per games, shots allowed per game, percentage of games won, defensive statistics (goalie saves, goalie save percentage, ratio of saves), and offensive statistics (scoringPercentage, scoring ratio). Various models were trained using these statistics and each team's outcome in the past games. KNN produced the highest accuracy with 56%. 

## Data	Import
Imported native [Python libraries](https://docs.python.org/3.7/library/index.html) and [Scikit-Learn libraries](http://scikit-learn.org/stable/). Update the workspace_id, authorization_token, and endpoint to correspond to the 'Training.csv' file in azure, or you can alternatively use the link below to include a file path from your computer. A pathway can be easily produced through azure 'Generate Data Access Code'.

In [206]:
import sys
sys.path.insert(0,"/Users/Sulakian/anaconda3/lib/python3.6")
import os 
print(os.getcwd())

/home/nbuser


In [207]:
#import python libraries
import pandas as pd
import numpy as np

import sys
import math
import csv
import urllib
import collections

#import Scikit-Learn libraries for implementation of machine learning algorithms 
import sklearn

In [208]:
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#feature importance
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
#to create predictions
from sklearn.cross_validation import train_test_split
#algorithms tested
from sklearn import svm
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report

## Data Transformation and Exploration
A year column was added to the data, as well as two columns with 'Winner' and 'Loser' with a none entry in each corresponding to a tie. Data was transformed in excel to produce the various statistics that would be included in the model (code can be found below for each metric). Feature selection was done with recursive feature elimination and feature importance quantification using Extra Trees Classifier to select for top 10 features. However, best results were achieved when all 14 features were included. 

## Methodology Overview
This model takes in two teams and which year they will be compared in. So for 2018, 2017 data will be used since it is the most current season. The model will then predict the probability that each team will win. Many of the algorithms used require a numerical representation of attributes to conduct statistical analysis. Feature vectors are commonly used in machine learning model since they are n-dimensional vectors composed of numerical inputs. Since these models take in vectors as input, the statistics were transformed into vectors, one for each team, which could then be compared. The simplest way to compare the two vectors is to take the difference between them. The model will then use the resultant vector to predict the probability that each team will win. The model will then be composed of an x component which will be the difference vector and a y component, which will be 1 if team 1 wins, and 0 will be associated with the inverse of the difference. This will allow the model to introduce negative sampling by allowing the model to select against true negatives. 

### Data Visualization

In [209]:
from azureml import Workspace
ws = Workspace(
    workspace_id='5cd0bcab12fa4e13bc1a21023ba57673',
    authorization_token='COEXKjYaViYevjIKq+7hyXm+G9odq3v/8877WcEw7hy1VizII0IMoZGuX+crYHlY/vu8mRMEeTbD6eNxThA2bg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['Training.csv']
EPL_data = ds.to_dataframe()
EPL_data['Date'] = EPL_data.Date.astype(str)
EPL_data.head()

Unnamed: 0,ID,Year,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,HF,AF,HC,AC,HY,AY,HR,AR,Winner,Loser
0,1,2005,13-Aug-05,Everton,Man United,0,2,A,0,1,...,15,14,8,6,3,1,0,0,Man United,Everton
1,2,2005,13-Aug-05,Man City,West Brom,0,0,D,0,0,...,13,11,3,6,2,3,0,0,,
2,3,2005,14-Aug-05,Arsenal,Newcastle,2,0,H,0,0,...,15,17,8,3,0,1,0,1,Arsenal,Newcastle
3,4,2005,20-Aug-05,Newcastle,West Ham,0,0,D,0,0,...,9,11,10,2,1,1,0,1,,
4,5,2005,21-Aug-05,Chelsea,Arsenal,1,0,H,0,0,...,17,21,3,7,2,3,0,0,Chelsea,Arsenal


In [210]:
#Import the csv data
#EPL_data = pd.read_csv('Training.csv')
#EPL_data.head()

In [211]:
#get list of teams in EPL that play in January 2018
from azureml import Workspace
ws = Workspace(
    workspace_id='5cd0bcab12fa4e13bc1a21023ba57673',
    authorization_token='COEXKjYaViYevjIKq+7hyXm+G9odq3v/8877WcEw7hy1VizII0IMoZGuX+crYHlY/vu8mRMEeTbD6eNxThA2bg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['Teams.csv']
team_names = ds.to_dataframe()
#team_names.head()
#team_names = pd.read_csv('Teams.csv')
#dfList = df['one'].tolist()
teamList = team_names['Team_Name'].tolist()
team_names.tail()

Unnamed: 0,Team_Name
11,Watford
12,Newcastle
13,Tottenham
14,Liverpool
15,Bournemouth


In [212]:
#test
print (team_names[team_names['Team_Name'] == 'Arsenal'])

  Team_Name
0   Arsenal


In [213]:
#test
print (team_names[team_names['Team_Name'] == 'Liverpool'])

    Team_Name
14  Liverpool


### Feature Creation
Use a Support Vector Machine (SVM). Plot each data item as a point in a n-dimensional environment, where each feature is a value of a particular coordinate. Find the distance (subtraction) between two vectors. 

In [254]:
#get annual vectors for each team
def getAnnualTeamData(teamName, year):
    
    annual_data = EPL_data[EPL_data['Year'] == year]
    
    # num goals scored in wins and losses
    gamesHome = annual_data[annual_data['HomeTeam'] == teamName] 
    totalGoalsScored = gamesHome['FTHG'].sum()
    gamesAway = annual_data[annual_data['AwayTeam'] == teamName]
    totalGames = gamesHome.append(gamesAway)
    numGames = len(totalGames.index)
    #total goals scored
    totalGoalsScored += gamesAway['FTAG'].sum()
    # total goals allowed
    totalGoalsAllowed = gamesHome['FTAG'].sum()
    totalGoalsAllowed += gamesAway['FTHG'].sum()
    
    #discipline: total red cards, total yellow cards
    totalYellowCards = gamesHome['HY'].sum()
    totalYellowCards += gamesAway['AY'].sum()
    totalRedCards = gamesHome['HR'].sum()
    totalRedCards += gamesAway['AR'].sum()
    
    #total fouls
    totalFouls = gamesHome['HF'].sum()
    totalFouls += gamesAway['AF'].sum()
    
    #total Corners
    totalCorners = gamesHome['HC'].sum()
    totalCorners += gamesAway['AC'].sum()

    #shots per game (spg) = total shots / total games 
    totalShots = gamesHome['HS'].sum()
    # avg shots per game
    totalShots += gamesAway['AS'].sum()
    if numGames != 0:
        spg = totalShots / numGames
    # avg shots allowed per game
    totalShotsAgainst = gamesHome['AS'].sum()
    totalShotsAgainst += gamesAway['HS'].sum()
    if numGames != 0:
        sag = totalShotsAgainst / numGames
    
    #Games Won Percentage = Games Won / (Games Won + Games Lost) 
    gamesWon = annual_data[annual_data['Winner'] == teamName] 
    gamesLost = annual_data[annual_data['Loser'] == teamName] 
    numGamesWon = len(gamesWon.index)
    numGamesLost = len(gamesLost.index)
    if numGames != 0:
        gamesWonPercentage = numGamesWon / numGames
    
    #Defense stats
        #Goalie Saves = Shots on Goal - Goal Scored
    totalShotsOnGoal = gamesHome['HST'].sum()
    totalShotsOnGoal += gamesAway['AST'].sum()
    goalieSaves = totalShotsOnGoal - totalGoalsAllowed
    
        #Saves Percentage = Goalie Saves / Shots on Goal   
    if totalShotsOnGoal != 0:
        savesPercentage = goalieSaves / totalShotsOnGoal
        
        #Saves Ratio = Shots On Goal / Goalie Saves    
    if goalieSaves != 0:
        savesRatio = totalShotsOnGoal / goalieSaves

    #Offense stats
        #Scoring Percentage = (Scoring Attempts - Goals Scored ) / Scoring Attempts
    if totalShots != 0:
        scoringPercentage = (totalShots - totalGoalsScored) / totalShots
        
        #Scoring Ratio = Shots On Goal / Goals Scored
    if totalGoalsScored != 0:
        scoringRatio = totalShotsOnGoal / totalGoalsScored       
        
            
    if numGames == 0: #if team not in dataset
        gamesWon = 0
        gamesLost = 0
        totalGoalsScored = 0
        totalGoalsAllowed = 0
        totalYellowCards = 0
        totalRedCards = 0
        totalFouls = 0
        totalCorners = 0
        spg = 0
        sag = 0
        gamesWonPercentage = 0
        goalieSaves = 0
        savesPercentage = 0
        savesRatio = 0
        scoringPercentage = 0
        scoringRatio = 0 
        
    return [totalGoalsScored, totalGoalsAllowed, totalYellowCards, totalRedCards,
        totalFouls,totalCorners, spg, sag, gamesWonPercentage, goalieSaves, savesPercentage, savesRatio,
        scoringPercentage, scoringRatio]

In [215]:
#test
getAnnualTeamData('Arsenal', 2017)

[32, 24, 29, 2, 171, 110, 13, 12, 0, 62, 0, 1, 0, 2]

In [216]:
#test
getAnnualTeamData('Chelsea', 2015)

[40, 42, 61, 5, 324, 173, 13, 11, 0, 100, 0, 1, 0, 3]

In [217]:
#create a dictionary for all the team stats in a year for all the teams
def createAnnualDict(year):
    annualDictionary = collections.defaultdict(list)
    for team in teamList:
        team_vector = getAnnualTeamData(team, year)
        annualDictionary[team] = team_vector
    return annualDictionary

In [218]:
createAnnualDict(2016)

defaultdict(list,
            {u'Arsenal': [58, 35, 40, 3, 310, 165, 14, 12, 0, 117, 0, 1, 0, 2],
             u'Bournemouth': [36,
              57,
              43,
              2,
              313,
              177,
              10,
              14,
              0,
              46,
              0,
              2,
              0,
              2],
             u'Burnley': [15, 27, 25, 0, 171, 58, 8, 20, 0, 19, 0, 2, 0, 3],
             u'Crystal Palace': [34,
              59,
              60,
              0,
              388,
              175,
              12,
              14,
              0,
              58,
              0,
              2,
              0,
              3],
             u'Everton': [35, 43, 49, 4, 334, 171, 12, 13, 0, 90, 0, 1, 0, 3],
             u'Leicester': [47,
              35,
              48,
              4,
              359,
              186,
              12,
              14,
              0,
              90,
              0,
  

# Model Training
Create training method that takes in a dictionary with with all the teams vectors by year. For each game, the function calculates the difference between between the team vectors for that year. Then, the function assigns a yTrain that is a 1 if the home team wins, and 0 otherwise. The difference vector becomes the input (xTrain) for the model, and a label (yTrain).

In [253]:
def getTrainingData(years):
    totalNumGames = 0
    for year in years:
        annual = EPL_data[EPL_data['Year'] == year]
        totalNumGames += len(annual.index)
    numFeatures = len(getAnnualTeamData('Arsenal',2015)) #random team, to find dimensionality
    xTrain = np.zeros(( totalNumGames, numFeatures))
    yTrain = np.zeros(( totalNumGames ))
    indexCounter = 0
    for year in years:
        team_vectors = createAnnualDict(year)
        annual = EPL_data[EPL_data['Year'] == year]
        numGamesInYear = len(annual.index)
        xTrainAnnual = np.zeros(( numGamesInYear, numFeatures))
        yTrainAnnual = np.zeros(( numGamesInYear ))
        counter = 0
        for index, row in annual.iterrows():
            h_team = row['HomeTeam']
            h_vector = team_vectors[h_team]
            a_team = row['AwayTeam']
            a_vector = team_vectors[a_team]
            diff = [a - b for a, b in zip(h_vector, a_vector)]
            if (counter % 2 == 0):
                if len(diff) != 0:
                    xTrainAnnual[counter] = diff
                yTrainAnnual[counter] = 1
            # the opposite of the difference of the vectors should be a true negative, where team 1 does not win
            else:
                if len(diff) != 0:
                    xTrainAnnual[counter] = [ -p for p in diff]
                yTrainAnnual[counter] = 0
            counter += 1
        xTrain[indexCounter:numGamesInYear+indexCounter] = xTrainAnnual
        yTrain[indexCounter:numGamesInYear+indexCounter] = yTrainAnnual
        indexCounter += numGamesInYear
    return xTrain, yTrain

In [220]:
#get the dictionary
years = range(2005,2017)
xTrain, yTrain = getTrainingData(years)
np.save('xTrain', xTrain)
np.save('yTrain', yTrain)

In [221]:
xTrain.shape

(1656, 14)

In [222]:
yTrain.shape

(1656,)

## Feature Selection

Feature selection of the 14 features is done through recursive feature elimination and a ranking of feature importance with extra trees classifier. 10 features of the 14 were determined to be much more influential. 

In [223]:
model1 = LogisticRegression() #recursive feature elimination 
model2 = ExtraTreesClassifier() #feature importance

In [224]:
# create the RFE model and select 3 attributes
rfe = RFE(model1, 9)
rfe = rfe.fit(xTrain, yTrain)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

[ True  True  True False  True False  True  True False  True False  True
 False  True]
[1 1 1 2 1 3 1 1 6 1 5 1 4 1]


In [225]:
# Feature Importance
#Top Features: total goals scored, total goals allowed, total yellow cards/red cards, total fouls, total corners,
    #spg, sag, goalie saves, scoring ratio
    
#Lowest features: scoring percentage, saves percentage, games won percentage, save ratio
    
#fit an Extra Trees model to the data
model2.fit(xTrain, yTrain)
#display the relative importance of each attribute
print(model2.feature_importances_) #the higher the more important the feature

[ 0.09646106  0.09692582  0.10426811  0.09198217  0.11679977  0.11330094
  0.082826    0.08832803  0.          0.0939929   0.          0.03201604
  0.          0.08309918]


### Updated Functions to include only top 10 features

In [250]:
#updated function to include only top 10 features
def getAnnualTeamData2(teamName, year):
    
    annual_data = EPL_data[EPL_data['Year'] == year]
    
    # num goals scored in wins and losses
    gamesHome = annual_data[annual_data['HomeTeam'] == teamName] 
    totalGoalsScored = gamesHome['FTHG'].sum()
    gamesAway = annual_data[annual_data['AwayTeam'] == teamName]
    totalGames = gamesHome.append(gamesAway)
    numGames = len(totalGames.index)
    #total goals scored
    totalGoalsScored += gamesAway['FTAG'].sum()
    # total goals allowed
    totalGoalsAllowed = gamesHome['FTAG'].sum()
    totalGoalsAllowed += gamesAway['FTHG'].sum()
    
    #discipline: total red cards, total yellow cards
    totalYellowCards = gamesHome['HY'].sum()
    totalYellowCards += gamesAway['AY'].sum()
    totalRedCards = gamesHome['HR'].sum()
    totalRedCards += gamesAway['AR'].sum()
    
    #total fouls
    totalFouls = gamesHome['HF'].sum()
    totalFouls += gamesAway['AF'].sum()
    
    #total Corners
    totalCorners = gamesHome['HC'].sum()
    totalCorners += gamesAway['AC'].sum()

    #shots per game (spg) = total shots / total games 
    totalShots = gamesHome['HS'].sum()
    # avg shots per game
    totalShots += gamesAway['AS'].sum()
    if numGames != 0:
        spg = totalShots / numGames
    # avg shots allowed per game
    totalShotsAgainst = gamesHome['AS'].sum()
    totalShotsAgainst += gamesAway['HS'].sum()
    if numGames != 0:
        sag = totalShotsAgainst / numGames
    
    #Games Won Percentage = Games Won / (Games Won + Games Lost) 
    gamesWon = annual_data[annual_data['Winner'] == teamName] 
    gamesLost = annual_data[annual_data['Loser'] == teamName] 
    numGamesWon = len(gamesWon.index)
    numGamesLost = len(gamesLost.index)
    if numGames != 0:
        gamesWonPercentage = numGamesWon / numGames
    
    #Defense stats
        #Goalie Saves = Shots on Goal - Goal Scored
    totalShotsOnGoal = gamesHome['HST'].sum()
    totalShotsOnGoal += gamesAway['AST'].sum()
    goalieSaves = totalShotsOnGoal - totalGoalsAllowed
    
        #Saves Percentage = Goalie Saves / Shots on Goal   
    if totalShotsOnGoal != 0:
        savesPercentage = goalieSaves / totalShotsOnGoal
        
        #Saves Ratio = Shots On Goal / Goalie Saves    
    if goalieSaves != 0:
        savesRatio = totalShotsOnGoal / goalieSaves

    #Offense stats
        #Scoring Percentage = (Scoring Attempts - Goals Scored ) / Scoring Attempts
    if totalShots != 0:
        scoringPercentage = (totalShots - totalGoalsScored) / totalShots
        
        #Scoring Ratio = Shots On Goal / Goals Scored
    if totalGoalsScored != 0:
        scoringRatio = totalShotsOnGoal / totalGoalsScored       
        
            
    if numGames == 0: #team not in dataset
        totalGoalsScored = 0
        totalGoalsAllowed = 0
        totalYellowCards = 0
        totalRedCards = 0
        totalFouls = 0
        totalCorners = 0
        spg = 0
        sag = 0
        goalieSaves = 0
        scoringRatio = 0
        
    return [totalGoalsScored, totalGoalsAllowed, totalYellowCards, totalRedCards,
        totalFouls,totalCorners, spg, sag, goalieSaves, scoringRatio]

In [227]:
#test
getAnnualTeamData2('Chelsea', 2015)

[40, 42, 61, 5, 324, 173, 13, 11, 100, 3]

In [251]:
#updated functions to include only top 10 features
def createAnnualDict2(year):
    annualDictionary = collections.defaultdict(list)
    for team in teamList:
        team_vector = getAnnualTeamData2(team, year)
        annualDictionary[team] = team_vector
    return annualDictionary

In [231]:
createAnnualDict2(2016)

defaultdict(list,
            {u'Arsenal': [58, 35, 40, 3, 310, 165, 14, 12, 117, 2],
             u'Bournemouth': [36, 57, 43, 2, 313, 177, 10, 14, 46, 2],
             u'Burnley': [15, 27, 25, 0, 171, 58, 8, 20, 19, 3],
             u'Crystal Palace': [34, 59, 60, 0, 388, 175, 12, 14, 58, 3],
             u'Everton': [35, 43, 49, 4, 334, 171, 12, 13, 90, 3],
             u'Leicester': [47, 35, 48, 4, 359, 186, 12, 14, 90, 2],
             u'Liverpool': [64, 42, 45, 2, 343, 209, 17, 10, 142, 2],
             u'Man City': [62, 40, 54, 3, 348, 210, 15, 8, 122, 2],
             u'Man United': [47, 34, 69, 2, 431, 186, 13, 11, 112, 3],
             u'Newcastle': [22, 27, 18, 2, 177, 65, 11, 12, 49, 3],
             u'Southampton': [44, 33, 42, 4, 364, 161, 13, 11, 103, 3],
             u'Swansea': [40, 63, 51, 0, 371, 145, 11, 16, 65, 3],
             u'Tottenham': [54, 31, 62, 0, 400, 219, 18, 10, 166, 3],
             u'Watford': [29, 51, 70, 4, 426, 120, 10, 14, 59, 3],
             u'

In [252]:
#updated functions to include only top 10 features
def getTrainingData2(years):
    totalNumGames = 0
    for year in years:
        annual = EPL_data[EPL_data['Year'] == year]
        totalNumGames += len(annual.index)
    numFeatures = len(getAnnualTeamData2('Arsenal',2015)) #random team, to find dimensionality
    xTrain2 = np.zeros(( totalNumGames, numFeatures))
    yTrain2 = np.zeros(( totalNumGames ))
    indexCounter = 0
    for year in years:
        team_vectors = createAnnualDict2(year)
        annual = EPL_data[EPL_data['Year'] == year]
        numGamesInYear = len(annual.index)
        xTrainAnnual = np.zeros(( numGamesInYear, numFeatures))
        yTrainAnnual = np.zeros(( numGamesInYear ))
        counter = 0
        for index, row in annual.iterrows():
            h_team = row['HomeTeam']
            h_vector = team_vectors[h_team]
            a_team = row['AwayTeam']
            a_vector = team_vectors[a_team]
            diff = [a - b for a, b in zip(h_vector, a_vector)]
            if (counter % 2 == 0):
                if len(diff) != 0:
                    xTrainAnnual[counter] = diff
                if h_team == row['Winner']:
                    yTrainAnnual[counter] = 1
                else: 
                    yTrainAnnual[counter] = 0
            # the opposite of the difference of the vectors should be a true negative, where team 1 does not win
            else:
                if len(diff) != 0:
                    xTrainAnnual[counter] = [ -p for p in diff]
                yTrainAnnual[counter] = 0
            counter += 1
        xTrain2[indexCounter:numGamesInYear+indexCounter] = xTrainAnnual
        yTrain2[indexCounter:numGamesInYear+indexCounter] = yTrainAnnual
        indexCounter += numGamesInYear
    return xTrain2, yTrain2

In [233]:
#get the dictionary
years = range(2005,2017)
xTrain2, yTrain2 = getTrainingData2(years)
np.save('xTrain2', xTrain2)
np.save('yTrain2', yTrain2)

In [234]:
xTrain2.shape

(1656, 10)

In [235]:
yTrain2.shape

(1656,)

In [236]:
print xTrain2

[[ -8.  15.   3. ...,   4. -25.   5.]
 [  5.   5. -15. ...,   5. -14.  -6.]
 [ -2.  -7.   0. ...,  -4.   1.   0.]
 ..., 
 [ -2.  -2.   9. ...,  -2. -20.   0.]
 [  5. -11. -18. ...,  -3.  50.   1.]
 [ -4.  -6.  -8. ...,  -2. -19.  -1.]]


## Model Validation
Training all 14 features using linear Regression produced the most reliable results. Thus although the function gave the top 10 features to be most influential, all 14 were used to calculate the final results for optimal accuracy

### Testing Models ALL 14 Features
Linear Regression: 62%<br />
SVM Regression: 50.7% (SVC)<br />
SVM Classification: 46.9% (SVR)<br />
Decision Tree (Classifier, Regressor): 49.3%, 52.9%<br />
Logistic Regression: 45.2%<br />
Random Forest Classifier (n = 100): 50.4%<br />
Bayesian Ridge Regression: 49.3%<br />
Lasso Regression: 47.6%<br />
Ridge Regression or Tikhonov regularization (alpha = 0.5): 46.4%<br />
Ada-boost Classifier (n = 100): 49.0%<br />
Gradient Boosting Classifier (n = 100): 50.7%<br />
Gradient Boosting Regressor (n = 100): 47.8%<br />
KNN (n = 60): 56.5%

### Testing Models Top 10 Features
Linear Regression: 57% <br />
SVM Regression: 58.1% (SVC)<br />
SVM Classification: 56.7% (SVR)<br />
Decision Tree (Classifier, Regressor): 47.6%, 50.4%<br />
Logistic Regression: 49.8%<br />
Random Forest Classifier (n = 100): 44.4%<br />
Bayesian Ridge Regression: 48.0%<br />
Lasso Regression: 41.2%<br />
Ridge Regression or Tikhonov regularization (alpha = 0.5): 41.8%<br />
Ada-boost Classifier (n = 100): 42.8%<br />
Gradient Boosting Classifier (n = 100): 51.6%<br />
Gradient Boosting Regressor (n = 100): 52.2%<br />

In [248]:
# Tried all the following models. Uncomment model to try.

lm = linear_model.LinearRegression()
#lm = tree.DecisionTreeClassifier()
#lm = tree.DecisionTreeRegressor()
#lm = linear_model.LogisticRegression()
#lm = linear_model.BayesianRidge()
#lm = linear_model.Lasso()
#lm = svm.SVC()
#lm = svm.SVR()
#lm = linear_model.Ridge(alpha = 0.5)
#lm = AdaBoostClassifier(n_estimators=100)
#lm = GradientBoostingClassifier(n_estimators=100) 
#lm = GradientBoostingRegressor(n_estimators=100, max_depth=9) 
#lm = RandomForestClassifier(n_estimators=100) 
#lm = KNeighborsClassifier(n_neighbors=60) #not possible with only 10 features

In [246]:
#use this
xTrain, X_test, yTrain, y_test = train_test_split(xTrain, yTrain)
print xTrain.shape, yTrain.shape
print X_test.shape, y_test.shape
#lm = linear_model.LinearRegression()
model2 = lm.fit(xTrain, yTrain)
predictions = lm.predict(X_test)
#avg pred
print sum(predictions)/len(predictions)
#print predictions

(931, 14) (931,)
(311, 14) (311,)
0.639871382637


## Results
I tested out the above models and selected the one with the highest prediction accuracy (linear regression). I then used this model to calculate the predictions for the 2018 games. 

In [239]:
def createGamePrediction(team1_vector, team2_vector, xTrain, yTrain):
    xTrain, X_test, yTrain, Y_test = train_test_split(xTrain, yTrain)
    xTrain.shape, yTrain.shape
    X_test.shape, y_test.shape
    lm = linear_model.LinearRegression()
    model2 = lm.fit(xTrain, yTrain)
    diff = [a - b for a, b in zip(team1_vector, team2_vector)]
    predictions = lm.predict(diff)
    return predictions

In [240]:
team1_vector = getAnnualTeamData("Arsenal", 2017)
team2_vector = getAnnualTeamData("West Brom", 2017)
team3_vector = getAnnualTeamData("Chelsea", 2017)

print 'Probability that ' + team1_name + ' wins:',createGamePrediction(team1_vector, team2_vector,xTrain, yTrain)
print 'Probability that ' + team2_name + ' wins:',createGamePrediction(team2_vector, team1_vector,xTrain, yTrain)

Probability that Everton wins: 0.472145799553
Probability that Arsenal wins: 0.516087854472


# Final Predictions on Test Set
Final predictions are based on the probability that first team (Home Team) wins. The probabilities are as follows:<br />
1) Arsenal	Crystal Palace 53.6%<br />
2) Burnley	Man United 49.6%<br />
3) Everton	West Brom 51.4%<br />
4) Leicester	Watford 54.4%<br />
5) Man City	Newcastle 77.4%<br />
6) Southampton	Tottenham 55.9%<br />
7) Swansea	Liverpool 47.3%<br />
8) West Ham	Bournemouth 58.96%<br />

In [241]:
#test_data = pd.read_csv('Data/test.csv')
from azureml import Workspace
ws = Workspace(
    workspace_id='5cd0bcab12fa4e13bc1a21023ba57673',
    authorization_token='COEXKjYaViYevjIKq+7hyXm+G9odq3v/8877WcEw7hy1VizII0IMoZGuX+crYHlY/vu8mRMEeTbD6eNxThA2bg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['Test.csv']
test_data = ds.to_dataframe()
(test_data.head())

Unnamed: 0,Game_ID,Year,Date,HomeTeam,AwayTeam
0,1,2018,20-Jan-18,Arsenal,Crystal Palace
1,2,2018,20-Jan-18,Burnley,Man United
2,3,2018,20-Jan-18,Everton,West Brom
3,4,2018,20-Jan-18,Leicester,Watford
4,5,2018,20-Jan-18,Man City,Newcastle


In [242]:
team1_vector = getAnnualTeamData("Arsenal", 2017)
team2_vector = getAnnualTeamData("Crystal Palace", 2017)
print createGamePrediction(team1_vector, team2_vector,xTrain, yTrain)

0.428688538701


In [243]:
team1_vector = getAnnualTeamData("Burnley", 2017)
team2_vector = getAnnualTeamData("Man United", 2017)
print createGamePrediction(team1_vector, team2_vector,xTrain, yTrain)

0.545627382987


In [255]:
#game_ID given to each game to simplify identification of game
def formulatePredictions():
    probs = [[0 for x in range(2)] for x in range(len(test_data.index))]
    for index, row in test_data.iterrows():
        game_ID = row['Game_ID']
        year = row['Year'] - 1
        team1_Name = row['HomeTeam']
        team2_Name = row['AwayTeam']
        team1_vector = getAnnualTeamData(team1_Name, year)
        team2_vector = getAnnualTeamData(team2_Name, year)
        prediction = createGamePrediction(team1_vector, team2_vector,xTrain, yTrain)
        probs[index][0] = game_ID
        probs[index][1] = prediction
    probs = pd.np.array(probs)
    return probs

In [249]:
formulatePredictions()

array([[ 1.        ,  0.53643911],
       [ 2.        ,  0.49611248],
       [ 3.        ,  0.51446088],
       [ 4.        ,  0.54425525],
       [ 5.        ,  0.77397612],
       [ 6.        ,  0.55895189],
       [ 7.        ,  0.47308558],
       [ 8.        ,  0.58962172]])

## References
1. **Background reading:**<br />
    * Brucher, Matthieu, et al. âScikit-Learn: Machine Learning in Python.â Edited by Mikio Braun, Journal of Machine Learning Research, 2011, [www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf](http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf).
    * Demsar, Janez, and Blaz Tupan. From Experimental Machine Learning to Interactive Data Mining. Orange, [www.celta.paris-sorbonne.fr/anasem/papers/miscelanea/InteractiveDataMining.pdf](http://www.celta.paris-sorbonne.fr/anasem/papers/miscelanea/InteractiveDataMining.pdf).
    * Dewey, Conor. âThe Hitchhiker's Guide to Machine Learning in Python.â FreeCodeCamp, FreeCodeCamp, 1 Aug. 2017, [medium.freecodecamp.org/the-hitchhikers-guide-to-machine-learning-algorithms-in-python-bfad66adb378](https://medium.freecodecamp.org/the-hitchhikers-guide-to-machine-learning-algorithms-in-python-bfad66adb378)/ 
    * Kaufmann, Morgan. âData Mining: Practical Machine Learning Tools and Techniques.â Edited by Diane Cerra, Research Gate, Nov. 2010, [www.researchgate.net/publication/220017784_Data_Mining_Practical_Machine_Learning_Tools_and_Techniques.](https://www.researchgate.net/profile/Ian_Witten/publication/220017784_Data_Mining_Practical_Machine_Learning_Tools_and_Techniques/links/00b495175e36c6f402000000.pdf).
    * Paruchuri, Vik. âMachine Learning with Python.â Dataquest, Dataquest, 14 Dec. 2017, [www.dataquest.io/blog/machine-learning-python/](https://www.dataquest.io/blog/machine-learning-python/).
    * âAn Introduction to Machine Learning with Scikit-Learn.â Scikit-Learn, Scikit-Learn Developers, [scikit-learn.org/stable/tutorial/basic/tutorial.html](http://scikit-learn.org/stable/tutorial/basic/tutorial.html).
    * Raghavan, Shreyas. âCreate a Model to Predict House Prices Using Python.â Towards Data Science, 17 June 2017, [towardsdatascience.com/create-a-model-to-predict-house-prices-using-python-d34fe8fad88f](https://towardsdatascience.com/create-a-model-to-predict-house-prices-using-python-d34fe8fad88f).
2. **Source used to find the common statistics calculated for football teams:** <br /> âWellington Phoenix.â Soccer Betting Statistics and Results for Wellington Phoenix, Soccer Betting Statistics 2018, 2018, [www.soccerbettingstatistics.com/team/wellington-phoenix/2017-2018/a-league/6282/53/1086](www.soccerbettingstatistics.com/team/wellington-phoenix/2017-2018/a-league/6282/53/1086).
<br />
3. **Feature vectors inspiration:** <br /> Agarwal, Sumeet. Machine Learning: A Very Quick Introduction. 6 Jan. 2013, [web.iitd.ac.in/~sumeet/mlintro_doc.pdf](web.iitd.ac.in/~sumeet/mlintro_doc.pdf).
<br />
4. **Research/background reading on explanations of Word2Vec - representing words with attributes in a vector, then subtracting those vectors to find the difference between them.**
    * âWhy word2vec works.â Galvanize, [blog.galvanize.com/add-and-subtract-words-like-vectors-with-word2vec-2/](http://andyljones.tumblr.com/post/111299309808/why-word2vec-works).
    * âVector Representations of Words.â TensorFlow, 2 Nov. 2017, [www.tensorflow.org/tutorials/word2vec](https://www.tensorflow.org/tutorials/word2vec).
    * âAdd and Subtract Words like Vectors with word2vec.â Galvanize, [blog.galvanize.com/add-and-subtract-words-like-vectors-with-word2vec-2/](http://blog.galvanize.com/add-and-subtract-words-like-vectors-with-word2vec-2/).
    * Critchlow, Will. âA Beginner's Guide to word2vec AKA What's the Opposite of Canada?â Distilled, 28 Jan. 2016, [www.distilled.net/resources/a-beginners-guide-to-word2vec-aka-whats-the-opposite-of-canada/](https://www.distilled.net/resources/a-beginners-guide-to-word2vec-aka-whats-the-opposite-of-canada/).
<br />
5. **Feature Selection inspiration:** <br /> âFeature Selection in Python with Scikit-Learn.â Machine Learning Mastery, 21 Sept. 2016, [machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/](machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/).
<br />
6. **Approach inspiration taken from:** <br /> Forsyth, Jared, and Andrew Wilde. âA Machine Learning Approach to March Madness.â Brigham Young University, 2014, [axon.cs.byu.edu/~martinez/classes/478/stuff/Sample_Group_Project3.pdf](http://axon.cs.byu.edu/~martinez/classes/478/stuff/Sample_Group_Project3.pdf).
