# Introduction
## Readings
[1]  Guyon and A Elisseeff, "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research 3 (2003) 1157-1182 
[2]  Ulmer and Fernadez, "Predicting Soccer Match Results in the English Premier League", Online: http://cs229.stanford.edu/proj2014/Ben%20Ulmer,%20Matt%20Fernandez,%20Predicting%20Soccer%20Results%20in%20the%20English%20Premier%20League.pdf
[3]  D Greer, "Spectator Booing and the Home Advantage: A Study of Social Influence in the Basketball Arena", Social Psychology Quarterly Vol. 46, No. 3 (Sep., 1983), pp. 252-261

##Approach
First off, I had a look through the data provided and brainstormed any factors that could affect the output:
Weather conditions.
Temperature.
Stadium Turnout.
Formations.
Transfer Budgets.
Previous League Position.
Players.
etc...

Unfortunately, most of this information is rather difficult to access publically and reliably apart from previous league positions.
I gathered the previous league positions from FootballData.com and collated it in a table ordered by season of the relevant clubs.

# Data Import
The standard epl-training.csv is the frame variable.
The League Standings.csv is the leagueMap variable.
The epl-test.csv is the testingFrame variable.

In [25]:
from azureml import Workspace
import pandas as pd
import sys
import numpy as numpy
import math as math

#Constants for Win, Lose, Draw
LOSE = 0
DRAW = 1
WIN = 2 
#Constant for Home and Away
HOME = 0
AWAY = 1

#Importing standard EPL file
ws = Workspace(
    workspace_id='c47c481c87134eb285ac760aee7d84e2',
    authorization_token='OvEd12KUaZZQsheJy2EEnXM/mKV3yxelsLGb5JvlktLYRZBfB5VbEXJrKr8Pje6wRk7KJhG7A+0bdLPJA2MZRg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['EPITraining']
frame = ds.to_dataframe()

#Importing Map of Season and team to league standing
ws = Workspace(
    workspace_id='c47c481c87134eb285ac760aee7d84e2',
    authorization_token='OvEd12KUaZZQsheJy2EEnXM/mKV3yxelsLGb5JvlktLYRZBfB5VbEXJrKr8Pje6wRk7KJhG7A+0bdLPJA2MZRg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['League Standings.csv']
leagueMap = ds.to_dataframe()

ws = Workspace(
    workspace_id='c47c481c87134eb285ac760aee7d84e2',
    authorization_token='OvEd12KUaZZQsheJy2EEnXM/mKV3yxelsLGb5JvlktLYRZBfB5VbEXJrKr8Pje6wRk7KJhG7A+0bdLPJA2MZRg==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['epl-test.csv']
testingFrame = ds.to_dataframe()



# Data Transformation and Exploration
The data in the file included 1987 games from the teams in the current English Premier League. 
This data contained the date, teams, goals, referee, and various metrics regarding the games.
Unfortunately, we can't directly train the data to predict based on the performance in the game, as the game hasn't been played yet and thus information isn't available. Naturally, to predict current/future games outcomes, we'll need to be using historic data.

I found in a paper [3] that Home Advantage (basketball in this case) is a well documented phenomenon and could influence match results.
I decided that my new columns would gather historic data to provide a picture of the performance of the Home team both home and way, and the performance of the away team, both home and away.
I found through trial and error that historic data (Data from 03-04 league while predicting for 09-10) was a worse predictor than modern data. To remedy this, I decided to split the data in Season Averages, and Historic Averages.
The next step was to create the averages, though not all games have equal weighting. Games played at the start of the season don't necesarily reflect performance of the team later in the season. 
This was remedied by using exponentially weighted averages to ensure that more recent games were represented more than historic matches. 
I compiled an Excel document of all the previous league standing of the teams in the tournament and created a column "PreviousStanding" as those teams that finished higher than 17th are safe, while the lower ones are relegated to the division below.

In [26]:
#Data Transformation

#Adds previous league standings from the league before
def addPreviousLeagueStandings(data):
    leagueStandings = []
    homestands = []
    awaystands = []
    #Get last seasons values
    
    for index, row in data.iterrows():
        year = row['Date'].year
        month = row['Date'].month
        home = row['HomeTeam']
        away = row['AwayTeam']
        prevSeason = 0
        if(month >= 8):
            prevSeason = str(year-1) + "-" + str(year)
        elif(month <= 5):
            prevSeason = str(year-2) + "-" + str(year-1)
        mappingHome = leagueMap.loc[(leagueMap['Team'] == home) & (leagueMap['League'] == prevSeason)]
        mappingAway = leagueMap.loc[(leagueMap['Team'] == away) & (leagueMap['League'] == prevSeason)]
        homestands.append((mappingHome.iloc[0])['Finishing Position'])
        awaystands.append(mappingAway.iloc[0]['Finishing Position'])
    leagueStandings.append(homestands)
    leagueStandings.append(awaystands)
    return leagueStandings
    
    
#Converts single column htr and ftr to 3 indicator columns    
def resultsToIndicatorValues(data):
    for index, row in data.iterrows():
        row['FTR'] = ({'H':WIN,'D':DRAW,'A':LOSE}[row['FTR']])
        row['HTR'] = ({'H':WIN,'D':DRAW,'A':LOSE}[row['HTR']])
    print(data)

#Adds a season column
def addSeasons(data):
    seasons = []
    for index, row in data.iterrows():
        year = row['Date'].year
        month = row['Date'].month
        season = "No Season"
        if(month >= 8):
            season = str(year) + "-" + str(year+1)
        elif(month <= 5):
            season = str(year-1) + "-" + str(year)
        seasons.append(season)
    return seasons


#Weighted Average for this seasons games on an exponential scale. The most recent games are weighted higher.
def calculateSeasonWeightedAverages(games, homeAway):
    numberOfGames = len(games)
    weights = getExponentialWeights(numberOfGames)
    weightedAverages = []
    #print("Number of Games: " + str(len(games)))
    if(weights == 0):
        return [0,0,0,0,0,0,0,0] 
    weightedShots = 0
    weightedShotsTarget = 0
    weightedFouls = 0
    weightedCorners = 0
    weightedYellows = 0
    weightedReds = 0
    weightedHalfGoals = 0
    weightedFullGoals = 0
    x = 0
    if(homeAway == 0):
        for index, row in games.iterrows():
            #print("\t" + row['HomeTeam'] + " vs " + row['AwayTeam'])
            weightedShots += row['HS'] * weights[x]
            weightedShotsTarget += row['HST'] * weights[x]
            weightedFouls += row['HF'] * weights[x]
            weightedCorners += row['HC'] * weights[x]
            weightedYellows += row['HY'] * weights[x]
            weightedReds += row['HR'] * weights[x]
            weightedHalfGoals += row['HTHG'] * weights[x]
            weightedFullGoals += row['FTHG'] * weights[x]
            x = x+1
    else:
        for index, row in games.iterrows():
            #print("\t" + row['HomeTeam'] + " vs " + row['AwayTeam'])
            weightedShots += row['AS'] * weights[x]
            weightedShotsTarget += row['AST'] * weights[x]
            weightedFouls += row['AF'] * weights[x]
            weightedCorners += row['AC'] * weights[x]
            weightedYellows += row['AY'] * weights[x]
            weightedReds += row['AR'] * weights[x]
            weightedHalfGoals += row['HTAG'] * weights[x]
            weightedFullGoals += row['FTAG'] * weights[x]
            x = x+1
    weightedAverages.append(weightedShots)
    weightedAverages.append(weightedShotsTarget)
    weightedAverages.append(weightedFouls)
    weightedAverages.append(weightedCorners)
    weightedAverages.append(weightedYellows)
    weightedAverages.append(weightedReds)
    weightedAverages.append(weightedHalfGoals)
    weightedAverages.append(weightedFullGoals)
    return weightedAverages
        

#Gets the exponential weightings of each average
def getExponentialWeights(number):
    weights = []
    if(number == 0):
        return 0
    lastExp = math.exp(0)/math.exp(1)
    for x in range(1, number):
        nextExp = math.exp(x*(1/number))/math.exp(1)
        weights.append(nextExp - lastExp)
        lastExp = nextExp
    weights.append(1 - sum(weights))
    #print(weights)
    return weights

#Counts streak of wins/losses/draws throughout all Home and Away games played by the team.
def getStreak(mGames, team):
    games = mGames.copy()
    games.iloc[::-1]
    typeOfStreak = 1
    lastGame = 1
    streak = 0
    flag = 1
    #print(team)
    #print("Number of Games: " + str(len(games)))
    for index, row in games.iterrows():
        #print("\t" + row['HomeTeam'] + " vs " + row['AwayTeam'])
        if(row['HomeTeam'] == team):
            if(row['FTR'] == WIN):
                #print("\tWIN STREAK")
                lastGame = WIN
            elif(row['FTR'] == DRAW):
                #print("\tDRAW STREAK")
                lastGame = DRAW
            elif(row['FTR'] == LOSE):
                #print("\tLOSE STREAK")
                lastGame = LOSE
            if(flag):
                #print("\tlower flag")
                typeOfStreak = lastGame
                flag = 0
            if(lastGame == typeOfStreak):
                streak += 1
                #print("\tStreak Up")
            else:
                #print("\tEnd of Streak")
                return [streak, typeOfStreak]
        elif(row['AwayTeam'] == team):
            if(row['FTR'] == LOSE):
                #print("\tLOSE STREAK")
                lastGame = LOSE
            elif(row['FTR'] == DRAW):
                #print("\tDRAW STREAK")
                lastGame = DRAW
            elif(row['FTR'] == WIN):
                #print("\tWIN STREAK")
                lastGame = WIN
            if(flag):
                #print("\tlower flag")
                typeOfStreak = lastGame
                flag = 0
            if(lastGame == typeOfStreak):
                #print("\tStreak Up")
                streak += 1
            else:
                #print("\tEnd of Streak")
                return [streak, typeOfStreak]
    #print("\tStreak: " + str(streak))
    return [streak, typeOfStreak]

In [27]:
#Key: S (Time Period) H (Team Playing) A (Values for where they played)
#S =  Season, H =  Historic (All Games)
#H = Home, A = Away

#Example SHATotalGames = Total Games this Season for the Home team playing Away
#Sorry for the large stuff
#Average played for this season so far.
SHATotalGames = []
SHHTotalGames = []
SAATotalGames = []
SAHTotalGames = []
SHHWins = []
SHHDraws = []
SHHLosses = []
SHStreak = []
SHStreakType = []
SAHWins = []
SAHDraws = []
SAHLosses = []
SHAWins = []
SHADraws = []
SHALosses = []
SAAWins = []
SAADraws = []
SAALosses = []
SAStreak = []
SAStreakType = []

#All weighted averages
SHHShots = []
SHHShotsTarget = []
SHHFouls = []
SHHCorners = []
SHHYellows = []
SHHReds = []
SHHHalfGoals = []
SHHFullGoals = []
SHAShots = []
SHAShotsTarget = []
SHAFouls = []
SHACorners = []
SHAYellows = []
SHAReds = []
SHAHalfGoals = []
SHAFullGoals = []

SAAShots = []
SAAShotsTarget = []
SAAFouls = []
SAACorners = []
SAAYellows = []
SAAReds = []
SAAHalfGoals = []
SAAFullGoals = []
SAHShots = []
SAHShotsTarget = []
SAHFouls = []
SAHCorners = []
SAHYellows = []
SAHReds = []
SAHHalfGoals = []
SAHFullGoals = []

#H for Historic, H for Home, A for Away
HHHShots = []
HHHShotsTarget = []
HHHFouls = []
HHHCorners = []
HHHYellows = []
HHHReds = []
HHAShots = []
HHAShotsTarget = []
HHAFouls = []
HHACorners = []
HHAYellows = []
HHAReds = []

HAAShots = []
HAAShotsTarget = []
HAAFouls = []
HAACorners = []
HAAYellows = []
HAAReds = []
HAHShots = []
HAHShotsTarget = []
HAHFouls = []
HAHCorners = []
HAHYellows = []
HAHReds = []

#Adding Seasons to each column and converting date field into a datetime type
frame['Date'] = pd.to_datetime(frame['Date'], format = '%d/%m/%Y')
frame['Season'] = addSeasons(frame)
leagueStandings = addPreviousLeagueStandings(frame)
frame['HPrevStanding'] = leagueStandings[0]
frame['APrevStanding'] = leagueStandings[1]

pd.set_option('display.max_colwidth', -1)
for index, row in frame.iterrows():

    #print("Row Index: " + str(index))

    SHHGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & (frame['HomeTeam'] == row['HomeTeam'])]
    SHAGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & (frame['AwayTeam'] == row['HomeTeam'])]
    SAAGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & (frame['AwayTeam'] == row['AwayTeam'])]
    SAHGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & (frame['HomeTeam'] == row['AwayTeam'])]
    HHHGames = frame.loc[(frame['Date'] < row['Date']) & (frame['HomeTeam'] == row['HomeTeam'])]
    HHAGames = frame.loc[(frame['Date'] < row['Date']) & (frame['AwayTeam'] == row['HomeTeam'])]
    HAAGames = frame.loc[(frame['Date'] < row['Date']) & (frame['AwayTeam'] == row['AwayTeam'])]
    HAHGames = frame.loc[(frame['Date'] < row['Date']) & (frame['HomeTeam'] == row['AwayTeam'])]
    

    #Calculates the total games
    SHHTotal = len(SHHGames.ID)
    SHATotal = len(SHAGames.ID)
    SAATotal = len(SAAGames.ID)
    SAHTotal = len(SAHGames.ID)
    HHHTotal = len(HHHGames.ID)
    HHATotal = len(HHAGames.ID)
    HAATotal = len(HAAGames.ID)
    HAHTotal = len(HAHGames.ID)

    SHHTotalGames.append(SHHTotal)
    SHATotalGames.append(SHATotal)
    SAATotalGames.append(SAATotal)
    SAHTotalGames.append(SAHTotal)

    if(SHHTotal == 0): SHHTotal = 1
    if(SHATotal == 0): SHATotal = 1
    if(SAATotal == 0): SAATotal = 1
    if(SAHTotal == 0): SAHTotal = 1
    if(HHHTotal == 0): HHHTotal = 1
    if(HHATotal == 0): HHATotal = 1
    if(HAATotal == 0): HAATotal = 1
    if(HAHTotal == 0): HAHTotal = 1

    #Total wins, draws, and losses 
    SHHWins.append(len((SHHGames.loc[(SHHGames['FTR'] == WIN)]).ID))
    SHHDraws.append(len((SHHGames.loc[(SHHGames['FTR'] == DRAW)]).ID))
    SHHLosses.append(len((SHHGames.loc[(SHHGames['FTR'] == LOSE)]).ID))
    SAHWins.append(len((SAHGames.loc[(SAHGames['FTR'] == WIN)]).ID))
    SAHDraws.append(len((SAHGames.loc[(SAHGames['FTR'] == DRAW)]).ID))
    SAHLosses.append(len((SAHGames.loc[(SAHGames['FTR'] == LOSE)]).ID))
    SHAWins.append(len((SHAGames.loc[(SHAGames['FTR'] == WIN)]).ID))
    SHADraws.append(len((SHAGames.loc[(SHAGames['FTR'] == DRAW)]).ID))
    SHALosses.append(len((SHAGames.loc[(SHAGames['FTR'] == LOSE)]).ID))
    SAAWins.append(len((SAAGames.loc[(SAAGames['FTR'] == WIN)]).ID))
    SAADraws.append(len((SAAGames.loc[(SAAGames['FTR'] == DRAW)]).ID))
    SAALosses.append(len((SAAGames.loc[(SAAGames['FTR'] == LOSE)]).ID))
    
    #Calculate streaks. Streakiness is a measure of performance according to some studies
    HAllGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & ((frame['AwayTeam'] == row['HomeTeam']) | (frame['HomeTeam'] == row['HomeTeam']))]
    streakSet = getStreak(HAllGames ,row['HomeTeam'])
    SHStreak.append(streakSet[0])
    SHStreakType.append(streakSet[1])
                                                                                                
    AAllGames = frame.loc[(frame['Date'] < row['Date']) & (frame['Season'] == row['Season']) & ((frame['HomeTeam'] == row['AwayTeam']) | (frame['AwayTeam'] == row['AwayTeam']))]   
    streakSet = getStreak(AAllGames ,row['AwayTeam'])
    SAStreak.append(streakSet[0])
    SAStreakType.append(streakSet[1])

    #Weighted Averages for Seasonal Performance
    #print(str(row['ID']) + ": " + row['HomeTeam'] + " VS " + row['AwayTeam'])
    weighted = calculateSeasonWeightedAverages(SHHGames, HOME)
    SHHShots.append(weighted[0])
    SHHShotsTarget.append(weighted[1])
    SHHFouls.append(weighted[2])
    SHHCorners.append(weighted[3])
    SHHYellows.append(weighted[4])
    SHHReds.append(weighted[5])
    SHHHalfGoals.append(weighted[6])
    SHHFullGoals.append(weighted[7])
    
    weighted = calculateSeasonWeightedAverages(SAHGames, HOME)
    SAHShots.append(weighted[0])
    SAHShotsTarget.append(weighted[1])
    SAHFouls.append(weighted[2])
    SAHCorners.append(weighted[3])
    SAHYellows.append(weighted[4])
    SAHReds.append(weighted[5])
    SAHHalfGoals.append(weighted[6])
    SAHFullGoals.append(weighted[7])
    
    weighted = calculateSeasonWeightedAverages(SAAGames, AWAY)
    SAAShots.append(weighted[0])
    SAAShotsTarget.append(weighted[1])
    SAAFouls.append(weighted[2])
    SAACorners.append(weighted[3])
    SAAYellows.append(weighted[4])
    SAAReds.append(weighted[5])
    SAAHalfGoals.append(weighted[6])
    SAAFullGoals.append(weighted[7])
    
    weighted = calculateSeasonWeightedAverages(SHAGames, AWAY)
    SHAShots.append(weighted[0])
    SHAShotsTarget.append(weighted[1])
    SHAFouls.append(weighted[2])
    SHACorners.append(weighted[3])
    SHAYellows.append(weighted[4])
    SHAReds.append(weighted[5])
    SHAHalfGoals.append(weighted[6])
    SHAFullGoals.append(weighted[7])

    #Simple Average for Historic Performance
    HHHShots.append(HHHGames.HS.sum()/HHHTotal)
    HHHShotsTarget.append(HHHGames.HST.sum()/HHHTotal)
    HHHFouls.append(HHHGames.HF.sum()/HHHTotal)
    HHHCorners.append(HHHGames.HC.sum()/HHHTotal)
    HHHYellows.append(HHHGames.HY.sum()/HHHTotal)
    HHHReds.append(HHHGames.HR.sum()/HHHTotal)

    HAHShots.append(HAHGames.AS.sum()/HAHTotal)
    HAHShotsTarget.append(HAHGames.AST.sum()/HAHTotal)
    HAHFouls.append(round(HAHGames.AF.sum()/HAHTotal, 6))
    HAHCorners.append(HAHGames.AC.sum()/HAHTotal)
    HAHYellows.append(HAHGames.AY.sum()/HAHTotal)
    HAHReds.append(HAHGames.AR.sum()/HAHTotal)

    HAAShots.append(HAAGames.AS.sum()/HAATotal)
    HAAShotsTarget.append(HAAGames.AST.sum()/HAATotal)
    HAAFouls.append(HAAGames.AF.sum()/HAATotal)
    HAACorners.append(HAAGames.AC.sum()/HAATotal)
    HAAYellows.append(HAAGames.AY.sum()/HAATotal)
    HAAReds.append(HAAGames.AR.sum()/HAATotal)

    HHAShots.append(HHAGames.AS.sum()/HHATotal)
    HHAShotsTarget.append(HHAGames.AST.sum()/HHATotal)
    HHAFouls.append(HHAGames.AF.sum()/HHATotal)
    HHACorners.append(HHAGames.AC.sum()/HHATotal)
    HHAYellows.append(HHAGames.AY.sum()/HHATotal)
    HHAReds.append(HHAGames.AR.sum()/HHATotal)  

    
frameCopy = frame.copy()
#Adding columns to frame
frame["SHHTotalGames"] = SHHTotalGames
frame["SHATotalGames"] = SHATotalGames
frame["SAATotalGames"] = SAATotalGames
frame["SAHTotalGames"] = SAHTotalGames
frame["SHHWins"] = SHHWins
frame["SAHWins"] = SAHWins
frame["SAAWins"] = SAAWins
frame["SHAWins"] = SHAWins
frame["SHHLosses"] = SHHLosses
frame["SAHLosses"] = SAHLosses
frame["SAALosses"] = SAALosses
frame["SHALosses"] = SHALosses
frame["SHHDraws"] = SHHDraws
frame["SAHDraws"] = SAHDraws
frame["SAADraws"] = SAADraws
frame["SHADraws"] = SHADraws

frame["SHStreak"] = SHStreak
frame["SHStreakType"] = SHStreakType
frame["SAStreak"] = SAStreak
frame["SAStreakType"] = SAStreakType

frame["SHHShots"] = SHHShots
frame["SHHShotsTarget"] = SHHShotsTarget
frame["SHHFouls"] = SHHFouls
frame["SHHCorners"] = SHHCorners
frame["SHHYellows"] = SHHYellows
frame["SHHReds"] = SHHReds

frame["SHAShots"] = SHAShots
frame["SHAShotsTarget"] = SHAShotsTarget
frame["SHAFouls"] = SHAFouls
frame["SHACorners"] = SHACorners
frame["SHAYellows"] = SHAYellows
frame["SHAReds"] = SHAReds

frame["SAHShots"] = SAHShots
frame["SAHShotsTarget"] = SAHShotsTarget
frame["SAHFouls"] = SAHFouls
frame["SAHCorners"] = SAHCorners
frame["SAHYellows"] = SAHYellows
frame["SAHReds"] = SAHReds

frame["SAAShots"] = SAAShots
frame["SAAShotsTarget"] = SAAShotsTarget
frame["SAAFouls"] = SAAFouls
frame["SAACorners"] = SAACorners
frame["SAAYellows"] = SAAYellows
frame["SAAReds"] = SAAReds

frame["HHHShots"] = HHHShots
frame["HHHShotsTarget"] = HHHShotsTarget
frame["HHHFouls"] = HHHFouls
frame["HHHCorners"] = HHHCorners
frame["HHHYellows"] = HHHYellows
frame["HHHReds"] = HHHReds

frame["HHAShots"] = HHAShots
frame["HHAShotsTarget"] = HHAShotsTarget
frame["HHAFouls"] = HHAFouls
frame["HHACorners"] = HHACorners
frame["HHAYellows"] = HHAYellows
frame["HHAReds"] = HHAReds

frame["HAHShots"] = HAHShots
frame["HAHShotsTarget"] = HAHShotsTarget
frame["HAHFouls"] = HAHFouls
frame["HAHCorners"] = HAHCorners
frame["HAHYellows"] = HAHYellows
frame["HAHReds"] = HAHReds

frame["HAAShots"] = HAAShots
frame["HAAShotsTarget"] = HAAShotsTarget
frame["HAAFouls"] = HAAFouls
frame["HAACorners"] = HAACorners
frame["HAAYellows"] = HAAYellows
frame["HAAReds"] = HAAReds

#Removing current match statistics column as it's information we don't have when we recall.
frame = frame.drop('HS', axis=1)
frame = frame.drop('AS', axis=1)
frame = frame.drop('HST', axis=1)
frame = frame.drop('AST', axis=1)
frame = frame.drop('HF', axis=1)
frame = frame.drop('AF', axis=1)
frame = frame.drop('HY', axis=1)
frame = frame.drop('AY', axis=1)
frame = frame.drop('HR', axis=1)
frame = frame.drop('AR', axis=1)
frame = frame.drop('HC', axis=1)
frame = frame.drop('AC', axis=1)
frame = frame.drop('HTHG', axis=1)
frame = frame.drop('FTHG', axis=1)
frame = frame.drop('HTAG', axis=1)
frame = frame.drop('FTAG', axis=1)
frame = frame.drop('HTR', axis=1)

In [28]:
#Predictions with ETC and RFC
#Gathering input data
#Please excuse the messiness of this part, I know I could have made a function out of it. But deadlines are deadlines.
def createPredictions(games):
    SHATotalGames = []
    SHHTotalGames = []
    SAATotalGames = []
    SAHTotalGames = []
    SHHWins = []
    SHHDraws = []
    SHHLosses = []
    SHStreak = []
    SHStreakType = []
    SAHWins = []
    SAHDraws = []
    SAHLosses = []
    SHAWins = []
    SHADraws = []
    SHALosses = []
    SAAWins = []
    SAADraws = []
    SAALosses = []
    SAStreak = []
    SAStreakType = []

    #All weighted averages
    SHHShots = []
    SHHShotsTarget = []
    SHHFouls = []
    SHHCorners = []
    SHHYellows = []
    SHHReds = []
    SHHHalfGoals = []
    SHHFullGoals = []
    SHAShots = []
    SHAShotsTarget = []
    SHAFouls = []
    SHACorners = []
    SHAYellows = []
    SHAReds = []
    SHAHalfGoals = []
    SHAFullGoals = []

    SAAShots = []
    SAAShotsTarget = []
    SAAFouls = []
    SAACorners = []
    SAAYellows = []
    SAAReds = []
    SAAHalfGoals = []
    SAAFullGoals = []
    SAHShots = []
    SAHShotsTarget = []
    SAHFouls = []
    SAHCorners = []
    SAHYellows = []
    SAHReds = []
    SAHHalfGoals = []
    SAHFullGoals = []

    #H for Historic, H for Home, A for Away
    HHHShots = []
    HHHShotsTarget = []
    HHHFouls = []
    HHHCorners = []
    HHHYellows = []
    HHHReds = []
    HHAShots = []
    HHAShotsTarget = []
    HHAFouls = []
    HHACorners = []
    HHAYellows = []
    HHAReds = []

    HAAShots = []
    HAAShotsTarget = []
    HAAFouls = []
    HAACorners = []
    HAAYellows = []
    HAAReds = []
    HAHShots = []
    HAHShotsTarget = []
    HAHFouls = []
    HAHCorners = []
    HAHYellows = []
    HAHReds = []
    
    for index, row in games.iterrows():
    
        SHHGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & (frameCopy['HomeTeam'] == row['HomeTeam'])]
        SHAGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & (frameCopy['AwayTeam'] == row['HomeTeam'])]
        SAAGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & (frameCopy['AwayTeam'] == row['AwayTeam'])]
        SAHGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & (frameCopy['HomeTeam'] == row['AwayTeam'])]
        HHHGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['HomeTeam'] == row['HomeTeam'])]
        HHAGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['AwayTeam'] == row['HomeTeam'])]
        HAAGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['AwayTeam'] == row['AwayTeam'])]
        HAHGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['HomeTeam'] == row['AwayTeam'])]
        
        #print(row['HomeTeam'] + " have played home " + str(len(SHHGames)))
        #Calculates the total games
        SHHTotal = len(SHHGames.ID)
        #print(row['HomeTeam'] + " have played away " + str(len(SHAGames)))
        SHATotal = len(SHAGames.ID)
        #print(row['AwayTeam'] + " have played home " + str(len(SAAGames)))
        SAATotal = len(SAAGames.ID)
        #print(row['AwayTeam'] + " have played away " + str(len(SAHGames)))
        SAHTotal = len(SAHGames.ID)
        HHHTotal = len(HHHGames.ID)
        HHATotal = len(HHAGames.ID)
        HAATotal = len(HAAGames.ID)
        HAHTotal = len(HAHGames.ID)

        SHHTotalGames.append(SHHTotal)
        SHATotalGames.append(SHATotal)
        SAATotalGames.append(SAATotal)
        SAHTotalGames.append(SAHTotal)

        if(SHHTotal == 0): SHHTotal = 1
        if(SHATotal == 0): SHATotal = 1
        if(SAATotal == 0): SAATotal = 1
        if(SAHTotal == 0): SAHTotal = 1
        if(HHHTotal == 0): HHHTotal = 1
        if(HHATotal == 0): HHATotal = 1
        if(HAATotal == 0): HAATotal = 1
        if(HAHTotal == 0): HAHTotal = 1

        #Total wins, draws, and losses 
        SHHWins.append(len((SHHGames.loc[(SHHGames['FTR'] == WIN)]).ID))
        SHHDraws.append(len((SHHGames.loc[(SHHGames['FTR'] == DRAW)]).ID))
        SHHLosses.append(len((SHHGames.loc[(SHHGames['FTR'] == LOSE)]).ID))
        SAHWins.append(len((SAHGames.loc[(SAHGames['FTR'] == WIN)]).ID))
        SAHDraws.append(len((SAHGames.loc[(SAHGames['FTR'] == DRAW)]).ID))
        SAHLosses.append(len((SAHGames.loc[(SAHGames['FTR'] == LOSE)]).ID))
        SHAWins.append(len((SHAGames.loc[(SHAGames['FTR'] == WIN)]).ID))
        SHADraws.append(len((SHAGames.loc[(SHAGames['FTR'] == DRAW)]).ID))
        SHALosses.append(len((SHAGames.loc[(SHAGames['FTR'] == LOSE)]).ID))
        SAAWins.append(len((SAAGames.loc[(SAAGames['FTR'] == WIN)]).ID))
        SAADraws.append(len((SAAGames.loc[(SAAGames['FTR'] == DRAW)]).ID))
        SAALosses.append(len((SAAGames.loc[(SAAGames['FTR'] == LOSE)]).ID))

        #Calculate streaks. Streakiness is a measure of performance according to some studies
        HAllGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & ((frameCopy['AwayTeam'] == row['HomeTeam']) | (frameCopy['HomeTeam'] == row['HomeTeam']))]
        streakSet = getStreak(HAllGames ,row['HomeTeam'])
        SHStreak.append(streakSet[0])
        SHStreakType.append(streakSet[1])

        AAllGames = frameCopy.loc[(frameCopy['Date'] < row['Date']) & (frameCopy['Season'] == row['Season']) & ((frameCopy['HomeTeam'] == row['AwayTeam']) | (frameCopy['AwayTeam'] == row['AwayTeam']))]   
        streakSet = getStreak(AAllGames ,row['AwayTeam'])
        SAStreak.append(streakSet[0])
        SAStreakType.append(streakSet[1])

        #Weighted Averages for Seasonal Performance
        #print(str(row['ID']) + ": " + row['HomeTeam'] + " VS " + row['AwayTeam'])
        weighted = calculateSeasonWeightedAverages(SHHGames, HOME)
        SHHShots.append(weighted[0])
        SHHShotsTarget.append(weighted[1])
        SHHFouls.append(weighted[2])
        SHHCorners.append(weighted[3])
        SHHYellows.append(weighted[4])
        SHHReds.append(weighted[5])
        SHHHalfGoals.append(weighted[6])
        SHHFullGoals.append(weighted[7])

        weighted = calculateSeasonWeightedAverages(SAHGames, HOME)
        SAHShots.append(weighted[0])
        SAHShotsTarget.append(weighted[1])
        SAHFouls.append(weighted[2])
        SAHCorners.append(weighted[3])
        SAHYellows.append(weighted[4])
        SAHReds.append(weighted[5])
        SAHHalfGoals.append(weighted[6])
        SAHFullGoals.append(weighted[7])

        weighted = calculateSeasonWeightedAverages(SAAGames, AWAY)
        SAAShots.append(weighted[0])
        SAAShotsTarget.append(weighted[1])
        SAAFouls.append(weighted[2])
        SAACorners.append(weighted[3])
        SAAYellows.append(weighted[4])
        SAAReds.append(weighted[5])
        SAAHalfGoals.append(weighted[6])
        SAAFullGoals.append(weighted[7])

        weighted = calculateSeasonWeightedAverages(SHAGames, AWAY)
        SHAShots.append(weighted[0])
        SHAShotsTarget.append(weighted[1])
        SHAFouls.append(weighted[2])
        SHACorners.append(weighted[3])
        SHAYellows.append(weighted[4])
        SHAReds.append(weighted[5])
        SHAHalfGoals.append(weighted[6])
        SHAFullGoals.append(weighted[7])

        #Simple Average for Historic Performance
        HHHShots.append(HHHGames.HS.sum()/HHHTotal)
        HHHShotsTarget.append(HHHGames.HST.sum()/HHHTotal)
        HHHFouls.append(HHHGames.HF.sum()/HHHTotal)
        HHHCorners.append(HHHGames.HC.sum()/HHHTotal)
        HHHYellows.append(HHHGames.HY.sum()/HHHTotal)
        HHHReds.append(HHHGames.HR.sum()/HHHTotal)

        HAHShots.append(HAHGames.AS.sum()/HAHTotal)
        HAHShotsTarget.append(HAHGames.AST.sum()/HAHTotal)
        HAHFouls.append(round(HAHGames.AF.sum()/HAHTotal, 6))
        HAHCorners.append(HAHGames.AC.sum()/HAHTotal)
        HAHYellows.append(HAHGames.AY.sum()/HAHTotal)
        HAHReds.append(HAHGames.AR.sum()/HAHTotal)

        HAAShots.append(HAAGames.AS.sum()/HAATotal)
        HAAShotsTarget.append(HAAGames.AST.sum()/HAATotal)
        HAAFouls.append(HAAGames.AF.sum()/HAATotal)
        HAACorners.append(HAAGames.AC.sum()/HAATotal)
        HAAYellows.append(HAAGames.AY.sum()/HAATotal)
        HAAReds.append(HAAGames.AR.sum()/HAATotal)

        HHAShots.append(HHAGames.AS.sum()/HHATotal)
        HHAShotsTarget.append(HHAGames.AST.sum()/HHATotal)
        HHAFouls.append(HHAGames.AF.sum()/HHATotal)
        HHACorners.append(HHAGames.AC.sum()/HHATotal)
        HHAYellows.append(HHAGames.AY.sum()/HHATotal)
        HHAReds.append(HHAGames.AR.sum()/HHATotal)
    
    #Adding columns to frame
    games["SHHTotalGames"] = SHHTotalGames
    games["SHATotalGames"] = SHATotalGames
    games["SAATotalGames"] = SAATotalGames
    games["SAHTotalGames"] = SAHTotalGames
    games["SHHWins"] = SHHWins
    games["SAHWins"] = SAHWins
    games["SAAWins"] = SAAWins
    games["SHAWins"] = SHAWins
    games["SHHLosses"] = SHHLosses
    games["SAHLosses"] = SAHLosses
    games["SAALosses"] = SAALosses
    games["SHALosses"] = SHALosses
    games["SHHDraws"] = SHHDraws
    games["SAHDraws"] = SAHDraws
    games["SAADraws"] = SAADraws
    games["SHADraws"] = SHADraws

    games["SHStreak"] = SHStreak
    games["SHStreakType"] = SHStreakType
    games["SAStreak"] = SAStreak
    games["SAStreakType"] = SAStreakType

    games["SHHShots"] = SHHShots
    games["SHHShotsTarget"] = SHHShotsTarget
    games["SHHFouls"] = SHHFouls
    games["SHHCorners"] = SHHCorners
    games["SHHYellows"] = SHHYellows
    games["SHHReds"] = SHHReds

    games["SHAShots"] = SHAShots
    games["SHAShotsTarget"] = SHAShotsTarget
    games["SHAFouls"] = SHAFouls
    games["SHACorners"] = SHACorners
    games["SHAYellows"] = SHAYellows
    games["SHAReds"] = SHAReds

    games["SAHShots"] = SAHShots
    games["SAHShotsTarget"] = SAHShotsTarget
    games["SAHFouls"] = SAHFouls
    games["SAHCorners"] = SAHCorners
    games["SAHYellows"] = SAHYellows
    games["SAHReds"] = SAHReds

    games["SAAShots"] = SAAShots
    games["SAAShotsTarget"] = SAAShotsTarget
    games["SAAFouls"] = SAAFouls
    games["SAACorners"] = SAACorners
    games["SAAYellows"] = SAAYellows
    games["SAAReds"] = SAAReds

    games["HHHShots"] = HHHShots
    games["HHHShotsTarget"] = HHHShotsTarget
    games["HHHFouls"] = HHHFouls
    games["HHHCorners"] = HHHCorners
    games["HHHYellows"] = HHHYellows
    games["HHHReds"] = HHHReds

    games["HHAShots"] = HHAShots
    games["HHAShotsTarget"] = HHAShotsTarget
    games["HHAFouls"] = HHAFouls
    games["HHACorners"] = HHACorners
    games["HHAYellows"] = HHAYellows
    games["HHAReds"] = HHAReds

    games["HAHShots"] = HAHShots
    games["HAHShotsTarget"] = HAHShotsTarget
    games["HAHFouls"] = HAHFouls
    games["HAHCorners"] = HAHCorners
    games["HAHYellows"] = HAHYellows
    games["HAHReds"] = HAHReds

    games["HAAShots"] = HAAShots
    games["HAAShotsTarget"] = HAAShotsTarget
    games["HAAFouls"] = HAAFouls
    games["HAACorners"] = HAACorners
    games["HAAYellows"] = HAAYellows
    games["HAAReds"] = HAAReds
    
    #Encoding timestamp as unix clock time.
    games['Date'] = games['Date'].astype(int)

# Methodology Overview
I opted to use the Random Forest Classifier for my predictions with 5000 estimators.
As mentioned in the data transformation section, I opted to use weighted averages both seasonally and historically in order to calculate metrics. After reading around to see previous attempts, I noticed that using averages of the past 10 games was a common theme in predictors. However, since we are using a rather limited dataset, I didn't think it would be prudent to use 10 games considering teams such as Watford in 2015-2016 didn't have any premier league match since 2007 and thus would be a poor indicator of future performance if included in a 10 game average.
Per convention I found that a good ratio of features to samples is 1/10.


# Model Training/Validation
Originally I used the standard split data function in sklearn with a 25% testing set split, however I quickly found that I was overfitting for the over represented classes (H, A) as my model was predicting H and A exclusively.

I remedied this by extracting a random set of 400 values for each class and using the combined 400 each as a training set. 
The model was validated by using the simple score function from sklearn to see accuracy results.

In [29]:
from sklearn import datasets, svm, tree, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from collections import defaultdict

#Preparing datatypes for fitting and test set for testing.
testingFrame['Date'] = pd.to_datetime(testingFrame['Date'], format = '%d/%m/%Y')   
testingFrame['Season'] = addSeasons(testingFrame)

categorical = ['HomeTeam', 'AwayTeam', 'Season']
for col in categorical:
    frame[col] = frame[col].astype('category')
    testingFrame[col] = testingFrame[col].astype('category')

#Encoding categories as numbers instead of strings
teamLe = preprocessing.LabelEncoder()
seasonLe = preprocessing.LabelEncoder()
#Using home team is enough as every team plays at home at least once
leagueStandings = addPreviousLeagueStandings(testingFrame)
testingFrame['HPrevStanding'] = leagueStandings[0]
testingFrame['APrevStanding'] = leagueStandings[1]

createPredictions(testingFrame)

teamLe = teamLe.fit(frame['HomeTeam'])
frame['HomeTeam'] = teamLe.transform(frame['HomeTeam'])
frame['AwayTeam'] = teamLe.transform(frame['AwayTeam'])
testingFrame['HomeTeam'] = teamLe.transform(testingFrame['HomeTeam'])
testingFrame['AwayTeam'] = teamLe.transform(testingFrame['AwayTeam'])
seasonLe = seasonLe.fit(frame['Season'])
frame['Season'] = seasonLe.transform(frame['Season'])
testingFrame['Season'] = seasonLe.transform(testingFrame['Season'])

#print(frame)
#print(testingFrame)

#Encoding timestamp as unix clock time.
frame['Date'] = frame['Date'].astype(int)

#Removing the referee because we don't have that information in the prediction set

features = ['Date', 'HomeTeam', 'AwayTeam', 'Season', 'HPrevStanding',
            'APrevStanding', 'SHHTotalGames', 'SHATotalGames', 'SAATotalGames', 'SAHTotalGames',
            'SHHWins', 'SAHWins', 'SAAWins', 'SHAWins', 'SHHLosses', 'SAHLosses', 'SAALosses',
            'SHALosses', 'SHHDraws', 'SAHDraws','SAADraws', 'SHADraws', 'SHStreak', 'SHStreakType',
            'SAStreak', 'SAStreakType', 'SHHShots', 'SHHShotsTarget', 'SHHFouls', 'SHHCorners',
            'SHHYellows', 'SHHReds', 'SHAShots', 'SHAShotsTarget', 'SHAFouls', 'SHACorners', 
            'SHAYellows', 'SHAReds', 'SAHShots', 'SAHShotsTarget', 'SAHFouls', 'SAHCorners',
            'SAHYellows', 'SAHReds', 'SAAShots', 'SAAShotsTarget', 'SAAFouls', 'SAACorners', 
            'SAAYellows', 'SAAReds', 'HHHShots', 'HHHShotsTarget', 'HHHFouls', 'HHHCorners',
            'HHHYellows', 'HHHReds', 'HHAShots', 'HHAShotsTarget', 'HHAFouls', 'HHACorners',
            'HHAYellows', 'HHAReds', 'HAHShots', 'HAHShotsTarget', 'HAHFouls', 'HAHCorners',
            'HAHYellows', 'HAHReds', 'HAAShots', 'HAAShotsTarget', 'HAAFouls', 'HAACorners',
            'HAAYellows', 'HAAReds']


x_home = frame[features].loc[frame['FTR'] == 'H']   
x_draw = frame[features].loc[frame['FTR'] == 'D']   
x_away = frame[features].loc[frame['FTR'] == 'A']   

labels = ['FTR']
y_home = frame[labels].loc[frame['FTR'] == 'H']   
y_draw = frame[labels].loc[frame['FTR'] == 'D']   
y_away = frame[labels].loc[frame['FTR'] == 'A'] 

#Building the training and testing set
print("Full set:")
print("\tHome wins: " + str(sum(frame['FTR'] == 'H')))
print("\tDraws: " + str(sum(frame['FTR'] == 'D')))
print("\tAway wins: " + str(sum(frame['FTR'] == 'A')))
print()

#Seperating into equal test sizes of 400 samples to prevent overfitting to Home wins
x_train_home, x_test_home, y_train_home, y_test_home = train_test_split(x_home ,y_home ,test_size=508, random_state=0)
x_train_draw, x_test_draw, y_train_draw, y_test_draw = train_test_split(x_draw ,y_draw ,test_size=110, random_state=0)
x_train_away, x_test_away, y_train_away, y_test_away = train_test_split(x_away ,y_away ,test_size=169, random_state=0)

#print(str(len(x_home)))
#print(str(len(x_draw)))
#print(str(len(x_away)))

x_train = [x_train_home, x_train_draw, x_train_away]
x_test = [x_test_home, x_test_draw, x_test_away]
y_train = [y_train_home, y_train_draw, y_train_away]
y_test = [y_test_home, y_test_draw, y_test_away]
x_train = pd.concat(x_train)
x_test = pd.concat(x_test)
y_train = pd.concat(y_train)
y_test = pd.concat(y_test)

print("Training Sets:")
print("\tHome wins: " + str(sum(y_train['FTR'] == 'H')))
print("\tDraws: " + str(sum(y_train['FTR'] == 'D')))
print("\tAway wins: " + str(sum(y_train['FTR'] == 'A')))
print()
print("Testing Sets:")
print("\tHome wins: " + str(sum(y_test['FTR'] == 'H')))
print("\tDraws: " + str(sum(y_test['FTR'] == 'D')))
print("\tAway wins: " + str(sum(y_test['FTR'] == 'A')))
print()

multi_clf = RandomForestClassifier(n_estimators=5000)
multi_clf = multi_clf.fit(x_train, y_train)

Full set:
	Home wins: 908
	Draws: 510
	Away wins: 569

Training Sets:
	Home wins: 400
	Draws: 400
	Away wins: 400

Testing Sets:
	Home wins: 508
	Draws: 110
	Away wins: 169





# Results
I acknowledge that due to an unblanced testing set, I could be incredibly off target.

In [30]:
print("Accuracy: " + str(multi_clf.score(x_test ,y_test)))

Accuracy: 0.536213468869


# Final Predictions on the Test Set

In [31]:
#print(testingFrame)

print(multi_clf.predict(testingFrame))

['H' 'D' 'D' 'A' 'A' 'D' 'D' 'H' 'H' 'H']
