# Predicting the Winning Football Team

Can we design a predictive model capable of accurately predicting if the home team will win a football match? 

![alt text](https://6544-presscdn-0-22-pagely.netdna-ssl.com/wp-content/uploads/2017/04/English-Premier-League.jpg "Logo Title Text 1")

## Steps

1. We will clean our dataset
2. Split it into training and testing data (12 features & 1 target (winning team (Home/Away/Draw))
3. Train 3 different classifiers on the data 
  -Logistic Regression
  -Support Vector Machine 
  -XGBoost
4. Use the best Classifer to predict who will win given an away team and a home team

## History

Sports betting is a 500 billion dollar market (Sydney Herald)

![alt text](https://static1.squarespace.com/static/506a95bbc4aa0491a951c141/t/51a55d97e4b00f4428967e64/1369791896526/sports-620x349.jpg "Logo Title Text 1")

Kaggle hosts a yearly competiton called March Madness 

https://www.kaggle.com/c/march-machine-learning-mania-2017/kernels

Several Papers on this 

https://arxiv.org/pdf/1511.05837.pdf

"It is possible to predict the winner of English county twenty twenty cricket games in almost two thirds of instances."

https://arxiv.org/pdf/1411.1243.pdf

"Something that becomes clear from the results is that Twitter contains enough information to be useful for
predicting outcomes in the Premier League"

https://qz.com/233830/world-cup-germany-argentina-predictions-microsoft/

For the 2014 World Cup, Bing correctly predicted the outcomes for all of the 15 games in the knockout round.

So the right questions to ask are

-What model should we use?
-What are the features (the aspects of a game) that matter the most to predicting a team win? Does being the home team give a team the advantage? 

## Dataset

- Football is played by 250 million players in over 200 countries (most popular sport globally)
- The English Premier League is the most popular domestic team in the world
- Retrived dataset from http://football-data.co.uk/data.php

![alt text](http://i.imgur.com/YRIctyo.png "Logo Title Text 1")

- Football is a team sport, a cheering crowd helps morale
- Familarity with pitch and weather conditions helps
- No need to travel (less fatigue)

Acrononyms- https://rstudio-pubs-static.s3.amazonaws.com/179121_70eb412bbe6c4a55837f2439e5ae6d4e.html

## Other repositories

- https://github.com/rsibi/epl-prediction-2017 (EPL prediction)
- https://github.com/adeshpande3/March-Madness-2017 (NCAA prediction)

## Import Dependencies

In [1]:
#data preprocessing
import pandas as pd
#produces a prediction model in the form of an ensemble of weak prediction models, typically decision tree
import numpy as np
from sklearn.decomposition import PCA #for principal component analysis


#THESE ARE THE MODELS WE WILL BE TRYING OUT

import xgboost as xgb #XGBoost
#the outcome (dependent variable) has only a limited number of possible values. 
#Logistic Regression is used when response variable is categorical in nature.
from sklearn.linear_model import LogisticRegression
#A random forest is a meta estimator that fits a number of decision tree classifiers 
#on various sub-samples of the dataset and use averaging to improve the predictive 
#accuracy and control over-fitting.
from sklearn.ensemble import RandomForestClassifier
#a discriminative classifier formally defined by a separating hyperplane.
from sklearn.svm import SVC #SVM
from sklearn import neighbors, datasets #K nearest neighbors
from sklearn.ensemble import AdaBoostClassifier #Adaboost
from sklearn.naive_bayes import GaussianNB #naive bayes
from sklearn import tree #decision tree
from sklearn.neural_network import MLPClassifier #multy layer perception (vanilla) neural network

#displayd data
from IPython.display import display
%matplotlib inline

In [4]:
# Read data and drop redundant column.
data1 = pd.read_csv('Set1_training.csv')
data_test1 = pd.read_csv('Set1_testing.csv')
data2 = pd.read_csv('Set2_training.csv')
data_test2 = pd.read_csv('Set2_testing.csv')
data3 = pd.read_csv('Set3_training.csv')
data_test3 = pd.read_csv('Set3_testing.csv')
data4 = pd.read_csv('Set4_training.csv')
data_test4 = pd.read_csv('Set4_testing.csv')
data5 = pd.read_csv('Set5_training.csv')
data_test5 = pd.read_csv('Set5_testing.csv')
data6 = pd.read_csv('Set6_training.csv')
data_test6 = pd.read_csv('Set6_testing.csv')
dataC = pd.read_csv('SetC_training.csv')
data_testC = pd.read_csv('SetC_testing.csv')
# Preview data.
display(data1.head())
display(data2.head())
display(data3.head())
display(data4.head())
display(data5.head())
display(data6.head())
display(dataC.head())

#"ACCRONYMS:





#"

#Input - 12 other features (fouls, shots, goals, misses,corners, red card, yellow cards)
#Output - Full Time Result (H=Home Win, D=Draw, A=Away Win) 

Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,BABIP,ERA,K%,K%.1,AVG,SLG,W%.1,BABIP.1,ERA.1,K%.2,K%.3,AVG.1,SLG.1
0,Orioles,Braves,D,0.549383,0.299,4.22,20.4,21.7,0.256,0.443,0.42236,0.293,4.51,19.6,20.0,0.255,0.384
1,Pirates,Tigers,L,0.481481,0.306,4.22,19.6,21.3,0.257,0.402,0.534161,0.3,4.24,20.4,21.3,0.267,0.438
2,BlueJays,Phillies,L,0.549383,0.282,3.79,21.5,21.9,0.248,0.426,0.438272,0.304,4.64,21.1,23.0,0.24,0.385
3,Reds,Indians,L,0.419753,0.29,4.91,19.6,21.1,0.256,0.408,0.583851,0.289,3.86,23.2,20.2,0.262,0.43
4,Braves,Orioles,L,0.42236,0.293,4.51,19.6,20.0,0.255,0.384,0.549383,0.299,4.22,20.4,21.7,0.256,0.443


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,BABIP,ERA,K%,K%.1,AVG,SLG,...,B_index,W%.1,BABIP.1,ERA.1,K%.2,K%.3,AVG.1,SLG.1,P_Index.1,B_index.1
0,Orioles,Braves,D,0.549383,0.299,4.22,20.4,21.7,0.256,0.443,...,47.3,0.42236,0.293,4.51,19.6,20.0,0.255,0.384,24.11,45.5
1,Pirates,Tigers,L,0.481481,0.306,4.22,19.6,21.3,0.257,0.402,...,47.0,0.534161,0.3,4.24,20.4,21.3,0.267,0.438,24.64,48.0
2,BlueJays,Phillies,L,0.549383,0.282,3.79,21.5,21.9,0.248,0.426,...,46.7,0.438272,0.304,4.64,21.1,23.0,0.24,0.385,25.74,47.0
3,Reds,Indians,L,0.419753,0.29,4.91,19.6,21.1,0.256,0.408,...,46.7,0.583851,0.289,3.86,23.2,20.2,0.262,0.43,27.06,46.4
4,Braves,Orioles,L,0.42236,0.293,4.51,19.6,20.0,0.255,0.384,...,45.5,0.549383,0.299,4.22,20.4,21.7,0.256,0.443,24.62,47.3


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,B_index,LV2,wOBA,W%.1,B_index.1,LV2.1,wOBA.1
0,Orioles,Braves,D,0.549383,47.3,26.91,0.326,0.42236,45.5,26.35,0.304
1,Pirates,Tigers,L,0.481481,47.0,26.13,0.318,0.534161,48.0,27.31,0.33
2,BlueJays,Phillies,L,0.549383,46.7,28.14,0.327,0.438272,47.0,28.53,0.296
3,Reds,Indians,L,0.419753,46.7,26.46,0.311,0.583851,46.4,30.09,0.326
4,Braves,Orioles,L,0.42236,45.5,26.35,0.304,0.549383,47.3,26.91,0.326


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,B_index,LV2,wOBA,WAR,WAR.1,W%.1,B_index.1,LV2.1,wOBA.1,WAR.2,WAR.3
0,Orioles,Braves,D,0.549383,47.3,26.91,0.326,15.4,20.2,0.42236,45.5,26.35,0.304,8.9,10.0
1,Pirates,Tigers,L,0.481481,47.0,26.13,0.318,9.2,17.3,0.534161,48.0,27.31,0.33,17.4,19.4
2,BlueJays,Phillies,L,0.549383,46.7,28.14,0.327,19.0,23.7,0.438272,47.0,28.53,0.296,13.6,10.2
3,Reds,Indians,L,0.419753,46.7,26.46,0.311,-1.4,15.9,0.583851,46.4,30.09,0.326,18.7,26.7
4,Braves,Orioles,L,0.42236,45.5,26.35,0.304,8.9,10.0,0.549383,47.3,26.91,0.326,15.4,20.2


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,Off,K%,AVG
0,Orioles,Braves,D,-0.127022,-128.6,-1.7,-0.001
1,Pirates,Tigers,L,0.05268,42.8,0.0,0.01
2,BlueJays,Phillies,L,-0.111111,-153.7,1.1,-0.008
3,Reds,Indians,L,0.164098,109.3,-0.9,0.006
4,Braves,Orioles,L,0.127022,128.6,1.7,0.001


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,Off,K%,AVG,P_Index,B_index
0,Orioles,Braves,D,-0.127022,-128.6,-1.7,-0.001,-0.51,-1.8
1,Pirates,Tigers,L,0.05268,42.8,0.0,0.01,0.82,1.0
2,BlueJays,Phillies,L,-0.111111,-153.7,1.1,-0.008,0.45,0.3
3,Reds,Indians,L,0.164098,109.3,-0.9,0.006,2.55,-0.3
4,Braves,Orioles,L,0.127022,128.6,1.7,0.001,0.51,1.8


Unnamed: 0,AWAY,HOME,HOME_TEAM_RESULT,W%,BABIP,LOB%,ERA,FIP,xFIP,WAR,...,WAR.3,rPM.1,DRS.1,BIZ.1,RZR.1,FSR.1,RngR.1,ErrR.1,UZR.1,Def.3
0,Orioles,Braves,D,0.549383,0.299,73.8,4.22,4.31,4.34,15.4,...,10.0,-24,-29,2147,0.792,-46,-11.5,-7.5,-12.9,-27.9
1,Pirates,Tigers,L,0.481481,0.306,72.5,4.22,4.3,4.28,9.2,...,19.4,-51,-60,2158,0.811,-1,-25.8,17.5,-20.8,-12.8
2,BlueJays,Phillies,L,0.549383,0.282,74.5,3.79,4.04,4.02,19.0,...,10.2,-20,-30,2012,0.83,-20,18.0,4.4,18.1,19.1
3,Reds,Indians,L,0.419753,0.29,72.8,4.91,5.24,4.79,-1.4,...,26.7,7,15,2063,0.829,7,35.1,-2.9,35.6,41.6
4,Braves,Orioles,L,0.42236,0.293,70.2,4.51,4.32,4.48,8.9,...,20.2,-34,-30,2074,0.822,13,-20.7,13.3,-12.6,-12.6


## Data Exploration

In [6]:
#what is the win rate for the home team?

# Total number of matches.
n_matches = dataC.shape[0]

# Calculate number of features. -1 because we are saving one as the target variable (win/lose/draw)
n_features = dataC.shape[1] - 1

# Calculate matches won by home team.
n_homewins = len(dataC[dataC.HOME_TEAM_RESULT == 'W'])

# Calculate win rate for home team.
win_rate = (float(n_homewins) / (n_matches)) * 100

# Print the results
print ("Total number of matches: {}".format(n_matches))
print ("Number of features: {}".format(n_features))
print ("Number of matches won by home team: {}".format(n_homewins))
print ("Win rate of home team: {:.2f}%".format(win_rate))

Total number of matches: 2734
Number of features: 58
Number of matches won by home team: 1439
Win rate of home team: 52.63%


In [12]:
# Visualising distribution of data
#from pandas.plotting import scatter_matrix

#the scatter matrix is plotting each of the columns specified against each other column.
#You would have observed that the diagonal graph is defined as a histogram, which means that in the 
#section of the plot matrix where the variable is against itself, a histogram is plotted.

#Scatter plots show how much one variable is affected by another. 
#The relationship between two variables is called their correlation
#negative vs positive correlation

#HTGD - Home team goal difference
#ATGD - away team goal difference
#HTP - Home team points
#ATP - Away team points
#DiffFormPts Diff in points
#DiffLP - Differnece in last years prediction

#scatter_matrix(data[['HOME_SCORE','AWAY_SCORE']], figsize=(10,10))

## Preparing the Data

In [15]:
# Separate into feature set and target variable
#Home team result:W=Home Win, D=Draw, L=Away Win
X_all1 = data1.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all1 = data1['HOME_TEAM_RESULT']
X_all2 = data2.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all2 = data2['HOME_TEAM_RESULT']
X_all3 = data3.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all3 = data3['HOME_TEAM_RESULT']
X_all4 = data4.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all4 = data4['HOME_TEAM_RESULT']
X_all5 = data5.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all5 = data5['HOME_TEAM_RESULT']
X_all6 = data6.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_all6 = data6['HOME_TEAM_RESULT']
X_allC = dataC.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_allC = dataC['HOME_TEAM_RESULT']

X_test1 = data_test1.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test1 = data_test1['HOME_TEAM_RESULT']
X_test2 = data_test2.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test2 = data_test2['HOME_TEAM_RESULT']
X_test3 = data_test3.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test3 = data_test3['HOME_TEAM_RESULT']
X_test4 = data_test4.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test4 = data_test4['HOME_TEAM_RESULT']
X_test5 = data_test5.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test5 = data_test5['HOME_TEAM_RESULT']
X_test6 = data_test6.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_test6 = data_test6['HOME_TEAM_RESULT']
X_testC = data_testC.drop(['HOME_TEAM_RESULT','AWAY','HOME'],1)
y_testC = data_testC['HOME_TEAM_RESULT']


# Standardising the data.
from sklearn.preprocessing import scale

#Center to the mean and component wise scale to unit variance.
cols1 = [['W%', 'BABIP', 'ERA', 'K%', 'K%.1', 'AVG', 'SLG', 'W%.1', 'BABIP.1', 'ERA.1', 'K%.2', 'K%.3', 'AVG.1', 'SLG.1']]
cols2 = [['W%', 'BABIP', 'ERA', 'K%', 'K%.1', 'AVG', 'SLG', 'P_Index', 'B_index', 'W%.1', 'BABIP.1', 'ERA.1', 'K%.2', 'K%.3', 'AVG.1', 'SLG.1', 'P_Index.1', 'B_index.1']]
cols3 = [['W%', 'B_index', 'LV2', 'wOBA', 'W%.1', 'B_index.1', 'LV2.1', 'wOBA.1']]
cols4 = [['W%' , 'B_index' , 'LV2' , 'wOBA' , 'WAR' , 'WAR.1']]
cols5 = [['W%' , 'Off' , 'K%' , 'AVG']]
cols6 = [['W%' , 'Off' , 'K%' , 'AVG' , 'P_Index' , 'B_index']]
colsC = [['W%', 'BABIP', 'LOB%', 'ERA', 'FIP', 'xFIP', 'WAR', 'K/BB', 'K%', 'HR', 'K%.1', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'Off', 'Def', 'WAR.1', 'rPM', 'DRS', 'BIZ', 'RZR', 'FSR', 'RngR', 'ErrR', 'UZR', 'Def.1', 'W%.1', 'BABIP.1', 'LOB%.1', 'ERA.1', 'FIP.1', 'xFIP.1', 'WAR.2', 'K/BB.1', 'K%.2', 'HR.1', 'K%.3', 'AVG.1', 'OBP.1', 'SLG.1', 'wOBA.1', 'wRC+.1', 'Off.1', 'Def.2', 'WAR.3', 'rPM.1', 'DRS.1', 'BIZ.1', 'RZR.1', 'FSR.1', 'RngR.1', 'ErrR.1', 'UZR.1', 'Def.3']]

for col in cols1:
    X_all1[col] = scale(X_all1[col])
    X_test1[col] = scale(X_test1[col])
    
for col in cols2:
    X_all2[col] = scale(X_all2[col])
    X_test2[col] = scale(X_test2[col])
    
for col in cols3:
    X_all3[col] = scale(X_all3[col])
    X_test3[col] = scale(X_test3[col])
    
for col in cols4:
    X_all4[col] = scale(X_all4[col])
    X_test4[col] = scale(X_test4[col])
    
for col in cols5:
    X_all5[col] = scale(X_all5[col])
    X_test5[col] = scale(X_test5[col])
        
for col in cols6:
    X_all6[col] = scale(X_all6[col])
    X_test6[col] = scale(X_test6[col])
    
for col in colsC:
    X_allC[col] = scale(X_allC[col])
    X_testC[col] = scale(X_testC[col])
    
    

In [16]:
#we want continous vars that are integers for our input data, so lets remove any categorical vars
def preprocess_features(X):
    ''' Preprocesses the football data and converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix = col)
                    
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

#my_model = PCA(n_components=0.99, svd_solver='full')
#U=my_model.fit_transform(X_all)

X_all1 = preprocess_features(X_all1)
X_all2 = preprocess_features(X_all2)
X_all3 = preprocess_features(X_all3)
X_all4 = preprocess_features(X_all4)
X_all5 = preprocess_features(X_all5)
X_all6 = preprocess_features(X_all6)
X_allC = preprocess_features(X_allC)

X_test1 = preprocess_features(X_test1)
X_test2 = preprocess_features(X_test2)
X_test3 = preprocess_features(X_test3)
X_test4 = preprocess_features(X_test4)
X_test5 = preprocess_features(X_test5)
X_test6 = preprocess_features(X_test6)
X_testC = preprocess_features(X_testC)
print ("Processed feature columns ({} total features):\n{}".format(len(X_allC.columns), list(X_allC.columns)))

Processed feature columns (56 total features):
['W%', 'BABIP', 'LOB%', 'ERA', 'FIP', 'xFIP', 'WAR', 'K/BB', 'K%', 'HR', 'K%.1', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'Off', 'Def', 'WAR.1', 'rPM', 'DRS', 'BIZ', 'RZR', 'FSR', 'RngR', 'ErrR', 'UZR', 'Def.1', 'W%.1', 'BABIP.1', 'LOB%.1', 'ERA.1', 'FIP.1', 'xFIP.1', 'WAR.2', 'K/BB.1', 'K%.2', 'HR.1', 'K%.3', 'AVG.1', 'OBP.1', 'SLG.1', 'wOBA.1', 'wRC+.1', 'Off.1', 'Def.2', 'WAR.3', 'rPM.1', 'DRS.1', 'BIZ.1', 'RZR.1', 'FSR.1', 'RngR.1', 'ErrR.1', 'UZR.1', 'Def.3']


In [17]:
# Show the feature information by printing the first five rows
print ("\nFeature values:")
display(X_allC.head())


Feature values:


Unnamed: 0,W%,BABIP,LOB%,ERA,FIP,xFIP,WAR,K/BB,K%,HR,...,WAR.3,rPM.1,DRS.1,BIZ.1,RZR.1,FSR.1,RngR.1,ErrR.1,UZR.1,Def.3
0,0.710381,0.236304,0.320201,0.150904,0.3625,0.554531,0.159073,-0.941959,-0.429612,2.070131,...,-1.284302,-0.741165,-0.750842,0.779036,-0.976932,-1.712538,-0.50432,-0.864981,-0.495872,-0.911051
1,-0.336003,0.842766,-0.306001,0.150904,0.334466,0.339914,-0.950565,-0.887012,-0.871426,-1.076126,...,0.030591,-1.570765,-1.542719,0.901333,0.106564,-0.073575,-1.055679,1.99663,-0.759396,-0.444021
2,0.710381,-1.236532,0.657387,-0.86943,-0.394431,-0.590091,0.80338,0.596543,0.177881,1.063329,...,-1.256326,-0.618261,-0.776386,-0.721877,1.19006,-0.765582,0.633099,0.497146,0.538209,0.542617
3,-1.287262,-0.543433,-0.161492,1.788183,2.969707,2.164155,-2.847689,-1.876049,-0.871426,-0.730037,...,1.051731,0.211339,0.373113,-0.154865,1.133034,0.217796,1.292417,-0.338445,1.121965,1.238521
4,-1.247084,-0.28352,-1.413896,0.839036,0.390535,1.055303,-1.004257,-1.079325,-0.871426,-2.051465,...,0.142497,-1.048424,-0.776386,-0.032569,0.733851,0.436324,-0.85904,1.515879,-0.485865,-0.437836


In [18]:
#from sklearn.cross_validation import train_test_split

# Shuffle and split the dataset into training and testing set.
#X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
#                                                    test_size = 0.1,
#                                                    random_state = 2,
#                                                    stratify = y_all)

#X_train=X_all
#X_test=X_test
#y_train=y_all
#y_test=y_test


## Training and Evaluating Models

In [19]:
#for measuring training time
from time import time 
# F1 score (also F-score or F-measure) is a measure of a test's accuracy. 
#It considers both the precision p and the recall r of the test to compute 
#the score: p is the number of correct positive results divided by the number of 
#all positive results, and r is the number of correct positive results divided by 
#the number of positive results that should have been returned. The F1 score can be 
#interpreted as a weighted average of the precision and recall, where an F1 score 
#reaches its best value at 1 and worst at 0.
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print ("Trained model in {:.4f} seconds".format(end - start))

    
def predict_labels(clf, features, target):#target is y-train or y_test
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    
    end = time()
    # Print and return results
    print ("Made predictions in {:.4f} seconds.".format(end - start))
    
    return f1_score(target, y_pred, pos_label='W',average='micro'), sum(target == y_pred) / float(len(y_pred))


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print ("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    f1, acc = predict_labels(clf, X_train, y_train)
    print (f1, acc)
    print ("F1 score and accuracy score for training set: {:.4f} , {:.4f}.".format(f1 , acc))
    
    f1, acc = predict_labels(clf, X_test, y_test)
    print ("F1 score and accuracy score for test set: {:.4f} , {:.4f}.".format(f1 , acc))

Logistic Regression

![alt text](https://image.slidesharecdn.com/logisticregression-predictingthechancesofcoronaryheartdisease-091203130638-phpapp01/95/logistic-regression-predicting-the-chances-of-coronary-heart-disease-2-728.jpg?cb=1259845609"Logo Title Text 1")

![alt text](https://i.ytimg.com/vi/HdB-z0TJRK4/maxresdefault.jpg "Logo Title Text 1")

Support Vector Machine

![alt text](https://image.slidesharecdn.com/supportvectormachine-121112135318-phpapp01/95/support-vector-machine-3-638.jpg?cb=1352729591 "Logo Title Text 1")
![alt text](http://docs.opencv.org/2.4/_images/optimal-hyperplane.png "Logo Title Text 1")

XGBoost

![alt text](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/cart.png "Logo Title Text 1")

![alt text](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/twocart.png "Logo Title Text 1")

![alt text](https://image.slidesharecdn.com/0782ee51-165d-4e34-a09c-2b7f8dacff01-150403064822-conversion-gate01/95/feature-importance-analysis-with-xgboost-in-tax-audit-17-638.jpg?cb=1450092771 "Logo Title Text 1")

![alt text](https://image.slidesharecdn.com/0782ee51-165d-4e34-a09c-2b7f8dacff01-150403064822-conversion-gate01/95/feature-importance-analysis-with-xgboost-in-tax-audit-18-638.jpg?cb=1450092771 "Logo Title Text 1")

In [22]:
# Initialize the three models (XGBoost is initialized later)
clf_LR = LogisticRegression(C=2500, random_state = 42) #LINEAR REGRESSION
clf_SVM = SVC(C=2500, random_state = 912, kernel='rbf') #SUPPORT VECTOR MACHINE
clf_XGB = xgb.XGBClassifier(seed = 82) #XG-BOOST
clf_KNN = neighbors.KNeighborsClassifier(n_neighbors=400, weights='uniform') #K-NEAREST NEIGHBORS
clf_RF = RandomForestClassifier(max_depth=3, random_state=0,min_samples_split=40) #RANDOM FOREST
clf_ADB = AdaBoostClassifier(base_estimator=None, n_estimators=20, learning_rate=1.0, random_state=0) #ADAPTIVE BOOSTING
clf_NB = GaussianNB() #NAIVE BAYES
clf_DT = tree.DecisionTreeClassifier(min_samples_split=3) #DECISION TREE
clf_NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(9, 2), random_state=1) #MULTY-LAYERED PERCEPTION NEURAL NETWORK

#Boosting refers to this general problem of producing a very accurate prediction rule 
#by combining rough and moderately inaccurate rules-of-thumb

train_predict(clf_LR, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_LR, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_LR, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_LR, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_LR, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_LR, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_LR, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_SVM, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_SVM, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_SVM, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_SVM, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_SVM, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_SVM, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_SVM, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_XGB, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_XGB, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_XGB, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_XGB, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_XGB, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_XGB, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_XGB, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_KNN, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_KNN, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_KNN, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_KNN, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_KNN, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_KNN, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_KNN, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_RF, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_RF, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_RF, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_RF, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_RF, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_RF, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_RF, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_ADB, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_ADB, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_ADB, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_ADB, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_ADB, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_ADB, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_ADB, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_NB, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_NB, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_NB, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_NB, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_NB, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_NB, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_NB, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_DT, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_DT, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_DT, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_DT, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_DT, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_DT, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_DT, X_allC, y_allC, X_testC, y_testC)
print ('')
train_predict(clf_NN, X_all1, y_all1, X_test1, y_test1)
train_predict(clf_NN, X_all2, y_all2, X_test2, y_test2)
train_predict(clf_NN, X_all3, y_all3, X_test3, y_test3)
train_predict(clf_NN, X_all4, y_all4, X_test4, y_test4)
train_predict(clf_NN, X_all5, y_all5, X_test5, y_test5)
train_predict(clf_NN, X_all6, y_all6, X_test6, y_test6)
train_predict(clf_NN, X_allC, y_allC, X_testC, y_testC)
print ('')


Training a LogisticRegression using a training set size of 2734. . .
Trained model in 0.0156 seconds
Made predictions in 0.0000 seconds.
0.562545720556 0.562545720556
F1 score and accuracy score for training set: 0.5625 , 0.5625.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5241 , 0.5241.
Training a LogisticRegression using a training set size of 2734. . .
Trained model in 0.0313 seconds
Made predictions in 0.0000 seconds.
0.562545720556 0.562545720556
F1 score and accuracy score for training set: 0.5625 , 0.5625.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5241 , 0.5241.
Training a LogisticRegression using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0.0000 seconds.
0.561814191661 0.561814191661
F1 score and accuracy score for training set: 0.5618 , 0.5618.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5194 , 0.5194.
Training a LogisticRegres



Trained model in 0.0982 seconds
Made predictions in 0.0000 seconds.
0.56035113387 0.56035113387
F1 score and accuracy score for training set: 0.5604 , 0.5604.
Made predictions in 0.0010 seconds.
F1 score and accuracy score for test set: 0.5216 , 0.5216.
Training a LogisticRegression using a training set size of 2734. . .
Trained model in 0.0060 seconds
Made predictions in 0.0000 seconds.
0.565471836138 0.565471836138
F1 score and accuracy score for training set: 0.5655 , 0.5655.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5198 , 0.5198.
Training a LogisticRegression using a training set size of 2734. . .
Trained model in 0.0156 seconds
Made predictions in 0.0000 seconds.
0.565471836138 0.565471836138
F1 score and accuracy score for training set: 0.5655 , 0.5655.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5198 , 0.5198.
Training a LogisticRegression using a training set size of 2734. . .
Trained model in 1.0870 sec



Trained model in 0.5686 seconds
Made predictions in 0.0156 seconds.
0.626554498903 0.626554498903
F1 score and accuracy score for training set: 0.6266 , 0.6266.
Made predictions in 0.0313 seconds.
F1 score and accuracy score for test set: 0.5151 , 0.5151.
Training a XGBClassifier using a training set size of 2734. . .




Trained model in 0.2889 seconds
Made predictions in 0.0240 seconds.
0.623628383321 0.623628383321
F1 score and accuracy score for training set: 0.6236 , 0.6236.
Made predictions in 0.0230 seconds.
F1 score and accuracy score for test set: 0.5248 , 0.5248.
Training a XGBClassifier using a training set size of 2734. . .




Trained model in 0.3612 seconds
Made predictions in 0.0156 seconds.
0.627286027798 0.627286027798
F1 score and accuracy score for training set: 0.6273 , 0.6273.
Made predictions in 0.0313 seconds.
F1 score and accuracy score for test set: 0.5277 , 0.5277.
Training a XGBClassifier using a training set size of 2734. . .




Trained model in 0.2138 seconds
Made predictions in 0.0220 seconds.
0.616678858815 0.616678858815
F1 score and accuracy score for training set: 0.6167 , 0.6167.
Made predictions in 0.0102 seconds.
F1 score and accuracy score for test set: 0.5158 , 0.5158.
Training a XGBClassifier using a training set size of 2734. . .




Trained model in 0.2950 seconds
Made predictions in 0.0313 seconds.
0.621799561083 0.621799561083
F1 score and accuracy score for training set: 0.6218 , 0.6218.
Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5169 , 0.5169.
Training a XGBClassifier using a training set size of 2734. . .




Trained model in 1.6271 seconds
Made predictions in 0.0470 seconds.
0.63643013899 0.63643013899
F1 score and accuracy score for training set: 0.6364 , 0.6364.
Made predictions in 0.0281 seconds.
F1 score and accuracy score for test set: 0.5295 , 0.5295.

Training a KNeighborsClassifier using a training set size of 2734. . .
Trained model in 0.0040 seconds




Made predictions in 0.5100 seconds.
0.554498902707 0.554498902707
F1 score and accuracy score for training set: 0.5545 , 0.5545.
Made predictions in 0.5056 seconds.
F1 score and accuracy score for test set: 0.5233 , 0.5233.
Training a KNeighborsClassifier using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0.5505 seconds.
0.55559619605 0.55559619605
F1 score and accuracy score for training set: 0.5556 , 0.5556.
Made predictions in 0.5810 seconds.
F1 score and accuracy score for test set: 0.5230 , 0.5230.
Training a KNeighborsClassifier using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0.3956 seconds.
0.553401609364 0.553401609364
F1 score and accuracy score for training set: 0.5534 , 0.5534.
Made predictions in 0.4135 seconds.
F1 score and accuracy score for test set: 0.5280 , 0.5280.
Training a KNeighborsClassifier using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0



Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5230 , 0.5230.
Training a RandomForestClassifier using a training set size of 2734. . .
Trained model in 0.0202 seconds
Made predictions in 0.0030 seconds.
0.577542062911 0.577542062911
F1 score and accuracy score for training set: 0.5775 , 0.5775.
Made predictions in 0.0029 seconds.
F1 score and accuracy score for test set: 0.5165 , 0.5165.
Training a RandomForestClassifier using a training set size of 2734. . .
Trained model in 0.0210 seconds
Made predictions in 0.0030 seconds.
0.577907827359 0.577907827359
F1 score and accuracy score for training set: 0.5779 , 0.5779.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5090 , 0.5090.
Training a RandomForestClassifier using a training set size of 2734. . .
Trained model in 0.0313 seconds
Made predictions in 0.0000 seconds.
0.580102414045 0.580102414045
F1 score and accuracy score for training set: 0.5801 , 0.5801.
Made predicti




Training a AdaBoostClassifier using a training set size of 2734. . .
Trained model in 0.0954 seconds
Made predictions in 0.0080 seconds.
0.559253840527 0.559253840527
F1 score and accuracy score for training set: 0.5593 , 0.5593.
Made predictions in 0.0070 seconds.
F1 score and accuracy score for test set: 0.5176 , 0.5176.
Training a AdaBoostClassifier using a training set size of 2734. . .
Trained model in 0.0782 seconds
Made predictions in 0.0000 seconds.
0.557059253841 0.557059253841
F1 score and accuracy score for training set: 0.5571 , 0.5571.
Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5259 , 0.5259.
Training a AdaBoostClassifier using a training set size of 2734. . .




Trained model in 0.0928 seconds
Made predictions in 0.0080 seconds.
0.562545720556 0.562545720556
F1 score and accuracy score for training set: 0.5625 , 0.5625.
Made predictions in 0.0070 seconds.
F1 score and accuracy score for test set: 0.5147 , 0.5147.
Training a AdaBoostClassifier using a training set size of 2734. . .
Trained model in 0.0708 seconds
Made predictions in 0.0000 seconds.
0.562179956108 0.562179956108
F1 score and accuracy score for training set: 0.5622 , 0.5622.
Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5295 , 0.5295.
Training a AdaBoostClassifier using a training set size of 2734. . .




Trained model in 0.0842 seconds
Made predictions in 0.0080 seconds.
0.561814191661 0.561814191661
F1 score and accuracy score for training set: 0.5618 , 0.5618.
Made predictions in 0.0092 seconds.
F1 score and accuracy score for test set: 0.5068 , 0.5068.
Training a AdaBoostClassifier using a training set size of 2734. . .
Trained model in 0.0676 seconds
Made predictions in 0.0000 seconds.
0.564374542794 0.564374542794
F1 score and accuracy score for training set: 0.5644 , 0.5644.
Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5075 , 0.5075.
Training a AdaBoostClassifier using a training set size of 2734. . .




Trained model in 0.1385 seconds
Made predictions in 0.0000 seconds.
0.559253840527 0.559253840527
F1 score and accuracy score for training set: 0.5593 , 0.5593.
Made predictions in 0.0156 seconds.
F1 score and accuracy score for test set: 0.5172 , 0.5172.

Training a GaussianNB using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0.0000 seconds.
0.555230431602 0.555230431602
F1 score and accuracy score for training set: 0.5552 , 0.5552.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5050 , 0.5050.
Training a GaussianNB using a training set size of 2734. . .
Trained model in 0.0000 seconds
Made predictions in 0.0000 seconds.
0.543525969276 0.543525969276
F1 score and accuracy score for training set: 0.5435 , 0.5435.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.4943 , 0.4943.
Training a GaussianNB using a training set size of 2734. . .
Trained model in 0.0156 seconds
Made predictions



0.562179956108 0.562179956108
F1 score and accuracy score for training set: 0.5622 , 0.5622.
Made predictions in 0.0000 seconds.
F1 score and accuracy score for test set: 0.5072 , 0.5072.
Training a GaussianNB using a training set size of 2734. . .
Trained model in 0.0060 seconds
Made predictions in 0.0070 seconds.
0.490855888808 0.490855888808
F1 score and accuracy score for training set: 0.4909 , 0.4909.
Made predictions in 0.0080 seconds.
F1 score and accuracy score for test set: 0.4425 , 0.4425.

Training a DecisionTreeClassifier using a training set size of 2734. . .
Trained model in 0.0100 seconds
Made predictions in 0.0010 seconds.
0.686905632772 0.686905632772
F1 score and accuracy score for training set: 0.6869 , 0.6869.
Made predictions in 0.0010 seconds.
F1 score and accuracy score for test set: 0.4925 , 0.4925.
Training a DecisionTreeClassifier using a training set size of 2734. . .
Trained model in 0.0156 seconds
Made predictions in 0.0000 seconds.
0.686174103877 0.6861741

**Clearly XGBoost seems like the best model as it has the highest F1 score and accuracy score on the test set.**

# Tuning the parameters of XGBoost.

![alt text](https://i.stack.imgur.com/9GgQK.jpg "Logo Title Text 1")

In [70]:
# Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer


# Create the parameters list you wish to tune
#parameters = { 'learning_rate' : [0.1],
#               'n_estimators' : [40],
#               'max_depth': [3],
#               'min_child_weight': [3],
#               'gamma':[0.4],
#               'subsample' : [0.8],
#               'colsample_bytree' : [0.8],
#               'scale_pos_weight' : [1],
#               'reg_alpha':[1e-5]
#             }  

parameters = { 'tol' : [0.0001],
              'C' : [1.0], 
              'intercept_scaling' : [1], 
              'max_iter' : [100],
              'n_jobs' : [1]
             }

# Initialize the classifier

#clf = xgb.XGBClassifier(seed=2)
clf = LogisticRegression(random_state = 2)

# TODO: Make an f1 scoring function using 'make_scorer'

#f1_scorer = make_scorer(accuracy_score)
f1_scorer = make_scorer(f1_score,pos_label='W',average='binary')

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf,
                        scoring=f1_scorer,
                        param_grid=parameters,
                        cv=5)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train,y_train)

# Get the estimator
clf = grid_obj.best_estimator_
print (clf)

# Report the final F1 score for training and testing after parameter tuning
f1, acc = predict_labels(clf, X_train, y_train)
print ("F1 score and accuracy score for training set: {:.4f} , {:.4f}.".format(f1 , acc))
    
f1, acc = predict_labels(clf, X_test, y_test)
print ("F1 score and accuracy score for test set: {:.4f} , {:.4f}.".format(f1 , acc))

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

In [173]:
#clf.predict(X_test[5])
print(row[1] for row in X_test)

Exception ignored in: <bound method DMatrix.__del__ of <xgboost.core.DMatrix object at 0x0000020674604E80>>
Traceback (most recent call last):
  File "C:\Users\Juna\Anaconda3\lib\site-packages\xgboost\core.py", line 368, in __del__
    if self.handle is not None:
AttributeError: 'DMatrix' object has no attribute 'handle'


TypeError: 'zip' object is not subscriptable

Possible Improvements?

-Adding Sentiment from Twitter, News Articles
-More features from other data sources (how much did others bet, player specific health stats)
