# Creating a random forest predictor for Eliteserien matches with pandas and sklearn

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, accuracy_score
import warnings
warnings.filterwarnings('ignore')

We will use the csv file we created with scraping.ipynb

In [2]:
matches = pd.read_csv("matches.csv", index_col =0)
matches.head

<bound method NDFrame.head of           date   time         comp         round  day venue result gf ga  \
0   2023-04-10  17:00  Eliteserien   Matchweek 1  Mon  Away      W  2  0   
1   2023-04-16  17:00  Eliteserien   Matchweek 2  Sun  Home      W  4  0   
2   2023-04-23  17:00  Eliteserien   Matchweek 3  Sun  Away      W  3  0   
3   2023-04-29  18:00  Eliteserien   Matchweek 4  Sat  Home      D  2  2   
4   2023-05-03  18:00  Eliteserien  Matchweek 18  Wed  Home      W  2  0   
..         ...    ...          ...           ...  ...   ...    ... .. ..   
25  2021-11-07  17:00  Eliteserien  Matchweek 26  Sun  Home      L  0  1   
26  2021-11-20  18:00  Eliteserien  Matchweek 27  Sat  Away      W  2  1   
27  2021-11-28  17:00  Eliteserien  Matchweek 28  Sun  Home      L  1  3   
28  2021-12-05  17:00  Eliteserien  Matchweek 29  Sun  Away      L  0  2   
29  2021-12-12  17:00  Eliteserien  Matchweek 30  Sun  Home      L  0  3   

        opponent  ...            referee  match report no

In [3]:
matches.shape

(1443, 24)

Sandefjord and Brann have a different amount of matches than what we would expect

In [4]:
matches["team"].value_counts()

team
Sandefjord      92
Tromso          90
BodoGlimt       90
Valerenga       90
Viking          90
Lillestrom      90
Molde           90
Sarpsborg 08    90
Rosenborg       90
Odd             90
Stromsgodset    90
Haugesund       90
Brann           61
HamKam          60
Stabaek         60
Aalesund        60
Kristiansund    60
Jerv            30
Mjondalen       30
Name: count, dtype: int64

It seems the Eliteserien playoff matches have been included when we scraped the data.

In [5]:
matches["round"].value_counts()

round
Matchweek 1                                     48
Matchweek 2                                     48
Matchweek 3                                     48
Matchweek 4                                     48
Matchweek 18                                    48
Matchweek 5                                     48
Matchweek 6                                     48
Matchweek 7                                     48
Matchweek 8                                     48
Matchweek 9                                     48
Matchweek 10                                    48
Matchweek 11                                    48
Matchweek 12                                    48
Matchweek 13                                    48
Matchweek 14                                    48
Matchweek 15                                    48
Matchweek 16                                    48
Matchweek 17                                    48
Matchweek 19                                    48
Matchweek 21             

We fix this by specifying that only the rows where the column "comp" is "Eliteserien" is included in matches. 

In [6]:
matches = matches[matches.comp == "Eliteserien"]

In [7]:
matches["team"].value_counts()

team
BodoGlimt       90
Tromso          90
Viking          90
Valerenga       90
Molde           90
Lillestrom      90
Stromsgodset    90
Sarpsborg 08    90
Rosenborg       90
Odd             90
Haugesund       90
Sandefjord      90
Brann           60
HamKam          60
Stabaek         60
Aalesund        60
Kristiansund    60
Jerv            30
Mjondalen       30
Name: count, dtype: int64

Some of the variables are an object type, which we can not use with the algorithm we are using. 

In [8]:
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf               object
ga               object
opponent         object
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes            object
sh              float64
sot             float64
dist            float64
pk                int64
pkatt             int64
season            int64
team             object
dtype: object

We change the "date" from a object type to datetime. We convert the venue and opponent into integers and create new columns for these variables. We also create an "hour" column which turns the "time" column into integers and only keeps the first two numbers (17:30 to 17), and we also create a "day_code" column with integer values for each day of the week. The final column we will add is a "target column", and for this project we only care whether a team has won or not, so if the result is a draw or a loss the value will be 0 and if the team won the value will be 1. 

In [9]:
matches["date"] = pd.to_datetime(matches["date"])
matches["venue_code"] = matches["venue"].astype("category").cat.codes
matches["opp_code"] = matches["opponent"].astype("category").cat.codes
matches["hour"] = matches["time"].str.replace(":.+", "", regex=True).astype("int")
matches["day_code"] = matches["date"].dt.dayofweek
matches["target"] = (matches["result"] == "W").astype("int")

In [10]:
matches

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,2023-04-10,17:00,Eliteserien,Matchweek 1,Mon,Away,W,2,0,Sarpsborg 08,...,,0,0,2023,BodoGlimt,0,13,17,0,1
1,2023-04-16,17:00,Eliteserien,Matchweek 2,Sun,Home,W,4,0,Stabæk,...,,0,1,2023,BodoGlimt,1,14,17,6,1
2,2023-04-23,17:00,Eliteserien,Matchweek 3,Sun,Away,W,3,0,Aalesund,...,,0,0,2023,BodoGlimt,0,0,17,6,1
3,2023-04-29,18:00,Eliteserien,Matchweek 4,Sat,Home,D,2,2,Brann,...,,0,0,2023,BodoGlimt,1,2,18,5,0
4,2023-05-03,18:00,Eliteserien,Matchweek 18,Wed,Home,W,2,0,Odd,...,,0,0,2023,BodoGlimt,1,10,18,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2021-11-07,17:00,Eliteserien,Matchweek 26,Sun,Home,L,0,1,Viking,...,,0,1,2021,Mjondalen,1,17,17,6,0
26,2021-11-20,18:00,Eliteserien,Matchweek 27,Sat,Away,W,2,1,Strømsgodset,...,,0,0,2021,Mjondalen,0,15,18,5,1
27,2021-11-28,17:00,Eliteserien,Matchweek 28,Sun,Home,L,1,3,Sarpsborg 08,...,,0,0,2021,Mjondalen,1,13,17,6,0
28,2021-12-05,17:00,Eliteserien,Matchweek 29,Sun,Away,L,0,2,Vålerenga,...,,0,0,2021,Mjondalen,0,18,17,6,0


We initiate the random forest classifier with 50 individual decision trees, 10 samples in a leaf of the decision tree before we split the node, and random state set to 1 som that we will get the same result every time as long as we give the model the same data. 

In [11]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

As this is time-series data, it is important that we use the oldest data as training and the newest as test, as we cant use future information to predict the past. We use the 2023 season as test and everything before 2023 as train. 

In [12]:
train = matches[matches["date"] < "2023-01-01"]
train

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,2022-04-02,18:00,Eliteserien,Matchweek 1,Sat,Home,W,1,0,Vålerenga,...,,0,0,2022,Molde,1,18,18,5,1
1,2022-04-10,18:00,Eliteserien,Matchweek 2,Sun,Away,W,3,1,Strømsgodset,...,,0,0,2022,Molde,0,15,18,6,1
2,2022-04-18,18:00,Eliteserien,Matchweek 3,Mon,Home,L,1,2,Lillestrøm,...,,0,0,2022,Molde,1,7,18,0,0
3,2022-04-24,20:00,Eliteserien,Matchweek 4,Sun,Away,D,0,0,Rosenborg,...,,0,0,2022,Molde,0,11,20,6,0
4,2022-05-07,18:00,Eliteserien,Matchweek 5,Sat,Home,L,3,4,Viking,...,,0,0,2022,Molde,1,17,18,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2021-11-07,17:00,Eliteserien,Matchweek 26,Sun,Home,L,0,1,Viking,...,,0,1,2021,Mjondalen,1,17,17,6,0
26,2021-11-20,18:00,Eliteserien,Matchweek 27,Sat,Away,W,2,1,Strømsgodset,...,,0,0,2021,Mjondalen,0,15,18,5,1
27,2021-11-28,17:00,Eliteserien,Matchweek 28,Sun,Home,L,1,3,Sarpsborg 08,...,,0,0,2021,Mjondalen,1,13,17,6,0
28,2021-12-05,17:00,Eliteserien,Matchweek 29,Sun,Away,L,0,2,Vålerenga,...,,0,0,2021,Mjondalen,0,18,17,6,0


In [13]:
test = matches[matches["date"] > "2023-01-01"]
test

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,2023-04-10,17:00,Eliteserien,Matchweek 1,Mon,Away,W,2,0,Sarpsborg 08,...,,0,0,2023,BodoGlimt,0,13,17,0,1
1,2023-04-16,17:00,Eliteserien,Matchweek 2,Sun,Home,W,4,0,Stabæk,...,,0,1,2023,BodoGlimt,1,14,17,6,1
2,2023-04-23,17:00,Eliteserien,Matchweek 3,Sun,Away,W,3,0,Aalesund,...,,0,0,2023,BodoGlimt,0,0,17,6,1
3,2023-04-29,18:00,Eliteserien,Matchweek 4,Sat,Home,D,2,2,Brann,...,,0,0,2023,BodoGlimt,1,2,18,5,0
4,2023-05-03,18:00,Eliteserien,Matchweek 18,Wed,Home,W,2,0,Odd,...,,0,0,2023,BodoGlimt,1,10,18,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2023-10-28,18:00,Eliteserien,Matchweek 26,Sat,Away,L,1,6,Haugesund,...,,0,0,2023,Aalesund,0,4,18,5,0
26,2023-11-06,19:00,Eliteserien,Matchweek 27,Mon,Home,L,0,3,Sandefjord,...,,0,0,2023,Aalesund,1,12,19,0,0
27,2023-11-12,17:00,Eliteserien,Matchweek 28,Sun,Away,L,0,1,Bodø/Glimt,...,,0,0,2023,Aalesund,0,1,17,6,0
28,2023-11-26,17:00,Eliteserien,Matchweek 29,Sun,Home,L,0,4,Viking,...,,0,0,2023,Aalesund,1,17,17,6,0


The predictors we use in this model is venue, opposition, time of day and which day the match was played on. 

In [14]:
predictors = ["venue_code", "opp_code", "hour", "day_code"]

In [15]:
rf.fit(train[predictors], train["target"])

In [16]:
preds = rf.predict(test[predictors])

In [17]:
error = accuracy_score(test["target"], preds)

This model receives an accuracy score of 0.58. This means that whatever the model predicted, 58% of the time that thing happened, but lets dig deeper.

In [18]:
error

0.58125

To get a better understanding, we create a dataframe that combine our actual values and predicted values

In [19]:
combined = pd.DataFrame(dict(actual=test["target"], predicted=preds))

From this crosstab we can see that our model 257 times our model predicted a draw or a loss it was correct, and 176 times it was wrong. The model predicted a win 25 times when the result was a draw or a loss, and it predicted a win 22 times when the result was a win.
Our goal with this project is to predict wins, we need to revise the way we score the model, as accuracy scores also include predictions for draw or loss.  

In [20]:

pd.crosstab(index=combined["actual"], columns=combined["predicted"])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,257,25
1,176,22


The prediction score will show us how often the result was a win when a win was predicted. Every time the model predicted a win, the team only won 47% of the time.

In [21]:
precision_score(test["target"], preds)

np.float64(0.46808510638297873)

We can try to improve the model by adding rolling stats for each team. We start by splitting our matches dataframe into teams.

In [22]:

grouped_matches = matches.groupby("team")

This is what one group looks like, this is all of Brann's matches. The goal of the rolling stats is that if we are on matchweek 4, how well did brann do in the previous 3 matchweeks. This will add a factor of form, which can play a key part in football. 

In [23]:
group = grouped_matches.get_group("Brann").sort_values("date")

In [24]:
group

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,pk,pkatt,season,team,venue_code,opp_code,hour,day_code,target
0,2021-05-09,18:00,Eliteserien,Matchweek 1,Sun,Away,L,1,3,Viking,...,,0,0,2021,Brann,0,17,18,6,0
1,2021-05-12,20:00,Eliteserien,Matchweek 2,Wed,Home,L,0,3,Vålerenga,...,,0,0,2021,Brann,1,18,20,2,0
2,2021-05-16,20:00,Eliteserien,Matchweek 3,Sun,Away,L,0,4,Molde,...,,0,0,2021,Brann,0,9,20,6,0
3,2021-05-20,20:30,Eliteserien,Matchweek 14,Thu,Away,L,2,3,Rosenborg,...,,0,0,2021,Brann,0,11,20,3,0
4,2021-05-24,18:00,Eliteserien,Matchweek 4,Mon,Home,L,1,2,Bodø/Glimt,...,,0,0,2021,Brann,1,1,18,0,0
5,2021-05-27,18:00,Eliteserien,Matchweek 5,Thu,Away,L,0,2,Stabæk,...,,0,1,2021,Brann,0,14,18,3,0
6,2021-05-30,18:00,Eliteserien,Matchweek 6,Sun,Home,W,3,0,Strømsgodset,...,,0,1,2021,Brann,1,15,18,6,1
7,2021-06-13,18:00,Eliteserien,Matchweek 7,Sun,Away,D,0,0,Sarpsborg 08,...,,0,0,2021,Brann,0,13,18,6,0
8,2021-06-20,18:00,Eliteserien,Matchweek 8,Sun,Home,L,1,3,Odd,...,,0,0,2021,Brann,1,10,18,6,0
9,2021-06-24,18:00,Eliteserien,Matchweek 9,Thu,Away,L,0,1,Haugesund,...,,0,0,2021,Brann,0,4,18,3,0


We create a function which takes a group, a set of columns that we will use for our rolling stats, and finally compute and add columns for our rolling averages.

In [25]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values("date")
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group


We want the rolling averages of goals for, goals against, shots, shots on target, penalty kick and penalty kick attempts. 

In [26]:
cols = ["gf", "ga", "sh", "sot", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]

In [27]:
new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'pk_rolling',
 'pkatt_rolling']

This is what the dataframe looks like for Brann with the rolling averages.

In [28]:
rolling_averages(group, cols, new_cols)

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,opp_code,hour,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,pk_rolling,pkatt_rolling
3,2021-05-20,20:30,Eliteserien,Matchweek 14,Thu,Away,L,2,3,Rosenborg,...,11,20,3,0,0.333333,3.333333,13.0,3.666667,0.0,0.0
4,2021-05-24,18:00,Eliteserien,Matchweek 4,Mon,Home,L,1,2,Bodø/Glimt,...,1,18,0,0,0.666667,3.333333,13.333333,4.333333,0.0,0.0
5,2021-05-27,18:00,Eliteserien,Matchweek 5,Thu,Away,L,0,2,Stabæk,...,14,18,3,0,1.0,3.0,15.333333,5.333333,0.0,0.0
6,2021-05-30,18:00,Eliteserien,Matchweek 6,Sun,Home,W,3,0,Strømsgodset,...,15,18,6,1,1.0,2.333333,12.666667,6.0,0.0,0.333333
7,2021-06-13,18:00,Eliteserien,Matchweek 7,Sun,Away,D,0,0,Sarpsborg 08,...,13,18,6,0,1.333333,1.333333,13.666667,5.666667,0.0,0.666667
8,2021-06-20,18:00,Eliteserien,Matchweek 8,Sun,Home,L,1,3,Odd,...,10,18,6,0,1.0,0.666667,10.0,3.666667,0.0,0.666667
9,2021-06-24,18:00,Eliteserien,Matchweek 9,Thu,Away,L,0,1,Haugesund,...,4,18,3,0,1.333333,1.0,12.666667,3.0,0.0,0.333333
10,2021-06-30,20:00,Eliteserien,Matchweek 10,Wed,Home,D,1,1,Lillestrøm,...,7,20,2,0,0.333333,1.333333,9.0,2.333333,0.0,0.0
11,2021-07-05,19:00,Eliteserien,Matchweek 11,Mon,Away,L,2,3,Kristiansund,...,6,19,0,0,0.666667,1.666667,11.666667,3.333333,0.0,0.0
12,2021-07-10,20:00,Eliteserien,Matchweek 12,Sat,Home,D,1,1,Tromsø,...,16,20,5,0,1.0,1.666667,10.333333,3.666667,0.0,0.0


Now that we have confirmed that the code works, we will apply this to all teams. We apply the rolling_averages function to every team in the matches dataframe. 

In [29]:
matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))

This is what the new dataframe looks like

In [30]:
matches_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,opp_code,hour,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,pk_rolling,pkatt_rolling
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Aalesund,3,2022-04-23,16:00,Eliteserien,Matchweek 4,Sat,Away,W,3,2,Odd,...,10,16,5,1,1.000000,1.000000,11.666667,3.333333,0.0,0.0
Aalesund,4,2022-04-28,20:00,Eliteserien,Matchweek 15,Thu,Away,L,0,2,Lillestrøm,...,7,20,3,0,1.666667,1.666667,8.666667,4.000000,0.0,0.0
Aalesund,5,2022-05-08,20:00,Eliteserien,Matchweek 5,Sun,Away,D,0,0,HamKam,...,3,20,6,0,1.666667,2.000000,11.000000,4.333333,0.0,0.0
Aalesund,6,2022-05-16,18:00,Eliteserien,Matchweek 6,Mon,Home,L,0,2,Molde,...,9,18,0,0,1.000000,1.333333,10.666667,4.000000,0.0,0.0
Aalesund,7,2022-05-22,18:00,Eliteserien,Matchweek 7,Sun,Away,W,2,1,Sarpsborg 08,...,13,18,6,1,0.000000,1.333333,12.333333,2.333333,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Viking,25,2023-10-29,17:00,Eliteserien,Matchweek 26,Sun,Away,L,0,1,Strømsgodset,...,15,17,6,0,1.333333,3.000000,10.666667,3.666667,0.0,0.0
Viking,26,2023-11-04,18:00,Eliteserien,Matchweek 27,Sat,Away,L,0,3,HamKam,...,3,18,5,0,1.333333,2.000000,13.333333,5.333333,0.0,0.0
Viking,27,2023-11-12,17:00,Eliteserien,Matchweek 28,Sun,Home,W,2,1,Sarpsborg 08,...,13,17,6,1,1.000000,2.666667,20.333333,7.333333,0.0,0.0
Viking,28,2023-11-26,17:00,Eliteserien,Matchweek 29,Sun,Away,W,4,0,Aalesund,...,0,17,6,1,0.666667,1.666667,16.333333,6.333333,0.0,0.0


We dont want the new index that was created, as it makes the dataframe harder to work with. We also want the index to be unique for each row. 

In [31]:
matches_rolling = matches_rolling.droplevel('team')
matches_rolling.index = range(matches_rolling.shape[0])


In [32]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,opp_code,hour,day_code,target,gf_rolling,ga_rolling,sh_rolling,sot_rolling,pk_rolling,pkatt_rolling
0,2022-04-23,16:00,Eliteserien,Matchweek 4,Sat,Away,W,3,2,Odd,...,10,16,5,1,1.000000,1.000000,11.666667,3.333333,0.0,0.0
1,2022-04-28,20:00,Eliteserien,Matchweek 15,Thu,Away,L,0,2,Lillestrøm,...,7,20,3,0,1.666667,1.666667,8.666667,4.000000,0.0,0.0
2,2022-05-08,20:00,Eliteserien,Matchweek 5,Sun,Away,D,0,0,HamKam,...,3,20,6,0,1.666667,2.000000,11.000000,4.333333,0.0,0.0
3,2022-05-16,18:00,Eliteserien,Matchweek 6,Mon,Home,L,0,2,Molde,...,9,18,0,0,1.000000,1.333333,10.666667,4.000000,0.0,0.0
4,2022-05-22,18:00,Eliteserien,Matchweek 7,Sun,Away,W,2,1,Sarpsborg 08,...,13,18,6,1,0.000000,1.333333,12.333333,2.333333,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1378,2023-10-29,17:00,Eliteserien,Matchweek 26,Sun,Away,L,0,1,Strømsgodset,...,15,17,6,0,1.333333,3.000000,10.666667,3.666667,0.0,0.0
1379,2023-11-04,18:00,Eliteserien,Matchweek 27,Sat,Away,L,0,3,HamKam,...,3,18,5,0,1.333333,2.000000,13.333333,5.333333,0.0,0.0
1380,2023-11-12,17:00,Eliteserien,Matchweek 28,Sun,Home,W,2,1,Sarpsborg 08,...,13,17,6,1,1.000000,2.666667,20.333333,7.333333,0.0,0.0
1381,2023-11-26,17:00,Eliteserien,Matchweek 29,Sun,Away,W,4,0,Aalesund,...,0,17,6,1,0.666667,1.666667,16.333333,6.333333,0.0,0.0


We create a function to make our predictions for future work. This way if we want to create some changes we can call this function on the new dataframe without typing everything for each change. 

In [33]:
def make_predictions(data, predictors):
    train = data[data["date"] < '2022-01-01']
    test = data[data["date"] > '2022-01-01']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
    error = precision_score(test["target"], preds)
    error_accuracy = accuracy_score(test["target"], preds)
    return combined, error, error_accuracy

We run the function on the matches_rolling dataframe with the same predictors + the new columns we created earlier.

In [34]:
combined, error, error_accuracy = make_predictions(matches_rolling, predictors + new_cols)

The precision score improved slightly, but the results are not great. for every time the model predicted a win, 50% of the time the result was a win. 

In [35]:
error

np.float64(0.4976958525345622)

The accuracy has also slightly improved

In [36]:
error_accuracy

0.6025236593059937

The model has 465 true negative predictions and 269 false negative predictions, while having 109 false positive predictions and 108 true positive predictions

In [37]:
pd.crosstab(index=combined["actual"], columns=combined["predicted"])

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,465,109
1,269,108


When this model was tested on English Premier League matches it received a precision score of 0.625, so I'm a little disappointed with 0.497 for this project. For future work we will look into the parameters set for the random forest classifier to look for improvement, and perhaps try other non-linear models to see if they can perform better. 

The not-so-great result compared to the Premier League dataset might also be because the top teams in England are more consistent, while in Norwegian football, as Tore-André Flo once said, anyone can beat anyone.