# Data analysis. 
In this notebook I will collect information about the seasons 14-15 to 19-20 combined.  
I will answer those questions : 
* Which 5 teams won the most games overall ? 
* Which 5 teams lost the most games overall ? 
* Which 5 teams won the most games away 
* which 5 teams won the most games at home ? 
* Which 5 teams lost the most games away ? 
* Which 5 teams lost the most games at home ?
* Overall percentage of home wins ? 
* Overall percentage of away wins ? 
* Overall percentage of home losses ? 
* Overall percentage of away losses ? 
* Correlation between playing at home and winning the game ? 
* Correlation between playing away and winning the game ? 



## Import datasets and libraries

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns

In [14]:
df = pd.read_csv("../assets/data/clean_data.csv")
df.drop("Unnamed: 0", inplace=True, axis=1)

## Which 5 teams won or lost the most games ? Away ? Home ? 

#### Feature engineering
Creation of new columns to easily count the losses and wins of each teams.

##### Creation of a full time winner column

In [44]:
def get_winner(ftr):
    if ftr.item() == "H":
        return df["HomeTeam"].iloc[ftr.name]
    elif ftr.item() == "A":
        return df["AwayTeam"].iloc[ftr.name]
    else :
        return "Draw"
df["ftr_winner"] = df[["FTR"]].apply(get_winner, axis=1)

##### Creation of a full time looser column

In [48]:
def get_looser(ftr):
    if ftr.item() == "H":
        return df["AwayTeam"].iloc[ftr.name]
    elif ftr.item() == "A":
        return df["HomeTeam"].iloc[ftr.name]
    else :
        return "Draw"
df["ftr_looser"] = df[["FTR"]].apply(get_looser, axis=1)

#### Value count 

##### Most wins overall

In [56]:
most_wins = df["ftr_winner"].value_counts(sort=True)
most_wins.head(6)

Draw         547
Man City     156
Liverpool    139
Chelsea      130
Tottenham    126
Arsenal      119
Name: ftr_winner, dtype: int64

##### Most losses overall 

In [58]:
most_losses = df["ftr_looser"].value_counts(sort=True)
most_losses.head(6)

Draw              547
Crystal Palace    105
Watford            92
West Ham           91
Bournemouth        91
Southampton        90
Name: ftr_looser, dtype: int64

##### Most home wins 

In [62]:
home_wins = df.loc[df["FTR"]== "H"]
most_home_wins = home_wins["ftr_winner"].value_counts(sort=True)
most_home_wins.head(5)

Man City     86
Arsenal      77
Liverpool    77
Tottenham    74
Chelsea      71
Name: ftr_winner, dtype: int64

##### Most away wins

In [70]:
away_wins = df.loc[df["FTR"]== "A"]
most_away_wins = away_wins["AwayTeam"].value_counts(sort=True)
most_away_wins.head(5)

Man City      70
Liverpool     62
Chelsea       59
Tottenham     52
Man United    50
Name: AwayTeam, dtype: int64

##### Most home losses

In [73]:
home_losses = df.loc[df["FTR"] == "A"]
most_home_losses = home_losses["HomeTeam"].value_counts(sort=True)
most_home_losses.head(5)

Crystal Palace    55
Southampton       40
West Ham          38
Burnley           38
Bournemouth       36
Name: HomeTeam, dtype: int64

##### Most away losses 

In [75]:
away_losses = df.loc[df["FTR"] == "H"]
most_away_losses = away_losses["AwayTeam"].value_counts(sort=True)
most_away_losses.head(5)

Watford        57
Newcastle      55
Bournemouth    55
Everton        54
West Ham       53
Name: AwayTeam, dtype: int64

## Overall percentage home wins, away wins, home losses, away losses

##### Percentage of home wins and away wins

In [83]:
percentages = df["FTR"].value_counts(normalize=True) *100
percentages = percentages.to_frame()
percentages

Unnamed: 0,FTR
H,45.701754
A,30.350877
D,23.947368


Home wins percentage  : 45.70 %.   

Away win percentage : 30.35 %.   

The percentages for the home losses and away losses are the same but inverted obviously.  

Away losses percentage : 45.70%.  

Home losses  percentage : 30.35%

## Wins or losses when trailing or leading at half time

In [95]:
home_wins_when_leading = home_wins.loc[home_wins["HTR"] == home_wins["FTR"]]

In [98]:
home_wins_when_leading

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA,Numerical_ftr,Numerical_htr,ftr_winner,ftr_looser
7,E0,2014-08-17,Liverpool,Southampton,2.0,1.0,H,1.0,0.0,H,...,2.00,1.94,1.87,1.43,4.83,8.75,1.0,1.0,Liverpool,Southampton
15,E0,2014-08-23,Swansea,Burnley,1.0,0.0,H,1.0,0.0,H,...,2.09,1.87,1.79,1.68,3.89,5.99,1.0,1.0,Swansea,Burnley
18,E0,2014-08-24,Tottenham,QPR,4.0,0.0,H,3.0,0.0,H,...,1.81,2.20,2.07,1.61,4.20,6.18,1.0,1.0,Tottenham,QPR
19,E0,2014-08-25,Man City,Liverpool,3.0,1.0,H,1.0,0.0,H,...,2.06,1.88,1.83,1.85,4.01,4.35,1.0,1.0,Man City,Liverpool
24,E0,2014-08-30,QPR,Sunderland,1.0,0.0,H,1.0,0.0,H,...,2.06,1.85,1.82,2.56,3.28,3.05,1.0,1.0,QPR,Sunderland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2251,E0,2019-04-26,Liverpool,Huddersfield,5.0,0.0,H,3.0,0.0,H,...,1.89,2.02,1.97,1.08,14.10,34.97,1.0,1.0,Liverpool,Huddersfield
2261,E0,2019-03-05,Everton,Burnley,2.0,0.0,H,2.0,0.0,H,...,2.16,1.80,1.75,1.71,3.85,5.56,1.0,1.0,Everton,Burnley
2265,E0,2019-04-05,West Ham,Southampton,3.0,0.0,H,1.0,0.0,H,...,2.29,1.73,1.66,2.40,3.69,3.00,1.0,1.0,West Ham,Southampton
2273,E0,2019-12-05,Crystal Palace,Bournemouth,5.0,3.0,H,3.0,1.0,H,...,2.52,1.60,1.54,1.79,4.40,4.16,1.0,1.0,Crystal Palace,Bournemouth


In [99]:
df.columns

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
       'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
       'WHH', 'WHD', 'WHA', 'VCH', 'VCD', 'VCA', 'Bb1X2', 'BbMxH', 'BbAvH',
       'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5',
       'BbMx<2.5', 'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH',
       'BbMxAHA', 'BbAvAHA', 'PSCH', 'PSCD', 'PSCA', 'Numerical_ftr',
       'Numerical_htr', 'ftr_winner', 'ftr_looser'],
      dtype='object')