### Plan
#### 1. Data Collection
#### 2. Preprocessed data
#### 3. Feature engineering
#### 4. Modeling
#### 5. Evaluation

### 1. Data collection

### Sofascore

![SofaScore](./images/sofascore.png)


### Livescore

![LiveScore](./images/livescore.png)

### Whoscored

![WhoScored](./images/whoscored.png)


Scrape dc càng nhiều nguồn thì càng tốt

### 2. Preprocess data

Chuyển data về dạng bảng 

Vd

In [1]:
import pandas as pd
import random

# data = {
#     'Team A': ['Liverpool', 'Mancity', 'Mancity'],
#     'Team B': ['Arsenal', 'Real Madrid', 'Barcelona'],
#     'Goals A': [1, 2, 3],
#     'Goals B': [0, 1, 2],
#     'Shots A': [10, 15, 20],
#     'Shots B': [5, 10, 15],
#     'Shots on Target A': [5, 10, 15],
#     'Shots on Target B': [3, 6, 9],
#     'Possession % A': [55, 60, 65],
#     'Possession % B': [45, 40, 35],
#     'Passes A': [400, 500, 600],
#     'Passes B': [300, 200, 100],
#     'Fouls A': [10, 20, 30],
#     'Fouls B': [5, 10, 15],
#     'Yellow Cards A': [1, 2, 3],
#     'Yellow Cards B': [0, 1, 2],
#     'Red Cards A': [0, 0, 1],
#     'Red Cards B': [0, 0, 0],
# }
teams = ['Liverpool', 'Mancity', 'Arsenal', 'Chelsea', 'Tottenham', 'Real Madrid', 'Barcelona', 'Atletico Madrid', 'Juventus', 'AC Milan']
data = {
    'Time': pd.date_range(start='1/1/2020', periods=100, freq='D'),
    'Team A': [],
    'Team B': [],
    'Goals A': [random.randint(0, 5) for _ in range(100)],
    'Goals B': [random.randint(0, 5) for _ in range(100)],
    'Shots A': [random.randint(5, 25) for _ in range(100)],
    'Shots B': [random.randint(5, 25) for _ in range(100)],
    'Shots on Target A': [random.randint(1, 10) for _ in range(100)],
    'Shots on Target B': [random.randint(1, 10) for _ in range(100)],
    'Possession % A': [random.randint(40, 60) for _ in range(100)],
    'Possession % B': [],
    'Passes A': [random.randint(200, 800) for _ in range(100)],
    'Passes B': [random.randint(200, 800) for _ in range(100)],
    'Fouls A': [random.randint(5, 30) for _ in range(100)],
    'Fouls B': [random.randint(5, 30) for _ in range(100)],
    'Yellow Cards A': [random.randint(0, 5) for _ in range(100)],
    'Yellow Cards B': [random.randint(0, 5) for _ in range(100)],
    'Red Cards A': [random.randint(0, 2) for _ in range(100)],
    'Red Cards B': [random.randint(0, 2) for _ in range(100)],
}

# Ensure Team A and Team B are different
for _ in range(100):
    team_a, team_b = random.sample(teams, 2)
    data['Team A'].append(team_a)
    data['Team B'].append(team_b)
    data['Possession % B'].append(100 - data['Possession % A'][_])

df = pd.DataFrame(data)
df

### 2. Feature engineering

Tính các chỉ số biểu thị phong độ của 2 đội như số bàn thắng trung bình trong 5 trận gần nhất, mean passes, mean possession, mean shots, ...

Lịch sử đối đầu: hiệu số thắng thua,...

In [2]:
# Calculate mean goals per match for the last 5 matches of each team
df = df.sort_values(by='Time')
df['Goals A Mean'] = df['Goals A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Goals B Mean'] = df['Goals B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

# Calculate mean possession % per match for the last 5 matches of each team
df['Possession % A Mean'] = df['Possession % A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Possession % B Mean'] = df['Possession % B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

# Calculate mean shots per match for the last 5 matches of each team
df['Shots A Mean'] = df['Shots A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Shots B Mean'] = df['Shots B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

# Calculate mean shots on target per match for the last 5 matches of each team
df['Shots on Target A Mean'] = df['Shots on Target A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Shots on Target B Mean'] = df['Shots on Target B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

# Calculate mean passes per match for the last 5 matches of each team
df['Passes A Mean'] = df['Passes A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Passes B Mean'] = df['Passes B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

# Calculate mean fouls per match for the last 5 matches of each team
df['Fouls A Mean'] = df['Fouls A'].groupby(df['Team A']).rolling(window=5).mean().reset_index(drop=True)
df['Fouls B Mean'] = df['Fouls B'].groupby(df['Team B']).rolling(window=5).mean().reset_index(drop=True)

df


### 3. Modeling 

dự định là input thông số 2 đội, lịch sử đối đầu và dự đoán thắng thua, số bàn 

In [3]:
X = df[['Team A', 'Team B', 'Goals A Mean', 'Goals B Mean', 'Possession % A Mean', 'Possession % B Mean', 'Shots A Mean', 'Shots B Mean', 'Shots on Target A Mean', 'Shots on Target B Mean', 'Passes A Mean', 'Passes B Mean', 'Fouls A Mean', 'Fouls B Mean']]
# Goals might be difficult to predict, so we can try to predict the outcome of the match
y = df.apply(lambda row: 1 if row['Goals A'] > row['Goals B'] else (-1 if row['Goals A'] < row['Goals B'] else 0), axis=1)
y