# Рыбин ИУ5-63Б РК2
## Вариант 17
## Датасет - Predict FIFA 2018 Man of the Match
Датасет был создан для предсказания лучшего игрока матча. Будем решать задачу классификации: определение значения поля Man of the Match на основе остальных столбцов.

### Импорт библиотек

In [47]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Загрузка данных и первичный анализ

In [48]:
data = pd.read_csv('FIFA 2018 Statistics.xls')

In [49]:
data.head()

Unnamed: 0,Date,Team,Opponent,Goal Scored,Ball Possession %,Attempts,On-Target,Off-Target,Blocked,Corners,...,Yellow Card,Yellow & Red,Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals,Own goal Time
0,14-06-2018,Russia,Saudi Arabia,5,40,13,7,3,3,6,...,0,0,0,Yes,12.0,Group Stage,No,0,,
1,14-06-2018,Saudi Arabia,Russia,0,60,6,0,3,3,2,...,0,0,0,No,,Group Stage,No,0,,
2,15-06-2018,Egypt,Uruguay,0,43,8,3,3,2,0,...,2,0,0,No,,Group Stage,No,0,,
3,15-06-2018,Uruguay,Egypt,1,57,14,4,6,4,5,...,0,0,0,Yes,89.0,Group Stage,No,0,,
4,15-06-2018,Morocco,Iran,0,64,13,3,6,4,5,...,1,0,0,No,,Group Stage,No,0,1.0,90.0


In [50]:
rows_count = data.shape[0]
columns_count = data.shape[1]
print('Всего строк: {}\nВсего колонок: {}'.format(rows_count, columns_count))

Всего строк: 128
Всего колонок: 27


In [51]:
data.dtypes

Date                       object
Team                       object
Opponent                   object
Goal Scored                 int64
Ball Possession %           int64
Attempts                    int64
On-Target                   int64
Off-Target                  int64
Blocked                     int64
Corners                     int64
Offsides                    int64
Free Kicks                  int64
Saves                       int64
Pass Accuracy %             int64
Passes                      int64
Distance Covered (Kms)      int64
Fouls Committed             int64
Yellow Card                 int64
Yellow & Red                int64
Red                         int64
Man of the Match           object
1st Goal                  float64
Round                      object
PSO                        object
Goals in PSO                int64
Own goals                 float64
Own goal Time             float64
dtype: object

In [52]:
data.isnull().sum()

Date                        0
Team                        0
Opponent                    0
Goal Scored                 0
Ball Possession %           0
Attempts                    0
On-Target                   0
Off-Target                  0
Blocked                     0
Corners                     0
Offsides                    0
Free Kicks                  0
Saves                       0
Pass Accuracy %             0
Passes                      0
Distance Covered (Kms)      0
Fouls Committed             0
Yellow Card                 0
Yellow & Red                0
Red                         0
Man of the Match            0
1st Goal                   34
Round                       0
PSO                         0
Goals in PSO                0
Own goals                 116
Own goal Time             116
dtype: int64

Пропуски в столбцах Own goals, 1st Goal и Own goal Time означают отсутствие голов. Own goals содержит количество голов, поэтому пропуски можем заменить на нули. В остальных же указано время, а значит замена на ноль будет некорректным, что затруднит анализ. Поэтому просто удалим эти колонки.

### Обработка данных

In [53]:
data = data.drop(["1st Goal", "Own goal Time"], axis=1).fillna(0)
data.head()

Unnamed: 0,Date,Team,Opponent,Goal Scored,Ball Possession %,Attempts,On-Target,Off-Target,Blocked,Corners,...,Distance Covered (Kms),Fouls Committed,Yellow Card,Yellow & Red,Red,Man of the Match,Round,PSO,Goals in PSO,Own goals
0,14-06-2018,Russia,Saudi Arabia,5,40,13,7,3,3,6,...,118,22,0,0,0,Yes,Group Stage,No,0,0.0
1,14-06-2018,Saudi Arabia,Russia,0,60,6,0,3,3,2,...,105,10,0,0,0,No,Group Stage,No,0,0.0
2,15-06-2018,Egypt,Uruguay,0,43,8,3,3,2,0,...,112,12,2,0,0,No,Group Stage,No,0,0.0
3,15-06-2018,Uruguay,Egypt,1,57,14,4,6,4,5,...,111,6,0,0,0,Yes,Group Stage,No,0,0.0
4,15-06-2018,Morocco,Iran,0,64,13,3,6,4,5,...,101,22,1,0,0,No,Group Stage,No,0,1.0


In [54]:
data.isnull().sum()

Date                      0
Team                      0
Opponent                  0
Goal Scored               0
Ball Possession %         0
Attempts                  0
On-Target                 0
Off-Target                0
Blocked                   0
Corners                   0
Offsides                  0
Free Kicks                0
Saves                     0
Pass Accuracy %           0
Passes                    0
Distance Covered (Kms)    0
Fouls Committed           0
Yellow Card               0
Yellow & Red              0
Red                       0
Man of the Match          0
Round                     0
PSO                       0
Goals in PSO              0
Own goals                 0
dtype: int64

Также удалим лишние поля, которые не нужны для задачи.

In [55]:
columns_to_drop = ["Date", "Team", "Opponent", "Round"]
data = data.drop(columns=columns_to_drop)

Закодируем категориальные признаки

In [56]:
label_encoder = LabelEncoder()
data["Man of the Match"] = label_encoder.fit_transform(data["Man of the Match"])
data["PSO"] = label_encoder.fit_transform(data["PSO"])

### Выборки для обучения и тестирования

In [57]:
X = data.drop(columns=["Man of the Match"])  # признаки
y = data["Man of the Match"]  # целевая переменная

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [59]:
# Модель дерева решений
decision_tree = DecisionTreeClassifier(random_state=1)
decision_tree.fit(X_train, y_train)

# Модель случайного леса
random_forest = RandomForestClassifier(random_state=1)
random_forest.fit(X_train, y_train)

# Предсказание на тестовой выборке
y_pred_dt = decision_tree.predict(X_test)
y_pred_rf = random_forest.predict(X_test)

In [67]:
# Оценка качества модели дерева решений
print("Модель дерева решений:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt, average='weighted'))
print("Recall:", recall_score(y_test, y_pred_dt, average='weighted'))
print("F1-score:", f1_score(y_test, y_pred_dt, average='weighted'))

# Оценка качества модели случайного леса
print("\nМодель случайного леса:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf, average='weighted'))
print("Recall:", recall_score(y_test, y_pred_rf, average='weighted'))
print("F1-score:", f1_score(y_test, y_pred_rf, average='weighted'))

Модель дерева решений:
Accuracy: 0.6153846153846154
Precision: 0.6466165413533834
Recall: 0.6153846153846154
F1-score: 0.59375

Модель случайного леса:
Accuracy: 0.6923076923076923
Precision: 0.7124183006535947
Recall: 0.6923076923076923
F1-score: 0.6848484848484849


Исходя из получннных результатов можно сделать вывод, что модель случайного леса показывает лучшие результаты, чем модель дерева решений, хоть и не значительно.