# IE 582 Statistical Learning for Data Mining
### Part 1: Project Data Preparation
**Date:** *13-01-2025*  
- **Name:** *Fatih Mehmet Yılmaz*
    - **School Number:** *2024702054*  
- **Name:** *Yusuf Sina Öztürk*
    - **School Number:** *2023702075*  

In this part, we ***prepared the data*** aligned with the results that resulted from the ***exploratory data analysis.***

### 0. Setup
- Need to import all libraries where it will be used in different parts of the project.
- Read provided dataset.

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import math
import seaborn as sns

import json
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

  from pandas import MultiIndex, Int64Index


In [3]:
match_data = pd.read_csv('match_data.csv')

### 1. Preparation of Data
- First thing first, we deleted the rows that shouldn't be used in any anaylsis or training set. `suspended` and `stopped` columns in the set shows us which columns are not usable.
    - For this reason, we should get rid of 7.8k rows that has `suspended` and 4.1k rows that has `stopped` status.
    - After that, we got final of 56.1 rows

In [81]:
len(match_data[match_data["suspended"]])

7817

In [83]:
len(match_data[match_data["stopped"] == True])

4115

In [84]:
match_data = match_data[(match_data['stopped'] == False) & (match_data['suspended'] == False)]
match_data = match_data.drop(columns=['stopped', 'suspended'])
len(match_data)

56127

#### 1.1 Explatory Data Analysis

- In order to check what is the properties of my dataset, we have used compact `ydata_profiling` library to check each column of my set with respect to their statistical values.

- The function that used below created an HTML file, it is also **included in repository.**;

In [17]:
%%time
report = ProfileReport(match_data, title="HW2_EDA", minimal=True)
report.to_file("HW2_EDA.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 1min 1s, sys: 35.6 s, total: 1min 36s
Wall time: 36.5 s


#### 1.2 Actions taken based on exploratory data analysis report

- There is `ticking` named feature on dataset which is constant in all set. Also this unuseful feature is dropped.

In [85]:
match_data = match_data.drop(columns=['ticking'])

**1.3 Comments and approaches for actions**

- With EDA, we also captured with high percentage zero warning and missing values warning occured in dataset.


  - For missing values, we should do imputation with linear interpolation.
  
  
  - For zero values, we will check whether this zero is from start of the match or it is missinformation. For misinformed values, we need to apply also linear interpolation. 
  
  
      - After checking all distributions and some of the instances, there is no misinformation on zero valued instances.
  
  
  - Percentage features has close to normal distribution, and seems there are misinformed values because 0 and 100 % are not fitted well on the histogram. 
  
  
      - Figured that out those are clearly the firts seconds of the match. That is why it is alright to keep it like that.
  
  
  - For text features, we applied one-code encoding to make them binary, in order to let ML models learn a bit better. 
  
  
      - Only `name` feature of the oppenent teams are not encoded because it is more like the same as `fixture_id`. If we try to apply league based different models, or add league dimension into dataset, we could apply it on those sophisticated attemps . 

In [86]:
# In order to sort data set, need to convert types Date Time features 
datetime_columns = ['current_time', 'half_start_datetime', 'match_start_datetime', 'latest_bookmaker_update']
match_data[datetime_columns] = match_data[datetime_columns].apply(pd.to_datetime)

match_data = match_data.sort_values(by=['fixture_id', 'halftime', 'current_time']).reset_index(drop=True)

In [87]:
# Group features that should implemented different type of imputations
# Features that needs to impute
missing_features = match_data.columns[match_data.isnull().any()].tolist()

percentage_features = [
    "Successful Passes Percentage - away", 
    "Successful Passes Percentage - home", 
    "Ball Possession % - away", 
    "Ball Possession % - home"
]
text_feature = ["current_state"]

integer_missing_features = set(missing_features) - set(percentage_features + text_feature)

In [88]:
# Imputed Version of Dataset
match_data_imputed = match_data.copy()

In [89]:
# Imputation for Integer Type Features
for fixture_id, group in match_data.groupby('fixture_id'):
    group = group.sort_values('current_time')

    for feature in integer_missing_features:
        missing_indices = group[group[feature].isna()].index

        for idx in missing_indices:
            valid_before = group.loc[:idx, feature].dropna()
            lower_bound = valid_before.iloc[-1] if not valid_before.empty else pd.NA
            lower_time = group.loc[valid_before.index[-1], 'current_time'] if not valid_before.empty else group.loc[idx, 'current_time']

            valid_after = group.loc[idx:, feature].dropna()
            upper_bound = valid_after.iloc[0] if not valid_after.empty else pd.NA
            upper_time = group.loc[valid_after.index[0], 'current_time'] if not valid_after.empty else group.loc[idx, 'current_time']

            if pd.isna(group.at[idx, feature]):
                if pd.isna(lower_bound) == True & pd.isna(upper_bound) == True:
                    continue
                elif pd.isna(lower_bound):
                    group.at[idx, feature] = 0
                elif pd.isna(upper_bound):
                    group.at[idx, feature] = lower_bound
                else:
                    time_diff = (upper_time - lower_time).total_seconds()
                    value_diff = upper_bound - lower_bound
                    time_diff_missing = (group.loc[idx, 'current_time'] - lower_time).total_seconds()
                    interpolated_value = lower_bound + (value_diff * time_diff_missing / time_diff)
                    group.at[idx, feature] = interpolated_value
        match_data_imputed.loc[group.index, feature] = group[feature]

In [90]:
# Rounding after linear interpolation
match_data_imputed[list(integer_missing_features)] = match_data_imputed[list(integer_missing_features)].round(0)

In [91]:
# Imputation for Float Type Features
for fixture_id, group in match_data.groupby('fixture_id'):
    group = group.sort_values('current_time')

    for feature in percentage_features:
        missing_indices = group[group[feature].isna()].index
        for idx in missing_indices:
            valid_before = group.loc[:idx, feature].dropna()
            lower_bound = valid_before.iloc[-1] if not valid_before.empty else pd.NA
            lower_time = group.loc[valid_before.index[-1], 'current_time'] if not valid_before.empty else group.loc[idx, 'current_time']

            valid_after = group.loc[idx:, feature].dropna()
            upper_bound = valid_after.iloc[0] if not valid_after.empty else pd.NA
            upper_time = group.loc[valid_after.index[0], 'current_time'] if not valid_after.empty else group.loc[idx, 'current_time']

            if pd.isna(group.at[idx, feature]):
                if pd.isna(lower_bound) == True & pd.isna(upper_bound) == True:
                    continue
                elif pd.isna(lower_bound) == True & pd.isna(upper_bound) == True:
                    group.at[idx, feature] = 0
                elif pd.isna(lower_bound) and not pd.isna(upper_bound):
                    group.at[idx, feature] = upper_bound
                elif pd.isna(upper_bound):

                    group.at[idx, feature] = lower_bound
                else:
                    
                    time_diff = (upper_time - lower_time).total_seconds()
                    value_diff = upper_bound - lower_bound
                    time_diff_missing = (group.loc[idx, 'current_time'] - lower_time).total_seconds()
                    interpolated_value = lower_bound + (value_diff * time_diff_missing / time_diff)
                    group.at[idx, feature] = interpolated_value
        match_data_imputed.loc[group.index, feature] = group[feature]

In [92]:
# Imputation for Text Type Features
for fixture_id, group in match_data.groupby('fixture_id'):
    group = group.sort_values('current_time')

    for feature in text_feature:
        missing_indices = group[group[feature].isna()].index
        for idx in missing_indices:
            valid_before = group.loc[:idx, feature].dropna()
            lower_bound = valid_before.iloc[-1] if not valid_before.empty else pd.NA
            lower_time = group.loc[valid_before.index[-1], 'current_time'] if not valid_before.empty else group.loc[idx, 'current_time']

            valid_after = group.loc[idx:, feature].dropna()
            upper_bound = valid_after.iloc[0] if not valid_after.empty else pd.NA
            upper_time = group.loc[valid_after.index[0], 'current_time'] if not valid_after.empty else group.loc[idx, 'current_time']

            if pd.isna(group.at[idx, feature]):
                if pd.isna(lower_bound):
                    group.at[idx, feature] = "X"
                elif pd.isna(upper_bound):
                    group.at[idx, feature] = lower_bound
                else:
                    group.at[idx, feature] = "X"
        match_data_imputed.loc[group.index, feature] = group[feature]

In [93]:
match_data_imputed.to_csv('match_data_imputed.csv', index=False)

In [4]:
match_data_imputed = pd.read_csv('match_data_imputed.csv')

- Now, dataset is ready to do analysis and training ML models.