### 4.5 Feature Engineering

The initial dataset was constructed from the player’s viewpoint. However, to align with the machine learning model’s requirements, we need to shift our focus to the team’s perspective. 

Consequently, we will reshape the dataset by merging and selectively deleting variables, emphasizing the team-centric information. This transformation will enable the model to accurately predict match outcomes.

In [1]:
# Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Dataset
total_data = pd.read_csv('../data/interim/player_match_data.csv')
total_data.shape

(946025, 25)

In [2]:
    # Variable 1: player_team_id 

# 'player_team_id' function
def find_most_common_team_number(group):
        all_team_ids = pd.concat([group['team_1_id'], group['team_2_id']])
        team_numbers = all_team_ids.dropna().astype(int)
        most_common_team_number = np.argmax(np.bincount(team_numbers))
        return most_common_team_number

# Applying function to dataset
total_data['player_team_id'] = total_data.groupby('team_name').apply(find_most_common_team_number).reindex(total_data['team_name']).values

# ----------------------------------------------------------------------------------------------------------------------------

    # Variable 2: winning_team

# 'winning_team' function
def get_winning_team(team_1_score, team_2_score):
        if team_1_score == team_2_score: return 0
        elif team_1_score > team_2_score: return 1
        else: return 2

# Aapplying function to dataset
total_data['winning_team'] = total_data.apply(lambda row: get_winning_team(row['team_1_score'], row['team_2_score']), axis=1)

# ----------------------------------------------------------------------------------------------------------------------------

    # Variable 3: winning_team_id 

# Creating column called 'winning_team_id' (doesn't need function)
total_data['winning_team_id'] = np.where(total_data['winning_team'] == 1, total_data['team_1_id'],
                                np.where(total_data['winning_team'] == 2, total_data['team_2_id'], 0))

# ----------------------------------------------------------------------------------------------------------------------------

    # Variable 4: player_has_won 

# Creating column called 'player_has_won' (doesn't need function)
total_data['player_has_won'] = np.where(total_data['winning_team_id'] == total_data['player_team_id'], 1, 0)

# ----------------------------------------------------------------------------------------------------------------------------

# Seleccting target variables

target_1 = 'winning_team'
target_2 = 'player_has_won'

# Show small overview of the dataset
total_data.head(3)


  total_data['player_team_id'] = total_data.groupby('team_name').apply(find_most_common_team_number).reindex(total_data['team_name']).values


Unnamed: 0,adr,assists,deaths,fkdiff,hs,kdratio,kills,rating,match_id,player_id,...,hour,day,week,month,year,weekday,player_team_id,winning_team,winning_team_id,player_has_won
0,163.2,3,10,1,10,90.0%,32,2.44,32227,5736,...,13,2,26,7,2016,5,6621,2,6621,1
1,81.0,3,6,1,5,75.0%,17,1.55,32227,2532,...,13,2,26,7,2016,5,6621,2,6621,1
2,77.6,3,10,1,11,75.0%,16,1.41,32227,7382,...,13,2,26,7,2016,5,6621,2,6621,1


In [3]:
unique_teams_1 = total_data[['team_1_id', 'team_1_name']].rename(columns={'team_1_id': 'team_id', 'team_1_name': 'team_name'})
unique_teams_2 = total_data[['team_2_id', 'team_2_name']].rename(columns={'team_2_id': 'team_id', 'team_2_name': 'team_name'})
unique_teams = pd.concat([unique_teams_1, unique_teams_2]).drop_duplicates().reset_index(drop=True)
unique_teams.info()

# SAVE THE DATASET
unique_teams.to_csv('../data/interim/team_data.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5256 entries, 0 to 5255
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   team_id    5256 non-null   int64 
 1   team_name  5256 non-null   object
dtypes: int64(1), object(1)
memory usage: 82.3+ KB


***Create a team dataframe: Group all matchs by MatchID and TeamID***

A sub dataframe will be created with the team information of each match. This will provide a dataframe that has two rows for each match (1 per team). It will have all the players IDs, the mean of adr and kdratio and the sum of the rest of the variables. This will provide us the **teams' performance** in every match played. 

In [4]:
total_data = pd.merge(total_data, unique_teams, left_on='team_name', right_on='team_name', how='left')

# Make invalid teams searchable
total_data = total_data.drop('team_name', axis=1)
total_data['team_id'] = total_data['team_id'].fillna(-1).astype(int)

# Keep only the records with a valid team
total_data = total_data[total_data['team_id'] != -1]

total_data.head()

Unnamed: 0,adr,assists,deaths,fkdiff,hs,kdratio,kills,rating,match_id,player_id,...,day,week,month,year,weekday,player_team_id,winning_team,winning_team_id,player_has_won,team_id
0,163.2,3,10,1,10,90.0%,32,2.44,32227,5736,...,2,26,7,2016,5,6621,2,6621,1,6621
1,81.0,3,6,1,5,75.0%,17,1.55,32227,2532,...,2,26,7,2016,5,6621,2,6621,1,6621
2,77.6,3,10,1,11,75.0%,16,1.41,32227,7382,...,2,26,7,2016,5,6621,2,6621,1,6621
3,77.0,2,10,-1,6,85.0%,14,1.38,32227,5698,...,2,26,7,2016,5,6621,2,6621,1,6621
4,61.2,4,12,4,4,85.0%,10,1.16,32227,10563,...,2,26,7,2016,5,6621,2,6621,1,6621


In [5]:
# Parse the non-numerical fields
total_data["adr"] = pd.to_numeric(total_data["adr"], errors="coerce")
total_data["kdratio"] = (
    pd.to_numeric(total_data["kdratio"].str.rstrip("%"), errors="coerce") / 100.0
)

total_data['adr'] = total_data.groupby('player_id')['adr'].transform(lambda x: x.fillna(x[x.notna()].mean()))
total_data['kdratio'] = total_data.groupby('player_id')['kdratio'].transform(lambda x: x.fillna(x[x.notna()].mean()))

# Deleting rows with null values
null_rows = total_data.loc[total_data.isnull().any(axis=1)]

total_data = total_data.dropna()


***Deleting irrelevant variables***

- The group ID and name variables have been removed because it is not possible to use them for the machine learning model.
- Time group variables were deleted due to their poor performance with both target variables.
- Although the target 2 allowed us to discover the correlation between variables, the target needed for the ML model is the targe 1 -winning team-.

In [6]:

# Drop unused columns for simplicity
total_data.drop(['hour', 'day', 'week', 'month', 'weekday', 'year', 'rating', 'team_1_name', 'team_2_name'], axis=1, inplace=True)
total_data.drop(['team_1_score', 'team_2_score', 'winning_team_id', 'player_has_won'], axis=1, inplace=True)
total_data.head()

Unnamed: 0,adr,assists,deaths,fkdiff,hs,kdratio,kills,match_id,player_id,team_1_id,team_2_id,data_unix,map,player_team_id,winning_team,team_id
0,163.2,3,10,1,10,0.9,32,32227,5736,6619,6621,1467476700000,Train,6621,2,6621
1,81.0,3,6,1,5,0.75,17,32227,2532,6619,6621,1467476700000,Train,6621,2,6621
2,77.6,3,10,1,11,0.75,16,32227,7382,6619,6621,1467476700000,Train,6621,2,6621
3,77.0,2,10,-1,6,0.85,14,32227,5698,6619,6621,1467476700000,Train,6621,2,6621
4,61.2,4,12,4,4,0.85,10,32227,10563,6619,6621,1467476700000,Train,6621,2,6621


In [7]:
# Group by MatchID and TeamID, then calculate average stats and collect PlayerIDs
team_stats = total_data.groupby(['match_id', 'team_id']).agg({
    'player_id': lambda x: frozenset(x),
    'adr': 'mean',
    'assists': 'sum',
    'deaths': 'sum',
    'fkdiff': 'sum',
    'hs': 'sum',
    'kdratio': 'mean',
    'kills': 'sum',
}).reset_index()

team_stats.head()



Unnamed: 0,match_id,team_id,player_id,adr,assists,deaths,fkdiff,hs,kdratio,kills
0,12838,4411,"(884, 7148, 29, 39)",73.369221,0,38,6,0,0.691815,73
1,12838,4443,(7150),43.525,0,19,-7,0,0.495,5
2,12839,4411,"(884, 7148, 29, 39)",73.369221,0,30,11,0,0.691815,64
3,12839,4443,(7150),43.525,0,17,-5,0,0.495,10
4,12840,4444,"(6796, 7154, 7156, 7158)",74.894429,0,26,3,0,0.67792,63


In [8]:
# Add a new column with the size of frozenset
team_stats['team_size'] = team_stats['player_id'].apply(len)

# Sort the DataFrame by the size of frozenset
team_stats = team_stats.sort_values(by='team_size')

match_ids = team_stats[team_stats['team_size'] != 5]['match_id'].to_list()

team_stats = team_stats[~team_stats['match_id'].isin(match_ids)]

# # Drop the temporary 'team_size' column if needed
team_stats.drop(columns=['team_size'], inplace=True)

team_stats.rename(columns={
    'player_id':'members',
    'adr': 'avg_adr',
    'assists': 'sum_assists',
    'deaths': 'sum_deaths',
    'fkdiff': 'sum_fkdiffs',
    'hs': 'sum_hs',
    'kdratio': 'mean_kdratio',
    'kills': 'sum_kills',
    }, inplace=True)

team_stats.head()

Unnamed: 0,match_id,team_id,members,avg_adr,sum_assists,sum_deaths,sum_fkdiffs,sum_hs,mean_kdratio,sum_kills
150825,84590,7551,"(13731, 17605, 17831, 16731, 15166)",69.62,23,100,-7,40,0.6692,92
153007,85712,6898,"(13476, 8943, 8944, 9170, 12476)",70.52,18,107,-10,35,0.6398,93
154486,86435,10089,"(18949, 18950, 18951, 18740, 8796)",65.24,16,82,3,31,0.6174,65
151064,84716,9980,"(17609, 10992, 10993, 11258, 8891)",62.42,26,98,-6,32,0.623,69
154483,86434,8680,"(9289, 10764, 12564, 13461, 13238)",65.26,19,95,-6,48,0.6644,85


***Merge the teams dataframe with the main dataframe***

The merging will provide the main dataset with the statistics related to the team. Now, every row will have the information on the player and the team views. However, the player information is not needed anymore, so everything but the players' id will be deleted. Finally, the dataframe will be grouped by match ID and team ID, giving two rows per match, one for each team.

In [9]:
# Merge with the original DataFrame to get PlayerIDs
merged_df = pd.merge(total_data, team_stats, how='left', left_on=['match_id', 'team_1_id'], right_on=['match_id', 'team_id'])
merged_df = pd.merge(merged_df, team_stats, how='left', left_on=['match_id', 'team_2_id'], right_on=['match_id', 'team_id'], suffixes=('_team_1', '_team_2'))


merged_df.head(10)


Unnamed: 0,adr,assists,deaths,fkdiff,hs,kdratio,kills,match_id,player_id,team_1_id,...,sum_kills_team_1,team_id,members_team_2,avg_adr_team_2,sum_assists_team_2,sum_deaths_team_2,sum_fkdiffs_team_2,sum_hs_team_2,mean_kdratio_team_2,sum_kills_team_2
0,163.2,3,10,1,10,0.9,32,32227,5736,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
1,81.0,3,6,1,5,0.75,17,32227,2532,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
2,77.6,3,10,1,11,0.75,16,32227,7382,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
3,77.0,2,10,-1,6,0.85,14,32227,5698,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
4,61.2,4,12,4,4,0.85,10,32227,10563,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
5,67.8,2,16,-2,7,0.55,12,32227,2492,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
6,81.8,0,19,-2,6,0.5,13,32227,11247,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
7,77.7,1,20,1,5,0.55,12,32227,10814,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
8,43.0,0,17,-1,4,0.5,8,32227,5737,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
9,23.3,1,17,-2,2,0.25,3,32227,168,6619,...,48.0,6621.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0


In [10]:
# Drop unnecessary columns and duplicates
merged_df.drop(
    [
        'adr', 
        'assists',
        'deaths',
        'fkdiff',
        'hs',
        'kdratio',
        'kills',
        'player_id',
        'player_team_id',
        'team_id_x',
        'team_id_y',
        'team_id',
        'team_1_id',
        'team_2_id'
        ], 
    axis=1,
    inplace=True
    )


merged_df.sort_values(by='match_id')
merged_df.drop_duplicates(ignore_index=True, inplace=True)
merged_df.head()

Unnamed: 0,match_id,data_unix,map,winning_team,members_team_1,avg_adr_team_1,sum_assists_team_1,sum_deaths_team_1,sum_fkdiffs_team_1,sum_hs_team_1,mean_kdratio_team_1,sum_kills_team_1,members_team_2,avg_adr_team_2,sum_assists_team_2,sum_deaths_team_2,sum_fkdiffs_team_2,sum_hs_team_2,mean_kdratio_team_2,sum_kills_team_2
0,32227,1467476700000,Train,2,"(168, 5737, 11247, 2492, 10814)",58.72,4.0,89.0,-6.0,24.0,0.47,48.0,"(5698, 10563, 2532, 5736, 7382)",92.0,15.0,48.0,6.0,36.0,0.82,89.0
1,22528,1440702000000,Dust2,1,"(483, 484, 2757, 7594, 3347)",72.598373,21.0,54.0,12.0,32.0,0.69258,89.0,"(2469, 7398, 7592, 429, 4954)",75.954746,15.0,90.0,-12.0,25.0,0.709509,54.0
2,13059,1349812800000,Dust2_se,0,,,,,,,,,,,,,,,,
3,31325,1464655500000,Cache,2,"(10565, 11302, 10795, 10797, 10798)",62.88,11.0,96.0,-5.0,28.0,0.536,66.0,"(5698, 10563, 2532, 5736, 7382)",81.04,17.0,66.0,5.0,33.0,0.792,96.0
4,42943,1489702200000,Mirage,2,"(12102, 8493, 12272, 10897, 11230)",69.66,13.0,102.0,3.0,45.0,0.6374,86.0,"(9571, 10372, 8708, 9705, 9069)",79.9,26.0,86.0,-3.0,51.0,0.6892,102.0


***Drop irrelevant columns and reorder the final dataframe***

The final step will consist in rordering the columns and dropping the remain duplicates, nan values and irrelevant columns of the dataset.

In [11]:
# Drop irrelevant columns
total_data = merged_df
total_data.drop([
    'match_id',
    'data_unix'
], axis=1, inplace=True)

# Reorder the dataframe
order = [
    'members_team_1',
    'members_team_2',
    'map',  
    'avg_adr_team_1',
    'sum_assists_team_1', 
    'sum_deaths_team_1', 
    'sum_fkdiffs_team_1',
    'sum_hs_team_1', 
    'mean_kdratio_team_1', 
    'sum_kills_team_1',     
    'avg_adr_team_2', 
    'sum_assists_team_2',
    'sum_deaths_team_2', 
    'sum_fkdiffs_team_2', 
    'sum_hs_team_2',
    'mean_kdratio_team_2', 
    'sum_kills_team_2',
    'winning_team'
]

total_data = total_data.reindex(columns=order)

total_data.dropna(inplace=True)
total_data.head()

Unnamed: 0,members_team_1,members_team_2,map,avg_adr_team_1,sum_assists_team_1,sum_deaths_team_1,sum_fkdiffs_team_1,sum_hs_team_1,mean_kdratio_team_1,sum_kills_team_1,avg_adr_team_2,sum_assists_team_2,sum_deaths_team_2,sum_fkdiffs_team_2,sum_hs_team_2,mean_kdratio_team_2,sum_kills_team_2,winning_team
0,"(168, 5737, 11247, 2492, 10814)","(5698, 10563, 2532, 5736, 7382)",Train,58.72,4.0,89.0,-6.0,24.0,0.47,48.0,92.0,15.0,48.0,6.0,36.0,0.82,89.0,2
1,"(483, 484, 2757, 7594, 3347)","(2469, 7398, 7592, 429, 4954)",Dust2,72.598373,21.0,54.0,12.0,32.0,0.69258,89.0,75.954746,15.0,90.0,-12.0,25.0,0.709509,54.0,1
3,"(10565, 11302, 10795, 10797, 10798)","(5698, 10563, 2532, 5736, 7382)",Cache,62.88,11.0,96.0,-5.0,28.0,0.536,66.0,81.04,17.0,66.0,5.0,33.0,0.792,96.0,2
4,"(12102, 8493, 12272, 10897, 11230)","(9571, 10372, 8708, 9705, 9069)",Mirage,69.66,13.0,102.0,3.0,45.0,0.6374,86.0,79.9,26.0,86.0,-3.0,51.0,0.6892,102.0,2
5,"(483, 484, 2757, 7594, 3347)","(1866, 7403, 338, 7796, 472)",Train,87.8,19.0,65.0,3.0,39.0,0.8176,96.0,66.64,18.0,96.0,-3.0,31.0,0.5826,65.0,1


***Save the dataset***

In [12]:
total_data.to_csv('../data/interim/clean_match_data.csv',index=False)

---

## Conclusions

1. The exploratory analysis of the dataset confirmed that there is a strong relationship between the player statistics in the game and the final outcome of the match.

2. The time variables have not provided strong correlation with the target variable, thus, they were eliminated from the dataset.

3. Player Null-values were replaced by their personal mean in order to keep their performance as equal as possible. Players with no records and null values were eliminated from the dataset.

4. In order to fulfill the requirements of the Machine Learning Model, it was decided to create a new dataset from the data given.

5. The new dataset includes the players ids and the average and sum of the individual statistics. The target is the winning team. A huge reordering was done in order to create a single row for every match that has all the needed information.

6. The clean dataset has 84040 rows and 18 columns and no null values. 