# Exploratory Data Analysis

This notebook is going to explore the raw and processed data to identify anomalies, spot patterns, and check assupmtions.
*** 

In [24]:
import sys
import pandas as pd
import numpy as np
from pathlib import Path

### Raw Data

This is the raw, uncleaned data gathered from this source on Kaggle:

https://www.kaggle.com/datasets/neelagiriaditya/ufc-datasets-1994-2025 

Here our goal is to get a feel for the data.

In [25]:
project_root = Path().resolve().parents[0]
sys.path.append(str(project_root))

df_fighter = pd.read_csv(Path(project_root) / "data" / "raw" / "fighter_details.csv")
df_fight = pd.read_csv(Path(project_root) / "data" / "raw" / "fight_details.csv")
df_event = pd.read_csv(Path(project_root) / "data" / "raw" / "event_details.csv")

In [26]:
df_fighter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2611 entries, 0 to 2610
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          2611 non-null   object 
 1   name        2611 non-null   object 
 2   nick_name   1692 non-null   object 
 3   wins        2611 non-null   int64  
 4   losses      2611 non-null   int64  
 5   draws       2611 non-null   int64  
 6   height      2590 non-null   float64
 7   weight      2593 non-null   float64
 8   reach       1956 non-null   float64
 9   stance      2534 non-null   object 
 10  dob         2455 non-null   object 
 11  splm        2611 non-null   float64
 12  str_acc     2611 non-null   int64  
 13  sapm        2611 non-null   float64
 14  str_def     2611 non-null   int64  
 15  td_avg      2611 non-null   float64
 16  td_avg_acc  2611 non-null   int64  
 17  td_def      2611 non-null   int64  
 18  sub_avg     2611 non-null   float64
dtypes: float64(7), int64(7), ob

Here, we should be sure to note that of the 2611 listed fighters we can see the following:
- 21 have no listed height
- 655 have no listed reach
- 77 have no listed stance
- 156 have no listed date of birth

All of these pieces of information are important for the model, so we must identify how to handle them properly (e.g. fill with generics, averages, drop, etc.).

In [27]:
df_fight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8337 entries, 0 to 8336
Data columns (total 86 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   event_name           8337 non-null   object 
 1   event_id             8337 non-null   object 
 2   fight_id             8337 non-null   object 
 3   r_name               8337 non-null   object 
 4   r_id                 8337 non-null   object 
 5   b_name               8337 non-null   object 
 6   b_id                 8337 non-null   object 
 7   division             8337 non-null   object 
 8   title_fight          8337 non-null   int64  
 9   method               8337 non-null   object 
 10  finish_round         8337 non-null   int64  
 11  match_time_sec       8337 non-null   int64  
 12  total_rounds         8306 non-null   float64
 13  referee              8311 non-null   object 
 14  r_kd                 8316 non-null   float64
 15  r_sig_str_landed     8316 non-null   f

Here, we can see that of the 8337 fights there is also some fights with missing information, some important ones being:
- Knockdowns 
- Total strikes 
- Signifcant strikes

Since it appears only around 21 fights are missing this information, they could simply be removed (or they might get removed automatically if they happened pre 2000-11-17)

In [28]:
df_event.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8337 entries, 0 to 8336
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   event_id   8337 non-null   object
 1   fight_id   8337 non-null   object
 2   date       8337 non-null   object
 3   location   8337 non-null   object
 4   winner     8190 non-null   object
 5   winner_id  8190 non-null   object
dtypes: object(6)
memory usage: 390.9+ KB


Here we can make a simple note that 147 fights ended with no winner, as such those fights can be removed.

In [29]:
# Format DateTime
df_event["date"] = pd.to_datetime(df_event["date"], format="%B %d, %Y")
df_fighter["dob"] = pd.to_datetime(df_fighter["dob"], format="%b %d, %Y")
# Drop fights with no winner
df_event = df_event.dropna(subset=["winner", "winner_id"])
# Drop fights before Unified Rules of MMA (UFC 28, November 17, 2000)
cutoff_date = pd.Timestamp("2000-11-17")
df_event = df_event[df_event["date"] >= cutoff_date]
# Sort the fights from oldest to newest for dataset construction
df_event = df_event.sort_values(by="date")

# Join events with fights
df_event_fights = pd.merge(df_event, df_fight, on="fight_id")
df_event_fights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7943 entries, 0 to 7942
Data columns (total 91 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   event_id_x           7943 non-null   object        
 1   fight_id             7943 non-null   object        
 2   date                 7943 non-null   datetime64[ns]
 3   location             7943 non-null   object        
 4   winner               7943 non-null   object        
 5   winner_id            7943 non-null   object        
 6   event_name           7943 non-null   object        
 7   event_id_y           7943 non-null   object        
 8   r_name               7943 non-null   object        
 9   r_id                 7943 non-null   object        
 10  b_name               7943 non-null   object        
 11  b_id                 7943 non-null   object        
 12  division             7943 non-null   object        
 13  title_fight          7943 non-nul

After some basic filtering, we can see the data has slimmed down to 7943 fights. This dropping of the fights with no winners and the old fights seems to have fixed the issues with missing values. Some accuracy statistics are still missing, but this is most likely cause by have no value if the fighter did not attempt any actions (as division by 0 is an impossibility). 
***


### Processed Data

Here we examine at the data set that we have created after it has been cleaned and processed.