# Introduction
The UFC has their own official stat website for all fights they've had which can be found [here](http://www.ufcstats.com/statistics/events/completed). While the data is comprehensive, the data visualisations are rather basic and don't show much data. This can prevent UFC fans from getting a better understanding of a fight, specifically the breakdown of the type of strikes landed and where they landed. A tool that creates better and more detailed visualisations would be beneficial to fans and possibly fighters too. Having this data could also help to find any trends and possibly even make predictions based on the number of strikes landed and where they are being landed.

# Dataset
The following [data set](https://www.kaggle.com/datasets/rajeevw/ufcdata/) will be used as it contains comprehensive and detailed information pulled from the official UFC stats website from 1993 to 2021.

# Setup

In [1]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
ufc_data = pd.read_csv("data.csv")

# Pre-processing
My initial aim is to create a visulisation similar to this:
<br><img src="Sample Screenshot.png" width="330" height="300"><br>
This is a screenshot from a YouTube video which can be found [here](https://www.youtube.com/watch?v=TWafkxT699c&t=327s).

This requires me to be able to reduce the dimensionality of the dataset to isolate only the columns that are relevant to what I wish to produce with my model.

In [4]:
# Returning high-level information on dataset
print(ufc_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6012 entries, 0 to 6011
Columns: 144 entries, R_fighter to R_age
dtypes: bool(1), float64(106), int64(28), object(9)
memory usage: 6.6+ MB
None


In [5]:
print(ufc_data.describe())

          B_avg_KD  B_avg_opp_KD  B_avg_SIG_STR_pct  B_avg_opp_SIG_STR_pct  \
count  4585.000000   4585.000000        4585.000000            4585.000000   
mean      0.247476      0.176818           0.453310               0.434290   
std       0.378509      0.324633           0.130458               0.132618   
min       0.000000      0.000000           0.000000               0.000000   
25%       0.000000      0.000000           0.376489               0.351045   
50%       0.015625      0.000000           0.450000               0.427500   
75%       0.500000      0.250000           0.527500               0.510000   
max       5.000000      3.000000           1.000000               1.000000   

       B_avg_TD_pct  B_avg_opp_TD_pct  B_avg_SUB_ATT  B_avg_opp_SUB_ATT  \
count   4585.000000       4585.000000    4585.000000        4585.000000   
mean       0.292650          0.268742       0.478884           0.409276   
std        0.273628          0.267178       0.724229           0.653826 

In [6]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 1)

print(ufc_data.describe())

       B_avg_KD  B_avg_opp_KD  B_avg_SIG_STR_pct  B_avg_opp_SIG_STR_pct  \
count    4585.0        4585.0             4585.0                 4585.0   
...         ...           ...                ...                    ...   

       B_avg_TD_pct  B_avg_opp_TD_pct  B_avg_SUB_ATT  B_avg_opp_SUB_ATT  \
count        4585.0            4585.0         4585.0             4585.0   
...             ...               ...            ...                ...   

       B_avg_REV  B_avg_opp_REV  B_avg_SIG_STR_att  B_avg_SIG_STR_landed  \
count     4585.0         4585.0             4585.0                4585.0   
...          ...            ...                ...                   ...   

       B_avg_opp_SIG_STR_att  B_avg_opp_SIG_STR_landed  B_avg_TOTAL_STR_att  \
count                 4585.0                    4585.0               4585.0   
...                      ...                       ...                  ...   

       B_avg_TOTAL_STR_landed  B_avg_opp_TOTAL_STR_att  \
count                  

## Relevant Columns
The relevant columns that I will likely need are as follows:
- R_fighter (Red fighter name)
- B_fighter (Blue fighter name)
- date (Date fight occurred)
- Winner (Winner of fight (either red or blue))
- KD (Number of knockdowns)
- SIG_STR (No. Significant Strikes Landed)
- SIG_STR_pct (Significant strikes as percentage)
- TOTAL_STR (Total no. strikes landed)
- TD (No. Takedowns)
- TD_pct (Takedown percentages)
- HEAD (No. significant strikes landed to body)
- BODY (No. significant strikes landed to body)
- CLINCH (No. significant strikes landed in clinch)
- GROUND (No. significant strikes landed while on the ground)
- win_by (Method of win)

Now that the relevant columns have been identified, we can begin reducing the dimensionality of the dataset and leave only the columns listed above.

In [7]:
# List of column names to be imported to Pandas data frame
columns_to_import = ['R_fighter', 'B_fighter', 'date', 'Winner', 
                     'B_avg_SIG_STR_pct', 'B_avg_SIG_STR_att', 'B_avg_SIG_STR_landed', 'B_avg_SIG_STR_pct', 'B_avg_LEG_att', 'B_avg_LEG_att', 'B_avg_HEAD_att', 'B_avg_HEAD_att',
                     'R_avg_SIG_STR_pct', 'R_avg_SIG_STR_att', 'R_avg_SIG_STR_landed', 'R_avg_SIG_STR_pct', 'R_avg_LEG_att', 'R_avg_LEG_att', 'R_avg_HEAD_att', 'R_avg_HEAD_att']

In [8]:
data_red = pd.read_csv("data.csv", usecols=columns_to_import)