# UFC Data Exploration - Initial Analysis

This notebook is my first look at the UFC fighter datasets. I'm going to explore what data we have, understand the structure, and look for any interesting patterns.

The main thing I learned from the Anscombe's Quartet example in class is that **you have to visualize your data**. Looking at just summary statistics (mean, std, correlation) can be super misleading since completely different datasets can have identical statistical properties. So I'll be making lots of graphs to actually see what's going on in the data.

I have four CSV files in the data folder - I'll load them all and start exploring.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## 1. Load the Data

First step is to load the data and take a look at what we have. Like we learned in the pandas lectures, I'll use `pd.read_csv()` to load the CSV files and then check the shapes to see how much data there is.

In [None]:
fighter_attributes = pd.read_csv('data/fighter_attributes.csv')
fighter_history = pd.read_csv('data/fighter_history.csv')
fighter_stats = pd.read_csv('data/fighter_stats.csv')
ufc_events = pd.read_csv('data/ufc-events.csv')

print(f"Fighter Attributes: {fighter_attributes.shape[0]} rows, {fighter_attributes.shape[1]} columns")
print(f"Fighter History: {fighter_history.shape[0]} rows, {fighter_history.shape[1]} columns")
print(f"Fighter Stats: {fighter_stats.shape[0]} rows, {fighter_stats.shape[1]} columns")
print(f"UFC Events: {ufc_events.shape[0]} rows, {ufc_events.shape[1]} columns")

print("\nOk so we have a lot of data here!")

## 2. Fighter Attributes Analysis

I'll start with fighter_attributes since that seems like the most straightforward dataset - should have physical characteristics like height, weight, etc for each fighter.

From the pandas lectures we learned to always start with `.head()`, `.info()`, and `.describe()` when exploring new data.

In [None]:
print("Fighter Attributes Sample:")
print("")
fighter_attributes.head(10)

In [None]:
print("Data Types:")
print(fighter_attributes.dtypes)
print("")
print("="*50)
print("")
print("Missing Values:")
missing_vals = fighter_attributes.isnull().sum()
print(missing_vals)

In [None]:
fighter_attributes.describe()

Ok so there's a good amount of missing data, especially in the reach column and the style column. That makes sense - not all fighters have complete records.

Now I want to visualize some of this data. From the graphing lectures we learned that you should pick the right plot type for what you're showing. For categorical data like weight class and stance, bar charts work well. For continuous data like age and height, histograms are better.

In [None]:
weight_class_counts = fighter_attributes['weight_class'].value_counts()

plt.figure(figsize=(14, 6))
weight_class_counts.plot(kind='bar', color='steelblue')
plt.title('Distribution of Fighters by Weight Class')
plt.xlabel('Weight Class')
plt.ylabel('Number of Fighters')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

stance_counts = fighter_attributes['stance'].value_counts()
axes[0].pie(stance_counts.values, labels=stance_counts.index, autopct='%1.1f%%')
axes[0].set_title('Fighter Stance Distribution')

gender_counts = fighter_attributes['gender'].value_counts()
colors = ['#3498db', '#e74c3c']
axes[1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', colors=colors)
axes[1].set_title('Fighter Gender Distribution')

plt.tight_layout()
plt.show()

Lightweight and Welterweight are the most popular weight classes - makes sense since those are kind of the "normal" weight ranges for most people. 

Stance looks like Orthodox (right-handed) is by far the most common.

In [None]:
country_counts = fighter_attributes['country'].value_counts()
top_countries = country_counts.head(15)

plt.figure(figsize=(12, 6))
top_countries.plot(kind='barh', color='teal')
plt.title('Top 15 Countries by Number of Fighters')
plt.xlabel('Number of Fighters')
plt.ylabel('Country')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

USA is way ahead as expected. Brazil and Russia are also big contributors which makes sense given their strong fighting traditions.

Now let me look at age distribution. For continuous variables, histograms are the best choice (from graphing lectures).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

age_data = fighter_attributes['age'].dropna()
axes[0].hist(age_data, bins=30, color='steelblue', edgecolor='white')
mean_age = fighter_attributes['age'].mean()
median_age = fighter_attributes['age'].median()
axes[0].axvline(mean_age, color='red', linestyle='--', label=f'Mean: {mean_age:.1f}')
axes[0].axvline(median_age, color='green', linestyle='--', label=f'Median: {median_age:.1f}')
axes[0].set_title('Age Distribution of Fighters')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
axes[0].legend()

fighter_attributes.boxplot(column='age', by='gender', ax=axes[1])
axes[1].set_title('Age Distribution by Gender')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Age')
plt.suptitle('')

plt.tight_layout()
plt.show()

The age distribution looks pretty normal, centered around 30-35 years old. Mean is slightly higher than median which indicates a small right skew - makes sense since you can have older fighters but not really younger than ~18.

Now let me look at height vs weight - should be an interesting relationship there. Scatter plots are good for showing relationships between two continuous variables.

In [None]:
plt.figure(figsize=(12, 8))

weight_classes = fighter_attributes['weight_class'].unique()
for wc in weight_classes:
    if pd.notna(wc):
        subset = fighter_attributes[fighter_attributes['weight_class'] == wc]
        plt.scatter(subset['height'], subset['weight'], label=wc, alpha=0.5)

plt.title('Height vs Weight by Weight Class')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)
plt.tight_layout()
plt.show()

You can see height and weight are clearly correlated (which makes sense) and the weight classes form distinct horizontal bands. Heavier weight classes are obviously higher up on the y-axis.

Let me also check what fighting styles are most common in the dataset.

In [None]:
style_data = fighter_attributes['style'].value_counts()
style_counts = style_data.head(10)

plt.figure(figsize=(10, 6))
style_counts.plot(kind='bar', color='coral')
plt.title('Top 10 Fighting Styles')
plt.xlabel('Fighting Style')
plt.ylabel('Number of Fighters')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Wrestling and BJJ (Brazilian Jiu-Jitsu) are the top styles. Makes sense since grappling is so important in MMA. Interesting that wrestling is #1 - this probably connects to Russia and USA both being top countries since they have strong wrestling programs.

I noticed earlier that reach had missing values. I wonder if reach is important for fighter success - that could be something to investigate more later.

## 3. Fighter History Analysis

Now let me look at the fight history data. This should tell us about actual fight outcomes - wins, losses, how fights ended, etc.

In [None]:
print("Fighter History Sample:")
print("")
fighter_history.head(10)

In [None]:
fighter_history['event_date'] = pd.to_datetime(fighter_history['event_date'])

print(f"Total fights recorded: {len(fighter_history)}")
print(f"Unique fighters: {fighter_history['fighter_id'].nunique()}")
print(f"Date range: {fighter_history['event_date'].min()} to {fighter_history['event_date'].max()}")

Need to convert the event_date column to datetime. This is something we covered in the pandas lectures - dates get read as strings by default unless you tell pandas otherwise (either with `parse_dates` when reading the CSV or converting after).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

result_counts = fighter_history['fight_result'].value_counts()
colors_dict = {'W': '#27ae60', 'L': '#e74c3c', 'D': '#3498db', 'NC': '#95a5a6'}
result_colors = []
for r in result_counts.index:
    if r in colors_dict:
        result_colors.append(colors_dict[r])
    else:
        result_colors.append('#333')
        
axes[0].bar(result_counts.index, result_counts.values, color=result_colors)
axes[0].set_title('Fight Result Distribution')
axes[0].set_xlabel('Result')
axes[0].set_ylabel('Count')

result_type_counts = fighter_history['fight_result_type'].value_counts()
result_type_counts.plot(kind='bar', ax=axes[1], color='mediumpurple')
axes[1].set_title('Fight Result Type Distribution')
axes[1].set_xlabel('Result Type')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

The data goes back to 1993 - that's basically the start of UFC! Now let me see how fights end.

In [None]:
fighter_history['year'] = fighter_history['event_date'].dt.year
fights_by_year = fighter_history.groupby('year').size()

plt.figure(figsize=(14, 5))
plt.fill_between(fights_by_year.index, fights_by_year.values, alpha=0.3, color='steelblue')
plt.plot(fights_by_year.index, fights_by_year.values, 'o-', color='steelblue')
plt.title('Number of Fights per Year')
plt.xlabel('Year')
plt.ylabel('Number of Fights')
plt.tight_layout()
plt.show()

Wins and losses are about equal which makes sense - for every win there's a loss (unless it's a draw or no contest). Decisions (DEC-UNA) and KO/TKO are the most common ways fights end.

Let me look at how UFC has grown over time.

In [None]:
title_fight_data = fighter_history[fighter_history['title_fight'] == True]
print(f"Total title fights: {len(title_fight_data)}")
print(f"\nTitle fights by result type:")
print(title_fight_data['fight_result_type'].value_counts())

You can see UFC has grown a lot over the years. There's a dip around 2020 which was probably COVID. Let me also specifically look at title fights.

In [None]:
fighter_wins = fighter_history.groupby('fighter_id').agg({
    'fight_result': lambda x: (x == 'W').sum(),
    'fighter_name': 'first'
}).rename(columns={'fight_result': 'wins'})

fighter_losses = fighter_history.groupby('fighter_id').apply(lambda x: (x['fight_result'] == 'L').sum())
fighter_wins['losses'] = fighter_losses
fighter_wins['total_fights'] = fighter_wins['wins'] + fighter_wins['losses']
fighter_wins['win_rate'] = fighter_wins['wins'] / fighter_wins['total_fights']

top_winners = fighter_wins.nlargest(15, 'wins')
print("Top 15 Fighters by Wins:")
top_winners[['fighter_name', 'wins', 'losses', 'win_rate']]

Now let me calculate win/loss records for each fighter. This requires using `groupby` which we learned about in the pandas lectures - I'll group by fighter_id and count wins and losses for each fighter.

In [None]:
avg_durations = fighter_history.groupby('fight_result_type')['fight_duration'].mean()
avg_durations_sorted = avg_durations.sort_values(ascending=False)

plt.figure(figsize=(10, 6))
avg_durations_sorted.plot(kind='barh', color='darkgreen')
plt.title('Average Fight Duration by Result Type')
plt.xlabel('Average Duration (seconds)')
plt.ylabel('Result Type')
plt.tight_layout()
plt.show()

Jim Miller, Donald Cerrone, and Andrei Arlovski are at the top - these are all veterans with long careers. Jon Jones has an insane win rate of 95.7%!

Let me also look at fight duration - how long do fights last on average depending on how they end.

## 4. Fighter Statistics Analysis

Decision fights go the longest (obviously since they go the full distance). KO/TKOs are the quickest. Submissions are somewhere in between.

Now let me look at the fighter_stats dataset which has detailed fight statistics like strikes, takedowns, etc.

In [None]:
print("Fighter Stats Sample:")
cols = list(fighter_stats.columns)
print(f"Columns: {cols}")
print("")
fighter_stats.head()

In [None]:
print("Key Statistics Columns:")
print("-" * 50)
print("TSL/TSA: Total Strikes Landed/Attempted")
print("TS_ACC: Total Strike Accuracy")
print("SSL/SSA: Significant Strikes Landed/Attempted")  
print("SS_ACC: Significant Strike Accuracy")
print("KD: Knockdowns")
print("TDL/TDA: Takedowns Landed/Attempted")
print("TD_ACC: Takedown Accuracy")
print("SDHL/SDHA: Strikes to Head Landed/Attempted")
print("SDBL/SDBA: Strikes to Body Landed/Attempted")
print("SDLL/SDLA: Strikes to Leg Landed/Attempted")

Lots of columns with abbreviations. Let me make a reference for what some of these mean.

In [None]:
fighter_agg = fighter_stats.groupby('fighter_id').agg({
    'TSL': 'sum',
    'TSA': 'sum',
    'SSL': 'sum',
    'SSA': 'sum',
    'KD': 'sum',
    'TDL': 'sum',
    'TDA': 'sum'
})

fighter_agg['total_strike_acc'] = fighter_agg['TSL'] / fighter_agg['TSA']
fighter_agg['sig_strike_acc'] = fighter_agg['SSL'] / fighter_agg['SSA']
fighter_agg['takedown_acc'] = fighter_agg['TDL'] / fighter_agg['TDA']

fighter_agg = fighter_agg.fillna(0)

fighter_agg = fighter_agg.merge(
    fighter_attributes[['fighter_id', 'name', 'weight_class']], 
    on='fighter_id', 
    how='left'
)

fighter_agg.head(10)

Now I'll aggregate the stats by fighter to get career totals. I'll use `groupby` and `agg` to sum up all the strikes and takedowns for each fighter across all their fights.

In [None]:
top_strikers = fighter_agg.nlargest(15, 'SSL')

plt.figure(figsize=(12, 6))
plt.barh(top_strikers['name'], top_strikers['SSL'], color='crimson')
plt.title('Top 15 Fighters by Total Significant Strikes Landed')
plt.xlabel('Significant Strikes Landed')
plt.ylabel('Fighter')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Let me see who the top strikers and grapplers are based on these aggregated stats.

In [None]:
top_kd_fighters = fighter_agg.nlargest(15, 'KD')

plt.figure(figsize=(12, 6))
plt.barh(top_kd_fighters['name'], top_kd_fighters['KD'], color='darkred')
plt.title('Top 15 Fighters by Total Knockdowns')
plt.xlabel('Knockdowns')
plt.ylabel('Fighter')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
top_td_fighters = fighter_agg.nlargest(15, 'TDL')

plt.figure(figsize=(12, 6))
plt.barh(top_td_fighters['name'], top_td_fighters['TDL'], color='navy')
plt.title('Top 15 Fighters by Total Takedowns Landed')
plt.xlabel('Takedowns Landed')
plt.ylabel('Fighter')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

These are the top takedown artists - mostly wrestlers and grapplers. This probably correlates with the fighters who have wrestling or grappling listed as their style.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

min_attempts = 50
filtered = fighter_agg[fighter_agg['TSA'] > min_attempts]

axes[0].hist(filtered['total_strike_acc'], bins=30, color='steelblue', edgecolor='white')
axes[0].set_title('Total Strike Accuracy')
axes[0].set_xlabel('Accuracy')
axes[0].set_ylabel('Frequency')

filtered2 = fighter_agg[fighter_agg['SSA'] > min_attempts]
axes[1].hist(filtered2['sig_strike_acc'], bins=30, color='crimson', edgecolor='white')
axes[1].set_title('Significant Strike Accuracy')
axes[1].set_xlabel('Accuracy')

filtered3 = fighter_agg[fighter_agg['TDA'] > 10]
axes[2].hist(filtered3['takedown_acc'], bins=30, color='navy', edgecolor='white')
axes[2].set_title('Takedown Accuracy')
axes[2].set_xlabel('Accuracy')

plt.tight_layout()
plt.show()

Let me look at the distribution of accuracy stats. I'll filter for fighters with a minimum number of attempts so we don't get weird percentages from guys who only threw a few strikes.

## 5. UFC Events Analysis

Strike accuracy looks like it's around 40-50% for most fighters. Takedown accuracy has more spread - some fighters are really good at it, others not so much.

Now let me look at the events data to understand where and when UFC events happen.

In [None]:
print("UFC Events Sample:")
print("")
ufc_events.head(10)

In [None]:
ufc_events['event_date'] = pd.to_datetime(ufc_events['event_date'])

total_fights = len(ufc_events)
unique_events = ufc_events['event_id'].nunique()
unique_venues = ufc_events['event_venue'].nunique()

print(f"Total fights: {total_fights}")
print(f"Unique events: {unique_events}")
print(f"Unique venues: {unique_venues}")
print(f"Date range: {ufc_events['event_date'].min()} to {ufc_events['event_date'].max()}")

In [None]:
wc_counts = ufc_events['weight_class'].value_counts()

plt.figure(figsize=(12, 6))
wc_counts.plot(kind='bar', color='steelblue')
plt.title('Number of Fights by Weight Class')
plt.xlabel('Weight Class')
plt.ylabel('Number of Fights')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
venues_by_events = ufc_events.groupby('event_venue')['event_id'].nunique()
top_venues = venues_by_events.nlargest(15)

plt.figure(figsize=(12, 6))
top_venues.plot(kind='barh', color='teal')
plt.title('Top 15 Venues by Number of UFC Events')
plt.xlabel('Number of Events')
plt.ylabel('Venue')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
events_by_country = ufc_events.groupby('event_venue_country')['event_id'].nunique()
top_countries = events_by_country.nlargest(15)

plt.figure(figsize=(12, 6))
top_countries.plot(kind='barh', color='darkorange')
plt.title('Top 15 Countries by Number of UFC Events')
plt.xlabel('Number of Events')
plt.ylabel('Country')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Las Vegas (UFC APEX and T-Mobile Arena) dominates. USA hosts the most events by far but they do go international to other countries.

In [None]:
completed = ufc_events[ufc_events['fight_completed'] == 1]
outcomes = completed['fight_outcome'].value_counts()

plt.figure(figsize=(10, 6))
outcomes.plot(kind='bar', color='mediumpurple')
plt.title('Fight Outcome Distribution (Completed Fights)')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
title_fight_events = ufc_events[ufc_events['title_fight'] == 1]
title_by_wc = title_fight_events['weight_class'].value_counts()

plt.figure(figsize=(12, 6))
title_by_wc.plot(kind='bar', color='gold', edgecolor='black')
plt.title('Title Fights by Weight Class')
plt.xlabel('Weight Class')
plt.ylabel('Number of Title Fights')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 6. Combined Analysis

Now let me try combining some datasets to look for interesting patterns. I want to see if physical attributes are related to performance.

To do this I'll merge the fighter_attributes with the aggregated stats and records. This is like doing a JOIN in SQL - combining tables based on a common key (fighter_id).

In [None]:
combined = fighter_attributes.merge(fighter_agg.drop(columns=['name', 'weight_class'], errors='ignore'), 
                                       on='fighter_id', how='left')

combined = combined.merge(fighter_wins[['wins', 'losses', 'total_fights', 'win_rate']], 
                                left_on='fighter_id', right_index=True, how='left')

combined.head()

In [None]:
cols = ['height', 'weight', 'age', 'reach', 'wins', 'losses', 'win_rate', 'SSL', 'KD', 'TDL']
corr_matrix = combined[cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Correlation Matrix: Physical Attributes vs Performance')
plt.tight_layout()
plt.show()

Now let me check if there are correlations between physical attributes and performance. A correlation matrix is a good way to visualize relationships between multiple variables.

Remember from class that correlation can be misleading (Anscombe's Quartet!), but it's still a useful starting point for identifying patterns to investigate further with visualization.

In [None]:
min_fights = 5
fighters_with_fights = combined[combined['total_fights'] >= min_fights]
stance_performance = fighters_with_fights.groupby('stance').agg({
    'win_rate': 'mean',
    'fighter_id': 'count'
}).rename(columns={'fighter_id': 'count'})

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(stance_performance.index, stance_performance['win_rate'], color='steelblue')
ax.set_title('Average Win Rate by Stance (min 5 fights)')
ax.set_xlabel('Stance')
ax.set_ylabel('Average Win Rate')
ax.set_ylim(0, 0.7)

for bar, count in zip(bars, stance_performance['count']):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, height + 0.01, 
            f'n={count}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

Height, weight, and reach are all strongly correlated with each other (obviously - taller people weigh more and have longer arms). 

But the correlation between these physical attributes and win_rate is pretty weak. Reach has a small positive correlation with wins and win_rate though - this could be worth investigating more. Maybe fighters with longer reach relative to their height have an advantage?

Let me look at some specific breakdowns by stance and style.

In [None]:
min_fighters_per_style = 10
style_performance = fighters_with_fights.groupby('style').agg({
    'win_rate': 'mean',
    'fighter_id': 'count'
}).rename(columns={'fighter_id': 'count'})

style_performance = style_performance[style_performance['count'] >= min_fighters_per_style]
style_performance = style_performance.sort_values('win_rate', ascending=False)

plt.figure(figsize=(12, 6))
bars = plt.barh(style_performance.index, style_performance['win_rate'], color='coral')
plt.title('Average Win Rate by Fighting Style (min 10 fighters)')
plt.xlabel('Average Win Rate')
plt.ylabel('Style')

for bar, count in zip(bars, style_performance['count']):
    width = bar.get_width()
    plt.text(width + 0.01, bar.get_y() + bar.get_height()/2, 
             f'n={count}', ha='left', va='center', fontsize=9)

plt.tight_layout()
plt.show()

Stance doesn't seem to make a big difference in win rate. What about fighting style?

In [None]:
avg_reach_per_class = combined.groupby('weight_class')['reach'].mean()
combined['reach_advantage'] = combined.apply(
    lambda row: row['reach'] - avg_reach_per_class.get(row['weight_class'], row['reach']) 
    if pd.notna(row['reach']) else np.nan, axis=1
)

filtered_combined = combined[(combined['total_fights'] >= 5) & (combined['reach_advantage'].notna())]

plt.figure(figsize=(10, 6))
plt.scatter(filtered_combined['reach_advantage'], filtered_combined['win_rate'], alpha=0.4, c='steelblue')
plt.axvline(0, color='red', linestyle='--', alpha=0.5)

z = np.polyfit(filtered_combined['reach_advantage'].dropna(), 
               filtered_combined.loc[filtered_combined['reach_advantage'].notna(), 'win_rate'].dropna(), 1)
p = np.poly1d(z)
x_vals = np.linspace(filtered_combined['reach_advantage'].min(), filtered_combined['reach_advantage'].max(), 100)
plt.plot(x_vals, p(x_vals), color='red', linewidth=2, label='Trend Line')

plt.title('Reach Advantage vs Win Rate')
plt.xlabel('Reach Advantage (cm above weight class average)')
plt.ylabel('Win Rate')
plt.legend()
plt.tight_layout()
plt.show()

Wrestling and Sambo have higher win rates. Interesting! A lot of top grapplers are from Russia with Sambo backgrounds. This might be something to explore more - whether Russian wrestlers/sambo fighters are more successful.

Let me also check reach advantage - I'll calculate how much each fighter's reach deviates from the average for their weight class.

## 7. Summary

There does seem to be a slight positive trend - fighters with longer reach for their weight class tend to have slightly higher win rates. But there's a lot of noise in the data. 

Maybe looking at reach/height ratio would be better to control for the fact that taller fighters naturally have longer reach. This is something I'll explore in more depth in a separate notebook.

Let me wrap up with some summary statistics.

In [None]:
print("="*60)
print("UFC DATA SUMMARY")
print("="*60)

num_fighters = len(fighter_attributes)
num_male = (fighter_attributes['gender'] == 'male').sum()
num_female = (fighter_attributes['gender'] == 'female').sum()
num_weight_classes = fighter_attributes['weight_class'].nunique()
num_countries = fighter_attributes['country'].nunique()
avg_age = fighter_attributes['age'].mean()
avg_height = fighter_attributes['height'].mean()

print(f"\nFIGHTERS:")
print(f"  Total fighters in database: {num_fighters}")
print(f"  Male fighters: {num_male}")
print(f"  Female fighters: {num_female}")
print(f"  Weight classes: {num_weight_classes}")
print(f"  Countries represented: {num_countries}")
print(f"  Average age: {avg_age:.1f} years")
print(f"  Average height: {avg_height:.1f} cm")

print(f"\nFIGHT HISTORY:")
print(f"  Total fights recorded: {len(fighter_history)}")
print(f"  Title fights: {(fighter_history['title_fight'] == True).sum()}")
print(f"  Most common result type: {fighter_history['fight_result_type'].mode()[0]}")

print(f"\nEVENTS:")
print(f"  Total events: {ufc_events['event_id'].nunique()}")
print(f"  Venues used: {ufc_events['event_venue'].nunique()}")
print(f"  Countries hosted: {ufc_events['event_venue_country'].nunique()}")

print("\n" + "="*60)

## Key Takeaways & Next Steps

From this initial exploration, here are some patterns I found that I want to investigate more:

1. **Reach-to-Height Ratio**: Reach has a small positive correlation with win rate. I want to look at whether fighters with proportionally longer arms (high reach/height ratio) have an advantage. This makes sense from a fighting perspective - longer reach means you can hit opponents while staying out of their range.

2. **Fighting Style Analysis**: Wrestling and Sambo fighters seem to have higher win rates. Combined with Russia being a top country for fighters, I want to look at whether fighters with grappling backgrounds (especially Russian wrestlers/sambo fighters) are more dominant than strikers.

3. **Age and Performance**: The age distribution is interesting - I might want to explore how age affects performance and at what age fighters tend to peak.

These will be the focus of my next analyses!