### Dataset Dictionary

Header | Definition
---|---------
`URL`| The URL of the comic character on the Marvel Wikia
`Name/Alias` | The full name or alias of the character
`Appearances` | The number of comic books that character appeared in as of April 30 
`Current?` | Is the member currently active on an avengers affiliated team?
`Gender` | The recorded gender of the character
`Probationary` | Sometimes the character was given probationary status as an Avenger, this is the date that happened
`Full/Reserve` | The month and year the character was introduced as a full or reserve member of the Avengers
`Year` | The year the character was introduced as a full or reserve member of the Avengers
`Years since joining` | 2015 minus the year
`Honorary` | The status of the avenger, if they were given "Honorary" Avenger status, if they are simply in the "Academy," or "Full" otherwise
`Death1` | Yes if the Avenger died, No if not. 
`Return1` | Yes if the Avenger returned from their first death, No if  they did not, blank if not applicable
`Death2` | Yes if the Avenger died a second time after their revival, No if they did not, blank if not applicable
`Return2` | Yes if the Avenger returned from their second death, No if they did not, blank if not applicable
`Death3` | Yes if the Avenger died a third time after their second revival, No if they did not, blank if not applicable
`Return3` | Yes if the Avenger returned from their third death, No if they did not, blank if not applicable
`Death4` | Yes if the Avenger died a fourth time after their third revival, No if they did not, blank if not applicable
`Return4` | Yes if the Avenger returned from their fourth death, No if they did not, blank if not applicable
`Death5` | Yes if the Avenger died a fifth time after their fourth revival, No if they did not, blank if not applicable
`Return5` | Yes if the Avenger returned from their fifth death, No if they did not, blank if not applicable
`Notes` | Descriptions of deaths and resurrections. 

# Questions to answer

1. How many Avengers have died at least once?
2. Top 10 Avengers Who Died the Most
3. What is the distribution of deaths among characters?
4. Did all characters die at some point?
5. Is there a difference in mortality between characters of different genders?
6. What is the relationship between the number of appearances and deaths?
7. Has mortality changed over the years?
8. Are there Avengers who died multiple times and returned?
9. Which characters are currently active or inactive?
10. Who are the characters with the most appearances who never died?
11. What is the average time since a character joined the Avengers?

## Importing Libraries

In [111]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from tabulate import tabulate

plt.style.use('ggplot')

file_path = 'data/avengers.csv'

In [112]:
avengers = pd.read_csv(file_path, encoding='ISO-8859-1')

## Understanding the Data

Before diving into the analysis, it's essential to understand the structure and key characteristics of our dataset. The following functions provide a summary of the Avengers dataset:

```avengers.describe()``` gives statistical insights, such as count, mean, and standard deviation for numerical columns.

```avengers.info()``` provides a detailed overview of the dataset, including column names, data types, and missing values.

```avengers.head()``` displays the first few rows of the dataset.



In [None]:
avengers.describe()

In [None]:
avengers.info()

In [None]:
avengers.head()

## Fixing the 'Name/Alias' Column

I noticed that the column containing the names of the Avengers has some missing values.

To address this issue, we could either remove the rows with missing values or replace the empty fields with "Unknown".

However, we have the <b>URL</b> column, which contains the names of the heroes within the link. In this case, we can extract the hero names from the URL and fill in the missing values.

In [115]:
# Function to extract the character's name from the URL

def extract_name_from_url(url):
    
    # We want the part between the slashes '/' and before the parentheses '('
    
    # Like in this example: marvel.wikia.com/Jocasta_(Earth-616)
    
    match = re.search(r'/([^/]+)_\(Earth-\d+\)', url)
    
    if match:
    
        return match.group(1).replace('_', ' ') # Replace '_' with a space
    
    return None


In [116]:
# Fill the 'Name/Alias' column with the extracted name from the 'URL' column
# if there are any missing values in 'Name/Alias'

avengers ['Name/Alias'] = avengers['Name/Alias'].fillna(avengers['URL'].apply(extract_name_from_url))

### How many Avengers have died at least once?

In [None]:
# Create a new column 'Died' that will be True if any of the 'Death1' to 'Death5' columns contain 'YES'
avengers['Died'] = avengers[['Death1','Death2', 'Death3', 'Death4', 'Death5']].apply(lambda x: 'YES' in x.values, axis=1)

# Sum up the number of Avengers who have died at least once
dead_avengers = avengers['Died'].sum()

print(f'Number of Avengers who died at least once: {dead_avengers}')

In [226]:
# Get the count of Avengers who have died and those who haven't
died_counts = avengers['Died'].value_counts()

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x=died_counts.index, y=died_counts.values, palette='viridis')

plt.title('Number of Avengers Who Died at Least Once')
plt.xlabel('Died at Least Once')
plt.ylabel('Number of Avengers')

plt.xticks(ticks=[0, 1], labels=['Did Not Die', 'Died at Least Once'])

# Adding value labels on top of the bars
for i, v in enumerate(died_counts.values):
    plt.text(i, v + 1, str(v), color='black', ha='center')

plt.show()


### Top 10 Avengers Who Died the Most

In [120]:
# List of columns representing different death occurrences
death_columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']

# Create a new column 'Total_Deaths' that sums the number of times 'YES' appears across the death columns
avengers['Total_Deaths'] = avengers[death_columns].apply(lambda x: x.str.contains('YES').sum(), axis=1)


In [121]:
# Sort the Avengers by the 'Total_Deaths' in descending order and select the top 10
top_10_deaths = avengers.sort_values(by = 'Total_Deaths', ascending = False).head(10)

In [None]:
top_10_deaths

In [None]:
plt.figure(figsize=(10, 8))
plt.barh(top_10_deaths['Name/Alias'], top_10_deaths['Total_Deaths'], color='darkred', edgecolor='black')

plt.xlabel('Number of Deaths', fontsize=12)
plt.ylabel('Avenger', fontsize=12)
plt.title('Top 10 Avengers Who Died the Most', fontsize=14)
plt.gca().invert_yaxis() 


for index, value in enumerate(top_10_deaths['Total_Deaths']):
    plt.text(value + 0.2, index, str(value), color='black', va='center', fontsize=10)

plt.grid(axis='x', linestyle='--', alpha=0.7) 
plt.tight_layout()
plt.show()


### Have all characters died at some point?

In [None]:
# Check if all characters have died at least once by using the 'Died' column
all_characters_died = avengers['Died'].all()

# If all characters died, print that information
if all_characters_died:
    print('All characters have died')
else:
    # If not all characters died, print how many have not died
    print('Not all characters have died')
    
    # Filter the list of characters who have never died
    never_died = avengers[~avengers['Died']]
    
    print(f'Number of characters who have not died: {len(never_died)}')
    
    # Create a list of characters who have never died
    never_died_list = never_died[['Name/Alias']].reset_index(drop=True).values.tolist()
    
    # Display the list in a fancy grid format using 'tabulate'
    print("\nCharacters who have not died:")
    print(tabulate(never_died_list, headers=['Name/Alias'], tablefmt='fancy_grid'))


### What is the distribution of deaths among characters?

In [None]:
# Create a 'Total_Deaths' column that counts non-null values across the death columns
avengers['Total_Deaths'] = avengers[death_columns].apply(lambda x: sum(pd.notna(x)), axis=1)

# Display each Avenger's name along with their total deaths
avengers[['Name/Alias', 'Total_Deaths']]

### Distribution of Number of Deaths Among Avengers

This histogram shows the distribution of the number of deaths among Avengers characters. Each bar represents the count of characters that have experienced a specific number of deaths. The x-axis shows the number of deaths, while the y-axis represents the number of Avengers with that number of deaths.


In [None]:
plt.figure(figsize=(12, 8))
n, bins, patches = plt.hist(avengers['Total_Deaths'].dropna(),bins=range(int(avengers['Total_Deaths'].max() + 2)), edgecolor='black',color='darkred')


plt.title('Distribution of Number of Deaths Among Avengers', fontsize=14)
plt.xlabel('Number of Deaths', fontsize=12)
plt.ylabel('Number of Avengers', fontsize=12)
plt.xticks(range(int(avengers['Total_Deaths'].max() + 1)))
plt.grid(axis='y', linestyle='--', alpha=0.7)


for count, bin_edge in zip(n, bins):
    plt.text(bin_edge + 0.5, count + 1, f'{int(count)}', fontsize=10, color='black', ha='center')


plt.tight_layout()
plt.show()


### How many times have the Avengers returned from death?

In [None]:
# Calculate the number of returns by counting how many 'YES' values appear in the return columns
return_counts = avengers[['Return1', 'Return2', 'Return3', 'Return4', 'Return5']].apply(lambda x: x.str.contains('YES', na=False).sum(), axis=1)

# Sum the total number of returns from death across all Avengers
total_returns = return_counts.sum()

# Print the total number of returns
print(f'Total number of returns from death: {total_returns}')

In [186]:
# Create a 'Total_Returns' column to store the number of returns for each Avenger
avengers['Total_Returns'] = avengers[['Return1', 'Return2', 'Return3', 'Return4', 'Return5']].apply(lambda x: x.str.contains('YES', na=False).sum(), axis=1)

# Calculate the distribution of total returns among characters
return_distribution = avengers['Total_Returns'].value_counts().sort_index()

In [None]:
print(return_distribution)

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(return_distribution.index, return_distribution.values, color='darkred', edgecolor='black')


plt.title('Distribution of Returns from Death Among Avengers', fontsize=14)
plt.xlabel('Number of Returns', fontsize=12)
plt.ylabel('Number of Avengers', fontsize=12)
plt.xticks(return_distribution.index)
plt.grid(axis='y', linestyle='--', alpha=0.7)


for index, value in enumerate(return_distribution.values):
    plt.text(index, value + 1, str(value), ha='center', fontsize=10, color='black')


plt.tight_layout()
plt.show()


### Is there a difference in mortality between characters of different genders?

In [None]:
# Group Avengers by gender and calculate the total deaths for each gender
deaths_by_gender = avengers.groupby('Gender')['Total_Deaths'].sum()

deaths_by_gender

In [None]:
plt.figure(figsize=(10, 6))
deaths_by_gender.plot(kind='bar', color='darkblue', edgecolor='black')


plt.xlabel('Gender', fontsize=12)
plt.ylabel('Total Number of Deaths', fontsize=12)
plt.title('Total Number of Deaths by Gender', fontsize=14)
plt.xticks(rotation=45, fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)


for index, value in enumerate(deaths_by_gender):
    plt.text(index, value + 1, str(value), ha='center', fontsize=10, color='black')

plt.tight_layout()
plt.show()


### Has Avengers mortality changed over the years?

In [229]:
# Group Avengers by the year they were introduced and calculate the total deaths for each year
deaths_by_year = avengers.groupby('Year')['Died'].sum()

# Reset the index and rename the column to 'Total_Deaths' for better readability
mortality_df = deaths_by_year.reset_index(name='Total_Deaths')

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(mortality_df['Year'], mortality_df['Total_Deaths'], marker='o', color='darkblue', linewidth=2, markersize=6)


plt.title('Number of Deaths of Avengers Over the Years', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Deaths', fontsize=12)


plt.xticks(mortality_df['Year'][::5], rotation=45, fontsize=10)


plt.grid(axis='y', linestyle='--', alpha=0.7)


plt.tight_layout()
plt.show()


### Are there Avengers who died multiple times and returned?

In [None]:
# Filter Avengers who died more than once and returned from death at least once
multiple_deaths_returned = avengers[(avengers[['Death2', 'Death3', 'Death4', 'Death5']] == 'YES').any(axis=1) &
         (avengers[['Return1', 'Return2', 'Return3', 'Return4', 'Return5']] == 'YES').any(axis=1)]

# Select columns with names, deaths, and returns
multiple_deaths_returned_names = multiple_deaths_returned[['Name/Alias', 'Death1', 'Death2', 'Death3', 'Death4', 'Death5', 'Return1', 'Return2', 'Return3', 'Return4', 'Return5']]
multiple_deaths_returned_names

### Which characters are currently active or inactive?

In [None]:
# Filter Avengers based on their current activity status
active_avengers = avengers[avengers['Current?'] == 'YES']
inactive_avengers = avengers[avengers['Current?'] == 'NO']

print(f"Number of active avengers: {len(active_avengers)} \n")
print(f"Number of inactive avengers: {len(inactive_avengers)} \n")

print("\nActive avengers:")
print(active_avengers['Name/Alias'].to_string(index=False))


In [None]:
print("\nInactive avengers:")
print(inactive_avengers['Name/Alias'].to_string(index=False))

### Which characters have the most appearances and never died?

In [None]:
# Sort Avengers who have never died by their number of appearances
top_appear_never_died = never_died.sort_values(by='Appearances', ascending=False)


print("Top 10 Avengers with the most appearances who never died: ")
print(top_appear_never_died[['Name/Alias', 'Appearances']].head(10).to_string(index = False))

### What is the average time (in years) since a character joined the Avengers?

In [None]:
# Calculate the average number of years since characters joined the Avengers
avg_years_since_joining = avengers['Years since joining'].mean()

print(f"The average time (in years) since a character joined the Avengers is: {avg_years_since_joining:.2f} years!")