# Superhero Data Analysis Exercise
## Introduction
This notebook analyses a superhero dataset, as found via [Kaggle](https://www.kaggle.com/datasets/saadatkhalid/superhero-dataset).<br>
The dataset contains information on the heroes' `Name`, `First Appearance`, `Origin`, and `Publisher`.

We want to explore the data to answer the below questions:
1. Which publishers have introduced the most superheroes?
2. How are superheroes distributed across different origins?
3. Do major publishers focus on certain origins more than independent publishers?

## The Dataset

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# from ydata_profiling import ProfileReport

sns.set_style('whitegrid')

In [2]:
df = pd.read_csv("../data/superhero.csv")
df.rename(columns={
    'first_appeared_in_issue':  'first_apperance'
},inplace=True)
df.head(10)

Unnamed: 0,first_apperance,name,origin,publisher
0,The Legion of Super-Heroes,Lightning Lad,Alien,DC Comics
1,The Menace of Dream Girl!,Dream Girl,Alien,DC Comics
2,The War Between Supergirl and The Supermen Eme...,Brainiac 5,Alien,DC Comics
3,Hercules in the 20th Century!,Invisible Kid,Human,DC Comics
4,The War Between Supergirl and The Supermen Eme...,Phantom Girl,Alien,DC Comics
5,The War Between Supergirl and The Supermen Eme...,Sun Boy,Radiation,DC Comics
6,Lana Lang and the Legion of Super-Heroes!,Thom Kallor,Alien,DC Comics
7,Escape of the Fatal Five!; Mocked By The Master!,Shadow Lass,Alien,DC Comics
8,The War Between Supergirl and The Supermen Eme...,Duplicate Girl,Alien,DC Comics
9,The Confession of Superboy!,Element Lad,Alien,DC Comics


## Data Structure
### Datatypes

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45224 entries, 0 to 45223
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   first_apperance  35981 non-null  object
 1   name             45224 non-null  object
 2   origin           45224 non-null  object
 3   publisher        45224 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


In [4]:
df.shape

(45224, 4)

### Missing values 

In [5]:
non_nan_count = df.describe().loc['count']
nan_count_using_describe = len(df) - non_nan_count
print("NaN count per column using describe():")
print(nan_count_using_describe)

NaN count per column using describe():
first_apperance    9243
name                  0
origin                0
publisher             0
Name: count, dtype: object


### Descriptive statistics



In [6]:
df.describe()

Unnamed: 0,first_apperance,name,origin,publisher
count,35981,45224,45224,45224
unique,17125,40342,10,580
top,Vol. 1,Mirage,Human,Marvel
freq,245,11,27561,10921


## Data Cleaning
### Missing Data
**Identify and Handle Missing Data:**
- Check for missing values and decide how to handle them
- For this dataset, it might be reasonable to drop rows with missing data or to fill in missing values where appropriate

Of the 45224 superheroes entries we have, 9243 are missing information about thier first appearance.<br>
The information does exist but retrieving the it would be time consuming. 

In [7]:
df.dropna(how='any', inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 35981 entries, 0 to 45223
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   first_apperance  35981 non-null  object
 1   name             35981 non-null  object
 2   origin           35981 non-null  object
 3   publisher        35981 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB


In [8]:
df.describe()

Unnamed: 0,first_apperance,name,origin,publisher
count,35981,35981,35981,35981
unique,17125,32730,10,449
top,Vol. 1,Michael,Human,Marvel
freq,245,8,21711,10141


### Consistency in Data
**Ensure Consistency in Text Data:**
- Verify the 'publisher' and 'origin' columns for consistency in different cells

Of the 35981 entries, we have 32730 unique entries.<br>
This means that 3251 superheroes share the same name.<br>
Using a mask we see the below:

In [9]:
df[df.duplicated('name')].sort_values(by=['name', 'publisher','origin'])

Unnamed: 0,first_apperance,name,origin,publisher
24451,Pyre Part 3,Abel,Cyborg,Aspen MLT
27071,"Of Birth, Death And The Confused, Painful Bit ...",Aberdeen Angus,Other,Marvel
18431,King David,Abigail,Human,In the Public Domain
35670,Volume 1,Abigail,Human,Wildstorm
20111,David and Goliath,Abner,Human,In the Public Domain
...,...,...,...,...
24533,Training Day(Part 2),Zog,Human,Image
22435,Vengeance of Bane,Zombie,Human,DC Comics
31983,The Storm That Shook Japan; Operation: C.H.A.S...,Zon,Animal,DC Comics
14079,Devil's Lot; Meat Machine Part 1,Zora,Human,Dark Horse Comics


## 5. **Data Transformation**
   - **Categorize Publishers:**
     - To analyze the influence of major versus independent publishers, categorize the publishers accordingly
     - Identify major publishers based on their frequency in the dataset
     - Create a new column `publisher_category`
     - Assign `major` to major publishers and `independent` to others




## 6. **Exploratory Data Analysis (EDA)**
   - **Visualize the Distribution of Superhero Origins:**
     - Use a bar plot to visualize the distribution of superhero origins
   - **Analyze Publisher Trends Over Time:**
     - If the `first_appeared_in_issue` column allows, explore how different publishers have introduced superheroes over time




In [10]:
# profile = ProfileReport(df, title='Superhero Data Analysis')
# profile.to_notebook_iframe()

## 7. **Detailed Data Analysis**
   - **Publisher Dominance:**
     - Analyze which publishers have introduced the most superheroes using a bar plot
   - **Relationship Between Origin and Publisher:**
     - Explore whether certain publishers are associated with specific origins using a heatmap




## 8. **Hypothesis Testing**
    - For instance, test whether major publishers are more likely to introduce superheroes from certain origins compared to independent publishers




## 9. **Conclude Your Analysis**
   - Summarize the key findings from your analysis
     ```markdown
     # Example of summarizing conclusions
     # "The analysis revealed that major publishers dominate the superhero landscape, introducing the majority of superheroes with a preference for certain origins. Independent publishers contribute a diverse set of origins, though in smaller numbers."
     ```



## 10. **Suggestions for Future Analysis**
   - Suggest further areas to explore, such as examining trends in superhero characteristics over time or how different origins relate to the success of the superheroes
     ```markdown
     # Example of suggesting further analysis
     # "Future analysis could explore how the characteristics of superheroes (such as powers, alliances) have evolved over time, or examine the correlation between a superhero's origin and their success or longevity in popular culture."
     ```
