### 💾💻📊 Data Science - MMI Portfolio No. 1 
# 💥 Superhero analysis 💥

## Main goal: Use the provided datasets to answer the following questions
1. How is the male-to-female ratio of superheros (marvel + DC) and how does it change over time?
2. How does this compare between Marvel and DC superheros?
3. How does the look (hair color, eyes ...) change over time?
4. Is there a typical look of a bad superhero and of a good superhero?

## General instructions
- The final notebook should be executable in the correct order (this means it should work if you do `Kernel` --> `Restart kernel and run all cells...`)
- Just providing code and plots is not enough, you should document and comment where necessary. Not so much on small code-related things (you may still do this if you want though, but this is not required), but mostly to explain what you do, why you do it, what you observe.

More specifically:
- Please briefly comment on the changes you do to the data, in particular, if you apply complex operations or if your changes depend on a certain choice you have to make.
- Please add descriptions and/or interpretations to the results you generate (for instance tables, plots). This doesn't have to be a lot of text. For simple, easy-to-understand results, a brief sentence can be enough. For more complex results, you might want to add a bit more explanation.

---
Please add your Name here
## Name: Sander Tebeck

---

## Imports and helper function
Use this part to import the main libraries used in this notebook.  
Also add more complex helper functions to this part (if you use any).

In [1]:
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# add imports if anything is missing
# feel free, for instance, to use other plotting libraries (e.g. seaborn, plotly...)

## Data import
Simple: Use this part to import your data.
For the present case you can simply use Pandas `.read_json()`.

In [8]:
path_root = "E:\Dokumente\MMI\Datascience\DataScience\superhero_data_portfolio_1" 

file_marvel = "superhero_data_marvel_mmi2024.json"
file_dc = "superhero_data_dc_mmi2024.json"

In [9]:
df_marvel = pd.read_json(os.path.join(path_root, file_marvel))
df_dc = pd.read_json(os.path.join(path_root, file_marvel))

## First Exploration & Data Cleaning
Use this part to have a first look at the data.  
Apply the necessary operations to clean and harmonize the data, such as handling missing values, conversions etc.

In [24]:
df_marvel.head()

Unnamed: 0,align,alive,appearances,eye,first appearance,gsm,hair,id,name,page_id,sex,urlslug
0,good characters,living characters,4043.0,hazel eyes,aug-62,,brown hair,secret identity,spider-man (peter parker),1678,male characters,\/spider-man_(peter_parker)
1,good characters,living characters,3360.0,blue eyes,mar-41,,white hair,public identity,captain america (steven rogers),7139,male characters,\/captain_america_(steven_rogers)
2,neutral characters,living characters,3061.0,blue eyes,oct-74,,black hair,public identity,"wolverine (james \""logan\"" howlett)",64786,male characters,\/wolverine_(james_%22logan%22_howlett)
3,good characters,living characters,2961.0,blue eyes,mar-63,,black hair,public identity,"iron man (anthony \""tony\"" stark)",1868,male characters,\/iron_man_(anthony_%22tony%22_stark)
4,good characters,living characters,2258.0,blue eyes,nov-50,,blond hair,no dual identity,thor (thor odinson),2460,male characters,\/thor_(thor_odinson)


In [25]:
df_dc.head()

Unnamed: 0,align,alive,appearances,eye,first appearance,gsm,hair,id,name,page_id,sex,urlslug
0,good characters,living characters,4043.0,hazel eyes,aug-62,,brown hair,secret identity,spider-man (peter parker),1678,male characters,\/spider-man_(peter_parker)
1,good characters,living characters,3360.0,blue eyes,mar-41,,white hair,public identity,captain america (steven rogers),7139,male characters,\/captain_america_(steven_rogers)
2,neutral characters,living characters,3061.0,blue eyes,oct-74,,black hair,public identity,"wolverine (james \""logan\"" howlett)",64786,male characters,\/wolverine_(james_%22logan%22_howlett)
3,good characters,living characters,2961.0,blue eyes,mar-63,,black hair,public identity,"iron man (anthony \""tony\"" stark)",1868,male characters,\/iron_man_(anthony_%22tony%22_stark)
4,good characters,living characters,2258.0,blue eyes,nov-50,,blond hair,no dual identity,thor (thor odinson),2460,male characters,\/thor_(thor_odinson)



### Exercise 1: First overview
Here, you don't have to perform any changes. Just briefly comment on the following:
- What are missing values here and are they a problem for the task?
- Are there features in the datasets that require changes to be useful for a later analysis?

### Exercise 2: Handle missing values

If you think this is necessary for the next steps in your analysis, use this part to remove (or impute/replace/edit) missing values.

## Exercise 3: Data conversions
Several columns contain data in a non-ideal format or style.
Apply the following changes:

- `first appearance` --> convert to consistent date format
- add column with only the year of appearance (call the new column: `year`)
- Convert the `name` column into a better, more consistent format. Try to add a proper `superhero_name` and a `real_name` column. So: `"batman (bruce wayne)"` should be split into `"batman"` and `"bruce wayne"`. Don't worry if this won't work for all cases, you are not expected to do (and check) this manually. Simply try to find a good solution that works most of the time.

### Hint:
Sometimes there is no proper Pandas function to do what you need (or we simply are not sure or can't find the right one...).
If you want to do a more complex operation on all entries in a column you can work with the `.apply()` method from Pandas. This fill any given Python function to all elements in a dataframe or column.

```python
def my_special_operation(input_entry):
    # do what you want
    return ouput_entry

# Apply this to ALL entries
my_dataframe.apply(my_special_operation)


# Apply this to all entries in one column
my_dataframe.loc[:, "column_A"].apply(my_special_operation)
```

## Data analysis 1 
Here you should address the two main questions posed at the beginning:
1. How is the male-to-female ratio of superheros (marvel + DC) and how does it change over time?
2. How does this compare between Marvel and DC superheros?

### Exercise 4:
- Please show the appearance of new superheros over time.
- Compare the same for only "male" and "female" superheros. 
- Perform this analysis for `Marvel`, `DC` and both `Marvel_DC`.

Hints: 
- For such a comparison it is often helpful to plot the **ratio** between the two (male:female or female:male).
- When you are interested in trends over time, you are often free to decide how you handle the "time"-component. You can use data for every year, or combine several years etc.
- Optional "nice-to-have": for noisy temporal data it sometimes helps (visually) to also use a **moving average** to smoothen curves.

## Data analysis 2 
Here you should address the two main questions posed at the beginning:
3. How does the look (hair color, eyes ...) change over time?
4. Is there a typical look of a bad superhero and a good superhero?

### Exercise 5:
- Find suitable plots to show if/how superhero hair color and eye color changed over time.
- Answer question no. 4 using visualization as well as a correlation analysis.