### 💾💻📊 Data Science - MMI Portfolio No. 1 
# 💥 Superhero analysis 💥

## Main goal: Use the provided datasets to answer the following questions
1. How is the male-to-female ratio of superheros (marvel + DC) and how does it change over time?
2. How does this compare between Marvel and DC superheros?
3. How does the look (hair color, eyes ...) change over time?
4. Is there a typical look of a bad superhero and of a good superhero?

## General instructions
- The final notebook should be executable in the correct order (this means it should work if you do `Kernel` --> `Restart kernel and run all cells...`)
- Just providing code and plots is not enough, you should document and comment where necessary. Not so much on small code-related things (you may still do this if you want though, but this is not required), but mostly to explain what you do, why you do it, what you observe.

More specifically:
- Please briefly comment on the changes you do to the data, in particular, if you apply complex operations or if your changes depend on a certain choice you have to make.
- Please add descriptions and/or interpretations to the results you generate (for instance tables, plots). This doesn't have to be a lot of text. For simple, easy-to-understand results, a brief sentence can be enough. For more complex results, you might want to add a bit more explanation.

---
Please add your Name here
## Name: Sander Tebeck

---

## Imports and helper function
Use this part to import the main libraries used in this notebook.  
Also add more complex helper functions to this part (if you use any).

In [1]:
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import seaborn as sns

# add imports if anything is missing
# feel free, for instance, to use other plotting libraries (e.g. seaborn, plotly...)

## Data import
Simple: Use this part to import your data.
For the present case you can simply use Pandas `.read_json()`.

In [22]:
path_root = "E:\Dokumente\MMI\Datascience\DataScience\superhero_data_portfolio_1" 

file_marvel = "superhero_data_marvel_mmi2024.json"
file_dc = "superhero_data_dc_mmi2024.json"

In [25]:
df_marvel = pd.read_json(os.path.join(path_root, file_marvel))
df_dc = pd.read_json(os.path.join(path_root, file_dc))

## First Exploration & Data Cleaning
Use this part to have a first look at the data.  
Apply the necessary operations to clean and harmonize the data, such as handling missing values, conversions etc.

In [26]:
df_marvel.head()

Unnamed: 0,align,alive,appearances,eye,first appearance,gsm,hair,id,name,page_id,sex,urlslug
0,good characters,living characters,4043.0,hazel eyes,aug-62,,brown hair,secret identity,spider-man (peter parker),1678,male characters,\/spider-man_(peter_parker)
1,good characters,living characters,3360.0,blue eyes,mar-41,,white hair,public identity,captain america (steven rogers),7139,male characters,\/captain_america_(steven_rogers)
2,neutral characters,living characters,3061.0,blue eyes,oct-74,,black hair,public identity,"wolverine (james \""logan\"" howlett)",64786,male characters,\/wolverine_(james_%22logan%22_howlett)
3,good characters,living characters,2961.0,blue eyes,mar-63,,black hair,public identity,"iron man (anthony \""tony\"" stark)",1868,male characters,\/iron_man_(anthony_%22tony%22_stark)
4,good characters,living characters,2258.0,blue eyes,nov-50,,blond hair,no dual identity,thor (thor odinson),2460,male characters,\/thor_(thor_odinson)


In [27]:
df_dc.head()

Unnamed: 0,align,alive,appearances,eye,first appearance,gsm,hair,id,name,page_id,sex,urlslug
0,good characters,living characters,3093.0,blue eyes,"1939, may",,black hair,secret identity,batman (bruce wayne),1422,male characters,\/wiki\/batman_(bruce_wayne)
1,good characters,living characters,2496.0,blue eyes,"1986, october",,black hair,secret identity,superman (clark kent),23387,male characters,\/wiki\/superman_(clark_kent)
2,good characters,living characters,1565.0,brown eyes,"1959, october",,brown hair,secret identity,green lantern (hal jordan),1458,male characters,\/wiki\/green_lantern_(hal_jordan)
3,good characters,living characters,1316.0,brown eyes,"1987, february",,white hair,public identity,james gordon (new earth),1659,male characters,\/wiki\/james_gordon_(new_earth)
4,good characters,living characters,1237.0,blue eyes,"1940, april",,black hair,secret identity,richard grayson (new earth),1576,male characters,\/wiki\/richard_grayson_(new_earth)



### Exercise 1: First overview
Here, you don't have to perform any changes. Just briefly comment on the following:
- What are missing values here and are they a problem for the task?
- Are there features in the datasets that require changes to be useful for a later analysis?

#### What are missing values here and are they a problem for the task?

##### Analyse:

In [33]:
df_marvel.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16376 entries, 0 to 16375
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   align             13564 non-null  object 
 1   alive             16373 non-null  object 
 2   appearances       15280 non-null  float64
 3   eye               6609 non-null   object 
 4   first appearance  15561 non-null  object 
 5   gsm               90 non-null     object 
 6   hair              12112 non-null  object 
 7   id                12606 non-null  object 
 8   name              16376 non-null  object 
 9   page_id           16376 non-null  int64  
 10  sex               15522 non-null  object 
 11  urlslug           16376 non-null  object 
dtypes: float64(1), int64(1), object(10)
memory usage: 1.6+ MB


In [65]:
df_dc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6896 entries, 0 to 6895
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   align             6295 non-null   object 
 1   alive             6893 non-null   object 
 2   appearances       6541 non-null   float64
 3   eye               3268 non-null   object 
 4   first appearance  6827 non-null   object 
 5   gsm               64 non-null     object 
 6   hair              4622 non-null   object 
 7   id                4883 non-null   object 
 8   name              6896 non-null   object 
 9   page_id           6896 non-null   int64  
 10  sex               6771 non-null   object 
 11  urlslug           6896 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 700.4+ KB


In [43]:
16376 /6896 

2.374709976798144

In [39]:
df_marvel.notnull().sum(axis=0)/len(df_marvel)

align               0.828285
alive               0.999817
appearances         0.933073
eye                 0.403578
first appearance    0.950232
gsm                 0.005496
hair                0.739619
id                  0.769785
name                1.000000
page_id             1.000000
sex                 0.947851
urlslug             1.000000
dtype: float64

In [41]:
df_dc.notnull().sum(axis=0)/len(df_dc)

align               0.912848
alive               0.999565
appearances         0.948521
eye                 0.473898
first appearance    0.989994
gsm                 0.009281
hair                0.670244
id                  0.708092
name                1.000000
page_id             1.000000
sex                 0.981874
urlslug             1.000000
dtype: float64

In [42]:
df_dc.notnull().sum(axis=0)/len(df_dc) - df_marvel.notnull().sum(axis=0)/len(df_marvel)

align               0.084563
alive              -0.000252
appearances         0.015448
eye                 0.070320
first appearance    0.039762
gsm                 0.003785
hair               -0.069375
id                 -0.061693
name                0.000000
page_id             0.000000
sex                 0.034023
urlslug             0.000000
dtype: float64

Wie viele der Aussehensdaten sind zusammen in einem eintrag?:

In [55]:
print('DC: ')
print(df_dc[['align', 'eye', 'hair']].notnull().all(axis=1).sum())
print(df_dc[['align', 'eye', 'hair']].notnull().all(axis=1).sum()/len(df_dc))
print('Marvel: ')
print(df_marvel[['align', 'eye', 'hair']].notnull().all(axis=1).sum())
print(df_marvel[['align', 'eye', 'hair']].notnull().all(axis=1).sum()/len(df_marvel))

DC: 
2530
0.3668793503480278
Marvel: 
5529
0.337628236443576


Da wir das aussehen im bezug zur Zeit betrachten wollen: 

In [62]:
print('DC: ')
print(df_dc[['align', 'eye', 'hair', 'first appearance', 'sex']].notnull().all(axis=1).sum())
print(df_dc[['align', 'eye', 'hair', 'first appearance', 'sex']].notnull().all(axis=1).sum()/len(df_dc))
print('Marvel: ')
print(df_marvel[['align', 'eye', 'hair', 'first appearance', 'sex']].notnull().all(axis=1).sum())
print(df_marvel[['align', 'eye', 'hair', 'first appearance', 'sex']].notnull().all(axis=1).sum()/len(df_marvel))

DC: 
2514
0.3645591647331787
Marvel: 
5069
0.30953834880312653


In [63]:
print('DC: ')
print(df_dc[['first appearance', 'sex']].notnull().all(axis=1).sum())
print(df_dc[['first appearance', 'sex']].notnull().all(axis=1).sum()/len(df_dc))
print('Marvel: ')
print(df_marvel[['first appearance', 'sex']].notnull().all(axis=1).sum())
print(df_marvel[['first appearance', 'sex']].notnull().all(axis=1).sum()/len(df_marvel))

DC: 
6703
0.9720127610208816
Marvel: 
14766
0.901685393258427


In [66]:
df_marvel['sex'].value_counts()

sex
male characters           11638
female characters          3837
agender characters           45
genderfluid characters        2
Name: count, dtype: int64

##### Antwort:

Beide datensätze haben sehr wenig gsm einträge (9%(dc) 5%(marvel)), mäßige anzahl an einträge bei der Augenfarbe (47%(dc) 40%(marvel)), und relativ gute mengen an Haarfarbe und id (~70%(dc) ~74%(marvel)). 
Moralische ausrichtung ist bei dc (91%) besser befüllt als bei marvel(81%). 
Allerdings ist der marvel datensatz ~2.3 mal größer.
Von diesen spalten die unter 90% "Füllrate" haben intressieren uns zum beantworten der Leitfragen besonders folgende: aling, eye, und hair. Dise Spalten sind essenziell für die Letzen beiden Leitfragen. Bei beiden datensätzen sind ~34% dieser in einer Zeile vollständig, d.h. wir haben bei dc 2530 komplette aussehenseintärge und bei marvel 5529. Fast alle dieser Einträge haben auch ein Erstauftritt eintrag.

#### Are there features in the datasets that require changes to be useful for a later analysis?

'first appearance'

Da uns in unserer analyse die veränderung über die Zeit in zwei Fragen interessiert und diese in der spalte 'first appearance' als string mit dem Monat gespeichert ist, sollte diese spalte einfacher geacht werden. Entweder man gibt die Datum und monateinzeln als Zahlen aus oder man rechnet Jahreszal + Monat/12 als Float zusammen. Da wir an der allgemeinen Entwicklung über einen großen zeitraum hinweg interessiert sind, lohnt sich dieser mehraufwand nicht. Die Datensätz unterscheiden sich in der formatierung der 'first appearance' und müssen deswegen unterschidelich umgeformt werden.

'name'

'name' hat in beiden datensätzen zwei informationen enthalten. Zuerst der Heldenname(wenn vorhanden) und dann kommt der Richtige Name oder Spitzname wenn es einen Heldennamen gab. Diese spalte könte auch umgeformt werden. Dies Umformung halte ich jedoch für eine mit geringer priotitä, da wir keine der leitfragen damit unterstützen. 

### Exercise 2: Handle missing values

If you think this is necessary for the next steps in your analysis, use this part to remove (or impute/replace/edit) missing values.

## Exercise 3: Data conversions
Several columns contain data in a non-ideal format or style.
Apply the following changes:

- `first appearance` --> convert to consistent date format
- add column with only the year of appearance (call the new column: `year`)
- Convert the `name` column into a better, more consistent format. Try to add a proper `superhero_name` and a `real_name` column. So: `"batman (bruce wayne)"` should be split into `"batman"` and `"bruce wayne"`. Don't worry if this won't work for all cases, you are not expected to do (and check) this manually. Simply try to find a good solution that works most of the time.

### Hint:
Sometimes there is no proper Pandas function to do what you need (or we simply are not sure or can't find the right one...).
If you want to do a more complex operation on all entries in a column you can work with the `.apply()` method from Pandas. This fill any given Python function to all elements in a dataframe or column.

```python
def my_special_operation(input_entry):
    # do what you want
    return ouput_entry

# Apply this to ALL entries
my_dataframe.apply(my_special_operation)


# Apply this to all entries in one column
my_dataframe.loc[:, "column_A"].apply(my_special_operation)
```

## Data analysis 1 
Here you should address the two main questions posed at the beginning:
1. How is the male-to-female ratio of superheros (marvel + DC) and how does it change over time?
2. How does this compare between Marvel and DC superheros?

### Exercise 4:
- Please show the appearance of new superheros over time.
- Compare the same for only "male" and "female" superheros. 
- Perform this analysis for `Marvel`, `DC` and both `Marvel_DC`.

Hints: 
- For such a comparison it is often helpful to plot the **ratio** between the two (male:female or female:male).
- When you are interested in trends over time, you are often free to decide how you handle the "time"-component. You can use data for every year, or combine several years etc.
- Optional "nice-to-have": for noisy temporal data it sometimes helps (visually) to also use a **moving average** to smoothen curves.

## Data analysis 2 
Here you should address the two main questions posed at the beginning:
3. How does the look (hair color, eyes ...) change over time?
4. Is there a typical look of a bad superhero and a good superhero?

### Exercise 5:
- Find suitable plots to show if/how superhero hair color and eye color changed over time.
- Answer question no. 4 using visualization as well as a correlation analysis.