# Marvellous Comics: mirror of society?
Superheroes and villains in comics can have a real impact on society. They’re supposed to represent what’s good and bad. Thus, the way the character is portrayed will have an influence on the reader. If for example all villains are part of the same minority, people will unconsciously see them in real life as bad people. Moreover, a character like Tony Stark could inspire people to study engineering. And these are just examples to illustrate the power comics can have on us.
We can thus study this choice of characters, how diverse it is, and if there is a tendency towards a specific portrait for superheroes and villains. 

## Data Acquisition
Here, the aim is to load the datasets we've previously parse from the websites:
personnage_url (which is the Marvel characters dataset)
perso_dc (which is the DC character dataset)
Then clean them and to have a ready to use datawarehouse. For the cleaning please remember that here we only plot the head of the value_counts but that we've check the entire dataset.
We will use the url of each characters as an ID as they are unique

In [79]:
# Import libraries
import pandas as pd
import numpy as np
%matplotlib inline
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import math
import re
import string
import pickle

### Let's first load the dataset

In [2]:
personnage = pd.read_pickle("personnage_url.txt").copy()
personnage.head(3)

Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth
0,/wiki/Aaron_the_Aakon_(Earth-616),Aaron,Secret Identity,,Aakon,Single,Slave trader,,Male,,,Brown,Black,Planet Oorga
1,/wiki/2-D_(Earth-616),Darell (full name unrevealed)[1],Secret Identity,2-D,American,Single,Adventurer,,Male,,,Brown,Brown,
2,/wiki/Abraham_Erskine_(Earth-616),Abraham Erskine[1],Known to Authorities Identity,Dr. Joseph Reinstein,"German, American",Married,Scientist,Advanced College Degree,Male,"5' 6"" (1.68 m)",160 lbs (73 kg),Brown,Black (graying),Germany


## Data acquaintance

In [3]:
#Check if we have any missing values
personnage.isnull().values.any()

False

***Now, we will have a look at every column and study how there are filled in order to better comprehend the data and clean it***

In [4]:
#We start with the Real Name of the characters
personnage['Real Name'].value_counts()

nknown                                                       3666
Unknown                                                       462
Unrevealed                                                    108
Not Applicable                                                 27
Unknown (The symbiote takes the name of its current host)      24
                                                             ... 
Naka                                                            1
Billy Jones                                                     1
Ebenezer Laughton[1]                                            1
Vladimir Ilyich Ulyanov                                         1
John Lewandow                                                   1
Name: Real Name, Length: 22947, dtype: int64

***Regarding the 'Real Name', we can observe that we have a few categories of unknown names, we are going to group them together under the label 'Unknown'***

In [5]:
personnage.loc[personnage['Real Name']=='nknown', 'Real Name'] ='Unknown'
personnage.loc[personnage['Real Name']=='Unrevealed', 'Real Name'] ='Unknown'
personnage.loc[personnage['Real Name']=='', 'Real Name'] ='Unknown'
personnage.loc[personnage['Real Name']=='N/A', 'Real Name'] ='Unknown'
personnage.loc[personnage['Real Name']=='Unknown (The symbiote takes the name of its current host)', 'Real Name'] ='Unknown'
personnage.loc[personnage['Real Name']=='None', 'Real Name'] ='Unknown'

# We also remove any link ([#])
personnage["Real Name"] = personnage["Real Name"].str.replace(r'\s\[\d\]', '')
personnage["Real Name"] = personnage["Real Name"].str.replace(r'\[\d\]', '')

personnage['Real Name'].value_counts()

Unknown                          4297
Not Applicable                     27
Martin (full name unrevealed)      11
James "Jamie" Arthur Madrox        10
Thunder                             6
                                 ... 
Evgeny Bezzubenkov                  1
Signor Korte                        1
Beleth                              1
Roscoe Kasady                       1
Malcolm Monroe                      1
Name: Real Name, Length: 22866, dtype: int64

**This seems to be quite good for what we will need later**

***We now look at the identity of the characters***

In [6]:
personnage["Identity"].value_counts().head()

No Dual Identity                 14653
Secret Identity                   7311
Public Identity                   2953
                                  2882
Known to Authorities Identity      144
Name: Identity, dtype: int64

***It looks like the identities are distribiuted among 'No Dual Identity', 'Secret Identity', 'Public Identity' and 'Known to Authorities Identity'***

In [7]:
personnage["Identity"] = personnage["Identity"].str.replace(r'\s\[\d\]', '')
personnage["Identity"] = personnage["Identity"].str.replace(r'\[\d\]', '')
personnage["Identity"] = personnage["Identity"].replace([''], 'Unknown')
personnage.loc[personnage['Identity'].str.contains('Dual'), 'Identity'] = 'No Dual Identity'
personnage.loc[personnage['Identity'].str.contains('Authorities'), 'Identity'] = 'Known to Authorities Identity'
personnage.loc[personnage['Identity'].str.contains('Public'), 'Identity'] = 'Public Identity'
personnage.loc[personnage['Identity'].str.contains('Secret'), 'Identity'] = 'Secret Identity'
personnage.loc[personnage['Identity'].str.contains('Dial'), 'Identity'] = 'No Dual Identity'
personnage.loc[personnage['Identity'].str.contains('Robot'), 'Identity'] = 'Public Identity'
personnage["Identity"].value_counts()

No Dual Identity                 14677
Secret Identity                   7331
Public Identity                   2965
Unknown                           2882
Known to Authorities Identity      154
Name: Identity, dtype: int64

***We continue with the Alias***

In [8]:
personnage["Current Alias"].value_counts()

                             15844
Nova                            28
Crimson Dynamo                  16
Ghost Rider                     16
Black Knight                    15
                             ...  
Blobba                           1
Bor                              1
Joelle                           1
Moth                             1
Yod of the All-Seeing Eye        1
Name: Current Alias, Length: 10047, dtype: int64

In [9]:
personnage.loc[personnage['Current Alias']=='', 'Current Alias'] ='Unknown'
personnage['Current Alias'] = personnage['Current Alias'].str.replace(r'\s\[\d\]', '')
personnage['Current Alias'] = personnage['Current Alias'].str.replace(r'\[\d\]', '')
personnage["Current Alias"].value_counts().head()

Unknown           15845
Nova                 28
Ghost Rider          18
Crimson Dynamo       16
Black Knight         16
Name: Current Alias, dtype: int64

***Citizenship***

In [10]:
personnage["Citizenship"].value_counts()

American                                            10392
                                                     8188
United States                                         520
British                                               499
German                                                469
                                                    ...  
Dakkamite, [2][3][4] with no criminal records[3]        1
Pooka                                                   1
English, Krakoan, British                               1
Spartax                                                 1
British, English, Monaco                                1
Name: Citizenship, Length: 1546, dtype: int64

In [11]:
personnage.loc[personnage["Citizenship"]=="", 'Citizenship'] = 'Unknown'
personnage.loc[personnage["Citizenship"]=="USA", 'Citizenship'] = 'American'
personnage.loc[personnage["Citizenship"]=="United States of America", 'Citizenship'] = 'American'
personnage.loc[personnage["Citizenship"]=="United States", 'Citizenship'] = 'American'
personnage.loc[personnage["Citizenship"]=="America", 'Citizenship'] = 'American'
personnage.loc[personnage["Citizenship"]=="British, English", 'Citizenship'] = 'British'
personnage.loc[personnage["Citizenship"]=="United Kingdom", 'Citizenship'] = 'British'
personnage.loc[personnage["Citizenship"]=="English", 'Citizenship'] = 'British'
personnage.loc[personnage["Citizenship"]=="Scottish, British", 'Citizenship'] = 'British'
personnage.loc[personnage["Citizenship"]=="British, Scottish", 'Citizenship'] = 'British'
personnage.loc[personnage["Citizenship"]=="English, British", 'Citizenship'] = 'British'
personnage["Citizenship"] = personnage["Citizenship"].str.replace(r'\s\[\d\]', '')
personnage["Citizenship"] = personnage["Citizenship"].str.replace(r'\[\d\]', '')
personnage["Citizenship"].value_counts().head()

American    11033
Unknown      8190
British       731
German        469
Canadian      327
Name: Citizenship, dtype: int64

***Marital Status***

In [12]:
personnage["Marital Status"].value_counts()

                                                                                            16788
Single                                                                                       7993
Married                                                                                      2190
Widowed                                                                                       608
Divorced                                                                                      255
Separated                                                                                      64
Engaged                                                                                        49
Married [1]                                                                                     4
Married [citation needed]                                                                       4
Single [1]                                                                                      3
Divorced ; Widowed  

In [13]:
personnage.loc[personnage["Marital Status"]=="", 'Marital Status'] = 'Unknown'
personnage["Marital Status"] = personnage["Marital Status"].str.replace(r'\s\[\d+\]', '')
personnage["Marital Status"] = personnage["Marital Status"].str.replace(r'\[\d+\]', '')
personnage["Marital Status"].value_counts()

Unknown                                                                                     16788
Single                                                                                       7997
Married                                                                                      2196
Widowed                                                                                       610
Divorced                                                                                      257
Separated                                                                                      67
Engaged                                                                                        49
Married [citation needed]                                                                       4
Single (presumed)                                                                               4
Divorced ; Widowed                                                                              2
Married (presumably)

In [14]:
def clean_marital(s):
    if s == '':
        return 'Unknown'
    s = s.replace(r'\s\[\d+\]', '')
    s = s.replace(r'\[\d+\]', '')
    allowed = ['Unknown','Single','Married','Widowed','Separated','Engaged','Divorced', 'Remarried']
    if s in allowed:
        return s
    else:
        try:
            ss = s.split()
            for i in range(len(ss)):
                if ss[i] in allowed:
                    return ss[i]
            if s == 'Singe':
                return 'Single'
            if s == 'single':
                return 'Single'
            if ss[0] == 'Single,':
                return 'Single'
            if s == 'Destroyed':
                return 'Widowed'
            if s == 'married':
                return 'Married'
            if s == 'Claims to be married':
                return 'Married'
            else:
                print(s)
        except:
            print(s)
            print(ss)

#clean marital status
personnage["Marital Status"] = personnage["Marital Status"].apply(clean_marital)

***Occupation***

In [15]:
personnage["Occupation"].value_counts()

                                                8700
Student                                          645
Criminal                                         600
Scientist                                        428
Adventurer                                       321
                                                ... 
interstellar mover, research subject               1
Agent of Kree Empire, geneticist                   1
Vassal of Pluto, former queen of the Amazons       1
Cowboy and criminal                                1
Interpreter                                        1
Name: Occupation, Length: 9029, dtype: int64

In [16]:
personnage.loc[personnage["Occupation"]=="",'Occupation'] = 'Unknown'
personnage["Occupation"] = personnage["Occupation"].str.replace(r'\s\[\d+\]', '')
personnage["Occupation"] = personnage["Occupation"].str.replace(r'\[\d+\]', '')
personnage["Occupation"].value_counts().head()

Unknown       8701
Student        645
Criminal       601
Scientist      428
Adventurer     322
Name: Occupation, dtype: int64

***Education***

In [17]:
personnage["Education"].value_counts()

                                                                              26162
Artificial Intelligence                                                          54
Trained on an unnamed world to be a spy                                          51
High School                                                                      39
High school graduate                                                             35
                                                                              ...  
PH.D. in hypnotherapy                                                             1
High school level courses at Massachusetts Academy                                1
University of Wisconsin (Madison)--Law Degree                                     1
Extensive training in skills useful to assassination; some college courses        1
Elementary level equivalent(ongoing)                                              1
Name: Education, Length: 1140, dtype: int64

In [18]:
personnage.loc[personnage["Education"]=="", 'Education'] = 'Unknown'
personnage.loc[personnage["Education"]=="Unrevealed", 'Education'] = 'Unknown'
personnage["Education"] = personnage["Education"].str.replace(r'\s\[\d+\]', '')
personnage["Education"] = personnage["Education"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Education"]=="High school graduate", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High School Graduate", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High School graduate", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High School student", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High School Student", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High school student", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="High-school dropout", 'Education'] = 'High School Dropout'
personnage.loc[personnage["Education"]=="High school dropout", 'Education'] = 'High School Dropout'
personnage.loc[personnage["Education"]=="High school drop-out", 'Education'] = 'High School Dropout'
personnage.loc[personnage["Education"]=="High school", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="Some high school", 'Education'] = 'High School'
personnage.loc[personnage["Education"]=="Some college", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="College Graduate", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="College graduate", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="College educated", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="College education", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="College degree", 'Education'] = 'College'
personnage.loc[personnage["Education"]=="University graduate", 'Education'] = 'University'
personnage.loc[personnage["Education"]=="Ph.D.", 'Education'] = 'Doctorate'
personnage.loc[personnage["Education"]=="PhD", 'Education'] = 'Doctorate'
personnage["Education"].value_counts().head()

Unknown                                    26177
High School                                  164
College                                       80
Artificial Intelligence                       54
Trained on an unnamed world to be a spy       51
Name: Education, dtype: int64

In [19]:
personnage["Gender"].value_counts().head()

Male           19787
Female          6738
                1264
Agender          172
Genderfluid       10
Name: Gender, dtype: int64

In [20]:
personnage["Gender"] = personnage["Gender"].str.replace(r'\s\[\d+\]', '')
personnage["Gender"] = personnage["Gender"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Gender"]=="", 'Gender'] = 'Unknown'
personnage.loc[personnage["Gender"]=="Earth-616", 'Gender'] = 'Unknown'
personnage.loc[personnage["Gender"]=="UNCLEAR", 'Gender'] = 'Unknown'
personnage.loc[personnage["Gender"]=="Male/Female", 'Gender'] = 'Unknown'
personnage.loc[personnage["Gender"]=="Male and Female", 'Gender'] = 'Unknown'
personnage.loc[personnage["Gender"]=="Female(as Shub-Niggurath),male(as Sahb Delanzar) (see notes)", 'Gender'] = 'Genderfluid'
personnage.loc[personnage["Gender"]=="Male(Originally), Genderfluid (as shapeshifter)", 'Gender'] = 'Genderfluid'
personnage.loc[personnage["Gender"]=="Mobile", 'Gender'] = 'Genderfluid'
personnage.loc[personnage["Gender"]=="Female, (formerly Male)", 'Gender'] = 'Transgender'
personnage.loc[personnage["Gender"]=="Female(Female Clone of Male)", 'Gender'] = 'Female'
personnage.loc[personnage["Gender"]=="female", 'Gender'] = 'Female'
personnage.loc[personnage["Gender"]=="Male, merged with a mortal female", 'Gender'] = 'Male'
personnage.loc[personnage["Gender"]=="Male(probably)", 'Gender'] = 'Male'
personnage.loc[personnage['Gender'].str.contains('Agender'), 'Gender'] = 'Agender'
personnage.loc[personnage['Gender'].str.contains('Genderfluid'), 'Gender'] = 'Genderfluid'

***Height: For an easier use later, we've decided to midify height and weight in float with the metric units.***

In [21]:
personnage["Height"].value_counts()

                                                 24555
Variable                                           278
6' 0" (1.83 m)                                     268
5' 11" (1.80 m)                                    228
5' 10" (1.78 m)                                    224
                                                 ...  
8' 0" (2.44 m) (variable); 6'1'(pre-mutation)        1
5' 10" (1.78 m) ; 6' (as Radion)                     1
6' 5" (1.96 m) (formerly, currently unknown)         1
5' 9" (1.75 m) in costume 6'1"[1]                    1
Variable (5'8" while replicating Sunspot)            1
Name: Height, Length: 586, dtype: int64

In [22]:
personnage["Height"] = personnage["Height"].str.replace(r'\s\[\d+\]', '')
personnage["Height"] = personnage["Height"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Height"]=="", "Height"] = 'Unknown'
personnage.loc[personnage["Height"]=="Unknown ", "Height"] = 'Unknown'
personnage.loc[personnage["Height"]=="Incalculable ", "Height"] = 'Unknown'
personnage.loc[personnage["Height"]=="Variable ", "Height"] = 'Variable'
personnage.loc[personnage["Height"]=="variable ", "Height"] = 'Variable'
personnage["Height"].value_counts().head()

Unknown            24564
Variable             289
6' 0" (1.83 m)       270
5' 11" (1.80 m)      231
5' 10" (1.78 m)      227
Name: Height, dtype: int64

In [23]:
def clean_height(s):
    if s != 'Unknown' and s!='Variable':
        try:
            temp = re.findall(r'\d+\' \d+\"',s) 
            height = re.findall(r'\d+',temp[0])
            inch = float(height[1])
            feet = float(height[0])
            inch += feet*12
            cm = round(inch * 2.54, 0)
            
            return str(cm)
        except:
            try:
                temp = re.findall(r'\d+\'\d+\"',s) 
                height = re.findall(r'\d+',temp[0])
                inch = float(height[1])
                feet = float(height[0])
                inch += feet*12
                cm = round(inch * 2.54, 0)
                return str(cm)
            except:
                try:
                    temp = re.findall(r'\d+\'',s)
                    height = re.findall(r'\d+',temp[0])
                    inch = 0
                    feet = float(height[0])
                    inch += feet*12
                    cm = round(inch * 2.54, 0)
                    return str(cm)
                except:
                    try:
                        temp = re.findall(r'\d+\"',s)
                        height = re.findall(r'\d+',temp[0])
                        inch = float(height[0])
                        feet = 0
                        inch += feet*12
                        cm = round(inch * 2.54, 0)
                        return str(cm)
                    except:
                        return 'Variable'
            
    else:
        return s

#apply function
personnage["Height in string"] = personnage["Height"].apply(lambda s: clean_height(s))

In [24]:
#convert the string to float, and put nan if unknown or variable
def string_to_float(s):
    try:
        return float(s)
    except:
        return np.nan
personnage["Height in float"] = personnage["Height in string"].apply(string_to_float)
personnage["Height in float"].describe()

count      3125.000000
mean        762.641600
std       14247.283429
min           8.000000
25%         173.000000
50%         180.000000
75%         188.000000
max      609600.000000
Name: Height in float, dtype: float64

***Weight***

In [25]:
personnage["Weight"].value_counts()

                                                          24704
Variable                                                    272
180 lbs (82 kg)                                              95
175 lbs (79 kg)                                              88
190 lbs (86 kg)                                              84
                                                          ...  
400 lbs (181 kg) (with battlesuit); 150 lbs (normally)        1
1102 lbs (500 kg) [1]                                         1
184 lbs (83 kg)                                               1
398 lbs (181 kg)                                              1
141 lbs (64 kg) (Variable)                                    1
Name: Weight, Length: 867, dtype: int64

In [26]:
personnage["Weight"] = personnage["Weight"].str.replace(r'\s\[\d+\]', '')
personnage["Weight"] = personnage["Weight"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Weight"]=="", "Weight"] = 'Unknown'
personnage.loc[personnage["Weight"]=="Unknown ", "Weight"] = 'Unknown'
personnage.loc[personnage["Weight"]=="variable ", "Weight"] = 'Variable'
personnage.loc[personnage["Weight"]=="Variable ", "Weight"] = 'Variable'
personnage["Weight"].value_counts()

Unknown                                                   24713
Variable                                                    282
180 lbs (82 kg)                                              96
175 lbs (79 kg)                                              89
190 lbs (86 kg)                                              85
                                                          ...  
400 lbs (181 kg) (with battlesuit); 150 lbs (normally)        1
184 lbs (83 kg)                                               1
398 lbs (181 kg)                                              1
386 lbs (175 kg)                                              1
145 lbs (66 kg) (Variable)                                    1
Name: Weight, Length: 848, dtype: int64

In [27]:
def clean_weight(s):
    if s != 'Unknown' and s != 'Variable':
        try:
            temp = re.findall(r'\d+\ lbs',s) 
            height = re.findall(r'\d+',temp[0])
            pound = float(height[0])
            kilo = round(0.453*pound,0)
            return str(kilo)
        except:
            return 'Variable'
    else:
        return s

#apply function
personnage["Weight in string"] = personnage["Weight"].apply(lambda s: clean_weight(s))
personnage["Weight in float"] = personnage["Weight in string"].apply(string_to_float)
personnage["Weight in float"].describe()

count    2.968000e+03
mean     5.496837e+08
std      2.993427e+10
min      0.000000e+00
25%      6.300000e+01
50%      8.200000e+01
75%      1.060000e+02
max      1.630800e+12
Name: Weight in float, dtype: float64

***Eye***

In [28]:
personnage["Eyes"] = personnage["Eyes"].str.replace(r'\s\[\d+\]', '')
personnage["Eyes"] = personnage["Eyes"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Eyes"]=="", "Eyes"] = 'Unknown'
personnage['Eyes'].value_counts().head(60)

Unknown                                               15390
Brown                                                  4060
Blue                                                   3305
Black                                                  1154
Green                                                   920
Red                                                     694
White                                                   519
Yellow                                                  385
Grey                                                    157
Hazel                                                   142
No Eyes                                                  99
Variable                                                 74
Purple                                                   54
Gray                                                     52
Orange                                                   46
Pink                                                     32
Amber                                   

***Hair***

In [29]:
personnage['Hair'].value_counts().head()

Black      6538
           5644
Brown      4714
Blond      2722
No Hair    1643
Name: Hair, dtype: int64

In [30]:
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\[\d+\]', '')
personnage["Hair"] = personnage["Hair"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Hair"] == 'No' , "Hair"] = 'Bald'
personnage.loc[personnage["Hair"].str.contains('No Hair'), 'Hair'] = 'Bald'
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(Variable\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(formerly.+\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(.+bald\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(balding\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\sformerly.+\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\;\sformerly.+','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(fur\)','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\sformerly.+','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\;.+','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\;','')
personnage["Hair"] = personnage["Hair"].str.replace(r'\s\(.+\)','')
personnage.loc[personnage["Hair"]=="", "Hair"] = 'Unknown'
personnage.loc[personnage["Hair"]=="Unrevealed", "Hair"] = 'Unknown'
personnage['Hair'].value_counts().head(10)

Black               6737
Unknown             5650
Brown               4855
Bald                3246
Blond               2800
White               1343
Grey                1094
Red                 1019
Green                185
Strawberry Blond     129
Name: Hair, dtype: int64

**Now let's move to the Place of Birth**

In [31]:
personnage["Place of Birth"] = personnage["Place of Birth"].str.replace(r'\s\[\d+\]', '')
personnage["Place of Birth"] = personnage["Place of Birth"].str.replace(r'\[\d+\]', '')
personnage.loc[personnage["Place of Birth"]=="", "Place of Birth"] = 'Unknown'
personnage["Place of Birth"].value_counts().head()

Unknown                    23249
Germany                      185
Attilan                      138
New York City, New York       89
Atlantis                      88
Name: Place of Birth, dtype: int64

In [32]:
personnage.head(20)

Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth,Height in string,Height in float,Weight in string,Weight in float
0,/wiki/Aaron_the_Aakon_(Earth-616),Aaron,Secret Identity,Unknown,Aakon,Single,Slave trader,Unknown,Male,Unknown,Unknown,Brown,Black,Planet Oorga,Unknown,,Unknown,
1,/wiki/2-D_(Earth-616),Darell (full name unrevealed),Secret Identity,2-D,American,Single,Adventurer,Unknown,Male,Unknown,Unknown,Brown,Brown,Unknown,Unknown,,Unknown,
2,/wiki/Abraham_Erskine_(Earth-616),Abraham Erskine,Known to Authorities Identity,Dr. Joseph Reinstein,"German, American",Married,Scientist,Advanced College Degree,Male,"5' 6"" (1.68 m)",160 lbs (73 kg),Brown,Black,Germany,168.0,168.0,72.0,72.0
3,/wiki/11-Ball_(Earth-616),Unknown,Secret Identity,11-Ball,American,Single,Professional criminal; former henchman,Unknown,Male,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,,Unknown,
4,/wiki/Abraham_(Earth-616),Abraham,No Dual Identity,Unknown,Unknown,Married,Prophet,Unknown,Male,Unknown,Unknown,Unknown,Black,Unknown,Unknown,,Unknown,
5,/wiki/Abarac_(Earth-616),Abarac,No Dual Identity,Unknown,Cybernian,Single,"Court magician, advisor",Unknown,Male,Unknown,Unknown,Unknown,White,Unknown,Unknown,,Unknown,
6,/wiki/Abdul_Faoul_(Earth-616),Professor Abdul Faoul,Secret Identity,Scarlet Scarab,Egyptian,Single,"Archeologist, adventurer",Unknown,Male,Unknown,Unknown,Unknown,Black,Egypt,Unknown,,Unknown,
7,/wiki/A.C._O%27Connor_(Earth-616),A. C. O'Connor,No Dual Identity,Ace O'Connor,American,Single,Journalist,Unknown,Female,Unknown,Unknown,Blue,Blond,Unknown,Unknown,,Unknown,
8,/wiki/7-X9_(Earth-616),Unknown,No Dual Identity,7-X9,Unknown,Unknown,Unknown,Unknown,Male,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,,Unknown,
9,/wiki/803_(Earth-616),803,No Dual Identity,Unknown,Unknown,Single,Unknown,Unknown,Agender,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,,Unknown,


***We delete the row where every element is 'Unknown'***

In [33]:
unknown = (personnage['Identity']=='Unknown') &\
                 (personnage['Real Name']=='Unknown') &\
                 (personnage['Current Alias']=='Unknown') &\
                 (personnage['Occupation']=='Unknown') &\
                 (personnage['Gender']=='Unknown') &\
                 (personnage['Place of Birth']=='Unknown') &\
                 (personnage['Eyes']=='Unknown') &\
                 (personnage['Citizenship']=='Unknown') &\
                 (personnage['Education']=='Unknown')

personnage[unknown].head(20)
personnage = personnage.drop(personnage[unknown].index)
personnage[unknown]

  del sys.path[0]


Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth,Height in string,Height in float,Weight in string,Weight in float


In [80]:
pickle.dump(personnage, open("marvel_pers_clean.txt",'wb'))

## Loading and cleaning the DC dataset
**We are know applying a really similar cleaner to clean the DC character dataset**

In [34]:
dc_pers = pd.read_pickle("perso_dc.txt")
dc_pers_df = pd.DataFrame(dc_pers[0])
dc_pers_df_1 = pd.DataFrame(dc_pers[1])
#Concatenate both dataframes
dc_pers_df = pd.concat([dc_pers_df, dc_pers_df_1], ignore_index=True)
dc_pers_df.head(20)

Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Good or Bad,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth
0,https://dc.fandom.com//wiki/Adam_Blake_(The_Nail),Adam Blake,Secret Identity,Captain Comet,,Good,,,,Male,,,Hazel,Brown,
1,https://dc.fandom.com//wiki/Ada_LaBostrie_(New...,Ada LaBostrie,Public Identity,Ada LaBostrie,American,Good,Married,Housewife,,Female,,,Brown,Black,
2,https://dc.fandom.com//wiki/Adellca_(New_Earth),Adellca,Secret Identity,Green Lantern,,Good,Single,Green Lantern,,Female,,,Black,Black,
3,https://dc.fandom.com//wiki/A-1_(Prime_Earth),Artificial Intelligence Data Flow,,A-I,,Good,Single,,,,,,,,
4,https://dc.fandom.com//wiki/Ace_Egan_(Quality_...,Ace Egan,Secret Identity,Ace of Space,,Good,,,,Male,,,,,New York
5,https://dc.fandom.com//wiki/Abigail_Cable_(The...,Abigail Cable,,,,Good,,,,Female,,,Blue,White,
6,https://dc.fandom.com//wiki/Abraham_Arlington_...,Abraham Arlington,,Azrael,British,Good,,,,Male,,,,,
7,https://dc.fandom.com//wiki/Abisha_(Prime_Earth),Abisha (surname unknown),Public Identity,Abisha,,Good,Single,Bodyguard,,Male,,,Black,Bald,
8,https://dc.fandom.com//wiki/Alan_Scott_(DC_Uni...,Alan Scott,Secret Identity,Green Lantern,American,Good,,,,Male,,,,Blond,
9,https://dc.fandom.com//wiki/Adam_Strange_(Kryp...,Adam Strange,Public Identity,,American,Good,,,College (abandoned),Male,,,Blue,Light Brown,Earth


***Let us clean the data set, starting with the Real Name column***

In [35]:
print('Missing Values : {}'.format(dc_pers_df['Real Name'].isnull().sum()))
dc_pers_df['Real Name'].value_counts()

Missing Values : 0


Unknown             4335
Bruce Wayne          164
Kal-El               118
Lois Lane             86
None                  75
                    ... 
Edward Cantwell        1
Ducra                  1
3g4                    1
Alpheus V. Hyatt       1
Jack of Diamonds       1
Name: Real Name, Length: 10877, dtype: int64

In [36]:
dc_pers_df["Real Name"] = dc_pers_df["Real Name"].str.replace(r'\s\[\d\]', '')
dc_pers_df["Real Name"] = dc_pers_df["Real Name"].str.replace(r'\[\d\]', '')
dc_pers_df.loc[dc_pers_df['Real Name']=='None', 'Real Name'] ='Unknown'
dc_pers_df.loc[dc_pers_df['Real Name']=='', 'Real Name'] ='Unknown'

***Identity***

In [37]:
dc_pers_df['Identity'].value_counts()

Public Identity    9400
Secret Identity    8145
                   4642
secret Identity       2
public Identity       1
Name: Identity, dtype: int64

In [38]:
dc_pers_df["Identity"] = dc_pers_df["Identity"].replace([''], 'Unknown')
dc_pers_df.loc[dc_pers_df['Identity']=='secret Identity', 'Identity'] = 'Secret Identity'
dc_pers_df.loc[dc_pers_df['Identity']=='public Identity', 'Identity'] = 'Public Identity'
dc_pers_df["Identity"].value_counts()

Public Identity    9401
Secret Identity    8147
Unknown            4642
Name: Identity, dtype: int64

***Current Alias***

In [39]:
dc_pers_df["Current Alias"] = dc_pers_df["Current Alias"].str.replace(r'\s\[\d\]', '')
dc_pers_df["Current Alias"] = dc_pers_df["Current Alias"].str.replace(r'\[\d\]', '')
dc_pers_df["Current Alias"] = dc_pers_df["Current Alias"].replace([''], 'Unknown')

In [40]:
dc_pers_df['Current Alias'].value_counts().head(5)

Unknown          4968
Green Lantern     444
Batman            202
Superman          149
Wonder Woman       98
Name: Current Alias, dtype: int64

***Citizenship***

In [41]:
def clean_citizen (s):
    s = str(s)
    s = s.replace(r'\s\[\d\]', '')
    s = s.replace(r'\[\d\]', '')
    if s == '':
        return 'Unknown'
    elif s == 'English':
        return 'British'
    elif s == 'Amerikan':
        return 'American'
    elif '·' in s:
        ss = s.split()
        return ss[0].strip()
    elif '/' in s:
        ss = s.split(sep = '/')
        return ss[0].strip()
    elif ';' in s:
        ss = s.split(sep = ';')
        return ss[0].strip()
    elif 'American' in s:
        return 'American'
    elif 'Japanese' in s:
        return 'Japanese'
    elif 'Australian' in s:
        return 'Australian'
    elif 'German' in s:
        return 'German'
    else:
        return s.strip()

dc_pers_df["Citizenship"] = dc_pers_df["Citizenship"].apply(clean_citizen)

In [42]:
dc_pers_df["Citizenship"].value_counts().head()

American                  11107
Unknown                    7540
British                     477
Amazon                      306
United Planets Citizen      285
Name: Citizenship, dtype: int64

***Marital Status***

In [43]:
dc_pers_df["Marital Status"] = dc_pers_df["Marital Status"].apply(clean_marital)

In [44]:
dc_pers_df["Marital Status"].value_counts()

Unknown      10848
Single        8893
Married       1588
Widowed        533
Divorced       230
Engaged         71
Separated       25
Remarried        2
Name: Marital Status, dtype: int64

***Occupation***

In [45]:
dc_pers_df["Occupation"].value_counts()

                                                                                                               10297
Criminal                                                                                                         648
Scientist                                                                                                        421
Adventurer                                                                                                       402
Student                                                                                                          349
                                                                                                               ...  
Freedom fighter · Scientist                                                                                        1
Empress of Thanagar                                                                                                1
Biologist · Queen of Heaven                                     

In [46]:
def basic_clean(s):
    if s == '':
        return 'Unknown'
    s = s.replace(r'\s\[\d+\]', '')
    s = s.replace(r'\[\d+\]', '')
    return s

In [47]:
dc_pers_df["Occupation"] = dc_pers_df["Occupation"].apply(basic_clean)
dc_pers_df["Occupation"].value_counts()

Unknown                         10298
Criminal                          648
Scientist                         421
Adventurer                        402
Student                           349
                                ...  
Freedom fighter · Scientist         1
Empress of Thanagar                 1
Biologist · Queen of Heaven         1
Kaznian Minister of Commerce        1
Reach general                       1
Name: Occupation, Length: 4253, dtype: int64

***Education***

In [48]:
dc_pers_df["Education"] = dc_pers_df["Education"].apply(basic_clean)
dc_pers_df["Education"].value_counts()

Unknown                                                                               21635
College Graduate                                                                         38
Amazonian                                                                                26
High School                                                                              24
Programmed by Dr. Magnus                                                                 20
                                                                                      ...  
Louis E. Grieve Memorial High School, apprentice to Johnny Warlock and Enchantress        1
Privately tutored by his grandfather                                                      1
Ph.D. Nuclear Physics                                                                     1
Bachelor's Degree in Forensic Science                                                     1
High School student (still enrolled)                                            

In [49]:
def clean_educ(s):
    if s == "High school graduate":
        return 'High School'
    elif s == "High School Graduate":
        return 'High School'
    elif s == "High School graduate":
        return 'High School'
    elif s == "High School student":
        return 'High School'
    elif s == "High School Student":
        return 'High School'
    elif s == "High school student":
        return 'High School'
    elif s == "High-school dropout":
        return 'High School Dropout'
    elif s == "High school dropout":
        return 'High School Dropout'
    elif s == "High school drop-out":
        return 'High School Dropout'
    elif s == "High school":
        return 'High School'
    elif s == "Some high school":
        return 'High School'
    elif s == "Some college":
        return 'College'
    elif s == "College Graduate":
        return 'College'
    elif s == "College graduate":
        return 'College'
    elif s == "College educated":
        return 'College'
    elif s == "College education":
        return 'College'
    elif s == "College degree":
        return 'College'
    elif s == "University graduate":
        return 'University'
    elif s == "Ph.D.":
        return 'Doctorate'
    elif s == "PhD":
        return 'Doctorate'
    else:
        return s
    
dc_pers_df["Education"] = dc_pers_df["Education"].apply(clean_educ)
dc_pers_df["Education"].value_counts()

Unknown                                                                               21635
College                                                                                  68
High School                                                                              42
Amazonian                                                                                26
Programmed by Dr. Magnus                                                                 20
                                                                                      ...  
Louis E. Grieve Memorial High School, apprentice to Johnny Warlock and Enchantress        1
Privately tutored by his grandfather                                                      1
Ph.D. Nuclear Physics                                                                     1
Bachelor's Degree in Forensic Science                                                     1
High School student (still enrolled)                                            

***Gender***

In [50]:
dc_pers_df["Gender"] = dc_pers_df["Gender"].apply(basic_clean)
dc_pers_df["Gender"].value_counts()

Male           15876
Female          5867
Unknown          366
Genderless        64
female             7
Transgender        6
male               2
New Earth          2
Name: Gender, dtype: int64

In [51]:
def clean_gender(s):
    if s =='male':
        return 'Male'
    elif s =='female':
        return 'Female'
    else:
        return s
    
dc_pers_df["Gender"] = dc_pers_df["Gender"].apply(clean_gender)
dc_pers_df["Gender"].value_counts()

Male           15878
Female          5874
Unknown          366
Genderless        64
Transgender        6
New Earth          2
Name: Gender, dtype: int64

***Height***

In [52]:
dc_pers_df["Height"] = dc_pers_df["Height"].apply(basic_clean)
dc_pers_df["Height"].value_counts()

Unknown      19860
5' 11"         229
6' 0"          222
5' 10"         198
6' 2"          167
             ...  
7′0″             1
6' 8" [3]        1
1' 4" [1]        1
6' 6" 6'0        1
6' 9"            1
Name: Height, Length: 153, dtype: int64

In [53]:
dc_pers_df["Height in string"] = dc_pers_df["Height"].apply(clean_height)
dc_pers_df["Height in float"] = dc_pers_df["Height in string"].apply(string_to_float)
dc_pers_df["Height in float"].describe()

count    2234.000000
mean      185.813339
std       125.729376
min         8.000000
25%       173.000000
50%       180.000000
75%       188.000000
max      5486.000000
Name: Height in float, dtype: float64

***Weight***

In [54]:
dc_pers_df["Weight"] = dc_pers_df["Weight"].apply(basic_clean)
dc_pers_df["Weight"].value_counts()

Unknown                                                 20015
175 lbs (79 kg)                                            78
Variable                                                   69
180 lbs (82 kg)                                            66
120 lbs (54 kg)                                            59
                                                        ...  
129                                                         1
22 lbs (10 kg) [citation needed]                            1
441 lbs (200 kg)                                            1
336 lbs (152 kg)                                            1
202 lbs (92 kg) (As Firestorm); 179 lbs (As Raymond)        1
Name: Weight, Length: 373, dtype: int64

In [55]:
dc_pers_df["Weight in string"] = dc_pers_df["Weight"].apply(clean_weight)
dc_pers_df["Weight in float"] = dc_pers_df["Weight in string"].apply(string_to_float)
dc_pers_df["Weight in float"].describe()

count     2078.000000
mean       150.653994
std       1461.481409
min          0.000000
25%         63.000000
50%         79.000000
75%         93.000000
max      54360.000000
Name: Weight in float, dtype: float64

***Eyes***

In [56]:
dc_pers_df["Eyes"] = dc_pers_df["Eyes"].apply(basic_clean)
dc_pers_df["Eyes"].value_counts()

Unknown                      10790
Blue                          3739
Brown                         2708
Black                         1646
Green                         1103
                             ...  
Blue-green                       1
Red · Amber                      1
Blue ·  formerly Brown           1
Purple ·  Orange ·  White        1
Photocellular · Red              1
Name: Eyes, Length: 147, dtype: int64

In [57]:
def clean_eye(s):
    
    if s == 'No Eyes':
        return 'No eyes'
    try:
        ss = s.split()
        if '(' in ss[0]:
            if s == '(as Kristin) Colorless ·  (as Snowman) Red':
                return 'Colorless'
        else:
            res = ss[0].strip()
            if '-' in res:
                sss = res.split('-')
                return(sss[0])
            else:
                return res.strip(';')
    except:
        print(s)
        return s

        
dc_pers_df["Eyes"] = dc_pers_df["Eyes"].apply(clean_eye)
dc_pers_df["Eyes"].value_counts()

Unknown          10790
Blue              3800
Brown             2751
Black             1669
Green             1123
Red                720
White              331
Yellow             324
Grey               144
No eyes            116
Photocellular       92
Hazel               80
Purple              67
Orange              41
Violet              33
Gold                28
Pink                28
Amber               18
Silver               9
Gray                 5
Blond                3
blue                 2
Indigo               2
Variable             2
red                  2
Flaming              1
Dark                 1
Colorless            1
Bald                 1
Fire                 1
brown                1
Diamond              1
Pale                 1
Mirrored             1
violet               1
Name: Eyes, dtype: int64

***Hair***

In [58]:
dc_pers_df["Hair"] = dc_pers_df["Hair"].apply(basic_clean)
dc_pers_df["Hair"].value_counts()

Black                                            5462
Unknown                                          4360
Brown                                            3448
Blond                                            2390
Red                                              1348
                                                 ... 
Brown · Black (Captain Marvel)                      1
Black (human) ·  black and golden brown (bee)       1
Brown · Bald ·  once                                1
Green ·  formerly Brown                             1
Black ·  White ·  Balding                           1
Name: Hair, Length: 267, dtype: int64

In [59]:
def clean_hair(s):
    if 'No Hair' in s:
        return 'No Hair'
    if ' ·' in s:
        ss = s.split()
        s =  ss[0].strip(';').strip('.').strip()
    if ' (' in s:
        ss = s.split()
        s = ss[0].strip(';').strip('.').strip()
    if '; ' in s:
        ss = s.split(';')
        s = ss[0].strip(';').strip('.').strip()
    if '/' in s:
        ss = s.split('/')
        s =  ss[0].strip(';').strip('.').strip()
    if '[' in s:
        ss = s.split('[')
        s = ss[0].strip(';').strip('.').strip()
    if 'Blonde' in s:
        return 'Blond'
    else:
        s = s.strip(';')
        s = s.strip('.')
        return s.strip()
    
dc_pers_df["Hair"] = dc_pers_df["Hair"].apply(clean_hair)
dc_pers_df["Hair"].value_counts()

Black                           5616
Unknown                         4361
Brown                           3550
Blond                           2440
Red                             1375
No Hair                         1233
White                           1147
Bald                            1074
Grey                             599
Green                            207
Blue                             102
Orange                            98
Purple                            91
Auburn                            91
Strawberry Blond                  50
Pink                              43
Silver                            23
Gold                              17
Light Brown                       15
Yellow                            15
Violet                             5
white                              3
Platinum Blond                     3
black                              3
Balding                            3
Auborn                             2
Auburn with white highlights       2
S

***Place of Birth***

In [60]:
dc_pers_df["Place of Birth"] = dc_pers_df["Place of Birth"].apply(basic_clean)
dc_pers_df["Place of Birth"].value_counts()

Unknown               18583
Krypton                 315
Gotham City             208
Apokolips                83
Germany                  78
                      ...  
Moscow, Russia            1
Cadmus labs               1
Van'n                     1
Innsbruck, Austria        1
Putthole                  1
Name: Place of Birth, Length: 1082, dtype: int64

***General cleaning***

In [61]:
dc_pers_df

Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Good or Bad,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth,Height in string,Height in float,Weight in string,Weight in float
0,https://dc.fandom.com//wiki/Adam_Blake_(The_Nail),Adam Blake,Secret Identity,Captain Comet,Unknown,Good,Unknown,Unknown,Unknown,Male,Unknown,Unknown,Hazel,Brown,Unknown,Unknown,,Unknown,
1,https://dc.fandom.com//wiki/Ada_LaBostrie_(New...,Ada LaBostrie,Public Identity,Ada LaBostrie,American,Good,Married,Housewife,Unknown,Female,Unknown,Unknown,Brown,Black,Unknown,Unknown,,Unknown,
2,https://dc.fandom.com//wiki/Adellca_(New_Earth),Adellca,Secret Identity,Green Lantern,Unknown,Good,Single,Green Lantern,Unknown,Female,Unknown,Unknown,Black,Black,Unknown,Unknown,,Unknown,
3,https://dc.fandom.com//wiki/A-1_(Prime_Earth),Artificial Intelligence Data Flow,Unknown,A-I,Unknown,Good,Single,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,,Unknown,
4,https://dc.fandom.com//wiki/Ace_Egan_(Quality_...,Ace Egan,Secret Identity,Ace of Space,Unknown,Good,Unknown,Unknown,Unknown,Male,Unknown,Unknown,Unknown,Unknown,New York,Unknown,,Unknown,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22185,https://dc.fandom.com//wiki/Zygo_(Prime_Earth),Unknown,Secret Identity,Zygo,Unknown,Bad,Unknown,Scientist,Unknown,Male,Unknown,Unknown,Brown,Brown,Unknown,Unknown,,Unknown,
22186,https://dc.fandom.com//wiki/Zymyr_(Pre-Zero_Hour),Zymyr,Public Identity,Zymyr,United Planets Citizen,Bad,Single,Scientist,Unknown,Male,Unknown,Unknown,No eyes,No Hair,Gil'Dishpan,Unknown,,Unknown,
22187,https://dc.fandom.com//wiki/Z%C3%BCM_(New_Earth),Unknown,Secret Identity,ZüM,Unknown,Bad,Single,Super-Villain,Unknown,Male,Unknown,Unknown,Unknown,No Hair,Ma'aleca'andra,Unknown,,Unknown,
22188,https://dc.fandom.com//wiki/Zyn_(New_Earth),Zyn,Public Identity,Unknown,Unknown,Bad,Unknown,Mercenary,Unknown,Male,Unknown,Unknown,Unknown,Bald,Unknown,Unknown,,Unknown,


In [62]:
unknown = (dc_pers_df['Identity']=='Unknown') &\
                 (dc_pers_df['Real Name']=='Unknown') &\
                 (dc_pers_df['Current Alias']=='Unknown') &\
                 (dc_pers_df['Occupation']=='Unknown') &\
                 (dc_pers_df['Gender']=='Unknown') &\
                 (dc_pers_df['Place of Birth']=='Unknown') &\
                 (dc_pers_df['Eyes']=='unknown') &\
                 (dc_pers_df['Citizenship']=='Unknown') &\
                 (dc_pers_df['Education']=='Unknown')

dc_pers_df[unknown].head(20)

Unnamed: 0,URL,Real Name,Identity,Current Alias,Citizenship,Good or Bad,Marital Status,Occupation,Education,Gender,Height,Weight,Eyes,Hair,Place of Birth,Height in string,Height in float,Weight in string,Weight in float


***There are no row where there is no information***

***Cleaning of DC Comics character Dataframe is done!!***

## Now lets clean the database of the comics for both DC and Marvel
**First we load the dataframe**

In [63]:
comics_dc = pd.read_pickle("comics_dc.txt")

In [64]:
# Cleaning and transforming to list
comics_dc['Good characters'] = comics_dc['Good characters'].str.replace(', ','',1)
comics_dc['Bad characters'] = comics_dc['Bad characters'].str.replace(', ','',1)
comics_dc['Neutral characters'] = comics_dc['Neutral characters'].str.replace(', ','',1)
comics_dc['Editor-in-chief'] = comics_dc['Editor-in-chief'].str.replace(', ','',1)
comics_dc['Editor-in-chief URL'] = comics_dc['Editor-in-chief URL'].str.replace(', ','',1)
comics_dc['Writer'] = comics_dc['Writer'].str.replace(', ','',1)
comics_dc['Writer URL'] = comics_dc['Writer URL'].str.replace(', ','',1)
comics_dc['Good characters'] = comics_dc['Good characters'].str.split(',')
comics_dc['Bad characters'] = comics_dc['Bad characters'].str.split(',')
comics_dc['Neutral characters'] = comics_dc['Neutral characters'].str.split(',')
comics_dc['Writer'] = comics_dc['Writer'].str.split(',')
comics_dc['Writer URL'] = comics_dc['Writer URL'].str.split(',')
comics_dc.reset_index(drop=True, inplace=True)
comics_dc

Unnamed: 0,URL,Good characters,Bad characters,Neutral characters,Editor-in-chief,Editor-in-chief URL,Writer,Writer URL,Publication date,Subcomic
0,/wiki/100_Bullets_Vol_1_64,"[/wiki/Jack_Daw_(100_Bullets), /wiki/Philip_G...",[],[],Karen Berger,/wiki/Karen_Berger,[],[],"November, 2005",The Dive
1,/wiki/100_Bullets_Vol_1_25,"[/wiki/Augustus_Medici_(100_Bullets), /wiki/B...",[],[],Karen Berger,/wiki/Karen_Berger,[],[],"August, 2001",Red Prince Blues (Part III of III)
2,/wiki/2020_Visions_Vol_1_5,[],[],[],Karen Berger,/wiki/Karen_Berger,[Ron Marz],[/wiki/Ron_Marz],"September, 1997",
3,/wiki/100%25_True%3F_Vol_1_2,[],[],[],Jenette Kahn,/wiki/Jenette_Kahn,[Ron Marz],[/wiki/Ron_Marz],"December, 1997",
4,/wiki/100_Bullets_Vol_1_11,[/wiki/Philip_Graves_(100_Bullets)],[],[],Karen Berger,/wiki/Karen_Berger,[],[],"June, 2000","Heartbreak, Sunny Side Up"
...,...,...,...,...,...,...,...,...,...,...
62309,/wiki/Zatanna_Vol_2_1,[],[],[],Dan DiDio,/wiki/Dan_DiDio,[],[],"July, 2010",
62310,/wiki/Zero_Girl_Vol_1_4,[],[],[],Jim Lee,/wiki/Jim_Lee,[],[],"May, 2001",
62311,/wiki/Young_Romance_Vol_1_196,[],[],[],,,[],[],"December, 1973",he 1st Stor
62312,/wiki/Young_Romance_Vol_1_126,[],[],[],,,[],[],"November, 1963",he 1st Stor


In [65]:
writer_editor = comics_dc.drop(['URL','Good characters','Bad characters','Neutral characters','Subcomic'],axis=1)

**We now have to deal with the writer that don't have URL**

In [66]:
bad_URL_index=writer_editor[writer_editor['Writer'].str.len() != writer_editor['Writer URL'].str.len()].index

In [67]:
for i in bad_URL_index:
    writer = writer_editor['Writer'][i]    
    writerURL = writer_editor['Writer URL'][i]
    idx = 0
    for name in writer:
        dummy = 0
        for s in writerURL:
            if name.split()[0] in s:
                dummy = 1
        if not dummy:
          if len(writerURL):
            if writerURL[0]!='':
              writer_editor['Writer URL'][i].insert(idx, name)
            else:
              writer_editor['Writer URL'][i].clear()
              writer_editor['Writer URL'][i].insert(idx, name)
          else:
            writer_editor['Writer URL'][i].insert(idx, name)
        idx += 1

In [68]:
def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

In [69]:
writer_editor = unnesting(writer_editor, ['Writer', 'Writer URL'])

**Now we do the same for the Marvel comics**

In [70]:
comics_marvel = pd.read_pickle("comics_marvel.txt")

In [71]:
comics_marvel

Unnamed: 0,URL,Good characters,Bad characters,Neutral characters,Editor-in-chief,Editor-in-chief URL,Writer,Writer URL,Publication date,Subcomic
0,/wiki/Marvel_Mystery_Comics_Vol_1_NN,,,,,,", Joe Caramagna",", /wiki/Joe_Caramagna","January, 1943",st stor
0,/wiki/Comedy_Comics_Vol_1_12,,,,", Stan Lee",", /wiki/Stan_Lee",,,"December, 1942",Morphy
0,/wiki/Marvel_Mystery_Comics_Vol_1_7,", /wiki/Human_Torch_(Android)_(Earth-616), /wi...",", /wiki/Roglo_(Earth-616), #cite_note-Only_App...",", /wiki/New_York_City_Police_Department_(Earth...",", Joe Simon",", /wiki/Joe_Simon",", Stan Lee, Larry Lieber",", /wiki/Stan_Lee, /wiki/Larry_Lieber","May, 1940",The Human Torch
0,/wiki/Marvel_Mystery_Comics_Vol_1_7,", /wiki/Thomas_Halloway_(Earth-616), /wiki/Bet...",", /wiki/Emma_Martin_(Earth-616)",", /wiki/Henry_Martin_(Earth-616)",", Joe Simon",", /wiki/Joe_Simon",", Paul Gustavson, Ray Gill",", /wiki/Paul_Gustavson, /wiki/Ray_Gill","May, 1940",The Angel: Master of Men
0,/wiki/Marvel_Mystery_Comics_Vol_1_7,", /wiki/Namor_McKenzie_(Earth-616), /wiki/Thak...",,", /wiki/Homo_mermanus, /wiki/New_York_City_Pol...",", Joe Simon",", /wiki/Joe_Simon",", William Blake Everett",", /wiki/William_Blake_Everett","May, 1940","Prince Namor, the Sub-Mariner"
...,...,...,...,...,...,...,...,...,...,...
0,/wiki/Spider-Man:_The_Complete_Clone_Saga_Epic...,", /wiki/Peter_Parker_(Earth-616), /wiki/Ben_Re...",", /wiki/Kaine_Parker_(Earth-616), /wiki/Samuel...",", /wiki/Guardian_(Spider-Clone)_(Earth-616), /...",", Joe Quesada",", /wiki/Joe_Quesada",", J.M. DeMatteis",", /wiki/J.M._DeMatteis",1979,Resurrection!
0,/wiki/Spider-Man:_The_Complete_Clone_Saga_Epic...,", /wiki/Peter_Parker_(Earth-616), /wiki/Ben_Re...",", /wiki/Miles_Warren_(Jackal_Clone_2)_(Earth-616)",", /wiki/Kaine_Parker_(Earth-616), /wiki/Charle...",", Joe Quesada",", /wiki/Joe_Quesada",", Howard Mackie",", /wiki/Howard_Mackie",1979,Truths & Deceptions
0,/wiki/Hellraiser_Vol_1_17,,,,,,", Clive Barker",", /wiki/Clive_Barker",1992,Resurrection
0,/wiki/Ultimate_Spider-Man_Infinite_Comic_Vol_2_10,", /wiki/Peter_Parker_(Earth-12041), /wiki/Pete...",", /wiki/Shazana_(Earth-12041)",", /wiki/William_Howard_Taft_(Earth-12041), /wi...",", Axel Alonso",", /wiki/Axel_Alonso",", John Barber",", /wiki/John_Barber",2016,Ham-ilton (Part 2)


In [72]:
# Cleaning and transforming to list
comics_marvel['Good characters'] = comics_marvel['Good characters'].str.replace(', ','',1)
comics_marvel['Bad characters'] = comics_marvel['Bad characters'].str.replace(', ','',1)
comics_marvel['Neutral characters'] = comics_marvel['Neutral characters'].str.replace(', ','',1)
comics_marvel['Editor-in-chief'] = comics_marvel['Editor-in-chief'].str.replace(', ','',1)
comics_marvel['Editor-in-chief URL'] = comics_marvel['Editor-in-chief URL'].str.replace(', ','',1)
comics_marvel['Writer'] = comics_marvel['Writer'].str.replace(', ','',1)
comics_marvel['Writer URL'] = comics_marvel['Writer URL'].str.replace(', ','',1)

comics_marvel['Good characters'] = comics_marvel['Good characters'].str.split(',')
comics_marvel['Bad characters'] = comics_marvel['Bad characters'].str.split(',')
comics_marvel['Neutral characters'] = comics_marvel['Neutral characters'].str.split(',')
comics_marvel['Writer'] = comics_marvel['Writer'].str.split(r",")
comics_marvel['Writer URL'] = comics_marvel['Writer URL'].str.split(r",")

comics_marvel.reset_index(drop=True, inplace=True)

In [73]:
writer_editor = comics_marvel.drop(['URL','Good characters','Bad characters','Neutral characters','Subcomic'],axis=1)

In [74]:
bad_URL_index = writer_editor[writer_editor['Writer'].str.len() != writer_editor['Writer URL'].str.len()].index

In [75]:
for i in bad_URL_index:
    writer = writer_editor['Writer'][i]    
    writerURL = writer_editor['Writer URL'][i]
    idx = 0
    for name in writer:
        dummy = 0
        for s in writerURL:
            if name.split()[0] in s:
                dummy = 1
        if not dummy:
            if len(writerURL):
                if writerURL[0]!='' :
                    writer_editor['Writer URL'][i].insert(idx, name)
                else:
                    writer_editor['Writer URL'][i].clear()
                    writer_editor['Writer URL'][i].insert(idx, name)
            else:
                writer_editor['Writer URL'][i].insert(idx, name)
        idx += 1

**We still have remaining problems: Ivan Velez with ',' in his URL as '%2C' which is the code for ',' and two writer have the same name: Peyo written twice. Lets do it manually**

In [76]:
writer_editor.loc[19990,'Writer URL'].append('David Winn')
writer_editor.loc[20239,'Writer'] = ['Peyo']
writer_editor.loc[20368,'Writer'] = ['Peyo']
writer_editor.loc[20382,'Writer'] = ['Peyo']

In [77]:
for ind in writer_editor[writer_editor['Writer'].str.len() != writer_editor['Writer URL'].str.len()].index:
    writer_editor.loc[ind,'Writer URL'] = ['/wiki/Ivan_Velez%2C_Jr.']

In [78]:
writer_editor = unnesting(writer_editor, ['Writer','Writer URL'])
writer_editor

Unnamed: 0,Writer,Writer URL,Editor-in-chief,Editor-in-chief URL,Publication date
0,Joe Caramagna,/wiki/Joe_Caramagna,,,"January, 1943"
1,,,Stan Lee,/wiki/Stan_Lee,"December, 1942"
2,Stan Lee,/wiki/Stan_Lee,Joe Simon,/wiki/Joe_Simon,"May, 1940"
2,Larry Lieber,/wiki/Larry_Lieber,Joe Simon,/wiki/Joe_Simon,"May, 1940"
3,Paul Gustavson,/wiki/Paul_Gustavson,Joe Simon,/wiki/Joe_Simon,"May, 1940"
...,...,...,...,...,...
68477,J.M. DeMatteis,/wiki/J.M._DeMatteis,Joe Quesada,/wiki/Joe_Quesada,1979
68478,Howard Mackie,/wiki/Howard_Mackie,Joe Quesada,/wiki/Joe_Quesada,1979
68479,Clive Barker,/wiki/Clive_Barker,,,1992
68480,John Barber,/wiki/John_Barber,Axel Alonso,/wiki/Axel_Alonso,2016


**Now our datasets are ready to use**

In [81]:
pickle.dump(personnage, open("dc_pers_clean.txt",'wb'))