<a href="https://colab.research.google.com/github/RicNavarro/personal-projects/blob/main/DC_Rebirth_List.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import glob as gb
import os
import re

Some additional cleaning was done in Excel
1) Instances of "," within certain entries were removed
2) Some text was properly re-encoded (entries with single quotes, for example, were faulty)
3) A few entries contained not the names of the issues, but descriptions of how to read certain events. Those were removed

In [2]:
df_main = pd.read_csv("data/DC_Rebirth_Comic_Order.csv")

In [3]:
df_main.head()

Unnamed: 0,Comic Name
0,DC Universe: Rebirth #1 (2016)
1,The Flash: Rebirth Vol. 2 #1 (2016)
2,Nightwing: Rebirth #1 (2016)
3,Titans: Rebirth #1 (2016)
4,Titans Vol. 3 #1 (2016)


We proceed to do some initial formatting in order to extract the character names, whilst removing spacing, parenthesis, issue nnumbering and some other info that won't be relevant for the "Character" field. Also, we distinguish the entrie that indicate an event

In [4]:
def clean_name(name):
    if name.split(" ")[0] == 'Read':
        return 'EVENT'
    else:
        name = name.title()
        if ":" in name:
            name = name.split(':')
        
        # Some Comic Names have the character name after the colon when they're part of an event
            split_to_keep = 0
            
            if name [0] == "" or "BATMAN" in name[1].upper():    
            # The "BATMAN" exception is added specifically because of a single edge case where the title is
            # not part of an event, but the character's name is after the colon
                split_to_keep = 1
            name = name[split_to_keep]

        return (name.split('Vol')[0].split('#')[0]).split("Annual")[0].split("Special")[0].strip().upper()

df_main['Character'] = df_main['Comic Name'].apply(clean_name)

In [5]:
df_main

Unnamed: 0,Comic Name,Character
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH
2,Nightwing: Rebirth #1 (2016),NIGHTWING
3,Titans: Rebirth #1 (2016),TITANS
4,Titans Vol. 3 #1 (2016),TITANS
...,...,...
2476,Read Dark Nights: Death Metal here.,EVENT
2477,Generations Shattered #1 (2021),GENERATIONS SHATTERED
2478,Generations Forged #1 (2021),GENERATIONS FORGED
2479,Read Endless Winter here.,EVENT


### Defining the `Event` column

A mask column is created in order to define which entries recieve the value contained within `Comic Name`, but properly formatted. That mask is then applied and the values are inserted into the `Event` columns

In [6]:
def clean_event(x):
  return x.replace('Read ', '').replace(' here.', '')

events_true = df_main['Character'] == "EVENT"
#print(events_true)

df_main['event_names_clean'] = df_main.loc[events_true, 'Comic Name'].apply(clean_event)
df_main['Event'] = np.where(df_main['Character'] == "EVENT", df_main['event_names_clean'], "None")
df_main = df_main.drop("event_names_clean", axis="columns")

df_main['Event']

0                           None
1                           None
2                           None
3                           None
4                           None
                  ...           
2476    Dark Nights: Death Metal
2477                        None
2478                        None
2479              Endless Winter
2480                Future State
Name: Event, Length: 2481, dtype: object

In [7]:
df_main

Unnamed: 0,Comic Name,Character,Event
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE,
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH,
2,Nightwing: Rebirth #1 (2016),NIGHTWING,
3,Titans: Rebirth #1 (2016),TITANS,
4,Titans Vol. 3 #1 (2016),TITANS,
...,...,...,...
2476,Read Dark Nights: Death Metal here.,EVENT,Dark Nights: Death Metal
2477,Generations Shattered #1 (2021),GENERATIONS SHATTERED,
2478,Generations Forged #1 (2021),GENERATIONS FORGED,
2479,Read Endless Winter here.,EVENT,Endless Winter


Some regular expressions are then used for some additional formating in the `events_list` Series, which contains the names of the different events. This is done in order to remove spacing and ponctuation, in addition to inserting underscoring. After this, the entries in said Series can be used for proper comparison with the file names that will be accessed.

In [8]:
events_list_entry_names = events_list = df_main.loc[df_main['Event'] != 'None']['Event']
print (events_list_entry_names)

def reformat_name_event(event):
    # Substituir qualquer caractere n alfanum (exceto _) ou espaço por um underscore
    #    Regex: [^\w\s] -> qualquer coisa que NÃO seja (caractere alfanumérico OU espaço)
    #    | -> OU
    #    \s+ -> um ou mais espaços
    event = re.sub(r'[^\w\s]|\s+', '_', event)

    # Remover underscores duplicados que possam surgir (ex: "Night__of___the")
    event = re.sub(r'__+', '_', event)

    # Remover underscores no início ou fim da string
    event = event.strip('_')

    return event


events_list_file_names = events_list.apply(reformat_name_event)
print (events_list_file_names)

105             Night of the Monster Men
273     Justice League vs. Suicide Squad
380                      Superman Reborn
534                 The Lazarus Contract
919                   Dark Nights: Metal
1560                    Heroes in Crisis
1577                       Drowned Earth
1911                     Event Leviathan
1970                        City of Bane
2015                        The Infected
2370                           Joker War
2476            Dark Nights: Death Metal
2479                      Endless Winter
2480                        Future State
Name: Event, dtype: object
105            Night_of_the_Monster_Men
273     Justice_League_vs_Suicide_Squad
380                     Superman_Reborn
534                The_Lazarus_Contract
919                   Dark_Nights_Metal
1560                   Heroes_in_Crisis
1577                      Drowned_Earth
1911                    Event_Leviathan
1970                       City_of_Bane
2015                       The_Infected

Temos que ler os nomes dos arquivos em DC_Events e

**(A)** Conforme fazemos a leitura armazenamos na lista `df_events`, que depois é reordenada segundo `events_dict` ou similares

ou

**(B)** Fazemos a leitura já checando os elementos de `events_dict`, ordenando conforme lemos e armazenando em `dfs_events` já na ordem

In [9]:
col_to_keep = "Comic Name"

events_path = "data/DC_Events/"
fileformat = "*.csv"
event_names = []
dfs_events = []

path_events = gb.glob(f"{events_path}/{fileformat}")

#for i in events_list.values:
#  print(i)
#print ("File names\n")
#for i in path_events:
#  print(os.path.basename(i))

#print ("\nEvent Column names\n")
if (len(path_events) == len(events_list_file_names.values)):
  for i in range(len(path_events)):
    #print(events_path + events_list.values[i] + fileformat.strip("*"))
    try:
        # remeber: the events are added according to the order they were read from in the original Comic_Order file
        event_address = events_path + events_list_file_names.values[i] + fileformat.strip("*")
        df_toshorten = pd.read_csv(event_address)
        dfs_events.append(df_toshorten[[col_to_keep]])
    except:
      print("The " + events_list_file_names.values[i] + " event has no corresponding file")

dfs_events

[              Comic Name
 0       Batman Vol. 3 #7
 1    Nightwing Vol. 4 #5
 2  Detective Comics #941
 3       Batman Vol. 3 #8
 4    Nightwing Vol. 4 #6
 5  Detective Comics #942,
                                     Comic Name
 0                      Suicide Squad Vol. 5 #8
 1                    Justice League Vol. 3 #12
 2   Justice League vs. Suicide Squad #1 (2016)
 3          Justice League vs. Suicide Squad #2
 4          Justice League vs. Suicide Squad #3
 5          Justice League vs. Suicide Squad #4
 6                      Suicide Squad Vol. 5 #9
 7          Justice League vs. Suicide Squad #5
 8                    Justice League Vol. 3 #13
 9          Justice League vs. Suicide Squad #6
 10                    Suicide Squad Vol. 5 #10,
             Comic Name
 0  Superman Vol. 4 #18
 1   Action Comics #975
 2        Superwoman #8
 3  Superman Vol. 4 #19
 4   Action Comics #976
 5  Supergirl Vol. 7 #8
 6        Superwoman #9
 7    Trinity Vol. 2 #8
 8   Action Comics #977


### Getting dataframes from each additional file and adding a column with the proper event name

In [10]:
def clean_event_name(comic_name, event):
    if not(event in comic_name):
       return clean_name(comic_name) 
        
    replaced_name = clean_name(comic_name.replace(event, ""))
    if len(replaced_name) != 0:
        return replaced_name
    else:
        return event.upper()
    
for i in range(len(events_list_entry_names)):

    dfs_events[i]["Event"] = events_list_entry_names.iloc[i]
    
    # Events require a special cleanup, considering many of them have the Event name in the Comic Book's title,
    # which would make some noise when doing the regular cleanup

    dfs_events[i]['Character'] = dfs_events[i].apply(lambda x: clean_event_name(x['Comic Name'], x['Event']), axis=1)

#Checking one of the dfs for the appropriate column
dfs_events

[              Comic Name                     Event         Character
 0       Batman Vol. 3 #7  Night of the Monster Men            BATMAN
 1    Nightwing Vol. 4 #5  Night of the Monster Men         NIGHTWING
 2  Detective Comics #941  Night of the Monster Men  DETECTIVE COMICS
 3       Batman Vol. 3 #8  Night of the Monster Men            BATMAN
 4    Nightwing Vol. 4 #6  Night of the Monster Men         NIGHTWING
 5  Detective Comics #942  Night of the Monster Men  DETECTIVE COMICS,
                                     Comic Name  \
 0                      Suicide Squad Vol. 5 #8   
 1                    Justice League Vol. 3 #12   
 2   Justice League vs. Suicide Squad #1 (2016)   
 3          Justice League vs. Suicide Squad #2   
 4          Justice League vs. Suicide Squad #3   
 5          Justice League vs. Suicide Squad #4   
 6                      Suicide Squad Vol. 5 #9   
 7          Justice League vs. Suicide Squad #5   
 8                    Justice League Vol. 3 #13   

Next step is to do the dataframe merging

In [11]:
df_main.loc[df_main['Event'] == "Justice League vs. Suicide Squad"]

Unnamed: 0,Comic Name,Character,Event
273,Read Justice League vs. Suicide Squad here.,EVENT,Justice League vs. Suicide Squad


In [12]:
insertion_points = []
#for i in dfs_events:
    
for i in df_main.loc[df_main['Event'] != 'None']['Comic Name']:
    insertion_points.append(i)

#print(insertion_points)

for i in range(len(insertion_points)):
    insertion_index = df_main[df_main['Comic Name'] == insertion_points[i]].index[0]

    df_main_slice1 = df_main.iloc[:insertion_index]
    df_main_slice2 = df_main.iloc[insertion_index + 1:]

    df_main = pd.concat([df_main_slice1, dfs_events[i], df_main_slice2], ignore_index=True)

In [13]:
df_main

Unnamed: 0,Comic Name,Character,Event
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE,
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH,
2,Nightwing: Rebirth #1 (2016),NIGHTWING,
3,Titans: Rebirth #1 (2016),TITANS,
4,Titans Vol. 3 #1 (2016),TITANS,
...,...,...,...
2703,Future State: Superman: House of El #1 (2021),SUPERMAN,Future State
2704,Future State: Swamp Thing #1 (2021),SWAMP THING,Future State
2705,Future State: Swamp Thing #2,SWAMP THING,Future State
2706,Future State: Immortal Wonder Woman #1 (2021),IMMORTAL WONDER WOMAN,Future State


In [27]:
print("There are " + str(len(df_main['Character'].unique())) + " characters")
unique_heroes = df_main['Character'].unique()
unique_heroes

There are 123 characters


array(['DC UNIVERSE', 'THE FLASH', 'NIGHTWING', 'TITANS', 'SUPERMAN',
       'SUPERWOMAN', 'BATMAN', 'GREEN ARROW', 'SUICIDE SQUAD',
       'GREEN LANTERN', 'GREEN LANTERNS', 'HARLEY QUINN', 'AQUAMAN',
       'BATGIRL', 'TEEN TITANS', 'JUSTICE LEAGUE', 'TRINITY', 'CATWOMAN',
       'CYBORG', 'GOTHAM ACADEMY', 'WONDER WOMAN', 'DC REBIRTH HOLIDAY',
       'NEW SUPER-MAN', 'DEATHSTROKE', 'RAVEN', 'RED HOOD', 'SUPER SONS',
       'THE HELLBLAZER', 'SIXPACK', 'DEADMAN', 'SUPERGIRL',
       'MOTHER PANIC', 'BATWOMAN', 'BATMAN BEYOND', 'BLUE BEETLE',
       'CAVE CARSON', 'HAWKMAN', 'MIDNIGHTER', 'SHADE', 'BANE',
       'DOOM PATROL', 'THE ODYSSEY OF THE AMAZONS', 'THE FALL',
       'BUG! THE ADVENTURES OF FORAGER', 'SWAMP THING', 'DARK DAYS',
       'DARK NIGHTS: METAL', 'DARK KNIGHTS RISING',
       'THE BATMAN WHO LAUGHS', 'MISTER MIRACLE', 'RAGMAN', 'DAMAGE',
       'THE SILENCER', 'BLACK LIGHTNING', 'THE DEMON', 'THE TERRIFICS',
       'DC NATION', 'THE CURSE OF BRIMSTONE',
       'CHALL

### An evaluation of what constitutes a **"collab title"** is needed
In order to do that, we look for some patterns within the `Character` column. The ones identified were the usage of "AND", "VS." and "/" to signal the involvement of at least two characters. (This is not perfect, as we'll see an edge case bellow)

In [15]:
for i in unique_heroes:
    if "AND" in i or "VS" in i or "/" in i or "\\" in i or "&" in i:
        print(i)

#print("--------------------------------------")
#for i in unique_heroes:
#    if "OF" in i:
#        print(i)

HAL JORDAN & THE GREEN LANTERN CORPS
BATGIRL AND THE BIRDS OF PREY
RED HOOD AND THE OUTLAWS
JUSTICE LEAGUE VS. SUICIDE SQUAD
SIXPACK AND DOGWELDER
MIDNIGHTER AND APOLLO
THE FALL AND RISE OF CAPTAIN ATOM
HAL JORDAN AND THE GREEN LANTERN CORPS
JLA/DOOM PATROL
MOTHER PANIC/BATMAN
SHADE THE CHANGING GIRL/WONDER WOMAN
CAVE CARSON HAS A CYBERNETIC EYE/SWAMP THING
DOOM PATROL/JLA
BATMAN AND THE SIGNAL
NEW SUPER-MAN AND THE JUSTICE LEAGUE OF CHINA
BATMAN AND WONDER WOMAN
WONDER WOMAN AND DARK JUSTICE LEAGUE
DARK JUSTICE LEAGUE AND WONDER WOMAN
JUSTICE LEAGUE/AQUAMAN
AQUAMAN/JUSTICE LEAGUE
BATMAN AND THE OUTSIDERS
BATMAN/SUPERMAN
HARLEY QUINN & POISON IVY
BATMAN VS. RA'S AL GHUL
HARLEY QUINN AND THE BIRDS OF PREY
SUPERMAN VS. IMPERIOUS LEX
SUPERMAN/WONDER WOMAN


That looks good, except for a single instance: `THE FALL AND RISE OF CAPTAIN ATOM`. This title will have to be dealt with separately.

Other than that, we're good to use those parameters to create a `Side Character` column

In [16]:
def split_characters(name):
    collab_indicators = ["AND","VS.","/","&"]
    #print(name)
    if not (any(ind in name for ind in collab_indicators)):
        return name, "None"

    split_point = "AND"
    if "VS." in name:
        split_point = "VS."
    elif "/" in name:
        split_point = "/"
    elif "&" in name:
        split_point = "&"
    
    name = name.split(split_point)
    #print(name[0].strip() + "-" + name[1].strip())
    return name[0].strip(), name[1].strip()

# Since a tuple is being returned, we need tu unzip it
df_main["Character"], df_main["Side Character"] = zip(*df_main.apply(lambda x: split_characters(x['Character']), axis=1))

In [17]:
df_main

Unnamed: 0,Comic Name,Character,Event,Side Character
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE,,
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH,,
2,Nightwing: Rebirth #1 (2016),NIGHTWING,,
3,Titans: Rebirth #1 (2016),TITANS,,
4,Titans Vol. 3 #1 (2016),TITANS,,
...,...,...,...,...
2703,Future State: Superman: House of El #1 (2021),SUPERMAN,Future State,
2704,Future State: Swamp Thing #1 (2021),SWAMP THING,Future State,
2705,Future State: Swamp Thing #2,SWAMP THING,Future State,
2706,Future State: Immortal Wonder Woman #1 (2021),IMMORTAL WONDER WOMAN,Future State,


### Dealing with duplicates/redundancies

There's some *visible* duplicates, *apparent but fake* duplicates(ambiguity), and *non-apparent* duplicates(aliases) in both the `Character` and the `Side Character` columns. In order to solve these, we need domain knowledge. Listed above are instanced of the first two kinds. 

In [18]:
unique_heroes = df_main['Character'].unique()
unique_side = df_main["Side Character"].unique()
repeated_chars = set()

print("\nCharacter\n")

for i in unique_heroes:
    rep = False
    for j in unique_heroes:
        if j in i and j != i:
            rep = True
            print(j + " is in "+ i)
            repeated_chars.add(j)
    if rep:
        print("-------------")

print("\nSide Character\n")

for i in unique_side:
    rep = False
    for j in unique_side:
        if j in i and j != i:
            rep = True
            print(j + " is in "+ i)
    if rep:
        print("-------------")


Character

GREEN LANTERN is in GREEN LANTERNS
-------------
TITANS is in TEEN TITANS
-------------
BATMAN is in ALL-STAR BATMAN
-------------
JUSTICE LEAGUE is in JUSTICE LEAGUE OF AMERICA
-------------
BATMAN is in BATMAN BEYOND
-------------
HAWKMAN is in DEATH OF HAWKMAN
-------------
SWAMP THING is in SWAMP THING WINTER
-------------
BATMAN is in BATMAN LOST
-------------
HAWKMAN is in HAWKMAN FOUND
-------------
BATMAN is in THE BATMAN WHO LAUGHS
-------------
SUPER SONS is in ADVENTURES OF THE SUPER SONS
-------------
JUSTICE LEAGUE is in JUSTICE LEAGUE DARK
-------------
JUSTICE LEAGUE is in DARK JUSTICE LEAGUE
-------------
GREEN LANTERN is in THE GREEN LANTERN
-------------
JUSTICE LEAGUE is in JUSTICE LEAGUE ODYSSEY
-------------
HARLEY QUINN is in HARLEY QUINN'S VILLAIN OF THE YEAR
-------------
THE GREEN LANTERN is in THE GREEN LANTERN SEASON TWO
GREEN LANTERN is in THE GREEN LANTERN SEASON TWO
-------------
BATMAN is in BATMAN KNIGHTFALL
-------------
SUPERMAN is in SUPER

After some research into the specific titles, we can see that *apparent but fake* duplicates are both in the `Character` and in the `Side Character` column. They are the following:

- `GREEN LANTERNS`, `TEEN TITANS`, `BATMAN BEYOND`, `THE BATMAN WHO LAUGHS`, `JUSTICE LEAGUE DARK`, `DARK JUSTICE LEAGUE`(which shall be renamed as `JUSTICE LEAGUE DARK`), `JUSTICE LEAGUE ODYSSEY`, `SUPERMAN'S PAL JIMMY OLSEN`, `THE NEXT BATMAN`, `SUPERMAN OF METROPOLIS`

The remaining instances in the `Character` column (among the ones presented above) are *visible* redundancies. They're dealt with bellow. (WIP)

In [19]:
#for i in repeated_heroes:
#    print (df_main.loc[df_main["Character"] == i]["Character"])
    
not_dupes = ["GREEN LANTERNS", "TEEN TITANS", "BATMAN BEYOND", "THE BATMAN WHO LAUGHS", "JUSTICE LEAGUE DARK", "DARK JUSTICE LEAGUE", "JUSTICE LEAGUE ODYSSEY", "SUPERMAN'S PAL JIMMY OLSEN", "THE NEXT BATMAN", "SUPERMAN OF METROPOLIS"]

def replace_dupes(entry, clear_name, to_replace):
    if (entry not in not_dupes) and to_replace:
        print(entry + " will be replaced by " + clear_name)
        return clear_name
    return entry

for i in repeated_chars:

    mask_all_contains = df_main['Character'].str.contains(i)
    mask_not_exact_match = df_main["Character"] != i

    df_main["mask_only_repeats"] = mask_all_contains & mask_not_exact_match
    
    df_main['Character'] = df_main.apply(lambda x: replace_dupes(x['Character'], i, x["mask_only_repeats"]), axis=1)
    df_main = df_main.drop("mask_only_repeats", axis="columns")


TRINITY CRISIS will be replaced by TRINITY
DEATH OF HAWKMAN will be replaced by HAWKMAN
DEATH OF HAWKMAN will be replaced by HAWKMAN
DEATH OF HAWKMAN will be replaced by HAWKMAN
DEATH OF HAWKMAN will be replaced by HAWKMAN
DEATH OF HAWKMAN will be replaced by HAWKMAN
DEATH OF HAWKMAN will be replaced by HAWKMAN
HAWKMAN FOUND will be replaced by HAWKMAN
HARLEY QUINN'S VILLAIN OF THE YEAR will be replaced by HARLEY QUINN
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be replaced by SUPER SONS
ADVENTURES OF THE SUPER SONS will be 

In [20]:
#df_main.loc[df_main["Comic Name"].str.contains("The Fall")]

Having domain knowldege, it's also posisble to notice additional redundacies. The `key: [values]` matches lited bellow indicate which keys will replace the values in entries with redundacies

- SUPERMAN: ['ACTION COMICS','THE MAN OF STEEL']
- BATMAN: [
  'DETECTIVE COMICS',
  'DARK DETECTIVE'
]
- THE FLASH: ['FLASH FORWARD']
- GREEN LANTERN: ['HAL JORDAN']
- CAVE CARSON: [
  'CAVE CARSON HAS A CYBERNETIC EYE',
  'CAVE CARSON HAS AN INTERSTELLAR EYE'
]
- SHADE: [
  'SHADE THE CHANGING GIRL',
  'SHADE THE CHANGING WOMAN'
]
- JUSTICE LEAGUE: [
  'JLA'
]
- SUPERGIRL: ['KARA ZOR-EL']
- ROBIN: [
  'ROBIN 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR',
  'ROBIN KING',
  'ROBIN ETERNAL'
]
- SUPER SONS: ['CHALLENGE OF THE SUPERSONS']

In [25]:
all_dupes = ['ACTION COMICS','THE MAN OF STEEL', 'DETECTIVE COMICS', 'DARK DETECTIVE', 'FLASH FORWARD', 'HAL JORDAN', 'CAVE CARSON HAS A CYBERNETIC EYE', 'CAVE CARSON HAS AN INTERSTELLAR EYE', 'SHADE THE CHANGING GIRL', 'SHADE THE CHANGING WOMAN', 'JLA', 'KARA ZOR-EL', 'ROBIN 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR', 'ROBIN KING', 'ROBIN ETERNAL']

def replace_hidden_dupes(name):
    if name in all_dupes:
#        print(name)
        match name:
            case w if w in [
                            'ACTION COMICS', 
                            'THE MAN OF STEEL'
                            ]:
                return "SUPERMAN"
            case w if w in [
                            'DETECTIVE COMICS',
                            'DARK DETECTIVE'
                            ]:
                return "BATMAN"
            case w if w in [
                            'FLASH FORWARD'
                            ]:
                return "THE FLASH"
            case w if w in [
                            'HAL JORDAN'
                            ]:
                return "GREEN LANTERN"
            case w if w in [
                            'CAVE CARSON HAS A CYBERNETIC EYE',
                            'CAVE CARSON HAS AN INTERSTELLAR EYE']:
                return "CAVE CARSON"
            case w if w in [
                              'SHADE THE CHANGING GIRL',
                              'SHADE THE CHANGING WOMAN'
                            ]:
                return "SHADE"
            case w if w in [
                              'JLA'
                            ]:
                return "JUSTICE LEAGUE"
            case w if w in [
                              'KARA ZOR-EL'
                            ]:
                return "SUPERGIRL"
            case w if w in [
                              'ROBIN 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR',
                              'ROBIN KING',
                              'ROBIN ETERNAL'
                            ]:
                return "ROBIN"
            case w if w in [
                              ['CHALLENGE OF THE SUPERSONS']
                            ]:
                return "SUPER SONS"
    return name

df_main['Character'] = df_main.apply(lambda x: replace_hidden_dupes(x['Character']), axis=1)

In [26]:
#df_main.loc[df_main["mask_only_repeats"] == True]
df_main

Unnamed: 0,Comic Name,Character,Event,Side Character
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE,,
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH,,
2,Nightwing: Rebirth #1 (2016),NIGHTWING,,
3,Titans: Rebirth #1 (2016),TITANS,,
4,Titans Vol. 3 #1 (2016),TITANS,,
...,...,...,...,...
2703,Future State: Superman: House of El #1 (2021),SUPERMAN,Future State,
2704,Future State: Swamp Thing #1 (2021),SWAMP THING,Future State,
2705,Future State: Swamp Thing #2,SWAMP THING,Future State,
2706,Future State: Immortal Wonder Woman #1 (2021),WONDER WOMAN,Future State,
