<a href="https://colab.research.google.com/github/RicNavarro/personal-projects/blob/main/DC_Rebirth_List.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [46]:
import numpy as np
import pandas as pd
import glob as gb
import os
import re

Some additional cleaning was done in Excel
1) Instances of "," within certain entries were removed
2) Some text was properly re-encoded (entries with single quotes, for example, were faulty)
3) A few entries contained not the names of the issues, but descriptions of how to read certain events. Those were removed

In [47]:
df_main = pd.read_csv("data/DC_Rebirth_Comic_Order.csv")

In [48]:
df_main.head()

Unnamed: 0,Comic Name
0,DC Universe: Rebirth #1 (2016)
1,The Flash: Rebirth Vol. 2 #1 (2016)
2,Nightwing: Rebirth #1 (2016)
3,Titans: Rebirth #1 (2016)
4,Titans Vol. 3 #1 (2016)


We proceed to do some initial formatting in order to extract the character names, whilst removing spacing, parenthesis, issue nnumbering and some other info that won't be relevant for the "Character" field. Also, we distinguish the entrie that indicate an event

In [49]:
def extract_and_clean_name(name):
    if name.split(" ")[0] == 'Read':
        return 'EVENT'
    else:
        return (name.title().split(':')[0].split('Vol')[0].split('#')[0]).rstrip().split("Annual")[-1].split("Special")[-1].upper()

df_main['Character'] = df_main['Comic Name'].apply(extract_and_clean_name)

In [50]:
df_main

Unnamed: 0,Comic Name,Character
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH
2,Nightwing: Rebirth #1 (2016),NIGHTWING
3,Titans: Rebirth #1 (2016),TITANS
4,Titans Vol. 3 #1 (2016),TITANS
...,...,...
2476,Read Dark Nights: Death Metal here.,EVENT
2477,Generations Shattered #1 (2021),GENERATIONS SHATTERED
2478,Generations Forged #1 (2021),GENERATIONS FORGED
2479,Read Endless Winter here.,EVENT


## Before stripping the 

**Eventização**

O código abaixo deve ser retrabalhado para que todas as entradas onde há 'Character' igual a 'EVENT' sejam stripadas de 'Read ' e ' here.'

In [62]:
def clean_event(x):
  return x.replace('Read ', '').replace(' here.', '')

events_true = df_main['Character'] == "EVENT"
print(events_true)

df_main['event_names_clean'] = df_main.loc[events_true, 'Comic Name'].apply(clean_event)

0       False
1       False
2       False
3       False
4       False
        ...  
2476     True
2477    False
2478    False
2479     True
2480     True
Name: Character, Length: 2481, dtype: bool


In [66]:
df_main['Event'] = np.where(df_main['Character'] == "EVENT", df_main['event_names_clean'], "None")
df_main = df_main.drop("event_names_clean", axis="columns")
df_main['Event']

0                           None
1                           None
2                           None
3                           None
4                           None
                  ...           
2476    Dark Nights: Death Metal
2477                        None
2478                        None
2479              Endless Winter
2480                Future State
Name: Event, Length: 2481, dtype: object

In [67]:
df_main

Unnamed: 0,Comic Name,Character,Event
0,DC Universe: Rebirth #1 (2016),DC UNIVERSE,
1,The Flash: Rebirth Vol. 2 #1 (2016),THE FLASH,
2,Nightwing: Rebirth #1 (2016),NIGHTWING,
3,Titans: Rebirth #1 (2016),TITANS,
4,Titans Vol. 3 #1 (2016),TITANS,
...,...,...,...
2476,Read Dark Nights: Death Metal here.,EVENT,Dark Nights: Death Metal
2477,Generations Shattered #1 (2021),GENERATIONS SHATTERED,
2478,Generations Forged #1 (2021),GENERATIONS FORGED,
2479,Read Endless Winter here.,EVENT,Endless Winter


Vamos ter de usar expressões regulares para poder modificar adequadamente as entradas na Série `events_list`, removendo espaços e pontuações para que nossa comparação com o nome dos arquivos seja adequada.

Consta no código uma explicação sobre as expressões

In [80]:
events_list_entry_names = events_list = df_main.loc[df_main['Event'] != 'None']['Event']
print (events_list_entry_names)

def reformat_name_event(event):
    # Substituir qualquer caractere n alfanum (exceto _) ou espaço por um underscore
    #    Regex: [^\w\s] -> qualquer coisa que NÃO seja (caractere alfanumérico OU espaço)
    #    | -> OU
    #    \s+ -> um ou mais espaços
    event = re.sub(r'[^\w\s]|\s+', '_', event)

    # Remover underscores duplicados que possam surgir (ex: "Night__of___the")
    event = re.sub(r'__+', '_', event)

    # Remover underscores no início ou fim da string
    event = event.strip('_')

    return event


events_list_file_names = events_list.apply(reformat_name_event)
print (events_list_file_names)

105             Night of the Monster Men
273     Justice League vs. Suicide Squad
380                      Superman Reborn
534                 The Lazarus Contract
919                   Dark Nights: Metal
1560                    Heroes in Crisis
1577                       Drowned Earth
1911                     Event Leviathan
1970                        City of Bane
2015                        The Infected
2370                           Joker War
2476            Dark Nights: Death Metal
2479                      Endless Winter
2480                        Future State
Name: Event, dtype: object
105            Night_of_the_Monster_Men
273     Justice_League_vs_Suicide_Squad
380                     Superman_Reborn
534                The_Lazarus_Contract
919                   Dark_Nights_Metal
1560                   Heroes_in_Crisis
1577                      Drowned_Earth
1911                    Event_Leviathan
1970                       City_of_Bane
2015                       The_Infected

Temos que ler os nomes dos arquivos em DC_Events e

**(A)** Conforme fazemos a leitura armazenamos na lista `df_events`, que depois é reordenada segundo `events_dict` ou similares

ou

**(B)** Fazemos a leitura já checando os elementos de `events_dict`, ordenando conforme lemos e armazenando em `dfs_events` já na ordem

In [85]:
col_to_keep = "Comic Name"

events_path = "data/DC_Events/"
fileformat = "*.csv"
event_names = []
dfs_events = []

path_events = gb.glob(f"{events_path}/{fileformat}")

#for i in events_list.values:
#  print(i)
#print ("File names\n")
#for i in path_events:
#  print(os.path.basename(i))

#print ("\nEvent Column names\n")
if (len(path_events) == len(events_list_file_names.values)):
  for i in range(len(path_events)):
    #print(events_path + events_list.values[i] + fileformat.strip("*"))
    try:
        # remeber: the events are added according to the order they were read from in the original Comic_Order file
        event_address = events_path + events_list_file_names.values[i] + fileformat.strip("*")
        df_toshorten = pd.read_csv(event_address)
        dfs_events.append(df_toshorten[[col_to_keep]])
    except:
      print("The " + events_list_file_names.values[i] + " event has no corresponding file")

dfs_events

[              Comic Name
 0       Batman Vol. 3 #7
 1    Nightwing Vol. 4 #5
 2  Detective Comics #941
 3       Batman Vol. 3 #8
 4    Nightwing Vol. 4 #6
 5  Detective Comics #942,
                                     Comic Name
 0                      Suicide Squad Vol. 5 #8
 1                    Justice League Vol. 3 #12
 2   Justice League vs. Suicide Squad #1 (2016)
 3          Justice League vs. Suicide Squad #2
 4          Justice League vs. Suicide Squad #3
 5          Justice League vs. Suicide Squad #4
 6                      Suicide Squad Vol. 5 #9
 7          Justice League vs. Suicide Squad #5
 8                    Justice League Vol. 3 #13
 9          Justice League vs. Suicide Squad #6
 10                    Suicide Squad Vol. 5 #10,
             Comic Name
 0  Superman Vol. 4 #18
 1   Action Comics #975
 2        Superwoman #8
 3  Superman Vol. 4 #19
 4   Action Comics #976
 5  Supergirl Vol. 7 #8
 6        Superwoman #9
 7    Trinity Vol. 2 #8
 8   Action Comics #977


### Getting dataframes from each additional file and adding a column with the proper event name

In [104]:
for i in range(len(events_list_entry_names)):
    dfs_events[i]["Event"] = events_list_entry_names.iloc[i]

#Checking one of the dfs for the appropriate column
dfs_events[1].head()

Unnamed: 0,Comic Name,Event
0,Suicide Squad Vol. 5 #8,Justice League vs. Suicide Squad
1,Justice League Vol. 3 #12,Justice League vs. Suicide Squad
2,Justice League vs. Suicide Squad #1 (2016),Justice League vs. Suicide Squad
3,Justice League vs. Suicide Squad #2,Justice League vs. Suicide Squad
4,Justice League vs. Suicide Squad #3,Justice League vs. Suicide Squad


Next step is to do the dataframe merging

-----
## Removal of character name redundancies (WIP)

In [25]:
all_heroes = df_main['Character']
for i in all_heroes:
    if "JOKER" in i:
        print (i)

# Joker special

#for i in all_heroes:
#    if "JOKER" in i:

df_main.loc[df_main['Character'] == "THE JOKER"]

THE JOKER
THE JOKER 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR


Unnamed: 0,Comic Name,Character,Event
1943,The Joker: Year of the Villain #1 (2019),THE JOKER,


-----

### The Complications of Comic Book Names

We have to deal with cases where an entry has multiple "Protagonists," creating an extra column to identify a possible "collab" title where the secondary character is identified. Criteria being "/  AND & VS." etc