# Completing ficDTB & Filling in All Fic Info
- <b>Name:</b> Sofia Kobayashi
- <b>Date:</b> 02/01/2023
- <b>Notebook Stage:</b> 1.1 (using inital data collected to complete ficDTB)
- <b>Description:</b> Combining ALL fic data (mostly adding text-data fics & fics from FFN.ne& AO3 sub/bookmarks list)

### **<u>Table of Contents</u>**

1. **[Imports](#imports)**

1. Fill AO3 fics from fic-3


1. **[Adding other url-data fics to ficDTB](#otherUrl_1)**
    1. [Adding info from otherDTB](#otherUrl_1_1)
        - Add fics to ficDTB from otherDTB (ffn.net fics & ao3 external fics)
        - Add users to userDTB from otherDTB (ffn.net users)
            - **Other CHECKPOINT NUM** - DESCRIPTION
    1. [Adding FFN.net fics & users](#otherUrl_1_2)
        - **User CHECKPOINT NUM** - DESCRIPTION
        - **Fic CHECKPOINT NUM** - DESCRIPTION
    1. [Adding AO3 fics](#otherUrl_1_3)
1. **[Adding text-data fics (from previous versions) to ficDTB](#TAG)**
1. **[TITLE](#TAG)**
1. **[TITLE](#TAG)**
    - **Fic CHECKPOINT NUM** - DESCRIPTION

### Plans for this Notebook 
- bring over excess stuff from dtbs_from_prev

- **add all fic urls for the other DTBs to ficDTB (location_found will just be different**
    - add look_into from others -> look_into dtb (no)

- Grab fics from ffn.net (Follows & Favorites) & ao3 (bookmarks, subs, for later)
    - add to ficDTB
- populate ficDTB
- add text fics
    - add date_added & versionNum from v1-6 if possible to text fics, which are then added
- de-dup (same fic-diff places, etc.)
    - start making search/matching functions for text, ffn.net, version matching
- another other cleaning? DTB seperation? 


### Plans for next notebooks
- next notebook is cleaning up all non-fic dtbs

### AO3 Work Methods
bookmark
, collect
, comment
, delete_bookmark
, download
, download_to_file
, get
, get_comments
, get_images
, is_subscribed
, leave_kudos
, load_chapters
, request
, reload
, set_session
, str_format
, subscribe
, unsubscribe


### AO3 Stats
t2 = ['authors',
 'bookmarks',
 'categories',
 'chapters',
 'characters',
 'collections',
 'comments',
 'complete',
 'date_edited',
 'date_published',
 'date_updated',
 'end_notes',
 'expected_chapters',
 'fandoms',
 'hits',
 'id',
 'is_subscribed',
 'kudos',
 'language',
 'loaded',
 'metadata',
 'nchapters',
 'oneshot',
 'rating',
 'relationships',
 'restricted',
 'series',
 'start_notes',
 'status',
 'summary',
 'tags',
 'text',
 'title',
 'url',
 'words']

<a id="imports"></a>
## 1. Imports

In [None]:
%run helpers.ipynb

In [None]:
%run helpers.ipynb
import AO3

import json
import pandas as pd
import numpy as np
from datetime import datetime

from bs4 import BeautifulSoup
import requests
import difflib
pd.set_option('display.max_columns', None)

<a id="sec2"></a>
## 2. Filling AO3 ficDTB (from fic-3)

<a id="sec2.1"></a>
### 2.1 Defining ficDTB-filling Functions

In [None]:
fic_columns = {'location_found', 'dtb_type', 'smk_source', 'version',
       'date_added', 'date_last_viewed', 'url_type', 'id', 'url',
       'recced_from_collections', 'url_psueds', 'fic_obj', 'date_obj_updated',
       'is_subscribed', 'cur_chapter', 'my_notes', 'title', 'authors', 'fandoms',
       'rating', 'categories', 'warnings', 'relationships', 'characters',
       'tags', 'series', 'collections', 'words', 'nchapters',
       'expected_chapters', 'complete', 'date_published', 'date_updated',
       'date_edited', 'language', 'is_restricted', 'metadata', 'summary',
       'start_notes', 'end_notes', 'chapters', 'text', 'kudos', 'comments',
       'bookmarks', 'hits'}

In [None]:
def is_fic_row_complete(row):
    """
    Takes one row of a fic dtb.
    Returns a Boolean on whether or not the given row is completely filled.
    """
    new_row = row.copy().drop(columns=['dtb_type','date_last_viewed','cur_chapter', 'my_notes', 'metadata'])
    return len(np.where(pd.isnull(new_row))[1]) == 0

# is_fic_row_complete(ficDTB.iloc[[0]])

In [None]:
def fill_fic_ao3_info(fic_id, session, report=False):
    """
    Takes a fic id (int).
    Returns a 1-row pandas DF populated by ao3 data from the given fic id.
    """
    # initialize temp holder & fic obj
    single_fic = pd.DataFrame({'id': [fic_id]})
    fic = AO3.Work(fic_id, session=session)

    # write report info & report    
    title = fic.title
    authors = json.dumps([author.username for author in fic.authors])
    fandoms = json.dumps(fic.fandoms)

    single_fic['title'] = title
    single_fic['authors'] = authors
    single_fic['fandoms'] = fandoms
    if report: print(f"- Wrote '{title}' by {authors}\nin {fandoms}")

    # Write remaining info:
    # metadata
    single_fic['fic_obj'] = fic
    single_fic['date_obj_updated'] = datetime.now()
#     single_fic['metadata'] = json.dumps(fic.metadata)
    single_fic['is_subscribed'] = fic.is_subscribed
    single_fic['is_restricted'] = fic.restricted
    
    # tags
    single_fic['rating'] = fic.rating
    single_fic['categories'] = json.dumps(fic.categories)
    single_fic['warnings'] = json.dumps(fic.warnings)
    single_fic['relationships'] = json.dumps(fic.relationships)
    single_fic['characters'] = json.dumps(fic.characters)
    single_fic['tags'] = json.dumps(fic.tags)
    single_fic['series'] = json.dumps([series.id for series in fic.series])
    single_fic['collections'] = json.dumps(fic.categories)
    
    # text
    single_fic['summary'] = fic.summary
    single_fic['start_notes'] = fic.start_notes
    single_fic['end_notes'] = fic.end_notes
    single_fic['chapters'] = json.dumps([(chap.title, chap.id) for chap in fic.chapters])
    single_fic['text'] = fic.text
    
    # stats
    single_fic['words'] = fic.words
    single_fic['kudos'] = fic.kudos
    single_fic['comments'] = fic.comments
    single_fic['bookmarks'] = fic.bookmarks
    single_fic['hits'] = fic.hits
    single_fic['nchapters'] = fic.nchapters
    single_fic['expected_chapters'] = fic.expected_chapters
    single_fic['complete'] = fic.complete
    
    # dates
    single_fic['date_published'] = fic.date_published
    
    single_fic['date_updated'] = fic.date_updated
    single_fic['date_edited'] = fic.date_edited
    single_fic['language'] = fic.language

    # not found
    single_fic['not_found'] = False
    
    return single_fic

# fill_fic_ao3_info(16616795, ss1)

In [None]:
def fill_fic_dtb(initial_fic_dtb, session, done_set, update=False, report=False):
    """
    Takes a seriesDTB (pandas DataFrame, post series-0), an AO3 session, a Boolean to put in 'update' mode 
        (aka update all rows regardless of it they're already complete) and a Boolean to print report.
    Modifies given seriesDTB by filling it up with ao3 information.
    Returns nothing.
    """
    # find total number of series to fill
    total = max(initial_fic_dtb.index)
    
    # ensure initial_fic_dtb has all necessary columns
    for col_name in fic_columns:
        if col_name not in initial_fic_dtb.columns:
            initial_fic_dtb[col_name] = np.nan
        if 'date' in col_name:
            initial_fic_dtb[col_name] = initial_fic_dtb[col_name].astype('datetime64[ns]')
    
    # find remaining indices
    remaining_indices = set(initial_fic_dtb.index) ^ done_set
    
    # fill all fics/rows in initial_fic_dtb
    for ind in remaining_indices: 
        try: 
            # when not using full-report, alert at every 100 series
            if not report:
                if ind%100 == 0: 
                    print(f'- {ind}! (printed every 100)')

            # if fic/row not entirely filled in OR we're updating the dtb 
            if ind not in done_set and ((not is_fic_row_complete(initial_fic_dtb.iloc[[ind]])) or update):
                # get fic id
                fic_id = initial_fic_dtb.at[ind, "id"]
                if report: print(f"{ind}: [{(ind/total)*100:.2f}%] Filling for [{fic_id}]")

                # get ao3 info
                fic_ao3_info = fill_fic_ao3_info(fic_id, session, report=False)

                # update initial_fic_dtb with series_ao3_info (new info will overwrite old info)
                fic_ao3_info.index = [ind]
                initial_fic_dtb.update(fic_ao3_info, join='left', overwrite=True)
                
            # if series/row is satifactory
            else: 
                if report: print(f"{ind}: .")
                    
            # add index to done set
            done_set.add(ind)

        # if something goes wrong 
        except Exception as e:
            initial_fic_dtb.at[ind, "not_found"] = e
            if e.args[0] == 'Cannot find work':
                print(f"-- ERROR [], {e}: {initial_fic_dtb.at[ind, 'id']}")
                done_set.add(ind) # add index to done set
            elif er1.args[0] == "'NoneType' object has no attribute 'text'":
                print(f"-- ERROR, {e}: {initial_fic_dtb.at[ind, 'id']}")
                done_set.add(ind) # add index to done set
#             else: 
#                 raise e
        
        # update temp csv w/ new row/series
        if ind%20 == 0: 
            file_name = "temp_fic.csv"
            if report: print(f'Wrote to {file_name}')
            initial_fic_dtb.to_csv(file_name)
        
        

    # Write finished series DTB to csv
    initial_fic_dtb.to_csv("temp_fic_final.csv")

    print("\nDONE!")


<a id="sec2.2"></a>
### 2.2 Filling ficDTB

In [None]:
ss1=my_session()

In [None]:
# ficDTB = pd.read_csv('data-checkpoints/fic-3-all_02-03-23.csv', index_col=0, parse_dates=['date_obj_updated']) \
#             .drop(columns=['is_missing', 'restricted','notes'])

# done_set= set()
# AO3.utils.limit_requests()

In [None]:
ficDTB.to_csv('fic5.csv')

In [None]:
err = set()
for ind in ficDTB.index:
    e1 = ficDTB.at[ind,'not_found']
    if type(e1) is not bool and type(e1) is not float and e1.args[0] == 'We are being rate-limited. Try again in a while or reduce the number of requests':
        err.add(ind)

In [None]:
fill_fic_dtb(ficDTB, ss1, done_set, update=False, report=True)

<a id="otherUrl_1"></a>
## 2. Adding other url-data fics to ficDTB

In [None]:
# read in most recent ficDTB file
ficDTB = pd.read_csv(
    "data-checkpoints/fic-3-all_02-03-23.csv",
    parse_dates=["date_added", "date_last_viewed"],
    index_col=0,
)
ficDTB.head(2)

In [None]:
# read in most recent authors file
userDTB = pd.read_csv(
    "data-checkpoints/users-0-all_02-26-23.csv", parse_dates=["date_added"], index_col=0
)
userDTB["location_found"] = "ao3"
userDTB.head(3)

<a id="otherUrl_1_1"></a>
### 2A. Adding Info from otherDTB
- **2A.1)** Add fics to ficDTB from otherDTB (ffn.net fics & ao3 external fics)
- **2A.2)** Add users to userDTB from otherDTB (ffn.net users)

In [None]:
# read in most recent others file
othersDTB = pd.read_csv(
    "data-checkpoints/others-0-all_01-03-23.csv",
    parse_dates=["date_added", "date_last_viewed"],
    index_col=0,
)

# get AO3 external fics & FFN.net urls
ffn_net_fics = othersDTB.query("url.str.contains('archiveofourown.org/') or \
                                url.str.contains('fanfiction.net/s/')")

# get ffn.net users
ffn_net_users = othersDTB.query("url.str.contains('fanfiction.net/u/')")

# get remaining (all non-fanfiction.net & ao3 urls)
remaining_others = othersDTB.query("~(url.str.contains('fanfiction.net/u/') or \
                                      url.str.contains('archiveofourown.org/') or \
                                      url.str.contains('fanfiction.net/s/'))")

In [None]:
# remaining_others.to_csv('data-checkpoints/others-1-all_03-09-23.csv')

### CHECKPOINT! others-1-all_03-09-23.csv (remaining non-ffn.net and non-ao3 links)

In [None]:
def get_fic_url_type(url):
    if 'archiveofourown.org/external_works/' in url: return 'ao3_external_work'
    elif 'fanfiction.net/s/' in url: return 'work'
    else: return np.nan
    
def get_location_found(url):
    if 'archiveofourown.org/' in url: return 'ao3'
    elif 'fanfiction.net/' in url: return 'ffn.net'
    else: return np.nan
    
ffn_net_fics['location_found'] = ffn_net_fics['url'].apply(get_location_foound)
ffn_net_fics['url_type'] = ffn_net_fics['url'].apply(get_fic_url_type)

In [None]:
# add new fic urls to ficDTB
ficDTB = (
    pd.concat([ficDTB, ffn_net_fics])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

ficDTB.tail(10)

In [None]:
ffn_net_users['location_found'] = ffn_net_users['url'].apply(get_location_found)

In [None]:
# add new user urls to usersDTB
userDTB = (
    pd.concat([userDTB, ffn_net_users])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

userDTB.tail(10)

In [None]:
# 2A.2) Add users to userDTB from otherDTB (ffn.net users)

# get AO3 external fics & FFN.net urls
new_users = othersDTB.query('url.str.contains("fanfiction.net/u/")')

# remove ffn.net fics & ao3 external works
othersDTB = othersDTB.query('~(url.str.contains("fanfiction.net/u/"))')

# add new fic urls to ficDTB
userDTB = (
    pd.concat([userDTB, new_users])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

userDTB

#### Other CHECKPOINT! others-1-all_02-07-23.csv (removed all fanfiction.net & ao3 (external) works)

In [None]:
othersDTB = othersDTB.reset_index(drop=True)
# othersDTB.to_csv("data-checkpoints/others-1-all_02-07-23.csv")

<a id="otherUrl_1_2"></a>
### 2B. Adding FFN.net fics & users
- manually collected authors & fics from my 'Alerts' & 'Favorites' sections on fanfiction.net
- made not distinction between the two categories, marked all fics as read
    - if there were any duplicate authors or fics in both Alerts & Favorites, chose earliest add date
    
- **2B.1)** Add users to userDTB from ffn.net Alerts & Favorites
- **2B.2)** Add fics to ficDTB from ffn.net Alerts & Favorites

### 2B.1) Add users to userDTB from ffn.net Alerts & Favorites

In [None]:
# read in authors from fanfiction.net account
ffn_authors = pd.read_csv(
    "urlsOutput/ffn-net_authors_02-04-23.csv", index_col=0, parse_dates=["date_added"]
)

# add ffn_authors to userDTB
userDTB = pd.concat([userDTB, ffn_authors]).reset_index().drop(columns=["index"])
userDTB.head(2)

#### User CHECKPOINT! users-1-all_02-07-23.csv (Add ffn.net users )

In [None]:
def populateFicDTB(useAccount, debug=False):
    """
    Takes boolean (use account/able to access restricted?)
    Fills any empty row of FicDTB with: title, authors, fandoms, Work object, and date this all last updated
    Requires that all rows' url_type & id be filled in.
    Returns nothing.
    """
    total = max(ficDTB.index)

    for ind in ficDTB.index:
        # get necessary info from DTB
        title = ficDTB.at[ind, "title"]
        authors = ficDTB.at[ind, "authors"]
        fandoms = ficDTB.at[ind, "fandoms"]
        fic_obj = ficDTB.at[ind, "fic_obj"]
        obj_date = ficDTB.at[ind, "date_obj_updated"]
        url_type = ficDTB.at[ind, "url_type"]

        # if any col is empty
        if (
            pd.isnull(title)
            or pd.isnull(authors)
            or pd.isnull(fandoms)
            or pd.isnull(fic_obj)
            or pd.isnull(obj_date)
        ):
            try:
                # get fic id
                if url_type == "chapters":
                    print("-- chapter!")
                    html_text = requests.get(url).text
                    soup = BeautifulSoup(html_text, "lxml")
                    wId = soup.find(
                        "input", attrs={"id": "kudo_commentable_id", "type": "hidden"}
                    )["value"]
                else:
                    wId = ficDTB.at[ind, "id"]
                print(f"- [{(ind/total)*100:.2f}%, #{ind}] Filling for [{wId}]")

                # initialize Work obj
                if useAccount:
                    work = AO3.Work(wId, session=session)
                else:
                    work = AO3.Work(wId)

                # write new info into DTB
                newTitle = work.title
                ficDTB.at[ind, "title"] = newTitle
                if debug:
                    print(f"- Wrote '{newTitle}'")

                newAuthors = json.dumps([x.username for x in work.authors])
                ficDTB.at[ind, "authors"] = newAuthors
                if debug:
                    print(f"- Wrote '{newAuthors}'")

                newFandoms = json.dumps(work.fandoms)
                ficDTB.at[ind, "fandoms"] = newFandoms
                if debug:
                    print(f"- Wrote '{newFandoms}'")

                ficDTB.at[ind, "fic_obj"] = work
                now = datetime.now()
                ficDTB.at[ind, "date_obj_updated"] = now
                if debug:
                    print(f"- Wrote fic obj at: {now.strftime('%m-%d-%y %H:%M:%S')}")

            # if Error
            except Exception as e:
                print(f"-- ERROR: {repr(e)} - - - {ficDTB.at[ind, 'id']}")
        else:
            print(".")

    print("\nDONE!")

In [None]:
userDTB = userDTB.reset_index(drop=True)
# userDTB.to_csv("data-checkpoints/users-1-all_02-07-23.csv")

### 2B.2) Add fics to ficDTB from ffn.net Alerts & Favorites

In [None]:
# read in fics from fanfiction.net account
ffn_fics = pd.read_csv(
    "urlsOutput/ffn-net_fics_02-04-23.csv",
    index_col=0,
    parse_dates=["date_added", "date_updated"],
)

def temp(x):
    return json.dumps([x.strip() for x in x.split(",")])

ffn_fics["authors"] = ffn_fics["author"].apply(temp)
ffn_fics["fandoms"] = ffn_fics["fandoms"].apply(temp)
ffn_fics = ffn_fics.drop(columns=["author"])

# add ffn_fics to ficDTB
ficDTB = pd.concat([ficDTB, ffn_fics]).reset_index(drop=True)
ficDTB

#### Fic CHECKPOINT! fics-4-all_02-07-23.csv (Add fics from ffn.net from ffn.net account & othersDTB)

In [None]:
# ficDTB.to_csv("data-checkpoints/fics-4-all_02-07-23.csv")


In [None]:
df = pd.read_excel("raw_data_for_v9.xlsx", sheet_name=None)
df.keys()

<a id="otherUrl_1_3"></a>
### 2.3 Adding AO3 fics
- make DTB of all AO3 fic-bookmarks, fic-subs, user-subs, series-subs

In [None]:
ficDTB.smk_source.value_counts()

In [None]:
ficDTB = pd.read_csv('data-checkpoints/fics-4-all_02-07-23.csv', index_col=0)
ficDTB

<a id="TAG"></a>
## Random Work

In [None]:
# fill new info cols
# populateFicDTB(True)

In [None]:
def getRating(x):
    if pd.isnull(x):
        return x

    rating = x.rating
    if rating == "Teen And Up Audiences	":
        return "T"
    elif rating == "Explicit":
        return "E"
    elif rating == "Mature":
        return "M"
    elif rating == "General Audiences":
        return "G"
    elif rating == "Not Rated":
        return "--"

## **start making search/matching functions for text, ffn.net, version matching**
- match by title:
    - get title of unknown fic
    - search ficDTB for matching titles
        - if no match: add col to ficDTB with as much into as possible
        - if match: update row with incoming data (version, smk_source, dtb_type, date_added) 
        

- OVERALL SECTION
    - read json file containing up unsorted txt fics
    - match them one by one 
    - remove url as matched
    - when done, overwrite previous file 

In [None]:
version_default_dates = {
    1: "05-31-2017 00:00:01",
    2: "05-30-2018 00:00:01",
    3: "08-09-2018 00:00:01",
    4: "08-25-2018 00:00:01",
    5: "01-01-2019 00:00:01",
    6: "04-09-2020 00:00:01",
    7: "05-01-2021 00:00:01",
    8: "06-04-2022 00:00:01",
    9: "10-24-2022 00:00:01",
}

In [None]:
# early.reset_index(drop=True).to_csv("urlsOutput/v1-6_txt/all_early.csv")

In [None]:
fic_titles = [x.lower() for x in ficDTB_titles]
early_tit = [x.lower() for x in early_titles]

for ind in early.index:
    title = early.at[ind, "title"].lower()
    if title in fic_titles:
        early = early.drop(ind)

In [None]:
txt_fics = pd.DataFrame(columns=early.columns.to_list())
txt_fics["date_added"] = pd.to_datetime(txt_fics["date_added"])

txt_fics

In [None]:
# read in all_early CSV
early = pd.read_csv(
    "urlsOutput/v1-6_txt/all_early.csv", index_col=0, parse_dates=["date_added"]
).reset_index(drop=True)

early_titles = early.query("work_type == 'fic'").title.to_list()
ficDTB_titles = ficDTB.loc[pd.isnull(ficDTB["title"]) == False].title.to_list()

for ind in early.index:
    title = str(early.at[ind, "title"])

    #     from_early = difflib.get_close_matches(title, early_titles, n=5, cutoff=0.6)
    #     if title in from_early: from_early.remove(title)

    from_dtb = difflib.get_close_matches(title, ficDTB_titles, n=3, cutoff=0.6)

    if len(from_dtb) == 0:
        # add text fic to ficDTB
        txt_fics = pd.concat([txt_fics, early.iloc[[ind]]])

    print(f"NEW: {title}")
    [print(f"- {x}") for x in from_dtb]
    #     if len(from_early) > 0: print(f'--- {from_early}')
    print()

In [None]:
import pandas as pd
# t1 = pd.read_excel('testing_data/raw_data_for_v9.xlsx', sheet_name='fandom_names')
# fandom_names = pd.read_csv("testing_data/fandom_names.csv")
v6_fic_text_ffn = pd.read_excel('testing_data/raw_data_for_v9.xlsx', sheet_name="v6_fic_text_ffn-dtb")

In [None]:
#FANDOM_NAMES = {('1/2 prince'): '1/2_prince',
                ('avatar-'): 'avatar',
                ('attack on titan'): 'attack on titan',
               ("assassin's creed"): 'assassins_creed',
               ('avatar: the last airbender'): 'atla',
               ('miraculous ladybug'): 'miraculous_ladybug',
               ('big hero 6', 'bh6'): 'big_hero_6',
               ('black panther'): 'black_panther',
               ('katekyo hitman reborn', 'khr'): 'katekyo_hitman_reborn',
               ('books of the raksura'): 'books_of_the_raksura',
                ('brooklyn nine-nine', 'b99'): 'brooklyn_99',
                ('captain america'): 'captain_america',
                ('captive prince'): 'captive_prince',
                ('chronicles of narnia'): 'chronicles_of_narnia',
                ('code geass'): 'code_geass',
                ('criminal minds'): 'criminal_minds',
                ('danny phantom'): 'danny_phantom',
                ('dark angel'): 'dark angel',
                ('detroit: become human'): 'detroit_become_human',
                ('disney', 'greek & roman myths',
                    'beauty & the beast',
                    'robin hood', 'rapunzel',
                    'pocahontas','sleeping beauty',
                    'the little mermaid','the secret garden',
                    'aladdin','cinderella','maid maleen',
                    'mulan'): 'folklore',
                ('eyeshield 21'): 'eyeshield_21',
                ('fairy tail'): 'fairy_tail',
                ('x-men'): 'xmen',
                ('fast & furious'): 'fast&furious',
                ('final fantasy vii'): 'final_fantasy_vii',
                ('final fantasy viii'): 'final_fantasy_viii',
                ('final fantasy xv'): 'final_fantasy_xv',
                ('fullmetal alchemist'): 'fullmental_alchemist',
                ('game of thrones', 'got'): 'game_of_thrones',
                ('good omens'): 'good_omens',
                ('gravity falls'): 'gravity_falls',
                ('guardians of the galaxy','gotg'): 'guardians_of_the_galaxy',
                ('gundam wing/ac'): 'gundam_wing/ac',
                ('john wick'): 'john_wick',
                ('joy of life'): 'joy_of_life',
                ('harry potter', "hp"): 'harry_potter',
                ('highschool of the dead'): 'highschool_of_the_dead',
                ('how to train your dragon'): 'httyd',
                ('hunger games'): 'hunger_games',
                ('james bond'): 'james_bond',
                ('jurassic park'): 'jurassic_park',
                ('k anime'): 'k_anime',
                ('kingsmen'): 'kingsman',
                ('kuroko no basuke', 'knb'): 'kuroko_no_basuke',
                ('kung fu panda'): 'kung_fu_panda',
                ('lotr'): 'lord_of_the_rings',
                ('ouran high school host club', 'ohshc'): 'ouran_hshc',
                ('gdc','modao zushi'): 'mdzs',
                ('magi!!! labyrinth of magic'): 'magi_lom',
                ('miraculous ladybug'): 'miraculous_ladybug',
                ('monster hunter'): 'monster_hunter',
                ('moon knight'): 'moon_knight',
                ('one piece'): 'one_piece',
                ('pacific rim'): 'pacific_rim',
                ('percy jackson and the olympians'): 'percy_jackson_olympians',
                ('person of interest'): 'person_of_interest',
                ('phineas and ferb'): 'phineas_and_ferb',
                ('pkmn: sword and shield'): 'pkmn_sword&shield',
                ('pokémon','pkmn'): 'pokemon',
                ('prince of tennis'): 'prince_of_tennis',
                ('princess kaguya'): 'princess_kaguya',
                ('reincarnated as a sword'): 'reincarnated_as_a_sword',
                ('rise of the guardians'): 'rise_of_the_guardians',
                ('solo levelling'): 'solo_levelling',
                ('star wars','sw'): '1234SW',
                ('star wars: the clone wars','star wars: clone wars',
                     'star wars cw','sw: the clone wars'): 'star_wars_cw',
                ('stargate-'): 'stargate',
                ('stargate atlantis'): 'stargate_atlantis',
                ('spn'): 'supernatural',
                ('stranger things'): 'stranger_things',
                ('scum villain', "scum villain's self-saving system"): 'svsss',
                ('sword art online'): 'sword_art_online',
                ('teen wolf'): 'teen_wolf',
                ('the 100'): 'the_100',
                ('the croods'): 'the_croods',
                ('the flash'): 'the_flash',
                ('the hobbit'): 'the_hobbit',
                ('the last of us'): 'the_last_of_us',
                ('the song of achillles'): 'the_song_of_achillles',
                ('the witcher'): 'the_witcher',
                ('tiger & bunny'): 'tiger&bunny',
                ('tokyo ghoul'): 'tokyo_ghoul',
                ('umbrella academy'): 'umbrella_academy',
                ('vampire hunter d'): 'vampire_hunter_d',
                ('yona of the dawn'): 'yona_of_the_dawn',
                ('young hercules'): 'young_hercules',
                ('young justice'): 'young_justice',
                ('yuuri on ice', 'yoi','yuuri on ice!!!'): 'yuuri_on_ice',
                ('gotham'): 'gotham',
                ('daredevil'): 'daredevil',
                ('temeraire'): 'temeraire',
                ('transformers'): 'transformers',
                ('leverage'): 'leverage',
                ('hamilton'): 'hamilton',
                ('fbawtft', 'fantastic beasts and where to find them'): 'fbawtft',
                ('spiderman'): 'spiderman',
                ('avengers'): 'avengers',
                ('smallville'): 'smallville',
                ('twilight'): 'twilight',
                ('thor'): 'thor',
                ('arrow'): 'arrow',
                ('ncis'): 'ncis',
                ('naruto'): 'naruto',
                ('merlin'): 'merlin',
                ('travelers'): 'travelers',
                ('left4dead'): 'left4dead',
                ('megamind'): 'megamind',
                ('rwby'): 'rwby',
                ('minecraft'): 'minecraft',
                ('original'): 'original_work',
                ('bts'): 'bts',
                ('bleach'): 'bleach',
                ('batman'): 'batman',
                ('torchwood'): 'torchwood',
                ('sherlock'): 'sherlock',
                ('descendants 2015'): 'descendants',
                ('bnha', 'mha','boku no hero academia'): 'bnha',
                ('multiple'): 'multiple_fandoms',
               }


# def get_clean_fandom_name(unclean_fandom_name) -> str:
#     """
#     Takes str unclean fandom name.
#     Returns str clean fandom name (if found), else returns None.
#     """
#     # if it's already a clean fandom
#     if unclean_fandom_name in FANDOM_NAMES.values():
#         return unclean_fandom_name
    
#     # else, search thru all aliases
#     for key in FANDOM_NAMES.keys():
#         if unclean_fandom_name in key:
#             return FANDOM_NAMES[key]


def fandom_report(dtb, fic_col, verbose=False) -> int:
    """
    TAKES a dtb - read from csv/xlsx
        str - column name of the fandoms
        boolean - print report
    PURPOSE: Check all fandoms in 'fic_col'. If verbose, print a report: 
        num rows, known fandoms, & unknown fandoms
        list of error index nums
        list of unknown fandoms
    RETURNS 1 if no unknown fandoms & no errors, 0 otherwise
    """
    known_fandoms = []
    unknown_fandoms = []
    clean_fandoms = []
    error_ind = set()
    num_rows = len(dtb)
    
    # for each fandom row
    for ind in dtb.index:
        fandom_str = dtb.loc[ind].loc[fic_col]
        
        # if fandom cell empty
        if pd.isnull(fandom_str):
            error_ind.add(ind)
            continue
        
        # clean fandom string
        fandom_list = fandom_str.replace('*','') \
                            .replace(' x ',',') \
                            .split(',')
        
        # for each fandom in fic
        for old_fandom in fandom_list:
            
            # get clean fandom
            if old_fandom in FANDOM_NAMES.values():
                clean_fandoms.append(old_fandom)
                clean_fandom = old_fandom
            else:
                clean_fandom = get_clean_fandom_name(old_fandom)
            
            # if no clean fandom found
            if not clean_fandom:
                unknown_fandoms.append((ind, old_fandom))
            else:
                known_fandoms.append(clean_fandom)
    
    # print report
    unknown_fandom_names = []
    if unknown_fandoms:
        unknown_fandom_names = list(zip(*unknown_fandoms))[1]
    
    num_unclean = len(set(known_fandoms))-len(set(clean_fandoms))
    if verbose:
        print(f'- --- FANDOM REPORT --- -')
        print(f'- # rows/fandoms:           {num_rows}')
        print(f'- # errors (row num):       {len(error_ind)}')
        [print('  ', err) for err in error_ind]
        print(f'- # unique known fandoms:   {len(set(known_fandoms))} (total), \
            {len(set(clean_fandoms))} (clean), {num_unclean} (unclean)')
        print(f'- # unique unknown fandoms: {len(set(unknown_fandom_names))}')
        [print('  ', fname) for fname in set(unknown_fandom_names)]

    if len(error_ind) == 0 and len(set(unknown_fandom_names)) == 0:
        if num_unclean == 0:
            return f"Ideal - all fandoms known & clean"
        return f"Good - all fandoms known, but {num_unclean} unclean"
    return f"Bad - {len(error_ind)} errors and {len(set(unknown_fandom_names))} unknown fandoms"


In [None]:
fandom_report(v6_fic_text_ffn, 'fic_fandom', True)

In [None]:
"".join(['1','2'])

In [None]:
v6_test = pd.read_csv("testing_data/v6_test_3.csv")

In [None]:
def clean_fandom_names(dtb, fandom_col_name, verbose=False):
    """
    Takes a dtb and str name of the fandom column.
    Reads the fandoms in the given dtb -> changes all fandom names to be consistent to the ones in FANDOM_NAMES.
    Returns str status update.
    """
    for ind in dtb.index:
        # get & clean fandom string
        fandom_str = dtb.at[ind, fandom_col_name]
        fandom_list = fandom_str.replace('*','') \
                            .replace(' x ',',') \
                            .split(',')
        
        # for each fandom in fic
        clean_fandoms = []
        for old_fandom in fandom_list:
            clean_fandom = get_clean_fandom_name(old_fandom)
            clean_fandoms.append(clean_fandom)
        
        res = ",".join(clean_fandoms)
        print(res)
#         # place clean string back into dtb
#         dtb.at[ind, fandom_col_name]

clean_fandom_names(v6_test, 'fic_fandom', False)

In [None]:
test_cell = v6_test.at[5,'fic_fandom']
test_cell

In [None]:
v6_test.at[5,'fic_fandom'] = 'assassins_creed,star_wars_cw'

In [None]:
v6_test.to_csv('testing_data/v6_test_3.csv')

In [None]:
t6 = pd.read_excel("testing_data/raw_data_for_v9.xlsx", sheet_name="v3_fic_text_fandom")

In [None]:
t6