# Completing ficDTB & Filling in All Fic Info
- <b>Name:</b> Sofia Kobayashi
- <b>Date:</b> 02/01/2023
- <b>Notebook Stage:</b> 1.1 (using inital data collected to complete ficDTB)
- <b>Description:</b> Combining ALL fic data (mostly adding text-data fics & fics from FFN.ne& AO3 sub/bookmarks list)

### **<u>Table of Contents</u>**

1. **[Imports](#imports)**

1. Fill AO3 fics from fic-3


1. **[Adding other url-data fics to ficDTB](#otherUrl_1)**
    1. [Adding info from otherDTB](#otherUrl_1_1)
        - Add fics to ficDTB from otherDTB (ffn.net fics & ao3 external fics)
        - Add users to userDTB from otherDTB (ffn.net users)
            - **Other CHECKPOINT NUM** - DESCRIPTION
    1. [Adding FFN.net fics & users](#otherUrl_1_2)
        - **User CHECKPOINT NUM** - DESCRIPTION
        - **Fic CHECKPOINT NUM** - DESCRIPTION
    1. [Adding AO3 fics](#otherUrl_1_3)
1. **[Adding text-data fics (from previous versions) to ficDTB](#TAG)**
1. **[TITLE](#TAG)**
1. **[TITLE](#TAG)**
    - **Fic CHECKPOINT NUM** - DESCRIPTION

### Plans for this Notebook 
- bring over excess stuff from dtbs_from_prev

- **add all fic urls for the other DTBs to ficDTB (location_found will just be different**
    - add look_into from others -> look_into dtb (no)

- Grab fics from ffn.net (Follows & Favorites) & ao3 (bookmarks, subs, for later)
    - add to ficDTB
- populate ficDTB
- add text fics
    - add date_added & versionNum from v1-6 if possible to text fics, which are then added
- de-dup (same fic-diff places, etc.)
    - start making search/matching functions for text, ffn.net, version matching
- another other cleaning? DTB seperation? 


### Plans for next notebooks
- next notebook is cleaning up all non-fic dtbs

### AO3 Work Methods
bookmark
, collect
, comment
, delete_bookmark
, download
, download_to_file
, get
, get_comments
, get_images
, is_subscribed
, leave_kudos
, load_chapters
, request
, reload
, set_session
, str_format
, subscribe
, unsubscribe


### AO3 Stats
t2 = ['authors',
 'bookmarks',
 'categories',
 'chapters',
 'characters',
 'collections',
 'comments',
 'complete',
 'date_edited',
 'date_published',
 'date_updated',
 'end_notes',
 'expected_chapters',
 'fandoms',
 'hits',
 'id',
 'is_subscribed',
 'kudos',
 'language',
 'loaded',
 'metadata',
 'nchapters',
 'oneshot',
 'rating',
 'relationships',
 'restricted',
 'series',
 'start_notes',
 'status',
 'summary',
 'tags',
 'text',
 'title',
 'url',
 'words']

<a id="imports"></a>
## 1. Imports

In [7]:
%run helpers.ipynb
import AO3

import json
import pandas as pd
import numpy as np
from datetime import datetime

from bs4 import BeautifulSoup
import requests
import difflib
pd.set_option('display.max_columns', None)

Success!


<a id="sec2"></a>
## 2. Filling AO3 ficDTB (from fic-3)

<a id="sec2.1"></a>
### 2.1 Defining ficDTB-filling Functions

In [76]:
fic_columns = {'location_found', 'dtb_type', 'smk_source', 'version',
       'date_added', 'date_last_viewed', 'url_type', 'id', 'url',
       'recced_from_collections', 'url_psueds', 'fic_obj', 'date_obj_updated',
       'is_subscribed', 'cur_chapter', 'my_notes', 'title', 'authors', 'fandoms',
       'rating', 'categories', 'warnings', 'relationships', 'characters',
       'tags', 'series', 'collections', 'words', 'nchapters',
       'expected_chapters', 'complete', 'date_published', 'date_updated',
       'date_edited', 'language', 'is_restricted', 'metadata', 'summary',
       'start_notes', 'end_notes', 'chapters', 'text', 'kudos', 'comments',
       'bookmarks', 'hits'}

In [77]:
def is_fic_row_complete(row):
    """
    Takes one row of a fic dtb.
    Returns a Boolean on whether or not the given row is completely filled.
    """
    new_row = row.copy().drop(columns=['dtb_type','date_last_viewed','cur_chapter', 'my_notes', 'metadata'])
    return len(np.where(pd.isnull(new_row))[1]) == 0

# is_fic_row_complete(ficDTB.iloc[[0]])

In [85]:
def fill_fic_ao3_info(fic_id, session, report=False):
    """
    Takes a fic id (int).
    Returns a 1-row pandas DF populated by ao3 data from the given fic id.
    """
    # initialize temp holder & fic obj
    single_fic = pd.DataFrame({'id': [fic_id]})
    fic = AO3.Work(fic_id, session=session)

    # write report info & report    
    title = fic.title
    authors = json.dumps([author.username for author in fic.authors])
    fandoms = json.dumps(fic.fandoms)

    single_fic['title'] = title
    single_fic['authors'] = authors
    single_fic['fandoms'] = fandoms
    if report: print(f"- Wrote '{title}' by {authors}\nin {fandoms}")

    # Write remaining info:
    # metadata
    single_fic['fic_obj'] = fic
    single_fic['date_obj_updated'] = datetime.now()
#     single_fic['metadata'] = json.dumps(fic.metadata)
    single_fic['is_subscribed'] = fic.is_subscribed
    single_fic['is_restricted'] = fic.restricted
    
    # tags
    single_fic['rating'] = fic.rating
    single_fic['categories'] = json.dumps(fic.categories)
    single_fic['warnings'] = json.dumps(fic.warnings)
    single_fic['relationships'] = json.dumps(fic.relationships)
    single_fic['characters'] = json.dumps(fic.characters)
    single_fic['tags'] = json.dumps(fic.tags)
    single_fic['series'] = json.dumps([series.id for series in fic.series])
    single_fic['collections'] = json.dumps(fic.categories)
    
    # text
    single_fic['summary'] = fic.summary
    single_fic['start_notes'] = fic.start_notes
    single_fic['end_notes'] = fic.end_notes
    single_fic['chapters'] = json.dumps([(chap.title, chap.id) for chap in fic.chapters])
    single_fic['text'] = fic.text
    
    # stats
    single_fic['words'] = fic.words
    single_fic['kudos'] = fic.kudos
    single_fic['comments'] = fic.comments
    single_fic['bookmarks'] = fic.bookmarks
    single_fic['hits'] = fic.hits
    single_fic['nchapters'] = fic.nchapters
    single_fic['expected_chapters'] = fic.expected_chapters
    single_fic['complete'] = fic.complete
    
    # dates
    single_fic['date_published'] = fic.date_published
    
    single_fic['date_updated'] = fic.date_updated
    single_fic['date_edited'] = fic.date_edited
    single_fic['language'] = fic.language

    # not found
    single_fic['not_found'] = False
    
    return single_fic

# fill_fic_ao3_info(16616795, ss1)

In [149]:
def fill_fic_dtb(initial_fic_dtb, session, done_set, update=False, report=False):
    """
    Takes a seriesDTB (pandas DataFrame, post series-0), an AO3 session, a Boolean to put in 'update' mode 
        (aka update all rows regardless of it they're already complete) and a Boolean to print report.
    Modifies given seriesDTB by filling it up with ao3 information.
    Returns nothing.
    """
    # find total number of series to fill
    total = max(initial_fic_dtb.index)
    
    # ensure initial_fic_dtb has all necessary columns
    for col_name in fic_columns:
        if col_name not in initial_fic_dtb.columns:
            initial_fic_dtb[col_name] = np.nan
        if 'date' in col_name:
            initial_fic_dtb[col_name] = initial_fic_dtb[col_name].astype('datetime64[ns]')
    
    # find remaining indices
    remaining_indices = set(initial_fic_dtb.index) ^ done_set
    
    # fill all fics/rows in initial_fic_dtb
    for ind in remaining_indices: 
        try: 
            # when not using full-report, alert at every 100 series
            if not report:
                if ind%100 == 0: 
                    print(f'- {ind}! (printed every 100)')

            # if fic/row not entirely filled in OR we're updating the dtb 
            if ind not in done_set and ((not is_fic_row_complete(initial_fic_dtb.iloc[[ind]])) or update):
                # get fic id
                fic_id = initial_fic_dtb.at[ind, "id"]
                if report: print(f"{ind}: [{(ind/total)*100:.2f}%] Filling for [{fic_id}]")

                # get ao3 info
                fic_ao3_info = fill_fic_ao3_info(fic_id, session, report=False)

                # update initial_fic_dtb with series_ao3_info (new info will overwrite old info)
                fic_ao3_info.index = [ind]
                initial_fic_dtb.update(fic_ao3_info, join='left', overwrite=True)
                
            # if series/row is satifactory
            else: 
                if report: print(f"{ind}: .")
                    
            # add index to done set
            done_set.add(ind)

        # if something goes wrong 
        except Exception as e:
            initial_fic_dtb.at[ind, "not_found"] = e
            if e.args[0] == 'Cannot find work':
                print(f"-- ERROR [], {e}: {initial_fic_dtb.at[ind, 'id']}")
                done_set.add(ind) # add index to done set
            elif er1.args[0] == "'NoneType' object has no attribute 'text'":
                print(f"-- ERROR, {e}: {initial_fic_dtb.at[ind, 'id']}")
                done_set.add(ind) # add index to done set
#             else: 
#                 raise e
        
        # update temp csv w/ new row/series
        if ind%20 == 0: 
            file_name = "temp_fic.csv"
            if report: print(f'Wrote to {file_name}')
            initial_fic_dtb.to_csv(file_name)
        
        

    # Write finished series DTB to csv
    initial_fic_dtb.to_csv("temp_fic_final.csv")

    print("\nDONE!")


<a id="sec2.2"></a>
### 2.2 Filling ficDTB

In [81]:
ss1=my_session()

In [89]:
# ficDTB = pd.read_csv('data-checkpoints/fic-3-all_02-03-23.csv', index_col=0, parse_dates=['date_obj_updated']) \
#             .drop(columns=['is_missing', 'restricted','notes'])

# done_set= set()
# AO3.utils.limit_requests()

In [187]:
ficDTB.to_csv('fic5.csv')

In [173]:
err = set()
for ind in ficDTB.index:
    e1 = ficDTB.at[ind,'not_found']
    if type(e1) is not bool and type(e1) is not float and e1.args[0] == 'We are being rate-limited. Try again in a while or reduce the number of requests':
        err.add(ind)

In [186]:
fill_fic_dtb(ficDTB, ss1, done_set, update=False, report=True)

958: [38.66%] Filling for [23779702.0]
959: [38.70%] Filling for [23810251.0]
960: [38.74%] Filling for [23818510.0]
Wrote to temp_fic.csv
961: [38.78%] Filling for [23850361.0]
962: [38.82%] Filling for [23859457.0]
963: [38.86%] Filling for [23931013.0]
-- ERROR, 'NoneType' object has no attribute 'text': 23931013.0
964: [38.90%] Filling for [2393225.0]




965: [38.94%] Filling for [23935006.0]
966: [38.98%] Filling for [23939416.0]
-- ERROR, ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')): 23939416.0
967: [39.02%] Filling for [23951533.0]
968: [39.06%] Filling for [23956966.0]
-- ERROR, ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')): 23956966.0
969: [39.10%] Filling for [23961145.0]
970: [39.14%] Filling for [23973340.0]
971: [39.18%] Filling for [23983990.0]
972: [39.23%] Filling for [23993578.0]
973: [39.27%] Filling for [24028024.0]
974: [39.31%] Filling for [24056941.0]
975: [39.35%] Filling for [24067537.0]
-- ERROR, ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')): 24067537.0
976: [39.39%] Filling for [240930.0]
977: [39.43%] Filling for [24103009.0]
978: [39.47%] Filling for [24105631.0]
979: [39.51%] Filling for [24110341.0]
980: [39.55%]



Wrote to temp_fic.csv
981: [39.59%] Filling for [24113557.0]




982: [39.63%] Filling for [2412722.0]
983: [39.67%] Filling for [24168637.0]




984: [39.71%] Filling for [24194614.0]
985: [39.75%] Filling for [24213475.0]




986: [39.79%] Filling for [24215029.0]




987: [39.83%] Filling for [24216859.0]
988: [39.87%] Filling for [24231487.0]




989: [39.91%] Filling for [24241987.0]
990: [39.95%] Filling for [24262567.0]
991: [39.99%] Filling for [24278113.0]
992: [40.03%] Filling for [24280420.0]
993: [40.07%] Filling for [24286120.0]
994: [40.11%] Filling for [2429072.0]
995: [40.15%] Filling for [24307858.0]




997: [40.23%] Filling for [24318004.0]
998: [40.27%] Filling for [24328996.0]
999: [40.31%] Filling for [24339220.0]
1000: [40.36%] Filling for [24343024.0]
-- ERROR, ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')): 24343024.0
Wrote to temp_fic.csv
1001: [40.40%] Filling for [24346537.0]
1002: [40.44%] Filling for [24349642.0]
1003: [40.48%] Filling for [24358207.0]




1004: [40.52%] Filling for [24358252.0]
1005: [40.56%] Filling for [24376201.0]
1006: [40.60%] Filling for [24376396.0]




1007: [40.64%] Filling for [24378940.0]
1008: [40.68%] Filling for [24401746.0]




1009: [40.72%] Filling for [24412372.0]
1010: [40.76%] Filling for [24414376.0]
1011: [40.80%] Filling for [2444300.0]
1012: [40.84%] Filling for [24526141.0]
1013: [40.88%] Filling for [24540286.0]
1014: [40.92%] Filling for [24546556.0]
1015: [40.96%] Filling for [24559009.0]
1016: [41.00%] Filling for [24579031.0]




1017: [41.04%] Filling for [24587365.0]




1018: [41.08%] Filling for [24613408.0]
1019: [41.12%] Filling for [24613453.0]




1020: [41.16%] Filling for [24630391.0]
Wrote to temp_fic.csv
1021: [41.20%] Filling for [24632080.0]
1022: [41.24%] Filling for [24634399.0]
1023: [41.28%] Filling for [24634954.0]
1024: [41.32%] Filling for [24639628.0]
1025: [41.36%] Filling for [24691249.0]
1026: [41.40%] Filling for [24700762.0]
1027: [41.44%] Filling for [24707857.0]
1028: [41.49%] Filling for [24743242.0]
-- ERROR, ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')): 24743242.0
1029: [41.53%] Filling for [24745498.0]
1030: [41.57%] Filling for [24756952.0]
1031: [41.61%] Filling for [24763852.0]
1032: [41.65%] Filling for [24766570.0]
1033: [41.69%] Filling for [24779452.0]
1034: [41.73%] Filling for [24786316.0]
1035: [41.77%] Filling for [24812572.0]
1036: [41.81%] Filling for [24822373.0]
1037: [41.85%] Filling for [24848329.0]
1038: [41.89%] Filling for [24854467.0]




1039: [41.93%] Filling for [24864346.0]
1040: [41.97%] Filling for [24872824.0]
Wrote to temp_fic.csv
1041: [42.01%] Filling for [24887110.0]
1042: [42.05%] Filling for [24887557.0]




1043: [42.09%] Filling for [24892276.0]




1044: [42.13%] Filling for [24905410.0]
1045: [42.17%] Filling for [24924808.0]
1046: [42.21%] Filling for [24925561.0]
1047: [42.25%] Filling for [24955558.0]
1048: [42.29%] Filling for [24963313.0]
1049: [42.33%] Filling for [24972502.0]
1050: [42.37%] Filling for [24972604.0]
1051: [42.41%] Filling for [24987124.0]




1052: [42.45%] Filling for [25015792.0]
1053: [42.49%] Filling for [25022086.0]
1054: [42.53%] Filling for [25026529.0]
1055: [42.57%] Filling for [25036408.0]
1056: [42.62%] Filling for [25047004.0]




1057: [42.66%] Filling for [25057393.0]




1058: [42.70%] Filling for [2506151.0]
1059: [42.74%] Filling for [25068775.0]
1060: [42.78%] Filling for [25079452.0]
Wrote to temp_fic.csv
1061: [42.82%] Filling for [2508782.0]
1062: [42.86%] Filling for [25092961.0]
1063: [42.90%] Filling for [25095610.0]
1064: [42.94%] Filling for [25111621.0]
1065: [42.98%] Filling for [25123036.0]
-- ERROR, ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')): 25123036.0
1066: [43.02%] Filling for [251352.0]
-- ERROR, ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')): 251352.0
1067: [43.06%] Filling for [25138768.0]
1068: [43.10%] Filling for [25143577.0]
1069: [43.14%] Filling for [25177945.0]
1070: [43.18%] Filling for [25211344.0]
1071: [43.22%] Filling for [25216897.0]
1072: [43.26%] Filling for [25217872.0]




1073: [43.30%] Filling for [25230364.0]
1074: [43.34%] Filling for [25241206.0]
1075: [43.38%] Filling for [25262914.0]




1076: [43.42%] Filling for [25270801.0]
1077: [43.46%] Filling for [25271548.0]
1078: [43.50%] Filling for [25286608.0]
1079: [43.54%] Filling for [25295155.0]
1080: [43.58%] Filling for [2531417.0]
Wrote to temp_fic.csv
1081: [43.62%] Filling for [25329310.0]
1082: [43.66%] Filling for [25344355.0]
1083: [43.70%] Filling for [25386817.0]
1084: [43.74%] Filling for [25392859.0]
1085: [43.79%] Filling for [25414132.0]
1086: [43.83%] Filling for [25415320.0]
1087: [43.87%] Filling for [25428055.0]
1088: [43.91%] Filling for [25433362.0]
1089: [43.95%] Filling for [25436386.0]
1090: [43.99%] Filling for [25474642.0]




1091: [44.03%] Filling for [25498321.0]
1092: [44.07%] Filling for [25511461.0]
1093: [44.11%] Filling for [25547245.0]




1094: [44.15%] Filling for [25556128.0]
1095: [44.19%] Filling for [25574386.0]
1096: [44.23%] Filling for [25582654.0]
1097: [44.27%] Filling for [25601596.0]
1098: [44.31%] Filling for [25614247.0]
1099: [44.35%] Filling for [25625587.0]
1100: [44.39%] Filling for [25627864.0]
Wrote to temp_fic.csv
1101: [44.43%] Filling for [2569610.0]
1102: [44.47%] Filling for [25722511.0]
1103: [44.51%] Filling for [25723171.0]
1104: [44.55%] Filling for [25759528.0]
1105: [44.59%] Filling for [25771438.0]
1106: [44.63%] Filling for [2578418.0]
1107: [44.67%] Filling for [25787653.0]
1108: [44.71%] Filling for [25791574.0]
1109: [44.75%] Filling for [25791772.0]
1110: [44.79%] Filling for [25800691.0]
1111: [44.83%] Filling for [25818727.0]
1112: [44.87%] Filling for [25822288.0]
1113: [44.92%] Filling for [25825039.0]
1114: [44.96%] Filling for [25885264.0]




1115: [45.00%] Filling for [25916761.0]
1116: [45.04%] Filling for [25918189.0]
1117: [45.08%] Filling for [25957888.0]
1118: [45.12%] Filling for [25964908.0]
1119: [45.16%] Filling for [25997320.0]




1120: [45.20%] Filling for [26001232.0]
Wrote to temp_fic.csv
1121: [45.24%] Filling for [26004406.0]
1122: [45.28%] Filling for [26007388.0]
1123: [45.32%] Filling for [260273.0]
1124: [45.36%] Filling for [26046058.0]
1125: [45.40%] Filling for [26046520.0]




1126: [45.44%] Filling for [26071945.0]
1127: [45.48%] Filling for [26090464.0]
1128: [45.52%] Filling for [26096470.0]
1129: [45.56%] Filling for [26106754.0]




1130: [45.60%] Filling for [26112073.0]
1131: [45.64%] Filling for [26132014.0]
1132: [45.68%] Filling for [26143228.0]
1133: [45.72%] Filling for [26171239.0]



KeyboardInterrupt



<a id="otherUrl_1"></a>
## 2. Adding other url-data fics to ficDTB

In [28]:
# read in most recent ficDTB file
ficDTB = pd.read_csv(
    "data-checkpoints/fic-3-all_02-03-23.csv",
    parse_dates=["date_added", "date_last_viewed"],
    index_col=0,
)
ficDTB.head(2)

Unnamed: 0,location_found,is_missing,dtb_type,smk_source,version,date_added,date_last_viewed,url_type,id,url,recced_from_collections,url_psueds,fic_obj,date_obj_updated,is_subscribed,cur_chapter,notes,title,authors,fandoms,rating,categories,warnings,relationships,characters,tags,series,collections,words,nchapters,expected_chapters,complete,date_published,date_updated,date_edited,language,restricted,metadata,summary,start_notes,end_notes,chapters,text,kudos,comments,bookmarks,hits
0,AO3,,,v7_sheets,7,2021-07-24 05:56:00,NaT,works,26671084,http://www.archiveofourown.org/works/26671084,[],"[""https://archiveofourown.org/works/26671084""]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AO3,,,v7_sheets,7,2021-05-16 07:28:01,NaT,works,27740392,http://www.archiveofourown.org/works/27740392,[],"[""https://archiveofourown.org/works/27740392"",...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [52]:
# read in most recent authors file
userDTB = pd.read_csv(
    "data-checkpoints/users-0-all_02-26-23.csv", parse_dates=["date_added"], index_col=0
)
userDTB["location_found"] = "ao3"
userDTB.head(3)

Unnamed: 0,dtb_type,location_found,smk_source,version,date_added,date_last_viewed,user_name,url
0,author,ao3,v7_sheets,7,2022-01-01 00:00:01,,AMournfulHowlInTheNight,https://archiveofourown.org/users/AMournfulHow...
1,,ao3,v7_sheets,7,2021-07-08 18:47:22,2021-07-08 22:28:20,Alex51324,https://archiveofourown.org/users/Alex51324/ps...
2,,ao3,safari,9,2022-04-22 03:08:02,2022-04-22 04:14:39,Aminias,https://archiveofourown.org/users/Aminias/pseu...


<a id="otherUrl_1_1"></a>
### 2A. Adding Info from otherDTB
- **2A.1)** Add fics to ficDTB from otherDTB (ffn.net fics & ao3 external fics)
- **2A.2)** Add users to userDTB from otherDTB (ffn.net users)

In [21]:
# read in most recent others file
othersDTB = pd.read_csv(
    "data-checkpoints/others-0-all_01-03-23.csv",
    parse_dates=["date_added", "date_last_viewed"],
    index_col=0,
)

# get AO3 external fics & FFN.net urls
ffn_net_fics = othersDTB.query("url.str.contains('archiveofourown.org/') or \
                                url.str.contains('fanfiction.net/s/')")

# get ffn.net users
ffn_net_users = othersDTB.query("url.str.contains('fanfiction.net/u/')")

# get remaining (all non-fanfiction.net & ao3 urls)
remaining_others = othersDTB.query("~(url.str.contains('fanfiction.net/u/') or \
                                      url.str.contains('archiveofourown.org/') or \
                                      url.str.contains('fanfiction.net/s/'))")

In [27]:
# remaining_others.to_csv('data-checkpoints/others-1-all_03-09-23.csv')

### CHECKPOINT! others-1-all_03-09-23.csv (remaining non-ffn.net and non-ao3 links)

In [64]:
def get_fic_url_type(url):
    if 'archiveofourown.org/external_works/' in url: return 'ao3_external_work'
    elif 'fanfiction.net/s/' in url: return 'work'
    else: return np.nan
    
def get_location_found(url):
    if 'archiveofourown.org/' in url: return 'ao3'
    elif 'fanfiction.net/' in url: return 'ffn.net'
    else: return np.nan
    
ffn_net_fics['location_found'] = ffn_net_fics['url'].apply(get_location_foound)
ffn_net_fics['url_type'] = ffn_net_fics['url'].apply(get_fic_url_type)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ffn_net_fics['location_found'] = ffn_net_fics['url'].apply(get_location_foound)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ffn_net_fics['url_type'] = ffn_net_fics['url'].apply(get_fic_url_type)


In [47]:
# add new fic urls to ficDTB
ficDTB = (
    pd.concat([ficDTB, ffn_net_fics])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

ficDTB.tail(10)

Unnamed: 0,location_found,is_missing,dtb_type,smk_source,version,date_added,date_last_viewed,url_type,id,url,recced_from_collections,url_psueds,fic_obj,date_obj_updated,is_subscribed,cur_chapter,notes,title,authors,fandoms,rating,categories,warnings,relationships,characters,tags,series,collections,words,nchapters,expected_chapters,complete,date_published,date_updated,date_edited,language,restricted,metadata,summary,start_notes,end_notes,chapters,text,kudos,comments,bookmarks,hits
2474,AO3,,,v7_sheets,7,2022-01-01 00:00:01,NaT,works,6203218.0,https://www.archiveofourown.org/works/6203218,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2475,AO3,,,v7_sheets,7,2022-01-01 00:00:01,NaT,works,653038.0,https://www.archiveofourown.org/works/653038/c...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2476,AO3,,,v7_sheets,7,2022-01-01 00:00:01,NaT,works,798684.0,https://www.archiveofourown.org/works/798684/c...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2477,AO3,,,v7_sheets,7,2022-01-01 00:00:01,NaT,works,8710123.0,https://www.archiveofourown.org/works/8710123,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2478,AO3,,,v7_sheets,7,2022-01-01 00:00:01,NaT,works,9210233.0,https://www.archiveofourown.org/works/9210233,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2479,,,,v7_sheets,7,2022-01-01 00:00:01,NaT,,,https://archiveofourown.org/external_works/637417,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2480,,,,v7_sheets,7,2021-07-12 06:43:24,NaT,,,https://m.fanfiction.net/s/12783920/1/Authors-...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2481,,,,safari,9,2022-04-13 23:04:41,NaT,,,https://m.fanfiction.net/s/13910770/1/Nest,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2482,,,,safari,9,2022-04-13 17:16:21,NaT,,,https://m.fanfiction.net/s/14052100/1/Gaslight...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2483,,,,v7_sheets,7,2021-04-05 12:16:03,NaT,,,https://m.fanfiction.net/s/5645842/1/Rational-...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [65]:
ffn_net_users['location_found'] = ffn_net_users['url'].apply(get_location_found)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ffn_net_users['location_found'] = ffn_net_users['url'].apply(get_location_found)


In [66]:
# add new user urls to usersDTB
userDTB = (
    pd.concat([userDTB, ffn_net_users])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

userDTB.tail(10)

Unnamed: 0,dtb_type,location_found,smk_source,version,date_added,date_last_viewed,user_name,url
84,,ao3,safari,9,2022-12-16 18:54:29,,stormy1x2,https://archiveofourown.org/users/stormy1x2/ps...
85,,ao3,safari,9,2022-12-15 05:38:23,,sweetlolixo,https://archiveofourown.org/users/sweetlolixo/...
86,,ao3,v7_sheets,7,2022-01-01 00:00:01,,technorat,https://archiveofourown.org/users/technorat/ps...
87,,ao3,v7_sheets,7,2021-07-23 10:31:50,,thepartyresponsible,https://archiveofourown.org/users/thepartyresp...
88,author,ao3,v7_sheets,7,2022-01-01 00:00:01,,wintersnight,https://archiveofourown.org/users/wintersnight...
89,,ao3,safari,9,2022-12-15 05:42:10,2022-12-15 06:24:54,x_los,https://archiveofourown.org/users/x_los/pseuds...
90,author,ffn.net,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/1718955/May-Wren
91,author,ffn.net,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/2042977/cywsaphyre
92,author,ffn.net,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/6272865/Coeur-Al-...
93,author,ffn.net,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/6314924/BlueMoonC...


In [82]:
# 2A.2) Add users to userDTB from otherDTB (ffn.net users)

# get AO3 external fics & FFN.net urls
new_users = othersDTB.query('url.str.contains("fanfiction.net/u/")')

# remove ffn.net fics & ao3 external works
othersDTB = othersDTB.query('~(url.str.contains("fanfiction.net/u/"))')

# add new fic urls to ficDTB
userDTB = (
    pd.concat([userDTB, new_users])
    .reset_index()
    .drop_duplicates(subset=["url"])
    .drop(columns=["index"])
)

userDTB

Unnamed: 0,smk_source,version,date_added,date_last_viewed,user_name,url,location_found,dtb_type
0,v7_sheets,7,2022-01-01 00:00:01,,AMournfulHowlInTheNight,https://archiveofourown.org/users/AMournfulHow...,ao3,
1,v7_sheets,7,2021-07-08 18:47:22,2021-07-08 22:28:20,Alex51324,https://archiveofourown.org/users/Alex51324/ps...,ao3,
2,safari,9,2022-04-22 03:08:02,2022-04-22 04:14:39,Aminias,https://archiveofourown.org/users/Aminias/pseu...,ao3,
3,v7_sheets,7,2022-01-01 00:00:01,,Applepie,https://archiveofourown.org/users/Applepie/pse...,ao3,
4,v7_sheets,7,2022-01-01 00:00:01,,Araceil,https://archiveofourown.org/users/Araceil/pseu...,ao3,
...,...,...,...,...,...,...,...,...
89,safari,9,2022-12-15 05:42:10,2022-12-15 06:24:54,x_los,https://archiveofourown.org/users/x_los/pseuds...,ao3,
90,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/1718955/May-Wren,,author
91,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/2042977/cywsaphyre,,author
92,v7_sheets,7,2022-01-01 00:00:01,,,https://www.fanfiction.net/u/6272865/Coeur-Al-...,,author


#### Other CHECKPOINT! others-1-all_02-07-23.csv (removed all fanfiction.net & ao3 (external) works)

In [83]:
othersDTB = othersDTB.reset_index(drop=True)
# othersDTB.to_csv("data-checkpoints/others-1-all_02-07-23.csv")

<a id="otherUrl_1_2"></a>
### 2B. Adding FFN.net fics & users
- manually collected authors & fics from my 'Alerts' & 'Favorites' sections on fanfiction.net
- made not distinction between the two categories, marked all fics as read
    - if there were any duplicate authors or fics in both Alerts & Favorites, chose earliest add date
    
- **2B.1)** Add users to userDTB from ffn.net Alerts & Favorites
- **2B.2)** Add fics to ficDTB from ffn.net Alerts & Favorites

### 2B.1) Add users to userDTB from ffn.net Alerts & Favorites

In [84]:
# read in authors from fanfiction.net account
ffn_authors = pd.read_csv(
    "urlsOutput/ffn-net_authors_02-04-23.csv", index_col=0, parse_dates=["date_added"]
)

# add ffn_authors to userDTB
userDTB = pd.concat([userDTB, ffn_authors]).reset_index().drop(columns=["index"])
userDTB.head(2)

Unnamed: 0,smk_source,version,date_added,date_last_viewed,user_name,url,location_found,dtb_type
0,v7_sheets,7.0,2022-01-01 00:00:01,,AMournfulHowlInTheNight,https://archiveofourown.org/users/AMournfulHow...,ao3,
1,v7_sheets,7.0,2021-07-08 18:47:22,2021-07-08 22:28:20,Alex51324,https://archiveofourown.org/users/Alex51324/ps...,ao3,


#### User CHECKPOINT! users-1-all_02-07-23.csv (Add ffn.net users )

In [None]:
def populateFicDTB(useAccount, debug=False):
    """
    Takes boolean (use account/able to access restricted?)
    Fills any empty row of FicDTB with: title, authors, fandoms, Work object, and date this all last updated
    Requires that all rows' url_type & id be filled in.
    Returns nothing.
    """
    total = max(ficDTB.index)

    for ind in ficDTB.index:
        # get necessary info from DTB
        title = ficDTB.at[ind, "title"]
        authors = ficDTB.at[ind, "authors"]
        fandoms = ficDTB.at[ind, "fandoms"]
        fic_obj = ficDTB.at[ind, "fic_obj"]
        obj_date = ficDTB.at[ind, "date_obj_updated"]
        url_type = ficDTB.at[ind, "url_type"]

        # if any col is empty
        if (
            pd.isnull(title)
            or pd.isnull(authors)
            or pd.isnull(fandoms)
            or pd.isnull(fic_obj)
            or pd.isnull(obj_date)
        ):
            try:
                # get fic id
                if url_type == "chapters":
                    print("-- chapter!")
                    html_text = requests.get(url).text
                    soup = BeautifulSoup(html_text, "lxml")
                    wId = soup.find(
                        "input", attrs={"id": "kudo_commentable_id", "type": "hidden"}
                    )["value"]
                else:
                    wId = ficDTB.at[ind, "id"]
                print(f"- [{(ind/total)*100:.2f}%, #{ind}] Filling for [{wId}]")

                # initialize Work obj
                if useAccount:
                    work = AO3.Work(wId, session=session)
                else:
                    work = AO3.Work(wId)

                # write new info into DTB
                newTitle = work.title
                ficDTB.at[ind, "title"] = newTitle
                if debug:
                    print(f"- Wrote '{newTitle}'")

                newAuthors = json.dumps([x.username for x in work.authors])
                ficDTB.at[ind, "authors"] = newAuthors
                if debug:
                    print(f"- Wrote '{newAuthors}'")

                newFandoms = json.dumps(work.fandoms)
                ficDTB.at[ind, "fandoms"] = newFandoms
                if debug:
                    print(f"- Wrote '{newFandoms}'")

                ficDTB.at[ind, "fic_obj"] = work
                now = datetime.now()
                ficDTB.at[ind, "date_obj_updated"] = now
                if debug:
                    print(f"- Wrote fic obj at: {now.strftime('%m-%d-%y %H:%M:%S')}")

            # if Error
            except Exception as e:
                print(f"-- ERROR: {repr(e)} - - - {ficDTB.at[ind, 'id']}")
        else:
            print(".")

    print("\nDONE!")

In [85]:
userDTB = userDTB.reset_index(drop=True)
# userDTB.to_csv("data-checkpoints/users-1-all_02-07-23.csv")

### 2B.2) Add fics to ficDTB from ffn.net Alerts & Favorites

In [86]:
# read in fics from fanfiction.net account
ffn_fics = pd.read_csv(
    "urlsOutput/ffn-net_fics_02-04-23.csv",
    index_col=0,
    parse_dates=["date_added", "date_updated"],
)

def temp(x):
    return json.dumps([x.strip() for x in x.split(",")])

ffn_fics["authors"] = ffn_fics["author"].apply(temp)
ffn_fics["fandoms"] = ffn_fics["fandoms"].apply(temp)
ffn_fics = ffn_fics.drop(columns=["author"])

# add ffn_fics to ficDTB
ficDTB = pd.concat([ficDTB, ffn_fics]).reset_index(drop=True)
ficDTB

Unnamed: 0,location_found,is_missing,dtb_type,smk_source,version,date_added,date_last_viewed,url_type,id,url,recced_from_collections,url_psueds,fic_obj,date_obj_updated,is_subscribed,cur_chapter,notes,title,authors,fandoms,rating,categories,warnings,relationships,characters,tags,series,collections,words,nchapters,expected_chapters,complete,date_published,date_updated,date_edited,language,restricted,metadata,summary,start_notes,end_notes,chapters,text,kudos,comments,bookmarks,hits
0,AO3,,,v7_sheets,7.0,2021-07-24 05:56:00,NaT,works,26671084.0,http://www.archiveofourown.org/works/26671084,[],"[""https://archiveofourown.org/works/26671084""]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AO3,,,v7_sheets,7.0,2021-05-16 07:28:01,NaT,works,27740392.0,http://www.archiveofourown.org/works/27740392,[],"[""https://archiveofourown.org/works/27740392"",...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,AO3,,,v7_sheets,7.0,2021-05-21 17:23:47,2021-05-22 07:51:06,works,31373576.0,http://www.archiveofourown.org/works/31373576,[],"[""https://archiveofourown.org/works/31373576""]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AO3,,,v7_sheets,7.0,2022-01-01 00:00:01,NaT,chapters,1554514.0,https://archiveofourown.org/chapters/1554514?a...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,AO3,,,v7_sheets,7.0,2022-01-01 00:00:01,NaT,chapters,17156026.0,https://archiveofourown.org/chapters/17156026?...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2532,ffn_net,,read,ffn_net_account,,2016-11-08 00:00:00,NaT,,,https://www.fanfiction.net/s/10573845/1/The-Ma...,,,,,,,,The Mafia King,"[""erimies""]","[""One Piece""]",,,,,,,,,,,,,,2014-07-28 00:00:00,,,,,,,,,,,,,
2533,ffn_net,,read,ffn_net_account,,2016-07-13 00:00:00,NaT,,,https://www.fanfiction.net/s/10514625/1/Standards,,,,,,,,Standards,"[""Taisi""]","[""One Piece""]",,,,,,,,,,,,,,2014-07-06 00:00:00,,,,,,,,,,,,,
2534,ffn_net,,read,ffn_net_account,,2016-12-15 00:00:00,NaT,,,https://www.fanfiction.net/s/7724057/1/Family-...,,,,,,,,Family Bonds,"[""xXDesertRoseXx""]","[""Harry Potter""]",,,,,,,,,,,,,,2014-07-05 00:00:00,,,,,,,,,,,,,
2535,ffn_net,,read,ffn_net_account,,2016-07-09 00:00:00,NaT,,,https://www.fanfiction.net/s/9551666/1/Somewhe...,,,,,,,,Somewhere To Belong,"[""Pizza yum""]","[""One Piece""]",,,,,,,,,,,,,,2014-03-05 00:00:00,,,,,,,,,,,,,


#### Fic CHECKPOINT! fics-4-all_02-07-23.csv (Add fics from ffn.net from ffn.net account & othersDTB)

In [87]:
# ficDTB.to_csv("data-checkpoints/fics-4-all_02-07-23.csv")


In [13]:
df = pd.read_excel("raw_data_for_v9.xlsx", sheet_name=None)
df.keys()

dict_keys(['v1_fic_text', 'v1_series_text', 'v2_authors_text', 'v2_fic_text', 'v3_authors_text', 'v3_fic_text_to-read', 'v3_series_text_to-read', 'v3_fic_text_categories', 'v3_fic_text_fandom', 'v3_series_text_fandom', 'v4_authors_text', 'v4_fic_text_to-read', 'v4_series_text_to-read', 'v4_fic_text_categories', 'v5_fic_text_cont-reading', 'v5_authors_text', 'v5_fic_text_to-read', 'v5_series_text_to-read', 'v5_fic_text_categories', 'v5_series_text_categories', 'v5_fic_text_fandom'])

<a id="otherUrl_1_3"></a>
### 2.3 Adding AO3 fics
- make DTB of all AO3 fic-bookmarks, fic-subs, user-subs, series-subs

In [6]:
ficDTB.smk_source.value_counts()

v7_sheets          2482
ffn_net_account      53
safari                2
Name: smk_source, dtype: int64

In [5]:
ficDTB = pd.read_csv('data-checkpoints/fics-4-all_02-07-23.csv', index_col=0)
ficDTB

Unnamed: 0,location_found,is_missing,dtb_type,smk_source,version,date_added,date_last_viewed,url_type,id,url,recced_from_collections,url_psueds,fic_obj,date_obj_updated,is_subscribed,cur_chapter,notes,title,authors,fandoms,rating,categories,warnings,relationships,characters,tags,series,collections,words,nchapters,expected_chapters,complete,date_published,date_updated,date_edited,language,restricted,metadata,summary,start_notes,end_notes,chapters,text,kudos,comments,bookmarks,hits
0,AO3,,,v7_sheets,7.0,2021-07-24 05:56:00,,works,26671084.0,http://www.archiveofourown.org/works/26671084,[],"[""https://archiveofourown.org/works/26671084""]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AO3,,,v7_sheets,7.0,2021-05-16 07:28:01,,works,27740392.0,http://www.archiveofourown.org/works/27740392,[],"[""https://archiveofourown.org/works/27740392"",...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,AO3,,,v7_sheets,7.0,2021-05-21 17:23:47,2021-05-22 07:51:06,works,31373576.0,http://www.archiveofourown.org/works/31373576,[],"[""https://archiveofourown.org/works/31373576""]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AO3,,,v7_sheets,7.0,2022-01-01 00:00:01,,chapters,1554514.0,https://archiveofourown.org/chapters/1554514?a...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,AO3,,,v7_sheets,7.0,2022-01-01 00:00:01,,chapters,17156026.0,https://archiveofourown.org/chapters/17156026?...,[],[],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2532,ffn_net,,read,ffn_net_account,,2016-11-08 00:00:00,,,,https://www.fanfiction.net/s/10573845/1/The-Ma...,,,,,,,,The Mafia King,"[""erimies""]","[""One Piece""]",,,,,,,,,,,,,,2014-07-28 00:00:00,,,,,,,,,,,,,
2533,ffn_net,,read,ffn_net_account,,2016-07-13 00:00:00,,,,https://www.fanfiction.net/s/10514625/1/Standards,,,,,,,,Standards,"[""Taisi""]","[""One Piece""]",,,,,,,,,,,,,,2014-07-06 00:00:00,,,,,,,,,,,,,
2534,ffn_net,,read,ffn_net_account,,2016-12-15 00:00:00,,,,https://www.fanfiction.net/s/7724057/1/Family-...,,,,,,,,Family Bonds,"[""xXDesertRoseXx""]","[""Harry Potter""]",,,,,,,,,,,,,,2014-07-05 00:00:00,,,,,,,,,,,,,
2535,ffn_net,,read,ffn_net_account,,2016-07-09 00:00:00,,,,https://www.fanfiction.net/s/9551666/1/Somewhe...,,,,,,,,Somewhere To Belong,"[""Pizza yum""]","[""One Piece""]",,,,,,,,,,,,,,2014-03-05 00:00:00,,,,,,,,,,,,,


<a id="TAG"></a>
## Random Work

In [None]:
# fill new info cols
# populateFicDTB(True)

In [None]:
def getRating(x):
    if pd.isnull(x):
        return x

    rating = x.rating
    if rating == "Teen And Up Audiences	":
        return "T"
    elif rating == "Explicit":
        return "E"
    elif rating == "Mature":
        return "M"
    elif rating == "General Audiences":
        return "G"
    elif rating == "Not Rated":
        return "--"

## **start making search/matching functions for text, ffn.net, version matching**
- match by title:
    - get title of unknown fic
    - search ficDTB for matching titles
        - if no match: add col to ficDTB with as much into as possible
        - if match: update row with incoming data (version, smk_source, dtb_type, date_added) 
        

- OVERALL SECTION
    - read json file containing up unsorted txt fics
    - match them one by one 
    - remove url as matched
    - when done, overwrite previous file 

In [1498]:
version_default_dates = {
    1: "05-31-2017 00:00:01",
    2: "05-30-2018 00:00:01",
    3: "08-09-2018 00:00:01",
    4: "08-25-2018 00:00:01",
    5: "01-01-2019 00:00:01",
    6: "04-09-2020 00:00:01",
    7: "05-01-2021 00:00:01",
    8: "06-04-2022 00:00:01",
    9: "10-24-2022 00:00:01",
}

In [1602]:
# early.reset_index(drop=True).to_csv("urlsOutput/v1-6_txt/all_early.csv")

In [1598]:
fic_titles = [x.lower() for x in ficDTB_titles]
early_tit = [x.lower() for x in early_titles]

for ind in early.index:
    title = early.at[ind, "title"].lower()
    if title in fic_titles:
        early = early.drop(ind)

In [6]:
txt_fics = pd.DataFrame(columns=early.columns.to_list())
txt_fics["date_added"] = pd.to_datetime(txt_fics["date_added"])

txt_fics

NameError: name 'early' is not defined

In [1624]:
# read in all_early CSV
early = pd.read_csv(
    "urlsOutput/v1-6_txt/all_early.csv", index_col=0, parse_dates=["date_added"]
).reset_index(drop=True)

early_titles = early.query("work_type == 'fic'").title.to_list()
ficDTB_titles = ficDTB.loc[pd.isnull(ficDTB["title"]) == False].title.to_list()

for ind in early.index:
    title = str(early.at[ind, "title"])

    #     from_early = difflib.get_close_matches(title, early_titles, n=5, cutoff=0.6)
    #     if title in from_early: from_early.remove(title)

    from_dtb = difflib.get_close_matches(title, ficDTB_titles, n=3, cutoff=0.6)

    if len(from_dtb) == 0:
        # add text fic to ficDTB
        txt_fics = pd.concat([txt_fics, early.iloc[[ind]]])

    print(f"NEW: {title}")
    [print(f"- {x}") for x in from_dtb]
    #     if len(from_early) > 0: print(f'--- {from_early}')
    print()

NEW: the thief of hogwarts

NEW: the double agent
- The Trouble With Wanting

NEW: against my nature

NEW: like a concussion

NEW: scarily attractive

NEW: Cirrus Cloud

NEW: "this ain't no fairytale" Series

NEW: red like a storm
- Filled like a Flower

NEW: the hand you're dealt
- the stars in your eyes

NEW: how we met your mother

NEW: trading yesterday

NEW: uchiha kyoya

NEW: The Rebirth of Tsunayoshi Sawada

NEW: odd-job tsuna

NEW: vigilante tendency

NEW: ten flames

NEW: naruto: myoushuu no fuuin

NEW: random word association
- an awkward position

NEW: accounting no jutsu

NEW: in the pink
- Salt in the Ruins
- Crossing the Line
- In the Know

NEW: scales and whitebeards

NEW: new game plus

NEW: the mafia king
- Mafia King
- Remaking

NEW: past and future king

NEW: you are not alone
- Parental Woes

NEW: i'm not supposed to talk to strangers

NEW: livingcontradictory

NEW: inner peace, utter chaos

NEW: not this time, fate

NEW: onslaught

NEW: research and smoothies
- res

KeyboardInterrupt: 

In [99]:
import pandas as pd
# t1 = pd.read_excel('testing_data/raw_data_for_v9.xlsx', sheet_name='fandom_names')
# fandom_names = pd.read_csv("testing_data/fandom_names.csv")
v6_fic_text_ffn = pd.read_excel('testing_data/raw_data_for_v9.xlsx', sheet_name="v6_fic_text_ffn-dtb")

In [154]:
FANDOM_NAMES = {('1/2 prince'): '1/2_prince',
                ('avatar-'): 'avatar',
                ('attack on titan'): 'attack on titan',
               ("assassin's creed"): 'assassins_creed',
               ('avatar: the last airbender'): 'atla',
               ('miraculous ladybug'): 'miraculous_ladybug',
               ('big hero 6', 'bh6'): 'big_hero_6',
               ('black panther'): 'black_panther',
               ('katekyo hitman reborn', 'khr'): 'katekyo_hitman_reborn',
               ('books of the raksura'): 'books_of_the_raksura',
                ('brooklyn nine-nine', 'b99'): 'brooklyn_99',
                ('captain america'): 'captain_america',
                ('captive prince'): 'captive_prince',
                ('chronicles of narnia'): 'chronicles_of_narnia',
                ('code geass'): 'code_geass',
                ('criminal minds'): 'criminal_minds',
                ('danny phantom'): 'danny_phantom',
                ('dark angel'): 'dark angel',
                ('detroit: become human'): 'detroit_become_human',
                ('disney', 'greek & roman myths',
                    'beauty & the beast',
                    'robin hood', 'rapunzel',
                    'pocahontas','sleeping beauty',
                    'the little mermaid','the secret garden',
                    'aladdin','cinderella','maid maleen',
                    'mulan'): 'folklore',
                ('eyeshield 21'): 'eyeshield_21',
                ('fairy tail'): 'fairy_tail',
                ('x-men'): 'xmen',
                ('fast & furious'): 'fast&furious',
                ('final fantasy vii'): 'final_fantasy_vii',
                ('final fantasy viii'): 'final_fantasy_viii',
                ('final fantasy xv'): 'final_fantasy_xv',
                ('fullmetal alchemist'): 'fullmental_alchemist',
                ('game of thrones', 'got'): 'game_of_thrones',
                ('good omens'): 'good_omens',
                ('gravity falls'): 'gravity_falls',
                ('guardians of the galaxy','gotg'): 'guardians_of_the_galaxy',
                ('gundam wing/ac'): 'gundam_wing/ac',
                ('john wick'): 'john_wick',
                ('joy of life'): 'joy_of_life',
                ('harry potter', "hp"): 'harry_potter',
                ('highschool of the dead'): 'highschool_of_the_dead',
                ('how to train your dragon'): 'httyd',
                ('hunger games'): 'hunger_games',
                ('james bond'): 'james_bond',
                ('jurassic park'): 'jurassic_park',
                ('k anime'): 'k_anime',
                ('kingsmen'): 'kingsman',
                ('kuroko no basuke', 'knb'): 'kuroko_no_basuke',
                ('kung fu panda'): 'kung_fu_panda',
                ('lotr'): 'lord_of_the_rings',
                ('ouran high school host club', 'ohshc'): 'ouran_hshc',
                ('gdc','modao zushi'): 'mdzs',
                ('magi!!! labyrinth of magic'): 'magi_lom',
                ('miraculous ladybug'): 'miraculous_ladybug',
                ('monster hunter'): 'monster_hunter',
                ('moon knight'): 'moon_knight',
                ('one piece'): 'one_piece',
                ('pacific rim'): 'pacific_rim',
                ('percy jackson and the olympians'): 'percy_jackson_olympians',
                ('person of interest'): 'person_of_interest',
                ('phineas and ferb'): 'phineas_and_ferb',
                ('pkmn: sword and shield'): 'pkmn_sword&shield',
                ('pokémon','pkmn'): 'pokemon',
                ('prince of tennis'): 'prince_of_tennis',
                ('princess kaguya'): 'princess_kaguya',
                ('reincarnated as a sword'): 'reincarnated_as_a_sword',
                ('rise of the guardians'): 'rise_of_the_guardians',
                ('solo levelling'): 'solo_levelling',
                ('star wars','sw'): '1234SW',
                ('star wars: the clone wars','star wars: clone wars',
                     'star wars cw','sw: the clone wars'): 'star_wars_cw',
                ('stargate-'): 'stargate',
                ('stargate atlantis'): 'stargate_atlantis',
                ('spn'): 'supernatural',
                ('stranger things'): 'stranger_things',
                ('scum villain', "scum villain's self-saving system"): 'svsss',
                ('sword art online'): 'sword_art_online',
                ('teen wolf'): 'teen_wolf',
                ('the 100'): 'the_100',
                ('the croods'): 'the_croods',
                ('the flash'): 'the_flash',
                ('the hobbit'): 'the_hobbit',
                ('the last of us'): 'the_last_of_us',
                ('the song of achillles'): 'the_song_of_achillles',
                ('the witcher'): 'the_witcher',
                ('tiger & bunny'): 'tiger&bunny',
                ('tokyo ghoul'): 'tokyo_ghoul',
                ('umbrella academy'): 'umbrella_academy',
                ('vampire hunter d'): 'vampire_hunter_d',
                ('yona of the dawn'): 'yona_of_the_dawn',
                ('young hercules'): 'young_hercules',
                ('young justice'): 'young_justice',
                ('yuuri on ice', 'yoi','yuuri on ice!!!'): 'yuuri_on_ice',
                ('gotham'): 'gotham',
                ('daredevil'): 'daredevil',
                ('temeraire'): 'temeraire',
                ('transformers'): 'transformers',
                ('leverage'): 'leverage',
                ('hamilton'): 'hamilton',
                ('fbawtft', 'fantastic beasts and where to find them'): 'fbawtft',
                ('spiderman'): 'spiderman',
                ('avengers'): 'avengers',
                ('smallville'): 'smallville',
                ('twilight'): 'twilight',
                ('thor'): 'thor',
                ('arrow'): 'arrow',
                ('ncis'): 'ncis',
                ('naruto'): 'naruto',
                ('merlin'): 'merlin',
                ('travelers'): 'travelers',
                ('left4dead'): 'left4dead',
                ('megamind'): 'megamind',
                ('rwby'): 'rwby',
                ('minecraft'): 'minecraft',
                ('original'): 'original_work',
                ('bts'): 'bts',
                ('bleach'): 'bleach',
                ('batman'): 'batman',
                ('torchwood'): 'torchwood',
                ('sherlock'): 'sherlock',
                ('descendants 2015'): 'descendants',
                ('bnha', 'mha','boku no hero academia'): 'bnha',
                ('multiple'): 'multiple_fandoms',
               }

def get_key(fandom) -> str:
    """
    Takes a str unclean fandom name.
    Searches through aliases of given fandom to find all aliases/key to FANDOM_NAMES dict.
    Returns that key.
    """
    for key in FANDOM_NAMES.keys():
        if fandom in key:
            return FANDOM_NAMES[key]

def fandom_report(dtb, fic_col, verbose=False) -> int:
    """
    TAKES a dtb - read from csv/xlsx
        str - column name of the fandoms
        boolean - print report
    PURPOSE: Check all fandoms in 'fic_col'. If verbose, print a report: 
        num rows, known fandoms, & unknown fandoms
        list of error index nums
        list of unknown fandoms
    RETURNS 1 if no unknown fandoms & no errors, 0 otherwise
    """
    known_fandoms = []
    unknown_fandoms = []
    clean_fandoms = []
    error_ind = set()
    num_rows = len(dtb)
    
    # for each fandom row
    for ind in dtb.index:
        fandom_str = dtb.loc[ind].loc[fic_col]
        
        # if fandom cell empty
        if pd.isnull(fandom_str):
            error_ind.add(ind)
            continue
        
        # clean fandom string
        fandom_list = fandom_str.replace('*','') \
                            .replace(' x ',',') \
                            .split(',')
        
        # for each fandom in fic
        for old_fandom in fandom_list:
            
            # get clean fandom
            if old_fandom in FANDOM_NAMES.values():
                clean_fandoms.append(old_fandom)
                clean_fandom = old_fandom
            else:
                clean_fandom = get_key(old_fandom)
            
            # if no clean fandom found
            if not clean_fandom:
                unknown_fandoms.append((ind, old_fandom))
            else:
                known_fandoms.append(clean_fandom)
    
    # print report
    unknown_fandom_names = []
    if unknown_fandoms:
        unknown_fandom_names = list(zip(*unknown_fandoms))[1]
    
    num_unclean = len(set(known_fandoms))-len(set(clean_fandoms))
    if verbose:
        print(f'- --- FANDOM REPORT --- -')
        print(f'- # rows/fandoms:           {num_rows}')
        print(f'- # errors (row num):       {len(error_ind)}')
        [print('  ', err) for err in error_ind]
        print(f'- # unique known fandoms:   {len(set(known_fandoms))} (total), \
            {len(set(clean_fandoms))} (clean), {num_unclean} (unclean)')
        print(f'- # unique unknown fandoms: {len(set(unknown_fandom_names))}')
        [print('  ', fname) for fname in set(unknown_fandom_names)]

    if len(error_ind) == 0 and len(set(unknown_fandom_names)) == 0:
        if num_unclean == 0:
            return f"Ideal - all fandoms known & clean"
        return f"Good - all fandoms known, but {num_unclean} unclean"
    return f"Bad - {len(error_ind)} errors and {len(set(unknown_fandom_names))} unknown fandoms"


In [155]:
fandom_report(v6_fic_text_ffn, 'fic_fandom', True)

- --- FANDOM REPORT --- -
- # rows/fandoms:           1638
- # errors (row num):       0
- # unique known fandoms:   106 (total),             33 (clean), 73 (unclean)
- # unique unknown fandoms: 0


'Good - all fandoms known, but 73 unclean'

In [173]:
v6_test = pd.read_csv("testing_data/v6_test_3.csv")

In [176]:
def get_clean_fandom(unclean_fandom) -> str:
    """
    Takes str unclean fandom name.
    Returns str clean fandom name (if found), else returns None.
    """
    key = get_key(unclean_fandom)
    return FANDOM_NAMES.get(key, None)

get_clean_fandom('mha')

In [None]:
def clean_fandom_names(dtb, fandom_col_name, verbose=False):
    """
    Takes a dtb and str name of the fandom column.
    Reads the fandoms in the given dtb -> clean/makes all fandom names consistent to the ones in FANDOM_NAMES.
    Returns str status update.
    """
    for ind in dtb.index:
        fandom_str = dtb.at[ind, fandom_col_name]
        

In [170]:
test_cell = v6_test.at[5,'fic_fandom']
test_cell

"*assassin's creed x sw: the clone wars"

In [171]:
v6_test.at[5,'fic_fandom'] = 'assassins_creed,star_wars_cw'

In [172]:
v6_test.to_csv('testing_data/v6_test_3.csv')