# Planning & Helper Functions
- <b>Name:</b> Sofia Kobayashi
- <b>Date:</b> 12/10/2022
- <b>Description:</b> FFN Project planning & helper functions for ffnv9.1 & ffnv9.2 Jupyter Notebook

#### <b><u>Functions Table of Contents</u></b>
0. [PROJECT PLANNING](#sec1)
1. [Cleaning & Incoming Data functions](#sec1)
1. [Display & Interface functions](#sec2)
1. [Search functions](#sec3)
1. [Meta-functions: testing & report](#sec4)
1. [Fic functions](#sec5)

1. [Misc. functions](#sec6)
1. [Single-use functions](#sec7)
1. [Cool Code](#sec8)

In [1]:
# IMPORT STATEMENTS & GLOBAL VARIABLES
import re
import json
from datetime import datetime

known_work_types = ["works","collections","series","users","tags","search","external_works","comments","chapters"]
masterNoDupUrls = "MASTER_noDupURLs.json"
masterNoDupWorks = "MASTER_noDupWorks.json"
masterOthers = "MASTER_others.json"

<a id="sec0"></a>
## OVERALL PROJECT PLANNING (ffnv9.2 & beyond)
- the 4 goals list in this cell go top -> bottom: concrete/small scale -> overall/big scale


**CURRENT: #1** Got all 

<u>Steps for now</u>
1. Compile all text only fics, add url ones to separate file
1. Compile all URLs to 1 JSON file, each url has: url, dateAdded, dateLastViewed
    - get all, sort out other
1. Light analysis
    - count how many times each url appears for Safari reading list vs # times work appears
    - get all old urls, see how many are missing from current readinglist


- <u>Work Types:</u> fic (regular, chapters, colWorks, external_work), user, collection, series, other (search, tag, comments)
    - other set-up?: work_type, reason_listed (look into, fav, watch?, etc.), url


**Data Section Goals**
1. Compiling all URLs and all text fics **[to do]**
    1. [ ] Compile all text fics
        - v1-6 (in v7), temp updates? 
    1. [x] Compile all URLs from all lists
        - x Safari reading lists (with date added) 
        - x old local URL txt files
        - x Chrome reading lists (date added too?)
        - x temp updates/FFN DTB (v7&8)?
    1. Clean, combine, de-dup
    1. Light analysis
2. Creating database architecture **[to do]**
    1. [ ] Make flow chart for databases
        - Fics: read, to read/cont read, coffee, look into
        - Non-fics: Series, authors, collections, other (search, tag, fandom, comments, etc.)
            - these non-fics will also get a label (look into,  
    1. [ ] Detangle fics-series-collections-look intos 
3. Getting all possible fic data (clean & formatted) **[finish #2]**
    1. [ ] Format all urls -> databases (aka update database process)
        1. Pick out all non-AO3 & external works
        1. Sort remaining by work_types
            - Put into approprite database
        1. Get all info for each database (using AO3 API)
    1. [ ] Find way to convert text fic info -> url -> databases
        - add their info too
4. Design my-info input process **[finish #3]**
    1. [ ] Decide what info I want to store for each work_type
    1. [ ] Create functions for me to see fic/work and add my info, then updates database
        - filling in data: full ffn, quick sort? (star, reread, etc.), temp update, coffee
        - changing/update works: moving fic from to read -> ffn or something else, look into -> something else
5. Creating intake system to quickly add new URLs **[finish #3/4]**
    1. [ ] Make function that gets current Safari reading list & updates databases
    1. [ ] Make function that takes a txt file of URLs & updates databases
6. [ ] Fill-in info **[finish #4]**
    


**Implementation Goals**
1. Data Section - get all the fic info & my rating info, sort into DTB categories, be able to update it easily **[current]**
1. Analysis Section - analyze my own reading habits?
1. Display Section - display & search for fics & fic lists (website? app?) 
1. Product/Future Work Section - make something with data (personal algorithm? fandom analysis)
    - Personal fic finding algorithm (on existing fics & one for new/upcoming fics)
    - Fandom analysis/web crawling functions    
 


**Guiding Goals for Overall FFN Project:**
1. Honor favorite fics
2. Keep track & easily find favorites/read lists
3. Find new ones to read/reread easier
4. Get data on fics, fandoms & my own reading habits

<a id="sec1"></a>
## Cleaning & Incoming Data Functions

### getTypeAndId() Possibilities
<u>input - url possibilities</u>
- regular: work, series, authors, collections, tags
- colWork, colWork-depricated, search (with ?), chapters (depricated), external_works


<u>output - type possibilities</u>
- works - id num
- series - id num
- external_works - id num
- comments - id num
- chapters - id num
- - 
- collections - name
- users - name
- tags - tag string
- search - long search string
- - 
- (colWork) collections:colName - id num


<u>work types</u>
1. work: regular, chapters, colWorks
1. user
1. collection
1. series
1. search
1. tag
1. comments
1. other: non-ao3, external_work

In [4]:
# testing getting type & id from: 
url_0 = 'https://archiveofourown.org/collections/WorksOfGreatQualityAcrossTheFandoms/works/29387814'
url_1 = 'https://archiveofourown.org/works/22269148/chapters/53178208'
url_2 = 'https://archiveofourown.org/users/miscellea/pseuds/The%20Feels%20Whale'
url_3 = 'https://archiveofourown.org/collections/TheCrackheadBible/works?commit=Sort+and+Filter&include_work_search%5Bfandom_ids%5D%5B%5D=3828398&page=6&utf8=%E2%9C%93&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bexcluded_tag_names%5D=&work_search%5Blanguage_id%5D=&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=kudos%3A+%26gt%3B1000&work_search%5Bsort_column%5D=kudos_count&work_search%5Bwords_from%5D=25000&work_search%5Bwords_to%5D='
url_4 = 'https://archiveofourown.org/external_works/637417'
url_5 = 'https://archiveofourown.org/tags/Danny%20Phantom/works'
url_6 = "https://archiveofourown.org/collections/Clever_Crossovers_and_Fantastic_Fusions"
url_7 = "https://archiveofourown.org/works?commit=Sort+and+Filter&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=&tag_id=L%C3%A1n+Q%C7%90r%C3%A9n*s*M%C3%A8ng+Y%C3%A1o+%7C+J%C4%ABn+Gu%C4%81ngy%C3%A1o"
url_8 = "https://archiveofourown.org/works/search?utf8=%E2%9C%93&commit=Search&work_search%5Bquery%5D=&work_search%5Btitle%5D=Assembly+of+Pain%2C+Happiness%2C+%26+Feelings.&work_search%5Bcreators%5D=&work_search%5Brevised_at%5D=&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=&work_search%5Bsingle_chapter%5D=0&work_search%5Bword_count%5D=&work_search%5Blanguage_id%5D=&work_search%5Bfandom_names%5D=&work_search%5Brating_ids%5D=&work_search%5Bcharacter_names%5D=TommyInnit+%28Video+Blogging+RPF%29&work_search%5Brelationship_names%5D=&work_search%5Bfreeform_names%5D=&work_search%5Bhits%5D=&work_search%5Bkudos_count%5D=&work_search%5Bcomments_count%5D=&work_search%5Bbookmarks_count%5D=&work_search%5Bsort_column%5D=_score&work_search%5Bsort_direction%5D=desc#:~:text=Works%20List-,Assembly%20of%20Pain%2C%20Happiness%2C%20%26%20Feelings.,-by%20RandomlySane"
url_9 = "https://archiveofourown.org/chapters/747149?show_comments=true"
url_10 = "https://archiveofourown.org/collections/asoiaftimetraveltransmigration/works/29620161"
url_11 = "https://archiveofourown.org/bookmarks?commit=Sort+and+Filter&bookmark_search%5Bsort_column%5D=created_at&include_bookmark_search%5Brelationship_ids%5D%5B%5D=27817261&bookmark_search%5Bother_tag_names%5D=&bookmark_search%5Bother_bookmark_tag_names%5D=&bookmark_search%5Bexcluded_tag_names%5D=&bookmark_search%5Bexcluded_bookmark_tag_names%5D=&bookmark_search%5Bbookmarkable_query%5D=&bookmark_search%5Bbookmark_query%5D=&bookmark_search%5Blanguage_id%5D=&bookmark_search%5Brec%5D=0&bookmark_search%5Bwith_notes%5D=0&user_id=kyme"
url_12 = "https://archiveofourown.org/collections:TheCrackheadBible/15774906"
url_13 = "https://archiveofourown.org/tags/esama"

import re
def getTypeAndId(url):
    """Give an AO3 url. Returns a tuple with (type-of-work, work-id). Type = works, series, tags, etc.
    If type = 'collections', assumed to be a colWork, returns (collections:colName, workId).
    Depends on: re"""
    # Check if it's a search result
#     print(url)
    # [search url]
    if ("works?" in url) or ("search?" in url) or ("bookmarks?" in url):
        pattern = re.compile('archiveofourown.org/(.+)')
        search = pattern.findall(url)[0]
        return("search", search)
    
    # reformatting depricated colWork urls
    if "collections:" in url:
        pattern = re.compile ("archiveofourown.org/collections:(.+)/(\d+)")
        search = pattern.findall(url)[0]
        url = f'https://archiveofourown.org/collections/{search[0]}/works/{search[1]}'
    
    # Find work type
    pattern = re.compile("(archiveofourown.org/)(\w+)/")
    info = pattern.findall(url)
    wType = info[0][1]

    # Check if it's an unknown type
    if wType not in known_work_types:
        raise Exception(f'WorkType not in global variable known_work_types!\n- url: {url}\n- output type: {wType}')
    
    # *** FIND TYPE & ID ***
    # If work type is 'collections', I think it has to be a colWork (different URL format)
    if wType == "collections":
        pattern = re.compile("archiveofourown.org/collections/(\w+)/works/(\d+)")
        info2 = pattern.findall(url)
        
        # Check if it's a colWork
        # [colWork or collections url]
        if info2 == []:
            pattern = re.compile("archiveofourown.org/collections/(.+)$")
            info3 = pattern.findall(url)

            # Check if it's NOT a colWork or collection
            if info3 == []:
                raise Exception(f'type="collections", but not a colWork or collection!\n- url: {url}\n- output type: {wType}')
             
            #Return collection data
            return (wType, info3[0])
        
        # Return colWork data
        colName = info2[0][0]
        wId = info2[0][1]
        return(f"{wType}:{colName}", wId)

    # [users url]
    elif wType == "users":
        pattern = re.compile("archiveofourown.org/users/(\w+)")
        authorName = pattern.findall(url)[0]
        return (wType, authorName)
    
    # [tags url]
    elif wType == "tags":
        pattern = re.compile("archiveofourown.org/tags/(.+)")
        tag = pattern.findall(url)[0]
        return (wType, tag)
    
    # Else, return type & idNum
    # [tags url]
    else:
        pattern = re.compile("(archiveofourown.org/)(\w+)/(\d+)")
        info = pattern.findall(url)
        return (wType, info[0][2])

# for i in range(12):
#     print(getTypeAndId(url_12))

# getTypeAndId(url_13)


In [5]:
import os 

def combineToTxt(dirPath):
    """
    Takes a string-path to a directory full of TXT files. Function then combines all files into 1 TXT file,
    will only add all text to 1 file, no de-duppinhg.
    """
    # Get date
    now = datetime.now().strftime("%m-%d-%y")
    
    # Get all files in given directory
    allFiles = get_all_files(dirPath)
    others = []

    # Get all lines in files
    for file in allFiles:
        # read in file
        with open(f"{dirPath}/{file}", "r") as infile:
            for line in infile:
                line.strip() # remove trailing whitespace
                others.append(line)

    # Write to json
    with open(f"txtOutput_{now}.txt","w") as outfile:
        outfile.writelines(others)
    
    return f"txtOutput_{now}.txt"

In [6]:
import os 

def combineToJson(dirPath):
    """Takes a string-path to a directory full of TXT files. Function then combines all files into 1 JSON file."""
    # Get date
    now = datetime.now().strftime("%m-%d-%y")
    
    # Get all files in given directory
    allFiles = get_all_files(dirPath)
    others = []

    # Get all lines in files
    for file in allFiles:
        # read in file
        with open(f"{dirPath}/{file}", "r") as infile:
            for line in infile:
                line = line.strip() #to get rid of \n  
                others.append(line)

    # Write to json
    with open(f"jsonOutput_{now}.json","w") as outfile:
        json.dump(others, outfile)
        
    return f"jsonOutput_{now}.json"

# combineToJson("urlsOutput")

In [2]:
import json

def txtToJson(fileName):
    """
    Takes a TXT file name & creates a JSON file from the contents.
    Returns name of output json file.
    """
    urls = []
    with open(fileName, "r") as infile:
        for line in infile:
            line = line.strip()
            urls.append(line)
    
    with open(f"{fileName.replace('.txt','')}.json", "w") as outfile:
        json.dump(urls, outfile)
    
    return f"{fileName.replace('.txt','')}.json"

# txtToJson("urlsOutput/v8_chrome.txt")

In [8]:
import json
from datetime import datetime

def add_to_masterfiles(urlFile):
    """
    Takes ONE txt file of URLs, appends new urls (probably from a new reading list) to the 3 MASTER json files:
    MASTER_noDupURLs, MASTER_noDupWorks, MASTER_others. Pair with `combineToTxt(dirPath)` to convert whole 
    folders of TXT files.
    Returns 'success' if successful. 
    """
    # Get current date & initialize 3 lists
    now = datetime.now()
    date_str = f"<Added: {now.strftime('%m-%d-%y %H:%M:%S')}>"
    
    
    # Initialize variables
    files = ["MASTER_noDupURLs.json", "MASTER_noDupWorks.json", "MASTER_others.json"]
    
    for file in files:
        if not os.path.isfile(file):
            with open(file,"w") as outfile:
                json.dump([], outfile)
            print(f"Made {file}")
    
    newNoDupURLs = []
    newNoDupWorks = []
    newOthers = []

    
    # Read in original files
    with open(files[0], "r") as infile:
        noDupURLs = json.load(infile)
    
    # rules a little different for noDupWorks bc it's formatted: [[typeI, url], ...
    with open(files[1], "r") as infile:  
        noDupWorks = json.load(infile)
        if noDupWorks == []: typeIdList = []
        else: 
            typeIdList = list(list(zip(*noDupWorks))[0])
        
    with open(files[2], "r") as infile:
        others = json.load(infile)

    totalLen = 0
    # Read in new URLs 
    with open(urlFile, "r") as infile:
        for line in infile:
            line = line.strip() #to get rid of \n
#             print(line) #DID SOMETHING GO WRONG?
            # if not an AO3 url
            # 1. others filter
            if "archiveofourown.org" not in line:
                if line not in others:
                    newOthers.append(line)

            else:
                # 2. noDupUrls filter
                if line not in noDupURLs:
                    newNoDupURLs.append(line)

                # 3. noDupWorks filter
                typeId = list(getTypeAndId(line))
                if typeId not in typeIdList:
                    typeIdList.append(typeId)
                    pair = [typeId, url]
                    newNoDupWorks.append(pair)
            totalLen += 1
    
    # Format & Write newly added-to files
    fileTypes = [[noDupURLs, newNoDupURLs, files[0]], 
                 [noDupWorks, newNoDupWorks, files[1]], 
                 [others, newOthers, files[2]]]
        
    for original, new, file in fileTypes:
        original.append(date_str) # add date stamp
        original.extend(new) # add new URLs
        
        # Write newly appended-lists
        with open(file, "w") as infile:
            json.dump(original, infile)
        
    # print addition report
    print(f"There were {totalLen} url(s) in '{urlFile}'")
    print(f"Added {len(newNoDupURLs)} url(s) to MASTER_noDupURLs.json")
    print(f"Added {len(newNoDupWorks)} url(s) to MASTER_noDupWorks.json")
    print(f"Added {len(newOthers)} url(s) to MASTER_others.json")
    
    return "success"

    
# add_to_masterfiles("txtOutput_12-25-22.txt")

In [9]:
def dir_to_masterfiles(dirPath):
    """
    Takes a directory (full of URL TXT files), makes a combined TXT files, then adds all those URLs to MASTER
    files.
    Returns nothing.
    """
    txtFile = combineToTxt(dirPath)
    result = add_to_masterfiles(txtFile)
    print(f"{result.title()}!")

# all_to_masterfiles("urlsOutput")

In [10]:
def readinglist_to_masterfiles():
    """
    Makes TXT file from Safari reading list, adds all those urls to master files.
    Returns nothing.
    """
    txtFile = getReadingList()
    result = add_to_masterfiles(txtFile)
    print(f"{result.title()}!")

# readinglist_to_masterfiles()

<a id="sec2"></a>
## Display & Interface functions

In [11]:
def getCorrectInput(allowedList):
    """Gets & returns user input but keeps prompting user until input is within correctList."""
    # run loop until input is within correctList
    passes = False
    while not passes: 
        passes = True
        ans = input()
        if ans not in allowedList:
            passes = False
    
    return ans

<a id="sec3"></a>
## Search functions

In [12]:
if False: 
    import difflib
    difflib.get_close_matches("apple", ["apl", "app", "bee", "cici"], n=3, cutoff=0.6)

<a id="sec4"></a>
## Meta-functions: Testing & Report

In [13]:
import json

def reportMasters():
    for file in [masterNoDupUrls, masterNoDupWorks, masterOthers]:
        with open(file) as infile:
            data = json.load(infile)
            print(f"{file} has {len(data)} url(s)")
            
# reportMasters()

<a id="sec5"></a>
## Fic functions

<a id="sec6"></a>
## Misc. functions

In [2]:
import AO3

def my_session():
    """
    Returns an AO3 session logged in to my account.
    """
    payload = open("randomData/to_add_authors.txt", "r")
    user = payload.readline().strip()
    password = payload.readline().strip()
    
    sess = AO3.Session(user, password)
    sess.refresh_auth_token()
    payload.close()
    
    return sess

In [14]:
from os import listdir

def get_all_files(dirName): 
    """Takes a string - name of directory. Returns list of ALL files within that directory minus the .DS_Store"""
    allFiles = [f for f in listdir(dirName)]
    if ".DS_Store" in allFiles:
        allFiles.remove(".DS_Store")
    return allFiles

# get_all_files("urlsOutput")

In [15]:
def getReadingList():
    """extracturls.py ~ This script gets a list of all the URLs in Safari Reading List, and
    writes them all to a file. Requires Python 3. ~ from someone on StackOverflow"""
    #!/usr/bin/env python
    import os
    import plistlib

    # Get current date 
    now = datetime.now()
    current_date = now.strftime("%m-%d-%y")

    # set file paths
    INPUT_FILE  = os.path.join(os.environ['HOME'], 'Library/Safari/Bookmarks.plist')
    OUTPUT_FILE = f"readinglist_{current_date}.txt"

    # Load and parse the Bookmarks file
    with open(INPUT_FILE, 'rb') as plist_file:
        plist = plistlib.load(plist_file)

    # Look for the child node which contains the Reading List data.
    # There should only be one Reading List item
    children = plist['Children']
    for child in children:
        if child.get('Title', None) == 'com.apple.ReadingList':
            reading_list = child

    # Extract the bookmarks
    bookmarks = reading_list['Children']

    # For each bookmark in the bookmark list, grab the URL
    urls = (bookmark['URLString'] for bookmark in bookmarks)

    # Write the URLs to a file
    with open(OUTPUT_FILE, 'w') as outfile:
        outfile.write('\n'.join(urls))
    
    print(f"Wrote to {OUTPUT_FILE}")
    return OUTPUT_FILE

In [16]:
def getOnlyAO3(fileName):
    """DEPRICATED - Takes a TXT file of URLs & returns 2 lists: [[the AO3 link], [non-AO3 links]]"""
    # Read in file of URLs
    with open(fileName, "r") as infile:
        lines = infile.readlines()
    
    # Sort URLs into archive & non-archive lists 
    archive = []
    notArchive = []
    for line in lines:
        if "archiveofourown.org" in line:
            archive.append(line)
        else:
            notArchive.append(line)

    return [archive, notArchive]
    
    
#     # Write archive URLs to file
#     fileNice = fileName.split("/")[-1] \
#                         .replace(' ','-')
#     archiveFile = f"archive_{fileNice}"
#     with open(archiveFile, "w") as outfile:
#         outfile.writelines(archive)
#         print(f"Wrote {len(archive)} AO3 link(s) to {archiveFile}")
    
#     # Write non-archive URLs to file
#     notArchiveFile = f"notArchive_{fileNice}"
#     with open(notArchiveFile, "w") as outfile:
#         outfile.writelines(notArchive)
#         print(f"Wrote {len(notArchive)} non-AO3 links to {notArchiveFile}")

<a id="sec7"></a>
## Single-use functions

In [17]:
def makeCol():
    """Single use - Made Collection Works URLs from (idNum, collectionName) tuples."""
    # Reads in numColWorks TXT file
    res = []
    with open("urls/numColWorks.txt", "r") as inFile:
        lines = inFile.readlines()
        
    # Creates ColWorks URLs from the id & name
    for line in lines:
        data = line[:-1].split(",")
        col = data[1]
        idNum = data[0]
        print(f"https://archiveofourown.org/collections/{col}/works/{idNum}")

    # Writes ColWorks URLs
    fileName = f"urlsOutput/from_{'numColWorks'.lower().replace(' ','-')}.txt"
    with open(fileName, "w") as outFile:
        print(f"Writing to {fileName}")
        outFile.writelines(res)

In [18]:
def addToStart(infile, strToAdd):
    """Single Use (kinda) - Takes a TXT file & string. Adds given string to the front of each line in the file, 
    writes a new file named 'from_{given file}'."""
    # Reads in given file
    res = []
    with open(f"urls/urlFiles/{infile}", "r") as inFile:
        lines = inFile.readlines()
        
    # Attaches given string to front
    for line in lines:
        res.append(strToAdd+line)

    # Overwrites given file with new 
    with open(f"urlsOutput/from_{infile.lower().replace(' ','-')}", "w") as outFile:
        outFile.writelines(res)

In [19]:
# seperate dual-urls (FROM 1_dtbs_from_prev)
if False: 
    print(f"len pre: {len(pre)}")
    url = "https://archiveofourown.org/works/9841367/chapters/22088246https://archiveofourown.org/works/76861"

    import re
    for url in pre: 
        pattern = re.compile('.+(https://archiveofourown.org/.+)')
        search = pattern.findall(url)
        if len(search) > 0:
            start = url.find(search[0])
            url1 = url[:start]
            url2 = url[start:]
            pre.remove(url)
            pre.append(url1)
            pre.append(url2)
        
# print(f"len pre after: {len(pre)}")

In [20]:
# single-use, converting v1-6 text fics to json from txt 
if False: 
    v16_txt_fics = []
    with open("urlsOutput/v1-6_txt/all_early.txt", "r") as infile:
        for line in infile:
            line = line.strip()
            if 'ver.' not in line.lower():
                info = line.split('---')

                temp = {}
                temp["version"] = int(info[0])
                temp["title"] = str(info[1])
                v16_txt_fics.append(temp)


    with open(f"urlsOutput/v1-6_txt/all_early_1.json", "w") as outfile:
        json.dump(v16_txt_fics, outfile)

In [21]:
# single-use, making CSV file for unsorted txt fics (v1-6)
if False: 
    with open("urlsOutput/v1-6_txt/all_early.json", "r") as infile:
        early_unsorted = json.load(infile)

    t1 = pd.DataFrame(early_unsorted)

    def temp(x):
        if "series" in x.lower(): return "series"
        else: return 'fic'

    def temp2(x):
        str_date = version_default_dates[int(x)]
        date_date = datetime.strptime(str_date, '%m-%d-%Y %H:%M:%S')
        return date_date

    t1["work_type"] = t1["title"].apply(temp)
    t1["smk_source"] = t1["version"].apply(lambda x: f"v{x}_list")
    t1["dtb_type"] = "read"
    t1["date_added"] = t1["version"].apply(temp2)
    t1['is_sorted'] = False

    # t1.to_csv('all_early_1.csv')

In [22]:
# adding fic url to fic dtb
def urlToFicDTB(url):
    """
    Outdated - Takes an AO3 fic url, creates a new row with appropriate info.
    Returns that new row with generated info.
    """
    row_data = {'dtb_type': ["read"], 
               'smk_source': ["v7_sheets"], 
               'version': [7], 
               'date_added': [datetime.strptime("01-01-22 00:00:01", "%m-%d-%y %H:%M:%S")], 
               'date_last_viewed': [np.datetime64("NaT")],
               'url_type': [getTypeAndId(url)[0]], 
               'id': [getTypeAndId(url)[1]], 
               'url': [url]
              }

    new_row = pd.DataFrame(data=row_data)


# total_4 = pd.concat([total_4, new_row], axis=0, ignore_index=True)

In [23]:
def new_combine(df1, df2, debug=False):
    """
    Takes 2 data frames, de-ups them, & updates (not just .update) df1 with df2.
    Since .update aligns on index, must fill both res & df2 with all rows.
    Returns updated df.
    """
    # concat & de-dup (keeping first), sort & reset index (since update aligns on index)
    res = pd.concat([df1, df2]) \
            .drop_duplicates(subset=["url"],keep="first") \
            .sort_values("url") \
            .reset_index()
    
    if debug == True: print(f"-- {res.smk_source.value_counts()}")
    
    df2 = pd.concat([df2, df1]) \
            .drop_duplicates(subset=["url"],keep="first") \
            .sort_values("url") \
            .reset_index()
    if debug == True: print(f"-- {df2.smk_source.value_counts()}")
    
    
    # update with df2
    res.update(df2)
    res = res.drop(columns=["index"])
    
    #ensure no duplicates
    if len(res) != len(res.drop_duplicates(subset=["url"])):
        raise Exception(f"Duplicates? {len(res)} in res, {res.drop_duplicates(subset=['url'])} after drop_duplicates")
    
    return res


In [242]:
def updateInfo(dtb, keyColName, keyInfo, newColName, newInfo, debug=True):   
    """
    Takes a DF & 2 pairs of parameters: 1. colName & value of 'key' to identify row to be changed and 
                                        2. colName & value of info to get updated in that row
    Key must be unique.
    Returns 0 if failed, 1 if success.
    """
    ind = dtb.index[dtb[keyColName] == keyInfo].tolist()
    if len(ind) == 0:
        if debug: print(f"- ERROR: '{keyInfo}' not found in '{keyColName}' column!")
        return 0
    elif len(ind) > 1:
        if debug: print(f"- ERROR: multiple of given key info found in dtb! Key must be unique!")
        return 0
    else:
        before = dtb.at[ind[0], newColName]
        dtb.at[ind[0], newColName] = newInfo
        if debug: print(f"Updated cell [#{ind[0]}, '{newColName}']: '{before}' -> '{newInfo}'")
        return 1

# updateInfo(t1, 'url', "https://archiveofourown.org/series/1995952", "name", "Shifters") 

<a id="sec8"></a>
## Cool Code/Examples

### Querying Examples - SQL & .query()

In [25]:
# Querying examples
if False: 
    # pd.df.query testing
    work_t = "series"
    id_n = '1029669'
    col = "url"

    df1.query(f"url_type == '{work_t}' and id == '{id_n}'").iloc[0][col]


    # pd sql testing
    import pandas as pd
    import pandasql as ps

    data = df1
    sql_query = "SELECT * FROM data WHERE work_type == 'work' AND date_added == '2022-01-11'"
    ps.sqldf(sql_query, locals())

### DF Examples

In [26]:
if False: 
    import numpy as np
    import pandas as pd

    no = np.datetime64("NaT")

    df1 = pd.DataFrame({"url": ["a","b","c","e",'f'],
                       "date_added": [1,np.nan,3,7,8],
                       "date_last_viewed": [no,no,no,no,no],
                       "dtb_type":[np.nan,np.nan,np.nan,np.nan,np.nan],
                       "smk_source": ["v8_old","v8_old","v8_old","v8_old","v8_old"]})
    df2 = pd.DataFrame({"url": ["a","b","c","d"],
                   "date_added": [4,2,3,6],
                   "date_last_viewed": [5, no, no, 9],
                   "dtb_type": ["read", "lookInto", np.nan, "toRead"],
                   "smk_source": ["chrome","chrome","chrome","chrome"]})

### BeautifulSoup example

In [27]:
# BS example
if False:
    from bs4 import BeautifulSoup
    import requests

    url = "https://archiveofourown.org/chapters/17156026?show_comments=true"

    # Getting HTML of the AO3 page
    html_text = requests.get(url).text
    soup = BeautifulSoup(html_text, "lxml")
    soup.find('input', attrs={'id': 'kudo_commentable_id','type':'hidden'})["value"]

In [28]:
print("Success!")

Success!


In [None]:
def fill_series_dtb_original(initial_series_dtb, session, report=False):
    """
    Takes a seriesDTB (pandas DataFrame, post series-0), an AO3 session, and a Boolean to print report.
    Modifies given seriesDTB by filling it up with ao3 information.
    Returns nothing.
    """
    # find total number of series to fill
    total = max(initial_series_dtb.index)
    
    # ensure initial_series_dtb has all necessary columns
    for col in series_columns:
        if col not in initial_series_dtb.columns:
            initial_series_dtb[col] = np.nan

    for ind in initial_series_dtb.index: # for every row in initial_series_dtb
        try: 
            if not report:
                if ind%100 == 0: 
                    print(f'- {ind}! (printed every 100)')

            if not series_row_complete(initial_series_dtb.iloc[[ind]]): # if any null value in rows (sans last_viewed & is_subbed)
                # get series id
                series_id = initial_series_dtb.at[ind, "id"]
                if report: print(f"{ind}: [{(ind/total)*100:.2f}%] Filling for [{series_id}]")

                # initialize Series obj
                series = AO3.Series(series_id, session=session)

                # write report info
                name = series.name
                creators = json.dumps([user.username for user in series.creators])
                fandoms = json.dumps(getSeriesFandoms(series))

                initial_series_dtb.at[ind, "name"] = name
                initial_series_dtb.at[ind, "creators"] = creators
                initial_series_dtb.at[ind, "fandoms"] = fandoms
                if report: print(f"- Wrote '{name}' by {creators}\nin {fandoms}")

                # write remaining info
                initial_series_dtb.at[ind, "series_obj"] = series
                initial_series_dtb.at[ind, "date_obj_updated"] = datetime.now()

                initial_series_dtb.at[ind, "description"] = series.description
                initial_series_dtb.at[ind, "notes"] = series.notes
                initial_series_dtb.at[ind, "words"] = series.words
                initial_series_dtb.at[ind, "complete"] = series.complete
                initial_series_dtb.at[ind, "is_subscribed"] = series.is_subscribed

                initial_series_dtb.at[ind, "series_begun"] = series.series_begun
                initial_series_dtb.at[ind, "series_updated"] = series.series_updated
                initial_series_dtb.at[ind, "nbookmarks"] = series.nbookmarks
                initial_series_dtb.at[ind, "nworks"] = series.nworks
                initial_series_dtb.at[ind, "work_list"] = json.dumps([work.id for work in get_series_work_list(series)])

                initial_series_dtb.at[ind, "is_restricted"] = series._soup.find("img", {"title": "Restricted"}) is not None
                initial_series_dtb.at[ind, "not_found"] = False
            else: 
                if report: print(f"{ind}: .")
            
        except Exception as e:
            initial_series_dtb.at[ind, "not_found"] = True
            print(f"-- ERROR, {ind}: {initial_series_dtb.at[ind, 'id']}")
        
        # update temp csv w/ new row/series
        initial_series_dtb.to_csv("temp_series.csv")

    # Write finished series DTB to csv
    initial_series_dtb.to_csv("temp_series_final.csv")



    print("\nDONE!")
