# Convert Factiva HTML to corpus of TXT files

This is a Jupyter Notebook to parse the HTML output from Factiva "save" operations and output a set of separate txt files for text analysis.

## Getting files from Factiva
Factiva allows you to export batches of files to HTML. Here are the steps.
1. Do your search.
2. Check the checkbox to select all the articles on the page.
3. Click the "save" icon (it looks like a disk 💾) and click "Article format".
4. A new browser window should open with all the articles. Save the HTML of this file using Ctrl-S - make sure you are saving as something like "Web Page, HTML only" - this will differ based on your browser.
5. Repeat the process 2-4 for each subsequent page of results.
6. Place these files in the directory you have specified below for `directory_with_factivahtml`. Note: the code scans for files ending in .html to process - so make sure the files have this file extension.

In [34]:
# if you don't already have it - then you should install beautifulsoup e.g. pip install beautifulsoup4
from bs4 import BeautifulSoup
import re
import datetime
import os
import glob

In [35]:
# you should have two directories factivahtml (put your output files from Factiva here) 
# and corpus (this is where you corpus will get created)
directory_with_factivahtml = 'inputfactivahtml/'
directory_to_output_corpus = 'outputcorpus/'

In [37]:
factivafiles = [f for f in glob.glob(directory_with_factivahtml + "*.html")]
for factivafile in factivafiles:
    soup = BeautifulSoup(open(factivafile, encoding='utf8'), "html.parser")
    i = 0
    for article in soup.select(".article .article,#lastArticle"):
        i += 1
        print('---------------' + factivafile + '|' + str(i) + '-----------------')
        doc_string = article.find_all('p', attrs={'class': None}, string=re.compile('^Document ')) 
        print('doc#:', doc_string[0].get_text())
        for headline in article.select("#hd"):
            print('headline:', headline.get_text().strip())
            body = headline.get_text().strip() + "\n"
        for author in article.select("div.author"):
            print('author:', author.get_text())
        words_string = article.find_all('div', attrs={'class': None}, string=re.compile('[0-9]{1,2} words')) 
        print('words:', words_string[0].get_text())
        date_string = article.find_all('div', attrs={'class': None}, string=re.compile('[0-9]{1,2} [A-Za-z]{1,} [0-9]{4}')) 
        print('date:', date_string[0].get_text())
        time_string = article.find_all('div', attrs={'class': None}, string=re.compile('[0-9]{2}:[0-9]{2}')) 
        if (len(time_string) > 0):
            print('time:', time_string[0].get_text())
            print('publisher: ', time_string[0].next_sibling.get_text()) #text publisher
            short_publisher_id = time_string[0].next_sibling.next_sibling.get_text()
        else:
            print('publisher: ', date_string[0].next_sibling.get_text()) #text publisher
            short_publisher_id = date_string[0].next_sibling.next_sibling.get_text()
        print('short publisher id: ', short_publisher_id) #short publisher
        copyright_string = article.find_all('div', attrs={'class': None}, string=re.compile('.*(All Rights Reserved|\(c\)|Copyright|©).*', flags=re.IGNORECASE)) 
        language = ''
        if (len(copyright_string) > 0):
            print('copyright:', copyright_string[0].get_text())
            language = copyright_string[0].previous_sibling.get_text()
            print('Language:', copyright_string[0].previous_sibling.get_text())
        else:
            print('Warning: No language!!!!!')

        for paragraph in article.select(".articleParagraph"):
            body += paragraph.get_text().strip() + "\n"

        format_str = '%d %B %Y'
        datetime_obj = datetime.datetime.strptime(date_string[0].get_text(), format_str)
        fileid = str(datetime_obj.date()) + '-' + short_publisher_id + '-' + doc_string[0].get_text().replace('Document ','')
        if (language == 'English'):
            with open(directory_to_output_corpus + fileid + '.txt', 'w', encoding='utf8') as f:
                f.write(body)
                f.close()
                print('Wrote file: ', directory_to_output_corpus + fileid + '.txt')


---------------inputfactivahtml/Factiva.html|1-----------------
doc#: Document PJRC000020240314ek5100003
headline: Evaluating Racial and Ethnic Invariance Among the Correlates of Guilty Pleas: A Focus on the Effect of Court Legitimacy, Attorney Type, Satisfaction, and Plea-Offer Evaluation
author: Jaynes Chae M; Lee Jacqueline G; N Franks Heath 
words: 240 words
date: 1 May 2024
publisher:  Journal of Research in Crime & Delinquency
short publisher id:  PJRC
copyright: © 2024 Journal of Research in Crime & Delinquency. Provided by ProQuest Information and Learning. All Rights Reserved. 
Language: English
Wrote file:  outputcorpus/2024-05-01-PJRC-PJRC000020240314ek5100003.txt
