# A - 2 - Full Texts Acquisition

## Description
**Process aim:**
This process aims at adding a field containing the full text of the resources to the metadata dataframe.

**Input:** A csv files containing metadata including at least one columns with URLs

**Sub-processes**:
1. Import metadata
2. Get and save PDF files
3. Extract full texts from PDF files
4. Add full text to dataset and save

**Output:** a CSV file

## 1. Import metadata

In [None]:
import pandas as pd
import requests
import textract
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

Write the name of the fields to keep. These might be either 
* identifiers (ie. record_id)
* target labels: field that are intended to be predicted in other world automatically generated (i.e. subjects_geo, subjects_topics, etc)
* url of the resource in English (i.e. url_English)
* features: field that includes characteristics of the text that will be used to predict the labels (i.e. title)
* any other field that you want to keep to analyze your dataset

In [None]:
# Columns of the dataset to keep
columns = ['record_id','body', 'date', 'session', 'subjects_geo','subjects_primary', 'subjects_topics', 'symbol', 
           'title', 'type','url_English']

In [None]:
# Load the dataset and create a Pandas dataframe
dataset = (pd.read_csv('data/0_input_data/metadata/input/doc_2000_2017.csv',usecols=columns, index_col='record_id', dtype='str'))

## 2. Get and save PDF files
We need the full text of the resources described in the MARC XML that will be used later to infer some metadata. In this case we will focus on English texts only.

For this step start by creating a list containing for each record that has an English url, the record id, and the url. We then use the function save_files() to get the files using the url and save them in the the folder data/acquisition.

In [None]:
def save_files(files_list, save_path, file_extension):
    '''
    Takes a list of of lists (record_id and url), the path of the location where the files
    will be saved, and the extension of the file type. Get the files through htpp requests
    and save them. Returns a list of record_id, corresponding to files that could not be 
    saved.
    '''
    errors = []
    for item in files_list:
        save_as = save_path + str(item[0]) + file_extension
        file_url = item[1]
        response = requests.get(file_url)
        if response.status_code == requests.codes.ok:
            with open(save_as, 'wb') as f:
                f.write(response.content)
        else:
            errors.append(item)
    return errors

def last_saved_file(record_id,file_list):
    '''
    Print the index of corresponding to the record_id in file_list
    To use if save_files stops in order to restart the downloads where it stoped.
    '''
    i = 0
    for item in file_list:
        if record_id in item:
            print('{} : {}'.format(i,item))
        i +=1

In [None]:
# Create a list of record_id and url for all record that have an url
en_list = (dataset.reset_index()[['record_id', 'url_English']]
           .dropna() # filter out if non values
           .values.tolist())
# Output the length of the list
len(en_list)

In [None]:
# Get the files and save them in pdf
save_files(en_list, 'data/A_input_data/files/', '.pdf')

If the script stops running before the list is completed, get the latest saved file, use last_saved_file to get the index number and restart at index number:

In [None]:
# last_record_id = # past the record_id of the latest file saved
# last_saved_file(last_record_id,en_list)

In [None]:
# Restart at index, replace *** by the index number
# save_files(en_list[***:], 'data/A_input_data/files/', '.pdf')

## 3. Extract full text from PDF files
Using the same files list we then use the convert_to_pdf_function, to get the content of the PDFs, convert it to a string of text, and store this as a third column in the initial list. Note that if a page cannot be processed, then it will be skipped altogether.

In [None]:
def convert_pdf_to_text(files_list, path):
    '''
    Takes a list of list, and a path to files in pdf. Read each files and convert to text.
    Append the resulting texts to the initial list and return the list.
    '''
    new_list = []
    i = 1
    for file in files_list:
        file_path = path + str(file[0]) + '.pdf'
        full_text = ""
        try: 
            full_text = textract.process(file_path)
            full_text = full_text.decode() # convert unicode bytes
        except:
            logger.exception("record {}: could not convert pdf to text".format(file[0]))
        file.append(full_text)    
        new_list.append(file)
        i +=1
    return new_list

In [None]:
# Add a column with the full text of the pdf
en_list = convert_pdf_to_text(en_list,'data/A_input_data/files/')

## 4. Add full texts to the metadata and save the output
To finish, we create a new dataframe with only the record_id and the full text. As they have the same index, the record-id, we can join it to the metadata dataframe and easily save the result as a CSV.

In [None]:
# Create a new dataframe withg the record_id and the full text
full_text = (pd.DataFrame(en_list, columns=['record_id','url','text'])
             .drop('url', axis=1)
             .set_index('record_id')
            )

In [None]:
# Join the result to the metadta dataset
dataset = dataset.join(full_text)

In [None]:
# Check the result
dataset.info()

In [None]:
# Save the content of the dataset in data/pre-processing/
dataset.to_csv('data/A_input_data/metadata/output/doc_2000_2017.csv')