# Pdf data extraction

## Goal

This notebook contains the code to extract text from the unzipped pdf files obtained from the [HSPCI website](https://intelligence.house.gov/social-media-content/social-media-advertisements.htm).

Before running the code below:

1. Place the unzipped folders under the raw_data directory
2. Install the [pdftotext](https://www.xpdfreader.com/pdftotext-man.html) terminal command.

The code will create two outputs:

1. raw.csv which is used in the data_cleaning notebook
2. txt_files/ directory containing the raw text extracted from the first page of each pdf files.

## Extracting text from pdfs

To extract text from pdfs we look for files ending in `.pdf` under the raw_data folder. We call the pdftotext utility to extract text from the first page of each pdfs and output the files under the same name in a new txt_files fodler and change the extension to txt.

Note that executing this will take a few minutes.

In [1]:
import os

input_path = '../raw_data/'
output_path = '../raw_data/txt_files/'

# Get all pdf files under path
for root, directories, files in os.walk(input_path):
    for file in files:
        if ".pdf" in file:
            # Create file paths
            pdf_file_path = os.path.join(root, file)
            pdf_file_path = pdf_file_path.replace('(', '\(')
            pdf_file_path = pdf_file_path.replace(')', '\)')
            txt_file_path = pdf_file_path[:-3] + 'txt'

            # You will need to have the pdftotext command line command available
            pdf_to_text_cmd = f'pdftotext -l 1 {pdf_file_path}'
            os.system(pdf_to_text_cmd)

            mv_cmd = f'mv {txt_file_path} {output_path}'
            os.system(mv_cmd)

## Extracting information from pdfs

The image below shows the format of the txt files we extracted.

![Russian ad description](../assets/pictures/ad_preview.png)

To extract information found on each line from the text we will need many regular expressions. They are listed below.

In [3]:
import re
import copy
import pandas as pd

regex_dict = {
    'ad_line_start': r"^Ad\s+",
    # Sometimes pdf files wrote Id with an 1, l or I character the spacing was also inconsistent
    'ad_id': r"^A\s*d\s*(I|l|1)\s*D\s*(?P<ad_id>[0-9]*)",
    'ad_text': r"^A\s*d\s*Text\s*",
    'ad_landing_page': r"^Ad\s+Landing\s+Page\s+(?P<ad_landing_page>.*)",
    'ad_targeting': r"^Ad\s*Targeting\s*(?P<ad_targeting>.*)",
    'ad_impressions': r"^Ad\s*Impressions\s*:?\s*(?P<ad_impressions>.*)",
    'ad_clicks': r"^Ad\s*Clicks\s*:?\s*(?P<ad_clicks>.*)",
    'ad_spend': r"^Ad\s*spend\s*:?\s*(?P<ad_spend>.*)",
    'ad_creation_date': r"^Ad\s*creation\s*date\s*:?\s*(?P<ad_creation_date>.*)",
    'ad_end_date': r"^Ad\s*end\s*date\s*:?\s*(?P<ad_end_date>.*)"
}

# Order of these entries is important!
ordered_targeting_topics_regex = {
    'custom_audience': r'^custom\s*audience\s*:?\s*(?P<custom_audience>.*)',
    'location': r'^location\s*:?\s*(?P<location>.*)',
    'interests': r'^interests\s*:?\s*(?P<interests>.*)',
    'excluded_connections': r'^excluded\s*connections\s*:?\s*(?P<excluded_connections>.*)',
    'age': r'^age\s*:?\s*(?P<age>.*)',
    'language': r'^language\s*:?\s*(?P<language>.*)',
    'placements': r'^placements\s*:?\s*(?P<placements>.*)',
    'people_who_match': r'^people\s*who\s*match\s*:?\s*(?P<people_who_match>.*)'
}

We need to define a few parsing functions before we can parse our ad text files. We first create a function which can parse information contained in one line.

In [4]:
# Given a key from the regex_dict, an array of lines from the file, the current line index
# and a dicionary to log parsing errors, use the appropriate regex to obtain the value.
#
# Returns the extracted value or None and the next line to parse (remains the same line if no value was found)
def extract_inline(regex_key, file_name, lines, line_index, parse_errors_file_dict):
    # Get the line from file
    line = lines[line_index]
    
    # Look for a match (case insensitive)
    matches = re.search(regex_dict[regex_key], line, flags=re.IGNORECASE)
    
    to_extract = None
    if matches is not None:
        group_dict = matches.groupdict()
        to_extract = group_dict.get(regex_key)

    # Add to fail to parse dictionary if regex was not comprehensive
    if to_extract is None:
        parse_errors_file_dict[regex_key].append(file_name)
        new_line_index = line_index
    else:
        new_line_index = line_index + 1

    return to_extract, new_line_index

The ad text can be on multiple lines so we create a function which iterates through lines from the current file until we find the next line starting with 'Ad '.

In [5]:
# Given the name of the file, an array of lines from the file, the current line index
# and a dicionary to log parsing errors it extracts the Ad Text value.
#
# Returns the extracted value or None and the next line to parse (remains the same line if no value was found)
def extract_ad_text(file_name, lines, line_index, parse_errors_file_dict):
    ad_text = None
    line = lines[line_index]

    # Search for line containing 'Ad Text'
    matches = re.search(regex_dict['ad_text'], line, flags=re.IGNORECASE)
    
    # If the line contains 'Ad Text'
    if matches is not None and matches.span() is not (-1, -1):
        (_, end) = matches.span()
        # Grab information for this first line after 'Ad Text'
        ad_text_array = [line[end:]]
        
        # Move on to the next line
        new_line_index = line_index + 1
        
        # As long as we have not reached the end of the file's lines
        while new_line_index < len(lines):
            # Get the current line
            line = lines[new_line_index]
            
            # Check for 'Ad ' signaling the start of a different field to parse
            matches = re.search(regex_dict['ad_line_start'], line, flags=re.IGNORECASE)
            if matches is not None:
                # We found the end of the Ad Text stop
                break
            else:
                ad_text_array.append(line)
                new_line_index += 1
        
        # Join lines together with space
        ad_text = ''.join(ad_text_array)

    # If no ad_text was found add to the parsing error dictionary
    if ad_text is None:
        parse_errors_file_dict['ad_text'].append(file_name)
        new_line_index = line_index

    return ad_text, new_line_index

Extracting information for the Ad Targeting section is a bit more complex, first we use a function to extract the lines which are part of the ad_targeting section. We then extract a dictionary of the different targeting topics and assigned them. Finally we return the parsed variables.

In [6]:
# Given the name of the file, an array of lines from the file, the current line index
# and a dicionary to log parsing errors it extracts all ad targeting values
#
# Returns the extracted value or None and the next line to parse (remains the same line if no value was found)
def extract_ad_targeting(file_name, lines, line_index, parse_errors_file_dict):
    # Step 1) get all targeting related lines and next line
    targeting_lines, next_line_index = extract_targeting_lines(lines, line_index)

    # Step 2) get topic dictionary with list of target per topic
    topic_dictionary = get_topic_dictionary(targeting_lines) if targeting_lines else dict()

    # Step 3) assign topic dictionary entry to variables (get defaults to None)
    ad_targeting_custom_audience = topic_dictionary.get('custom_audience')
    ad_targeting_location = topic_dictionary.get('location')
    ad_targeting_interests = topic_dictionary.get('interests')
    ad_targeting_excluded_connections = topic_dictionary.get('excluded_connections')
    ad_targeting_age = topic_dictionary.get('age')
    ad_targeting_language = topic_dictionary.get('language')
    ad_targeting_placements = topic_dictionary.get('placements')
    ad_targeting_people_who_match = topic_dictionary.get('people_who_match')

    return (ad_targeting_custom_audience,
            ad_targeting_location,
            ad_targeting_interests,
            ad_targeting_excluded_connections,
            ad_targeting_age,
            ad_targeting_language,
            ad_targeting_placements,
            ad_targeting_people_who_match,
            next_line_index)

# Given the file's lines and the current line
# Returns an array of lines for the ad targeting and the appropriate next line to analyze.
def extract_targeting_lines(lines, line_index):
    targeting_lines = []

    # Step 1) check that the current line is the targeting line
    matches = re.search(regex_dict['ad_targeting'], lines[line_index], flags=re.IGNORECASE)
    if matches is None:
        return None, line_index
    else:
        first_line = matches.groupdict()['ad_targeting']
        targeting_lines.append(first_line)
        line_index += 1

    # Step 2) get lines
    next_line_index = line_index
    match_found = False
    
    # While we have not reached the end of the file's line and not found a match with 
    # ad_impressions which follows ad targeting...
    while next_line_index < len(lines) and not match_found:
        line = lines[next_line_index]
        matches = re.search(regex_dict['ad_impressions'], line, flags=re.IGNORECASE)

        if matches is not None:
            match_found = True
        else:
            targeting_lines.append(line)
            next_line_index += 1

    if not match_found:
        targeting_lines = None
        parse_errors_file_dict['ad_targeting'].append(file_name)

    return targeting_lines, next_line_index if match_found else line_index

# Using the lines found for targeting, extract each topic
# and return a dictionary containing the successfully extracted values
def get_topic_dictionary(targeting_lines):
    # Create a copy of the regexes for targeting
    topic_regex = copy.deepcopy(ordered_targeting_topics_regex)
    
    # Create an empty dictionary for all matches found
    topic_text_dictionary = {}
    
    # For each targeting lines
    current_topic = None
    for i in range(len(targeting_lines)):
        # Get current line
        line = targeting_lines[i]
        
        for topic, regex in topic_regex.items():
            matches = re.search(regex, line, flags=re.IGNORECASE)

            if matches and matches.groupdict():
                # current_topic can match until new topic is found
                current_topic = topic

                # remove topic name from value using groups
                # we use this as a marker to know if we have already covered a topic
                group_dict = matches.groupdict()
                line = group_dict.get(topic)

                # Once a topic is found it is removed
                topic_regex.pop(topic)
                break

        if current_topic is None:
            return None

        # If we have a current_topic we are matching
        if current_topic in topic_text_dictionary:
            # It is on multiple line continue to append
            topic_text_dictionary[current_topic] += (' ' + line.replace('\n', ''))
        else:
            # It is the first time we've match add the line
            topic_text_dictionary[current_topic] = line if line is not None else ''

    return topic_text_dictionary

Using all the functions defined above, we create a function that can fully parse lines of a ad txt file.

In [7]:
# Given the name of ad txt file, its path and a dictionary to log parsing errors,
# extract as many fields as possible from the raw text.
def row_from_file(file_name, file_path, parse_errors_file_dict):
    # Open the txt file
    with open(file_path, 'r') as ad_file:
        # Read each line in an array
        lines = ad_file.readlines()
        
        # If the file is empty add a file empty error
        if lines is None and len(lines) == 0:
            parse_errors_file_dict['empty'].append(file_path)

        # Start by looking at the first line
        line_index = 0

        # ad_id
        (ad_id, line_index) = extract_inline('ad_id', file_name, lines, line_index, parse_errors_file_dict)

        # ad_text
        (ad_text, line_index) = extract_ad_text(file_name, lines, line_index, parse_errors_file_dict)

        # ad_landing_page
        (ad_landing_page, line_index) = extract_inline('ad_landing_page', file_name, lines, line_index,
                                                       parse_errors_file_dict)

        # ad_targeting (all ad_targeting is extracted here
        (ad_targeting_custom_audience,
         ad_targeting_location,
         ad_targeting_interests,
         ad_targeting_excluded_connections,
         ad_targeting_age,
         ad_targeting_language,
         ad_targeting_placements,
         ad_targeting_people_who_match,
         line_index) = extract_ad_targeting(
            file_name,
            lines,
            line_index,
            parse_errors_file_dict)

        # ad_impressions
        (ad_impressions, line_index) = extract_inline('ad_impressions', file_name, lines, line_index,
                                                      parse_errors_file_dict)

        # ad_clicks
        (ad_clicks, line_index) = extract_inline('ad_clicks', file_name, lines, line_index, parse_errors_file_dict)

        # ad_spend
        (ad_spend, line_index) = extract_inline('ad_spend', file_name, lines, line_index, parse_errors_file_dict)

        # ad_creation_date
        (ad_creation_date, line_index) = extract_inline('ad_creation_date', file_name, lines, line_index,
                                                        parse_errors_file_dict)

        # ad_end_date
        (ad_end_date, line_index) = extract_inline('ad_end_date', file_name, lines, line_index, parse_errors_file_dict)

    return [file_name,
            ad_id,
            ad_text,
            ad_landing_page,
            ad_targeting_custom_audience,
            ad_targeting_location,
            ad_targeting_interests,
            ad_targeting_excluded_connections,
            ad_targeting_age,
            ad_targeting_language,
            ad_targeting_placements,
            ad_targeting_people_who_match,
            ad_impressions,
            ad_clicks,
            ad_spend,
            ad_creation_date,
            ad_end_date]

The code below opens the txt versions of the ads and extracts the information using the row_from_file function explained above.

In [8]:
input_path = '../raw_data/txt_files/'

all_rows = []
parse_errors_file_dict = {
    'ad_id': [],
    'ad_text': [],
    'ad_landing_page': [],
    'ad_targeting': [],
    'ad_targeting_custom_audience': [],
    'ad_targeting_location': [],
    'ad_targeting_interests': [],
    'ad_targeting_excluded_connections': [],
    'ad_targeting_age': [],
    'ad_targeting_language': [],
    'ad_targeting_placements': [],
    'ad_targeting_people_who_match': [],
    'ad_impressions': [],
    'ad_clicks': [],
    'ad_spend': [],
    'ad_creation_date': [],
    'ad_end_date': [],
    'empty': []
}

for root, directories, files in os.walk(input_path):
    for file_name in files:
        if ".txt" in file_name:
            txt_file_path = os.path.join(root, file_name)
            row_array = row_from_file(file_name, txt_file_path, parse_errors_file_dict)
            all_rows.append(row_array)

We finally take rows and output them to csv.

In [9]:
df = pd.DataFrame(all_rows, columns=[
    'file_name',
    'ad_id',
    'ad_text',
    'ad_landing_page',
    'ad_targeting_custom_audience',
    'ad_targeting_location',
    'ad_targeting_interests',
    'ad_targeting_excluded_connections',
    'ad_targeting_age',
    'ad_targeting_language',
    'ad_targeting_placements',
    'ad_targeting_people_who_match',
    'ad_impressions',
    'ad_clicks',
    'ad_spend',
    'ad_creation_date',
    'ad_end_date'
])
df.to_csv('../raw_data/raw.csv')

The next notebook is [data_cleaning](data_cleaning.ipynb).