# Scraping Exam Info from the SOA

## Download Results Files and Packages
Information for exam passers in 2018 is available here:
https://www.soa.org/education/general-info/exam-results/edu-exam-results-archive.aspx

We download packages for reading pdfs, using regular expressions, and interacting with our file directories.

In [1]:
import PyPDF2
import re
import os

## Extracting Names from a PDF
There are a lot of files to extract the names from. First lets figure out how to do a single file then we will work on iterating over all of the files. 

There are multiple pages so we iterate through the pages using a regular expression to extract all the names on each page. The names come as an ordered list so we use a regular expression to pick up text starting with a number and a period and ending before we hit any more numbers.

In [10]:
#Regular expression for extracting names.
#A simpler regex will break for people with numbers in their names (example: "3rd")
regexName = re.compile(r'\d+\.\s[^,]+,\s[^0-9]+')

def scrapeNames(path):
    #Use pdf reader
    pdfFileObj=open(path, "rb")
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    #Iterate over pages and extract all names
    maxPage = pdfReader.numPages
    allNames = []
    for pageNum in range(0,maxPage):
        pageText = pdfReader.getPage(pageNum).extractText()
        allNames += regexName.findall(pageText)
    return allNames

#Extract names and examine output.
rawNames = scrapeNames("./edu-2018-02-c-names-ajl65e.pdf") #This is from the downloaded zip.
rawNames[0:9]

['1. Ab Manan, Muhd Azman Firdaus   \n',
 '2. Abadeer, Mirette   \n',
 '3. Abbott, Anthony   \n',
 '4. Abramova, Rena   \n',
 '5. Adair, Liam Alexander  \n',
 '6. Adams, Brooke   \n',
 '7. Adler, Justin   \n',
 '8. Agcaoili, Ramon Vicente Rimando  \n',
 '9. Ahmad, Osman   \n']

We want to extract the first and last names of each exam taker. We remove the numbers at the start of each line and the extra text at the end of each line and split each string into a list of two strings for first and last name.

In [11]:
def formatName(nameString):
    nameString = re.sub(r'^[0-9]+\.\s','',nameString) #Remove Leading Numbers
    nameString = re.sub('\n','',nameString) #Remove newline
    nameString = nameString.rstrip() #Remove trailing whitespace
    namePair = nameString.split(", ", 1) #Split strings
    return namePair

[formatName(name) for name in rawNames][0:9]

[['Ab Manan', 'Muhd Azman Firdaus'],
 ['Abadeer', 'Mirette'],
 ['Abbott', 'Anthony'],
 ['Abramova', 'Rena'],
 ['Adair', 'Liam Alexander'],
 ['Adams', 'Brooke'],
 ['Adler', 'Justin'],
 ['Agcaoili', 'Ramon Vicente Rimando'],
 ['Ahmad', 'Osman']]

## Extract From All PDFs and Combine Results

In [13]:
os.listdir("edu-names-2018")[0:9] #This is the downloaded zip before we do any cleaning.

['Exam CFEFD',
 'Exam CFESDM',
 'Exam EA1',
 'Exam EA2F',
 'Exam EA2L',
 'Exam ERM',
 'Exam FM',
 'Exam GHADV',
 'Exam GHCORC']

Some of the exams changed in 2018 which made the file structure not as nice. We split the "Exam IFM-MFE" exam into folders "Exam IFM" and "Exam MFE" manually. We do the same for "Exam LTAM-MLC" and "Exam STAM-C". We also change "Exam-SRM" to "Exam SRM" for consistency. The described changes are in "edu-names-2018".

We make a list containing the path to every pdf. We scrape every file in the list to get the names of our exam takers.

In [17]:
examFolders = ["edu-names-2018-modified/" + examPath for examPath in os.listdir("edu-names-2018-modified") if "Exam" in examPath] #Full path for exam folders

examFiles = []
for examFolder in examFolders:
    for resultFile in os.listdir(examFolder):
        if "names" in resultFile:
            examFiles.append(examFolder + "/" + resultFile)

examFiles[0:9]

['edu-names-2018-modified/Exam C/edu-2018-02-c-names-ajl65e.pdf',
 'edu-names-2018-modified/Exam C/edu-2018-06-c-names-afaert6.pdf',
 'edu-names-2018-modified/Exam CFEFD/edu-2018-04-cfefd-names-je7hd9.pdf',
 'edu-names-2018-modified/Exam CFEFD/edu-2018-10-cfefd-names-je7hds.pdf',
 'edu-names-2018-modified/Exam CFESDM/edu-2018-04-cfesdm-names-2de9w0.pdf',
 'edu-names-2018-modified/Exam CFESDM/edu-2018-11-cfesdm-names-2de9ws.pdf',
 'edu-names-2018-modified/Exam EA1/edu-2018-05-ea1-names-eeyiuahr8.pdf',
 'edu-names-2018-modified/Exam EA2F/edu-2018-11-ea2f-names-fiojfe89.pdf',
 'edu-names-2018-modified/Exam EA2L/edu-2018-05-ea2l-names-jdtyu7.pdf']

We scrape every name in every file and add it to a list. We use the file paths to fill in information for which exam was passed and when it was passed.

In [19]:
regexExamName = re.compile(r'Exam\s[A-Z]*')
regexDate = re.compile(r'[0-9]{4}-[0-9]{2}')

allNames = []
for file in examFiles:
    examInfo = regexExamName.search(file).group()
    monthInfo = regexDate.search(file).group()
    rawNames = scrapeNames(file)
    formattedNames = [formatName(name) for name in rawNames]
    for name in formattedNames:
        allNames.append([name[0], name[1], examInfo, monthInfo])
allNames[0:9]

[['Ab Manan', 'Muhd Azman Firdaus', 'Exam C', '2018-02'],
 ['Abadeer', 'Mirette', 'Exam C', '2018-02'],
 ['Abbott', 'Anthony', 'Exam C', '2018-02'],
 ['Abramova', 'Rena', 'Exam C', '2018-02'],
 ['Adair', 'Liam Alexander', 'Exam C', '2018-02'],
 ['Adams', 'Brooke', 'Exam C', '2018-02'],
 ['Adler', 'Justin', 'Exam C', '2018-02'],
 ['Agcaoili', 'Ramon Vicente Rimando', 'Exam C', '2018-02'],
 ['Ahmad', 'Osman', 'Exam C', '2018-02']]

We convert this list of lists to a pandas data frame for analysis.

In [20]:
import pandas as pd
df = pd.DataFrame(allNames, columns = ['first', 'last', 'exam', 'date'])
df.groupby(['exam'])['exam'].count().sort_values(ascending=False)

exam
Exam P           4793
Exam FM          4740
Exam C           2604
Exam IFM         2078
Exam MFE         1365
Exam MLC          853
Exam LTAM         753
Exam STAM         586
Exam ILALP        578
Exam PA           524
Exam GHCORU       393
Exam ILALFVU      324
Exam GHADV        318
Exam ILALRM       318
Exam EA           292
Exam ERM          278
Exam GHSPC        199
Exam QFICORE      169
Exam QFIADV       139
Exam QFIIRM       126
Exam SRM          116
Exam RETRPIRM      95
Exam CFESDM        79
Exam CFEFD         75
Exam RETFRC        60
Exam RETDAU        48
Exam RETDAC        42
Exam ILALFVC       40
Exam GIINT         20
Exam GHCORC        10
Exam GIADV          5
Exam GIIRR          4
Exam GIFREU         2
Name: exam, dtype: int64