# Scraping Exam Info from the SOA

## Download Results Files and Packages
Information for exam passers in 2018 is available here:
https://www.soa.org/education/general-info/exam-results/edu-exam-results-archive.aspx

We download packages for reading pdfs, using regular expressions, and interacting with our file directories.

In [8]:
import PyPDF2
import re
import os

## Extracting Names from a PDF
There are a lot of files to extract the names from. First lets figure out how to do a single file then we will work on iterating over all of the files. 

There are multiple pages so we iterate through the pages using a regular expression to extract all the names on each page. The names come as an ordered list so we use a regular expression to pick up text starting with a number and a period and ending before we hit any more numbers.

In [9]:
#Regular expression for extracting names.
#A simpler regex will break for people with numbers in their names (example: "3rd")
regexName = re.compile(r'\d+\.\s[^,]+,\s[^0-9]+')

def scrapeNames(path):
    #Use pdf reader
    pdfFileObj=open(path, "rb")
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    #Iterate over pages and extract all names
    maxPage = pdfReader.numPages
    allNames = []
    for pageNum in range(0,maxPage):
        pageText = pdfReader.getPage(pageNum).extractText()
        allNames += regexName.findall(pageText)
    return allNames

#Extract names and examine output.
rawNames = scrapeNames("./edu-2019-03-p-names-3tgjhu7.pdf") #This is from the downloaded zip.
rawNames

['1. Abbas, Fizza  \n ',
 '2. Abdul Aziz, Talha  \n ',
 '3. Ackerman, Samuel  \n ',
 '4. Acosta, Ana  \n ',
 '5. Acri, Sierra  \n ',
 '6. Adamic, Haley Michelle \n ',
 '7. Adamo, Anthony  \n ',
 '8. Adeagbo, Oluwakemi  \n ',
 '9. Adlan, Anis Syafiqah  \n ',
 '10. Agrawal, Yogesh  \n ',
 '11. Aguayo Dupin, Nicolas Antonio  \n ',
 '12. Aguirre, Andrew  \n ',
 '13. Alam, Farheen Kabir \n ',
 '14. Alaskar, Ruba  \n ',
 '15. Albano, Carl  \n ',
 '16. Alberto, Janette  \n ',
 '17. Al-Burti, Aliaa  \n ',
 '18. Aldaoud, Abir  \n ',
 '19. Alfallaj, Abdulkarim  \n ',
 '20. Ali, Nade  \n ',
 '21. Allen, Asrielle  \n ',
 '22. Alqahtani, Reem Saleh \n ',
 '23. Alsayel, Abdulrhman  \n ',
 '24. Alvarado, John Renz  \n ',
 '25. Amstislavskiy\n, Eric  \n ',
 '26. Anderson, Lance  \n ',
 '27. Anderson, Michael  \n ',
 '28. Armanious, Miriam  \n ',
 '29. Asavatheputhai, Pongsathorn  \n ',
 '30. Aselisewine, Wisdom  \n ',
 '31. Astiazaran, Juan  \n ',
 '32. Attara, Florence  \n ',
 '33. Aun, Syed  \n ',
 

We want to extract the first and last names of each exam taker. We remove the numbers at the start of each line and the extra text at the end of each line and split each string into a list of two strings for first and last name.

In [3]:
def formatName(nameString):
    nameString = re.sub(r'^[0-9]+\.\s','',nameString) #Remove Leading Numbers
    nameString = re.sub('\n','',nameString) #Remove newline
    nameString = nameString.rstrip() #Remove trailing whitespace
    namePair = nameString.split(", ", 1) #Split strings
    return namePair

[formatName(name) for name in rawNames]

[['Abbas', 'Fizza'],
 ['Abdul Aziz', 'Talha'],
 ['Ackerman', 'Samuel'],
 ['Acosta', 'Ana'],
 ['Acri', 'Sierra'],
 ['Adamic', 'Haley Michelle'],
 ['Adamo', 'Anthony'],
 ['Adeagbo', 'Oluwakemi'],
 ['Adlan', 'Anis Syafiqah'],
 ['Agrawal', 'Yogesh'],
 ['Aguayo Dupin', 'Nicolas Antonio'],
 ['Aguirre', 'Andrew'],
 ['Alam', 'Farheen Kabir'],
 ['Alaskar', 'Ruba'],
 ['Albano', 'Carl'],
 ['Alberto', 'Janette'],
 ['Al-Burti', 'Aliaa'],
 ['Aldaoud', 'Abir'],
 ['Alfallaj', 'Abdulkarim'],
 ['Ali', 'Nade'],
 ['Allen', 'Asrielle'],
 ['Alqahtani', 'Reem Saleh'],
 ['Alsayel', 'Abdulrhman'],
 ['Alvarado', 'John Renz'],
 ['Amstislavskiy', 'Eric'],
 ['Anderson', 'Lance'],
 ['Anderson', 'Michael'],
 ['Armanious', 'Miriam'],
 ['Asavatheputhai', 'Pongsathorn'],
 ['Aselisewine', 'Wisdom'],
 ['Astiazaran', 'Juan'],
 ['Attara', 'Florence'],
 ['Aun', 'Syed'],
 ['Avila', 'Monica'],
 ['Awotwe', 'Emmanuel'],
 ['Ayers', 'Megan'],
 ['Bahr', 'Anna'],
 ['Bahr', 'Spencer'],
 ['Baillie', 'Madeline'],
 ['Bajsicka', 'Weroni

## Extract From All PDFs and Combine Results

In [4]:
os.listdir("edu-names-2018_raw") #This is the downloaded zip before we do any cleaning.

['edu-2018-04-fsa-percents-453he7.pdf',
 'edu-2018-10-fsa-percents-453hes.pdf',
 'Exam CFEFD',
 'Exam CFESDM',
 'Exam EA1',
 'Exam EA2F',
 'Exam EA2L',
 'Exam ERM',
 'Exam FM',
 'Exam GHADV',
 'Exam GHCORC',
 'Exam GHCORU',
 'Exam GHSPC',
 'Exam GIADV',
 'Exam GIFREU',
 'Exam GIINT',
 'Exam GIIRR',
 'Exam IFM-MFE',
 'Exam ILALFVC',
 'Exam ILALFVU',
 'Exam ILALP',
 'Exam ILALRM',
 'Exam LTAM-MLC',
 'Exam P',
 'Exam PA',
 'Exam QFIADV',
 'Exam QFICORE',
 'Exam QFIIRM',
 'Exam RETDAC',
 'Exam RETDAU',
 'Exam RETFRC',
 'Exam RETRPIRM',
 'Exam STAM-C',
 'Exam-SRM']

Some of the exams changed in 2018 which made the file structure not as nice. We split the "Exam IFM-MFE" exam into folders "Exam IFM" and "Exam MFE" manually. We do the same for "Exam LTAM-MLC" and "Exam STAM-C". We also change "Exam-SRM" to "Exam SRM" for consistency. The described changes are in "edu-names-2018".

We make a list containing the path to every pdf. We scrape every file in the list to get the names of our exam takers.

In [5]:
examFolders = ["edu-names-2018/" + examPath for examPath in os.listdir("edu-names-2018") if "Exam" in examPath] #Full path for exam folders

examFiles = []
for examFolder in examFolders:
    for resultFile in os.listdir(examFolder):
        if "names" in resultFile:
            examFiles.append(examFolder + "/" + resultFile)

examFiles

['edu-names-2018/Exam C/edu-2018-02-c-names-ajl65e.pdf',
 'edu-names-2018/Exam C/edu-2018-06-c-names-afaert6.pdf',
 'edu-names-2018/Exam CFEFD/edu-2018-04-cfefd-names-je7hd9.pdf',
 'edu-names-2018/Exam CFEFD/edu-2018-10-cfefd-names-je7hds.pdf',
 'edu-names-2018/Exam CFESDM/edu-2018-04-cfesdm-names-2de9w0.pdf',
 'edu-names-2018/Exam CFESDM/edu-2018-11-cfesdm-names-2de9ws.pdf',
 'edu-names-2018/Exam EA1/edu-2018-05-ea1-names-eeyiuahr8.pdf',
 'edu-names-2018/Exam EA2F/edu-2018-11-ea2f-names-fiojfe89.pdf',
 'edu-names-2018/Exam EA2L/edu-2018-05-ea2l-names-jdtyu7.pdf',
 'edu-names-2018/Exam ERM/edu-2018-04-erm-names-00oe8u.pdf',
 'edu-names-2018/Exam ERM/edu-2018-10-erm-names-00oe8s.pdf',
 'edu-names-2018/Exam FM/edu-2018-02-fm-names-lu83jk.pdf',
 'edu-names-2018/Exam FM/edu-2018-04-fm-names-lhts3ok.pdf',
 'edu-names-2018/Exam FM/edu-2018-06-fm-names-sujheowihtr.pdf',
 'edu-names-2018/Exam FM/edu-2018-08-fm-names-6eh88j.pdf',
 'edu-names-2018/Exam FM/edu-2018-10-fm-names-w47rjknh49.pdf',
 '

We scrape every name in every file and add it to a list. We use the file paths to fill in information for which exam was passed and when it was passed.

In [6]:
regexExamName = re.compile(r'Exam\s[A-Z]*')
regexDate = re.compile(r'[0-9]{4}-[0-9]{2}')

allNames = []
for file in examFiles:
    examInfo = regexExamName.search(file).group()
    monthInfo = regexDate.search(file).group()
    rawNames = scrapeNames(file)
    formattedNames = [formatName(name) for name in rawNames]
    for name in formattedNames:
        allNames.append([name[0], name[1], examInfo, monthInfo])
allNames

[['Ab Manan', 'Muhd Azman Firdaus', 'Exam C', '2018-02'],
 ['Abadeer', 'Mirette', 'Exam C', '2018-02'],
 ['Abbott', 'Anthony', 'Exam C', '2018-02'],
 ['Abramova', 'Rena', 'Exam C', '2018-02'],
 ['Adair', 'Liam Alexander', 'Exam C', '2018-02'],
 ['Adams', 'Brooke', 'Exam C', '2018-02'],
 ['Adler', 'Justin', 'Exam C', '2018-02'],
 ['Agcaoili', 'Ramon Vicente Rimando', 'Exam C', '2018-02'],
 ['Ahmad', 'Osman', 'Exam C', '2018-02'],
 ['Akoto', 'Osei', 'Exam C', '2018-02'],
 ['Allen', 'Austin Sei', 'Exam C', '2018-02'],
 ['Allen', 'Dina', 'Exam C', '2018-02'],
 ['Al-Yassin', 'Julan', 'Exam C', '2018-02'],
 ['Amabile', 'Devan Rose', 'Exam C', '2018-02'],
 ['Amador', 'Daisy Margarita', 'Exam C', '2018-02'],
 ['Aman', 'Eric J', 'Exam C', '2018-02'],
 ['Amburgey', 'Stephen', 'Exam C', '2018-02'],
 ['Amoh', 'Ebenezer', 'Exam C', '2018-02'],
 ['Amponsah', 'Charles Kwame', 'Exam C', '2018-02'],
 ['An', 'Zuoni', 'Exam C', '2018-02'],
 ['Anderson', 'Emilie', 'Exam C', '2018-02'],
 ['Anderson', 'Mars

We convert this list of lists to a pandas data frame for analysis.

In [7]:
import pandas as pd
df = pd.DataFrame(allNames, columns = ['first', 'last', 'exam', 'date'])
df.groupby(['first','last'])['exam'].count().sort_values(ascending=False)

first           last                 
Liu             Chang                    9
                Yang                     8
Yen             Wen-Yuan                 6
Zhao            Yuxin                    6
Al Kahfi        Muhammad                 5
Li              Xiang                    5
Wang            Yuqing                   5
Liu             Yixuan                   5
Zhang           Xinyu                    5
Lee             Jay-Hee                  5
Sun             Tian                     5
Kim             Jaehwi                   4
Liu             Xinyu                    4
Wu              Yue                      4
Wang            Zijia                    4
Lu              Yuyan                    4
Rim             Jaejun                   4
Liu             Chen                     4
Wang            Qi                       4
Lam             Nok Ki                   4
Guo             Tianao                   4
He              Weiyu                    4
Li              