# Project Luther: Web Scraping and Data Collection

In this project, I will be analyzing data on student and teacher demographics and see if there seems to be any kind of relationship between Illinois teacher demographics and the "achievement gap" between hispanic students and their white peers on standardized math tests in high school (The PSAE).

This notebook contains the code to scrape the publicly available data from the Illinois Report Card website

URL: https://www.illinoisreportcard.com/ListSchools.aspx

Analysis will be covered in a second notebook

## **Initializing libraries and modules**

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd

import time
import re
import pickle
import collections

## Selenium Data Scraping Script

**I decided to use Selenium as my weapon of choice to scrape data because the Illinois Report Card Website has dynamic content that requires quite a bit of clicking**

The code below opens up a new chrome window from where all the data will be collected. The data collection will be automated through Selenium

In [2]:
chromedriver_path = "/home/farhaan/chromedriver"
driver = webdriver.Chrome(chromedriver_path)
driver.get('https://www.illinoisreportcard.com/ListSchools.aspx')
time.sleep(5);

Web scraping will yield unique dictionaries for each school containing scraped data. All of these dictionaries will be contained within a master_list_of_school_dictionaries until data collection is complete.
Once it is complete, the list of dictionaries will be converted into a pandas dataframe object.

In [4]:
master_list_school_dictionaries = []

### Navigation functions

The following four functions will contain the means to navigate the website to collect data on each high school. The high schools are organized by alphabet on separate webpages for each alphabet.
- Running the **page_navigator** will set the entire data scraping apparatus into motion. It will start at the first webpage with school names beginning with 'A' navigate to the next webpage after all the high school links on the current page have been sorted through and will end at the alphabet 'Z'.
- The **link_navigator** will cycle through all the high school school links on the current webpage.
    - The **is_highschool** helper function returns a boolean to assist the link_navigator in differentiating high schools from elementary/middle schools.


- The **open_school_in_new_tab** function opens the school link in a new tab when the link_navigator has selected a high school. It runs the grab_student_data function on the new tab to collect student data. After data collection for the school has been completed, open_school_in_new_tab closes the tab and returns focus to the main window containing all the school links.

In [5]:
def page_navigator():
    """
    will navigate alpha-nav pages while scraping data about every high school
    """
    driver.switch_to_default_content()
    alpha_page_list = driver.find_elements_by_xpath('//ul[@class="list-inline"]//a')
    alpha_page_index =  0
    time.sleep(0.5)
    while alpha_page_index < len(alpha_page_list):
        alpha_page_list = driver.find_elements_by_xpath('//ul[@class="list-inline"]//a')
        if alpha_page_index >0:
            next_page=alpha_page_list[alpha_page_index]
            next_page.click()
            time.sleep(4.5)
        link_navigator()
        alpha_page_index +=1

def link_navigator():
    """
    For the school links on the alpha-nav sorted page, this function
    will append scraped data about every high school to
    master_list_of_school_dictionaries
    """
    school_list = driver.find_elements_by_xpath('//div[@class="col-xs-6 col-sm-6 cellLeft"]/a')
    type_of_school = driver.find_elements_by_xpath('//div[@class="col-xs-6 col-sm-6 cellLeft"]')
    time.sleep(0.5)
    #slice the type of school list since it contains an extra row for table heading compared to school_list
    type_of_school = type_of_school[1:]
    #want to only select high schools
    for school_type,school_link in zip(type_of_school,school_list):
        school_type = school_type.text  
        if is_highschool(school_type):
            school_data_dict = open_school_in_new_tab(school_link)
            if school_data_dict is not None:
                master_list_school_dictionaries.append(school_data_dict)
        else:
            continue

def is_highschool(school_type):
    """
    based on description on site, checks to see if a given school is a high school
    if it is a high school, returns true. if not a high school, returns false.
    """
    it_is_a_highschool = False
    regex = re.compile('(.*)\n.*-12\)',re.DOTALL|re.MULTILINE)
    is_a_highschool_query = re.search(regex,school_type)
    if is_a_highschool_query:
        it_is_a_highschool = True    
    return it_is_a_highschool

def open_school_in_new_tab(school_link):
    """
    opens the school link in a new tab, runs data scraping algorithm,
    closes the tab, returns the data for the school as a dictionary,
    and then switches window focus back to the list of schools
    """
    main_window=driver.current_window_handle
    #open the school in a new tab
    school_link.send_keys(Keys.CONTROL + Keys.RETURN)
    time.sleep(4.5)
    #switch to the new tab
    driver.switch_to_window(driver.window_handles[-1])
    time.sleep(0.5)
    #collect the school data
    driver.switch_to_default_content()
    school_data_dict = grab_school_data()
    time.sleep(0.1)
    #close the tab and switch focus to the original school list
    driver.close()
    driver.switch_to_window(main_window)
    driver.switch_to_default_content()
    return school_data_dict

### Data collection functions

The following five functions are responsible for extracting the data from each school.
For each school, a dictionary will be returned containing:

*{School Name, White Hispanic achievement gap, white student demographics, black student demographics, hispanic student demographics, white teacher demographics, black teacher demographics, and hispanic teacher demographics}*

- The **grab_school_data** function is the main wrapper for executing the smaller functions. If the *hispanic-white achivement gap* value is not present for a particular school, the function will stop collecting data for that school and return a value of *None* to the open_school_in_new_tab function which originall called grab_school_data.
- **grab_achievement_gap** is the gatekeeper. If a grab_achievement_gap value cannot be obtained, there is no point in collecting any more data for the school since the achievement gap **is my output variable of interest**.
- **grab_school_name**, **grab_student_ethnicity**, **grab_teacher_ethnicity** are self explanatory.

In [6]:
def grab_school_data():
    """
    runs a scraping script for a specific school and returns
    a dictionary containing desired data in key:value form
    """
    school_data_dict = {}
    achievement_gap_val = grab_achievement_gap()
    if achievement_gap_val is not None:
        school_name = grab_school_name()
        student_demographics = grab_student_ethnicity()
        teacher_demographics = grab_teacher_ethnicity()
        school_data_dict.update(school_name)
        school_data_dict['Hispanic_White_Achievement_Gap'] = achievement_gap_val
        school_data_dict.update(student_demographics)
        school_data_dict.update(teacher_demographics)
        return school_data_dict
    else:
        return None

def grab_achievement_gap():    
    """
    checks to see if data for the school includes a
    white-hispanic standardized test score achievement gap
    if it does, this will return the value of the gap.
    """
    students_info = driver.find_element_by_partial_link_text('Academic Progress')
    students_info.click()
    time.sleep(2)
    achievement_gap = driver.find_element_by_partial_link_text('Achievement Gap')
    achievement_gap.click()
    time.sleep(3.5)
    driver.switch_to_frame(driver.find_element_by_name("IFrame_IRC"))
    unclick_poverty = driver.find_element_by_xpath('//input[@value="LowIncome,NonLowIncome"]')
    unclick_poverty.click()
    time.sleep(1.5)
    click_hisp_white_gap = driver.find_element_by_xpath('//input[@value="Hispanic,White"]')
    click_hisp_white_gap.click()
    time.sleep(1.5)
    click_math = driver.find_element_by_xpath('//input[@data-value="Mathematics"]')
    click_math.click()
    time.sleep(2.4)
    Hisp_White_Achievement_Gap = driver.find_element_by_xpath('//div[@class="result"]')
    time.sleep(0.8)
    Hisp_White_Achievement_Gap = Hisp_White_Achievement_Gap.text
    regex = re.compile('Hispanic and White\n(.?[0-9]+)\n',re.IGNORECASE|re.DOTALL)
    if re.search(regex,Hisp_White_Achievement_Gap):
        Achievement_Gap_Value = float(re.findall(regex,Hisp_White_Achievement_Gap)[0])
        driver.switch_to_default_content()
        return Achievement_Gap_Value
    else:
    # If we can't grab data on the achievement gap, we will just be
    #check the next school in our list
        driver.switch_to_default_content()
        return None

def grab_school_name():
    """
    returns school name as a single key:value dictionary
    """
    school_name_dict={}
    school_name = driver.find_element_by_xpath('//section[@class="main-content"]//span[@class="lblHeader"]')
    time.sleep(0.4)
    school_name_dict['school_name'] = school_name.text
    driver.switch_to_default_content()    
    return school_name_dict

def grab_student_ethnicity():
    """
    returns black, white, and hispanic student demographics as a dictionary
    """
    student_demographics = {}
    time.sleep(0.2)
    students_info = driver.find_element_by_partial_link_text('Students')
    students_info.click()
    time.sleep(2.5)
    student_ethnicity = driver.find_element_by_partial_link_text('Racial/Ethnic Diversity')
    student_ethnicity.click()
    time.sleep(4.5)
    driver.switch_to_frame(driver.find_element_by_name("IFrame_IRC"))
    time.sleep(0.4)
    graph_info = driver.find_element_by_xpath('//div[@id="graph-data"]')
    time.sleep(0.8)
    graph_info_text = graph_info.text
    regex = re.compile('White \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['white_students'] = float(re.findall(regex,graph_info_text)[0])
    regex = re.compile('Black \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['black_students'] = float(re.findall(regex,graph_info_text)[0])   
    regex = re.compile('Hispanic \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    student_demographics['hispanic_students'] = float(re.findall(regex,graph_info_text)[0])
    driver.switch_to_default_content()
    return student_demographics
    

def grab_teacher_ethnicity():
    """
    returns black, white, and hispanic teacher demographics as a dictionary
    """
    teacher_demographics = {}
    time.sleep(0.1)
    teachers_info = driver.find_element_by_partial_link_text('Teachers')
    teachers_info.click()
    time.sleep(2.5)
    achievement_gap = driver.find_element_by_partial_link_text('Demographics')
    achievement_gap.click()
    time.sleep(4.9)
    driver.switch_to_frame(driver.find_element_by_name("IFrame_IRC"))
    time.sleep(0.4)
    graph_info = driver.find_element_by_xpath('//div[@id="nested-graph"]')
    time.sleep(0.8)
    graph_info_text = graph_info.text
    regex = re.compile('White \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['white_teachers'] = float(temp_re[0])
    regex = re.compile('Black \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['black_teachers'] = float(temp_re[0])   
    regex = re.compile('Hispanic \(([0-9]+\.*[0-9]*)%\)',re.IGNORECASE|re.DOTALL)
    temp_re = re.findall(regex,graph_info_text)
    teacher_demographics['hispanic_teachers'] = float(temp_re[0])
    driver.switch_to_default_content()
    return teacher_demographics

## Running the script

To scrape the data, all that needs to be done is run the **page_navigator**. All data will be collected within list element *master_list_of_school_dictionaries*.

Caution: A nasty NoSuchElementFound Error will occur if your internet connection is too slow to load the dynamic javascript based webpage content before the script searches for certain elements. A fix for this would be:

a. to increase sleep times in certain parts of the script

b. Better method: use the WebDriverWait function in conjunction with the expected_conditions module from the appropriate Selenium packages which will wait until an element is loaded for a user-specified time before python throws an error.
    - This is considered best practice.
    - time.sleep() works but is not best practice (allegedly).

In [None]:
# Running this function is all that is needed to collect all the data
page_navigator()

## Pickling Data and formatting it for future use

The variable *master_list_of_school_dictionaries* contains dictionaries of all the schools.
we can turn our data into a dataframe by turning the list of dictionaries into a dictionary of lists and then using the pandas module to convert it into a dataframe.
The resulting dataframe will be pickled for future use.

In [95]:
def list_of_dicts_to_dict_of_lists(list_of_dicts):
    """
    Turns a list of dictionaries with common keys into one dictionary containing
    a list of valuse for each key. This makes it easy to create a dataframe object.
    """
    dict_of_lists = collections.defaultdict(list)
    for dictionary in list_of_dicts:
        for key, value in dictionary.items():
            dict_of_lists[key].append(value)
    return dict_of_lists

In [113]:
print("master_list_of_school_dictionaries is a",type(master_list_of_school_dictionaries))
pre_df_student_data = list_of_dicts_to_dict_of_lists(master_list_of_school_dictionaries)
student_data_df = pd.DataFrame(pre_df_student_data)
print("student_data_df is a",type(student_data_df))

master_list_of_school_dictionaries is a <class 'list'>
student_data_df is a <class 'pandas.core.frame.DataFrame'>


Pickling the DataFrame to filename:
**student_data_df_pickle**

In [96]:
pd.to_pickle(student_data_df,'/home/farhaan/ds/metis/metisgh/Projects/02-Luther/student_data_df_pickle')