# Data Wrangling: Scraping the Web

## Introduction
As part of my own research, fuelled by Udacity's course content on Data Wrangling, I decided to make use of my new-found skills to scrap a lot of the data of the US Bureau of Transportation Statistics, namely information about statistics regarding major airports and the carriers travelling through them.  

The purpose of this project is not to answer questions about airports and planes; rather, the focus of the project is on understanding how to get the data, even when it's freely available as a webpage. No doubt, statistics about the airports and carriers can be carried, but that is a secondary focus.

## Analysis

My analysis of this data is broken into 3 phases:
1. Downloading the data to be scraped
2. Scraping the data
3. Converting the scraped data into a more accessible format
4. Analysing the extracted data

The importance of my analysis could not be underscored by the fact that it is imperative to **try the above steps on a single data element** and *only* then generalize the process. This has helped me catch errors early on, and made me more efficient.

*Note: Cell output is restricted to 1000 characters for brevity.*

In [10]:
# Basic imports for the rest of the analysis

# making HTTPS requests
import requests

# HTML scraping and parsing libraries
from bs4 import BeautifulSoup
import urllib

# folder navigation
import os

# plotting
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# data intensive
import numpy as np
import pandas as pd
import seaborn as sns

# sample page to scrape
scrape_page = "http://www.transtats.bts.gov/Data_Elements.aspx?Data=2"

s = requests.session()

## 1. Downloading the data to be scraped

### Basics
So, the first thing to be done is to take a look at what we're dealing with here. To do so, go to <a href = "http://www.transtats.bts.gov/Data_Elements.aspx?Data=2">this website</a>. 

When you go there, you will notice the following dropdown buttons for both carriers and airports:<br><br>
<img src = "Screenshot (109).png" style = "width: 60%; height: 60%">
<br>
You will also see a table like so: <br><br>
<img src = "Screenshot (108).png" style = "width: 60%; height: 60%">
<br>
Like it says, this table outlines the number of domestic and international flights for a particular month in a year. 
Now, copying this data for this table into Excel might be easy, but what if you had to do this a 100 times? 

Before we even think of this, we need to get a way to get a page with a particular carrier (say "American Airlines" and a particular airport (say "Jackson International Airport, Atlanta") to scrape. If you notice carefully, then you would have seen that the URI of the page does not change despite selecting the data. This is a real problem!

Thanks to Chrome's inspector element, there's no need to worry!
Right click on the Inspector: Inspect > Network, like so: <br>
<img src = "Screenshot (110).png" style = "width: 60%; height: 60%"><br><br>

Double-click on the highlighted text. It will open a small window next to the text. Scroll down to form parameters: **this** is the data that is sent with each request! 

At the bottom, you will be able to see the airport and carrier you chose. These are the only parts of the form data that changes for this page. 

### Looking for these form parameters
If you examine the HTML for this page, you will not be able to find these values readily. Time to dive into some Python coding to look for these. (These parameters are actually well hidden in divs, and thus are not found easily.)

In [21]:
# BeautifulSoup functions to return the HTML page
r = s.get(scrape_page)
soup = BeautifulSoup(r.text, "lxml")

print soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>
	Data Elements
</title><link href="styles/global.css" rel="stylesheet" type="text/css"/><link href="styles/rita_main.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet" type="text/css"/><link href="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone.css" rel="stylesheet"/><link href="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone_pn.css" rel="stylesheet"/>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js" type="text/javascript"></script>
<script src="https://www.bts.dot.gov/sites/bts.dot.gov/themes/bts_standalone/bts_standalone.js"></script>
<script language="javascript" type="text/javascript">
function window_CarrierList(page)
{
    //aUrl="Carrie

As can be seen by the output above, we now have the HTML page output to perform our calculations! 

After having finally found these parameters in the Inspector, I noticed that each parameter could be identified by its *id* of the same name. 

Additionally, some of the parameters I inspected seemed to be empty. These will likely be empty throughout my analysis, so I don't need to worry about them. The rest of the story is a simple loop.

In [22]:
data = {}

for parameter in ["__VIEWSTATE", "__VIEWSTATEGENERATOR", "__EVENTVALIDATION"]: # other parameters are empty
    parameter_id = soup.find(id = parameter)
    if parameter_id is None:
        data[parameter] = ""
    else:
        data[parameter] = parameter_id.get("value")

print data

{'__EVENTVALIDATION': '/wEdANQJI1Sqg9z91K1IxN+WYuLY9Pc8W+KtN+Mo2aSUQqdKHnn1wNLGBL09odqqO9CBtbAJsDUC3lheYiZJgW/YlADjuWzIaw6NWZdPqqJzsUIui3WJba4zPjTfLRRsH0Y4tKCbvFwJUL16Fg2zvSQ8pNpmmiXKEOf5q1Kv3vsuvzhW05PQbvopHFZM7OfQDQb4hdOUdXAXRWTwuBeN66YRUYZW+iHUadUKYEzQDBGkXUs/YCGr0cwlD3oBni9ShkvD3kwjk8WcekoqTVcnps8CXSCz2VxSHFLZn8o/OI/mzSaJLF7n4FW7/iSCbjzg5qjsDMH5Z4x2xKMscyTvkWGm4eCnGGT+PhzP2oB89KGJMNTRcLt8dfZ0OLmTKBRvL+6aLO1Jlqb4uy82+C1G/TuY290BKE0bVp+gYhNwVZBEHAug4oRNRIquFUPBmZHKNQbPcLdxJCrNuditKsxyOfCvi1s9pWWvurw18YSXxClfb0sw/6b9lN2DZRhnQ66+hD+aycC8f25oq7hT+8oWmhbKNSmthde8aZvm7cigiu7lj5rG9vMpuFq433U3WFpHq4V3rSYKlbsk7FCG4pN34VJPbxjIui8imvQgdNZYuF9tBmSfCMGWAtxVxwaJCe8yj4uJmba+0Pm8y2EfbpmeJ+44i6d8EmJqbmGABn8t1Ju434eerRoBJeHv60FRHHjWan2JtV04PuDW5M7IdHpDmLWkPtRYohRxs++m7m25rVGGO9xU1ZNF9UtcLUUhOKfxV7PKBup85Idz8i2sTT70Zc4lsW/JRHHlTXbctzOEfZD1Wprgm7Ac5PMqE0ra5Hs9xfdZld/gpyDYaqFgSrclhTGylLvDltyRWe3GPJgdPBxKU3ZsOOZKxTIQ2tefAP/tG5+tF/1bZ2SCbsUSUtVIjP7WdSJnLcMBNra8/krTTsDAFbBxcTaM9rKNoB1wQsxdHU68DHLOnCOgqQrN4

We now have the form data!
The only thing we need now is the list of carriers and airports to iterate over.

Below are some functions that operate similarly to how we gathered the form data. Each of the dropdown options is contained as children of a *select* tag with id = "CarrierList"/"AirportList".

### Data Cleaning
As part of the extraction, it is necessary to get the relevant data and remove all extraneous information. 
In the functions below, I clean the data in 2 ways:
1. Extract the "value" attribute from the tags
2. Check to see if only relevant "values" are extracted

How this has been done programmatically can be seen in the code snippet below.

In [23]:
def extract_carriers(page):
    '''
    function to extract list of carriers (which are displayed as a drop down) from [page].
    '''
    data = []

    r = s.get(page)
    soup = BeautifulSoup(r.text, "lxml")
    carrier_list = soup.find(id = "CarrierList")
    children = carrier_list.findChildren()
    
    for airlines in children:
        # this is what is contained in airlines: <option value="AA">American Airlines </option>
        
        value = airlines.get('value')
        # this is what is contained in value: AA
        
        # making sure we don't extract options such as "AllUS" in <option value="AllUS">All U.S. Carriers</option>
        if (len(value) == 2):
            data.append(value)

    return data

In [24]:
carriers = extract_carriers(scrape_page)

In [25]:
def extract_airports(page):
    '''
    function to extract list of airports (which are displayed as a drop down) from [page].
    '''
    data = []
    r=s.get(page)
    soup=BeautifulSoup(r.text, "lxml")
    carrier_list = soup.find(id = "AirportList")
    children = carrier_list.findChildren()
    for airlines in children:
        value = airlines.get('value')
        if (len(value) == 3 and value.isupper()):
            data.append(value)
            print value
    return data


In [26]:
airports = extract_airports(scrape_page)

ATL
BWI
BOS
CLT
MDW
ORD
DAL
DFW
DEN
DTW
FLL
IAH
LAS
LAX
MIA
MSP
JFK
LGA
EWR
MCO
PHL
PHX
PDX
SLC
SAN
SFO
SEA
TPA
DCA
IAD
UXM
ABR
ABI
DYS
ADK
VZF
BQN
AKK
KKI
AKI
AKO
CAK
7AK
KQA
AUK
ALM
ALS
ABY
ALB
ABQ
ZXB
WKK
AED
AEX
AXN
AET
ABE
AIA
APN
DQH
AOO
AMA
ABL
OQZ
AOS
OTS
AKP
EDF
DQL
MRI
ANC
AND
AGN
ANI
ANN
ANB
ANV
ATW
ACV
ARC
ADM
AVL
HTS
ASE
AST
AHN
AKB
PDK
FTY
ACY
ATT
ATK
MER
AUO
AGS
AUG
AUS
A28
BFL
BGR
BHB
BRW
BTI
BQV
A2K
BTR
BTL
AK2
A56
BTY
BPT
BVD
WBQ
BKW
BED
A11
KBE
BLV
BLI
BLM
JVL
BVU
BJI
RDM
BEH
BET
BTT
BVY
OQB
A50
BIC
BIG
BGQ
BMX
PWR
A85
BIL
BIX
BGM
KBC
BHM
BIS
BYW
BID
BMG
BMI
BFB
BYH
BCT
BOI
RLU
BXS
BLD
BYA
BWG
BZN
BFD
A23
BRD
BKG
BWC
PWT
KTS
BDR
TRI
BKX
RBH
BRO
BWD
BQK
BCE
BKC
BUF
IFP
BUR
BRL
BTV
MVW
BNO
BTM
USA
UXI
CDW
C01
ADW
CDL
CGI
LUR
EHM
CZF
A61
A40
CYT
MDH
CLD
CNM
A87
CPR
CDC
CID
JRV
NRR
CEM
CDR
CIK
CMI
WCR
CHS
CRW
SPB
STT
CHO
CYM
CHA
CYF
WA7
CEX
EGA
NCN
KCN
VAK
CYS
PWK
DPA
LOT
CKX
CIC
CEF
KCG
KCL
WQZ
KCQ
CZN
CIV
ZXH
SSB
STX
CHU
LUK
CVG
OQC
A12
CHP
IRC
CLP
CKB
BKL
CLE
CGF
CFT


Rather than create folders for each of the airports and carriers for **testing**, I will only consider a small subset to see if my programs run correctly. To allow for the entire files, just uncomment the lines below.

In [8]:
# partial data to test

airports_sample = sorted(airports[:3])
# UNCOMMENT: airports_sample = sorted(airports) 

# print airports_sample

carriers_sample = sorted(carriers[:3])
# UNCOMMENT: carriers_sample = sorted(carriers) 

# print carriers_sample

Generalizing the parameter extraction written way above, I have written the *extract_params* function to extract a set of given parameters for any page that you want to scrape. 

In [9]:
def extract_params(scrape_page, params):
    '''
    function to find certain parameters specified by [params] from [scrape_page], if they exist.
    '''
    
    data = {}
    r = s.get(scrape_page)
    soup = BeautifulSoup(r.text, "lxml")
        
    for parameter in params:
        parameter_id = soup.find(id = parameter)
        if parameter_id is None:
            data[parameter] = ""
        else:
            data[parameter] = parameter_id.get("value")
    
    return data

The *create_files* function below is the result of everything we wrote above. 

It takes in a set of parameters, lists of airports and carriers, to create a folder structure with relevant files inside each.

In [10]:
def create_files(data, airports, carriers):
    '''
    function to make an HTTPS request to the server, given parameter values from [data] 
    for all values in [airports] and [carriers] and store the corresponding result in 
    a folder structure.
    '''
    
    # go inside airport-data
    os.chdir("airport-data")
        
    # all data parameters have been stored in data
    eventvalidation = data["__EVENTVALIDATION"]
    viewstate = data["__VIEWSTATE"]
    viewstategenerator = data["__VIEWSTATEGENERATOR"]
      
    for airport in airports:
        
        # create new folder for each airport
        newpath = airport 

        if not os.path.exists(newpath):
            os.makedirs(newpath)
            
        # print newpath

        # navigate to the new folder
        os.chdir(newpath)
        
        for carrier in carriers:
            
            # make a POST request for the current airport and carrier 
            r = s.post("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                       data = (("__EVENTTARGET", ""),
                               ("__EVENTARGUMENT", ""),
                               ("__VIEWSTATE", viewstate),
                               ("__VIEWSTATEGENERATOR",viewstategenerator),
                               ("__EVENTVALIDATION", eventvalidation),
                               ("CarrierList", carrier),
                               ("AirportList", airport),
                               ("Submit", "Submit")))
            
            # write new file
            f = open("{0}-{1}.html".format(airport, carrier), "w")
            print "Created file!", "{0}-{1}.html".format(airport, carrier)
            
            f.write(r.text)
            f.close()
            
        os.chdir("..")

In [11]:
# list of parameters required to make request
params = ["__VIEWSTATE", "__VIEWSTATEGENERATOR", "__EVENTVALIDATION"]

# value of parameters
params_dict = extract_params(scrape_page, params)

In [12]:
create_files(params_dict, airports_sample, carriers_sample)

Created file! ATL-AA.html
Created file! ATL-AS.html
Created file! ATL-G4.html
Created file! BOS-AA.html
Created file! BOS-AS.html
Created file! BOS-G4.html
Created file! BWI-AA.html
Created file! BWI-AS.html
Created file! BWI-G4.html


This is how the structure of the system looks after executing the above function: <br><br>
<img src = "Screenshot (111).png" style = "width: 60%; height: 60%;"><br><br>

You can see that there is a folder for each airport and HTML files for each carrier within each folder, containing information regarding those particular airport and carrier that we can extract with our functions defined above.

Written below is a function that extracts the tabular data from any HTML page. Given the above folder structure, we can exploit the function and loop it to get data for all the pages.

In [31]:
def extract_table_data(page):
    '''
    function to extract all data present in the <table> element of [page] programmatically
    and store this data.
    
    The resultant data structure is a list of dictionaries, where each dictionary is data 
    corresponding to a particular time frame.
    '''
    
    html = urllib.urlopen(page).read()
    soup = BeautifulSoup(html, "lxml")
    
    data = []
    
    table = soup.find('table', class_ = "dataTDRight")
    tr =  table.find_all('tr', class_ = "dataTDRight")
    for tags in tr:
        inner_data = []
        info = {}
        td = tags.find_all('td')
        for inner_tags in td:
            inner_data.append(inner_tags.text)

        if (inner_data[1] != "TOTAL"):
            # notice the casting to an integer is not normal
            info["year"] = convert_to_int(inner_data[0]) 
            info["month"] = convert_to_int(inner_data[1])
            info["domestic"] = convert_to_int(inner_data[2])
            info["international"] = convert_to_int(inner_data[3])
            
            # print info
            data.append(info)

    return data

In [32]:
# sample output
extract_table_data(scrape_page)

[{'domestic': 815489, 'international': 92565, 'month': 10, 'year': 2002},
 {'domestic': 766775, 'international': 91342, 'month': 11, 'year': 2002},
 {'domestic': 782175, 'international': 96881, 'month': 12, 'year': 2002},
 {'domestic': 785651, 'international': 98053, 'month': 1, 'year': 2003},
 {'domestic': 690750, 'international': 85965, 'month': 2, 'year': 2003},
 {'domestic': 797634, 'international': 97929, 'month': 3, 'year': 2003},
 {'domestic': 766639, 'international': 89398, 'month': 4, 'year': 2003},
 {'domestic': 789857, 'international': 87671, 'month': 5, 'year': 2003},
 {'domestic': 798841, 'international': 95435, 'month': 6, 'year': 2003},
 {'domestic': 832075, 'international': 102795, 'month': 7, 'year': 2003},
 {'domestic': 831185, 'international': 102145, 'month': 8, 'year': 2003},
 {'domestic': 782264, 'international': 90681, 'month': 9, 'year': 2003},
 {'domestic': 818777, 'international': 91820, 'month': 10, 'year': 2003},
 {'domestic': 766266, 'international': 91004,

### Data Cleaning
Notice that my program skips the line with "TOTAL" in one of the columns; this is because I want pure data, and no precomputed values for data consistency.

You will also notice that I do not cast the values in the above function to integers directly. To see why, look at the cell below.

In [None]:
int(u'\xa0')

To avoid this, I have put the integer casting in a try block, and this returns 0 if the casting is unsuccessful.

Now this is well and good, but you might be wondering how I noticed this error. Remember how I told you that always try your functions on a smaller dataset? Well, that's how I found that a value wasn't being cast properly and according wrote this function.

In [30]:
def convert_to_int(val):
    '''
    helper function to try and convert a value to its corresponding integer value, else 0
    '''
    try:
        int_val = int(val.replace(',', ''))
        return int_val
    except:
        return 0

### 3. Converting the scraped data into a more accessible format
Hurray! We know have the data we need; however, the data is present as a list of dictionaries.
To perform statistical analyses or draw plots for the data, it would make sense to have this data in a more convenient format for Python's data anaylsis tools such as NumPy or Pandas. 

Luckily, dictionaries can be converted to Pandas DataFrames with the inbuilt *from_dict* Pandas function. I have wrapped this function in my own function for clarity.

In [None]:
def convert_to_dataframe(dic):
    '''
    function to convert dictionary data [dic] into a pandas DataFrame
    '''
    return pd.DataFrame.from_dict(dic)

In [None]:
# d is a dictionary
d = extract_table_data(scrape_page)

# df is a pandas dataframe
df = convert_to_dataframe(d)

print df

Yay! Now that that we have *extract_table_data*, I have written a function generalizing this all files within the "airport-data" folder. 

This gives rise to a list, where each element concerns a particular airport. Within each "airport" element, there is a list of dictionaries corresponding to a different carrier from that airport.

In [None]:
def convert_to_data_list():
    
    g_list = []
    os.chdir("C:\Users\Raghav\Desktop\Data Analysis Nanodegree\Data Wrangling")
    os.chdir("airport-data")

    # print os.listdir(os.getcwd())

    for dir in os.listdir(os.getcwd()):
        # go inside
        os.chdir(dir)

        # print os.listdir(os.getcwd())

        l_list = []

        for inner_dir in os.listdir(os.getcwd()):

            l_dic = {}        
            
            print "Reading file:", inner_dir
            
             # print inner_dir
            try:
                
                dic = extract_table_data(inner_dir)
                print "Converted {0}!".format(inner_dir)
                
            except:
                
                dic = {}
                print "Could not convert {0} ".format(inner_dir)

            df = convert_to_dataframe(dic)

            inner_name = inner_dir.split(".")[0]
            l_dic[inner_name] = df

            l_list.append(l_dic)
            print ""

        g_list.append(l_list)

        # come out
        os.chdir("..")
    return g_list

In [None]:
airport_sample_data = convert_to_data_list()

Now that we have all this data, we would need a good interface for any user to interact with this data. To that end, the function below, *get_data* does exactly that! 

As the user, you just need to provide the airport and carrier whose information you seek.

In [None]:
def get_data(airport, carrier):
    ind_airport = airports_sample.index(airport) # change this line later
    ind_carrier = carriers_sample.index(carrier) # change this too
    text = str(airport) + "-" + str(carrier)
    print ind_airport, ind_carrier, text
    return airport_sample_data[ind_airport][ind_carrier][text]

In [None]:
# example usage: airport is ATL and carrier is AA
atl_aa = get_data('ATL', 'AA')
atl_aa

### 4. Analysing the extracted data
Phew! That was a lot!

Now that we have the data we always wanted in a convenient DataFrame, let's see if we can spot some trends in the data.

Consider what we have been working with: the ATL airport and AA airlines.

In [None]:
# helper functions to extract certain details 
# example in next cell

def get_year(df, yr):
    return df[df['year'] == yr]

def get_month(df, month):
    return df[df['month'] == month]

def get_column_value(df, column, value):
    return df[df[column] == value]

In [None]:
# print all entries pertaining to the year of 2015
atl_aa = get_data('ATL', 'AA')

In [None]:
atl_aa_2015 = get_year(atl_aa, 2015)

In [None]:
# plotting extracted data
sns.barplot(data = atl_aa_2015, x = "month", y = "domestic")

plt.xlabel('Month number')
plt.ylabel('Number of domestic flights')
plt.title('Domestic AA flights from ATL for 2015')
plt.ylim(atl_aa_2015['domestic'].min())

Looking at the graph above, the jump from June to July does seem like a lot for AA in 2015! 

After Googling around a bit, I found  an <a href = "http://www.atl.com/wp-content/uploads/2017/01/07-01-2015ATL-to-exceed-national-travel-projections.pdf">article</a>, which states how the airport has picked up immense traffic in July for the Independence Day weekend, almost 14% more passengers! American Airlines, being the <a href = "https://en.wikipedia.org/wiki/List_of_largest_airlines_in_North_America">largest airlines</a>, would thus see a higher increase.

That was a sweet prediction! 

# Conclusion

What did we cover in this project? 

We performed analysis on data sitting on web pages through sophisticated use of Python's data science libraries and extensive use of functions.

Overall, this project serves as a pretty good overview of data wrangling.

# Additional stuff

If you want to look at data about certain airlines and/or carriers, just make sure **airlines_sample** and **carriers_sample** contain those values, and rereun all the cells in the notebook

In [None]:
def convert_to_data_list_from_carrier(carrier):
    '''
    function generalizing [extract_table_data] function for all files within "airport-data" folder 
    for a PARTICULAR carrier
    '''

    g_list = []
    os.chdir("C:\Users\Raghav\Desktop\Data Analysis Nanodegree\Data Wrangling")
    os.chdir("airport-data")

    for dir in os.listdir(os.getcwd()):
        # go inside
        os.chdir(dir)

        for inner_dir in os.listdir(os.getcwd()):
            
            # break "ATL-AA.html" to "ATL-AA"
            name = inner_dir.split(".")[0]
            
            # break "ATL-AA" to "AA"
            carrier_name = name.split("-")[1]
            
            if (carrier_name == carrier):
            
                l_dic = {}        

                try:
                    print "Reading file:", inner_dir
                    dic = extract_table_data(inner_dir)

                except:
                    print "Could not convert {0} ".format(inner_dir)
                    dic = {}

                df = convert_to_dataframe(dic)

                inner_name = inner_dir.split(".")[0]
                print "Converted {0}!".format(inner_dir)
                l_dic[inner_name] = df

                g_list.append(l_dic)

        # come out
        os.chdir("..")
                
    return g_list

# References
1. <a href = "https://in.udacity.com/course/data-analyst-nanodegree--nd002/">Udacity Nanodegree (Data Science)</a>
2. <a href = "http://www.atl.com/wp-content/uploads/2017/01/07-01-2015ATL-to-exceed-national-travel-projections.pdf">ATL travel predictions</a>
3. <a href = "https://en.wikipedia.org/wiki/List_of_largest_airlines_in_North_America"> List of largest airlines in North America (Wikipedia)</a>