##Scraping data interacting with a web application

I am interested in temperatures per day for a given city. There is an application web site that provides that info: 

http://www.wunderground.com/history/


One can go to the web page and handle it manually to get the data...but what happens if what you want is collecting the mean temperatures for each of the 365 days of 2011 (for instance)? are you gonna repeat manually the process 365 times? It is an unhealthy dicision. Let's see what we can do.


The process of taking data from a web site is known as "scrapping". One can scrap data from a URL in different ways. However, sometimes we  are forced to interact with the web site and that fact reduces the number of options. We find cases like these on a browser application in which the user must to specify the parameters of the search. At this respect, two of the available options are:
* **tuning the URL**: we can use a for loop or similar to fit the URL with the parameters of the search.
* **Mechanize library**. This library allows the user to automatize the process of interaction with the web app.

I will use the second option as seems to be more fancy and useful for future works. However it has limitations: the forms must be labeled in order to be used with this library. More info about the problem related with the no labeled forms in the hereunder link.

http://stackoverflow.com/questions/27486361/mechanizer-in-python-selecting-form-field-with-no-name


###Mechanize 


In [98]:
# Mechanize
import mechanize
# Opener and parseresponse
from mechanize._opener import urlopen
from mechanize._form import ParseResponse
# import beautifulsoup
from bs4 import BeautifulSoup
# import os
import os

In [99]:
br = mechanize.Browser()
url = "http://www.wunderground.com/history/"

Get Forms:

In [100]:
request = mechanize.Request(url)
response = mechanize.urlopen(request)

Parse forms:

In [101]:
forms = mechanize.ParseResponse(response, backwards_compat=False)

How many forms do we have?


In [64]:
print len(forms) # there are 5 forms

5


Check each form and pick the one you need:

In [65]:
print forms[4]

<GET http://www.wunderground.com/cgi-bin/findweather/getForecast application/x-www-form-urlencoded
  <HiddenControl(airportorwmo=query) (readonly)>
  <HiddenControl(historytype=DailyHistory) (readonly)>
  <HiddenControl(backurl=/history/index.html) (readonly)>
  <TextControl(code=)>
  <SelectControl(month=[1, 2, 3, 4, 5, 6, 7, 8, 9, *10, 11, 12])>
  <SelectControl(day=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, *30, 31])>
  <SelectControl(year=[*2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950, 1949, 1948, 1947, 1946, 1945])>
  <SubmitControl(<None>=Submit) (readonly)>>


You can select by name (it works when form has a name):

* br.select_form("form1") 

* Or by index when form is unnamed: br.form = forms[4] 

In [102]:
br.form = forms[4]  

Our object:

In [103]:
br.form 

<mechanize._form.HTMLForm instance at 0x10ba12d40>

In [82]:
# Specify the form options

LOCATION = "Washington, District of Columbia" 
DAY = ['3'] # A list, not a string
MONTH = ['1',] # Ibid
YEAR = ['2011',] # Ibid 
 
# input options into the form
br.form["code"] = LOCATION
br.form["month"] = MONTH
br.form["day"] = DAY
br.form["year"] = YEAR

In [83]:
# submit the form
page = urlopen(br.form.click()).read () 
 
# BS4 the response
soup = BeautifulSoup(page, "html.parser")

We can then proceed as usual by finding the table containing the data of interest:

In [84]:
table = soup.find('table',{'id':'historyTable'})


temperatures=[]

def get_temperatures():
    data=[]
    children = table.findChildren()
    for child in children:
        data.append(child.text)
    Mean_temperature=(data[data.index("Mean Temperature")+2].split()[0])
    
    return Mean_temperature

temperatures.append(get_temperatures())

In [85]:
temperatures

[u'2']

Yeah! We have a list with one of the temperatures and I spent around 2 days to code this...hope it will help 
you!

Now we want the whole list of temperatures. Let's do this:

In [120]:
def Build_temperature_date_df():
    '''returns a single dataframe with all the temperatures per day'''
    
    Days=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14',
          '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26',
          '27', '28', '29', '30', '31']
      
    Months=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
    
    Years= ['2011','2012','2013']
    
    LOCATION = "Washington, District of Columbia" 
    
    #Empty list where to put the mean temperatures per day
    temperatures=[]
    #Empty list where to put the dates
    dates=[]
    
    for year in Years:
        YEAR = [str(year),]         # YEAR = ['2011',] # Ibid 
        for month in months:       
            MONTH = [str(month),]   #MONTH = ['1',] # Ibid
            for day in Days:
                DAY=[str(day),]      #DAY = [Days[0],] # A list, not a string
                
                # input options into the form
                br.form["code"] = LOCATION
                br.form["month"] = MONTH
                br.form["day"] = DAY
                br.form["year"] = YEAR
                
                # submit the form
                page = urlopen(br.form.click()).read () 
 
                # BS4 the response
                soup = BeautifulSoup(page, "html.parser")
            
                table = soup.find('table',{'id':'historyTable'})


                

                def get_temperatures():
                    data=[]
                    children = table.findChildren()
                    for child in children:
                        try:
                            data.append(child.text)
                            Mean_temperature=(data[data.index("Mean Temperature")+2].split()[0])
                            if Mean_temperature=='':
                                return 'NaN'
                            else:
                                return int(Mean_temperature)
                        except:
                            continue

                temperatures.append(get_temperatures())
                dates.append(str(year)+'-'+str(month)+'-'+str(day))
     
    temp_date_df=pd.DataFrame({'Date':dates, 'temperatures in C':temperatures})
    return temp_date_df                       
                                

In [121]:
Temperature_date_df=Build_temperature_date_df()

In [122]:
Temperature_date_df.head()

Unnamed: 0,Date,temperatures in C
0,2011-1-1,8
1,2011-1-2,8
2,2011-1-3,2
3,2011-1-4,1
4,2011-1-5,3


NOTE: There are 21 fake dates as the web site jumps from **February 28th** to **March 1st** while the script takes that temperature as the one registered at **February 29**  November.

Let's fix it:

In [133]:
fakes=['2011-2-29','2011-2-30','2011-2-31', '2011-4-31','2011-6-31','2011-9-31','2011-11-31',
       '2012-2-30','2012-2-31', '2012-4-31','2012-6-31','2012-9-31','2012-11-31',
       '2013-2-29','2013-2-30','2013-2-31', '2013-4-31','2013-6-31','2013-9-31','2013-11-31']
len(fakes)
       

20

In [134]:
Temperature_date_df=Temperature_date_df.dropna()     #drop all rows that have any NaN values

for dates in fakes:
    Temperature_date_df=Temperature_date_df[Temperature_date_df.Date.str.contains(dates) == False]

In [138]:
len(Temperature_date_df)


1096

**FIXED!!!!**

365+366+365 = 1096

In [137]:
Temperature_date_df.set_index('Date').head()

Unnamed: 0_level_0,temperatures in C
Date,Unnamed: 1_level_1
2011-1-1,8
2011-1-2,8
2011-1-3,2
2011-1-4,1
2011-1-5,3


Finally the dataframe is exported as a CSV file.

In [139]:
Temperature_date_df.dropna().to_csv('temp_day', sep=',')
 