# Python for Data Science

##### By [Jiin Jung](http://jiinjung.com) & [Minjae Yun](https://sites.google.com/view/minjaeyun/home)
##### Claremont Graduate University
##### December 6 (Friday), 2019

# Part II. Web Scraping

In this part, we aim to understand what web scraping is, how we can conduct this method, and how we can create a flat file containing information.

## 1. What is web scraping?

In [None]:
# go anconda prompt on your computer
# pip install requests 
# pip install beautifulsoup4
import requests 
from bs4 import BeautifulSoup
url = 'https://www.icpsr.umich.edu/files/NACJD/ORIs/STATESoris.html'

To collect the every link in a list of agencies:

In [None]:
# "." is a method to call functions in a package
# use request package to get the content in the defined url
s = requests.get(url)
# use BeautifulSoup package to get the content in the format of html
soup = BeautifulSoup(s.text, 'lxml')
# get the "li" elements from html source
elements = soup.findAll("li")
# note: python does not discriminate "" and ''

To get the second link in the list:

In [None]:
# get "href" (Hypertext REFerence) 'attribute' from the element
elements[0].a['href']
# concatenate strings (or merge text) in python: "+"
url = 'https://www.icpsr.umich.edu/files/NACJD/ORIs/'+ elements[0].a['href']
print(url)

## 2. Repeating the same task: forloop and whileloop
Generate dynamic elements, assign them to key functions, and looping through repeatition
### Forloop
Python does the task sequentially in the loop

In [None]:
# Python index always starts from 0!
list(range(10))
# 'range' function includes the initial value and exclude the last value
# 'range' function
for i in range(10):
    print(i)
# we can also assign the initial value
for i in range(6,10):
    print(i)

### Practice
Use 'forloop' and one line of code to print 10 executive powers of 5

In [None]:
# in other words, pick any powers of 5 in a consectutive order
for i in range(#fill out this one#):
    val = #fill out this line#
    print(val)

Let's apply forloop to collect every link in the example url

In [None]:
length = len(elements)
links = []
for i in range(length):
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
links[1]

### Whileloop

'whileloop' does the same task but deploy conditional statements instead of numbering system

In [None]:
i = 0
links = []
# Question: why don't we include the final value 'length' itself in the loop?
# Answer:
while i < length:
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
    i = i+1

In [None]:
# Python uses += equivalently to = i+1
i = 0
links = []
while i < length:
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
    i += 1

### Practice
Use 'whileloop' and list 'append' function to get every content from every link

In [None]:
# first one:
contents = []
url = links[0]
s = requests.get(url)
# use BeautifulSoup package to get the content in the format of html
soup = BeautifulSoup(s.text, 'lxml')
# get the "li" elements from html source
content = soup.findAll("pre")
contents.append(content)

In [None]:
# the rest: 

### Difference between forloop and whileloop
Forloop provides an exhaustive access to each element within an object while whileloop allows conditional statemetns

In [None]:
#my_list = [1, 2, 3] # this should have been defined above
for element in my_list:
    element = element + 10
    print(element)

In [None]:
# get 10 odd numbers in the natural numbers from 6th odd number
i = 6
while i >= 6 and i < 6+10:
    number = 2*i + 1
    print(number)
    i+=1

## 3. Regular Expression
To get the right information, we need to disentangle the current information into elements first and stack up the elements accordingly, which is why we need [regular expression](https://www.regular-expressions.info/)

In [None]:
# 1) bring down the content
# html element has a function 'text' which ignores the html syntax and only returns strings
import re
first = contents[0]
print(first[0].text)
#length=len(first[0].text)
#"AUTAUGA COUNTY SHERIFF'S OFFICE" in first[0].text
content = first[0].text
# split the text based on any spaces
print(re.split("\s+", content))
actual_content = re.split("\s+", content)
actual_content = [x for x in actual_content if x != '']

In [None]:
# 2) get the variable titles first and find the right locations of each information
title = []
for i in range(3):
    title.append(actual_content)
length=len(actual_content)
tickers=[]
for i in range(length):
    if 'AL' in actual_content[i]:
        tickers.append(i)
print(tickers)

### Practice
Use 'join' function we learned previously to get each office names

In [None]:
# get each office and put them in the same list
# example
', '.join(actual_content[5:7])

In [None]:
# answer:
firstOffice = 
secondOffice = 
thirdOffice = 
names = [firstOffice, secondOffice, thirdOffice]

### Practice
Append the first office name, ORI7, and ORI9

In [None]:
# first office
agency=[]
agency.append(names[0])
agency.append(actual_content[tickers[0]])
agency.append(actual_content[tickers[0]+1])
print(agency)

Use loop to get a list containing other two agencies

In [None]:
agencies = []
# why does the range starts with 1?
for i in range(1,3):
    one=[]
    one.append(names[i])
# fill out this line
# fill out this line
    agencies.append(one)
print(agencies)    

## 4. Pandas dataframe
Get the dataframe

In [None]:
agencies.append(agency)
import pandas as pd
agencyList = pd.DataFrame(agencies, columns = ['Agency', 'ORI7', 'ORI9'])
print(agencyList)
# 1) use pandas to save it to csv
#agencyList.to_csv('filename.csv', header = True, index = False)
# 2) can be saved without using pandas - not recommended
#with open('filename.txt', 'w', encoding='utf-8') as f:
#    f.write(agencyList.to_string(header = True, index = False))

### Merge dataframes
We may want to work merge different datasets to have a collective source of information. <br>
We can use "pd.concat" and "pd.merge" equivalently. <br>
Please see [here](https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ ) for more details about "pd.merge".

In [None]:
new_obs = {                    
        'Agency': ['CHP LOS ANGELES COMM CTR', 'CHP PROTECTIVE SERVICES', 'CLAREMONT POLICE DEPARTMENT'],
        'ORI7': ['CA01999', 'CA01998', 'CA01913'],
        'ORI9': ['CA0199900', 'CA0199800', 'CA0191300'] 
}
more_info = {
        'ORI7': ['AL00402', 'AL00401', "AL00400", "CA01999", "CA01998", "CA01913"],
        'ZipCode': ['36003', '36067', '36067', '90041', '95831', '91711']
}
df_new = pd.DataFrame(new_obs, columns = ['Agency','ORI7', 'ORI9'])
df_more = pd.DataFrame(more_info, columns = ["ORI7", "ZipCode"])

In [None]:
# "concat" is useful for greedy merging
agencyList = pd.concat([agencyList, df_new], ignore_index=True)
print(agencyList)

In [None]:
# "merge" is useful for matching complete sets of information
pd.merge(agencyList, df_more, on="ORI7")

In [None]:
# "concat" can also be used to merge columnwise
# as long as the index sets are matched
pd.concat([agencyList, df_more], axis = 1)