#**Web Scraping for Social Sciences**
Minjae Yun </br> </br>
[Text book](https://jakevdp.github.io/PythonDataScienceHandbook/)

---


##Final Goals
*   Obtain and read url page 
*   Clean and stack information into a dataframe 
*   Repeat the same tasks
*   Trials & errors


---


## Lecture 1 : Setting Up and Basics
  Lecture 2 : Basic Web Scraping </br>
  Lecture 3 : Advanced Web Scraping </br>


---


[Installing Anaconda](https://python-programming.quantecon.org/getting_started.html)

### 1. Load modules
[General installation method](https://docs.python.org/3/installing/)

In [None]:
# !pip install BeautifulSoup
# !pip install requests
import os # operating system dependent functionality (set the working directory)
import pandas as pd # main matrix operations library (also read files)
import requests # get pages from web (no interactivity)
from bs4 import BeautifulSoup # get pages from web (interactively)
import re # regular expression

### 2. Notes on list, tuple, and dataframe
* List is the most general container for a collection of information 
* Tuple also contains a collection of information but the values in it wouldn't change
  * Developers use tuples when they wouldn't want to change the values e.g. homepage addresses (social scientists rarely use this)
*  Pandas is a modeule for operating data structures which is motivated by R dataframe


In [None]:
# list can contain anything
a_list = [1,2,3,4,5] # numbers
b_list = ["a",3,5,"d","e"] # characters and numbers
c_list = [a_list,"b",6,b_list,7,"c" ] # lists, characters and numbers  
a_list = [1+x for x in a_list] # list operator
a_tuple=(1,2,3,4,5)

## 3. Use modules
Examples of obtaining information from web pages

In [None]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

In [None]:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_mental_disorders'
s = requests.get(url)
b = BeautifulSoup(s.text, 'lxml')
#create an empty list where those list objects 
main_disorders = []

# grab all of the li tags and store the text
for i in b.find_all(name = 'li'):
    main_disorders.append(i.text)
#Subset out the ones we care about
main_disorders = main_disorders[26:216]

In [None]:
import json
all_urls = ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/' + x + '/daily/2010010100/2019020900' for x in main_disorders]

url = all_urls[0]
s=requests.get(url).content
json_string = s.decode('utf-8').replace("'", "\"")
d = json.loads(json_string)
values = pd.DataFrame.from_dict(d['items'])
values.head()

Unnamed: 0,project,article,granularity,timestamp,access,agent,views
0,en.wikipedia,Absence_seizure,daily,2015070100,all-access,user,636
1,en.wikipedia,Absence_seizure,daily,2015070200,all-access,user,592
2,en.wikipedia,Absence_seizure,daily,2015070300,all-access,user,460
3,en.wikipedia,Absence_seizure,daily,2015070400,all-access,user,454
4,en.wikipedia,Absence_seizure,daily,2015070500,all-access,user,488


## 4. Regular expression
[Google education](https://developers.google.com/edu/python/regular-expressions) <br>
[Practice page](https://regexr.com/) <br>
We can capture abstract words by using the regular expression

In [None]:
import re
text = "take each type of chunks of characters from this string: @$%&* 18424 AZDFTFH kjiwer mail@address.com another@address.com theother@gmail.com"
print(re.findall("\w+", text)) # chracters except for symbols
print(re.findall("[a-z]+", text)) # only the lower letters
print(re.findall("[0-9]+", text)) # only the numbers