## Social Computing: Notebook 1
# Online data collection using the requests library

In [None]:
# Please inlcude your names below and edit the name of the file to include the names of the people answering

# Students: Rupal Saxena, Orestis Oikonomou

In [None]:
import requests

As step 0, pick your favorite Wikipedia page, open it in the browser, and then save it as an html file. Now open it in the browser as well as in a text editor and look at the difference. 

Using the requests library you can retrieve the html source of the page, in a response object (using requests.get(“url”)). The response object you received has content that you can access calling the .text function on it.

Call text and save the result in a file, then open the file in a browser and check whether you successfully saved the page. Note, you will only be able to open the file in the browser if you give it an html extension.

In [None]:
## our code
filename = "getbillie.html"
res = requests.get("https://en.wikipedia.org/wiki/Billie_Eilish")
file = open(filename, "w")
file.write(res.text)
file.close()

### 1) Basic web crawling (10 points)

URLs have specific formats, for example any Wikipedia page will be of the format https://en.wikipedia.org/wiki/Pythonidae where the last word is the topic of the article.
Next, we want to automate this saving process using the requests library and making automated requests to Wikipedia.

Exercise: Pick 5 different words, and write code that loops through these words, and retrieves the html content for each associated wikipedia page, and saves the html text as wiki_htmls/[word].html files. (Choose words that actually have associated wiki pages). 


In [None]:
import os

if not os.path.exists('wiki_htmls'):
    os.makedirs('wiki_htmls')

words = ["Billboard_200", "Time_(magazine)", "Brit_Awards", "Golden_Globe_Award_for_Best_Original_Song", "MTV_Video_Music_Awards"]

for i in words:
  filename = "wiki_htmls/"+i+".html"
  res = requests.get("https://en.wikipedia.org/wiki/"+i)
  file = open(filename, "w")
  file.write(res.text)
  file.close()
  

### 2) URL formats (10 points)

What is the common URL in the case of Google searches? And in the case of Yelp? 

**In case of Google searches the common URL is:**
https://www.google.com/search?q=

**In case of Yelp the common URL is:**
https://de.yelp.ch/search?


And what happens to the URL if you want to define the location as well as the type of venue you are looking for?



*   If we want to define the location the url changes and its core part becomes:https://de.yelp.ch/search?find_desc= 
*   If we want to define the venue changes and its core part becomes:https://de.yelp.ch/search?find_desc=&find_loc=

Can you find more search parameters for either of the two pages that you can define via the URLs? What do they mean?



*   When we change the filter for price range the url becomes: https://de.yelp.ch/search?find_desc=&find_loc=zurich&attrs= . After, *attrs* follows a variable that defines the price range.
*   When we change the filter for distance the url becomes: https://de.yelp.ch/search?find_desc=&find_loc=zurich&l= . After, *l* follows a different serie of characters depending on the covered region that we chose.



### 3) HTML content basics (5 points)

In [None]:
import requests
res = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

Using the BeautifulSoup parser library we will parse the webpage that you just saved. 

In [None]:
# let's import BeautifulSoup, our parser library
# And make a soup object out of the html of the page

# in case bs4 throws error try
# !pip install --upgrade html5lib==1.0b8

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

In [None]:
# print a nice version using prettify
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


Here's how we can find all instances of a tag at once: Try to predict what the following command will return: `soup.find_all('p')` and then call it to check if you were right. 

***Prediction***

It will print all the parts of the html find that are defined as paragraphs.

In [None]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>, <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

Print out the second element of this list.

In [None]:
print(soup.find_all("p")[1])

<p class="inner-text">
                Second paragraph.
            </p>


Print out the text inside the second element of the list, using the .text on the element.

In [None]:
print(soup.find_all("p")[1].text)


                Second paragraph.
            


When you try to find a specific element on a page you can reach it by finding classes or IDs of the elements.

In [None]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

How many elements would it return for 'inner-text'? Guess, and check your guess by using the find_all command

***Our guess***

It will return 2

In [None]:
soup.find_all('p', class_='inner-text')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

### 4) Finding elements in the browser (50 points)
Since every web page is different and html can get very large and messy, the easiest way to find elements that you are interested in is to start from the browser window. So next we will quickly look at how to find elements using the developer tools in your browser. Open the following webpage in your browser (preferably Chrome): http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579#.Wkwh8VQ-fVo 

Find the developer tools in your browser. (In Chrome, it's view --> developer --> developer tools or Control+Shift+C on Windows and Command+Shift+C on Mac) You should end up with a panel at the bottom or the right side of the browser like what you see below. Make sure the Elements panel is highlighted:

In [None]:
res = requests.get("http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579")
soup = BeautifulSoup(res.content, 'html.parser')

When trying to find a specific element, you can right click on it on the page and select "inspect". This will also open up the developer tools window. For example if we want to extract the current temperature value:

<img src="inspect.png">

<img src="inspect_class.png">

<br><br>
1. (5 points) Using the find function, extract and print out the current temperature from the page.
2. (5 points) Do the same with the value in Celsius.

In [None]:
### Fill out and print a full sentence describing the temperature in F and C. 
temp_F = soup.find('p', class_='myforecast-current-lrg').text
print("The current temprature in Fahrenheit is "+ temp_F)
temp_C = soup.find('p', class_='myforecast-current-sm').text
print("The current temprature in Celsius is "+temp_C)

The current temprature in Fahrenheit is 80°F
The current temprature in Celsius is 27°C


3. (20 points) In this exercise we will extract each half day's forecast from the extended forecast on the weather report page. <br>
    a. Find the container for the extended forecast on the weather page we just downloaded. <br>
    b. Make a list with all forecast items (overnight, Wednesday, Wednesday night, etc) <br>
    c. For each time period, print out the name of the period, the short description of the expected weather conditions, and the temperature. 

In [None]:
# I think it can be just this
forecast= soup.find_all('li', class_="forecast-tombstone")
for i in forecast:
  print(i.text)



Today
ScatteredShowersHigh: 84 °F


Tonight
ScatteredShowersLow: 73 °F


Thursday
Mostly Sunnyand BreezyHigh: 83 °F


ThursdayNight
ScatteredShowers andBreezyLow: 72 °F


Friday
ScatteredShowers andBreezyHigh: 83 °F


FridayNight
IsolatedShowers andBreezyLow: 72 °F


Saturday
IsolatedShowers andBreezyHigh: 83 °F


SaturdayNight
Mostly Clearthen IsolatedShowersLow: 72 °F


Sunday
IsolatedShowersHigh: 83 °F


4. (20 points) Take a list of jobs (e.g.['teacher', 'lawyer', 'data-scientist']). For each job save the html of the result of searching it on indeed. The url of a result page looks like: https://www.indeed.com/q-data-scientist-jobs.html. 
<br>
For each job find the names of the companies from the first result page.  Make a dictionary where the keys are the jobs and value is a list of the company names. 

In [None]:
# your code here
companies = {}

response_ds = requests.get("https://www.indeed.com/q-data-scientist-jobs.html")
soup_ds = BeautifulSoup(response_ds.content, 'html.parser')
html_list = soup_ds.find_all('span', class_="companyName")

companies["data-scientist"] = []

for company in html_list:
  name = company.text
  companies["data-scientist"].append(name)

response_ds = requests.get("https://www.indeed.com/jobs?q=teacher&l&vjk=df5f6bb62e33874f")
soup_ds = BeautifulSoup(response_ds.content, 'html.parser')
html_list = soup_ds.find_all('span', class_="companyName")

companies["teacher"] = []

for company in html_list:
  name = company.text
  companies["teacher"].append(name)


response_ds = requests.get("https://www.indeed.com/jobs?q=lawyer&l&vjk=dd4d90c192144f6b")
soup_ds = BeautifulSoup(response_ds.content, 'html.parser')
html_list = soup_ds.find_all('span', class_="companyName")

companies["lawyer"] = []

for company in html_list:
  name = company.text
  companies["lawyer"].append(name)

print(companies)

{'data-scientist': ['Nebraska Furniture Mart', 'Union Pacific', 'Spotify', 'BlueSky Technology Solutions', 'OTC Direct Inc', 'Accentuate Staffing', 'WorkCog', 'Amazon Web Services, Inc.', 'CDC Foundation ( Contract)', 'TheIncLab', 'Zoom Video Communications, Inc.', 'Lark Health', 'Millennium Health', 'Bayer', 'Radcube LLC'], 'teacher': ['Council Bluffs Community School District', 'Millard Public Schools', 'Council Bluffs Community School District', 'Millard Public Schools', 'US Bureau of Indian Education', 'Tinkergarten', 'US Bureau of Indian Education', 'Council Bluffs Community School District', 'ST. THOMAS MORE CATHOLIC SCHOOL', 'Teach Iowa', 'Council Bluffs Community School District', 'North Kingstown School Department', 'Teach Iowa', 'Glenwood CSD', 'Bellevue Public Schools'], 'lawyer': ['PSI Services LLC', 'PACCAR', 'US National Labor Relations Board', 'Contract Counselors', 'Animal Defense Legal Fund', 'U.S. Department of Education Office of General...', 'Spotify', 'US DHS Headq

### 5) Headers (25 points)

Every request you send has a so called HTTP header (unrelated to the content of the message), for example to communicate the size of the message, the browser from which the request is coming from, or what kind of response it is expecting back in the response. 

1) Read up on this: What parts does a request contain exactly and what is the purpose of a header? 

2) Look in the browser: Take a URL and find the request header using the developer tools in your browser. (Hint: you will need to look inside 'network'). 

3) If you don’t tell python otherwise, it will use a default header when sending requests. What is this default when you use the requests library?

4) The requests library allows to specify the headers of your request exactly. Set the header of your request (for the  URL you previously picked) to be the one copied from your browser. 

5) Now compare the response headers for the same URL in the browser, and by calling a function on the response object in your code. What differences do you see? 

**Your chosen URL:** 
https://en.wikipedia.org/wiki/Billie_Eilish

**Default header of Python requests:** ##
{'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

**Header in your browser:** ##

:authority: en.wikipedia.org
:method: GET
:path: /wiki/Billie_Eilish
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8 ...

In [None]:
response = requests.get("https://en.wikipedia.org/wiki/Billie_Eilish")
print(response.request.headers)

**Response header in your browser:** ##

accept-ranges: bytes
age: 70951
cache-control: private, s-maxage=0, max-age=0, must-revalidate
content-encoding: gzip
content-language: en
content-length: 122758
content-type: text/html; charset=UTF-8
date: Tue, 15 Mar 2022 16:26:12 GMT
last-modified: Tue, 15 Mar 2022 16:23:12 GMT

**Response header in the response in python: **##

{'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

**Difference: **##