# HTTP requests and HTTP status codes

## Requests
We'll be dealing right now with two types of requests only.
- GET
- POST

### GET
- Used to retrieve remote data
- Can also be used to submit data(query to get some data in return), but is not considered secure when dealing wtih sensitive data.
- Encodes data in request URL, which maybe logged by third parties. That's why not secure for forms with sensitive data
- Something like https://testingserver.com?id=test@gmail.com&password=this_is_a_test (and you don't want it!)
- Data is sent as part of url (which is now in your browser history), so anyone with access to your browser history has your account

### POST
- Used to update / insert remote data, submit data to be processed, to a server

## Status Codes

### 2xx Response (200, 201, 203, 204)
Indicates success.
### 3xx Response (301, 302, 303, 304)
Redirection response, assigns a new URL
### 4xx Response (400, 401, 402, 403, 404)
Error from client side (trying to access something which isn't there etc)
### 5xx Response (500, 501, 502, 503)
Error from server side (Internal server error, overload etc)

# Web Scraping
Gathering data or information from web pages.
![](../assets/scrape1.png)
## Applications
- Extract products information
- Extract data to make a search engine
- Extract data from social media
... etc

## Workflow
- Get the page source (by get / post)
- Parse the html
- Store the useful data

You need to experiment manually first with the workflow and find patterns for scraping data from the pages before starting with writing the code.

**You should be comfortble using inspect element and developer tool box of your browser, so that you can debug your scraping application**

## Libraries
- requests (you can use urllib too, will provide an example)
- BeautifulSoup
- urllib (for downloading stuff, BONUS)

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import urllib  # for downloading stuff

In [2]:
# normal http request test
url = "https://github.com/alfbjabfjabskd"  # not a valid username, so we know we'll get a 4xx response
response = requests.get(url)
print(response)
print(response.text)

<Response [404]>
Not Found


In [3]:
url = "https://github.com/odin"
response = requests.get(url)
print(response)
print(response.text[:100])  # this will be too lengthy, so just print some part and make sure you've got data

<Response [200]>


<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" hr


In [4]:
# you can access the html returned by response.text or response.content
print("Cookies", response.cookies.keys())  # you can print response.cookies, I'm not here as it takes a lot of space
print
print("Headers", response.headers.keys())
print
print("HTTP Status Code", response.status_code)

('Cookies', ['logged_in', '_gh_sess'])

('Headers', ['Server', 'Date', 'Content-Type', 'Transfer-Encoding', 'Status', 'Cache-Control', 'Vary', 'X-UA-Compatible', 'Set-Cookie', 'X-Request-Id', 'X-Runtime', 'Content-Security-Policy', 'Strict-Transport-Security', 'Public-Key-Pins', 'X-Content-Type-Options', 'X-Frame-Options', 'X-XSS-Protection', 'X-Runtime-rack', 'Content-Encoding', 'X-GitHub-Request-Id'])

('HTTP Status Code', 200)


In [5]:
# use of POST request to maintain session
# say you can access a page only if you're logged in
res = requests.get("http://hackthis.co.uk/levels/main/1")
html = res.text
err_idx = html.find("You must")  # I know that error is given when try to access it without logging in
print(html[err_idx : err_idx + 100])

You must be logged in to view this content
                        </div>                </article>



In [6]:
sess = requests.Session()
response = sess.post("https://www.hackthis.co.uk/?login",
                       data={"username": "username", "password": "password"})
# I'm showing saved output from correct data. You'll need to create account and change parameters above.

print(response)
err_idx = response.text.find("You must")
print(err_idx)  # = -1, so the means no error now. You can now scrape the page as ususal
sess.close()
# Want more applications?

# Suppose, you want to access your github private repos. You cannot do it without authenticating that
# you're a legitimate user and logged in.

# Session() helps maintain a session, so that when you login, it maintains the logged in state

# You could've done above without Session, but it would've been complicated then.
# You would send your credentials with normal requests.post() method and get some authemtication token in response
# You'll need to send this token and other data everytime to show yourself as a logged in user

# Normal post is useful when you just want to submit some data and don't want any session
# Like I would just like to see what happens if I give wrong creds in login
response = requests.post("https://www.hackthis.co.uk/?login",
                       data={"username": "afjajkfajks", "password": "ajkvdavda"})
print
print(response)
print(response.text[response.text.find("details") - 20: response.text.find("details") + 7].strip())

<Response [200]>
-1

<Response [200]>
Invalid login details


In [12]:
# Now let's move on to scraping part (basics)
# I'll be uploading two scripts with proper applications for your refernce

res = requests.get("https://github.com/o-d-i-n")

# pass the html to beautiful soup for parsing
soup = bs(res.text, "html.parser")
print(soup.prettify())  # prettify prints the html in a better visual way

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8">
   <link href="https://assets-cdn.github.com" rel="dns-prefetch">
    <link href="https://avatars0.githubusercontent.com" rel="dns-prefetch">
     <link href="https://avatars1.githubusercontent.com" rel="dns-prefetch">
      <link href="https://avatars2.githubusercontent.com" rel="dns-prefetch">
       <link href="https://avatars3.githubusercontent.com" rel="dns-prefetch">
        <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch">
         <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch">
          <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-bedfc518345498ab3204d330c1727cde7e733526a09cd7df6867f6a231565091.css" media="all" rel="stylesheet"/>
          <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-36dc4044670fb93253d932fbc3a10282233e86290bfaf8285c479925456b3d47.css" media="all" rel="stylesheet"/>
        

![](../assets/scrape2.jpg)

In [13]:
# As you can see, the div tag with class "org-repos" and class "repo-list" is the main div tag under
# which we can find all the repo links

repo_div = soup.find("div", {"class": "repo-list"})
print(repo_div)

<div class="org-repos repo-list">
<li class="col-12 d-block width-full py-4 border-bottom public source" itemprop="owns" itemscope="" itemtype="http://schema.org/Code">
<div class="d-inline-block mb-1">
<h3>
<a href="/o-d-i-n/workshops" itemprop="name codeRepository">
        workshops</a>
</h3>
</div>
<div>
<p class="col-9 d-inline-block text-gray mb-2 pr-4" itemprop="description">
          Notebooks and resources for sessions taken by odin team
        </p>
<div class="col-3 float-right text-right">
<poll-include-fragment src="/o-d-i-n/workshops/graphs/participation?h=28&amp;type=sparkline&amp;w=155">
</poll-include-fragment>
</div>
</div>
<div class="f6 text-gray mt-2">


        Updated <relative-time datetime="2017-10-03T13:15:52Z">Oct 3, 2017</relative-time>
</div>
</li>
<li class="col-12 d-block width-full py-4 border-bottom public source" itemprop="owns" itemscope="" itemtype="http://schema.org/Code">
<div class="d-inline-block mb-1">
<h3>
<a href="/o-d-i-n/DevBible" itemprop=

In [18]:
# We can see now, inside this div tag, all a tags with attribute itemprop have repository name
# this extracted div tag can be treated like another soup object, and can be used to find inner nested tags
repo_names = []

for a_tag in soup.findAll('a'):
    # a_tag.attrs returns dict of type {attribute: attribute_value} of a_tag
    # check if itemprop is an attribute
    if 'itemprop' not in a_tag.attrs.keys():
        continue
    # if it is there, check if it's value is "name codeRepository" (value for all repos)
    # there's itemprop=email too, that's why you need to see all data
    # then filter it accordingly until you're left with what you need!
    if a_tag['itemprop'] == "name codeRepository":
        repo_names.append(a_tag.text.strip())

print("\n".join(repo_names))

workshops
DevBible
HelloWorld
AI-Think-Tank
all-contributors
bootcamp
all-contributors-atom
moksha-2017
nyan-cat
ShortestPathRL-DP
Extention-Google-web-light
EvilWords
Hypha
HouseKeeper
is-alpha
speed-tester
CloudShopper
nsit-fest


In [19]:
# Downloading something from the internet
# so, I just copied the download url of a repo, which I can also get from scraping
download_url = "https://github.com/o-d-i-n/DevBible/archive/master.zip"
# For python2
urllib.urlretrieve(download_url, "repo.zip")  # url, filename
# For python3, comment the above line, and uncomment the below line
# urllib.request.urlretrieve(download_url, "repo.zip")  # url, filename

('repo.zip', <httplib.HTTPMessage instance at 0x7fd9b85d2fc8>)