# WebScraping

Web Scraping is a method to extract (or scrape) relevant data/information from the website. It may also be referred to as **web data extraction**. Generally, web scraping has a certain goal in mind (to retrieve a certain piece of information) as opposed to **web crawling** in which all data is gathered.

Some examples may be:
- Gathering job information from [rozee.pk](https://rozee.pk)
- Gathering car information from pakwheels.
- Used in Natural Language Processing (NLP) for extracting relevant data (for learning purposes).


## Challenges of WebScraping
- Web scraping is structure dependent. This means, that based on the structure of the website design, the location to look for the relevant information may be different and so the search process cannot be generalized to all types of websites.
- The scraper can become outdated if the website is modified and the structure is changed.

# Methodology

Before we dive into details of Python code and relevant libraries, let us just understand the process of web scraping.

1. We start by choosing the website we want to scrap.
2. We go ahead and download the particular web page (using *requests* library).
3. We parse the downloaded web page and parse and store it as an html version (other formats are available too). This way, we can access individual html tags.
4. We spend some time understanding the structure of the html file. This way we can identify what information is available, how it is stored and how to access it.
5. Finally, once the location is identified, we go ahead and write Python code to access relevant data.
6. We can store that data as dataframe (other options are available too) and perform data analysis on it.

In [None]:
#required libraries
import requests
from bs4 import BeautifulSoup


# Getting the Web Page
In order to get the local copy of the webpage, we use the *requests* library from Python. It has a function `get()` that takes the URL of the webpage and stores it as a *content* attribute.



In [None]:
#request the content from website
page = requests.get("https://washingtondc.craigslist.org/search/bia")
page


<Response [200]>

In [None]:
#see response status
page.status_code

200

In [None]:
page.content

b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="washington, DC bicycles - craigslist">\n\t<meta name="description" content="washington, DC bicycles - craigslist">\n\t<meta property="og:description" content="washington, DC bicycles - craigslist">\n\t<meta property="og:url" content="https://washingtondc.craigslist.org/search/bia">\n\t<title>washington, DC bicycles - craigslist</title>\n\t<link rel="canonical" href="https://washingtondc.craigslist.org/search/bia">\n\t<link rel="alternate" href="https://washingtondc.craigslist.org/search/bia" hreflang="x-default">\n\n\n\n    <link rel="icon" href="/favicon.ico" id="favicon" />\n\n<script type="application/ld+json" id="ld_searchpage_data" >\n    {"description

# Parsing the Webpage
Once, we have the local copy of the webpage, the next step is to parse it for scraping purposes. For this purpose, we have a Python library called **Beautiful Soup**.

In [None]:
# create beautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')


In [None]:
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="craigslist" property="og:site_name"/>
<meta content="preview" name="twitter:card"/>
<meta content="washington, DC bicycles - craigslist" property="og:title"/>
<meta content="washington, DC bicycles - craigslist" name="description"/>
<meta content="washington, DC bicycles - craigslist" property="og:description"/>
<meta content="https://washingtondc.craigslist.org/search/bia" property="og:url"/>
<title>washington, DC bicycles - craigslist</title>
<link href="https://washingtondc.craigslist.org/search/bia" rel="canonical"/>
<link href="https://washingtondc.craigslist.org/search/bia" hreflang="x-default" rel="alternate"/>
<link href="/favicon.ico" id="favicon" rel="icon">
<script id="ld_searchpage_data" type="application/ld+json">
    {"description":"Bicycles for sale in Washington, DC","@context":"

### `prettify()`
This is a very important function that prints the html file in a better formatted string with indentation.

In [None]:
#View formatted content
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="washington, DC bicycles - craigslist" property="og:title"/>
  <meta content="washington, DC bicycles - craigslist" name="description"/>
  <meta content="washington, DC bicycles - craigslist" property="og:description"/>
  <meta content="https://washingtondc.craigslist.org/search/bia" property="og:url"/>
  <title>
   washington, DC bicycles - craigslist
  </title>
  <link href="https://washingtondc.craigslist.org/search/bia" rel="canonical"/>
  <link href="https://washingtondc.craigslist.org/search/bia" hreflang="x-default" rel="alternate"/>
  <link href="/favicon.ico" id="favicon" rel="icon">
   <script id="ld_searchpage_data" type="application/ld+json">
    {"description":"Bicycles for 

Now lets dig in to find relevant data, before that we need to read the html file to see what we are looking for.

For this we go to our target webpage  and inspects it content.

- Open the webpage on your browser.
- Right click and then select *Inspect*, on the relevant information on the webpage. This will open the html version of the webpage on your browser.
- Identify the tag in which the relevant information is stored.

- The most important element in html is, *tags*. A tag is Each tag can have multiple attributes, *class* and *ID* are some of the most common attributes that uniquely identify the tag (by possessing important information). The relevant data is generally text between these tags (start and end tags), for example \<a\> and \</a\>.
- In order to access data between texts, `.text` or the function `.get_text()` or `getText()` can be used. All have similar functionality, however the function versions give more control to manipulate the data.
- If the relevant data is the value of some attribute, `get()` function can be used to get the value.
- To search using tags, class or ID, two functions can be used (there are others as well):
    - `find()`: returns the first instance of the tag to be found.
    - `find_all()`: returns all instances of the tag to be found. returns a list of items.
    

In [None]:
# single search
soup.find('div')

<div class="cl-content">
<main>
</main>
</div>

In [None]:
soup.find_all('div')

#len(soup.find_all('div'))


In [None]:
data = soup.find('li', class_='cl-static-search-result')
data

<li class="cl-static-search-result" title="Specialized Roubaix SL4 Carbon Road Bike (54cm)">
<a href="https://washingtondc.craigslist.org/nva/bik/d/arlington-specialized-roubaix-sl4/7736411152.html">
<div class="title">Specialized Roubaix SL4 Carbon Road Bike (54cm)</div>
<div class="details">
<div class="price">$1,000</div>
<div class="location">
                        northern virginia
                    </div>
</div>
</a>
</li>

In [None]:
title = data.find('div', class_='title').get_text(strip=True)
title

'Specialized Roubaix SL4 Carbon Road Bike (54cm)'

In [None]:
price = data.find("div", class_='price').get_text(strip=True)
location = data.find("div", class_='location').get_text(strip=True)

In [None]:
price

'$1,000'

In [None]:
location

'northern virginia'

In [None]:
link = data.find('a').get('href')
link

'https://washingtondc.craigslist.org/nva/bik/d/arlington-specialized-roubaix-sl4/7736411152.html'

In [None]:
price

'$1,699'

In [None]:
location

'Reston / Herndon / Sterling'

In [None]:
l=[]
# data = soup.find('li', class_='cl-static-search-result')
for data in soup.find_all('li', class_='cl-static-search-result'):
    o=[data.find("div", class_='price').get_text(strip=True), data.find("div", class_='location').get_text(strip=True),data.find('div', class_='title').get_text(strip=True), data.find('a').get('href')]
    l.append(o)


In [None]:
len(l)

334

Storing it into dataframe.

In [None]:
import pandas as pd

df = pd.DataFrame(l,columns=['Price','Location','Title','Link'])

In [None]:
df

Unnamed: 0,Price,Location,Title,Link
0,"$1,000",northern virginia,Specialized Roubaix SL4 Carbon Road Bike (54cm),https://washingtondc.craigslist.org/nva/bik/d/...
1,$800,"North Arlington, VA",RANS Stratus Recumbent Bicycle,https://washingtondc.craigslist.org/nva/bik/d/...
2,$380,"North Arlington, VA",Bianchi Torino Men's Hybrid Bicycle (700C wheel),https://washingtondc.craigslist.org/nva/bik/d/...
3,$280,"North Arlington, VA","GT Palomar Mountain Bike (26"" wheel)",https://washingtondc.craigslist.org/nva/bik/d/...
4,$380,"North Arlington, VA","Trek 4300 Men's Mountain Bike (26"" Wheel)",https://washingtondc.craigslist.org/nva/bik/d/...
...,...,...,...,...
329,$188,Sterling,Thule Hitching Post Pro Bike Rack for 4 Bikes ...,https://washingtondc.craigslist.org/nva/bik/d/...
330,$99,NoVA,Dynacraft CycoCycle 3 Wheel Trike Fold Down Fo...,https://washingtondc.craigslist.org/nva/bik/d/...
331,$77,Sterling,Rare Mongoose Erupt FAT TIRE Bike 16 inch,https://washingtondc.craigslist.org/nva/bik/d/...
332,$120,Alexandria,2 vintage Univega 20” Project MTBs,https://washingtondc.craigslist.org/doc/bik/d/...


Coverting datyaframe into csv(comma seperated) file.

In [None]:

df.to_csv('DataDC.csv')