This notebook tees us up to do some interesting scraping of Craigslist sites. 

In [None]:
import sqlite3
import requests
from bs4 import BeautifulSoup 
import time # used to make our code "sleep" for a second or two before new calls.

Let's begin by doing a car search on Craigslist and looking at the results. Click on this link and feel free to change the search terms once you get there: https://seattle.craigslist.org/search/cto?query=volkswagen+jetta. 

Once you've looked around a bit (including looking at the page source and page elements), enter the URL down below.

In [None]:
my_search_page = "https://seattle.craigslist.org/search/cto?query=volkswagen+jetta"

Let's grab the page and process it with Beautiful Soup.

In [None]:
r = requests.get(my_search_page)
print(r.status_code)

bs = BeautifulSoup(r.text,'html.parser')

In [None]:
bs.prettify

Well, not that pretty. Okay, we've come to our first challenge. As you can see from the page when you open it in a browser, there are a bunch of cars listed on that page. If you right click on one you'll be able to copy the car listing. It'll look kind of like this `https://seattle.craigslist.org/skc/cto/d/seattle-jetta-2005/6994660904.html`. 

### First challenge

Extract every one of those links from the page you're on. Store them in in a list call `car_pages`.

Approach: Let's try to identify those page elements via the "inspect element" on Chrome. It looks like they show up in blocks that look like this: 

```
<a 
    href="https://seattle.craigslist.org/tac/cto/d/tacoma-2001-volkswagen-jetta/6994004871.html" 
    class="result-image gallery" 
    data-ids="1:00B0B_kGzRKlMhcLE,1:00303_ivZbPUU6nb1,1:00n0n_32JBe7nYXP,
              1:00W0W_1AOugnLjUvc,1:00c0c_kB6girnJdhA,1:00c0c_gJQhKdxhS5o,
              1:00L0L_eyvNJpLqaPN,1:00x0x_4ZUhwK5KmER">
    <span class="result-price">$1700</span>
</a>
```

So we'll want to look for `a` tags that have class `result-image gallery`. (There are probably lots of ways to do this, but going for this pair seems like the easiest to me. 


In [None]:
# let's just print the class of the links on the page.
for link in bs.find_all('a'):
    print(link.get('class'))

Okay, it looks like the class for the ones we want is `['result-image', 'gallery']`, so let's test for that and grab the links. 

In [None]:
car_pages = []

for link in bs.find_all('a') :
    if (link.get('class') and # some links don't have classes, so need to test for its existence  
        ('result-image' in link.get('class') and
        'gallery' in link.get('class'))) :
        
        car_pages.append(link.get('href'))
        
print(len(car_pages))

Okay, in the version I just ran we have 120 links. Looking at them, they look pretty reasonable. 

---

In [None]:
raw_car_pages = car_pages

### Second Challenge

Okay, you may want to do these out of order, so let's define a `car_pages` variable so you have something to play with.

In [None]:
car_pages = ['https://seattle.craigslist.org/skc/cto/d/seattle-jetta-2005/6994660904.html',
             'https://seattle.craigslist.org/skc/cto/d/federal-way-2003-vw-jetta/6994207516.html',
             'https://seattle.craigslist.org/see/cto/d/federal-way-2014-volkswagen-jetta-20l/6994139994.html']

Okay, now we're ready for the second challenge. Extract some of the key elements from this the resulting Craigslist page. We'll discuss this in class, but some good ones to extract: 

* Make
* Model
* Price
* Transmission
* Miles
* Number of images
* Description
* Page title

Craigslist isn't trying to make things easy on us. But there is a box on the right-hand side where a bunch of these fields are being stored. Let's just grab those. Once you've done that, your goal is to scrape all the pages in `car_pages` and write out the info to a text file. 

In [None]:
page = car_pages[0]

r = requests.get(page)
bs = BeautifulSoup(r.text,'html.parser')

In [None]:
bs.prettify

In [None]:
# I'm going to write a function that covers most of the obvious cases. This 
# is clearly overkill. I'll return a dictionary with all the values, including
# the missing ones. 
def get_vehicle_info(soup) :
    ''' Given soup, returns a dictionary with 
        most of the vehicle info. vin, cond, cyl, drive,
        fuel, odometer, paint_color, size, trans, type.
    '''
    
    car_text = soup.find_all("p", {"class":"attrgroup"})[1]
    
    # Set all values to None to start
    vin = cond = cyl = drive = fuel = None
    odometer = title_status = trans = None
    color = size = type_val = None

    for item in car_text.find_all("span") :
        if "VIN" in str(item) :
            vin = item.b.string
        elif "condition" in str(item) :
            cond = item.b.string
        elif "cylinders" in str(item) :
            cyl = item.b.string 
        elif "drive" in str(item) :
            drive = item.b.string 
        elif "fuel" in str(item) :
            fuel = item.b.string 
        elif "odometer" in str(item) :
            odometer = item.b.string 
        elif "title status" in str(item) :
            title_status = item.b.string 
        elif "transmission" in str(item) :
            trans = item.b.string 
        elif "paint color" in str(item) :
            color = item.b.string 
        elif "size" in str(item) :
            size = item.b.string
        elif "type" in str(item) :
            type_val = item.b.string

            
            
    return_dict = {"vin":vin,
                   "condition":cond,
                   "cylinders":cyl,
                   "drive":drive,
                   "fuel":fuel,
                   "odometer":odometer,
                   "title_status":title_status,
                   "transmission":trans,
                   "paint_color":color,
                   "size":size,
                   "type":type_val}
  
    return(return_dict)

In [None]:
# now let's test it
get_vehicle_info(bs)

Okay, that looks pretty good. Now let's write a cell that extract all the info and writes each one to a text file.

In [None]:
output_file = "20191007_car_scrape.txt"
written_headers = False # notice the use of this trick. It's a good one. 

with open(output_file,'w') as ofile :
    for page in car_pages :
        r = requests.get(page)
        if r.status_code == 200 :
            bs = BeautifulSoup(r.text,'html.parser')

            vehicle_info = get_vehicle_info(bs)

            if not written_headers :
                headers = ['page']

                # build up our headers list
                for key in vehicle_info :
                    headers.append(key)
                    
                ofile.write("\t".join(headers) + "\n")
                written_headers = True

            # Now we build up the row. We're going to use
            # the existence of the headers list to access
            # the keys of the dictionary in the correct order.
            # Remember: we can't depend on the ordering of 
            # a dictionary. 

            output_row = [page]
            
            for field_name in headers[1:] : # skip first spot
                output_row.append(vehicle_info[field_name])
                
            ofile.write("\t".join([str(item) for item in output_row]) + "\n")

        time.sleep(1)        
        

Okay, nice work. One big piece that's missing is price. For a bonus challenge, try to get the price off the page. Open question: why did CL make it so hard to get price?

In [None]:
# This one is a monster. I grabbed it from the full CL project I have, so it gets a _lot_ of other stuff that's useful.
def get_page_elements(soup) :
    ''' Getting elements of the page that 
      *should* be mostly straightforward:
      post_id, listing_name, posting_body_text,
      title_text, posting_dt, update_dt 
      price
    '''
  
    # post-id
    post_id = None
    for item in soup.find_all("p",{"class":"postinginfo"}) :
        if "post" in str(item.string) :
            post_id = item.string.split(":")[1]
            post_id = post_id.strip()
  
    # listing_name
    # TODO: Not sure what this is. I put it in the 
    # db design, but don't know what it means. Leaving 
    # it in for now in the hopes that I'll remember
    # what I was seeing.
    # TODO: It's been a while. Remove this. 
  
    listing_name = None

    # posting_body_text
    posting_body_text = soup.find_all("section",{"id":"postingbody"})[0].text.strip() 

    # title_text
    full_title_text = soup.find_all("span",{"class":"postingtitletext"})[0].text.split("\n")
    full_title_text = [a for a in full_title_text if len(a) > 0]  

    # attempt at actual title. Is hyphen split reliable?
    title_text = full_title_text[0].split("-")[0].strip()

    # grab price and location
    price = location = None
    hyphens = full_title_text[0].count("-")
  
    if hyphens > 0 :    
        # This is tricky. It's possible to have
        # multiple hyphens in the title text. 
        if hyphens == 1 :
            price_spot = 1
        elif hyphens == 2 :
            price_spot = 2
        else :
            price_spot = hyphens # hope for the best

    
    price_loc_part = full_title_text[0].split("-")[price_spot].strip()
    price = price_loc_part.split("(")[0].strip()
    price = price.replace("$","")
    # TODO: add error checking and logging. Also, 
    # this could be separated out of the pull itself.
    # 
    # 20181128: Not sure what this note actually means? am I 
    # saying that I should write an "extract price" function?
    # that seems like a reasonable idea. 
  
    # location: the place listed in the title
    if "(" in price_loc_part :
        location = price_loc_part.split("(")[1].strip()
        location = location.replace(")","")
  
    full_title_text = "".join(full_title_text) 
    # For now we're storing this so we can figure out if we're screwing it up.

    # times
    posting_dt = update_dt = None
  
    for item in soup.find_all("p",{"class":"postinginfo reveal"}) :
        if "posted:" in item.text :
            holder = item.find_all("time")[0]
            posting_dt = holder.attrs['datetime']
        elif "updated:" in item.text :
            holder = item.find_all("time")[0]
            update_dt = holder.attrs['datetime']  
      
    # Num images
    holder = soup.find_all("div",{"id":"thumbs"})
    if len(holder) > 0 :
        num_images = len(holder[0].find_all("a"))    
    else :
        num_images = 0

    ret_d = {"post_id":post_id,
             "listing_name":listing_name,
             "posting_body_text":posting_body_text,
             "full_title_text":full_title_text,
             "title_text":title_text,
             "posting_dt":posting_dt,
             "update_dt":update_dt,
             "price":price,
             "location":location,
             "num_images":num_images}
  
    return(ret_d)

In [None]:
get_page_elements(bs)