## Dealing with Different Website Layouts
In most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are preselected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to write a separate page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.

In [12]:
import requests
import requests
from bs4 import BeautifulSoup


class Content:
    """ Class to define our data model that contains 3 elements"""
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body


def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')


def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div",{"class","post-body"}).text
    return Content(url, title, body)


def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class":"story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)


content = scrapeBrookings('https://www.brookings.edu/research/why-ai-is-just-automation/')
print('Title: {}'.format(content.title), '\n', 'URL: {}\n'.format(content.url))
#print(content.body)

content = scrapeNYTimes('https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html')
print('Title: {}'.format(content.title), '\n', 'URL: {}\n'.format(content.url))
#print(content.body)

Title: Why AI is just automation 
 URL: https://www.brookings.edu/research/why-ai-is-just-automation/

Title: The Men Who Want to Live Forever 
 URL: https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html



### Make the code more Object Oriented

What we see from the previous code is that the only real site-dependent variables here are the CSS selectors **used to obtain each piece of information.** BeautifulSoup’s `find` and `find_all` functions take in 2 arguments, a tag string and a dictionary of key/value attributes; so we can pass these arguments in as parameters that define the location of the target data.


To make things even more convenient, rather than dealing with all of these tag arguments and key/value pairs, you can use the BeautifulSoup `select` function with a single string CSS selector for each piece of information you want to collect and put all of these selectors in a dictionary object.



In [13]:
class Content:
    """  Data model for this project """
    
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """ Flexible printing function controls output """
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))

class Website:
    """ 
    Contains information about website structure. The Website class doesn't store the collected 
    data from the pages, but stores instructions about how to collect that data.
    """

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag