# Ryan's Notes on Web Scraping

Utilizing requests, beautifulsoup4, contextlib.

Get deps: `pip3 install requests beautifulsoup4`

**Resources**
- [Practical intro to web scraping](https://realpython.com/python-web-scraping-practical-introduction/)

In [46]:
import requests
import contextlib
from bs4 import BeautifulSoup
import pprint
import operator

Now that we've imported all of our libraries, I need to make a simple request and grab all the HTML for any given URL (webpage address).

In [30]:
def getPage(url):
    print('Attempting to get HTML/XML from '+url)
    with contextlib.closing(requests.get(url, stream=True)) as resp:
        content_type = resp.headers['Content-Type'].lower()
        print('Content Type: ' + content_type)
        if(resp.status_code == 200 and 'html' in content_type):
            print('Success!')
            return resp.content

        
page_html = getPage('https://ryanfleck.github.io/java')

Attempting to get HTML/XML from https://ryanfleck.github.io/java
Content Type: text/html; charset=utf-8
Success!


Now the HTML is stored in the `page_html` variable. Time to use Beautiful Soup.

In [34]:
page = BeautifulSoup(page_html, 'html.parser')

for paragraph in page.select('p'):
    print('\n'+paragraph.text)



          RCF2019
          ⋅
          Home
          ⋅
          
              Invert Page Colors
          


Java is a powerful Object-Oriented programming language used to create complex, scalable, reliable enterprise applications. The University of Ottawa uses Java and C to teach Operating Systems, Data Structures and Software Engineering. At MNP LLP, I applied Java to extend client WCMS systems built on Spring. Going forward, I will be using Java at my upcoming Summer internship at IBM. While not my most active manual, I expect this page to see tremendous growth in the next few months.

Java is as good a language as any for learning how to program. Typically I recommend JavaScript as you can create more visual, interactive projects right off the bat, and the developer community is more beginner-friendly and geared towards ‘fun’ projects. With Java, you will be able to write Android Applications, web servers and APIs, desktop applications and even games. Unfortunately, most fr

**Neat.** Now, with our `page` object, we can select HTML elements and read their contents and properties to our heart's content. `Fun.fun.fun()` For starters, I'll write a word count.

In [50]:
words = {'and': 4, 'apple': 3}

def addWordToDict( word, dictionary ):
    if word not in dictionary.keys():
        dictionary.setdefault(word,1)
    else:
        dictionary[word] += 1

for paragraph in page.select('p'):
    for word in paragraph.text.split():
        addWordToDict(word, words)

# See which words are most frequent.
pprint.pprint(sorted(
    words.items(), 
    key= operator.itemgetter(1), 
    reverse=True))

[('and', 73),
 ('the', 69),
 ('to', 51),
 ('a', 48),
 ('is', 43),
 ('of', 32),
 ('Java', 28),
 ('be', 24),
 ('for', 22),
 ('are', 22),
 ('in', 21),
 ('can', 21),
 ('thread', 15),
 ('with', 14),
 ('an', 13),
 ('will', 12),
 ('as', 12),
 ('A', 12),
 ('The', 11),
 ('on', 11),
 ('at', 10),
 ('that', 10),
 ('you', 9),
 ('EE', 9),
 ('it', 8),
 ('threads', 8),
 ('used', 7),
 ('I', 7),
 ('this', 7),
 ('or', 7),
 ('into', 7),
 ('your', 7),
 ('data', 7),
 ('not', 6),
 ('more', 6),
 ('all', 6),
 ('different', 6),
 ('most', 5),
 ('web', 5),
 ('some', 5),
 ('only', 5),
 ('variables', 5),
 ('objects', 5),
 ('ToDo', 5),
 ('semaphore', 5),
 ('using', 4),
 ('my', 4),
 ('good', 4),
 ('when', 4),
 ('one', 4),
 ('run', 4),
 ('code', 4),
 ('has', 4),
 ('must', 4),
 ('When', 4),
 ('variable', 4),
 ('In', 4),
 ('from', 4),
 ('ensures', 4),
 ('object', 4),
 ('each', 4),
 ('execute', 4),
 ('number', 4),
 ('continue', 4),
 ('B', 4),
 ('before', 4),
 ('until', 4),
 ('This', 4),
 ('SE', 4),
 ('apple', 3),
 ('⋅', 

<br />

## Miscallaneous Python Language Experimentation
(AKA Stuff I didn't know so I had to write a test to find out.)

In [26]:
# Is none falsy?
if( None ):
    print('Lol!')
else:
    print('Not lol.')

Not lol.
