In [1]:
import wikipedia
import requests
from bs4 import BeautifulSoup
from gensim.summarization import keywords

Using TensorFlow backend.


### Getting Data
For this project, all data will be found on publicly available web pages. I'll be using the `requests` and `bs4` modules to acquire data. 

My goal is to make this project generalizable and able to work with any list of webpages/URLs. Because of this, I'm focusing on understanding how to work with a small dataset of a few wiki pages for now.

Eventually, I will separate the scraping portion into a script and create a `data/` directory, but that isn't necessary for now.

### Sharing Data
Because all of the data I'm using is publicly available on the internet, everyone will have access to all of my data. It will all be either scraped data or built-in corpora (like `gensim.corpora.wikicorpus`). The `data/` directory will probably remain local and gitignored, but I will include a script that can be run to generate everything in `data/`.

In [20]:
# this can get the data we need
def _get_with_requests(link): 
    import lxml
    req = requests.get(link).text
    soup = BeautifulSoup(req, "html.parser")
    # remove javascript/css 
    [s.decompose() for s in soup.findAll(['script', 'style'])] 
    
    return soup.text

def get_topics(content):
    '''
    takes string of content and returns list of keywords
    '''
    # definitely need to look into more interesting data to rank keywords
    # what exactly does words do... https://radimrehurek.com/gensim/summarization/keywords.html
    return keywords(content, words=10, lemmatize=True).split('\n')

In [21]:
'''
Pretend we have a list of topics, say Apple, Orange, Yankees, Red Sox. 
Get data, figure out topics, and group.
'''
# for wikipedia pages, use wikipedia api to guarantee nice data
pages = ['Apple', 'Orange_(fruit)', 'Yankees', 'Red Sox']
pages_content = {page:wikipedia.page(page).content for page in pages}


In [24]:
# Extract topics from each page
# Topics will be used for section naming, not for grouping. Grouping w/ paragraph2vec?
topic_dict = {page:get_topics(content) for page, content in pages_content.items()}

### Stats
This project is designed to be generic and should work for any list of URLs (with varying success, of course).
This section prints some stats and visualizes the topics.

In [23]:
topic_dict['Orange_(fruit)']

['orange',
 'grown',
 'fruits',
 'citrus',
 'tree',
 'variety',
 'navels',
 'florida',
 'juice',
 'called']