# Web Scraping

Building a function to return the text and other relevant information from blog posts and articles in order to build the training set for the classifier

Additionally, testing a few URLs to validate that the web scraping function works as expected

In [1]:
import re
import urllib
from bs4 import BeautifulSoup

import pandas as pd

def web_scraper(url, fluff=0):
    """
    Grabs the text of the body for a given URL
    
    To-Do:
    - Fix extracting number of images to include generated plots
    """
    page = urllib.request.urlopen(url)  # Reads the URL
    soup = BeautifulSoup(page, 'html.parser')  # Parses the HTML
    body = soup.find('body').get_text()  # Gets the text from the body
    text = body.replace('\n', ' ')  # Replaces \n with a blank space
    text = ' '.join(text.split())  # Removes doplucate white space
    num_latex_equations = len(re.findall('begin{equation}', text))  # Counts number of LaTeX equations
    num_images = len(re.findall('src', text))  # Counts number of linked images
    results = {'Text': text, 'NumLaTeXEquations': num_latex_equations, 'NumImages': num_images , 'IsFluff': fluff}
    return results


web_scraper('https://jeffmacaluso.github.io/jekyll/update/2017/02/10/RestaurantRecommender.html')

{'IsFluff': 0,
 'NumImages': 0,
 'NumLaTeXEquations': 0,
 'Text': 'Jeff Macaluso About Restaurant Recommender Feb 10, 2017 Introduction In this project, I extracted restaurant ratings and reviews from Foursquare and used distance (one of the main ideas behind recommender systems) to generate recommendations for restaurants in one city that have similar reviews to restaurants in another city. This post is the abridged version, but check out my github post for all of the code if you are curious or want to use it. Motivation I grew up in Austin, Texas, and moved to Minneapolis, Minnesota a few years ago. My wife and I are people who love food, and loved the food culture in Austin. After our move, we wanted to find new restaurants to replace our favorites from back home. However, most decently rated places in Minneapolis we went to just didn’t quite live up to our Austin expectations. These restaurants usually came at the recommendations of locals, acquaintances, or from Google ratings, bu

In [2]:
web_scraper('https://opendatascience.com/blog/curse-of-dimensionality-explained/')

{'IsFluff': 0,
 'NumImages': 3,
 'NumLaTeXEquations': 5,
 'Text': 'Toggle navigation News Webinar Content Talks Blog Start Intro Jobs News Webinar Talks Blogs Start Search Exact matches only Search in title Search in content Search in comments Search in excerpt Filter by Custom Post Type Intro Jobs @media (max-width: 325px) { /*#header {top: 139px;}*/ .home-widget-col-1 { margin-top:60px;} } var date_new = "April 12, 2017 23:00"; var minutes = 4; Curse of Dimensionality Explained By Nikolay Manchev | 06/10/2016 Tags: Machine Learning (function() { if (window.pluso)if (typeof window.pluso.start == "function") return; if (window.ifpluso==undefined) { window.ifpluso = 1; var d = document, s = d.createElement(\'script\'), g = \'getElementsByTagName\'; s.type = \'text/javascript\'; s.charset=\'UTF-8\'; s.async = true; s.src = (\'https:\' == window.location.protocol ? \'https\' : \'http\') + \'://share.pluso.ru/pluso-like.js\'; var h=d[g](\'body\')[0]; h.appendChild(s); }})(); The Curse of D

In [3]:
web_scraper('http://blog.revolutionanalytics.com/2018/01/doazureparallel-simulations.html')

{'IsFluff': 0,
 'NumImages': 1,
 'NumLaTeXEquations': 0,
 'Text': 'Revolutions Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008 « Scraping a website with 5 lines of R code | Main | Because it\'s Friday: Excel Painter » $(function(){ var query = window.location.search.substring(1); if( query == "pintix=1" ) { var e=document.createElement(\'script\');e.setAttribute(\'type\',\'text/javascript\');e.setAttribute(\'charset\',\'UTF-8\');e.setAttribute(\'src\',\'http://static.typepad.com/.shared:va9a3b5c:typepad:en_us//js/pinmarklet.js?r=\'+Math.random()*99999999);document.body.appendChild(e); } }); window.ZemantaBlogSettings = ""; window.ZemantaPostSettings = ""; January 25, 2018 Speed up simulations in R with doAzureParallel I\'m a big fan using R to simulate data. When I\'m trying to understand a data set, my first step is sometimes to simulate data from a model and compare the results to the data, before I go down t

## Combining Everything Together

In [5]:
# List of good URLs to scrape
urls = ['https://jeffmacaluso.github.io/jekyll/update/2017/02/10/RestaurantRecommender.html',
        'https://opendatascience.com/blog/curse-of-dimensionality-explained/',
        'http://blog.revolutionanalytics.com/2018/01/doazureparallel-simulations.html']

# Putting the URLs into a data frame to take advantage of .apply
urlList = pd.DataFrame(urls, columns=['URL'])

# Scraping the URLs and adding the results back into the data frame as columns
results = urlList['URL'].apply(lambda url: pd.Series(web_scraper(url)))
results = urlList.merge(results, left_index=True, right_index=True)

results

Unnamed: 0,URL,IsFluff,NumImages,NumLaTeXEquations,Text
0,https://jeffmacaluso.github.io/jekyll/update/2...,0,0,0,Jeff Macaluso About Restaurant Recommender Feb...
1,https://opendatascience.com/blog/curse-of-dime...,0,3,5,Toggle navigation News Webinar Content Talks B...
2,http://blog.revolutionanalytics.com/2018/01/do...,0,1,0,Revolutions Daily news about using open source...
