# Using Regular Expressions to Clean up Web Pages

This notebook will make use of regular expressions to clean up a webpage. This is useful if you want to carry out any meaningful textual analysis of the content in a web page. We can remove the html tags and other unnecessary textual elements with this method. 

Regular Expressions (or Regex) is a coding technique that functions in many programming languages. Regex makes use of metacharacters (!?^.) and literal strings to carry out its operations. For a full list of Regex metacharacters and their associated functions, please see the Regex cheatsheet: http://www.rexegg.com/regex-quickstart.html

## Libraries and Resources used

-  Python 3
-  re
-  urllib

Written February 27, 2018.

In [337]:
# Import the required libraries
import re
import urllib.request

## Import the Content from a Website

In order to start the cleaning, we need to import a site's content into python.
For the purposes of this notebook we have used the landing page from TAPoR's companion site, Methodica. 
Feel free to use the url for any site that you wish to clean. 

In [338]:
# Store your url in a variable
path = 'http://methodi.ca/' 

# Read in the content from the url
with urllib.request.urlopen(path) as response:
    webContent = response.read().decode('utf-8')
# Print out the first 1000 characters of the content to see what kind of html tags we are dealing with
print(webContent[:1000])

<!DOCTYPE html>
<html lang="en" dir="ltr"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:og="http://ogp.me/ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:sioc="http://rdfs.org/sioc/ns#"
  xmlns:sioct="http://rdfs.org/sioc/types#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<head>
<meta charset="utf-8" />
<link rel="shortcut icon" href="http://methodi.ca/sites/all/themes/methodica_theme/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />
<link rel="canonical" href="/methodica" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<link rel="shortlink" href="/node/2" />
<title>Welcome to the Methods Commons | Methods Commons</title>
<style type="text/css" media="all">
@import url("http://methodi.ca/modules/system/syst

## Eliminate Whitespace

By examining these first 1000 characters, you may notice that there are some unnecessary lines and spaces. This 'whitespace' can be eliminated with a couple of lines of Regex. This step is also important as Regex doesn't match over newline (\n) characters and failing to do this first will result in our code not operating as it should.

In [339]:
# Eliminate new line characters with re.sub
# This function works by substituting the new line character with a space
webContent = re.sub(r'\n', " ", webContent)

# Check the altered text
print (webContent[:1000])

<!DOCTYPE html> <html lang="en" dir="ltr"   xmlns:content="http://purl.org/rss/1.0/modules/content/"   xmlns:dc="http://purl.org/dc/terms/"   xmlns:foaf="http://xmlns.com/foaf/0.1/"   xmlns:og="http://ogp.me/ns#"   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"   xmlns:sioc="http://rdfs.org/sioc/ns#"   xmlns:sioct="http://rdfs.org/sioc/types#"   xmlns:skos="http://www.w3.org/2004/02/skos/core#"   xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <head> <meta charset="utf-8" /> <link rel="shortcut icon" href="http://methodi.ca/sites/all/themes/methodica_theme/favicon.ico" type="image/vnd.microsoft.icon" /> <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /> <link rel="canonical" href="/methodica" /> <meta name="Generator" content="Drupal 7 (http://drupal.org)" /> <link rel="shortlink" href="/node/2" /> <title>Welcome to the Methods Commons | Methods Commons</title> <style type="text/css" media="all"> @import url("http://methodi.ca/modules/system/syst

In [340]:
# Remove all occurences of 2 or more spaces
# To grab counts of characters in a text you can use numbers in curly brackets
webContent = re.sub(r'\s{2,}', " ", webContent)

# Check the altered text
print (webContent[:1000])

<!DOCTYPE html> <html lang="en" dir="ltr" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <head> <meta charset="utf-8" /> <link rel="shortcut icon" href="http://methodi.ca/sites/all/themes/methodica_theme/favicon.ico" type="image/vnd.microsoft.icon" /> <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /> <link rel="canonical" href="/methodica" /> <meta name="Generator" content="Drupal 7 (http://drupal.org)" /> <link rel="shortlink" href="/node/2" /> <title>Welcome to the Methods Commons | Methods Commons</title> <style type="text/css" media="all"> @import url("http://methodi.ca/modules/system/system.base.css?p2m7mv

## Eliminate HTML Tags

The first thing that is readily apparent upon printing out the content is that we are dealing with a lot of html tag nonsense. 
Let's remove these elements so we can get strictly text. 
An obvious feature of HTML tags is that they come wrapped in < >.
The text inbetween the arrows can be stripped out with Regex. 

In [341]:
# First, we are going to isolate the tag content with re.findall
# This technique is usefull if it is perhaps the tags themselves that you want to pull from the web content
webTags = re.findall(r'<.*?>', webContent)

# Check the cleaned content
# Note that it stores the tags in the variable as a list
print (webTags[:10])

['<!DOCTYPE html>', '<html lang="en" dir="ltr" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">', '<head>', '<meta charset="utf-8" />', '<link rel="shortcut icon" href="http://methodi.ca/sites/all/themes/methodica_theme/favicon.ico" type="image/vnd.microsoft.icon" />', '<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />', '<link rel="canonical" href="/methodica" />', '<meta name="Generator" content="Drupal 7 (http://drupal.org)" />', '<link rel="shortlink" href="/node/2" />', '<title>']


In [342]:
# Since we are looking for just the text eliminate the HTML tags with re.sub
# This function works by substituting the characters with a space
webContent = re.sub(r'<.*?>', " ", webContent)

# Check the cleaned content
print (webContent[:1000])

                   Welcome to the Methods Commons | Methods Commons    @import url("http://methodi.ca/modules/system/system.base.css?p2m7mv"); @import url("http://methodi.ca/modules/system/system.menus.css?p2m7mv"); @import url("http://methodi.ca/modules/system/system.messages.css?p2m7mv"); @import url("http://methodi.ca/modules/system/system.theme.css?p2m7mv");     @import url("http://methodi.ca/sites/all/modules/codefilter/codefilter.css?p2m7mv");     @import url("http://methodi.ca/modules/comment/comment.css?p2m7mv"); @import url("http://methodi.ca/modules/field/theme/field.css?p2m7mv"); @import url("http://methodi.ca/modules/node/node.css?p2m7mv"); @import url("http://methodi.ca/modules/search/search.css?p2m7mv"); @import url("http://methodi.ca/modules/user/user.css?p2m7mv"); @import url("http://methodi.ca/sites/all/modules/views/css/views.css?p2m7mv"); @import url("http://methodi.ca/sites/all/modules/ckeditor/css/ckeditor.css?p2m7mv");     @import url("http://methodi.ca/sites/all/

## Eliminate Additional Web Scripting

Some of the text is starting to make its way to the surface, but our content is still riddled with web scripting elements. 
This is going to vary from webpage to webpage, so what I am going to provide you with here is an example of how to deal with this issue. 
A good way to begin cutting out unnecessary code is to isolate that code for your Regex expression.
In this case, everything between and including  @import url(".......");

In [343]:
# Use the same substitution technique to remove the unnecessary code
webContent = re.sub(r'@import url.*?;', " ", webContent)

# Check the cleaned content
print (webContent)

                   Welcome to the Methods Commons | Methods Commons                                                                                                                                         Search form     Search                                  Home     Recipes     Tutorials     Examples     Utilities     Backgrounders     Glossary     About                                 Learn how to identify the locations named in a corpus and generate a map of the results.                 Work through an example of how to identify themes in a text.                 Read up on the history of text analysis tools.                              Welcome to the Methods Commons                         Methodica is a collection of research methods and techniques for analyzing text. Computation has produced new and exciting ways of studying text in the Digital Humanities, and many of these methods do not require the use of expensive programs or detailed programming knowledge. This site describe

In [344]:
# We are not done yet. There is some link code that needs to be eliminated 
# Use the same substitution technique to remove the unnecessary code
webContent = re.sub(r'&.*?;', " ", webContent)

# Check the cleaned content
print (webContent)

                   Welcome to the Methods Commons | Methods Commons                                                                                                                                         Search form     Search                                  Home     Recipes     Tutorials     Examples     Utilities     Backgrounders     Glossary     About                                 Learn how to identify the locations named in a corpus and generate a map of the results.                 Work through an example of how to identify themes in a text.                 Read up on the history of text analysis tools.                              Welcome to the Methods Commons                         Methodica is a collection of research methods and techniques for analyzing text. Computation has produced new and exciting ways of studying text in the Digital Humanities, and many of these methods do not require the use of expensive programs or detailed programming knowledge. This site describe

In [345]:
# Let's run our cleaning Regex code one more time, to tighten up the text
webContent = re.sub(r'\n', " ", webContent)

webContent = re.sub(r'\s{2,}', " ", webContent)

# Check the altered text
print (webContent)

 Welcome to the Methods Commons | Methods Commons Search form Search Home Recipes Tutorials Examples Utilities Backgrounders Glossary About Learn how to identify the locations named in a corpus and generate a map of the results. Work through an example of how to identify themes in a text. Read up on the history of text analysis tools. Welcome to the Methods Commons Methodica is a collection of research methods and techniques for analyzing text. Computation has produced new and exciting ways of studying text in the Digital Humanities, and many of these methods do not require the use of expensive programs or detailed programming knowledge. This site describes common or interesting sequences of actions, or recipes , showing users how to combine freely accessible resources to perform various analytic tasks. Each recipe begins by listing required ingredients (text analysis tools, pieces of code, texts in certain formats, etc.), walks the user through the steps of a process, and concludes wi

## Conclusion 

Importing sites into Python also brings in all of the additional HTML elements that may get in the way of any textual analyisis we need to do on the web content. Regex is a fairly simple way to eliminate those elements. We can even use it to isolate certain tags and lines of code that we might need for our project. The Regex required for cleaning is going to change from site to site, so this notebook should be taken more as an example and not as a strict set of instructions on how to do so. 