# Custom Sources - HTML

## How to bring in text directly from website

In [2]:
import nltk
from urllib.request import urlopen

Websites are written in HTML, so when you pull information directly from a site, you will get all the code back along with the text.

In [3]:
url="https://en.wikipedia.org/wiki/Python_(programming_language)"

In [4]:
html = urlopen(url).read()

In [5]:
html[:1000]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python (programming language) - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":879223212,"wgRevisionId":879223212,"wgArticleId":23862,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2015","Wikipedia articles needing clarification from May 2018","All articles with unsourced statements","Articles with unsourced statements from May 2018","Articles containing potentially dated statements from March 2018","All articles containin

**We will use a Python library called BeautifulSoup in order to strip away the HTML code.**

In [6]:
from bs4 import BeautifulSoup

In [7]:
web_str = BeautifulSoup(html).get_text()

In [8]:
web_str
# Still traces of code left

'\n\n\nPython (programming language) - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":879223212,"wgRevisionId":879223212,"wgArticleId":23862,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2015","Wikipedia articles needing clarification from May 2018","All articles with unsourced statements","Articles with unsourced statements from May 2018","Articles containing potentially dated statements from March 2018","All articles containing potentially dated statements","Articles containing potentially dated statements from August 2016","Articles containing potentially 

In [8]:
web_tokens = nltk.word_tokenize(web_str)

In [166]:
web_tokens[0:10]

['Python',
 '(',
 'programming',
 'language',
 ')',
 '-',
 'Wikipedia',
 'document.documentElement.className',
 '=',
 'document.documentElement.className.replace']

With a little bit of manual work to remove leftover traces of code, we can find the main body of text. We know from the raw text the beginning of the wikipedia article is "Python is an interpreted high-level programming language for general-purpose programming."

In [167]:
start = web_str.find("Python is an interpreted \
high-level programming language for general-purpose \
programming.")

In [169]:
# Index number where the start sentence starts
start

6869

The end of the first section of the Wikipedia entry ends with "Python and CPython are managed by the non-profit Python Software Foundation." 

In [170]:
end = web_str.find("Python and CPython are managed \
by the non-profit Python Software Foundation.")

In [171]:
# Index number where the end sentence starts
end

7782

In [172]:
# we also need last sentence length as `end`
# will give the starting of the end sentence.

last_sent = len("Python and CPython are managed \
by the non-profit Python Software Foundation.")
last_sent

76

In [174]:
intro = web_str[start:(end+last_sent)]
intro

"Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.[27] In July 2018, Van Rossum stepped down as the leader in the language community after 30 years.[28][29]\nPython features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.[30]\nPython interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software[31] and has a community-based development model, as do nearly all of Python's other implementations. Python and CPython are managed by the non-profit Python Software Foundation."

`end` will give the starting of the end sentence. Adding length of last sentence will ensure inclusion of the last sentence as well. 

In [175]:
intro_tokens = nltk.word_tokenize(intro)

In [176]:
print (intro_tokens)

['Python', 'is', 'an', 'interpreted', 'high-level', 'programming', 'language', 'for', 'general-purpose', 'programming', '.', 'Created', 'by', 'Guido', 'van', 'Rossum', 'and', 'first', 'released', 'in', '1991', ',', 'Python', 'has', 'a', 'design', 'philosophy', 'that', 'emphasizes', 'code', 'readability', ',', 'notably', 'using', 'significant', 'whitespace', '.', 'It', 'provides', 'constructs', 'that', 'enable', 'clear', 'programming', 'on', 'both', 'small', 'and', 'large', 'scales', '.', '[', '27', ']', 'In', 'July', '2018', ',', 'Van', 'Rossum', 'stepped', 'down', 'as', 'the', 'leader', 'in', 'the', 'language', 'community', 'after', '30', 'years', '.', '[', '28', ']', '[', '29', ']', 'Python', 'features', 'a', 'dynamic', 'type', 'system', 'and', 'automatic', 'memory', 'management', '.', 'It', 'supports', 'multiple', 'programming', 'paradigms', ',', 'including', 'object-oriented', ',', 'imperative', ',', 'functional', 'and', 'procedural', ',', 'and', 'has', 'a', 'large', 'and', 'compre