# Week 1 - Retreiving and Preparing Text for Machines

This week, we begin by "begging, borrowing and stealing" text from several
contexts of human communication (e.g., PDFs, HTML, Word) and preparing it for
machines to "read" and analyze. This notebook outlines scraping text from the
web, PDF and Word documents. Then we detail "spidering" or walking
through hyperlinks to build samples of online content, and using APIs,
Application Programming Interfaces, provided by webservices to access their
content. Along the way, we will use regular expressions, outlined in the
reading, to remove unwanted formatting and ornamentation. Finally, we discuss
various text encodings, filtering and data structures in which text can be
placed for analysis.

For this notebook we will be using the following packages:

In [1]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
import lucem_illud #pip install git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
import requests #for http requests
import bs4 #called `beautifulsoup4`, an html parser
import pandas #gives us DataFrames
import docx #reading MS doc files, install as `python-docx`

#Stuff for pdfs
#Install as `pdfminer2`
import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.layout
import pdfminer.pdfpage

#These come with Python
import re #for regexs
import urllib.parse #For joining urls
import io #for making http requests look like files
import json #For Tumblr API responses
import os.path #For checking if files exist
import os #For making directories

We will also be working on the following files/urls

In [56]:
wikipedia_base_url = 'https://en.wikipedia.org'
wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
content_analysis_save = 'wikipedia_content_analysis.html'
example_text_file = 'sometextfile.txt'
information_extraction_pdf = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/Content%20Analysis%2018.pdf'
example_docx = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/macs6000_connecting_to_midway.docx'
example_docx_save = 'example.docx'

# Scraping

Before we can start analyzing content we need to obtain it. Sometimes it will be
provided to us from a pre-curated text archive, but sometimes we will need to
download it. As a starting example we will attempt to download the wikipedia
page on content analysis. The page is located at [https://en.wikipedia.org/wiki/
Content_analysis](https://en.wikipedia.org/wiki/Content_analysis) so lets start
with that.

We can do this by making an HTTP GET request to that url, a GET request is
simply a request to the server to provide the contents given by some url. The
other request we will be using in this class is called a POST request and
requests the server to take some content we provide. While the Python standard
library does have the ability do make GET requests we will be using the
[_requests_](http://docs.python-requests.org/en/master/) package as it is _'the
only Non-GMO HTTP library for Python'_...also it provides a nicer interface.

In [226]:
#wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
requests.get(wikipedia_content_analysis)

<Response [200]>

`'Response [200]'` means the server responded with what we asked for. If you get
another number (e.g. 404) it likely means there was some kind of error, these
codes are called HTTP response codes and a list of them can be found
[here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). The response
object contains all the data the server sent including the website's contents
and the HTTP header. We are interested in the contents which we can access with
the `.text` attribute.

In [227]:
wikiContentRequest = requests.get(wikipedia_content_analysis)
print(wikiContentRequest.text[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Content analysis - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Content_analysis","wgTitle":"Content analysis","wgCurRevisionId":819393184,"wgRevisionId":819393184,"wgArticleId":473317,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing expert attention with no reason or talk parameter","Articles needing expert attention from April 2008","All articles needing expert attention","Sociology articles needing expert attention","Media articles needing expert attention","Articles that may be too long from January 2018","All articles with un

This is not what we were looking for, because it is the start of the HTML that
makes up the website. This is HTML and is meant to be read by computers. Luckily
we have a computer to parse it for us. To do the parsing we will use [_Beautiful
Soup_](https://www.crummy.com/software/BeautifulSoup/) which is a better parser
than the one in the standard library.

In [228]:
wikiContentSoup = bs4.BeautifulSoup(wikiContentRequest.text, 'html.parser')
print(wikiContentSoup.text[:200])






Content analysis - Wikipedia
document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
(window.RLQ=window.RLQ||[]).push(functio


This is better but there's still random whitespace and we have more than just
the text of the article. This is because what we requested is the whole webpage,
not just the text for the article.

We want to extract only the text we care about, and in order to do this we will
need to inspect the html. One way to do this is simply to go to the website with
a browser and use its inspection or view source tool. If javascript or other
dynamic loading occurs on the page, however, it is likely that what Python
receives is not what you will see, so we will need to inspect what Python
receives. To do this we can save the html `requests` obtained.

In [229]:
#content_analysis_save = 'wikipedia_content_analysis.html'

with open(content_analysis_save, mode='w', encoding='utf-8') as f:
    f.write(wikiContentRequest.text)

Now lets open the file (`wikipedia_content_analysis.html`) we just created with
a web browser. It should look sort of like the original but without the images
and formatting.

As there is very little standardization on structuring webpages, figuring out
how best to extract what you want is an art. Looking at this page it looks like
all the main textual content is inside `<p>`(paragraph) tags within the `<body>`
tag.

In [230]:
contentPTags = wikiContentSoup.body.findAll('p')
for pTag in contentPTags[:3]:
    print(pTag.text)
    print()




Content analysis is a research method for studying documents and communication artifacts, which can be texts of various formats, pictures, audio or video. Social scientists use content analysis to quantify patterns in communication, in a replicable and systematic manner.[1] One of the key advantage of this research method is to analyse social phenomena in a non-invasive way, in contrast to simulating social experiences or collecting survey answers.

Practices and philosophies of content analysis vary between scholarly communities. They all involve systematic reading or observation of texts or artifacts which are assigned labels (sometimes called codes) to indicate the presence of interesting, meaningful patterns.[2][3] After labeling a large set of media, a researcher is able to statistically estimate the proportions of patterns in the texts, as well as correlations between patterns.



We now have all the text from the page, split up by paragraph. If we wanted to
get the section headers or references as well it would require a bit more work,
but is doable.

There is one more thing we might want to do before sending this text to be
processed, remove the references indicators (`[2]`, `[3]` , etc). To do this we
can use a short regular expression (regex).

In [231]:
contentParagraphs = []
for pTag in contentPTags:
    #strings starting with r are raw so their \'s are not modifier characters
    #If we didn't start with r the string would be: '\\[\\d+\\]'
    contentParagraphs.append(re.sub(r'\[\d+\]', '', pTag.text))
print(type(contentParagraphs))
print("-----我是分割线-----")

#convert to a DataFrame
contentParagraphsDF = pandas.DataFrame({'paragraph-text' : contentParagraphs})
print(contentParagraphsDF)

<class 'list'>
-----我是分割线-----
                                       paragraph-text
0                                                    
1   \nContent analysis is a research method for st...
2   Practices and philosophies of content analysis...
3   Computers are increasingly used in content ana...
4                                                    
5                                                    
6   Content analysis is best understood as a broad...
7   The simplest and most objective form of conten...
8   A further step in analysis is the distinction ...
9   More generally, content analysis is research u...
10  By having contents of communication available ...
11  Robert Weber notes: "To make valid inferences ...
12  There are five types of texts in content analy...
13  Over the years, content analysis has been appl...
14  In recent times, particularly with the advent ...
15  Quantitative content analysis has enjoyed a re...
16  Recently, Arash Heydarian Pashakhanlou has arg.

Now we have a `DataFrame` containing all relevant text from the page ready to be
processed

If you are not familiar with regex, it is a way of specifying searches in text.
A regex engine takes in the search pattern, in the above case `'\[\d+\]'` and
some string, the paragraph texts. Then it reads the input string one character
at a time checking if it matches the search. Here the regex `'\d'` matches
number characters (while `'\['` and `'\]'` capture the braces on either side).

In [9]:
findNumber = r'\d'
regexResults = re.search(findNumber, 'not a number, not a number, numbers 2134567890, not a number')
regexResults

<_sre.SRE_Match object; span=(36, 37), match='2'>

In Python the regex package (`re`) usually returns `Match` objects (you can have
multiple pattern hits in a a single `Match`), to get the string that matched our
pattern we can use the `.group()` method, and as we want the first one we will
ask for the 0'th group.

In [10]:
print(regexResults.group(0))

2


That gives us the first number, if we wanted the whole block of numbers we can
add a wildcard `'+'` which requests 1 or more instances of the preceding
character.

In [11]:
findNumbers = r'\d+'
regexResults = re.search(findNumbers, 'not a number, not a number, numbers 2134567890, not a number')
print(regexResults.group(0))

2134567890


Now we have the whole block of numbers, there are a huge number of special
characters in regex, for the full description of Python's implementation look at
the [re docs](https://docs.python.org/3/library/re.html) there is also a short
[tutorial](https://docs.python.org/3/howto/regex.html#regex-howto).

# <span style="color:red">Section 1</span>
<span style="color:red">Construct cells immediately below this that describe and download webcontent relating to your anticipated final project. Use beautiful soup and at least five regular expressions to extract relevant, nontrivial *chunks* of that content (e.g., cleaned sentences, paragraphs, etc.) to a pandas `Dataframe`.</span>

In [156]:
from bs4 import BeautifulSoup


def soup_a_book(a_book_url):
    ##Extracting book introduction from a webpage+ Using Beautifulsoup to parse a book's web page
    a_book_request= requests.get(a_book_url)
    a_book_soup= BeautifulSoup(a_book_request.text, 'lxml')
    return a_book_soup


"""
#Why do I need this step?
a_bool_file= 'a_book_file.html'
with open(a_bool_file, mode='w', encoding='utf-8') as fpw:
    fpw.write(a_book_request.text)
"""
a_book_soup= soup_a_book("https://book.qidian.com/info/1003354631")

In [157]:
def get_intro(a_book_soup):
    #Following the previous function, get the book's introductio and get rid of "\u3000"
    for_intro= a_book_soup.body.find("div", attrs= {"class": "book-intro"})
    intro= re.sub(r"\u3000", " ", for_intro.text.strip()) #(Regular Expression 1)
    return intro

intro= get_intro(a_book_soup)
intro

'一念成沧海，一念化桑田。一念斩千魔，一念诛万仙。  唯我念……永恒  这是耳根继《仙逆》《求魔》《我欲封天》后，创作的第四部长篇小说《一念永恒》'

In [158]:
def get_book_demo(a_book_soup):
    ###Book demographics###
    for_info= a_book_soup.body.find("div", attrs= {"class": "book-info"})
    #print(for_info.text)

    #name, author
    name= for_info.text.split()[0]
    author= for_info.text.split()[1]
    
    linked_text= re.findall(r"\d+.\d+.*总推荐", for_info.text)[0] #(Regular Expression 2)
    #The above re.findall() returns a list, I take its element as string.
    splited_text= linked_text.split("|")
    #print(splited_text)

    #word count
    word_count= splited_text[0] 

    #click count
    click_count= re.findall(r"\d+.\d+.*总点击", splited_text[1])[0] #(Regular Expression 3)  
    
    #recommendation count
    recom_count= splited_text[2] #(Regular Expression 3)  

    #tags
    tags= a_book_soup.body.find_all("a", attrs= {"class": "red"})
    tags= [i.text for i in tags]
    tags_string= "" 
    for t in tags:
        tags_string+= t+ ";"

    return name, author, word_count, click_count, recom_count, tags_string

name, author, word_count, click_count, recom_count, tags_string= get_book_demo(a_book_soup)
print(name, author, word_count, click_count, recom_count, tags_string)

一念永恒 耳根 354.98万字 1503.35万总点击 1616万总推荐 仙侠;幻想修仙;


In [159]:
def get_IntroMentioned_books(intro):
    #Using output of the above function get_intro()
    #Lookinf for other books mentioned in the book introduction
    all_mentioned= re.findall(r"《(.*?)》", intro) #(Regular Expression 4)
    other_books= [i for i in all_mentioned if i!= name]
    other_books_string= ""
    for o in other_books:
        other_books_string+= o+ ";"
    return other_books_string
        
other_books_string= get_IntroMentioned_books(intro)
other_books_string

'仙逆;求魔;我欲封天;'

In [160]:
def get_text(a_book_soup):
    #Input is the soup of the introduction page of the book 
    #Obtain 10 chapters as example writing, 10 chapters are stored in one dictionary
    
    #Using the introduction page to get the the "read for free (= first page)"
    def get_content(for_read_url):
        read_url= "https:"+ for_read_url

        ch= requests.get(read_url)
        ch_soup= BeautifulSoup(ch.text, 'html.parser')

        chP= ch_soup.body.find("div", attrs= {"class": "read-content"}).find("p")
        for_chP= str(chP).replace("\u3000", "")
        #print(for_chP)
        ch_content= re.sub(r"(<\/*p>)(\1*)", "", for_chP) #(Regular Expression 5)
        ch_content= ch_content.strip()
        return ch_content, ch_soup
    
    content_dic= {}
    i= 1 
    while i< 11:
        if i== 1:
            for_read_url= a_book_soup.body.find("a", text= "免费试读")["href"] 
            ch_content, ch_soup = get_content(for_read_url)
            content_dic[i]= ch_content
            i+= 1
        else:
            for_read_url= ch_soup.body.find("a", text= "下一章")["href"]
            ch_content, ch_soup = get_content(for_read_url)
            content_dic[i]= ch_content
            i+= 1
    
    return content_dic
content_dic= get_text(a_book_soup)
#print(content_dic)

In [161]:
#Testing: Store the data as one entry (one row)
a_book_dic= {'name' : name, "author": author, "intro": intro, 
             "word_count": word_count, "click_count": click_count, "recom_count": recom_count,
             "tags": tags_string, "other_books": other_books_string, "example_text": content_dic}
helper= [a_book_dic]

bookDF= pandas.DataFrame(data= helper)
print(bookDF)

  author  click_count                                       example_text  \
0     耳根  1503.35万总点击  {1: '帽儿山，位于东林山脉中，山下有一个村子，民风淳朴，以耕田为生，与世隔绝。清晨，村庄...   

                                               intro  name  other_books  \
0  一念成沧海，一念化桑田。一念斩千魔，一念诛万仙。  唯我念……永恒  这是耳根继《仙逆》《求...  一念永恒  仙逆;求魔;我欲封天;   

  recom_count      tags word_count  
0    1616万总推荐  仙侠;幻想修仙;   354.98万字  


In [162]:
#Combining all functions above into one big function
#Output is a Pandas Data Frame
def deal_with_a_book(a_book_url):
    a_book_soup= soup_a_book(a_book_url)
    intro= get_intro(a_book_soup)
    name, author, word_count, click_count, recom_count, tags_string= get_book_demo(a_book_soup)
    other_books_string= get_IntroMentioned_books(intro)
    content_dic= get_text(a_book_soup)
    
    a_book_dic= {'name' : name, "author": author, "intro": intro, 
                 "word_count": word_count, "click_count": click_count, 
                 "recom_count": recom_count,
                 "tags": tags_string, "intro_mentioned_books": other_books_string, 
                 "example_text": content_dic}
    helper= [a_book_dic]
    bookDF= pandas.DataFrame(data= helper)
    return bookDF, a_book_soup

startDF, start_soup= deal_with_a_book("https://book.qidian.com/info/1003354631")
startDF

Unnamed: 0,author,click_count,example_text,intro,intro_mentioned_books,name,recom_count,tags,word_count
0,耳根,1503.35万总点击,{1: '帽儿山，位于东林山脉中，山下有一个村子，民风淳朴，以耕田为生，与世隔绝。清晨，村庄...,一念成沧海，一念化桑田。一念斩千魔，一念诛万仙。 唯我念……永恒 这是耳根继《仙逆》《求...,仙逆;求魔;我欲封天;,一念永恒,1616万总推荐,仙侠;幻想修仙;,354.98万字



# Spidering

What if we want to to get a bunch of different pages from wikipedia. We would
need to get the url for each of the pages we want. Typically, we want pages that
are linked to by other pages and so we will need to parse pages and identify the
links. Right now we will be retrieving all links in the body of the content
analysis page.

To do this we will need to find all the `<a>` (anchor) tags with `href`s
(hyperlink references) inside of `<p>` tags. `href` can have many
[different](http://stackoverflow.com/questions/4855168/what-is-href-and-why-is-
it-used) [forms](https://en.wikipedia.org/wiki/Hyperlink#Hyperlinks_in_HTML) so
dealing with them can be tricky, but generally, you will want to extract
absolute or relative links. An absolute link is one you can follow without
modification, while a relative link requires a base url that you will then
append. Wikipedia uses relative urls for its internal links: below is an example
for dealing with them.

In [20]:
#wikipedia_base_url = 'https://en.wikipedia.org'

otherPAgeURLS = []
#We also want to know where the links come from so we also will get:
#the paragraph number
#the word the link is in
for paragraphNum, pTag in enumerate(contentPTags):
    #we only want hrefs that link to wiki pages
    tagLinks = pTag.findAll('a', href=re.compile('/wiki/'), class_=False)
    for aTag in tagLinks:
        #We need to extract the url from the <a> tag
        relurl = aTag.get('href')
        linkText = aTag.text
        #wikipedia_base_url is the base we can use the urllib joining function to merge them
        #Giving a nice structured tupe like this means we can use tuple expansion later
        otherPAgeURLS.append((
            urllib.parse.urljoin(wikipedia_base_url, relurl),
            paragraphNum,
            linkText,
        ))
print(otherPAgeURLS[:10])

[('https://en.wikipedia.org/wiki/Document', 0, 'documents'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Semantics', 1, 'meaningful'), ('https://en.wikipedia.org/wiki/Machine_learning', 2, 'Machine learning'), ('https://en.wikipedia.org/wiki/Klaus_Krippendorff', 5, 'Klaus Krippendorff'), ('https://en.wikipedia.org/wiki/Radio', 6, 'radio'), ('https://en.wikipedia.org/wiki/Television', 6, 'television'), ('https://en.wikipedia.org/wiki/Key_Word_in_Context', 6, 'Keyword In Context'), ('https://en.wikipedia.org/wiki/Synonym', 6, 'synonyms'), ('https://en.wikipedia.org/wiki/Homonym', 6, 'homonyms')]


In [27]:
[wikipedia_content_analysis] * len(contentParagraphsDF['paragraph-text'])

['https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki/Content_analysis',
 'https://en.wikipedia.org/wiki

We will be adding these new texts to our DataFrame `contentParagraphsDF` so we
will need to add 2 more columns to keep track of paragraph numbers and sources.

In [36]:
#Repeat the address "https://en.wikipedia.org/wiki/Content_analysis"for
#the "paragraph" amount of times.
contentParagraphsDF['source'] = [wikipedia_content_analysis] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['paragraph-number'] = range(len(contentParagraphsDF['paragraph-text']))

#contentParagraphsDF

Then we can add two more columns to our `Dataframe` and define a function to
parse
each linked page and add its text to our DataFrame.

In [31]:
contentParagraphsDF['source-paragraph-number'] = [None] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['source-paragraph-text'] = [None] * len(contentParagraphsDF['paragraph-text'])

def getTextFromWikiPage(targetURL, sourceParNum, sourceText):
    #Make a dict to store data before adding it to the DataFrame
    parsDict = {'source' : [], 'paragraph-number' : [], 'paragraph-text' : [], 'source-paragraph-number' : [],  'source-paragraph-text' : []}
    #Now we get the page
    r = requests.get(targetURL)
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    #enumerating gives use the paragraph number
    for parNum, pTag in enumerate(soup.body.findAll('p')):
        #same regex as before
        parsDict['paragraph-text'].append(re.sub(r'\[\d+\]', '', pTag.text))
        parsDict['paragraph-number'].append(parNum)
        parsDict['source'].append(targetURL)
        parsDict['source-paragraph-number'].append(sourceParNum)
        parsDict['source-paragraph-text'].append(sourceText)
    return pandas.DataFrame(parsDict)

And run it on our list of link tags

In [32]:
for urlTuple in otherPAgeURLS[:3]:
    #ignore_index means the indices will not be reset after each append
    contentParagraphsDF = contentParagraphsDF.append(getTextFromWikiPage(*urlTuple),ignore_index=True)
contentParagraphsDF

Unnamed: 0,paragraph-number,paragraph-text,source,source-paragraph-number,source-paragraph-text
0,0,\nContent analysis is a research method for st...,https://en.wikipedia.org/wiki/Content_analysis,,
1,1,Practices and philosophies of content analysis...,https://en.wikipedia.org/wiki/Content_analysis,,
2,2,Computers are increasingly used in content ana...,https://en.wikipedia.org/wiki/Content_analysis,,
3,3,,https://en.wikipedia.org/wiki/Content_analysis,,
4,4,,https://en.wikipedia.org/wiki/Content_analysis,,
5,5,Content analysis is best understood as a broad...,https://en.wikipedia.org/wiki/Content_analysis,,
6,6,The simplest and most objective form of conten...,https://en.wikipedia.org/wiki/Content_analysis,,
7,7,A further step in analysis is the distinction ...,https://en.wikipedia.org/wiki/Content_analysis,,
8,8,"More generally, content analysis is research u...",https://en.wikipedia.org/wiki/Content_analysis,,
9,9,By having contents of communication available ...,https://en.wikipedia.org/wiki/Content_analysis,,



# <span style="color:red">Section 2</span>
<span style="color:red">Construct cells immediately below this that spider webcontent from another site with content relating to your anticipated final project. Specifically, identify urls on a core page, then follow and extract content from them into a pandas `Dataframe`. In addition, demonstrate a *recursive* spider, which follows more than one level of links (i.e., follows links from a site, then follows links on followed sites to new sites, etc.), making sure to define a reasonable endpoint so that you do not wander the web forever :-).</span>



In [165]:
a_book_url= "https://book.qidian.com/info/1003354631"
a_book_soup= soup_a_book(a_book_url)

def find_books_from_one(a_book_soup, a_book_url):
    #Using the soup_a_book() function defined above, 
    #In order to substract all book links on the page
    all_appear_books= a_book_soup.find_all("a", href= re.compile("/book.qidian.com/info/"))
    all_appear_books_list= ["https:"+ i["href"] for i in all_appear_books]
    all_appear_books_set= set(all_appear_books_list)
    following_books_list= [i for i in all_appear_books_set if i!= a_book_url]
    #following_books_id= [re.findall(r"(\d+)",i["href"])[0] for i in following_books]
    return following_books_list

#Testing:
following_books= find_books_from_one(a_book_soup, a_book_url)

In [167]:
#This is a recursive code. But it will always finish a layer of scraping before breaking,
#which, in this case, means that it ends after 22 books.
#For example, if I set the desired number to 30, and after a layer it acquires only 29,
#it will finish another layer of scraping, which will end up with way more than 30.
#The number of books can be modified by changing variable "I_want_n_books"

I_want_n_books= 20
all_books= ["https://book.qidian.com/info/1003354631"]

while True:
    if len(all_books)== 1:
        the_DF, a_book_soup0= deal_with_a_book(all_books[0])
        following_books0= find_books_from_one(a_book_soup0, all_books[0])
        len_all_books_B= len(all_books) 
        for i in following_books0:
            if i not in all_books:
                all_books+= [i]
        len_all_books_A= len(all_books)
        len_inc= len_all_books_A- len_all_books_B
            
    else:
        len_all_books_B= len(all_books)
        for b in all_books[-len_inc:]:
            print(b)
            a_DF, a_book_soup= deal_with_a_book(b)
            the_DF= the_DF.append(a_DF)
            print(the_DF.shape)
            following_books= find_books_from_one(a_book_soup, b)
            for i in following_books:
                if i not in all_books:
                    all_books+= [i]
            len_all_books_A= len(all_books)
            len_inc= len_all_books_A- len_all_books_B
            
    if the_DF.shape[0]> I_want_n_books:
        break
  
            
#the_DF
    

https://book.qidian.com/info/1004595892
(2, 9)
https://book.qidian.com/info/1010805831
(3, 9)
https://book.qidian.com/info/1005373310
(4, 9)
https://book.qidian.com/info/1010159082
(5, 9)
https://book.qidian.com/info/1010916969
(6, 9)
https://book.qidian.com/info/1005317872
(7, 9)
https://book.qidian.com/info/1010969082
(8, 9)
https://book.qidian.com/info/2066661
(9, 9)
https://book.qidian.com/info/1010192932
(10, 9)
https://book.qidian.com/info/1009631704
(11, 9)
https://book.qidian.com/info/1010923301
(12, 9)
https://book.qidian.com/info/1010868771
(13, 9)
https://book.qidian.com/info/1010255643
(14, 9)
https://book.qidian.com/info/1010468795
(15, 9)
https://book.qidian.com/info/1264634
(16, 9)
https://book.qidian.com/info/1010926157
(17, 9)
https://book.qidian.com/info/3106580
(18, 9)
https://book.qidian.com/info/2070910
(19, 9)
https://book.qidian.com/info/1010965651
(20, 9)
https://book.qidian.com/info/1010928889
(21, 9)
https://book.qidian.com/info/1243117
(22, 9)


In [168]:
the_DF

Unnamed: 0,author,click_count,example_text,intro,intro_mentioned_books,name,recom_count,tags,word_count
0,耳根,1503.36万总点击,{1: '帽儿山，位于东林山脉中，山下有一个村子，民风淳朴，以耕田为生，与世隔绝。清晨，村庄...,一念成沧海，一念化桑田。一念斩千魔，一念诛万仙。 唯我念……永恒 这是耳根继《仙逆》《求...,仙逆;求魔;我欲封天;,一念永恒,1616万总推荐,仙侠;幻想修仙;,354.98万字
0,耳根,136.73万总点击,{1: '这是一座毗邻群山边缘的孤镇。秋水。冷河。远处崔嵬的群山，就像一座座万古魔神的雕像，...,这是一部短篇小说，讲述的是一代滇国女王的成长史， 我若为王，定要四海臣服！,,滇娇传之天悦东方,183.08万总推荐,仙侠;古典仙侠;,24.5万字
0,相雨山,2.12万总点击,{1: '梅雨季节刚过，天空放晴，六月的香江流域，本应是稻秧下种的日子，但这诡异的天气却下起...,修行有三门，天门，元门，仙门。 天门破，超凡脱俗，享受三百载，为人中仙。 元门破，遗世独...,,大仙饶命,1435总推荐,仙侠;古典仙侠;,17.17万字
0,发烧的胖头鱼,25.04万总点击,{1: '“大王叫我来巡山哟——”光秃秃的小山头，忽的传来一阵抑扬顿挫的歌声，远处惊起一片飞...,欢哥是一只老实本分的巡山小妖，却被天外陨石砸开了窍，脑子里整天冒出一堆奇怪的词语。 天猫是...,小妖不上天;,小妖不上天,4.9万总推荐,仙侠;幻想修仙;,119.29万字
0,螃蟹慢爬,28.98万总点击,{1: '“听说了吗？昨天又有人死了，是李家的媳妇......夜里她突然发狂，口中念叨着亡夫...,地狱空荡荡，恶鬼满人间。 —————————————— 在这妖魔横行，众生苦楚的黑暗世道...,,极道妖鬼,2.78万总推荐,仙侠;修真文明;,104.15万字
0,雾都巨人,9041总点击,{1: '【SCP，安全，收容，保护】【日期：2017·10·30】【文件安全许可等级：一级...,妖者，人之所忌，其气焰以取之，妖由人兴也。人无衅焉，妖不自作。人弃常则妖兴，故有妖。龙者，鳞...,,妖龙之灾,661总推荐,仙侠;幻想修仙;,9.89万字
0,神秘男人,127.35万总点击,{1: '镇南王府位于陈国与南蛮交界的泰安郡，郡城之外千门万户，极尽奢华之能。作为陈国开国皇...,作者平平淡淡讲故事，诸位开开心心看小说。,,能穿越的修行者,12.56万总推荐,仙侠;幻想修仙;,151.54万字
0,大郎该吃药了,5636总点击,{1: '六七月份大热天，烈阳如一团火球高高的挂在天上。实在太热了，山里有股闷热，燥的人心慌...,千般法术，万般大道，我只问一句：可为至尊么？,,无极香火道,299总推荐,仙侠;古典仙侠;,19.93万字
0,玉爪俊,409.31万总点击,{1: '新书，需要大家的关注和支持，请各位兄弟姐妹多多帮衬，收藏，推荐，点击!拜谢了！——...,钟元新生了，在一个可以修行成仙的世界里。而且，他很轻松就拜入了仙家门庭，可是 ，他还没来得...,,蜀山旁门之祖,11.19万总推荐,仙侠;修真文明;,500.46万字
0,蓝白阁,54.23万总点击,{1: '陈国，边陲，木岩村，是秋。萧瑟秋风中，陈默朝着村口重重的跪了下去，三拜九叩一番大礼...,数万年前，一场大灾劫，彻底改变打乱了所有法则。在这个修行只能依靠各种灵植的世界之中，从贫困战...,,大界果,9.24万总推荐,仙侠;幻想修仙;,19.8万字


## API (Tumblr)

Generally website owners do not like you scraping their sites. If done badly,
scarping can act like a DOS attack so you should be careful how often you make
calls to a site. Some sites want automated tools to access their data, so they
create [application programming interface
(APIs)](https://en.wikipedia.org/wiki/Application_programming_interface). An API
specifies a procedure for an application (or script) to access their data. Often
this is though a [representational state transfer
(REST)](https://en.wikipedia.org/wiki/Representational_state_transfer) web
service, which just means if you make correctly formatted HTTP requests they
will return nicely formatted data.

A nice example for us to study is [Tumblr](https://www.tumblr.com), they have a
[simple RESTful API](https://www.tumblr.com/docs/en/api/v1) that allows you to
read posts without any complicated html parsing.

We can get the first 20 posts from a blog by making an http GET request to
`'http://{blog}.tumblr.com/api/read/json'`, were `{blog}` is the name of the
target blog. Lets try and get the posts from [http://lolcats-lol-
cat.tumblr.com/](http://lolcats-lol-cat.tumblr.com/) (Note the blog says at the
top 'One hour one pic lolcats', but the canonical name that Tumblr uses is in
the URL 'lolcats-lol-cat').

In [2]:
tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'))

print(r.text[:1000])

var tumblr_api_read = {"tumblelog":{"title":"One hour one pic lolcats","description":"","name":"lolcats-lol-cat","timezone":"Europe\/Paris","cname":false,"feeds":[]},"posts-start":0,"posts-total":2964,"posts-type":false,"posts":[{"id":"169306612050","url":"http:\/\/lolcats-lol-cat.tumblr.com\/post\/169306612050","url-with-slug":"http:\/\/lolcats-lol-cat.tumblr.com\/post\/169306612050\/kitty-mind-blown","type":"photo","date-gmt":"2018-01-04 15:00:13 GMT","date":"Thu, 04 Jan 2018 16:00:13","bookmarklet":0,"mobile":0,"feed-item":"","from-feed-id":0,"unix-timestamp":1515078013,"format":"html","reblog-key":"TZzKhxon","slug":"kitty-mind-blown","is-submission":false,"like-button":"<div class=\"like_button\" data-post-id=\"169306612050\" data-blog-name=\"lolcats-lol-cat\" id=\"like_button_169306612050\"><iframe id=\"like_iframe_169306612050\" src=\"http:\/\/assets.tumblr.com\/assets\/html\/like_iframe.html?_v=fc298e85f978b8662a643fe0a6b8c638#name=lolcats-lol-cat&amp;post_id=169306612050&amp;co

This might not look very good on first inspection, but it has far fewer angle
braces than html, which makes it easier to parse. What we have is
[JSON](https://en.wikipedia.org/wiki/JSON) a 'human readable' text based data
transmission format based on javascript. Luckily, we can readily convert it to a
python `dict`.

In [6]:
#We need to load only the stuff between the curly braces
d = json.loads(r.text[len('var tumblr_api_read = '):-2])

print(d.keys())
print(len(d['posts']))

dict_keys(['posts-start', 'posts-type', 'posts-total', 'tumblelog', 'posts'])
20


If we read the [API specification](https://www.tumblr.com/docs/en/api/v1), we
will see there are a lot of things we can get if we add things to our GET
request. First we can retrieve posts by their id number. Let's first get post
`146020177084`.

In [None]:
r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'), params = {'id' : 146020177084})
d = json.loads(r.text[len('var tumblr_api_read = '):-2])
d['posts'][0].keys()
d['posts'][0]['photo-url-1280']

with open('lolcat.gif', 'wb') as f:
    gifRequest = requests.get(d['posts'][0]['photo-url-1280'], stream = True)
    f.write(gifRequest.content)

<img src='lolcat.gif'>

Such beauty; such vigor (If you can't see it you have to refresh the page). Now
we could retrieve the text from all posts as well
as related metadata, like the post date, caption or tags. We could also get
links to all the images.

In [22]:
#Putting a max in case the blog has millions of images
#The given max will be rounded up to the nearest multiple of 50
def tumblrImageScrape(blogName, maxImages = 200):
    #Restating this here so the function isn't dependent on any external variables
    tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

    #There are a bunch of possible locations for the photo url
    possiblePhotoSuffixes = [1280, 500, 400, 250, 100]

    #These are the pieces of information we will be gathering,
    #at the end we will convert this to a DataFrame.
    #There are a few other datums we could gather like the captions
    #you can read the Tumblr documentation to learn how to get them
    #https://www.tumblr.com/docs/en/api/v1
    postsData = {
        'id' : [],
        'photo-url' : [],
        'date' : [],
        'tags' : [],
        'photo-type' : []
    }

    #Tumblr limits us to a max of 50 posts per request
    for requestNum in range(maxImages // 50):
        requestParams = {
            'start' : requestNum * 50,
            'num' : 50,
            'type' : 'photo'
        }
        r = requests.get(tumblrAPItarget.format(blogName), params = requestParams)
        requestDict = json.loads(r.text[len('var tumblr_api_read = '):-2])
        for postDict in requestDict['posts']:
            #We are dealing with uncleaned data, we can't trust it.
            #Specifically, not all posts are guaranteed to have the fields we want
            try:
                postsData['id'].append(postDict['id'])
                postsData['date'].append(postDict['date'])
                postsData['tags'].append(postDict['tags'])
            except KeyError as e:
                raise KeyError("Post {} from {} is missing: {}".format(postDict['id'], blogName, e))

            foundSuffix = False
            for suffix in possiblePhotoSuffixes:
                try:
                    photoURL = postDict['photo-url-{}'.format(suffix)]
                    postsData['photo-url'].append(photoURL)
                    postsData['photo-type'].append(photoURL.split('.')[-1])
                    foundSuffix = True
                    break
                except KeyError:
                    pass
            if not foundSuffix:
                #Make sure your error messages are useful
                #You will be one of the users
                raise KeyError("Post {} from {} is missing a photo url".format(postDict['id'], blogName))

    return pandas.DataFrame(postsData)
tumblrImageScrape('lolcats-lol-cat', 50)

Unnamed: 0,date,id,photo-type,photo-url,tags
0,"Thu, 04 Jan 2018 16:00:13",169306612050,jpg,http://78.media.tumblr.com/704a681944a94a1fac1...,"[cat, cats, lol, lolcat, lolcats]"
1,"Thu, 04 Jan 2018 14:00:41",169303827051,png,http://78.media.tumblr.com/1ec2cc33b059d657351...,"[cat, cats, lol, lolcat, lolcats]"
2,"Thu, 04 Jan 2018 12:00:34",169301536211,png,http://78.media.tumblr.com/80dc7c42c35a2dbce58...,"[cat, cats, lol, lolcat, lolcats]"
3,"Tue, 02 Jan 2018 04:00:09",169209300706,jpg,http://78.media.tumblr.com/94b8dbe91683bd01551...,"[cat, cats, lol, lolcat, lolcats]"
4,"Tue, 02 Jan 2018 02:00:42",169205268941,jpg,http://78.media.tumblr.com/e3bbb26992ac10e0234...,"[cat, cats, lol, lolcat, lolcats]"
5,"Sun, 31 Dec 2017 08:00:10",169141076975,png,http://78.media.tumblr.com/5ef1ee635831085377d...,"[cat, cats, lol, lolcat, lolcats]"
6,"Sun, 31 Dec 2017 06:00:35",169137806653,jpg,http://78.media.tumblr.com/9dc91b972231d1e6250...,"[cat, cats, lol, lolcat, lolcats]"
7,"Sun, 31 Dec 2017 04:00:38",169134323015,png,http://78.media.tumblr.com/f62a86855bb0ed8bf1e...,"[cat, cats, lol, lolcat, lolcats]"
8,"Sun, 31 Dec 2017 02:00:42",169130694997,jpg,http://78.media.tumblr.com/760cd9b9017708c90d5...,"[cat, cats, lol, lolcat, lolcats]"
9,"Sun, 24 Dec 2017 08:00:08",168883132580,jpg,http://78.media.tumblr.com/c078fc21432061a41ec...,"[cat, cats, lol, lolcat, lolcats]"


Now we have the urls of a bunch of images and can run OCR on them to gather
compelling meme narratives, accompanied by cats.

# Files

What if the text we want isn't on a webpage? There are a many other sources of
text available, typically organized into *files*.

## Raw text (and encoding)

The most basic form of storing text is as a _raw text_ document. Source code
(`.py`, `.r`, etc) is usually raw text as are text files (`.txt`) and those with
many other extension (e.g., .csv, .dat, etc.). Opening an unknown file with a
text editor is often a great way of learning what the file is.

We can create a text file in python with the `open()` function

In [65]:
example_text_file = 'sometextfile.txt'
#stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols \u2421 \u241B \u20A0 \u20A1 \u20A2 \u20A3 \u0D60\n'
stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols ␡ ␛ ₠ ₡ ₢ ₣ ൠ\n'

with open(example_text_file, mode = 'w', encoding='utf-8') as f:
    f.write(stringToWrite)

Notice the `encoding='utf-8'` argument, which specifies how we map the bits from
the file to the glyphs (and whitespace characters like tab (`'\t'`) or newline
(`'\n'`)) on the screen. When dealing only with latin letters, arabic numerals
and the other symbols on America keyboards you usually do not have to worry
about encodings as the ones used today are backwards compatible with
[ASCII](https://en.wikipedia.org/wiki/ASCII), which gives the binary
representation of 128 characters.

Some of you, however, will want to use other characters (e.g., Chinese
characters). To solve this there is
[Unicode](https://en.wikipedia.org/wiki/Unicode) which assigns numbers to
symbols, e.g., 041 is `'A'` and 03A3 is `'Σ'` (numbers starting with 0 are
hexadecimal). Often non/beyond-ASCII characters are called Unicode characters.
Unicode contains 1,114,112 characters, about 10\% of which have been assigned.
Unfortunately there are many ways used to map combinations of bits to Unicode
symbols. The ones you are likely to encounter are called by Python _utf-8_,
_utf-16_ and _latin-1_. _utf-8_ is the standard for Linux and Mac OS while both
_utf-16_ and _latin-1_ are used by windows. If you use the wrong encoding,
characters can appear wrong, sometimes change in number or Python could raise an
exception. Lets see what happens when we open the file we just created with
different encodings.

In [23]:
with open(example_text_file, encoding='utf-8') as f:
    print("This is with the correct encoding:")
    print(f.read())

with open(example_text_file, encoding='latin-1') as f:
    print("This is with the wrong encoding:")
    print(f.read())

NameError: name 'example_text_file' is not defined

Notice that with _latin-1_ the unicode characters are mixed up and there are too
many of them. You need to keep in mind encoding when obtaining text files.
Determining the encoding can sometime involve substantial work.

We can also load many text files at once. LEts tart by looking at the Shakespeare files in the `data` directory 

In [66]:
with open('../data/Shakespeare/midsummer_nights_dream.txt') as f:
    midsummer = f.read()
print(midsummer[-700:])

, and Train.]

PUCK
  If we shadows have offended,
  Think but this,--and all is mended,--
  That you have but slumber'd here
  While these visions did appear.
  And this weak and idle theme,
  No more yielding but a dream,
  Gentles, do not reprehend;
  If you pardon, we will mend.
  And, as I am an honest Puck,
  If we have unearned luck
  Now to 'scape the serpent's tongue,
  We will make amends ere long;
  Else the Puck a liar call:
  So, good night unto you all.
  Give me your hands, if we be friends,
  And Robin shall restore amends.

[Exit.]





End of Project Gutenberg Etext of A Midsummer Night's Dream by Shakespeare
PG has multiple editions of William Shakespeare's Complete Works



Then to load all the files in `../data/Shakespeare` we can use a for loop with `scandir`:

In [36]:
targetDir = '../data/Shakespeare' #Change this to your own directory of texts
shakespearText = []
shakespearFileName = []

for file in (file for file in os.scandir(targetDir) if file.is_file() and not file.name.startswith('.')):
    with open(file.path, encoding='utf-8') as f:
        shakespearText.append(f.read())
    shakespearFileName.append(file.name)

Then we can put them all in pandas DataFrame

In [35]:
print(shakespearFileName)
shakespear_df = pandas.DataFrame({'text' : shakespearText}, index = shakespearFileName)
shakespear_df

['alls_well_that_ends_well.txt', 'anthonie_and_cleopatra.txt', 'as_you_like_it.txt', 'comedy_of_errors.txt', 'coriolanus.txt', 'cymbeline.txt', 'hamlet.txt', 'julius_caesar.txt', 'king_henry_4_p1.txt', 'king_henry_4_p2.txt', 'king_henry_5.txt', 'king_henry_6_p1.txt', 'king_henry_6_p2.txt', 'king_henry_6_p3.txt', 'king_henry_8.txt', 'king_john.txt', 'king_lear.txt', 'king_richard_2.txt', 'king_richard_3.txt', 'lovers_complaint.txt', 'loves_labors_lost.txt', 'macbeth.txt', 'measure_for_measure.txt', 'merchant_of_venice.txt', 'merry_wives_of_windsor.txt', 'midsummer_nights_dream.txt', 'much_ado_about_nothing.txt', 'othello.txt', 'passionate_pilgrim.txt', 'pericles_prince_of_tyre.txt', 'phoenix_and_the_turtle.txt', 'rape_of_lucrece.txt', 'romeo_and_juliet.txt', 'sonnets.txt', 'taming_of_the_shrew.txt', 'tempest.txt', 'timon_of_athens.txt', 'titus_andronicus.txt', 'troilus_and_cressida.txt', 'twelth_night.txt', 'two_gentlemen_of_verona.txt', 'venus_and_adonis.txt', 'winters_tale.txt']


Unnamed: 0,text
alls_well_that_ends_well.txt,"All's Well, that Ends Well\n\nActus primus. Sc..."
anthonie_and_cleopatra.txt,"The Tragedie of Anthonie, and Cleopatra\n\nAct..."
as_you_like_it.txt,AS YOU LIKE IT\n\nby William Shakespeare\n\n\n...
comedy_of_errors.txt,"DRAMATIS PERSONAE\n\nSOLINUS, Duke of Ephesus\..."
coriolanus.txt,THE TRAGEDY OF CORIOLANUS\n\nby William Shakes...
cymbeline.txt,The Tragedie of Cymbeline\n\nActus Primus. Sco...
hamlet.txt,The Tragedie of Hamlet\n\nActus Primus. Scoena...
julius_caesar.txt,"Dramatis Personae\n\n JULIUS CAESAR, Roman st..."
king_henry_4_p1.txt,The First Part of Henry the Fourth\n\nwith the...
king_henry_4_p2.txt,"KING HENRY IV, SECOND PART\n\nby William Shake..."


Getting your text in a format like this is the first step of most analysis

## PDF

Another common way text will be stored is in a PDF file. First we will download
a pdf in Python. To do that lets grab a chapter from
_Speech and Language Processing_, chapter 21 is on Information Extraction which
seems apt. It is stored as a pdf at [https://web.stanford.edu/~jurafsky/slp3/21.
pdf](https://web.stanford.edu/~jurafsky/slp3/21.pdf) although we are downloading
from a copy just in case Jurafsky changes their website.

In [41]:
from sys import getfilesystemencoding

print( getfilesystemencoding() )

utf-8


In [42]:
information_extraction_pdf = 'https://github.com/KnowledgeLab/content_analysis/raw/data/21.pdf'

infoExtractionRequest = requests.get(information_extraction_pdf, stream=True)
print(infoExtractionRequest.text[:1000])

%PDF-1.5
%����
97 0 obj
<<
/Length 2988      
/Filter /FlateDecode
>>
stream
xڝYY��~�_��T,U�����f<{����z\[)o(
���!�������n��4�J�/b�� }7�9�s������o���I�,�|�a�d�J��M��y�9?�~<i]כHE���	�}����e��k��esp�ߋ�շyS�Jf��|�?>���׺�+��y7����==w��8�����M<�x�
����`?���Q�*gs�����!���Wn)Ϗ������a�*?p6�Ƒ2?��N����wӍq�A�ڽ|h��z�;%���pƍg�๑���s�����9��fQ�������$����pw��Շ���W[���f��{�M��M�(�M=gã�����l�?��r��s|Ǐ}׏B��o�:E}����N>b�t����{:�M2��I�����߉�M�Q�k�k�tW�C�6덂
������拍���Y��o乤�8s���Fy-�����g��vGvM ����3���}���[+o�y��שnٰ|7N6UŽ�f����l����D�D��|[i�n��MY��f�3��W�n`�>���X!~lڧ��>��K�7�
�_�d��el}��g�S�.4:���ve�WƟ�0�.�y���V�j������bh��i�)�uW������[�jGnG�T,?9�u]��-1�s��7e��.��XU%�ޗ5��[��²�؅���4w�W��N��淼)��R�MT,�,F4��o�>n��P�z������s;
*��\��o^=������d�8��P��P����@p��^(�&N�[L�d)�H�����a����wf��!��톁�$(dIf��h�4���;J,O���X\�u�$xʗ�zx�����t��|ӎ���n���V{�Av@�Wt:�!bc!*���>����΋�@R�Q`��_�A���Aл�yG+J���z[��\[G�1-r��U�m�A�|��'�2

It says `'pdf'`, so thats a good sign. The rest though looks like we are having
issues with an encoding. The random characters are not caused by our encoding
being wrong, however. They are cause by there not being an encoding for those
parts at all. PDFs are nominally binary files, meaning there are sections of
binary that are specific to pdf and nothing else so you need something that
knows about pdf to read them. To do that we will be using
[`PyPDF2`](https://github.com/mstamy2/PyPDF2), a PDF processing library for
Python 3.


Because PDFs are a very complicated file format pdfminer requires a large amount
of boilerplate code to extract text, we have written a function that takes in an
open PDF file and returns the text so you don't have to.

In [43]:
def readPDF(pdfFile):
    #Based on code from http://stackoverflow.com/a/20905381/4955164
    #Using utf-8, if there are a bunch of random symbols try changing this
    codec = 'utf-8'
    rsrcmgr = pdfminer.pdfinterp.PDFResourceManager()
    retstr = io.StringIO()
    layoutParams = pdfminer.layout.LAParams()
    device = pdfminer.converter.TextConverter(rsrcmgr, retstr, 
                                              laparams = layoutParams, codec = codec)
    #We need a device and an interpreter
    interpreter = pdfminer.pdfinterp.PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos=set()
    for page in pdfminer.pdfpage.PDFPage.get_pages(pdfFile, pagenos, 
                                                   maxpages=maxpages, password=password,
                                                   caching=caching, check_extractable=True):
        interpreter.process_page(page)
    device.close()
    returnedString = retstr.getvalue()
    retstr.close()
    return returnedString

First we need to take the response object and convert it into a 'file like'
object so that pdfminer can read it. To do this we will use `io`'s `BytesIO`.

In [47]:
infoExtractionBytes = io.BytesIO(infoExtractionRequest.content)

Now we can give it to pdfminer.

In [45]:
print(readPDF(infoExtractionBytes)[:550])

Speech and Language Processing. Daniel Jurafsky & James H. Martin.
rights reserved.

Draft of November 7, 2016.

Copyright c(cid:13) 2016.

All

CHAPTER

21 Information Extraction

I am the very model of a modern Major-General,
I’ve information vegetable, animal, and mineral,
I know the kings of England, and I quote the ﬁghts historical
From Marathon to Waterloo, in order categorical...
Gilbert and Sullivan, Pirates of Penzance

Imagine that you are an analyst with an investment ﬁrm that tracks airline stocks.
You’re given the task of determini


From here we can either look at the full text or fiddle with our PDF reader and
get more information about individual blocks of text.

## Word Docs

The other type of document you are likely to encounter is the `.docx`, these are
actually a version of [XML](https://en.wikipedia.org/wiki/Office_Open_XML), just
like HTML, and like HTML we will use a specialized parser.

For this class we will use [`python-docx`](https://python-
docx.readthedocs.io/en/latest/) which provides a nice simple interface for
reading `.docx` files

In [53]:
example_docx = 'https://github.com/KnowledgeLab/content_analysis/raw/data/example_doc.docx'

r = requests.get(example_docx, stream=True)
d = docx.Document(io.BytesIO(r.content))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

Week 1 - Retreiving and Preparing Text for Machines
This week, we begin by "begging, borrowing and stealing" text from several contexts of human communication (e.g., PDFs, HTML, Word) and preparing it for machines to "read" and analyze. This notebook outlines scraping text from the web, from images, PDF and Word documents. Then we detail "spidering" or walking through hyperlinks to build samples of online content, and using APIs, Application Programming Interfaces, provided by webservices to access their content. Along the way, we will use regular expressions, outlined in the reading, to remove unwanted formatting and ornamentation. Finally, we discuss various text encodings, filtering and data structures in which text can be placed for analysis.
For this notebook we will be using the following packages:
#All these packages need to be installed from pip
import requests #for http requests
import bs4 #called `beautifulsoup4`, an html parser
import pandas #gives us DataFrames
import docx 

This procedure uses the `io.BytesIO` class again, since `docx.Document` expects
a file. Another way to do it is to save the document to a file and then read it
like any other file. If we do this we can either delete the file afterwords, or
save it and avoid downloading the following time.

This function is useful as a part of many different tasks so it and others like it will be added to the helper package `lucem_illud` so we can use it later without having to retype it.

In [51]:
def downloadIfNeeded(targetURL, outputFile, **openkwargs):
    if not os.path.isfile(outputFile):
        outputDir = os.path.dirname(outputFile)
        #This function is a more general os.mkdir()
        if len(outputDir) > 0:
            os.makedirs(outputDir, exist_ok = True)
        r = requests.get(targetURL, stream=True)
        #Using a closure like this is generally better than having to
        #remember to close the file. There are ways to make this function
        #work as a closure too
        with open(outputFile, 'wb') as f:
            f.write(r.content)
    return open(outputFile, **openkwargs)

This function will download, save and open `outputFile` as `outputFile` or just
open it if `outputFile` exists. By default `open()` will open the file as read
only text with the local encoding, which may cause issues if its not a text
file.

In [117]:
try:
    d = docx.Document(downloadIfNeeded(example_docx, example_docx_save))
except Exception as e:
    print(e)

File is not a zip file


We need to tell `open()` to read in binary mode (`'rb'`), this is why we added
`**openkwargs`, this allows us to pass any keyword arguments (kwargs) from
`downloadIfNeeded` to `open()`.

In [128]:
d = docx.Document(downloadIfNeeded(example_docx, example_docx_save, mode = 'rb'))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

 
 

Accessing the Research Computing Center Resources

To connect to the midway compute cluster to access your home directory and the macs60000 storage space, and utilize the HPC resources, you will either use a terminal client (with or without X11 forwarding capabilities) or the Linux remote desktop server software client (Thinlinc) to connect to the midway cluster. To submit jobs, monitor jobs, browse directories or do other computing you will need to connect through either the terminal or remote desktop. Setup and utilization of these clients will be discussed below in the context of your local platform’s architecture.
SSH Client Setup & Remote Desktop Server


Now we can read the file with `docx.Document` and not have to wait for it to be
downloaded every time.


# <span style="color:red">Section 3</span>
<span style="color:red">Construct cells immediately below this that extract and organize textual content from text, PDF or Word into a pandas dataframe.</span>


In [120]:
#Getting an example txt from the internet and saving it as a dataframe.

from urllib.request import urlopen
text_url= "http://www.sample-videos.com/text/Sample-text-file-10kb.txt"
txt_by = urlopen(text_url).read() #Gives me a bytes object
txt_st= txt_by.decode("utf-8").replace("\r", "") #Transform it to string
txtP= txt_st.split("\n\n")
DF_txt= pandas.DataFrame({"txt": txtP})
#DF_txt

In [119]:
#Getting an example pdf from the internet and saving it as a Data Frame

pdf_url= "http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf"
pdf_request = requests.get(pdf_url, stream=True)
pdf_by = io.BytesIO(pdf_request.content)
pdf_st= readPDF(pdf_by)
#Striping bullet point and spliting by (double) newline
pdfP= pdf_st.replace(u'\u2022', "").split("\n\n")
DF_pdf= pandas.DataFrame({"pdf": pdfP[:-2]})
#DF_pdf

In [155]:
#Getting an example doc from the internet and saving it as a Data Frame
doc_url= "http://homepages.inf.ed.ac.uk/neilb/TestWordDoc.doc"

d = docx.Document(downloadIfNeeded(doc_url, "helper.doc", mode = 'rb'))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

PackageNotFoundError: Package not found at 'helper.doc'

## Shiyu: Scraping a Dataset the way I wanted

In [222]:
from bs4 import BeautifulSoup
import re
import pandas as pd

In [246]:
##Gathering book links from the ranking board before processing the books
##The ranking board includes 10 pages; each page includes 50 books. 
##500 books in total on the ranking board. 
BookList= []
for p in range(1,11):
    rank_url= "https://www.qidian.com/rank/collect?style=2&page={}".format(p)

    page_request= requests.get(rank_url)
    page_soup= BeautifulSoup(page_request.text, "lxml")

    url_list= page_soup.body.find("div", attrs= {"class": "rank-view-list"})\
               .find_all("a", attrs= {"class": "name"}) #Continuing the previous line!!!
    bookmark_list= page_soup.body.find_all("td", attrs= {"class": "month"})
    for i in range(50): #The page is well structured; 50 books a page
        a_book= [url_list[i]["href"], int(bookmark_list[i].text)]
        BookList.append(a_book)

#Output of this loop is links of the books on the ranking broad.
#Output is list within list
#[[book1's url, number of being bookmarked], [book2's url, number of being bookmarked], etc]
print(len(BookList))
print(BookList[:10])

500
[['//book.qidian.com/info/1003354631', 4174089], ['//book.qidian.com/info/1004608738', 3297730], ['//book.qidian.com/info/3676417', 2249562], ['//book.qidian.com/info/2750457', 1631691], ['//book.qidian.com/info/1003307568', 1613495], ['//book.qidian.com/info/1209977', 1587833], ['//book.qidian.com/info/3602691', 1585675], ['//book.qidian.com/info/1010089759', 1338237], ['//book.qidian.com/info/3513193', 1279457], ['//book.qidian.com/info/3681560', 1236954]]


In [None]:
##These functions are for dealing with individual book links
##and gathering information of each book

def soup_a_book(a_book_url):
    ##Extracting book introduction from a webpage+ Using Beautifulsoup to parse a book's web page
    a_book_request= requests.get(a_book_url)
    a_book_soup= BeautifulSoup(a_book_request.text, 'lxml')
    return a_book_soup


def get_intro(a_book_soup):
    #Following the previous function, get the book's introductio and get rid of "\u3000"
    for_intro= a_book_soup.body.find("div", attrs= {"class": "book-intro"})
    intro= re.sub(r"\u3000", " ", for_intro.text.strip()) #(Regular Expression 1)
    return intro

def get_book_demo(a_book_soup):
    ###Book demographics###
    for_info= a_book_soup.body.find("div", attrs= {"class": "book-info"})
    #print(for_info.text)

    #name, author
    name= for_info.text.split()[0]
    author= for_info.text.split()[1]
    
    linked_text= re.findall(r"\d+.\d+.*总推荐", for_info.text)[0] #(Regular Expression 2)
    #The above re.findall() returns a list, I take its element as string.
    splited_text= linked_text.split("|")
    #print(splited_text)

    #word count
    word_count= splited_text[0] 

    #click count
    click_count= re.findall(r"\d+.\d+.*总点击", splited_text[1])[0] #(Regular Expression 3)  
    
    #recommendation count
    recom_count= splited_text[2] #(Regular Expression 3)  

    #tags
    tags= a_book_soup.body.find_all("a", attrs= {"class": "red"})
    tags= [i.text for i in tags]
    tags_string= "" 
    for t in tags:
        tags_string+= t+ ";"

    return name, author, word_count, click_count, recom_count, tags_string


def get_IntroMentioned_books(intro):
    #Using output of the above function get_intro()
    #Lookinf for other books mentioned in the book introduction
    all_mentioned= re.findall(r"《(.*?)》", intro) #(Regular Expression 4)
    other_books= [i for i in all_mentioned if i!= name]
    other_books_string= ""
    for o in other_books:
        other_books_string+= o+ ";"
    return other_books_string




def get_text(a_book_soup):
    #Input is the soup of the introduction page of the book 
    #Obtain 10 chapters as example writing, 10 chapters are stored in one dictionary
    
    #Using the introduction page to get the the "read for free (= first page)"
    def get_content(for_read_url):
        read_url= "https:"+ for_read_url

        ch= requests.get(read_url)
        ch_soup= BeautifulSoup(ch.text, 'html.parser')

        chP= ch_soup.body.find("div", attrs= {"class": "read-content"}).find("p")
        for_chP= str(chP).replace("\u3000", "")
        #print(for_chP)
        ch_content= re.sub(r"(<\/*p>)(\1*)", "", for_chP) #(Regular Expression 5)
        ch_content= ch_content.strip()
        return ch_content, ch_soup
    
    content_dic= {}
    i= 1 
    while i< 11:
        if i== 1:
            for_read_url= a_book_soup.body.find("a", text= "免费试读")["href"] 
            ch_content, ch_soup = get_content(for_read_url)
            content_dic[i]= ch_content
            i+= 1
        else:
            for_read_url= ch_soup.body.find("a", text= "下一章")["href"]
            ch_content, ch_soup = get_content(for_read_url)
            content_dic[i]= ch_content
            i+= 1
    
    return content_dic

def deal_with_a_book(a_book_url):
    a_book_soup= soup_a_book(a_book_url)
    intro= get_intro(a_book_soup)
    name, author, word_count, click_count, recom_count, tags_string= get_book_demo(a_book_soup)
    other_books_string= get_IntroMentioned_books(intro)
    content_dic= get_text(a_book_soup)
    
    a_book_dic= {'name' : name, "author": author, "intro": intro, 
                 "word_count": word_count, "click_count": click_count, 
                 "recom_count": recom_count,
                 "tags": tags_string, "intro_mentioned_books": other_books_string, 
                 "example_text": content_dic}
    helper= [a_book_dic]
    bookDF= pandas.DataFrame(data= helper)
    return bookDF, a_book_soup


In [242]:
##Processing the 500 books and output a data frame
bookDF_male= pd.DataFrame({})
for book in BookList:
    book_url= "https:"+ book[0]
    a_bookDF, a_book_soup= deal_with_a_book(book_url)
    #The "number of being bookmarked" information is not available on the
    #book introduction page. This information is available on the book ranking page. 
    #Thus, I add a column for each book about its number of being bookmarked:
    a_bookDF["num_bookmarked"]= book[1]
    bookDF_male= bookDF_male.append(a_bookDF)
    print(bookDF_male.shape)
    
bookDF_male.to_csv("bookDF_male.csv")

(1, 10)
(2, 10)
(3, 10)


Unnamed: 0,author,click_count,example_text,intro,intro_mentioned_books,name,recom_count,tags,word_count,num_bookmarked
0,耳根,1503.53万总点击,{1: '帽儿山，位于东林山脉中，山下有一个村子，民风淳朴，以耕田为生，与世隔绝。清晨，村庄...,一念成沧海，一念化桑田。一念斩千魔，一念诛万仙。 唯我念……永恒 这是耳根继《仙逆》《求...,仙逆;求魔;我欲封天;,一念永恒,1616.09万总推荐,仙侠;幻想修仙;,354.98万字,4174077
0,辰东,1609.74万总点击,{1: '大漠孤烟直，长河落日圆。一望无垠的大漠，空旷而高远，壮阔而雄浑，当红日西坠，地平线...,在破败中崛起，在寂灭中复苏。 沧海成尘，雷电枯竭，那一缕幽雾又一次临近大地，世间的枷锁被打...,,圣墟,1818.01万总推荐,玄幻;东方玄幻;,316.22万字,3297696
0,忘语,503.29万总点击,{1: '浩瀚宇宙某个偏僻星域中，一点朦胧金光以某种固定速度在漆黑星空中徐徐飞行着，并不时从...,天降神物！异血附体！ 群仙惊惧！万魔退避！ 一名从东洲大陆走出的少年。 一具生死相依的...,凡人修仙传;魔天记;凡人修仙传;《凡人修仙之仙界篇;,玄界之门,757.24万总推荐,仙侠;幻想修仙;,338.88万字,2249565
