# Episode1 Getting the Data

Specifically, we'll be walking through:
1.*Getting the data=" In this case, we'll be scraping data from a website
2.*Cleaning the data-we will walk through popular text preprocessing techniques
3.*Organizing the data-"we will organize the cleaned data into a way that is easy to input into other algorithms 

1. Corpus - a collection of text
2. Documnet-Term Matrix - word counts in matrix format

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle

In [2]:
def url_to_transcript(url):
    page = requests.get(url).text
###content gives you access to the raw bytes of the response payload,
###you will often want to convert them into a string using a character encoding such as UTF-8. 
###response will do that for you when you access .text:
    soup = BeautifulSoup(page)
    text = [p.text for p in soup.find_all('p')]
    print(url)
    return text
urls = ['https://scrapsfromtheloft.com/2020/01/05/ambiguous-beginning-of-heart-of-darkness/',
       'https://scrapsfromtheloft.com/2019/12/23/vladimir-nabokov-the-man-who-scandalized-the-world/',
       'https://scrapsfromtheloft.com/2019/12/07/american-novels-first-world-war-ernest-hemingway-farewell-to-arms/']
writers = ['Joseph Conrad','Vladimir Nabokov','Ernest Hemingway']

In [3]:
#!mkdir transcripts

In [4]:
transcripts = [url_to_transcript(u) for u in urls]



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


https://scrapsfromtheloft.com/2020/01/05/ambiguous-beginning-of-heart-of-darkness/
https://scrapsfromtheloft.com/2019/12/23/vladimir-nabokov-the-man-who-scandalized-the-world/
https://scrapsfromtheloft.com/2019/12/07/american-novels-first-world-war-ernest-hemingway-farewell-to-arms/


In [5]:
data = {}
for i,w in enumerate(writers):
    print(i)
    print(w)
    with open("transcripts/"+w+".txt","wb") as file:
        pickle.dump(transcripts[i],file)####### in general pickle always be the preferred way to serialize Python objects #########
################ pickle is a binary serialization forma ########################
##########   https://docs.python.org/3/library/pickle.html   #############

0
Joseph Conrad
1
Vladimir Nabokov
2
Ernest Hemingway


In [6]:
for i,w in enumerate(writers): #### w is key here ####
    with open("transcripts/"+w+".txt","rb") as file:
        data[w] = pickle.load(file)  ###### Load value for key;
##### Every para load as a element of list ######

In [7]:
data.keys()

dict_keys(['Joseph Conrad', 'Vladimir Nabokov', 'Ernest Hemingway'])

In [8]:
data['Joseph Conrad'][1]

'by Richard Adams'

In [9]:
transcripts[0]

['The Ambiguous Beginning of “Heart of Darkness”',
 'by Richard Adams',
 'Richard Adams analyzes the title and opening paragraphs of Heart of Darkness, showing that neither gives the reader clues regarding the subject matter and focus of the story. Adams offers possible meanings of the title and possible interpretations of the setting and the group of five men setting out on a journey. Adams suggests that the reader is prepared to eavesdrop on the story Heart of Darkness. Richard Adams is professor of English at California State University in Sacramento. He has written school texts—Appropriate English and Teaching Shakespeare—and published editions of works by Shakespeare, Conrad, Schaffer, and Iris Murdoch.',
 '* * *',
 'Many works of fiction, particularly those written before the end of the nineteenth century, provide us with some notion of their subject-matter or focus before we actually embark on a reading of the text. They do so by means of their title-pages. One, for instance, pu

## Data Cleaning 

Common data cleaning steps on all text:
* Make text all lower case 
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text(/n)
* Tokenize text
* Rmove stop words

In [10]:
def combine_text(list_of_text):
    combine_text = ' '.join(list_of_text)
    return combine_text    #### Convert them to a list ####

In [11]:
dict_combined = {key:[combine_text(value)] for (key,value) in data.items()}

In [12]:
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(dict_combined).transpose()
##### Transpose make row ro column and make column to row ####
data_df.columns = ['transcript']
#data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
Ernest Hemingway,"by Carlos Baker Since his death in the summer of 1961, a school of critics has arisen which holds that the novel we are about to discuss was reall..."
Joseph Conrad,"The Ambiguous Beginning of “Heart of Darkness” by Richard Adams Richard Adams analyzes the title and opening paragraphs of Heart of Darkness, show..."
Vladimir Nabokov,"Who and what is Vladimir Nabokov (the author of Lolita) and why by Helen Lawrenson One of the more diverting aspects of Lolita, the most controv..."


## Data cleaning 

In [13]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\d', '', text)   #######'\d' Matches any Unicode decimal digit; 
#####Matches Unicode word characters; this includes most characters that can be part of a word in any language,numbers and the underscore.
    return text

round1 = lambda x: clean_text_round1(x)

In [14]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’*“”…""'']', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [15]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
Ernest Hemingway,by carlos baker since his death in the summer of a school of critics has arisen which holds that the novel we are about to discuss was really the...
Joseph Conrad,the ambiguous beginning of “heart of darkness” by richard adams richard adams analyzes the title and opening paragraphs of heart of darkness showi...
Vladimir Nabokov,who and what is vladimir nabokov the author of lolita and why by helen lawrenson one of the more diverting aspects of lolita the most controvers...


In [16]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
Ernest Hemingway,by carlos baker since his death in the summer of a school of critics has arisen which holds that the novel we are about to discuss was really the...
Joseph Conrad,the ambiguous beginning of heart of darkness by richard adams richard adams analyzes the title and opening paragraphs of heart of darkness showing...
Vladimir Nabokov,who and what is vladimir nabokov the author of lolita and why by helen lawrenson one of the more diverting aspects of lolita the most controvers...


In [17]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

### Documnet-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the',etc.

In [18]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')###Remove stop words
data_cv = cv.fit_transform(data_clean.transcript)
print(data_cv.toarray())
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

[[0 1 4 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 1 1 1]]


Unnamed: 0,abilities,ability,able,abnormal,aboriginal,abounds,abrupt,abruzzi,absence,absorption,...,yoked,yoking,york,young,youth,youthful,zeitgeist,zoo,zoology,émigrés
Ernest Hemingway,0,1,4,0,0,0,0,2,0,0,...,0,0,3,8,0,0,1,0,0,0
Joseph Conrad,0,0,0,0,1,1,0,0,1,1,...,1,1,0,0,1,0,0,0,0,0
Vladimir Nabokov,1,0,0,1,0,0,1,0,0,0,...,0,0,5,3,0,1,0,1,1,1


In [19]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [20]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))