## Extract Data
From: Chapter 1 of    
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning using Python   
by Adarsha Shivananda, Akshay Kulkarni   
https://www.safaribooksonline.com/library/view/natural-language-processing/9781484242674/

### Extract Text from PDF files

In [54]:
!pip install PyPDF2



You are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [55]:
import PyPDF2
from PyPDF2 import PdfFileReader

In [56]:
#Creating a pdf file object
with open("data/sample.pdf","rb") as pdf, open("data/extracted.txt","w+") as txt:
    #creating pdf reader object
    pdf_reader = PyPDF2.PdfFileReader(pdf)
    #checking number of pages in a pdf file
    for i in range(pdf_reader.numPages):
        #creating a page object
        page = pdf_reader.getPage(i)
        #finally extracting text from the page
        txt.write(page.extractText())
#Don't need to close the pdf file since opened using "with"
#pdf.close()

### Collecting Data from JSON data from the Web

In [57]:
#Install requests
!pip install requests



You are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [58]:
import requests
import json

In [59]:
#json from "https://quotes.rest/qod.json"
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent = 4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "The real opportunity for success lies within the person and not in the job.",
                "length": "75",
                "author": "Zig Ziglar",
                "tags": [
                    "inspire",
                    "opportunity",
                    "success"
                ],
                "category": "inspire",
                "date": "2019-02-11",
                "permalink": "https://theysaidso.com/quote/cG742g_g_dI81QSvneS9rweF/zig-ziglar-the-real-opportunity-for-success-lies-within-the-person-and-not-in-th",
                "title": "Inspiring Quote of the day",
                "background": "https://theysaidso.com/img/bgs/man_on_the_mountain.jpg",
                "id": "cG742g_g_dI81QSvneS9rweF"
            }
        ],
        "copyright": "2017-19 theysaidso.com"
    }
}


In [60]:
#extract contents
q = res['contents']['quotes'][0]
print(q['quote'], '\n--', q['author'])

The real opportunity for success lies within the person and not in the job. 
-- Zig Ziglar


### WebScraping with Beautiful Soup: Collecting Data from HTML

In [61]:
!pip install bs4
import urllib.request as urllib2
from bs4 import BeautifulSoup



You are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


#### Fetch the HTML file
More on Beautiful Soup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

In [62]:
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

#### Simple Parsing

In [63]:
soup = BeautifulSoup(html_doc, 'html.parser')
# Pretty Formating (adding proper indentations to) the parsed html file
strhtm = soup.prettify()
# Print few lines
print (strhtm[:100])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <


In [64]:
print(soup.title)
print(soup.title.string)
print(soup.a.string) # to find <a href></a>... links 
print(soup.b.string)

<title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
None
Natural language processing


In [65]:
for link in soup.find_all('a'):
    print(link.get('href'))

None
#mw-head
#p-search
/wiki/Language_processing_in_the_brain
/wiki/File:Automated_online_assistant.png
/wiki/File:Automated_online_assistant.png
/wiki/Automated_online_assistant
/wiki/Customer_service
#cite_note-Kongthon-1
/wiki/Computer_science
/wiki/Information_engineering_(field)
/wiki/Artificial_intelligence
/wiki/Natural_language
/wiki/Speech_recognition
/wiki/Natural_language_understanding
/wiki/Natural_language_generation
#History
#Rule-based_vs._statistical_NLP
#Major_evaluations_and_tasks
#Syntax
#Semantics
#Discourse
#Speech
#See_also
#References
#Further_reading
/w/index.php?title=Natural_language_processing&action=edit&section=1
/wiki/History_of_natural_language_processing
/wiki/Alan_Turing
/wiki/Intelligence
/wiki/Turing_test
/wiki/Georgetown-IBM_experiment
/wiki/Automatic_translation
#cite_note-2
/wiki/ALPAC
/wiki/Statistical_machine_translation
/wiki/SHRDLU
/wiki/Blocks_world
/wiki/ELIZA
/wiki/Rogerian_psychotherapy
/wiki/Joseph_Weizenbaum
/wiki/Ontology_(information_s

#### Detailed Parsing: IMDB Example

In [66]:
!pip install pandas
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame
from time import sleep
import re
import pickle
from ipywidgets import FloatProgress
from IPython.display import display



You are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [67]:
# Top 250 Movies Page:
url = 'http://www.imdb.com/chart/top?ref_=nv_mv_250_6'
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c,"lxml") # lxlml is an  alternative html parser, default is HTML parser included in Python’s standard library

In [69]:
summary = soup.find('div',{'class':'article'})
# Create empty lists to append the extracted data.
moviename = []
cast = []
description = []
rating = []
ratingoutof = []
year = []
rot_audscore = []
rot_avgrating = []
rot_users = []
# Extracting the required data from the html soup.
rgx = re.compile('[%s]' % '()')
f = FloatProgress(min=0, max=250)
display(f)
for row,i in zip(summary.find('table').findAll('tr'),range(len(summary.find('table').findAll('tr')))):
    for sitem in row.findAll('span',{'class':'secondaryInfo'}):
        s = sitem.find(text=True)
        year.append(rgx.sub('', s))
    for ritem in row.findAll('td',{'class':'ratingColumn imdbRating'}):
        for iget in ritem.findAll('strong'):
            rating.append(iget.find(text=True))
            ratingoutof.append(iget.get('title').split(' ', 4)[3])
    for item in row.findAll('td',{'class':'titleColumn'}):
        for href in item.findAll('a',href=True):
            moviename.append(href.find(text=True))
            rurl = 'https://www.rottentomatoes.com/m/'+ href.find(text=True)
            try:
                rresult = requests.get(rurl)
            except requests.exceptions.ConnectionError:
                status_code = "Connection refused"
            rc = rresult.content
            rsoup = BeautifulSoup(rc)
            try:
                rot_audscore.append(rsoup.find('div',{'class':'meter-value'}).find('span',{'class':'superPageFontColor'}).text)
                rot_avgrating.append(rsoup.find('div',{'class':'audience-info hidden-xs superPageFontColor'}).find('div').contents[2].strip())
                rot_users.append(rsoup.find('div',{'class':'audience-info hidden-xs superPageFontColor'}).contents[3].contents[2].strip())
            except AttributeError:
                rot_audscore.append("")
                rot_avgrating.append("")
                rot_users.append("")
            cast.append(href.get('title'))
            imdb = "http://www.imdb.com" + href.get('href')
            try:
                iresult = requests.get(imdb)
                ic = iresult.content
                isoup = BeautifulSoup(ic)
                description.append(isoup.find('div',{'class':'summary_text'}).find(text=True).strip())
            except requests.exceptions.ConnectionError:
                description.append("")
    sleep(.1)            
    f.value = i

FloatProgress(value=0.0, max=250.0)

In [70]:
# List to pandas series
moviename = Series(moviename)
cast = Series(cast)
description = Series(description)
rating = Series(rating)
ratingoutof = Series(ratingoutof)
year = Series(year)
rot_audscore = Series(rot_audscore)
rot_avgrating = Series(rot_avgrating)
rot_users = Series(rot_users)

In [71]:
imdb_df = pd.concat([moviename,year,description,cast,rating,ratingoutof,rot_audscore,rot_avgrating,rot_users],axis=1)
imdb_df.columns = ['moviename','year','description','cast','imdb_rating','imdb_ratingbasedon','tomatoes_audscore','tomatoes_rating','tomatoes_ratingbasedon']
imdb_df['rank'] = imdb_df.index + 1
imdb_df.head(5)

Unnamed: 0,moviename,year,description,cast,imdb_rating,imdb_ratingbasedon,tomatoes_audscore,tomatoes_rating,tomatoes_ratingbasedon,rank
0,The Shawshank Redemption,1994,Two imprisoned men bond over a number of years...,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.2,2051896,,,,1
1,The Godfather,1972,The aging patriarch of an organized crime dyna...,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.2,1407392,,,,2
2,The Godfather: Part II,1974,The early life and career of Vito Corleone in ...,"Francis Ford Coppola (dir.), Al Pacino, Robert...",9.0,975957,,,,3
3,The Dark Knight,2008,When the menace known as the Joker emerges fro...,"Christopher Nolan (dir.), Christian Bale, Heat...",9.0,2018734,,,,4
4,12 Angry Men,1957,A jury holdout attempts to prevent a miscarria...,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.9,578243,,,,5


In [72]:
# Saving the file as CSV.
imdb_df.to_csv("data/imdbdataexport.csv")

###  Parsing Text Using Regular Expressions

#### Tokenize

In [73]:
import re
#run the split query
re.split('\s+','I like this book.')

['I', 'like', 'this', 'book.']

#### Extracing emails

In [74]:
doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)
for address in addresses:
    print(address)

xyz@abc.com
pqr@mno.com
