# Week 4 - Modern Digital Technologies in Text Analysis

# Extracting the Data

1. Text data collection using APIs
2. Reading PDF file in Python
3. Reading word (.docx) document
4. Reading JSON object
5. Reading HTML page and HTML parsing

## 1. Collecting Data

There are a lot of free APIs through which we can collect data and use it to solve problems.

**Example**
1. Free APIs like Twitter
2. Wikipedia
3. Government data (e.g. http://data.gov)
4. Health care claim data (e.g. https://www.healthdata.gov/)

## 1.1 Collecting Data from Tweets

### Problem
We want to collect text data using Twitter APIs.

### Solution
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers are making their living from it. 

There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this data is collected and analyzed, it gives a tremendous amount of insights to a business about their company, product, service, etc. 

Let’s see how to pull the data in this section and then explore how to leverage it later.

### How it works

### Step 1.1 Log in to the Twitter developer portal
Create your own app in the Twitter developer portal, and get the keys
mentioned below. Once you have these credentials, you can start pulling
data. 

Keys needed:

- **Consumer key**: Key associated with the application (Twitter, Facebook, etc.).
- **Consumer secret**: Password used to authenticate with the authentication server (Twitter, Facebook, etc.).
- **Access token**: Key given to the client after successful authentication of above keys.
- **Access token secret**: Password for the access key.

### Step 1.2 Execute below query in Python
Once all the credentials are in place, use the code below to fetch the data.

In [None]:
# Install tweepy

# !pip install tweepy

In [None]:
# import the libraries
import tweepy

from tweepy import OAuthHandler

# credentials
consumer_key = "*****************"
consumer_secret = "*****************"
access_token = "*****************"
access_token_secret = "*****************"

# calling API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

query = "ABC"

Tweets = api.search_tweets(query, count=10, lang='en', exclude='retweets', tweet_mode='extended')

The query above will pull the top 10 tweets when the product `ABC` is
searched. The API will pull English tweets since the language given is ‘en’
and it will exclude retweets.

## 2. Collecting Data from PDFs

Most of the time your data will be stored as PDF files. We need to extract text from these files and store it for further analysis.

### Problem
You want to read a PDF file.

### Solution
The simplest way to do this is by using the **PyPDF2** library.

### How It Works
Let’s follow the steps in this section to extract data from PDF files.

### Step 2.1. Install and import all the necessary libraries

In [None]:
# !pip install PyPDF2

In [None]:
import PyPDF2
from PyPDF2 import PdfReader

### Step 2.2 Extracting text from PDF file

**Note**: You can download any PDF file from the web and place it in the location where you are running this Jupyter notebook or Python script.

In [None]:
# Creating a pdf file object
pdf = open("paper.pdf", "rb")

# create pdf reader object
pdf_reader = PdfReader(pdf)

# checking number of pages in a pdf file
print(len(pdf_reader.pages), '\n\n')

# creating a page object
page = pdf_reader.pages[0]

# finally extracting text from the page
print(page.extract_text())

# closing the pdf file
pdf.close()

Please note that the function above doesn’t work for scanned PDFs.

## 3. Collecting Data from Word Files



### Problem
You want to read word (.docx) files.

### Solution
The simplest way to do this is by using the **docx** library.

### How It Works
Let’s follow the steps in this section to extract data from the Word file.

### Step 3.1. Install and import all the necessary libraries

In [None]:
# Install docx

# !pip install python-docx

In [None]:
# Import library
import docx
from docx import Document

### Step 3-2 Extracting text from word file

**Note**: You can download any Word file from the web and place it in the location where you are running this Jupyter notebook or Python script.

In [None]:
# Creating a word file object
doc = open("word.docx", "rb")

# creating a reader object
document = Document(doc)

docu=""
for para in document.paragraphs:
    docu += para.text
    
# we can see the output by calling the docu
print(docu)

## 4. Collecting Data from JSON

Reading a JSON file/object.

### Problem
We want to read a JSON file/object.

### Solution
The simplest way to do this is by using requests and the JSON library.

### How It Works
Let’s follow the steps in this section to extract data from the JSON.

### Step 4.1. Install and import all the necessary libraries

In [None]:
import requests
import json

### Step 4.2. Extracting Text from JSON file

In [None]:
# First we make a JSON file and store it

data = '{"success": { "total": 1}, "contents": {"quotes": [ {"quote": \
"Where there is ruin, there is hope for a treasure.", "length": "50", "author": "Rumi", \
"tags": ["failure", "inspire", "learning-from-failure"], "category": "inspire", "date": "2018-09-29", \
"permalink": "https://theysaidso.com/quote/dPKsui4sQnQqgMnXHLKtfweF/rumi-where-there-is-\
ruin-there-is-hope-for-a-treasure", "title": "Inspiring Quote of the day",\
"background": "https://theysaidso.com/img/bgs/man_on_the_mountain.jpg", "id": "dPKsui4sQnQqgMnXHLKtfweF"\
} ], "copyright": "2017-19 theysaidso.com"}}'

temp2_file = json.dumps(data)

with open("my_second_file.json", "w") as file:
    file.write(temp2_file)

In [None]:
with open("my_second_file.json", "r") as file:
    temp = json.load(file)
    
temp

In [None]:
data = {"success": { "total": 1}, "contents": {"quotes": [ {"quote": \
"Where there is ruin, there is hope for a treasure.", "length": "50", "author": "Rumi", \
"tags": ["failure", "inspire", "learning-from-failure"], "category": "inspire", "date": "2018-09-29", \
"permalink": "https://theysaidso.com/quote/dPKsui4sQnQqgMnXHLKtfweF/rumi-where-there-is-\
ruin-there-is-hope-for-a-treasure", "title": "Inspiring Quote of the day",\
"background": "https://theysaidso.com/img/bgs/man_on_the_mountain.jpg", "id": "dPKsui4sQnQqgMnXHLKtfweF"\
} ], "copyright": "2017-19 theysaidso.com"}}

temp = json.dumps(data, indent=3)

print(temp)

In [None]:
# extract contents

q = data['contents']['quotes'][0]

q

In [None]:
# extract only quote

print(q['quote'], '\n--', q['author'])

## 5. Collecting Data from HTML
let us look at reading HTML pages.


### Problem
We want to parse/read HTML pages.

### Solution
The simplest way to do this is by using the **bs4** library.

### How It Works
Let’s follow the steps in this section to extract data from the web.

### Step 5.1. Install and import all the necessary libraries

In [None]:
# !pip install bs4

In [None]:
import urllib.request as urllib2
from bs4 import BeautifulSoup

### Step 5.2. Fetch the HTML file
Pick any website from the web that you want to extract. Let’s pick Wikipedia for this example.

In [None]:
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

### Step 5.3. Fetch the HTML file

Now we get the data:

In [None]:
# parsing
soup = BeautifulSoup(html_doc, 'html.parser')

# Formating the parsed html file
strhtm = soup.prettify()

# pring few lines
print(strhtm[:1000])

### Step 5.4. Extracting tag value

We can extract a tag value from the first instance of the tag using the following code.

In [None]:
print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

### Step 5.5. Extracting all instances of a particular tag

Here we get all the instances of a tag that we are interested in:

In [None]:
for x in soup.find_all('a'):
    print(x.string)

### Step 5.6. Extracting all text of a particular tag

Finally, we get the text:

In [None]:
for x in soup.find_all('p'):
    print(x.text)

### Step 5.7. Another Example

In [None]:
from flask import request

In [None]:
from urllib import request

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

To get text out of HTML we will use a Python library called BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

from nltk import word_tokenize


raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
print(tokens[:10])

### Step 5.8. Electronic Books

In [None]:
url = "http://www.gutenberg.org/files/2554/2554-0.txt"

response = request.urlopen(url)

raw = response.read().decode('utf8')
type(raw)

In [None]:
len(raw)

In [None]:
raw[1:74]