# Extracting Data

Implemented the following in this notebook:
- 1. Collecting Data from API
- 2. Collecting Data from PDFs
- 3. Collecting Data from Word files
- 4. Collecting Data from JSON
- 5. Collecting Data from HTML
- 6. Parsing text using Regular expressions

# 1. Collecting Data from API

There are a lot of free APIs through which we can collect data and use it to solve problems. Here, mainly referred to Twitter API in particular since it contains a huge amount of data with a lot of value in it.

When all of this data is collected and analyzed, it gives a tremendous amount of insights to a business about the company, product, service etc.

The following steps are needed for Twitter data analysis:
- consumer key: Key associated with the application 
- consumer secret: Password used to authenticate with the authentication server
- access token: Key given to the client after successful authentication of above keys
- access token secret: Password for the access key

In [None]:
# install tweepy
!pip3 install tweepy

In [None]:
# import required libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler

In [None]:
# credentials
consumer_key = "wTkUgbPZnCQBKXeZA4jv5KUoX"
consumer_secret = "Z5hfLPvkyR49q7uFjUf4dDO9Hp65j3YqZ2WY1fHVYvo1GG9FG6"
access_token = "706794838696574976-s8H9dItuWc1Z8xXakByhXgoSYC8qHha"
access_token_secret = "RYUI34vZrdArHKigshrQlNQ4wq9inVvmpnDFaGPoRu1RV"

In [None]:
# calling API
auth = OAuthHandler(consumer_key, consumer_secret)

In [None]:
auth.set_access_token(access_token, access_token_secret)

In [None]:
api = tweepy.API(auth)

In [None]:
# provide the query you want to pull the data (e.g pulling data for the mobile phone ABC)
query = "ABC"

In [None]:
# fetching tweets
tweets = api.search(query, count=10, lang='en', exclude='retweets', tweet_mode='extended')

The query above will pull the top 10 tweets when the product ABC is searched. The API will pull English tweets since the language given is 'en' and it will exclude retweets.

# 2. Collecting Data from PDFs

Most of the time data will be stored as PDF files. We need to extract text from these files and store it for further analysis. Used of PyPDF2 library and see how we can extract data from it.

In [None]:
# install required library
!pip3 install PyPDF2 --user

In [None]:
# import required libraries
import PyPDF2
from PyPDF2 import PdfFileReader

I will import a personal pdf file from where I will extract text data.

In [None]:
# create a pdf file object
pdf = open('./data/pdf/Text Classification on Social Media Bachelor Thesis.pdf', 'rb')

In [None]:
# create a pdf reader object
pdf_reader = PdfFileReader(pdf)

In [None]:
# check number of pages in the pdf file
pdf_reader.numPages

In [None]:
# create a page object
page = pdf_reader.getPage(0)

In [None]:
# finally extract text from the specified page
page.extractText()

In [None]:
# close the pdf file
pdf.close()

# 3. Collecting Data from Word files

How to extract data from Word files in Python. For this we will make use of docx library in Python.

Used a personal doc file to extract data from it.

In [None]:
# install required library
!pip3 install python-docx --user

In [None]:
# import required library
from docx import Document

In [None]:
# create a word file object
file = open('./data/docx/TrustCo System - Analyzing Reviews Report.docx', 'rb')

In [None]:
# create a word reader object
document = Document(file)

In [None]:
# create an empty string and call the document.
# document variable stores each paragraph in the Word document. We create a for loop that goes through 
# each paragraph in the Word document and appends the paragraph.
doc = ""
for p in document.paragraphs:
    doc += p.text

In [None]:
doc

# 4. Collecting Data from JSON

The simplest solution for reading data from JSON files in Python is by using requests and JSON library provided by Python.

In [None]:
# import required libraries
import requests
import json

In [None]:
# json from 'https://quotes.rest/qod.json'
r = requests.get('https://quotes.rest/qod.json')

In [None]:
# get the result into dictionary
res = r.json()

In [None]:
# check json structure
print(json.dumps(res, indent=4))

In [None]:
# extract contents
q = res['contents']['quotes'][0]

In [None]:
# print output
q

In [None]:
# extract quote and author
print(q['quote'], '\n--', q['author'])

# 5. Collecting Data from HTML

How to collect data from HTML pages. As a solution used of bs4 library also known as BeautifulSoup in Python.

In [None]:
# install required library
!pip3 install bs4 --user

In [None]:
# import required libraries
import urllib.request as urllib
from bs4 import BeautifulSoup

In [None]:
# fetch html file (e.g Wikipedia)
response = urllib.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')

In [None]:
# get data into html_doc (binary)
html_doc = response.read()

In [None]:
# parse html file
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
# format the parsed html file
strhtml = soup.prettify()

In [None]:
# print few lines
print(strhtml[:1000])

In [None]:
# extract tag values
# 1. extract title
print(soup.title)

In [None]:
# 2. extract content between the following tags: <title> </title>
print(soup.title.string)

In [None]:
# 3. extract content between the following tags: <a> </a>
print(soup.a.string)

In [None]:
# 4. extract content between the following tags: <b> </b>
print(soup.b.string)

In [None]:
# extract all instances of a particular tag (e.g 'a')
content_tag_a = []
for x in soup.find_all('a'):
    content_tag_a.append(x.string)

In [None]:
# print the first 5 occurencies 
content_tag_a[:5]

In [None]:
# extract all text of a particular tag
content_tag_p = []
for x in soup.find_all('p'):
    content_tag_p.append(x.text)

In [None]:
# print the first 5 occurencies
content_tag_p[:5]

# 6. Parsing text using Regular Expressions

How regular expressions are helpful when dealing with text data. This is very much required when dealing with raw data from the web, which would contain HTML tags, long text, repeated text.

For this we will make use of the "re" library written in Python.

The basic flags in "re" library are:
- re.I : used for ignoring casing
- re.L : used for finding a local dependent
- re.M : used for finding patterns throughout multiple lines
- re.S : used to find dot matches
- re.U : used to work for unicode data
- re.X : used for writing regex in a more readable format

Regular expression's functionality:
- Find the single occurence of character a and b: 
Regex: [ab]
- Find characters except for a and b:
Regex: [^ab]
- Find the character range of a to z:
Regex: [a-z]
- Find a range except to z:
Regex: [^a-z]
- Find all the characters a to z as well A to Z:
Regex: [a-zA-Z]
- Any single character:
Regex:
- Any whitespace character:
Regex: \s
- Any non-whitespace character:
Regex: \S
- Any digit:
Regex: \d
- Any non-digit:
Regex: \D
- Any words:
Regex: \w
- Any non-words:
Regex: \W
- Either match a or b:
Regex: (a|b)
- The occurence of a is either zero or one:
    - Matches zero or one occurence but not more than one occurence
    Regex: a? ; ?
    - The occurence of a is zero times or more than that:
    Regex: a* ; * matches zero or more than that
    - The occurence of a is one time or more than that:
    Regex: a+ ; + matches occurences one or more that one time

Exactly match three occurences of a:
Regex: a{3}

Match simultaneous occurences of a with 3 or more than 3:
Regex: a{3,}

Match simultaneous occurences of a between 3 to 6:
Regex: a{3,6}

Starting of the string:
Regex: ^

Ending of the string:
Regex: $

Match word boundary:
Regex: \b

Non-word boundary:
Regex: \B

The most used functions are as follows: re.match() and re.search() and they are used to find patterns, and they can be processed according to the requirements of the application

- re.match() function checks for a match of the string only at the beginning of the string
                      if it finds the pattern at the beginning of the input string then it
                      return matched pattern; else it returns a noun
- re.search() function checks for a match of the string anywhere in the string. It finds all
                      the occurences of the pattern in the given input string or data.

# Tokenizing

Is the process of splitting the sentence into chunk of words. One way to do this is by using re.split function from Python

In [None]:
# import required libraries
import re

In [None]:
# run the split query
re.split('\s+', 'I like this book.')

# Extracting email IDs

The simplest way to do this is by using re.findall function from Python "re" library.

In [None]:
# 1. read / create the document or sentences
doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"

In [None]:
# 2 execute the re.findall function
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)

In [None]:
type(addresses)

In [None]:
for address in addresses:
    print(address)

# Replacing email IDs

Here we replace  email ids from sentences or documents with another email id. The simplest way to do this is by using re.sub

In [None]:
# 1. read / create the document or sentences
doc = "For more details please mail us at xyz@abc.com"

In [None]:
# 2. execute re.sub function
new_doc = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'pqr@mno.com', doc)

In [None]:
new_doc

# Extract data from ebook and perform regex

In [None]:
# import required libraries
import re
import requests
import json

In [None]:
# url that we want to extract information
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

In [None]:
# define function to extract
def get_book(url):
    # send a http request to get the text from project Gutenberg
    raw_text = requests.get(url).text
    
    # skip metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*", raw_text).end()
    
    # skip metadata from the end of the book
    end = re.search(r"II", raw_text).start()
    
    # keep relevant text
    text = raw_text[start:end]
    return text

In [None]:
# preprocessing
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+', ' ', sentence).lower()

In [None]:
book = get_book(url)

In [None]:
# apply preprocessing
processed_book = preprocess(book)

In [None]:
processed_book[:2000]

In [None]:
# perform exploratory data analysis on data using regex
# Count number of times "the" is appeared in the book
len(re.findall(r'the', processed_book))

In [None]:
# Replace "i" with "I"
processed_book = re.sub(r'\si\s', " I ", processed_book)

In [None]:
processed_book[:2000]

In [None]:
# find all occurances of text in the format "abc--xyz"
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)