> **Copyright (c) 2020 Skymind Holdings Berhad**<br><br>
> **Copyright (c) 2021 Skymind Education Group Sdn. Bhd.**<br>
<br>
Licensed under the Apache License, Version 2.0 (the \"License\");
<br>you may not use this file except in compliance with the License.
<br>You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0/
<br>
<br>Unless required by applicable law or agreed to in writing, software
<br>distributed under the License is distributed on an \"AS IS\" BASIS,
<br>WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
<br>See the License for the specific language governing permissions and
<br>limitations under the License.
<br>
<br>
**SPDX-License-Identifier: Apache-2.0**
<br>

# Introduction

This notebook is to introduce the basic of data extraction from various source. The first stage of NLP project is to extract the required textual data. The data is usually unstructured and is stored in a varying number of sources.
This article illustrates how we can extract text based data from the most common sources.

# Notebook Content

* [Extract Table From A Webpage](#Extract-Table-From-A-Webpage)


* [Extract Tweets](#Extract-Tweets)


* [Extract Text From A HTML Webpage](#Extract-Text-From-A-HTML-Webpage)


* [Read A Word Document](#Read-A-Word-Document)


* [Read A PDF Document](#Read-A-PDF-Document)


* [Read Text From A Csv File](#Read-Text-From-A-Csv-File)


* [Read Text From An Excel Spreadsheet](#Read-Text-From-An-Excel-Spreadsheet)


* [Read Outlook Emails](#Read-Outlook-Emails)


* [Extract RSS Feeds](#Extract-RSS-Feeds)

## Extract Table From A Webpage

Often the facts and figures are represented in a table in a HTML webpage. If we want to extract a HTML table from a web page then we can use Pandas library. The method reads HTML tables into a `list` of `DataFrame` objects.

In [1]:
import pandas as pd

In [2]:
# Pass in the url to extract the tables
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
table_lists = pd.read_html(url)

In [3]:
table_lists

[                                                   0
 0                                Part of a series on
 1                            Artificial intelligence
 2  Major goals Artificial general intelligence Pl...
 3  Approaches Symbolic Deep learning Bayesian net...
 4  Philosophy Ethics Existential risk Turing test...
 5                History Timeline Progress AI winter
 6  Technology Applications Projects Programming l...
 7                                  Glossary Glossary
 8  .mw-parser-output .navbar{display:inline;font-...,
                                                     0
 0                                 Part of a series on
 1                     Machine learningand data mining
 2   Problems Classification Clustering Regression ...
 3   Supervised learning.mw-parser-output .nobold{f...
 4   Clustering BIRCH CURE Hierarchical k-means Exp...
 5   Dimensionality reduction Factor analysis CCA I...
 6   Structured prediction Graphical models Bayes n...
 7         Anomaly 

## Extract Tweets

Twitter tweets can be extracted and fed into a NLP model to get a wider public view. We can use the tweepy library to extract the tweets for our target keywords.

Let’s assume we are interested in a company or certain Twitter members, we can use the Tweepy library to extract the required tweets.

The first stage is to generate the required tokens and secret security information:

* Navigate to https://apps.twitter.com/ and ‘Create New App’
* Choose Create your Twitter Application, fill in the details and you will get your token from ‘Keys and Access Tokens’ tab.


Lastly use the following code to extract the tweets. Let’s assume you want to extract last 5 tweets about FinTechExplained and MachineLearning:

### Import the Classes

In [4]:
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

### Get the API

In [5]:
# Enter the details below
# auth = tweepy.auth.OAuthHandler(enter_key_consumer, enter_secret_consumer)

# auth.set_access_token(token, secret)

# api = tweepy.API(auth)

### Get Tweets and Print Them

In [6]:
def get_tweets(api, keywords, count):
    return api.search(q=keywords, result_type='recent', lang='en', count=count)

# tweets = get_tweets(api, ['FinTechExplained','MachineLearning'], 5)

# #Print Output
# for tweet in tweets:
#     print(tweet.text)

The tweet object has a number of additional properties including place, friends count, followers count, screen name and so on. We can also extract tweets from a specific user.

## Extract Text From A HTML Webpage

For HTML scarping, use BeautifulSoap library

### Step 1: Install BeautifulSoap

### Step 2: Use the required classes

In [7]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

### Step 3: Pass Url and Parse HTML

In [8]:
url = "https://en.wikipedia.org/wiki/Natural_language_processing"
all_html = BeautifulSoup(urlopen(url), 'html.parser')

Use find() method to get the text of the required tag

In [9]:
text = all_html.find_all('p')

In [10]:
text

[<p><b>Natural language processing</b> (<b>NLP</b>) is a subfield of <a href="/wiki/Linguistics" title="Linguistics">linguistics</a>, <a href="/wiki/Computer_science" title="Computer science">computer science</a>, and <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of <a href="/wiki/Natural_language" title="Natural language">natural language</a> data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
 </p>,
 <p>Challenges in natural language processing frequently involve <a href="/wiki/Speech_recognition" title="Speech recognition">speech recognition

## Read A Word Document

We can use the docx libary to read and extract text from the word documents

In [11]:
from docx import Document

### Open File and Extract Text

In [12]:
all_text = []

doc = Document("../../../resources/day_02/NLP.docx")

for paragrah in doc.paragraphs:
    all_text.append(paragrah.text)
    
print("\n".join(all_text))

Everything we express (either verbally or in written) carries huge amounts of information. The topic we choose, our tone, our selection of words, everything adds some type of information that can be interpreted and value extracted from it. In theory, we can understand and even predict human behaviour using that information.
But there is a problem: one person may generate hundreds or thousands of words in a declaration, each sentence with its corresponding complexity. If you want to scale and analyze several hundreds, thousands or millions of people or declarations in a given geography, then the situation is unmanageable.
Data generated from conversations, declarations or even tweets are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases, and represent the vast majority of data available in the actual world. It is messy and hard to manipulate. Nevertheless, thanks to the advances in disciplines like m

## Read A PDF Document

PyPDF2 library can work with PDF documents

In [13]:
from PyPDF2 import PdfFileReader

### Extract theText from the First Page

In [14]:
reader = PdfFileReader(open("../../../resources/day_02/ELMo.pdf", 'rb'))

print(reader.getPage(0).extractText()) #0 is first page

Deepcontextualizedwordrepresentations
MatthewE.Peters
y
,MarkNeumann
y
,MohitIyyer
y
,MattGardner
y
,
f
matthewp,markn,mohiti,mattg
g
@allenai.org
ChristopherClark

,KentonLee

,LukeZettlemoyer

f
csquared,kentonl,lsz
g
@cs.washington.edu
y
AllenInstituteforIntelligence

PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashington
Abstract
Weintroduceanewtypeof
deepcontextual-
ized
wordrepresentationthatmodelsboth(1)
complexcharacteristicsofworduse(e.g.,syn-
taxandsemantics),and(2)howtheseuses
varyacrosslinguisticcontexts(i.e.,tomodel
polysemy).Ourwordvectorsarelearnedfunc-
tionsoftheinternalstatesofadeepbidirec-
tionallanguagemodel(biLM),whichispre-
trainedonalargetextcorpus.Weshowthat
theserepresentationscanbeeasilyaddedto
existingmodelsandimprovethe
stateoftheartacrosssixchallengingNLP
problems,includingquestionanswering,tex-
tualentailmentandsentimentanalysis.We
alsopresentananalysisshowingthatexposing
thedeepinternalsofthepre-trainednetworkis
crucial,allowingdownstreammod

## Read Text From A Csv File

Pandas is a great library to use if you want to read text from a csv file. pandas.read_csv() can read a comma-separated values (csv) file into DataFrame. We can also optionally iterate or break the file into chunks.

In [15]:
dataframe = pd.read_csv("../../../resources/day_02/financial.csv", sep=',')

In [16]:
dataframe

Unnamed: 0,Series_reference,Period,Data_value,Suppressed,STATUS,UNITS,Magnitude,Subject,Group,Series_title_1,Series_title_2,Series_title_3,Series_title_4,Series_title_5
0,BDCQ.SF1AA2CA,2016.06,1116.386,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
1,BDCQ.SF1AA2CA,2016.09,1070.874,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
2,BDCQ.SF1AA2CA,2016.12,1054.408,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
3,BDCQ.SF1AA2CA,2017.03,1010.665,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
4,BDCQ.SF1AA2CA,2017.06,1233.700,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Sales (operating income),Forestry and Logging,Current prices,Unadjusted,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3055,BDCQ.SF8RS2CA,2020.03,141.986,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Operating profit,Other Services,Current prices,Unadjusted,
3056,BDCQ.SF8RS2CA,2020.06,-5.260,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Operating profit,Other Services,Current prices,Unadjusted,
3057,BDCQ.SF8RS2CA,2020.09,73.572,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Operating profit,Other Services,Current prices,Unadjusted,
3058,BDCQ.SF8RS2CA,2020.12,229.689,,F,Dollars,6,Business Data Collection - BDC,Industry by financial variable,Operating profit,Other Services,Current prices,Unadjusted,


## Read Text From An Excel Spreadsheet

Pandas can be used to read text from an excel spreadsheet. The key is to import the Excel sheets as dataframes.

In [17]:
dataframe = pd.read_excel("../../../resources/day_02/person.xlsx")

In [18]:
dataframe

Unnamed: 0,0,First Name,Last Name,Gender,Country,Age,Date,Id
0,1,Dulce,Abril,Female,United States,32,15/10/2017,1562
1,2,Mara,Hashimoto,Female,Great Britain,25,16/08/2016,1582
2,3,Philip,Gent,Male,France,36,21/05/2015,2587
3,4,Kathleen,Hanner,Female,United States,25,15/10/2017,3549
4,5,Nereida,Magwood,Female,United States,58,16/08/2016,2468
5,6,Gaston,Brumm,Male,United States,24,21/05/2015,2554
6,7,Etta,Hurn,Female,Great Britain,56,15/10/2017,3598
7,8,Earlean,Melgar,Female,United States,27,16/08/2016,2456
8,9,Vincenza,Weiland,Female,United States,40,21/05/2015,6548


## Read Outlook Emails

There are a lot of useful information that is sent via Email messages. We can use Python to read text from the emails. Win32 is a great API for that.

### Use the api to get the contents of an email

In [19]:
import win32com.client

# my_outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")

# folder = outlook.GetDefaultFolder(6) #index 
# for item in folder.Items:
#     print(item.body)

## Extract RSS Feeds

feedparser is a fantastic library to extract the RSS feeds.

### Use the feedparser to extract the keys

In [20]:
import feedparser

feed = feedparser.parse("https://www.feedotter.com/blog/find-an-rss-feed-url/")
for entry in feed.entries:
    print(entry.keys())

In [21]:
feed

{'bozo': 1,
 'entries': [],
 'feed': {'html': {'lang': 'en-US', 'prefix': 'og: https://ogp.me/ns#'},
  'meta': {'name': 'msapplication-TileImage',
   'content': 'https://mk0successfeedoqagp8.kinstacdn.com/wp-content/uploads/2019/08/cropped-feedotter_ICON.png'},
  'links': [{'rel': 'canonical',
    'href': 'https://www.feedotter.com/blog/find-an-rss-feed-url/',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://js.hs-scripts.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://assets.calendly.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://calendly.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://www.gstatic.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://www.google.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://www.googletagmanager.com',
    'type': 'text/html'},
   {'rel': 'dns-prefetch',
    'href': 'https://

# Contributors

**Author**
<br>Chee Lam

# References

1. [NLP: Python Data Extraction From Social Media, Emails, Documents, Webpages, RSS & Images](https://medium.com/fintechexplained/nlp-python-data-extraction-from-social-media-emails-images-documents-web-pages-58d2f148f5f4)