In [None]:
![NYPLogo.png](attachment:NYPLogo.png)

# Practical 11a: Extract Text Data

## Objectives
# * Demonstrate different methods of acquiring text data from different data sources.

## Extract Text Data from PDFs
Most of the time our data will be stored as PDF files. We need to extract text from these files and store it for further analysis. 

The simplest way to do this is by using the PyPDF2 libary. 

Follow the steps in this session to extract text data from PDF files.


### Install and import all the necessary libraries
To install the PyPDF2 library:
> Type this command in jupyter or anaconda prompt: **pip install pypdf2**

```Python
# import libraries
import PyPDF2
from PyPDF2 import PdfFileReader
```

### Extracting text from PDF file
Download the **sample.pdf** file and placed in the same directory as your Jupyter notebook or Python script.

Now we will extract the text using the Python code below.
```Python
# create a pdf file object and open the sample.pdf
pdf = open("sample.pdf", "rb")

# create pdf reader object
pdf_reader = PyPDF2.PdfReader(pdf)

# check the number of pages in the pdf file
print (pdf_reader.pages)

# create a page object 
page = pdf_reader.pages[0]

# extract the text from the page
print (page.extract_text())

# clode the pdf file
pdf.close()
```

Note that the codes above does not work for scanned PDFs.

### Exercise
Download the **sample2.pdf** and try the following tasks:
1. Read the all the pages in sample2.pdf and print out the extracted text

<em>**Hint use loop to read all pages</em>

## Extract Text Data from Word Files
Word files are one of the most common files that an organisation deals with. We need to extract text from these files and store it for further analysis. 

The simplest way to do this is by using the docx libary. 

Follow the steps in this session to extract text data from Word files.


### Install and import all the necessary libraries
To install the docx library:
> Type this command in jupyter or anaconda prompt: **pip install python-docx**

```Python
# import library
from docx import Document
```

### Extracting text from PDF file
Download the **sample.docx** file and placed in the same directory as your Jupyter notebook or Python script.

Now we will extract the text using the Python code below.
```Python
# create a word file object
doc = open("sample.docx", "rb")

# create word reader object
doc = Document(doc)

# create an empty string and call this document. This document variable store each paragraph in the Word document.
# We then create a for loop that goes through each paragraph in the Word document and appends the paragraph.
docu=""
for para in doc.paragraphs:
    docu += para.text

# print the extract text
print(docu)
```

## Extracting Text Data from RSS Feeds
RSS (Rich Site Summary) is a format for delivering regularly change web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it. 

The simplest way to do this is by using the feedparser libary. 

Follow the steps in this session to extract text data from RSS feeds.


### Install and import all the necessary libraries
To install the feedparser library:
> Type this command in jupyter or anaconda prompt: **pip install feedparser**

```Python
# import library
import feedparser
```

In [5]:
import feedparser

### Feed Structure
In the below example, we will get the structure of the feed so that we can analyse further about which parts of the feed want to process. Use the feedparser.parse() function for creating a feed object which contains parsed blog. It takes the URL of the blog feed.
```Python
# create feed and put in the RSS feed url
news_feed = feedparser.parse("https://www.channelnewsasia.com/api/v1/rss-outbound-feed?_format=xml&category=10416")

news_feed.keys()
```
Feed has some general information about the rss feed, but the "meat" of the feed is in entries. The rest of the keys weren't all that useful.
```Python
# get keys under entries
news_feed.entries[0].keys()
```

### Get Details of Feed
RSS documents often know as feed which consists of text, and metadata, like time and author's name.
```Python
# returns title of rss feed
print(news_feed.feed.title)

# return link of rss feed and number of entries
print(news_feed.feed.link)
print(len(news_feed.entries))

# details of individual entries can be accessed by using attribute nam
print(news_feed.entries[0].title)
print(news_feed.entries[0].link)
print(news_feed.entries[0].published)
```

### Exercise
Try the following tasks:
1. Read this rss feed: https://www.channelnewsasia.com/api/v1/rss-outbound-feed?_format=xml
2. Save the entries title, link, published into a dataframe

## Extracting Text Data from Twitter
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers are making their living from it. There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this data is collected and analysed, it gives a tremendous amount of insights to a business about their company, product, service, etc. 

In February 2023, Twitter set unrealistic prices for its API, giving away crumbs of data for big bucks. Some started using libraries such as snscrape, which used web public APIs. But in April 2023, Twitter closed that option as well — making search only for authorized accounts.

In the exercise below to use Python to scrape data from Twitter. With Twitter's recent changes and restricting use of its API this will be a super useful method to allow you to extract narrow time data from Twitter. 

In order to scrape our data we are going to use the **scrapper API (decommissioned)**. To use API  you first have to create an account so you can go ahead and go to scrapperapi.com. Click on **Get Started for Free** to sign up for an account. After creating the account, it will give you an API key that you need in the code later. 

As Twitter.com became X.com it closed its public API though web scraping.
However, some 3rd party APIs are available for us to still scrap data from X.com!

In this X.com web scraping tutorial, we'll take a look at scraping X.com posts and profiles using Python and Playwright.
Visit https://scrapfly.io/blog/how-to-scrape-twitter/ and following the instructions to scrap data off x.com.
As an alternative to scrapping data from x.com, we could refer to https://github.com/scrapfly/scrapfly-scrapers/tree/main/twitter-scraper



## Extracting Text Data from Twitter (Outdated)
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers are making their living from it. There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this data is collected and analysed, it gives a tremendous amount of insights to a business about their company, product, service, etc. 

In February 2023, Twitter set unrealistic prices for its API, giving away crumbs of data for big bucks. Some started using libraries such as snscrape, which used web public APIs. But in April 2023, Twitter closed that option as well — making search only for authorized accounts.

In the exercise below to use Python to scrape data from Twitter. With Twitter's recent changes and restricting use of its API this will be a super useful method to allow you to extract narrow time data from Twitter. 

In order to scrape our data we are going to use the **scrapper API**. To use API  you first have to create an account so you can go ahead and go to scrapperapi.com. Click on **Get Started for Free** to sign up for an account. After creating the account, it will give you an API key that you need in the code later. 

### Install and import all the necessary libraries
First we need to import two libraries. The first one is request which is what we will be using to make sure HTTP requests to actually go out to the web and scrape data. We'll also import pandas to get the data we scrape from twitter back into data frames.

```Python
# import the libraries
import requests
import pandas as pd
```

Let's take a look at the data and the keys. We can see that we have three keys: search_information, organic_results and pagination. From the organic_results, you can see there is title, and the snippet from the tweet and it gives you the link and the displayed link.

If we look at the pagination, it give you some additional metadata about your data pool. The information that we wanted are in the organic_results.

~~~Python
print (data.keys())
data['organic_results']
~~~

Let's look at the type that was returned for organic_results. We can see that it is a list so we can actually access certain elements of organic results. Each element within that list is a dictionary and we can use the keys to access more data. 

```Python
# data type of organic_results
print (type(data['organic_results']))

# get first element of organic_results
display(data['organic_results'][0])

# get the title of the first tweet
print (data['organic_results'][0]['title'])
```


Next, we will convert our scrape data to a dataframe. First we need to iterate through the tweets in the organic_results and placed it in a twitter_data list. Using the list we will create a dataframe. 

~~~Python
twitter_data = []
for tweet in data['organic_results']:
    twitter_data.append(tweet)
    
df = pd.DataFrame(twitter_data)
df
~~~

### Exercise
Try the following tasks:
1. Scrape 50 tweets from Twitter on "**chatgpt**"
2. Save the scrape results into a dataframe