# Reading Data, Pandas

There are various file formats, how do we make a sense of them all?

* There are archive/compression formats such as .zip, .rar, .7z, .tar those hold other files.
* There are text formats such as .txt, .csv, .json, .tsv - those can be read by humans in a text editor
* There are binary formats such as .exe, .jpg, .png - those are not human readable

### Reading text files

In this section we will read a simple text file.

In [None]:
filename = "alice_wonderland.txt"

The following two cells are commented out because they might not work in Google Colab:

In [None]:
## open the file in current directory for reading
#file_1 = open(filename)

## read contents of the file
#data = file_1.read()

## close the file
#file_1.close()

In [None]:
## a better way (automatically closing the open file)

#with open(filename) as file_1:
#    data = file_1.read()

### Google Colab

Note: The above action (reading a local file) will fail if you execute it in Google Colab.

We can open it from a remote web location (from Github) instead. Let's use the `requests` library:

In [None]:
import requests

url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/notebooks/" + filename

response = requests.get(url)
data = response.text

### Let's continue


In [None]:
# print the first 100 characters of the file
print(data[:100])

In [None]:
# split text into tokens (words)
words = data.split()

In [None]:
# count the number of tokens in text

print(len(words))

In [None]:
# print the first 50 tokens
print(words[:50])

### Counting word frequency

Here we will use Python's Counter object (from Python collections library) to determine word frequency of the text.

https://docs.python.org/3/library/collections.html#collections.Counter

In [None]:
from collections import Counter

In [None]:
c = Counter(words)

In [None]:
# print the 20 most common words (tokens)
print(c.most_common(20))

In [None]:
# a nicer way of printing counter results using a *for* cycle

for token, count in c.most_common(20):
    print(f"{token}: {count}")


Notice how words may appear in both lowercase ("the") and uppercase ("The"). You may want to normalize the text by converting it all to lowercase and do other clean-up steps.

### Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org/

Pandas lets us define `DataFrames` that contain tabular data organized in columns and rows:
* both columns and rows may have labels (names for these columns / rows)
* every column has its data type (different columns may have different data types)

Pandas also lets us define `Series` that contain a series of data (one column). Every `Series` element may have a label (name).

### Reading TSV files

Corpora that we could work with are located in archived TSV (Tab-separated-values) files:
https://github.com/CaptSolo/BSSDH_2023_beginners/tree/main/corpora

These files consist of rows (records) that contain one or more values separated by "Tab" characters.

We will use Pandas library to read a TSV file that contains a smaller version of the "lv_old_newspapers.zip" corpus: https://github.com/CaptSolo/BSSDH_2023_beginners/blob/main/corpora/lv_old_newspapers_5k.tsv

You may also use a TSV file for an English newspaper corpus (with slightly different column names): https://github.com/CaptSolo/BSSDH_2023_beginners/blob/main/corpora/en_old_newspapers_5k.tsv

In [None]:
import pandas

# common alternative
# import pandas as pd
# this would let you save 4 characters each time you need some pandas functionality you would write pd instead of pandas

In [None]:
# Commented out code that will not work in Google Colab

## if you downloaded and unarchived the whole Github repository
## this is where you will find the lv_old_newspapers_5k.tsv file:

#filename = "../corpora/lv_old_newspapers_5k.tsv"

In [None]:
## read the tab-separated file ("sep" parameter tells Pandas that values in the file
## are separated with the "tab" character.

#df_1 = pandas.read_csv(filename, sep="\t") # instead of df_1 we could use another name for our variable

#### Google Colab

Note: The above action (reading a local file) will fail if you execute it in Google Colab.

We have two different approaches then:

1. Upload file to Google Colab (remember this is temporary). Read it just like you would on a local computer.

2. Download file(s) from web address, instead of file path we will use its web addrss (URL)

In [None]:
# Approach 1
# Assuming file has been uploaded it will be found in current directory

file_path = "lv_old_newspapers_5k.tsv"

df_1 = pandas.read_csv(file_path, sep="\t")

# print the first lines of the file
df_1.head()

In [None]:
# Approach 2 reading from a web address
url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers_5k.tsv"

# ... or you could use the English corpus instead:
# url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv"

df_2 = pandas.read_csv(url, sep="\t")

# print the first lines of the file
df_2.head()

In [None]:
# get the basic statistics of the dataset
df_2.describe()

### Let's continue working with the dataframe (containing a text corpus)

In [None]:
df_1 = df_2

# the size of the corpus:
print(len(df_1))

In [None]:
# select the Text column, show the first 10 entries

df_1["Text"][:10]

In [None]:
# we can get ALL of the text in one big string from a pandas column

list_of_rows = list(df_1.Text)
len(list_of_rows)

In [None]:
# let's see what we have in first 3 rows
list_of_rows[:3]

In [None]:
all_text = "\n".join(list_of_rows) # we can join all rows into one big string
# separating each document with a newline, but you could choose something else to join with

# "\n" means a newline symbol

all_text[:250]

### Reading archived files

Pandas can also read archived CSV and TSV files.

In [None]:
# filename_2 = "../corpora/lv_old_newspapers.zip"

## read the archived, tab-separated file ("compression" parameter tells
## Pandas that this is a ZIP archived file).

# df_2 = pandas.read_csv(filename_2, sep="\t", compression="zip")

Note: The above action (reading a local file) that is commented out will fail if you execute it in Google Colab.

We will use downloading from a remote web location instead (a Github repository in this case):

In [None]:
url_2 = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers.zip"

df_2 = pandas.read_csv(url_2, sep="\t", compression="zip")

In [None]:
# the size of the corpus:

print(len(df_2))

In [None]:
# show the last 10 entries

df_2.tail(10)

In [None]:
# Sorting the dataset
df_2.sort_values(by=["Date"])

# Minimum value
df_2.min()

In [None]:
# Maximum value
df_2.max()

##  Reading other formats

Pandas supports a wide variety of file formats

Full list of formats is available here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

For example to read Excel files you would use my_dataframe = pandas.read_excel(filepath)
where filepath would be a string with file location or web address

## Task - read data into a dataframe from file

We have 4 different corpora for you to use.

Web addresses:

* English - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv
* Estonian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/ee_old_newspapers.zip
* Latvian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers.zip
* Ukrainian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/ua_old_newspapers.zip

Load one of them in a Pandas dataframe. Check the length, shape, sort them, see the first 15 entries and the last 20 entries.

# Text Mining with NLTK and Pandas

Source: [Text Mining and Sentiment Analysis with NLTK and Pandas in Python](https://www.kirenz.com/post/2021-12-11-text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/)
* by Jan Kirenz
* license: [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)

In [None]:
import pandas as pd

# Import some tweets from Barack Obama
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/twitter-tweepy/main/tweets-obama.csv")
df.head(3)

In [None]:
df.info()

In [None]:
# Convert text to lowercase

df['text'] = df['text'].astype(str).str.lower()
df.head(3)

### Tokenization

* We use NLTK's RegexpTokenizer to perform tokenization in combination with regular expressions
  * `\w+` matches Unicode word characters with one or more occurrences
  * this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

In [None]:
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('\w+')

df['text_token']=df['text'].apply(regexp.tokenize)
df.head(3)

### Stopwords

In [None]:
import nltk

nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words("english")

# Extend the list with your own custom stopwords
my_stopwords = ['https']
stopwords.extend(my_stopwords)

In [None]:
# let's create a function to remove stopwords

def remove_stopwords(words):
    return [item for item in words if item not in stopwords]

# apply this function to every dataframe row:
df['text_token'] = df['text_token'].apply(remove_stopwords)

df.head(3)

### Remove infrequent words

We first change the format of text_token to strings and keep only words which are longer than 2 letters.

In [None]:
# function for joining together words longer than 2 letters
def join_words(words):
    return ' '.join([item for item in words if len(item)>2])

# apply the function to the dataframe
df['text_string'] = df['text_token'].apply(join_words)

In [None]:
df[['text', 'text_token', 'text_string']].head(3)

### Continue working with the dataset

In [None]:
nltk.download('punkt')

# Create a list of all words
all_words = ' '.join([word for word in df['text_string']])

# Tokenize all_words
tokenized_words = nltk.tokenize.word_tokenize(all_words)

In [None]:
tokenized_words[:10]

In [None]:
df['text_string'][:10]

In [None]:
# Create a frequency distribution

from nltk import FreqDist

fdist = FreqDist(tokenized_words)
fdist

In [None]:
# this function returns words that appear more than once

def freq_words(words):
    return ' '.join([item for item in words if fdist[item] > 1 ])

# apply the function to the dataframe
df['text_string_fdist'] = df['text_token'].apply(freq_words)

In [None]:
df[['text', 'text_string', 'text_string_fdist']].head()

In [None]:
df.head(10)

### Summary

We used Pandas to hold data of a Tweet message corpus as it went through transformations: tokenization, stopword removal, etc.
* after each transformation we added a new Pandas dataframe column to hold the transformed data

Further steps:
* see how to visualize data in the `Day 2 - Visualization` Jupyter notebook;
* see the [Text Mining and Sentiment Analysis with NLTK and Pandas in Python](https://www.kirenz.com/post/2021-12-11-text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/text-mining-and-sentiment-analysis-with-nltk-and-pandas-in-python/) post.

---

This notebook by Uldis Bojārs is available under the [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) license.