# Reading Data, Pandas

There are various file formats, how do we make a sense of them all?

* There are archive/compression formats such as .zip, .rar, .7z, .tar those hold other files.
* There are text formats such as .txt, .csv, .json, .tsv - those can be read by humans in a text editor
* There are binary formats such as .exe, .jpg, .png - those are not human readable

### Reading text files

In this section we will read a simple text file.

In [None]:
filename = "alice_wonderland.txt"

In [None]:
# open the file in current directory for reading
file_1 = open(filename)

# read contents of the file
data = file_1.read()

# close the file
file_1.close()

In [None]:
# a better way (automatically closing the open file)

with open(filename) as file_1:   
    data = file_1.read()

### Google Colab

Note: The above action (reading a local file) will fail if you execute it in Google Colab. 

We can open it from a remote web location (from Github) instead. Let's use the `requests` library:

In [None]:
import requests

url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/notebooks/" + filename

response = requests.get(url)
data = response.text

### Let's continue


In [None]:
# print the first 100 characters of the file
print(data[:100])

In [None]:
# split text into tokens (words)
words = data.split()

In [None]:
# count the number of tokens in text

print(len(words))

In [None]:
# print the first 50 tokens
print(words[:50])

### Counting word frequency

Here we will use Python's Counter object (from Python collections library) to determine word frequency of the text. 

https://docs.python.org/3/library/collections.html#collections.Counter

In [None]:
from collections import Counter

In [None]:
c = Counter(words)

In [None]:
# print the 20 most common words (tokens)
print(c.most_common(20))

In [None]:
# a nicer way of printing counter results using a *for* cycle

for token, count in c.most_common(20):
    print(f"{token}: {count}")


Notice how words may appear in both lowercase ("the") and uppercase ("The"). You may want to normalize the text by converting it all to lowercase and do other clean-up steps. 

### Reading TSV files

Corpora that we could work with are located in archived TSV (Tab-separated-values) files:
https://github.com/CaptSolo/BSSDH_2023_beginners/tree/main/corpora

These files consist of rows (records) that contain one or more values separated by "Tab" characters.

We will use Pandas library to read a TSV file that contains a smaller version of the "lv_old_newspapers.zip" corpus: https://github.com/CaptSolo/BSSDH_2023_beginners/blob/main/corpora/lv_old_newspapers_5k.tsv

You may also use a TSV file for an English newspaper corpus (with slightly different column names): https://github.com/CaptSolo/BSSDH_2023_beginners/blob/main/corpora/en_old_newspapers_5k.tsv

In [3]:
import pandas

# common alternative 
# import pandas as pd
# this would let you save 4 characters each time you need some pandas functionality you would write pd instead of pandas

In [None]:
# if you downloaded and unarchived the whole Github repository 
# this is where you will find the lv_old_newspapers_5k.tsv file:

filename = "../corpora/lv_old_newspapers_5k.tsv"

In [None]:
# read the tab-separated file ("sep" parameter tells Pandas that values in the file
# are separated with the "tab" character.

df_1 = pandas.read_csv(filename, sep="\t") # instead of df_1 we could use another name for our variable

#### Google Colab

Note: The above action (reading a local file) will fail if you execute it in Google Colab.

We have two different approaches then:

1. Upload file to Google Colab (remember this is temporary). Read it just like you would on a local computer. 

2. Download file(s) from web address, instead of file path we will use its web addrss (URL)

In [None]:
# Approach 1
# Assuming file has been uploaded it will be found in current directory

file_path = "lv_old_newspapers_5k.tsv"

df_1 = pandas.read_csv(file_path, sep="\t")

# print the first lines of the file
df_1.head()

In [108]:
# Approach 2 reading from a web address 
url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers_5k.tsv"

# ... or you could use the English corpus instead:
# url = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv"

df_2 = pandas.read_csv(url, sep="\t")

# print the first lines of the file
df_2.head()

Unnamed: 0,Language,Source,Date,Text
0,Latvian,rekurzeme.lv,2008/09/04,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
2,Latvian,bauskasdzive.lv,2007/12/27,"Bhuto, kas Pakistānā no trimdas atgriezās tika..."
3,Latvian,bauskasdzive.lv,2008/10/08,Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."


In [109]:
# get the basic statistics of the dataset
df_2.describe()

Unnamed: 0,Language,Source,Date,Text
count,4999,4999,4999,4999
unique,1,13,1428,4999
top,Latvian,diena.lv,2011/12/23,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
freq,4999,634,24,1


### Let's continue working with the dataframe (containing a text corpus)

In [None]:
# the size of the corpus:
print(len(df_1))

In [None]:
# select the Text column, show the first 10 entries

df_1["Text"][:10]

In [None]:
# we can get ALL of the text in one big string from a pandas column

list_of_rows = list(df_1.Text)
len(list_of_rows)

In [None]:
# let's see what we have in first 3 rows
list_of_rows[:3]

In [None]:
all_text = "\n".join(list_of_rows) # we can join all rows into one big string 
# separating each document with a newline, but you could choose something else to join with

# "\n" means a newline symbol

all_text[:250]

### Reading archived files

Pandas can also read archived CSV and TSV files.

In [None]:
# filename_2 = "../corpora/lv_old_newspapers.zip"

## read the archived, tab-separated file ("compression" parameter tells
## Pandas that this is a ZIP archived file).

# df_2 = pandas.read_csv(filename_2, sep="\t", compression="zip")

Note: The above action (reading a local file) that is commented out will fail if you execute it in Google Colab.

We will use downloading from a remote web location instead (a Github repository in this case):

In [4]:
url_2 = "https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers.zip"

df_2 = pandas.read_csv(url_2, sep="\t", compression="zip")

In [5]:
# the size of the corpus:

print(len(df_2))

319428


In [6]:
# show the last 10 entries

df_2.tail(10)

Unnamed: 0,Language,Source,Date,Text
319418,Latvian,bdaugava.lv,2010/01/16,Ceturtdien no rajona padomes ēkas tika svinīgi...
319419,Latvian,nra.lv,2011/12/21,"AFP vēsta, ka naktī uz otrdienu, jau piekto na..."
319420,Latvian,db.lv,2011/12/02,TOP 500 ir vienīgais ikgadējais izdevums Latvi...
319421,Latvian,diena.lv,2009/12/21,ka pati visu mūžu bijusi saistīta ar šo jomu. ...
319422,Latvian,la.lv,2011/12/08,"Prakse liecina, ka tādos gadījumos tiesu izpil..."
319423,Latvian,ziemellatvija.lv,2008/01/30,Beigu beigās I. Klempere kopā ar dēlu mājās de...
319424,Latvian,db.lv,2012/01/03,"Vienkāršā valodā tas nozīmē, ka investori par ..."
319425,Latvian,la.lv,2011/08/27,– Visi mūsu projekti ir notikuši sadarbībā ar ...
319426,Latvian,ziemellatvija.lv,2007/03/12,"Pole atzina, ka par šo ziņojumu VM saņēmusi li..."
319427,Latvian,bauskasdzive.lv,2011/07/07,"Trešdienas, 6. jūlija, vakarā projekta vadītāj..."


In [7]:
# Sorting the dataset
df_2.sort_values(by=["Date"])

# Minimum value
df_2.min()

Language                                              Latvian
Source                                        bauskasdzive.lv
Date                                               2005/04/27
Text        ! Apsēdies, atpūties, centies noorientēties ap...
dtype: object

In [8]:
# Maximum value
df_2.max()

Language                                              Latvian
Source                                                  zz.lv
Date                                               2012/01/14
Text        ♦ virsseržants Pēteris Tetērins, jaunākais ins...
dtype: object

##  Reading other formats

Pandas supports a wide variety of file formats

Full list of formats is available here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

For example to read Excel files you would use my_dataframe = pandas.read_excel(filepath)
where filepath would be a string with file location or web address

## Task - read data into a dataframe from file

We have 4 different corpora for you to use.

Web addresses:

* English - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/en_old_newspapers_5k.tsv
* Estonian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/ee_old_newspapers.zip
* Latvian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/lv_old_newspapers.zip
* Ukrainian - https://raw.githubusercontent.com/CaptSolo/BSSDH_2023_beginners/main/corpora/ua_old_newspapers.zip

Load one of them in a dataframe. Check the length, shape, sort them, see the first 15 entries and the last 20 entries.