# Data explanation (EXISTING)
This is the first notebook in the project

In this notebook, the first section explores the dataset and the second section walks through how the dataset was gathered and cleaned.

## Table of contents
- [Intro to the dataset](#Intro-to-the-dataset)
    - [The extracted subdirectory](#The-extracted-subdirectory)
    - [The texts subdirectory](#The-texts-subdirectory)
    - [The chunked subdirectory](#The-chunked-subdirectory)
- [Explanation of data gathering pipeline](#Explanation-of-data-gathering-pipeline)
    - [1: Networking setup](#1:-Networking-setup)
    - [2: Find the data files to download](#2:-Find-the-data-files-to-download)
    - [3: Download raw wikipedia XML files](#3:-Download-raw-wikipedia-XML-files)
    - [4: Remove the XML and wikipedia formatting](#4:-Remove-the-XML-and-wikipedia-formatting)
    - [5: Get the article text](#5:-Get-the-article-text)
    - [6: Cleaning to remove punctuation, extra spaces](#6:-Cleaning-to-remove-punctuation,-extra-spaces)
    - [7: Chunking and shuffling](#7:-Chunking-and-shuffling)

In [1]:
import os # for file operations
import pandas as pd
import wikipedia # gets list of all languages on wikipedia
import paramiko

---
# Intro to the dataset

In [2]:
path = '/Users/snc/Documents/wikidata/'
os.listdir(path)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/snc/Documents/wikidata/'

On the highest level of my data directory, there are three subdirectories. 

The extracted subdirectory contains the raw text of all articles for all languages' wikipedias, untouched except for being extracted from the XML format of the Wikipedia server dumps. Each language is its own .txt file in this directory. 

In texts, I have the data once all the articles's text bodies have been concatenated together into one very long string, with punctuation, URLs, formatting characters and so on stripped away. Each language has its own file(s) in the texts subdirectory - some languages' wikipedias were too big for the language to be held in just one file and subsequently tranferred to another computer (I would get memory errors), so many languages are split into many installments (the number is proportionate to the size of that wikipedia). 

Lastly, the chunked subdirectory stores a subset of the texts data for more convenient use when trying out machine learning models. Here, each language is represented by its own file. I randomly selected 500-character chunks from each language's files in the texts subdirectory, then shuffled the order of the chunks. For languages with more than 10000 contiguous 500-character chunks, I capped the number of chunks at 10000. However, smaller wikipedias were to small too have all 10000 chunks.

## The extracted subdirectory
Here I'll make a dataframe of all languages/files in extracted and the filesize for that language in megabytes (and recall a gigabyte is 1000 megabytes)
### The files

In [None]:
extracted_list = []
for f in os.listdir(path+'extracted/'):
    if f.startswith('.'): continue #skip system files
    fname = f[:f.index('.')]
    lang = wikipedia.languages()[fname]
    extracted_list.append((fname, lang, f, os.path.getsize(path+'extracted/' + f)/1000000))
files= pd.DataFrame(extracted_list, columns=['code', 'lang', 'fname', 'exfsize_MB'])
files.set_index('code', inplace=True)
files.head()

In [None]:
files.shape

248 files/languages in the extracted subdirectory - each language in the dataset is represented by its own file

In [None]:
files.exfsize_MB.sum()

This subdirectory is taking up about 85000 megabytes or 85 gigabytes on my computer

### Example for a language
Here is an example file, for Greek (language code el):

In [None]:
f = open(path+'extracted/el.txt', 'r')
exel = f.readlines()
f.close()

In [None]:
len(exel) # each title/header and each paragraph of each article is its own line

In [None]:
exel[0:10] # first 10 elements

## The texts subdirectory
### The files
Since languages can be split up across multiple files, the naming scheme is as follows when dividing language with code 'lang' into n parts: the first file is 'lang0.txt', second is 'lang1.txt' up until 'lang(n-1).txt'.

In [None]:
files['tf_totalsize_MB'] = 0.0 # total size for each language in texts subdirectory
files['tf_count'] = 0 # number of files dedicated to that language
for f in os.listdir(path + 'texts/'):
    if f.startswith('.'): continue
    lang = f[:f.index('.')]
    lang = ''.join([i for i in lang if not i.isdigit()]) # remove 
                                                        #installment number to get lang code
    files['tf_count'][lang] += 1
    files['tf_totalsize_MB'][lang] += os.path.getsize(path+'texts/'+f)/1000000
files['tf_avgsize_MB'] = files['tf_totalsize_MB'] / files['tf_count']
files.head()

For instance, here is the information for the English dataset:

In [None]:
files.loc['en']

The `tf_count` entry tells you there's 10 files to store the English wikipedia. The total size of these files (from the `tf_totalsize_MB` entry) is 13384 Megabytes or about 13 gigabytes. The average filesize for each of the 10 English files (since I split the data evenly amongst each file) is 1338 Megabytes or 1.3 gigabytes.

Now, you might wonder why I was able to create the 14 gigabyte file to represent English in the extracted directory, but I got a memory error when trying to create one 13 gigabyte file to represent English in the texts directory. I have that question too. Maybe it's because the texts files have no linebreaks while the extracted ones do, making the extracted ones easier to split up when writing to file and transferring to another computer. 

In [None]:
files['tf_count'].sum()

In [None]:
files['tf_totalsize_MB'].sum()

The total texts directory contains 567 files (divided by 248 languages is average 2.3 files per languages). The total space taken up by this subdirectory is about 80.5 gigabytes on my computer, so just slightly smaller than the extracted subdirectory.

### Example for a language
This is the first file for German

In [None]:
f = open(path + 'texts/de0.txt', 'r')
tede = f.read()
f.close()

The file's contents are just a lot of characters in one continuous line. Other than the very end of the file, the newline character '\n' is not present:

In [None]:
len(tede)

In [None]:
'\n' in tede[:-1]

First 1000 characters:

In [None]:
tede[:1000]

## The chunked subdirectory
### The files

In [None]:
files['fchunk_size_MB'] = 0 # file size of chunked file
files['nchunks'] = 0 # number of chunks (lines) in the file
for f in os.listdir(path+'chunked/'):
    if f.startswith('.'): continue
    lang = f[:f.index('.')]
    file = open(path+'chunked/'+f, 'r')
    chunks = file.readlines()
    file.close()
    files['nchunks'][lang] = len(chunks)
    files['fchunk_size_MB'][lang] = os.path.getsize(path+'chunked/'+f)/1000000
files.head()

As shown in this histogram of the number of chunks per language, most languages have the full 10000 chunks but a non-negligible amount do not.

In [None]:
files['nchunks'].plot.hist(bins=30)
plt.savefig('figs/nchunks.png') 

The average number of chunks is 7719 while the median is 10000. The minimum number of chunks is 343.

In [None]:
files['nchunks'].describe()

The language with only 343 chunks is Lakku (лакку)

In [None]:
files[files['nchunks']==343]

153 out of the 248 languages (so 62%) have a full 10000 chunks.

In [None]:
files['fchunk_size_MB'].sum()

These files take up 1366 megabytes or 1.4 gigabytes on my computer - much smaller than the extracted or texts subdirectories!

### Example for a language
Here are the Turkish chunks

In [None]:
f = open(path+'chunked/tr.txt', 'r')
chtr = f.readlines()
f.close()

Each line is a chunk. Turkish has 10000 chunks:

In [None]:
len(chtr)

Each line is 501 characters long (500 characters plus the newline character)

In [None]:
sum([len(l) for l in chtr])/len(chtr) # average line length

In [None]:
sum([len(l)==501 for l in chtr]) # this is the number of lines in the file that have exactly
# 501 characters. you see it's equal to the number of lines in the file

These are the first five chunks:

In [None]:
chtr[:5] 

---
# Explanation of data gathering pipeline
Now that the dataset is introduced, I'll explain how I gathered it. To assemble this full, final copy of the dataset, I ran my datagathering script on CRC over a period of days. The script is located in the datagather/crc directory of this repository.

I'm not sure how long the script itself would take on its own now because as I ran it, it would crash for various reasons (eg, writing large strings causing memory errors or networking set up wrong) or my session would expire, so I would need to fix the issue and start it running again where it left off. Sometimes I also had up to six copies of the script running at a time over the weekend when CRC had a lot of open resources, each collecting languages starting with a different letter of the alphabet.
## 1: Networking setup
CRC doesn't have enough storage space for the full dataset, so I needed to periodically send data somewhere else and delete all of one language's files from CRC before moving onto the next language. I would send files to an old laptop I have for storage (see second progress report for more details). 

Here is an outline of the networking aspects involved in my data gathering/cleaning:

Outside script

1. Get an internet address (or technically, publically expose port 22) for the storage laptop by running `./ngrok tcp 22` in shell and leaving it running
2. On my normal laptop, connect to Pitt's VPN (required to access CRC)
2. ssh to crc.h2p.pitt.edu to access CRC
3. Pass storage laptop's username, address and port to the script as command line arguments

Inside script

1. Prompt user for password, store it for future use in the runtime
2. Establish a test sftp connection with storage laptop to verify password (if fails, prompt user again)
3. After downloading a language's raw dump and extracting XML, send backup to device by opening a new sftp connection, transferring file, then closing connection (forms the extracted subdirectory)
4. Transfer another copy of the data after passing through the cleaning steps (forms the texts subdirectory)
5. Transfer another copy of the data after the chunking step (forms the chunked subdirectory)

What follows is an example that creates a very small exammple helloworld.txt file, then transfers it to another computer. I didn't bother with the VPN or ngrok for this simple example.

In [None]:
import paramiko
import getpass

In [None]:
# create the test file
fname = 'helloworld.txt'
f = open(fname, 'w')
f.writelines('hello world\n')
f.close()

In [None]:
username = 'snc'
port = 22
address = '10.0.0.25' # this is the computer's PRIVATE IP address: the address your internet
# router uses to distinguish this computer from any other device on your own network so 
# you don't recieve all the webpages requested by your brother and grandma or vice versa
# you can entering the IP address for your own computer here and use your laptop as the source 
# and the destination if you want to try it out. Then, it would be a bit like emailing or calling
# yourself in that the email is sent to the sender or the call is sent to the caller.

# to enable this sftp connection for a mac, go to System Preferences > Sharing > 
# Enable "Remote Login". There, it will display the IP address to use. Your source/local
# and destination/remote computers must be on the same wifi network unless you learn to use
# ngrok or something similar. Don't know how to do it on other operating systems 
# but there's probably tutorials on line if it isn't enabled by defailt.

In [None]:
pwd = getpass.getpass(prompt='sftp password: ') # password for your username 
# on destination/remote computer

In [None]:
# ESTABLISH CONNECTION
client = paramiko.client.SSHClient()
client.load_system_host_keys() # this loads any local ssh keys
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(address, port=port, username=username, password=pwd)
sftp = client.open_sftp() # type SFTPClient

# TRANSFER THE FILE
sftp.put('./'+fname, fname) #src, dest path/filename
print('transferred', fname)

# CLOSE THE CONNECTION
client.close()

If you run this yourself, look in your home (~) directory for the transferred file at the destination/remote computer.

## 2: Find the data files to download
The Wikipedia API offers a dictionary of all the languages of a Wikipedia edition that exists or used to exist. It provides both the language's abbreviation/code and name. So, my code iterates through this list:

In [None]:
langsdict = wikipedia.languages()
langsdict

When I have the language's abbreviation `abv`, I can construct the download link for the wikipedia dump as: 

'*https://dumps.wikimedia.org/*' + `abv` + '*wiki/latest/*' + `abv` + '*wiki-latest-pages-articles.xml.bz2*'

Once I download it, I store it on my computer. Then, I check to see if the file is already present locally before re-downloading. This way I don't need to keep stressing Wikipedia's server each time I run my program. Some wikipedias are closed/no longer maintained and throw an error if I try to download, so I just skip those languages. Also, I deem languages as too small if their raw dump file size is less than 1MB, I'll discuss more why later.

## 3: Download raw wikipedia XML files
The script downloads these automatically, they come in a compressed .xml.bz format.

In [None]:
dumpspath = './datagather/dumps/' # where I'm storing the raw files
dumpsraw = dict()
for f in os.listdir(dumpspath):
    if f.startswith('.'): continue
    lang = f[:f.index('.')-4] # isolate just name of the language
    dumpsraw[lang] = [langsdict[lang], os.path.getsize(dumpspath + f)]

Dataframe of the languages I've gotten - 248 languages:

In [None]:
dumps = pd.DataFrame.from_dict(dumpsraw, columns=['name', 'fsize'], orient='index')
print(dumps.shape)
dumps.head()

This is the number of bytes for these raw dump files alone, which is about 76 gigabytes:

In [None]:
dumps['fsize'].sum() 

These are the languages I skipped for either of the reasons I mentioned:

In [None]:
excluded = [langsdict[l] for l in langsdict.keys() if l not in dumps.index]
print(len(excluded), 'languages excluded:\n', ', '.join(excluded))

But these are the ones I was able to get:

In [None]:
print(dumps.shape[0], 'languages included:\n', ', '.join(dumps['name']))

Largest wikipedias by raw dump file size:

In [None]:
dumps.sort_values('fsize', ascending=False).head(10) # largest wikis included

English alone is about 18 GB! The distant second is German at about 6 GB.

Smallest file sizes that I included (since cutoff was 1MB):

In [None]:
dumps.sort_values('fsize').head(10) # smallest wikis included

Plot of the filesizes for the languages I included. You can see most are quite small, while a few are very big.

In [None]:
dumps['fsize'].plot.hist(bins=30) # many small wikis, just a few big wikis

Here is an example of Qaraqalpaqsha's raw dump file, if I manually decompress it (I don't store these anywhere, it's all handled by the extractor tool I use):

In [None]:
f = open('./data_samples/kaa-raw.xml', 'r')
exdump = f.readlines()
f.close()
exdump[:20] # start of file

In [None]:
exdump[3745:3765] # random middle part showing part of an article 

One of the languages that ultimately gets excluded is Qafár af. Here I'll read in its expanded dump file:

In [None]:
f = open('./data_samples/aa-raw.xml')
aaraw = f.readlines()
f.close()

In [None]:
aaraw[:20]

In [None]:
len(aaraw)

At first glance, this language's XML file has 5526 lines, which sounds like enough to at least get a good sampling of what the language looks like. But then I stripped all the XML and extra Wikipedia information off.

## 4: Remove the XML and wikipedia formatting
Wikipedia formatting is stuff like the particular templates used in the articles, the random stuff you see at the very start of the above raw XML file, etc.

In [None]:
f = open('./data_samples/aa-extracted.txt')
aa = f.readlines()
f.close()

Here is Qafár af (the example excluded language)'s *entire* file without the XML and extra Wikipedia formatting:

In [None]:
aa

As you can see, the output of the tool I used is one json object per page/article. From 5526 lines, Qafár af is left with just two pages, one of which has no text at all. The other page is just a sentence long; it seems to be a message notifying users that this wikipedia is closed and not maintained. This is why I excluded languages whose raw dump files are smaller than 1MB.

Returning to Qaraqalpaqsha, which I showed the raw dump for earlier, here is its Wikipedia with the XML and Wikipedia formatting removed:

In [None]:
f = open('./data_samples/kaa-extracted.txt', 'r')
kaa = f.readlines()
f.close()

In [None]:
kaa[:3] # first 3 json objects

The encoding is causing some weird display issues for some characters here, it's strange because it shows up fine in my text editor and in zsh with the `head` command. Either way, it gets corrected once it travels through the rest of my current data correction pipeline and the final file for this language which I'll show later displays fine.

The tool I'm using to go from raw dump to "clean" json items is called WikiExtractor. The current version on GitHub is not stable, but it's here: https://github.com/attardi/wikiextractor

I forked it and made a stable version here: https://github.com/soCromp/wikiextractor

I needed to use specifically this tool for another project I'm doing in another class and spent a lot of time fixing it and figuring out how to run it. It's designed specifically for processing Wikipedia data en masse and already "knows" Wikipedia formatting and how to find certain article attributes. As a result, I decided to just use it again here rather than learn how to use totally another thing like Beautiful Soup - I've used their Java library a couple years ago (jsoup) and remembered taking a while to get data cleaned like I wanted. 

The tool runs pretty quick - Afrikaans takes 1-2 minutes and English maybe 5. I also pass the text through some shell regex like `sed` at this point for a little extra cleaning. Shell regex seems to run faster than Python regex?

## 5: Get the article text
The next step in the pipeline is doing some encoding correction after the XML and Wikipedia formatting is extracted. Then, I grab just article text from the json objects.

In [None]:
f = open('./data_samples/af-articles.txt')
aftexts = f.readlines()
f.close()

In [None]:
print(''.join(aftexts[:4]))

As you can see, it still has line breaks, punctuation, numbers, etc.

After this step, I was planning to tokenize. I added this in and started to run the script on all the data, but disliked the results I was getting. I don't know what tool I could use to tokenize different languages since there are so many different rules and probably a lot of things I don't know about how to tokenize the languages I don't speak. So, I just took the tokenizing step out at least for now.
## 6: Cleaning to remove punctuation, extra spaces
Next, I just pass the text through a filter to remove all puntuation characters and replace sequences of multiple spaces in a row with just one space. I use a special unicode library function, unicodedata.category() to detect any unicode puntuation character in order to catch even puntuation like the Japanese period "。"

In [None]:
f = open('./data_samples/af-text.txt')
afclean = f.readlines()
f.close()

In [None]:
afclean[0][:402]

Again, there is some strange encoding issue when I display it in Jupyter. You can see it in the spaces between where the numbers were. It doesn't occur when I display it in Atom, Sublime or TextEdit.

## 7: Chunking and shuffling
Originally I had the idea to make each line a sentence or a certain number of words. Then I realized not all languages separate words or sentences the same way, and decided the most language-neutral way to do this would be to simply make each line 500 characters long. I refer to each line as a chunk, and simply divide the long block of text every 500 characters. I'm sure there are better ways to handle this, but I chose this simple way for now at least. Then, if there are more than 100000 chunks, I randomly sample 100000 chunks to prevent any larger languages from being overrepresented in my data. This number is somewhat arbitrary so I might revisit later. Lastly, I shuffle the chunks.

In [None]:
f = open('./data_samples/af-chunks.txt', encoding='utf-8')
af = f.readlines()
f.close()

First five Afrikaans chunks:

In [None]:
af[:5]

In [None]:
f = open('./data_samples/en-chunks.txt', encoding='utf-8')
en = f.readlines()
f.close()

First 5 English chunks:

In [None]:
en[:5]

Some of the encoding issues from files I showed you mid-pipeline do not seem present in these written out files.

In [None]:
f = open('./data_samples/kaa-chunks.txt', encoding='utf-8')
kaa = f.readlines()
f.close()

The weird encoding errors are now gone for Qaraqalpaqsha. Here are its first 5 chunks:

In [None]:
kaa[:5]

In [None]:
f = open('./data_samples/ml-chunks.txt')
ml = f.readlines()
f.close()

There are still some weird encoding issues with Jupyter specifically. In Malayalam, Jupyter can't display this right: ണ്‌ 

Even pasting it into this markup cell, it's showing me a red dot next to this character when I'm in edit mode, I guess to say it doesn't want to show the diacritic mark. And in the following code printout, it shows the character as \u200c instead.

In [None]:
ml[:5]

I'm not sure how much I can do about that issue; I know the correct "information" for that character is in there since these files are displaying correctly on my computer. I've tried reading the file in with different encoding schemes or doing the `str.encode('utf-8').decode('raw_unicode_escape')` trick.

As much as I want to fix it, it potentially is a deeply embedded issue in my browser or Jupyter or even my operating system. Knowing there's plenty of other stuff to do with the project, it's probably important that I prioritize at this point.