# 8: Text Files

* Basic reading & writing
* Encodings
* JSON
* Multiple files
* Building a simple corpus reader

So far, we have primarily been working with text files in the context of specific, complex formats, using Python packages like Pandas (for CSVs) and Beautiful Soup (for XML). One advantage of these packages is that they will deal with messy things for you, and get things set up fast.

However, we will inevitably run across a situation where these aren't quite the right thing, for instance:

* What about raw English text, with no formatting?
    * To be processed into sentences, tokens, perhaps tagged?
    * or for information to be extracted with regular expressions
* What about one sentence/word per line?
* What about CSV (or TSV) type situations but where you want to process a line at a time (not load the entire file)
* What about storing lexicons?
* What about working in other languages with different character sets?
* What happens when I've got a bunch of text files to process at once?

In this notebook, we'll dig a bit deeper into text file I/O (Input/Output) in Python.

## Basic reading and writing

The classic method for opening files in Python is to assign a file object (f) to the result of the open function, and then close the file by calling the close method. 

In [1]:
f = open("8_text_files.ipynb")
print(f.read(500))
f.close()

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 8: Text Files\n",
    "\n",
    "* Basic reading & writing\n",
    "* Encodings\n",
    "* JSON\n",
    "* Multiple files\n",
    "* Building a simple corpus reader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we have primarily been working with text files in the context of specific, complex formats, using Pyth


An popular recent alternative method is to use Python with...as syntax, which will close the file automatically at the end of the code block. 

- Advantage: won't accidently leave files open and lose data
- Disadvantage: extra indentation

In [3]:
with open("8_text_files.ipynb") as f:
    print(f.read(500))

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 8: Text Files\n",
    "\n",
    "* Basic reading & writing\n",
    "* Encodings\n",
    "* JSON\n",
    "* Multiple files\n",
    "* Building a simple corpus reader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we have primarily been working with text files in the context of specific, complex formats, using Pyth


Remember that "r" mode does not need to be specified for reading, it is the default mode. Write mode "w" overwrites the file you are creating. The append option "a" can be useful if you are adding continuously to a file.

In [4]:
with open("test.txt","w") as fout:
    fout.write("test write 1\n")

In [5]:
with open("test.txt","w",) as fout:
    fout.write("test write 2\n")

In [6]:
with open("test.txt","a") as fout:
    fout.write("test append")

In [7]:
f = open("test.txt")
print(f.read())
f.close()

test write 2
test append


The two most common options for reading files are iterating line by line using a *for* loop (which does not require holding the entire file in memory), or reading the entire file into a single string at once, using [read](https://docs.python.org/3/library/io.html#io.TextIOBase.read) (great for quick use of regexes, for instance). You can read a single line of a file without loop by using [readline](https://docs.python.org/3/library/io.html#io.TextIOBase.readline). A fourth option is [readlines](https://docs.python.org/3/library/io.html#io.IOBase.readlines), which will read the entire file into a list of string where each string is a line. Remember that in all of these cases, the newline characters will still be there, so you'll probably want to use strip!

In [8]:
some_lines = "line1\nline2\nline3\nline4"
with open("test.txt","w") as fout:
    fout.write(some_lines)

In [9]:
with open("test.txt") as f:
    # my code here
    print(f.read())
    # my code here

line1
line2
line3
line4


In [10]:
with open("test.txt") as f:
    # my code here
    for line in f:
        print(line.strip())
    # my code here

line1
line2
line3
line4


In [11]:
with open("test.txt") as f:
    # my code here
    print(f.readline())
    print(f.readline())
    # my code here

line1

line2



In [12]:
with open("test.txt") as f:
    # my code here
    print(f.readlines())
    # my code here

['line1\n', 'line2\n', 'line3\n', 'line4']


For writing, the [write] method can be used whether you are writing incrementally or one shot. (There is a also [writelines](https://docs.python.org/3/library/io.html#io.IOBase.writelines) method if you already have a list of strings, though note that newlines are not added. This is less used though.)

In [13]:
some_lines = ["line1","line2","line3","line4"]

In [14]:

with open("test.txt","w") as fout:
    #my code here
    for line in some_lines:
        fout.write(line + "\n")
    #my code here

In [15]:
with open("test.txt") as f:
    print(f.read())

line1
line2
line3
line4



For practice, let's write and then read in the counts of words in the Brown, using a tab to deliminate the word and its count:

In [16]:
from nltk.corpus import brown
from collections import Counter

counts = Counter(brown.words())

counts_copy = {}

#my code here
fout = open("brown_counts.txt", "w")
for word, count in counts.items():
    fout.write(word + "\t" + str(count) + "\n")
fout.close()

f = open("brown_counts.txt")
for line in f:
    word, count = line.strip().split()
    counts_copy[word] = int(count)
f.close()
#my code here

counts == counts_copy

True

If you have a lexicon or corpus in CSV/TSV involving more than multiple columns, you probably won't want to manipulate each line manually like we did here, however, but you may not want to use pandas either, which requires you load the entire file into memory. 

An intermediate option between those two extremes is to use Python's [csv](https://docs.python.org/3/library/csv.html) library, which is lightweight but easy to use (note that it will do any necessary escaping for you!). For example, one popular format for annotated corpora is the CoNLL format, an example with headers is below.

In [17]:
CoNLL_example = "ID\tFORM\tLEMMA\tUPOSTAG\tXPOSTAG\tFEATS\tHEAD\tDEPREL\n1\tHe\the\tPRON\tPRP\tCase=Nom|Number=Sing|Person=3\t2\tnsubj\n2\tis\tbe\tVERB\tVBZ\tNumber=Sing|Person=3|Tense=Pres\t0\troot\n3\tin\tin\tADP\tIN\t_\t6\tcase\n4\tthe\tthe\tDET\tDT\tDefinite=Def|PronType=Art\t6\tdet\n5\tUnited\tunite\tVERB\tVBD\tTense=Past|VerbForm=Part\t6\tamod\n6\tKingdom\tkingdom\tNOUN\tNN\tNumber=Sing\t2   nmod\n7\t(\t(\tPUNCT\t-LRB-\t_\t8\tpunct\n8\tUK\tUK\tPROPN\tNNP\tNumber=Sing\t6\tappos\n9\t)\t)\tPUNCT\t-RRB-\t_\t8\tpunct\n10\t.\t.\tPUNCT\t.\t_\t2\tpunct"
with open("CoNNL.txt","w") as fout:
    fout.write(CoNLL_example)
print(CoNLL_example)

ID	FORM	LEMMA	UPOSTAG	XPOSTAG	FEATS	HEAD	DEPREL
1	He	he	PRON	PRP	Case=Nom|Number=Sing|Person=3	2	nsubj
2	is	be	VERB	VBZ	Number=Sing|Person=3|Tense=Pres	0	root
3	in	in	ADP	IN	_	6	case
4	the	the	DET	DT	Definite=Def|PronType=Art	6	det
5	United	unite	VERB	VBD	Tense=Past|VerbForm=Part	6	amod
6	Kingdom	kingdom	NOUN	NN	Number=Sing	2   nmod
7	(	(	PUNCT	-LRB-	_	8	punct
8	UK	UK	PROPN	NNP	Number=Sing	6	appos
9	)	)	PUNCT	-RRB-	_	8	punct
10	.	.	PUNCT	.	_	2	punct


If we like, we can skip the header line and read each row using a simple CSV [reader](https://docs.python.org/3/library/csv.html#csv.reader), which returns a list for each row:

In [18]:
import csv

f = open("CoNNL.txt")
f.readline() # skip the header by reading one line
reader = csv.reader(f,delimiter="\t")
for row in reader:
    print(row)
f.close()

['1', 'He', 'he', 'PRON', 'PRP', 'Case=Nom|Number=Sing|Person=3', '2', 'nsubj']
['2', 'is', 'be', 'VERB', 'VBZ', 'Number=Sing|Person=3|Tense=Pres', '0', 'root']
['3', 'in', 'in', 'ADP', 'IN', '_', '6', 'case']
['4', 'the', 'the', 'DET', 'DT', 'Definite=Def|PronType=Art', '6', 'det']
['5', 'United', 'unite', 'VERB', 'VBD', 'Tense=Past|VerbForm=Part', '6', 'amod']
['6', 'Kingdom', 'kingdom', 'NOUN', 'NN', 'Number=Sing', '2   nmod']
['7', '(', '(', 'PUNCT', '-LRB-', '_', '8', 'punct']
['8', 'UK', 'UK', 'PROPN', 'NNP', 'Number=Sing', '6', 'appos']
['9', ')', ')', 'PUNCT', '-RRB-', '_', '8', 'punct']
['10', '.', '.', 'PUNCT', '.', '_', '2', 'punct']


Or we might use a [DictReader](https://docs.python.org/3/library/csv.html#csv.DictReader), where each row is a dictionary with the headers as keys:

In [19]:
f = open("CoNNL.txt")
reader = csv.DictReader(f,delimiter="\t")
for row in reader:
    print(row)
f.close()

{'ID': '1', 'FORM': 'He', 'LEMMA': 'he', 'UPOSTAG': 'PRON', 'XPOSTAG': 'PRP', 'FEATS': 'Case=Nom|Number=Sing|Person=3', 'HEAD': '2', 'DEPREL': 'nsubj'}
{'ID': '2', 'FORM': 'is', 'LEMMA': 'be', 'UPOSTAG': 'VERB', 'XPOSTAG': 'VBZ', 'FEATS': 'Number=Sing|Person=3|Tense=Pres', 'HEAD': '0', 'DEPREL': 'root'}
{'ID': '3', 'FORM': 'in', 'LEMMA': 'in', 'UPOSTAG': 'ADP', 'XPOSTAG': 'IN', 'FEATS': '_', 'HEAD': '6', 'DEPREL': 'case'}
{'ID': '4', 'FORM': 'the', 'LEMMA': 'the', 'UPOSTAG': 'DET', 'XPOSTAG': 'DT', 'FEATS': 'Definite=Def|PronType=Art', 'HEAD': '6', 'DEPREL': 'det'}
{'ID': '5', 'FORM': 'United', 'LEMMA': 'unite', 'UPOSTAG': 'VERB', 'XPOSTAG': 'VBD', 'FEATS': 'Tense=Past|VerbForm=Part', 'HEAD': '6', 'DEPREL': 'amod'}
{'ID': '6', 'FORM': 'Kingdom', 'LEMMA': 'kingdom', 'UPOSTAG': 'NOUN', 'XPOSTAG': 'NN', 'FEATS': 'Number=Sing', 'HEAD': '2   nmod', 'DEPREL': None}
{'ID': '7', 'FORM': '(', 'LEMMA': '(', 'UPOSTAG': 'PUNCT', 'XPOSTAG': '-LRB-', 'FEATS': '_', 'HEAD': '8', 'DEPREL': 'punct'}
{'I

Finally, one other structured text format that is popular in computational liguistics and beyond is the JSON. As it happens, .ipynb is a JSON format, so we can read these lecture notes, using [json.load](https://docs.python.org/3/library/json.html#json.load), which can be passed a filepointer!

In [21]:
import json
f = open("8_text_files.ipynb")
lecture_8 = json.load(f)

In [23]:
lecture_8['cells'][0]

{'cell_type': 'markdown',
 'metadata': {'slideshow': {'slide_type': 'slide'}},
 'source': ['# 8: Text Files\n',
  '\n',
  '* Basic reading & writing\n',
  '* Encodings\n',
  '* JSON\n',
  '* Multiple files\n',
  '* Building a simple corpus reader']}

> This probably won't work on Windows systems but it works on my Mac. Since Mac and PCs have different default encodings, let's change topics and first talk about encodings.

## Encodings

For computers, numbers are everything. However, when dealing with texts, we need a way to associate numbers with characters. *Encodings* provide such a mapping. Generally there is trade-off between the number of possible characters that can be represented and the amount of space required to store text on disk, so different encodings were developed so they could represent the particular characters used in particular languages.

![test](http://www.asciitable.com/index/asciifull.gif)

In Python, an encoding can be selecting using the `encoding` keyword when you open a file. ASCII was the first major encoding and is very compact but can only represent 128 characters; using ASCII will fail if you try to write text that uses characters which aren't found on typical English keyboard. 

In [24]:
with open("test.txt", "w",encoding="ascii") as fout:
    fout.write("this works\n")
    fout.write("ça ne va pas\n")
    fout.write("不行\n")

UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 0: ordinal not in range(128)

Latin-1 and various related formats can use up to 256 characters (a full byte), and support most of the languages of Europe. A variation on Latin-1 called CP-1252 is usually the default encoding for Windows.

In [25]:
with open("test.txt", "w",encoding="latin-1") as fout:
    fout.write("this works")
    fout.write("ça va")
    fout.write("还是不行")

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

These days, the most popular encoding is definitely UTF-8, which supports all the characters included in Unicode, including all the characters of pretty much every written language, as well as things like emoji. Even if you don't think you need it, it is a good idea to save the text files you create to be in UTF-8. The characters included in ASCII have the same representation in UTF-8, so for normal English texts it is actually no less efficient! Note that UTF-8 is the default encoding for OS X.

In [26]:
with open("test.txt", "w",encoding="utf-8") as fout:
    fout.write("this works\n")
    fout.write("ça va\n")
    fout.write("可以了\n")

In [27]:
with open("test.txt",encoding="utf-8") as f:
    print(f.read())

this works
ça va
可以了



In [28]:
with open("test.txt",encoding="ascii") as f:
    print(f.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Most of the time, encodings just work and you don't have to think about them. However, sooner or later (probably sooner) you will get an encoding error when you read a file. You might try changing the encoding, or trying to autodetect the encoding (BeautifulSoup [can help you with this](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit)). But sometimes it just doesn't work (or you don't have the patience), at which point you might want to try a more liberal option for the *errors* keyword parameter such as ignore or replace. 

In [29]:
with open("test.txt", "w",encoding="utf-8") as fout:
    fout.write("this works")
    fout.write("ça va")
    fout.write("可以了")
    
    
with open("test.txt",encoding="ascii",errors="replace") as f:
    print(f.read())
          

this works��a va���������


Finally, although encodings can be handled as part of file IO, sometimes you need to [encode](https://docs.python.org/3/library/stdtypes.html#str.encode) to a bytes string or [decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) from a bytes (binary) string when there is no file directly involved. Use the encode and decode methods for strings, with also have the errors keyword argument.

In [30]:
with open("test.txt","rb") as f:
    text = f.read()
    print(text)
    print(text.decode("utf-8"))
    print(text.decode("ascii",errors="ignore"))

b'this works\xc3\xa7a va\xe5\x8f\xaf\xe4\xbb\xa5\xe4\xba\x86'
this worksça va可以了
this worksa va


## JSON

Now that we understand encodings, let's try that JSON example again

In [31]:
f = open("8_text_files.ipynb", encoding="utf-8")
lecture_8 = json.load(f)

The JSON format represents all the basic Python types, including strings, ints, lists, and dicts. Let's take a look what we have after we've loaded in from that .ipynb JSON.

In [33]:
# lecture_8

In [34]:
lecture_8.keys()

dict_keys(['cells', 'metadata', 'nbformat', 'nbformat_minor'])

In [35]:
lecture_8["cells"][0]

{'cell_type': 'markdown',
 'metadata': {'slideshow': {'slide_type': 'slide'}},
 'source': ['# 8: Text Files\n',
  '\n',
  '* Basic reading & writing\n',
  '* Encodings\n',
  '* JSON\n',
  '* Multiple files\n',
  '* Building a simple corpus reader']}

Working in reverse, we can use JSONs to store any complex python data structure w have constructed. There are more compact ways to store these sorts of objects using a binary representation rather than text (e.g. [pickle](https://docs.python.org/3/library/pickle.html)), but JSON has the advantage of being easy to read. Write to a file using the [dump](https://docs.python.org/3/library/json.html#json.dump) function after you've opened a file for writing.

In [36]:
fout = open("8_text_files.ipynb","w",encoding="utf-8")
json.dump(lecture_8, fout)
fout.close()

## Multiple files

When dealing with corpora, we will often want to iterate over multiple files, applying the same processing to all of them. The easiest way to accomplish this is to use [os.listdir](https://docs.python.org/3/library/os.html#os.listdir) 

In [39]:
import os
os.listdir()

['CoNNL.txt',
 '.DS_Store',
 '5_regex.ipynb',
 '3_lexicons.ipynb',
 '6_XML.ipynb',
 '2_corpora.ipynb',
 'brown_counts.txt',
 '1_strings.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 '4_stats.ipynb',
 '8_text_files.ipynb',
 '7_preprocess.ipynb']

Let's programmatically make some files in a subdirectory directory for us to read. We use [os.mkdir](https://docs.python.org/3/library/os.html#os.mkdir) for this (note if you run this code more than once, you'll have to comment out os.mkdir because you can only make the directory once)

In [40]:
os.mkdir("files")
fout = open("files/test1.txt","w")
fout.write("Hello World!")
fout.close()
fout = open("files/test2.txt","w")
fout.write("Goodbye World!")
fout.close()
os.listdir("files")

['test1.txt', 'test2.txt']

We can iterate over the list of filenames and open each one, concatenating the directory name to the front to form a path

In [41]:
for filename in os.listdir("files"):
    with open("files/" + filename) as f:
        print(f.read())

Hello World!
Goodbye World!


Another popular way to deal with multiple files is to keep them in archives, so they take up minimal space and are easy to transfer. All major archive formats that have direct library support in Python, including [zip](https://docs.python.org/3/library/zipfile.html), [gzip](https://docs.python.org/3/library/gzip.html), [tar](https://docs.python.org/3/library/tarfile.html), and [bzip2](https://docs.python.org/3/library/bz2.html). The code below creates a zip file with the two text files we created above.

In [42]:
from zipfile import ZipFile

my_zip = ZipFile("test.zip","w")
my_zip.write("files/test1.txt")
my_zip.write("files/test2.txt")
my_zip.close()

We could extract those files, but a more common use case is to access them directly inside the zip. We open the zipfile, and iterate over its contents using [ZipFile.namelist](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.namelist), and get a filepointer for the archived file using [ZipFile.open](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.open):

In [49]:
my_zip = ZipFile("test.zip")
for filename in my_zip.namelist():
    print(filename)
    f = my_zip.open(filename)
    print(f.read())
    f.close()
my_zip.close()

files/test1.txt
b'Hello World!'
files/test2.txt
b'Goodbye World!'


Note the b in front of the strings, indicating they are bytes strings, not (normal) Unicode strings; ZipFile opens files in binary mode. To convert them to a "proper" Python Unicode string, we would use the `decode` method mentioned above with the correct encoding (in this case, any of the encodings from above would work)! Note you will run across the same problem you manipulate a webpage directly rather than passing it to Beautiful Soup, which does the conversion for you. You will definitely want to decode to a correct Unicode string, a bytes string is not what you normally want to work with!

## Building a simple corpus reader

Early in the repo we were able to explore corpora without dealing with the details of file I/O and text preprocessing because the NLTK corpus readers allow us to sidestep those issues.

Now, we have all the tools in place to build our own Corpus Readers. We'll now build a class that has a `words` method that acts just like `words` does for corpora in NLTK. 

In addition to some stuff we've learned recently, we'll also need our function to be a generator. Good corpus readers should not load the entire corpus into memory, they should `yield`!

In [50]:
from nltk import word_tokenize
import os

class MyCorpusReader: 
    def __init__(self, directory, encoding):
        self.directory = directory
        self.encoding = encoding
        
    def words(self):
        # my code here
        for filename in os.listdir(self.directory):
            f = open(self.directory + "/" + filename, encoding=self.encoding)
            text = f.read()
            words = word_tokenize(text)
            for word in words:
                yield word
        # my code here

In [51]:
reader = MyCorpusReader("files", "utf-8")
for i,word in enumerate(reader.words()):
    print(i)
    print(word)

0
Hello
1
World
2
!
3
Goodbye
4
World
5
!


We could easily extend this to include all the functionality we got in NLTK (sents, tagged_sents, etc.). 