<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/17_Writing_Data_to_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting your results out of Colab

Most of the output in these notebooks is printed directly in the notebook, usually underneath the code cells. This is usually fine for our purposes. However, it is also worthwhile to know how you could turn your results into actual files. You might want to create new text files, or save results of linguistic analysis
for many texts to to a new text file or spreadsheet. This notebook will show you a few options for how to accomplish this.

# Writing to a text file

Writing directly to a file in Python involves first opening the file and then adding information to the file.

To open a file, you use the `open()` function. This function needs to point at the file in question, so the full file path needs to be supplied. In this case, we could either write to our Google Drive (assuming our Drive is mounted), or directly to the notebok environment.

The filepath to the environment is `./saved-files/...`, so for simplicity let's stick with that for now (plus we do not need to dally with mounting the Google Drive!).

So, if we wanted to write to a file named `'test.txt` in our enrivonment, we would write something like this:

```
open('./saved-files/text.txt')
```
However, we need to open the file and keep it open, so we can save the results to a variable.


```
myfile = open('./saved-files/text.txt')
```



Run the cell below to try this out. You should receive an error telling you that no such file or directory exists. This makes sense, because we have not made this file!

In [1]:
myfile = open('./saved-files/text.txt')

FileNotFoundError: [Errno 2] No such file or directory: './saved-files/text.txt'

So, how do we create a file if it does not already exist? They answer is by supplying options to the `mode` argument within the `open()` function. This argument tells Python what to do with files that do not exist, as well as files that do exist. The default `mode` is `r`, which means read only. There are other flags which control whether a file is created, whether it is available for reading and/or writing, and whether data should be added to the start or end of the file. [You can find a good answer about this here.](https://stackoverflow.com/questions/1466000/difference-between-modes-a-a-w-w-and-r-in-built-in-open-function)

So, if we wanted to first create a file and then continue to add data to the file, the `a+` mode is a good choice. This mode opens a file for writing (and creates the file it it doesn't exist), and then adds data to the end of the file.

Let's try it out:



In [6]:
myfile = open('./saved-files//text.txt', mode = 'a+')

You can verify that this file has been created by looking within the notebook environment. Next, we want to add some information to this file. We can do so with the `.write()` method, where we place the information we want to be added to the file in the argument.

Let's write a single sentence to the file. After writing, let's close the file as well.

In [15]:
myfile.write('take rests as long as you need, but for shorter than you want. ')
myfile.close()

Now we can load the file again using `mode = 'r'` to see what is in the file. We will add `.read()` to the end to access the content of the file.

You should see the sentence has been added to the text file.

In [17]:
open('./saved-files/text.txt', mode = 'r').read()

'take rests as long as you need, but for shorter than you want. '

Let's try adding another sentence to the file:

In [18]:
# open the file
myfile = open('./saved-files/text.txt', mode = 'a+')

# write to the file
myfile.write('a nod\'s as good as a wink to a blind bat!')

# close the file
myfile.close()

Take another look, we see that the new information was added to the file, after the data which had already been added. The result is a single long string of text.

In [20]:
open('./saved-files/text.txt').read()

"take rests as long as you need, but for shorter than you want. a nod's as good as a wink to a blind bat!"

What happens if we were to open the file with `mode = 'w+'` and add some more data?



In [22]:
# open the file
myfile = open('./saved-files/text.txt', mode = 'w+')

# write to the file
myfile.write('with your powers combined, I am Captain Planet!')

# close the file
myfile.close()

What does the file look like now? Well, opening the file with `w` has overwritten the existing content. Therefore, is is crucially important to make sure you open files with the right mode! Sometimes you may want to continuously add to an existing file, whereas othertimes you may want to start over with a fresh file each time.

In [23]:
open('./saved-files/text.txt').read()

'with your powers combined, I am Captain Planet!'

One nice thing about Colab is that you can manually download the files from the notebook environment onto your computer. Or you could have them saved directly to your Google Drive.

# Creating a bunch of text files

What if we had a list of texts or some other container of files we wanted to create individual text files for? Perhaps you wanted to create a corpus based on some sort of analysis you are already done, and your data is stored as dictionary?

Let's make a sample dictionary:


In [24]:
sf_dict = {'JERRY': "What's the deal with these pretzels?",
           'KRAMER': "These pretzels are making me thirsty!",
           'GEORGE': 'We live in a society!'}

We want to create a separate text file for each line in the dictionary. We therefore need to make a new filename for each text file. Using a loop, we could exploit the keys of the dictionary and use them as they names of the text files.

Using string concatenation, we can glue the key to the file extension we want, in this case of `.txt`. Because we want a fresh file each time, we will use the `w+` mode.

In [25]:
# initiate loop through the dictionary keys
for key in sf_dict.keys():
  # open the file with dictionary key + .txt as file name, using w+ mode
  f = open(f'./saved-files/{key}.txt', 'w+')
  # write the dictionary data to the text
  f.write(sf_dict[key])
  # close the file
  f.close()

In [27]:
# we can loop through the files and open them to read them
for key in sf_dict.keys():
  print(open('./saved-files/' + key + '.txt').read())

What's the deal with these pretzels?
These pretzels are making me thirsty!
We live in a society!


Pretty easy, right? That's how you can quickly make a bunch of text files separated from larger texts or analyses, which can come in handy for making your own corpora.

# writing results of analyses as text files

In addition to writing text to file, we might want to make a text file which holds the results of an analysis. For instance, we might want to collect the name of a text, the number of words in a text, etc.

Or, we may want to output the results of a frequency distribution to a text file, so that we do not have to run the notebook over and over.

The procedure for doing so is the exact same, but we may want to start including different whitespace characters to insert tabs and newlines as a means to organise the data.

Let's see this with a frequency analysis of a small text.

In [28]:
# import nltk to use FreqDist
import nltk

# create text, with .split() at the end to tokenize
woodchuck = """
How much wood could a woodchuck chuck
If a woodchuck could chuck wood?
As much wood as a woodchuck could chuck,
If a woodchuck could chuck wood.
""".split()

# create FreqDist of text
fdist = nltk.FreqDist(woodchuck)

We will loop through the FreqDist and add the word plus its frequency to a text file. Crucially, we will separate the words and their frequency values with a tab character `'\t'`, and add a newline `'\n'` so that each word/frequency value is on its own line. We will write this to a file called `'woodchuckFreq.txt'`

In order to use string concatenation, we will have to convert the number into a string using `str()`

In [29]:
for key in fdist.keys():
  f = open('./saved-files/woodchuckFreq.txt', 'a+')
  f.write(key + '\t' + str(fdist[key]) + '\n')
  f.close()


Inspect the file - if we split on newlines, we can see how the file  shows the data. And, if you manually download the file and open it in a text editor, you will see that each word/frequency pair is on its own line.

In [30]:
open('./saved-files/woodchuckFreq.txt').read().split('\n')

['How\t1',
 'much\t2',
 'wood\t2',
 'could\t4',
 'a\t4',
 'woodchuck\t4',
 'chuck\t3',
 'If\t2',
 'wood?\t1',
 'As\t1',
 'as\t1',
 'chuck,\t1',
 'wood.\t1',
 '']

Annoyingly, our code inserts a newline at the final part of the loop, which means that the text file will have an empty newline at the end. How could this be resolved?

1. You could use `if`/`else` logic to add newlines for all lines except the last line of the loop
2. You could use `rstrip()` to string the leading/trailing newline characters
3. calculate the text to be written in a loop, and then write only one time, after using `rstrip()`
4. You could manually open the text file and delete it (sometimes the manual ways are the faster ways...but only somtimes).
5. Likely many other ways...

# with open

Now that you understand file reading and writing, time to introduce one more common thing you will encounter with Python file writing, the `with` statement.

[You can read some good answers about `with` on Stack Overflow](https://stackoverflow.com/questions/3012488/what-is-the-python-with-statement-designed-for), but the details are not super important for our purposes. This information is included primarily because many examples of reading/writing files in Python will be accompanied with these statements.

Basically, instead of opening and closing a file each time in a loop, we can hold the file open using `with`, perform all of the looping and writing while it is open, and then safely close it.

To use `with`, simply use it before the `open()` statement and include an `as variable` followed by a colon. This creates a header within which you can perform looping to read/write to the file.

In [31]:
# hold this file open as the variable `textfile`
with open('./saved-files/withtext.txt', 'w+') as textfile:
  # initate a loop (what is enumerate doing? How is it used in the .write() line?)
  for index, word in enumerate('every day is exactly the same'.split()):
    # write to the same file
    textfile.write(word + str(index) + '\n')

Test that it worked:

In [32]:
# do you understand what is going on in this list comprehension? what is the .rstrip() doing?
[word for word in open('./saved-files/withtext.txt').read().rstrip().split('\n')]

['every0', 'day1', 'is2', 'exactly3', 'the4', 'same5']

# creating .csv files (spreadsheets)

Writing text to file is useful, and can also be used to write out data results from analyses. But text files are only structured when they have tabs and newlines separating them. What if we wanted to write to a spreadsheet instead of a text file? We can! Specifically, we can use create a comma separate values file (`.csv`) which works like a spreadsheet and can be opened in programs like Google Sheets, Excel, or Libre Office.

This is particularly useful for tabular data, such as if you had a series of metrics for a text: total number of words, sentences, TTR, etc. So that data that looks like this can be created easily as a csv file:

|text|words|sentences|TTR|
|:-:|:-:|:-:|:-:|
|text1|200|10|.7|
|text2|100|10|.9|



Let's first create some data that could be written to a csv file like this. We will import two of the datasets from The Current, and save their filenames in a list.

In [35]:
# plastic ban question data
# load the TP007 data to the notebook environment
#!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp007.txt'

# indoor only cats question data
# load the TP003 data to the notebook environment
#!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp003.txt'

mytexts = ['./the-current/tp003.txt','./the-current/tp007.txt']

Having loaded these two texts in, let's calculate the number of words, sentences, and TTR for each text, and then write that information to a csv file. We will do this row by row, so to speak.

First, import the csv library

In [34]:
# import the csv library
import csv

Using the `csv` library is similar to `open` for writing text. We will use `with` and `open` to hold a csvfile open, and then use the `csvreader` or `csvwriter` classes to create csv files. The same modes also apply, so be sure whether you want to create a new file, or append to an existing file.





There is a lot going on in the cell below, so here is an explanation:


Line 1: opening the file using `with` and `mode = 'w+'` \
Line 2: convert the opened file into a csv writer object. The same variable name is used (`f`) \
Line 4: write a header row to give the columns names. Notice that this is done *before* the loop, otherwise the header would be written for each iteration through the loop! \
Line 5: loop through our filenames, which are saved as strings in a list \
Line 6: open the filename and save it to the variable t `t`, using `.rstrip()` to remove trailing newlines \
Line 7: write the name of the text, the number of words, number of sentences, and TTR of the file to the csv. This information is entered as a list, where each item in the list will be a new column in the spreadsheet.

In [37]:
with open('./saved-files/textAnalytics.csv', 'w+') as f: # open the file with a temporary variable
  f = csv.writer(f) # convert that variable to a csvwriter object
  # create header row
  f.writerow(['text', 'number of words', 'number of sentences', 'TTR'])
  for text in mytexts:
    # open the text
    t = open(text, encoding="utf8").read().rstrip()
    # write different text calculations in a single line
    f.writerow([text, len((t).split()), len(t.split('\n')), len(set(t.split()))/len(t.split())])

After creating the file, we can use `csv.reader` to read the lines of the file:


In [38]:
with open('./saved-files/textAnalytics.csv', 'r') as f:
  f = csv.reader(f)
  for row in f:
    print(row)

['text', 'number of words', 'number of sentences', 'TTR']
[]
['./the-current/tp003.txt', '21668', '1365', '0.17758907144175742']
[]
['./the-current/tp007.txt', '11158', '720', '0.24251658003226384']
[]


You can also download the file manually or using the Google Colab `files` library


In [None]:
from google.colab import files
files.download('./saved-files/textAnalytics.csv')

Hopefully you now have a few new strategies to read and write data. While there are other ways to read/write csv files such as using `pandas` or `polars`, the `csv` module can serve the same purposes in many instances, and does not require loading in additional libraries.

Once you get the data into spreadsheet format, you could include it in a written report or conduct visualisations on the data. You could also do that with Python, depending on how far you want to go within a single notebook.