<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/13_Loading_Data_into_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using your own data

As useful and interesting as the NLTK data is, you will eventually want to load in your own data. This notebook shows you a few options for doing so



# Option 1: Mounting your Google Drive

One way to do so involves connecting Google Colab with your Google Drive.

The process of connecting Colab to your Google Drive is known as mounting your drive. To do so, you click on the folder icon on the left side of the Colab page:

<img src = https://i.imgur.com/82Wedue.png>

Then you click the "mount drive" icon in the next menu:

<img src = https://i.imgur.com/d8DxFIu.png>

Colab should then automatically add a code cell like this:

<img src = https://i.imgur.com/ttfUkwi.png>

Run the cell to mount your Google Drive. You will most likely see several permissions prompts asking you if its okay to make this connection with the associated Google account. It's fine to do this with notebooks you make or the ones I give you, but be wary of other notebooks that might try to ask for your account permissions. There is likely no big risk but I feel obligated to tell you that you should not blindly trust any other Colab notebooks you might come across.



## Accessing files in your Google Drive

Now that your Drive is connected, you can directly access files in your Google Drive account. This is very handy. (You might need to click the refresh button (the folder with the circle arrow) to see the new folder).

You should see a new folder on the left side menu (after clicking on the folder icon) called `drive`. Clicking that folder should then reveal a subfolder called `MyDrive`. The `MyDrive` folder is the root folder for your Google Drive.

<img src = https://i.imgur.com/Av1mGtQ.png>




In order to access files on your drive, you will need to be able to give Python the full filepath to your files. No matter where your files are, the start of your filepath will always be `/content/drive/MyDrive/...`, where the `...` are any additional folders.

So, for example, if you had a file called `mydata.txt` located in the base level of your Google Drive, the filepath location would be `/content/drive/MyDrive/mydata.txt`. If you had that same file located in a folder called `mydata`, the filepath would be `/content/drive/MyDrive/mydata/mydata.txt`, and so on.

### Practice uploading a file from your drive

Go [here](https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/marine_biologist.txt), you should see a page of text. Manually copy and paste the text into a text editor program, such as notepad on windows or textedit on Mac (don't use Microsoft Word). Save the file as a `.txt` file to your Google Drive folder and name it `marine_biologist.txt`

Once you've done that, you should be able to read the text into Colab using the following cell.

The code uses the Python `open()` function, which, well, opens files! We need to use the `.read()` method at the end of open to return the contents of the file, which in this case is a string of the raw text in the `.txt` file.

In [None]:
marine_biologist = open('/content/drive/MyDrive/marine_biologist.txt').read()

# a random quote from this text
marine_biologist[15041:15135]

If you are having trouble with this step, make sure you are saving the file to your base level folder in Google Drive. Also make sure your drive is mounted, and that you have saved the file using the same filename I used above. If all else fails, please reach out for some help, because being able to access texts on your Google drive will be an important step for a lot of you in order to read in text data. Of course, if you're comfortable using Jupyter on your local machine, you're under no requirement/obligation to use Google drive to store your files.

### Working with a read file.

Once you've loaded the file in, you can perform all of the same operations on it as we have been doing on strings we've typed as well as the built-in data included with NLTK. You should be familiar with the following code at this point â€” are you able to leave comments explaining what each line is doing?

In [None]:
marine_tokens = nltk.word_tokenize(marine_biologist.lower())

marine_fdist = nltk.FreqDist(marine_tokens)

marine_fdist.most_common(10)

# Option 2: Using `!wget` or the `requests` library

Using Google Drive is a solid bet for integrating with Colab, but you might not like mounting drives each time you run a notebook, or working with files in your drive.

There are other options which involve reading files directly from the internet, using other functions such as `!wget` or Python libraries for requesting data from URLs, such as the `requests` library. There are various places in these materials which show how to do either method.

However, using these methods requires that the data already exists on the internet somewhere, and also exists at a URL you can access (and ideally control). Therefore, I only recommend using this method if you able to control the place where you data lives - and it might just be easier for you to use Google Drive if you don't want to go that route. But, using GitHub is a solid choice, and one which is used in some of these notebooks (as well as the example below).

The main benefit of using `!wget` is that the data is loaded directly into the notebook environment, so you would not need to muck around with sifting through files on the Google Drive. This method is also a bit easier to share resources with, since someone else would not need to have the same data on their Drive.

Below is an example of using `!wget` to access a text file saved on GitHub.

In [None]:
# using !wget to load a file into the notebook environment
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/tmoom.txt'

Instead of pointing at `/content/drive/MyDrive/...`, you instead just point at `/content/...`

You need to use the appropriate method to open the file, such as using `open()` to open a text file:

In [None]:
# read in the text
tmoom = open('tmoom.txt').read()

# split into tokens
tmoom_tokens = nltk.word_tokenize(tmoom)

# look at the first ten tokens!
[token for token in tmoom_tokens][:10]

## `requests`

You can also read in data directly from the url using Python libraries such as `requests` or `urllib`. This method still requires that you know where you can access a text file from, but unlike `!wget` will load the file directly into Python, rather than through the notebook environment first. You typically need to point a function towards a url and then use some additional methods to open the data. This works best for raw `.txt` files.

In [None]:
# import the library
import requests

In [None]:
# save URL to a variable
URL = 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/sample-texts/tmoom.txt'

In [None]:
# use .get() to retrieve file at the URL
data = requests.get(URL)

You can see that the information is saved in the variable in a format specific to the requests library. On its own, we can't see that text in the data object.  

In [None]:
data

This variable has a variety of attributes, one of which is the `text` attribute, which includes the text of the URL. In this case, it is a `.txt` file. You can access the text using `.text` - note that you do not need to use the brackets.

In [None]:
data.text

We can of course chain these functions together in order to read in the text and convert it into tokens or some other format in one single line. In the cell below, I split the URL results on newlines inside of a list comprehension. The result is requesting the file and receiving a list of all the sentences in the file.

In [None]:
[line for line in requests.get(URL).text.split('\n')]

# Option 3: Uploading files manually

There is also a way to upload files directly to the notebook environment. This involves using a function from the colab library. First, import the function.

In [None]:
# import the files function from colab
from google.colab import files

After importing the function, you now have access to a few functions, one of which allows you to upload files into your notebook. You do this with the `files.upload()` function.

In [None]:
# run a cell with this command to prompt the user to upload files.
files.upload()

You can then choose a file from anywhere on your computer and upload it to the notebook environment. The file can then be accessed using the same methods used with `!wget`.

# Creating your own NLTK corpus

Regardless of how you get your data into Colab, you can use the NLTK library to make your own version of the NLTK corpora.

There are two ways to do this, one is to read in a bunch of texts as one single corpus. To do this, we use the `PlaintextCorpusReader` class from NLTK.

In order to use it, we need three things:

1. some files,
2. a filepath which leads to files, and
3. the names of the files.

Again, please follow along. Please go [here](https://github.com/scskalicky/LING-226-vuw/blob/main/other-data/seinfeld.zip) and click the "download" button to download a compressed file containing several scripts from an American television show *Seinfeld*.

Download the file, unzip it, and save the folder to your base Google Drive folder. Your files should be located in `/content/drive/MyDrive/seinfeld`. This will be the filepath we feed to the NLTK corpus reader. Let's go ahead and save that to a variable so we only need to type it once:

In [4]:
corpus_root = './the-current'

Next, we'll load in the corpus reader from NLTK

In [3]:
# import the module to read in plain text
from nltk.corpus import PlaintextCorpusReader

As well as some other required NLTK resources

In [5]:
# import the NLTK library
import nltk

# download resources
nltk_resources = ['gutenberg', 'punkt', 'brown', 'state_union']

nltk.download(nltk_resources)

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

Now, we need to create a new variable from the `PaintextCorpusReader`. We need to put the path to the files as the first agument, followed by a list of the files names we want to be included in the corpus. The files in the folder are:

```
THE BOYFRIEND PT 1_cleaned.txt
THE BOYFRIEND PT 2_cleaned.txt
THE CHINESE RESTAURANT_cleaned.txt
THE DEALERSHIP_cleaned.txt
THE DOODLE_cleaned.txt
THE ENGLISH PATIENT_cleaned.txt
THE FACE PAINTER_cleaned.txt
THE GOOD SAMARITAN_cleaned.txt
THE JUNIOR MINT_cleaned.txt
THE LITTLE KICKS_cleaned.txt
THE MARINE BIOLOGIST_cleaned.txt
THE PARKING GARAGE_cleaned.txt
THE PARKING SPACE_cleaned.txt
THE PEZ DISPENSER_cleaned.txt
```

Let's try it out on a single file to start. Hey look, the marine biologist episode is in here, so we can try that again.

(You may need to mount your Google Drive, or be comfortable using other methods to get these texts into Colab).

In [7]:
# read in my text (i've passed the name in a list, so I could include more than one text if I need to later)
marine_biologist_corpus = PlaintextCorpusReader(corpus_root, ['tp001.txt'])

Now that we've loaded a corpus (even if it is just one text), we can use the built-in NLTK corpus functions.

In [8]:
# The raw version should be just the string
# note we get the exact same output here as when we read the text in manually, above.
marine_biologist_corpus.raw()[10:100]

'eed to work hard to make it happen\r\n0\t3d is better than other bands in the whole country\r\n'

In [None]:
# we can also get sentences
marine_biologist_corpus.sents()

If you remember from the first part of NLTK, they were using functions like `.concordance()` on the built-in data. We can do the same with our data, but we need to wrap the tokenized words in an nltk function called `Text()`.

In [None]:
# Create a special Text version of the corpus
from nltk.text import Text
mb_txt = Text(marine_biologist_corpus.words())

In [None]:
# now we can look for concordance lines
mb_txt.concordance('GEORGE')

In [None]:
mb_txt.concordance('whale')

### Loading in multiple texts to make a corpus

A corpus of a single text is not very interesting. Let's update our `PlaintextCorpusReader` to include all of the texts in our Seinfeld folder. But, it sure would be annoying having to type all of the filenames one-by-one. Fortunately, there's a way around this.

We can use the [`glob` library](https://docs.python.org/3/library/glob.html) to grab all of the filenames in a directory. The `glob` function makes it easy to save all of the filenames from a directory into a variable.  

In [None]:
# import the function which is the same name as the module
from glob import glob

# the * indicates you want everything from the folder.
# we can use more intelligent ways to select only certain files, we'll see this later with regex
filenames = glob('./the-current/*')

filenames

Doing this gives us the entire filepath which doesn't really hurt us but also is kind of annoying. We could easily remove this using slicing. Because the part that we want to remove is always the same (i.e., the `/content/drive/MyDrive/seinfeld/'` part), we could just slice that part off from each filename. All we need to know is where to start the slice

In [None]:
# starting at 32 gives us the episode name only.
filenames[1][14:]

In [None]:
# let's write a list comprehension which removes the start of each filename
filenames_short = [name[14:] for name in filenames]

# voila!
filenames_short

Now we can just pass `filenames_short` to the `PlaintextCorpusReader` function and make a larger corpus. I tested it and it will also work without cleaning the filepath we get from `glob`, but this is nice because we remove the clutter.

In [None]:
# make our seinfeld corpus
seinfeld_corpus = PlaintextCorpusReader(root = corpus_root, fileids = filenames_short)

In [None]:
# we can use the fileids function to see the texts in here
seinfeld_corpus.fileids()

In [None]:
# what are the ten most common words in our corpus?
from nltk import FreqDist
FreqDist(seinfeld_corpus.words()).most_common(10)

In [None]:
# and I can search for concordances, neat!
Text(seinfeld_corpus.words()).concordance('red')    

# **Your Turn**

You now have the ability to load in any text you like and use the existing NLTK corpus functions to explore the text.

Spend some time repeating the steps above for a different set of text data to make your own corpus. You might want to create a specific folder on your Google Drive which has your data for this course.