### Beginning Text Preparation

In order to perform text analysis, there are a few Python commands you should have up your sleeve. Some of the commands help get you set up and locate all of the files in your corpora. Other commands can be used throughout the programming process to check on your algorithm and make sure everything looks the way you think it should. Learning the following commands will give you a brief introduction to Python while also setting you up with a solid toolkit to begin programming.

### Loading files from Github to Carbonate

Since we will be using Carbonate as the example for our file paths and other elements in the scripts and Jupyter notebooks, many of you may want to save these files to Carbonate to make following along a bit easier. To do this you will need a Carbonate account if you do not already have one. Indiana University students, faculty, staff, and sponsored affiliates can request a Carbonate account. Steps on how to do so can be found [here](https://kb.iu.edu/d/aolp). Once you have your account, you'll want to be able to access the [Research Desktop (ReD)](https://kb.iu.edu/d/apum). You will also need to have access to ReD through the [thinlinc client](https://kb.iu.edu/d/aput). Once you have your account and can access it, you'll want to acquire the [Cyber DH Text-Analysis](https://github.com/cyberdh/Text-Analysis) GitHub repository and save it to Carbonate via Research Desktop. You can do this by opening the Firefox browser on ReD and going to [https://github.com/cyberdh/Text-Analysis](https://github.com/cyberdh/Text-Analysis) and clicking the green download button. Make sure to save it in your 'home' directory on ReD which should be your `/N/u/yourUserName/Carbonate` file path. Then extract the contents of the .zip file to the same 'home' folder just described in the previous sentence. Once you have done this you can also put a copy of the folder in your Box account.

The nice thing about ReD is that it comes with a built in way for you to access your Box account so you can download the repository to Box and use the Text-Analysis notebooks and scripts on Carbonate and your own computer without having to use an SFTP client or some other means of moving files back and forth.

To use Box on ReD go to Applications > Storage > Box setup and follow the instructions. You can also get help [here](https://kb.iu.edu/d/apxv#storage)

Now that you have Carbonate, ReD, and Box up and running, let's make sure you can run the notebooks and scripts.

### Loading Jupyter Notebook and Spyder

To load Jupyter Notebook on ReD go to the top left corner of the desktop where its says "Applications" and go to Applications > Analytics > Jupyter Notebook and double click "Jupyter Notebook". It should load after a short wait. 

To load Spyder (which you will want to use with all our .py files) there are a couple steps.

To begin, you will want to open the text editor on ReD. To do this go to where it says "Applications" in the top left corner and then choose Utilities > Text Editor. Then in the text editor type the following:
> module unload python<br>
> module load anaconda/python3.6/4.3.1

Then save the file in your home directory, which should be named "Carbonate", and name the file ".modules". The "." in front is important as it tells the system that this is a hidden file. In addition, the system knows to look for a file by this name and execute the commands inside the file.

If you already have a ".modules" file, then add the above lines to the already existing file. To see if you already have a ".modules" file open your home directory (again, named "Carbonate" or is the folder on ReD labeled "username's Home" and the folder has a house on it. Once in the home directory go to "view" in the menu on the upper left and check the "Show Hidden Files" check box. Then look and see if you have a file named ".modules" and the lines mentioned above to the file.

This will unload Python 2 and load Python 3. You need to have Python as the default for your system in order to use/open spyder.

Now, open your terminal and type `spyder` and hit return. There is a bit of a wait, but Spyder should open before a minute is up.

### Opening a notebook or script

To open a notebook (files ending in .ipynb) you will need to so do from Jupyter Notebook itself. When you first open Jupyter Notebook you will see a list of the folders in your 'home' directory (`/N/u/yourUserName/Carbonate`) on ReD. To get started with one of our notebooks navigate using those folders to where the notebook is stored. So if you wanted to open this notebook you would find `Text-Analysis-master` then click on `textPrep-Py.ipynb` and the notebook will open in another tab in your browser. To start a new notebook of your own look for the drop down menu labeled 'new' in the top right of the page when Jupyter Notebook first launches and choose Python 3. **DO NOT CHOOSE PYTHON 2.7** or these notebooks will not work. They are all written in Python 3.

For running Python scripts in Spyder (files ending in .py) you also (like Jupyter Notebook) cannot double click a .py file and have Spyder launch. You need to launch Spyder as described above and then go to File > Open > and if you want to open our topTenPlainText-Py.py file you would start in your 'home' directory (same as above) go to `Text-Analysis-master` then open `WordFrequencies` then `histograms` then `scripts` and finally click on `topTenPlainText-Py.py`. The script should open up in the Spyder application.

### Running code in notebooks

The easiest way to run the code in the cells is to hit 'shift' + 'return' on your keyboard and this will run your code a cell at a time by running the highlighted cell and then moving to highlight the cell directly below it. If the cell is Markdown (which this current cell is) instead of code, it will read and interpret the Markdown in the cell and give you clean crisp text to read. If the cell is code, it will run the code and the produce any output (if the code in that cell has any) in the space just below the cell. If you wish to run the whole notebook at once, you can go to 'Cell' in the menu and choose the 'Run All' option. This will run every cell in the notebook beginnng with the top cell and ending with the bottom cell. This should be enough to get you started, so let's dig in!

### Run CyberDH environment
The code in the cell below points to a Python environment specificaly for use with the Python Jupyter Notebooks created by Cyberinfrastructure for Digital Humanities. It allows for the use of the different packages in our notebooks and their subsequent data sets. You will see this in all of our Python notebooks and scripts.

##### Packages
- **sys:** Provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available.
- **os:** Provides a portable way of using operating system dependent functionality.

#### NOTE: This cell is only for use with Research Desktop. You will get an error if you try to run this cell on your personal device!!

In [1]:
import sys
import os
sys.path.insert(0,"/N/u/cyberdh/Carbonate/dhPyEnviron/lib/python3.6/site-packages")
os.environ["NLTK_DATA"] = "/N/u/cyberdh/Carbonate/dhPyEnviron/nltk_data"

### Beginning Text-Analysis with Python

##### Installing Packages
While all the packages needed to run our notebooks are installed in our cyberdh Python environment on Carbonate, you may get adventurous and decide you want to change or adjust the notebooks, and, therefore, may need a package we don't have installed. To install a Python package on ReD you need to open your terminal using the icon on the ReD desktop. In the line type `pip3 install myDesiredPackage --user` and press enter. Replace the `myDesiredPackage` with the name of the package you want to install. The `--user` part of the code tells it to download it to your user profile and not system wide. Without the `--user` the system will not let you install the package as you do not have permission to install packages for everyone who uses ReD. Now, you should be able to use the functionality of your installed package along with the packages in our cyberdh Python environment.

##### Calling Packages
Python uses various packages in order to expand on the capabilities of the language. This makes Python very flexible in it's uses and increases it's capabilities dramatically. Some packages have various modules within them, and sometimes you might want to call these modules within the package instead of the entire package. You can even call specific functions from the module if you plan on only using that function. This is what you see below. We call the nltk package, which stands for "natural language tool kit." To call just the entire nltk package you would simply type `import nltk`. If we wanted to call nltk and the corpus module we would type `from nltk import corpus`. If we want to narrow it down to a function from the corpus module in nltk we type what you see below. The below code says from the `nltk` package go to the `.corpus` module and `import` the `PlaintextCorpusReader` function.

In [2]:
from nltk.corpus import PlaintextCorpusReader

##### Reading your data file
Now we need to point to the file or directory we will be using. 

The first variable, `homePath`, points to the home directory of whatever operating system you are on (Linux, osX, Windows). It does this using a function called `environ` from the `os` package we imported earlier. For Carbonate it automatically points to `/N/u/yourUserName/Carbonate` so you don't have to change the username in the file path constantly in our notebooks. The code `os.environ['HOME']` says that from the `os` package call the `environ` function to find the `['HOME']` file path. 

The second variable `corpusRoot` points to the directory our file is in. Again, we will be using functions from the `os` package. Here we use a function from a specific module from the `os` package to accomplish our goal. Our code `os.path.join` says that from the `os` package we need the `path` module and from that module we need the `join` function will join elements of a file path together. The elements of the file path it will join together are in the parantheses just after the function. So we are joining our `homePath` that we made above to the folder `'Text-Analysis-master'` which we then join to the folder `'data'`, which is finally joined to the `'shakespeareFolger'` folder. So the final file path created is '/N/u/myUserName/Carbonate/Text-Analysis-master/data/shakespearFolger'. While it might seem simpler to just save the whole file path as a variable and be done, however, this method allows for other users to utilize the code with minimal adjustments. 

The third variable, `textFile`, is the file of interest, in this case Hamlet, by William Shakespeare. 

The fourth variable uses the `PlaintextCorpusReader` function we pulled from nltk.corpus to read the file. This function can break text down into paragraphs, sentences (not lines), and words. Here we create the variable `corpus` which takes Hamlet and converts it from a 'str' class object into a 'nltk.corpus.reader.plaintext.PlaintextCorpusReader' class object so that the other functions built to work with this class of object will do what we want them to do. 

In [3]:
homePath = os.environ['HOME']

corpusRoot = os.path.join(homePath, 'Text-Analysis-master','data','shakespeareFolger')

textFile = 'Hamlet.txt'

corpus = PlaintextCorpusReader(corpusRoot, textFile)

##### Class
Sometimes you will get an error in Python that says that the function or action you are trying to use does not work with the type or 'class' of object you are trying to perform the action on. The next code basically says show (`print`) what class of object (`type`) my data or object is (`textFile`). In this case when we run the Python code, we see it is a string or 'str' object.

In [4]:
print(type(textFile))

<class 'str'>


Now if we want to see what class it is after our reader function is finished with it, we simply change the object of interest in the parantheses to match. Since the `corpus` variable now refers to the resulting object after running our data through the PlaintextCorpusReader (from here on refered to as PCR) function, we will change `textFile` to `corpus`.

In [5]:
print(type(corpus))

<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>


### Data Inspection
##### Paragraphs
The PCR counts paragraphs by assuming that whitespace separates the paragraphs. Whitespace is generally anything that is 2 empty spaces or more. The first line of code takes the `corpus` variable we made before and applies the function of the PCR tool that separates the text into paragraphs by appending it to the end of the variable (`corpus.paras`) and calling the function but with no parameters (the empty parantheses) and saving it as a new variable called `paragraphs`.

In [6]:
paragraphs = corpus.paras()

Now let's say you want to count the number of paragraphs in the play Hamlet. You could simply put `print(len(paragraphs)`, which basically means show me (`print`) the number or length (`len`) of the variable `paragraphs`. As you can see, you get 450 paragraphs.

In [7]:
print(len(paragraphs))

450


If you are writing code using an IDE like Spyder or PyCharm, and your code is fairly complicated and gives you multiple results, it might be helpful to label those results or your 450 paragraphs could get lost in the shuffle. To print a label to go with your results you simply put the label in quotes. Python reads anything in quotes as a string, so if you give the command to print and have a phrase in quotes it will print that phrase. So the code below says to `print` the phrase `# of paragraphs:`. The `{}\n` says put the results in place of the `{}`, then add a blank line (this is the `\n` part). Make sure the `{}\n` are inside the quotes too and you leave a space between the `:` and the `{}` or the answer will be placed right up against the colon. Then we tell Python the result to put in place of the `{}` using `.format` is the `len` (length) of the variable `paragraphs`.

In [8]:
print('# of paragraphs: {}\n'.format(len(paragraphs)))

# of paragraphs: 450



Counting the number of paragraphs is well and good, but you might want to see some of the paragraphs so you know what your code is actually counting as a paragraph. First, we choose a variable and make it equal to the number of paragraphs we wish to see. The reason we turn our number into a variable is because this number is important to a couple different parts of our code and this way if we wish to change the number of paragraphs we want to see, we only have to change the number in this variable. This eliminates the potential for error as we don't have to go through and change the number in multiple places. Let's start with a paragraph count of 3. We name the variable `cnt`, so anytime you see `cnt` it actually is assigned to the number 3.

In [9]:
cnt = 3

Now we want to print 3 paragraphs. First we set up our label. Notice the `{}` is in the middle of our phrase this time. That means the result will be placed there. In this case our result is simply whatever our 'cnt' variable is. This way, when we change `cnt`, our label automatically updates. Try playing with the `cnt` number and watch the label change from 'First 3 paragraphs:' to 'First 5 paragraphs:' and back again.

In [10]:
print('First {} paragraphs:\n'.format(cnt))

First 3 paragraphs:



Now we want to print the actual paragraphs. To do this we create a `for` loop. The code below says for each individual paragraph ,`p`, in our `paragraphs` variable between the first paragraph and whatever our `cnt` variable is, `[0 : cnt]`, print that paragraph, `print(p)`, and then print a blank line, `print()`. Note that the first paragraph is numbered `0` and not `1`. Most programming languages start counting at '0' and Python is no different. Also, the letter you use does not have to be `p`. We use that simply because it makes sense when we as humans read the code. You could use z if you like, it would not matter. Just make sure the letter in the `for` loop matches the letter in the first print command. Also, indentation is important in Python. When you indent it says that the indent code belongs to the code that is not indented or is less indented above it. In our code the `print(p)` and `print()` are part of the `for` loop above them. In addition, the `:` after `[0 : cnt]:` closes our for loop and also says that there should be indented code following containing the statements to be executed in our loop. If you don't indent properly, you will often get an invalid syntax error. Now, let's run the whole code with the label and everything and see our results!

In [11]:
print('First {} paragraphs:\n'.format(cnt))
for p in paragraphs[0 : cnt]:
    
    print(p)
    print()

First 3 paragraphs:

[['Who', "'", 's', 'there', '?']]

[['Nay', ',', 'answer', 'me', '.'], ['Stand', 'and', 'unfold', 'yourself', '.'], ['Long', 'live', 'the', 'King', '!'], ['Barnardo', '.'], ['He', '.']]

[['You', 'come', 'most', 'carefully', 'upon', 'your', 'hour', '.']]



##### Lines
We are going to take a break from the PCR tool, but don't worry, we'll come back to it.

If you are interested in the number of lines in your text, the PCR tool does not separate lines. This is most likely because Python does this all on it's own, no packages required. 

First, we are going to open our file, in this case it is once again Hamlet, but we are going to tell Python to open the file and refer to the file as `file`. By using a `with` statement at the beginning we are telling Python to close the file when we are done without having to type anything. The code says open the file ,`with open(os.path.join('path','to','your','file.txt'))`, and refer to the now opened file as `file` or whatever you want to call the file. The `:` at the end says that we are going to be doing more with this file.

Now, we want Python to convert `file` into a list of lines. So we create the variable 'lines' and make it equal to this process. The code says to take `file` and convert the lines to a list ,`.readlines`, and now execute this function (the empty parantheses). 

Now we have a list of lines, including blank ones, and at the moment it will count those empty lines since they contain the '\n' which signifies a 'new line' and is what `readlines` is actually using to determine what a line is. So we create another variable called `linesRd` and make it equal to the result of the script that follows. That script basically says go through every line ,`line for line`, in the variable `lines` and only keep the line if when you strip the '\n' from an individual line,`if l.strip('\n')`, it does not equal nothing ,`!= ''`, or only keep the lines that have stuff in them when you remove the '\n'. Now we have a list of lines that all contain text with no blank lines.    

In [12]:
with open (os.path.join(corpusRoot, 'Hamlet.txt')) as file:
    lines = file.readlines()
    linesRd = [line for line in lines if line.strip('\n') != '']

Now we print out the number of total lines, as well as the text of the first 5 lines and the last 5 lines. This all works the same as when we counted the paragraphs above and text of the first 3 paragraphs, with the exception of counting the last 5 lines. For that, the code looks almost identical to what we do to count the first 5 lines, except in the `For` statement. The part where we say which lines we want to see for the first 5 lines:`linesRd[0 : lncnt]`, looks like this for the last 5 lines: `linesRd[-lncnt:]`. We are saying we want to start our count at the opposite end and go until you come to the end. If you know how many lines you have total, you could also put the number of the last 5 lines, which in our case is [4149 : 4154]. However, the code below will give you the last lines without having to have an exact count.

In [13]:
lncnt = 5
print('# of lines: {}\n'.format(len(linesRd)))
     
print('First {} lines:\n'.format(lncnt))
for l in linesRd[0 : lncnt]:
    print(l)
    print()
    
print('Last {} lines:\n'.format(lncnt))
for l in linesRd[-lncnt:]:
    print(l)
    print()

# of lines: 4154

First 5 lines:

 Who's there?


 Nay, answer me. Stand and unfold yourself.


 Long live the King!


 Barnardo.


 He.


Last 5 lines:

 The soldier's music and the rite of war


 Speak loudly for him.


 Take up the bodies. Such a sight as this


 Becomes the field but here shows much amiss.


 Go, bid the soldiers shoot.




##### Sentences
Now we come back to our PCR tool. The PCR tool can also separate the text into sentences, but uses punctuation to do so. This means it will not work if you have removed punctuation before attempting this. This is also different from counting lines as the lines use '\n' to determine what a line is, while the PCR sentence function uses '.', '?', and '!' to separate out sentences. For this we have also added code that shows the last 5 sentences. To practice, try adjusting the code so you can add it to the paragraphs section above and see the last paragraphs of Hamlet.

In [14]:
sentences = corpus.sents()

print('# of sentences: {}\n'.format(len(sentences)))

sntcnt = 5

print('First {} sentences:\n'.format(sntcnt))

for s in sentences[0 : sntcnt]:
    
    print(s)
    print()

print('Last {} sentences:\n'.format(sntcnt))

for s in sentences[-sntcnt:]:
    
    print(s)
    print()

# of sentences: 2494

First 5 sentences:

['Who', "'", 's', 'there', '?']

['Nay', ',', 'answer', 'me', '.']

['Stand', 'and', 'unfold', 'yourself', '.']

['Long', 'live', 'the', 'King', '!']

['Barnardo', '.']

Last 5 sentences:

['But', 'let', 'this', 'same', 'be', 'presently', 'performed', 'Even', 'while', 'men', "'", 's', 'minds', 'are', 'wild', ',', 'lest', 'more', 'mischance', 'On', 'plots', 'and', 'errors', 'happen', '.']

['Let', 'four', 'captains', 'Bear', 'Hamlet', 'like', 'a', 'soldier', 'to', 'the', 'stage', ',', 'For', 'he', 'was', 'likely', ',', 'had', 'he', 'been', 'put', 'on', ',', 'To', 'have', 'proved', 'most', 'royal', ';', 'and', 'for', 'his', 'passage', ',', 'The', 'soldier', "'", 's', 'music', 'and', 'the', 'rite', 'of', 'war', 'Speak', 'loudly', 'for', 'him', '.']

['Take', 'up', 'the', 'bodies', '.']

['Such', 'a', 'sight', 'as', 'this', 'Becomes', 'the', 'field', 'but', 'here', 'shows', 'much', 'amiss', '.']

['Go', ',', 'bid', 'the', 'soldiers', 'shoot', '.']


##### Words
Finally, we want to count words and perhaps see the first 10 words in our text. If you use PCR it will include punctuation as a seperate 'word' or 'token' and include it in the count as well as show each apostrophe and comma as a separate word. This can definitely be useful and the code is the same as with the paragraphs and sentences sections above, only the initial function and variables are different.

In [15]:
tokens = corpus.words()

print('# of tokens: {}\n'.format(len(tokens)))

tokcnt = 10

print('First {} tokens:\n'.format(tokcnt))

for t in tokens[0 : tokcnt]:
    
    print(t)
    print()

# of tokens: 36653

First 10 tokens:

Who

'

s

there

?

Nay

,

answer

me

.



If you want to count and also see only words, with no punctuation, then PCR is not a good option, however, this is once again because Python has a way of doing this all on it's own with a built in package called 'string'. So we `import string`. 

Next we need to create a statement. We assign the statement to the variable `remove` and the statement is as follows: create a blank dictionary using what is coming next, `dict.fromkeys`, then we use the `map` function which allows us to apply a given function to a list of iterables or tuples. Meaning we can apply a function to everything that comes after without having to apply the function to each item individually. Here we apply the function `ord` to everything that comes next. The `ord` function converts non-unicode characters to the unicode equivalent. We are applying this to the `string.punctuation` item and putting all the characters in `string.punctuation` into the dictionary and refering to this whole process as `remove`.

Then we open the file like we did in the lines section above. We then read the file, `ham.read()`, while using the `remove` dictionary we made for the translate function, `translate(remove)`, which removes anything found in the dictionary, and then we apply the `split` function which separates or "splits" the document into a list of words. We name the result of this process `wrds`. The rest is the same as the other sections.

In [16]:
import string
remove = dict.fromkeys(map(ord, string.punctuation))

with open (os.path.join(corpusRoot, 'Hamlet.txt')) as ham:
    wrds = ham.read().translate(remove).split()
    

print('# of words: {}\n'.format(len(wrds)))

wrdcnt = 10

print('First {} words:\n'.format(wrdcnt))

for w in wrds[0 : wrdcnt]:
    
    print(w)
    print()

# of words: 29648

First 10 words:

Whos

there

Nay

answer

me

Stand

and

unfold

yourself

Long



### Explore
The above commands are a few tips and tricks to get you started with Python. Similar to Python's extensibility with packages, the Python user community has great resouces for learners. The [Python Package Index](https://pypi.python.org/pypi) and the [Python Docs](https://docs.python.org/3/contents.html) answer quite a few questions about Python and its uses.

Googling the issue, function, package, or object name with "python" will return helpful resources. If a website dedicated to the specific package appears, there you will find extensive documentation and examples for the package's functions, etc. and other related resources. For any other issues, Stack Overflow is helpful to find answers to common questions as well as ask your own.

The rest of the IU Cyber DH tutorials explain some methods for textual analysis using Python. If you are ready to dive in, click on one to begin!