# Working with Texts in Python
## Getting Started

Jeremy Mikecz

Research Data Services

Dartmouth College

This notebook is a shortened and updated version of notebooks found in the following resources:
+ Text Analysis with Python
+ Computational Text Analysis Week


## 1. Importing One Text from a File



Import the **pathlib** library to work with file paths.

In [1]:
from pathlib import Path

Create a path to our desired files, then open one of those files.

In [None]:
sotu_dir = Path("../../texts/sotu/txt")

text_dir = Path(sotu_dir, "Roosevelt_1944.txt")

# f is a file, read() opens that file as a text string
with open(text_dir, encoding='utf-8') as f:
    txt = f.read()
print(txt)


To the Congress:

This Nation in the past two years has become an active partner in the
world's greatest war against human slavery.

We have joined with like-minded people in order to defend ourselves in a
world that has been gravely threatened with gangster rule.

But I do not think that any of us Americans can be content with mere
survival. Sacrifices that we and our allies are making impose upon us all a
sacred obligation to see to it that out of this war we and our children
will gain something better than mere survival.

We are united in determination that this war shall not be followed by
another interim which leads to new disaster--that we shall not repeat the
tragic errors of ostrich isolationism--that we shall not repeat the excesses
of the wild twenties when this Nation went for a joy ride on a roller
coaster which ended in a tragic crash.

When Mr. Hull went to Moscow in October, and when I went to Cairo and
Teheran in November, we knew that we were in agreement with our alli

## 2. Processing Text and Extracting Information

### 2a. Tokenization

In text analysis, we normally work with words rather than sentences. The process of dividing a text string into words and other segments like punctuation and numbers is called tokenization and the resulting segments are called tokens.

In [3]:
len(txt)

22108

The easiest way to tokenize text is to use the **.split()** method. Note, however, it doesn't separate punctuation (i.e. "To the Congress:" becomes ["To", "the", "Congress:"]).

In [4]:
tokens1 = txt.split()
print(tokens1[:20])
len(tokens1)

['To', 'the', 'Congress:', 'This', 'Nation', 'in', 'the', 'past', 'two', 'years', 'has', 'become', 'an', 'active', 'partner', 'in', 'the', "world's", 'greatest', 'war']


3755

### 2b. Basic Pre-processing

In [14]:
print(txt[:100])
print("***")
print(txt[-100:])

type(txt)

To the Congress:

This Nation in the past two years has become an active partner in the
world's grea
***
in its most critical hour--to keep this Nation great--to make this
Nation greater in a better world.


str

In [None]:
# Processing the full text string
# Type `txt.` and see the options that appear. These are methods that work on strings and strings only.

**Methods** perform in similar ways to functions. The key difference is that:

- functions are independent and can be called by their name only
- methods act on objects of a particular class. In plain terms, some methods work only on text strings. Others work only on integers or dataframes.

Thus, the syntax for calling a method is:

```
object_name.method_name()
```

Examples of [common methods for text strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) include:

- str.capitalize()
- str.encode()
- str.endswith()
- str.startswith()
- str.lower()
- str.islower()
- str.upper()
- str.isupper()
- str.strip()
- str.replace()
- str.split()

Note: "str" above refers to either a raw text string such as `"hello!"` or a variable that contains a string such as `question = "what is happening?"`.

Let's see what some of these string methods do:

In [None]:
# try applying some of these methods to our text string stored in the variable `txt` 
# or a new text string you create:

In [23]:
#You can chain multiple methods together:
sent = "  Two roads diverged in a wood and I - I took the one less traveled by, and that has made all the difference.   "
sent.strip().endswith(".")  # you can chain multiple methods together. Note: Python applies the method attached to the object first and then moves to the 2nd, 3rd, etc. methods. 
## So, in this case it first applies the .strip() method to our sentence and then applies the .endswith() to the results returned by strip

True

In [None]:
#another example of chained methods
sent = sent.strip()
sent.replace("two", "three").replace("one", "two")

## Working with Lists (of words)

In [25]:
# the most basic form of tokenization - splitting on white spaces.
words = sent.split()
words

['Two',
 'roads',
 'diverged',
 'in',
 'a',
 'wood',
 'and',
 'I',
 '-',
 'I',
 'took',
 'the',
 'one',
 'less',
 'traveled',
 'by,',
 'and',
 'that',
 'has',
 'made',
 'all',
 'the',
 'difference.']

In [31]:
type(words)

list

In [20]:
print(len(words))

23


In [27]:
words.append("This")
print(words[-5:])


['all', 'the', 'difference.', 'This', 'This']


In [28]:
words.extend(["is", "the", "end."])
words

['Two',
 'roads',
 'diverged',
 'in',
 'a',
 'wood',
 'and',
 'I',
 '-',
 'I',
 'took',
 'the',
 'one',
 'less',
 'traveled',
 'by,',
 'and',
 'that',
 'has',
 'made',
 'all',
 'the',
 'difference.',
 'This',
 'This',
 'is',
 'the',
 'end.']

In [29]:
lower_words = []
for word in words:
    lower_words.append(word.lower())
lower_words

['two',
 'roads',
 'diverged',
 'in',
 'a',
 'wood',
 'and',
 'i',
 '-',
 'i',
 'took',
 'the',
 'one',
 'less',
 'traveled',
 'by,',
 'and',
 'that',
 'has',
 'made',
 'all',
 'the',
 'difference.',
 'this',
 'this',
 'is',
 'the',
 'end.']

In [30]:
lower_words2 = [word.lower() for word in words]
lower_words2

['two',
 'roads',
 'diverged',
 'in',
 'a',
 'wood',
 'and',
 'i',
 '-',
 'i',
 'took',
 'the',
 'one',
 'less',
 'traveled',
 'by,',
 'and',
 'that',
 'has',
 'made',
 'all',
 'the',
 'difference.',
 'this',
 'this',
 'is',
 'the',
 'end.']

7. There are a variety of Python libraries and modules that help us work with texts. One interesting module that comes with the Python Standard Library is [**difflib**](https://docs.python.org/3/library/difflib.html). It allows us to compare the difference between sequences of text.

Examine the code cells below. They apply the **ndiff** function from the difflib library to two lists of words. Examine the results:

In [16]:
import difflib

sent1 = "What in the world is going on over there?".split()
sent2 = "What the heck is goin' on down there?".split()

In [17]:
from difflib import ndiff

diff = ndiff(sent1, sent2)
print("\n".join(diff))

  What
- in
  the
- world
+ heck
  is
- going
?     ^

+ goin'
?     ^

  on
- over
+ down
  there?


## Using SpaCY to tokenize and process texts 

In [6]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
#nlp = English()
#tokenizer = Tokenizer(nlp.vocab)
#tokenizer2 = nlp.tokenizer

#tokens2 = tokenizer(txt)
#print(tokens2[:40], "\n\n")

#tokens3 = tokenizer2(txt)
#print(tokens3[:40], "\n\n")

doc = nlp(txt)
doc


To the Congress:

This Nation in the past two years has become an active partner in the
world's greatest war against human slavery.

We have joined with like-minded people in order to defend ourselves in a
world that has been gravely threatened with gangster rule.

But I do not think that any of us Americans can be content with mere
survival. Sacrifices that we and our allies are making impose upon us all a
sacred obligation to see to it that out of this war we and our children
will gain something better than mere survival.

We are united in determination that this war shall not be followed by
another interim which leads to new disaster--that we shall not repeat the
tragic errors of ostrich isolationism--that we shall not repeat the excesses
of the wild twenties when this Nation went for a joy ride on a roller
coaster which ended in a tragic crash.

When Mr. Hull went to Moscow in October, and when I went to Cairo and
Teheran in November, we knew that we were in agreement with our alli

In [7]:
tokens = [token.text for token in doc]
print(len(tokens))
tokens[:30]

4649


['To',
 'the',
 'Congress',
 ':',
 '\n\n',
 'This',
 'Nation',
 'in',
 'the',
 'past',
 'two',
 'years',
 'has',
 'become',
 'an',
 'active',
 'partner',
 'in',
 'the',
 '\n',
 'world',
 "'s",
 'greatest',
 'war',
 'against',
 'human',
 'slavery',
 '.',
 '\n\n',
 'We']

### Part of Speech (POS) Tagging

In [8]:
import pandas as pd

tokenlist = []
for token in doc[:100]:
    tokdict = {}
    tokdict["token"] = token.text
    tokdict["lemma"] = token.lemma_
    tokdict["pos"] = token.pos_
    tokdict["tag"] = token.tag_
    tokdict["dep"] = token.dep_
    tokdict["shape"] = token.shape_
    tokdict["alpha"] = token.is_alpha
    tokdict["stop"] = token.is_stop
    tokenlist.append(tokdict)
tokdf = pd.DataFrame(tokenlist)
tokdf

Unnamed: 0,token,lemma,pos,tag,dep,shape,alpha,stop
0,To,to,ADP,IN,prep,Xx,True,True
1,the,the,DET,DT,det,xxx,True,True
2,Congress,Congress,PROPN,NNP,pobj,Xxxxx,True,False
3,:,:,PUNCT,:,punct,:,False,False
4,\n\n,\n\n,SPACE,_SP,dep,\n\n,False,False
...,...,...,...,...,...,...,...,...
95,out,out,ADP,IN,prep,xxx,True,True
96,of,of,ADP,IN,prep,xx,True,True
97,this,this,DET,DT,det,xxxx,True,True
98,war,war,NOUN,NN,pobj,xxx,True,False


In [9]:
spacy.explain("ADP")

'adposition'

### Sentence Segmentation

In [10]:
for sent in doc.sents:
    print(sent.text)

To the Congress:

This Nation in the past two years has become an active partner in the
world's greatest war against human slavery.


We have joined with like-minded people in order to defend ourselves in a
world that has been gravely threatened with gangster rule.


But I do not think that any of us Americans can be content with mere
survival.
Sacrifices that we and our allies are making impose upon us all a
sacred obligation to see to it that out of this war we and our children
will gain something better than mere survival.


We are united in determination that this war shall not be followed by
another interim which leads to new disaster--that we shall not repeat the
tragic errors of ostrich isolationism--that we shall not repeat the excesses
of the wild twenties when this Nation went for a joy ride on a roller
coaster which ended in a tragic crash.


When Mr. Hull went to Moscow in October, and when I went to Cairo and
Teheran in November, we knew that we were in agreement with our 

In [11]:
from spacy import displacy
sentences = [sent for sent in doc.sents]
displacy.render(sentences[2], style="dep")

### Named Entity Recognition

In [12]:
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)

Congress ORG 7 15
the past two years DATE 33 51
Americans NORP 300 309
Hull PERSON 871 875
Moscow GPE 884 890
October DATE 894 901
Cairo GPE 922 927
Teheran GPE 932 939
November DATE 943 951
two CARDINAL 1603 1606
Hull PERSON 1657 1661
Nation ORG 1730 1736
Santa Claus PERSON 1784 1795
Churchill PERSON 1875 1884
Marshal Stalin PERSON 1890 1904
Generalissimo Chiang Kai-shek PERSON 1910 1939
Constitution LAW 1997 2009
Hull PERSON 2025 2029
Allied ORG 2195 2201
one CARDINAL 2360 2363
the United Nations ORG 2459 2477
one CARDINAL 2499 2502
Nations ORG 2693 2700
Generalissimo PERSON 2756 2769
Marshal Stalin PERSON 2774 2788
Churchill PERSON 2808 2817
China GPE 3330 3335
Russia GPE 3340 3346
Britain GPE 3369 3376
America GPE 3381 3388
Nation ORG 3456 3462
Nations ORG 3512 3519
Germany GPE 3639 3646
Italy GPE 3648 3653
Japan GPE 3659 3664
Nations ORG 3747 3754
Nations ORG 4109 4116
American NORP 4176 4184
power-- ORG 4379 4386
Moscow GPE 4614 4620
Cairo GPE 4622 4627
Teheran GPE 4633 4640
Wash