# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [1]:
import os # interacting with operating system
import re # regex
import string
from collections import Counter, OrderedDict #

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [3]:
dick_path = os.path.join("..", "data", "Dickens_Expectations_1861.txt")

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".

In [15]:
with open(dick_path, "r", encoding = "utf-8-sig") as f: # specify what encoding the file is in
    text = f.read()

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [16]:
print(text[:300])

text[:300] # see the diff?

REAT EXPECTATIONS
 1867 Edition 
by Charles Dickens
Chapter I
My father's family name being Pirrip, and my Christian name Philip, my
infant tongue could make of both names nothing longer or more explicit
than Pip. So, I called myself Pip, and came to be called Pip.
I give Pirrip as my father's famil


"REAT EXPECTATIONS\n 1867 Edition \nby Charles Dickens\nChapter I\nMy father's family name being Pirrip, and my Christian name Philip, my\ninfant tongue could make of both names nothing longer or more explicit\nthan Pip. So, I called myself Pip, and came to be called Pip.\nI give Pirrip as my father's famil"

You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [17]:
text = text.replace("\n", " ")

__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [None]:
tokens = text.split(" ")

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In [19]:
sentences = re.split(r"[\.\?!]\s*", text) # Mr. Jones will be split into two sentences

["REAT EXPECTATIONS  1867 Edition  by Charles Dickens Chapter I My father's family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip",
 'So, I called myself Pip, and came to be called Pip',
 "I give Pirrip as my father's family name, on the authority of his tombstone and my sister, - Mrs",
 'Joe Gargery, who married the blacksmith',
 'As I never saw my father or my mother, and never saw any likeness of either of them  for their days were long before the days of photographs , my first fancies regarding what they were like were unreasonably derived from their tombstones',
 "The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair",
 'From the character and turn of the inscription, "Also Georgiana Wife of the Above," I drew a childish conclusion that my mother was freckled and sickly',
 'To five little stone lozenges, each about a foot and a ha

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

In [22]:
counter = 0
keyword = "love"

for token in tokens:
    strip = token.strip(string.punctuation) # remove punctuation 
    lower = strip.lower() # lowercase
    if lower == keyword:
        counter += 1

print(counter)

60


In [23]:
tokens.count("love")

43

We can use a similar logic to find all sentences where a certain keyword appears.

In [25]:
clean = []
for token in tokens:
    strip = token.strip(string.punctuation) # remove punctuation 
    lower = strip.lower() # lowercase
    clean.append(lower)

clean.count("love")

60

Python also has some built-in tools which we can use to count how many times a token appears in a list.

In [None]:
Counter(clean).most_common()

In [None]:
Counter(clean)

There are some problems, though! 

In [None]:
counter = 0
keyword = "love"

for idx, sent in enumerate(sentences): # idx is index
    strip = sent.strip(string.punctuation) # remove punctuation 
    lower = strip.lower() # lowercase
    modified_kw = " " + keyword + " " 
    if modified_kw in lower:
        counter += 1
        print(idx, lower)

counter

if love is the last word it will not catch it. But if we do note put whitespace we will get "lower", "loves" etc.
enumerate gives each sentence a number

In [None]:
print(enumerate(tokens))
print(tokens)

## Viewing keywords in context (KWIC, concordancing)

In [36]:
#cleaned is clean tokens

strip = tokens.strip(string.punctuation) # remove punctuation 
cleaned = strip.lower() # lowercase

for idx, token in enumerate(cleaned) :
    if token == keyword:
        before = ' '.join(cleaned[idx-5:idx]) #combining the previous 5 words with a whitspace between
        after = ' '.join(cleaned[idx+1:idx+6])
        full = [before, token, after]
        print(idx, "{:50} {:20} {:50}".format(*full)) # the numbers denote how many characters/columns you want each wariable to fill. * loops over full 
        

AttributeError: 'list' object has no attribute 'strip'

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.