# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [None]:
import os # operating system tools
import re # regex
import string # string processing tools
from collections import Counter, OrderedDict

: 

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [None]:
filename = os.path.join ("..","data","Dickens_Expectations_1861.txt")

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".

In [None]:
#context manager, want to open a file in a certain way (read, append, edit). We want to open in read mode
with open(filename, "r") as file:
    text = file.read()

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [None]:
print(text[:300])


You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [None]:
text[:300] # get it with the formatting of the text. Unicode

__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [None]:
tokens

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In [None]:
sentences = re.split(r"[.?!]\s*",text) # [.?!] means [\.\?] the actual character \s new line, space or tab, * matches preceding element zero or more times

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

In [None]:
counter = 0
keyword = "love"
# for every token
for token in tokens: 
    # Tokens remove punctuation at the end
    stripped = token.strip(string.punctuation) 
    # make lowercase
    lowered = stripped.lower() 
    # is that token the keyword?
    if token == keyword:
        # if yes, add 1 to the counter
        counter +=1

In [None]:
tokens.count("love")


In [None]:
cleaned= []
# Tokens remove punctuation at the end
    stripped = token.strip(string.punctuation) 
    # make lowercase
    lowered = stripped.lower()
    # append the vleaned to a list
    cleaned.append(lowered)
    

In [None]:
# make a DICTIONARY
Counter(cleaned)
#base them on frequency, function verbs, determiners, TUPLES 
Counter(cleaned).most_common()  

We can use a similar logic to find all sentences where a certain keyword appears.

In [None]:
for sentence in sentences:
    # make everything lowercase
    sentence = sentence.lower()
    # strip sentence for punctuation
    stripped = sentence.strip(string.punctuation)
    # add whitespace around keyword
    modified_kw= " " + keyword + " "
    if modified_kw in stripped
        print(stripped)


Python also has some built-in tools which we can use to count how many times a token appears in a list.

In [None]:
# using enumarate, allows you to go through sentences one by one, and make an index for every index (and list which linenumber it is)
for idx, sentence in enumerate(sentences):
    # make everything lowercase
    sentence = sentence.lower()
    # strip sentence for punctuation
    stripped = sentence.strip(string.punctuation)
    # add whitespace around keyword
    modified_kw= " " + keyword + " "
    if modified_kw in stripped
        print(idx, stripped)

There are some problems, though! 

## Viewing keywords in context (KWIC, concordancing)

In [None]:
cleaned= []
for token in tokens:
    # remove punctuation at the end
    stripped = token.strip(string.punctuation) 
    # make lowercase
    lowered = stripped.lower()
    # append the vleaned to a list
    cleaned.append(lowered)
    

In [None]:
#define keyword
keyword = love

# for every token, going tthrough the cleaned list
for idx, token in enumerate(cleaned):
    #if the token is love
    if token == keyword:
        #from the list called cleaned, we want to take the five previous words in the list called clean
        #.join: joining the list into a single string with 5 words before and all the words (max 5) 
        # after the keyword with a space between them
        before= ' '.join(cleaned[idx-5:idx])
        after = ' '.join(cleaned[idx+1:idx+6])
        full = before [before, token, after]
        # how many characters do I want between the surrounding words and the keyword (makes spaces until there are 50 character)
        print("{:50} {:20} {:50}".format(*full))

: 

In [None]:
animals = ["dog", "cat", "bird"]

In [None]:
print('\n'.join)

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.