## An example of using Python for text analysis.

Python can be used for a lot more than crunching big datasets of numbers, for example it can be used in the humanities/arts for the analysis of textual data.

In this notebook you will design and build some Python code to analyse Jane Austin's famous novel [*Pride and Prejudice*](https://en.wikipedia.org/wiki/Pride_and_Prejudice).

As you may know, *Pride and Prejudice* deals with the stormy relationship between the heroine, Elizabeth Bennett, and Mr Darcy. Written in a pre-feminist era (1813), but by a female author and with a female lead, we could be interested in how the book manages the power balance between these two characters.
![](https://upload.wikimedia.org/wikipedia/commons/2/22/PrideandPrejudiceCH3.jpg)



In this very simple analysis, you will use Python to answer these questions:

1) Relatively how often are Elizabeth and Darcy's names mentioned?
2) In sentences where both characters are mentioned, how often is Elizabeth's name mentioned first, and how many times is Darcy's name mentioned first?

To complete this task, you will write some Python functions. You will be provided with some tips on how to approach this, but the details are up to you.

The full text of the book is available from the [Gutenberg Project](https://www.gutenberg.org/ebooks/1342), but we have the text prepared for you already - you just need to download it. To do this, run the following cell:

In [None]:
!curl https://raw.githubusercontent.com/CharlieLaughton/colabtools/master/Pride_and_Prejudice/Pride_and_Prejudice.txt -o Pride_and_Prejudice.txt

### Task 1: write a function that loads the contents of the book into a string variable.

Write a function that ends up working like this:

    content = read(book_file_name)
    
Tips:

There are several ways to read the contents of a file into a Python program, but probably the most basic (good enough for now) will look something like this:

    f = open(book_file_name)
    content = f.read()
    f.close()
    
OK, so now complete the cell below:

In [None]:
def read(book_title):
    # Add you code here
    
    return content # don't forget to 'return' the result!

Does your code work? Time to find out. Run the following cell, if something is wrong with your function you will get an error, if so go back and fix your function, and try again!

In [None]:
title = 'Pride_and_Prejudice.txt'
content = read(title)
if len(content) == 773701:
    print('Great! Your function works!')
else:
    print('Whoops! - something is wrong...')

### Task 2. Split the contents into individual sentences

At the moment the whole book - every word - is in one Python string, `content`. The next step is to split it into sentences, each of which can then be analysed.

Tips:

Python `strings` have a useful `method` called `split()` that can help. Run the code in the following cell to see it in action:

In [None]:
sentence = 'This is a sentence made up of several words'
chunks = sentence.split()
print(chunks)

The original string is split into a `list` of shorter strings. The split happens at each space (" ") character.

But you can change this behaviour:

In [None]:
chunks = sentence.split('e')
print(chunks)

Notice the character used to define the split points is removed from the list of generated sub-strings.

So - write a Python function that takes your big string with the entire contents of the book in it, and returns a list of strings, each of which is one sentence in the book. It should work something like this:

    sentences = sentence_split(long_string)
    

In [None]:
def sentence_split(long_string):
    # write your code here
    
    return sentences # this should be a list of strings

Let's see if it works:

In [None]:
sentences = sentence_split(content)
if len(sentences) == 5267:
    print('Great! Your function works!')
else:
    print('Whoops! - something is wrong...')

### Task 3. Find the position of a particular word in a sentence

To achieve our goal, we need to analyse each sentence to:

1. Find out if it contains the word 'Elizabeth' or the word 'Darcy'
2. If it does, how far into the sentence it is (so we can see which of the two is mentioned first, in sentences where they both appear)


A Python function to do this might look like this:

    position = word_position(sentence, target_word)
    
where `sentence` is a long string with a sentence in it, `target_word` is the word we are searching for, and `position` is an integer that tells us how many characters into the sentence is the match, or has the value -1 if the word doesn't appear in the sentence at all. (Remember Python counts from zero, so if the 1st word is the match, the value of `position` will be 0).

Tip:

Python `strings` have a useful method called `index()` we can use. See it in action here:

In [None]:
sentence = 'This is a sentence made up of several words'
print(sentence.index('sentence'))
print(sentence.index('several'))

But you have to be a bit careful using it in a function, because if you ask for the index of something that isn't there, you get an error message:

In [None]:
print(sentence.index('cat'))

To protect against this, you need to first check in some safe way if the substring is present at all, before using `.index()` to tell you where. Here is one approach:

In [None]:
if 'cat' in sentence:
    print(sentence.index('cat'))
else:
    print('cat does not appear in this sentence')

OK, bearing this in mind, create your function:

In [None]:
def word_position(sentence, target_word):
    # Add your code below
    
    return position

Time to test your function. As it happens, the word "Elizabeth" first appears in sentence 56, but the word Darcy does not feature here. Does your function agree?

In [None]:
position_test_1 = word_position(sentences[56], 'Elizabeth')
if position_test_1 == 39:
    print('Test 1 passed')
else:
    print('Test 1 failed - result was ', position_test_1)
    
position_test_2 = word_position(sentences[56], 'Darcy')
if position_test_2 == -1:
    print('Test 2 passed')
else:
    print('Test 2 failed - result was ', position_test_2)
    

### Task 4. Analyse the full text

Nearly there! You have all the functions you need, now you just need to write some code to put it all together. The approach will be:

1) Initialise some counters to zero:

    a) `nD` to count the number of sentences mentioning Darcy
    b) `nE` to count the number of sentences mentioning Elizabeth
    c) `nDE` to count the number of sentences where Darcy appears before Elizabeth
    d) `nED` to count the number of sentences where Elizabeth appears before Darcy
    
2) Loop over each sentence. For each:

    a) See if 'Elizabeth' is in the sentence, and if so, in what position
    b) Likewise for 'Darcy'
    c) Update `nD` and `nE` as appropriate
    d) If both are present, see whose name comes earlier in the sentence and update `nDE` or `nED` as appropriate.
    
OK, no need for a function this time, just fill in the cell below with your code, then run it:

In [None]:
nD = 0 # number of mentions of Darcy
nE = 0 # number of mentions of Elizabeth
nDE = 0 # number of times Darcy is mentioned before Elizabeth
nED = 0 # number of times Elizabeth is mentioned before Darcy
#Add your code here:


# end with the following lines:
print('Pride and Prejudice contains about', len(sentences), 'sentences.  Elizabeth is')
print('mentioned by name in', nE, 'of them, while Darcy is mentioned by name')
print('in', nD, 'of them.')
print('In sentences where both are mentioned, Elizabeth is mentioned before')
print('Darcy', nED, 'times, whereas Darcy is mentioned before Elizabeth', nDE, 'times.')

### Checking the answer

Run the code in the following cell to reveal the correct answer!

In [None]:
secret_message = '507269646520616e64205072656a756469636520636f6e7\
461696e732061626f757420353236372073656e74656e6365732e2020\
456c697a61626574682069730a6d656e74696f6e6564206279206e616d\
6520696e20363233206f66207468656d2c207768696c652044617263792\
06973206d656e74696f6e6564206279206e616d650a696e20333939206f6\
6207468656d2e0a496e2073656e74656e63657320776865726520626f7468\
20617265206d656e74696f6e65642c20456c697a6162657468206973206d65\
6e74696f6e6564206265666f72650a44617263792035302074696d65732c207\
7686572656173204461726379206973206d656e74696f6e6564206265666f726\
520456c697a61626574682032372074696d65732e'
print(bytes.fromhex(secret_message).decode())