# Week 2.2 Task: Text in the Wild

This week we're going to find our own sets of documents online. This will sharpen our Python and regex skills as we try and overcome some of the issues with taking data from the real world. 

There are 3 types of text data that I will suggest and give you some code and pointers towards working with, but feel free to use any text you have lying around or can rustle up from the interweb. It just has to be in the right format. 

### The Format 

What we want to collect is sets of **documents** to compare. Each **document** is stored as a single string in a 1D array. 

```
documents = [
    document1,
    document2,
    document3
    .....
]
```

### Making Your Dataset

I am going to suggest either getting books like I did, or film scripts, song lyrics. 

#### Books

[Project Gutenberg](https://www.gutenberg.org/ebooks/) has loads of free eBooks. 
1. You can find some on here
2. Just get the plain text versions of the book
3. Make a new empty file in a text editor
4. Copy/paste into a text editor and save as with .txt extension

Then all we need to do is load them in as text files. You can choose to load in several books and treat them each as a separate document, or alternatively, you can load in one text file and try and split it into parts such as paragraphs or chapters using the regex skills you learnt last week.

#### Song Lyrics

1. You can find some song lyrics online
2. Make a new empty file in a text editor
3. Copy/paste into a text editor and save as with .txt extension

As with books, you can find song lyrics can make text files with them and load each one in as separate text files. You can compare individual songs, lines in songs (split up the song using a regex), different artists, or different albums by the same artist. Just remember, what you want to end up with is an array where each document is a single string.

#### Subtitle Files 

1. Find some subtitle files in `.srt` format online
2. Save them on your computer and get the file paths
3. Use the `pysrt` example code below to load them in

As with books and songs, you can load each film or episode in as a separate document and compare them!

### Loading in files

Below we see code that takes an array of file paths and stores each one as a document. Alternatively, below it we take one text file and split it into documents using a regex.

Remember, you can need to do some extra cleaning up of your data with regex depending on the format!

#### Getting file paths

If your text files are in the same folder as this notebook, you just need to use the filename as the path. This is called a **relative** path, as the path is _relative_ to the program you are trying to access it from (this notebook).

For example, you would just need `hacking.txt`. 


If you have the text files somewhere else on your computer, you'll need an **absolute** path. This means you need to give the whole path starting from the root of your file system. 

* For Mac OSX, this is often `/Users/[USERNAME]/Documents/...`
    - You can find a file path by
     1. Opening a new Terminal window 
     2. Dragging and dropping a file into it
     
* For Windows, it will probably look like `C:\Documents\...` (the slashes go the other way!)
    - You can find a file path using any other [these](https://www.wikihow.com/Find-a-File%27s-Path-on-Windows) methods. 




In [3]:
#Import all necessary packages
import numpy as np
import pandas as pd 
import re
from sklearn.feature_extraction import stop_words
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
from sklearn.metrics.pairwise import cosine_similarity as cosine
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/louismccallum/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Text files (Books, songs etc...)

In [None]:
###Each txt file is a separate document
#Put your file paths in here
urls = [
    "path1.txt","path2.txt","path3.txt"
]
documents = []
for url in urls:
    #Open file
    fs = open(url, 'r') 
    #Read in text as string
    doc = fs.read() 
    #Add string to array of documents
    documents = documents + [doc]
documents = np.array(documents)
print(documents.shape)

In [None]:
# Split single txt file into documents with a regex
url = "path1.txt"
#Open file
fs = open(url, 'r') 
#Read in text as string
doc = fs.read() 
#Split based on your own regex
documents = np.array(re.split(r'\s\s\s\s\s\sChapter+', doc))
print(documents.shape)

### Subtitle files

Here you can try and load in subtitle files in the `.srt` format

In [None]:
#Install pysrt package
!pip install pysrt

In [9]:
#getting your own subs
import pysrt
#Put your file paths in here
urls = [
    "path1.txt","path2.txt","path3.txt"
]
documents = []
for url in urls:
    #Load in the susbitle file and parse
    subs = pysrt.open(url)
    #Connect together linebreaks
    subs = [l.text.replace("\n", " ") for l in subs]
    #Join into one string
    subs = " ".join(subs)
    #Add string to array of documents 
    documents = documents + [subs]
documents = np.array(documents)
print(documents.shape)

(3,)


### Analysing your text

Now we have a set of documents in an array, we can start turning them into numerical representations.

Luckily, the `sklearn` library has built in functions for us to use when we want to make **word vectors**. We can either use the `Count Vectorizer` to make a bag of words, or `TfidfVectorizer` to get TF/IDF vectors. 

We also define our own **tokeniser**. The function `my_tokeniser` in called on every document, and returns an array of tokens. We can edit this function to try different stemming, lemmatisation, capitalisation, n-grams all of which we have seen in the Week 2.2 lecture notebook and can copy across!

Underneath is code for seeing the highest scoring words for each document, as well as viewing the similarities against each other. 

Try different approaches to tokenising and compare Bag of Words against TF/IDF on your documents, what do you find?

In [None]:
#Called once for each document
def my_tokeniser(doc):
    #Split on spaces
    tokens = re.split(r'[-\s.,;!?]+', doc)
    processed = []
    for t in tokens:
        #Lemmatise and make lowercase
        t = lem.lemmatize(t.lower())
        #Remove stop words
        if not t in stop_words.ENGLISH_STOP_WORDS:
            processed = processed + [t]
    #Return an array of tokens for that document
    return processed

In [None]:
#Using the CountVectorizer to get a bag of words using a custom tokeniser
vectoriser = CountVectorizer(tokenizer=my_tokeniser)
vectorised = vectoriser.fit_transform(documents)
print(vectorised.todense().shape)

In [None]:
#Using the TFIDF Vectorizer to get TFIDF vectors with custom tokeniser
vectoriser = TfidfVectorizer(tokenizer=my_tokeniser)
vectorised = vectoriser.fit_transform(documents)
print(vectorised.todense().shape)

### Showing the highest scoring words per document

In [None]:
#Store in a dataframe and sort
num_words = 10
#Use the vocab as the column names
vocab = vectoriser.get_feature_names()
data = pd.DataFrame(vectorised.todense(), columns = vocab)
for i in range(len(vectorised.todense())):
    print("doc", i)
    print(data.iloc[i].sort_values(ascending = False).head(num_words))

### Showing the similarities between documents

In [None]:
#Convert to array 
vector_array = vectorised.todense()
#Find similarities
result = cosine(vector_array)
#Put the result in a dataframe and 
df = pd.DataFrame(result)
#Show with heatmap style gradients
df.style.background_gradient(cmap='Greens')