# Word Count of Treasure Island
    In this project we will use the spacy and nltk libraries to do a word count of 
    Robert Louis Stevenson's "Treasure Island". Prior to starting this analysis the 
    libraries and english core word library from spacy are installed. The second set of imports allows a connection to R in order to download the book from the r-package 
    'gutenberg'
    
    The code, along with the files necessary and versions of packages in this instance 
    can be found on this repo.

In [1]:
import spacy
import nltk
nlp = spacy.load('en_core_web_lg')

import rpy2
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
utils = importr('utils')
utils.install_packages('gutenbergr', repos='https://cloud.r-project.org')
importr('gutenbergr')


rpy2.robjects.packages.Package as a <module 'gutenbergr'>

### 1 - Text Cleaning Function
    The following function will take the text given, and lemmatize all non-pronoun 
    words, while changing all pronouns to lowercase.

In [2]:
def clean_text(text):
    nlp_text = nlp(text)
    lemmas = [w.lower_ if w.lemma == '-PRON-' else w.lemma_ 
              for w in nlp_text if w.is_alpha and not w.is_stop]

    #for w in nlp_text:
    #    if w.is_alpha and not w.is_stop:
    #        if w.lemma == '-PRON-':
    #            lemmas.append(w.lower_)
    #        else:
    #            lemmas.append(w.lemma_)
    return(' '.join(lemmas))

### 2 - Read in file and apply the function
    This next step will read in the file chosen, strip the line breaks from each row of the list, remove the null list items, convert the list into a single string, 
    and finally apply our cleaned_text function to the text

In [3]:
treasure_island_df = ro.r('gutenberg_download("120")')
treasure_island_full = ' '.join(treasure_island_df[1])
treasure_island = treasure_island_full[3488:]

In [4]:
cleaned_treasure_island = clean_text(treasure_island)

    Alternatively, the file can be downloaded from 
    http://www.gutenberg.org/files/120/120-0.txt, read in and opened/cleaned with 
    the following:

In [5]:
with open('Treasure_Island.txt', 'r', encoding = 'utf-8') as text:
    ti_example = [line.rstrip() for line in text]
    while('' in ti_example):
            ti_example.remove('')
            
single_text_ti = ' '.join(ti_example)
cleaned_ti = clean_text(single_text_ti)

### 3 - Apply word count
    Lastly, we apply the nltk package to count the frequency of words appearing in the 
    documentation.

In [6]:
treasure_island_top_words = nltk.FreqDist(cleaned_treasure_island.split())

In [7]:
treasure_island_top_words.most_common(25)

[('say', 409),
 ('man', 364),
 ('come', 215),
 ('like', 214),
 ('hand', 198),
 ('captain', 188),
 ('doctor', 159),
 ('go', 152),
 ('good', 150),
 ('silver', 142),
 ('time', 139),
 ('know', 139),
 ('cry', 137),
 ('look', 133),
 ('ship', 133),
 ('think', 128),
 ('see', 126),
 ('tell', 115),
 ('old', 114),
 ('begin', 113),
 ('sea', 107),
 ('run', 104),
 ('little', 102),
 ('find', 102),
 ('hear', 101)]

# Conclusion
    From above it can be seen by simply counting single words we can't truly get a good 
    understanding for what the text contains. some of the top words with meaning 
    ('captain', 'ssilver', 'ship', 'sea') do imply the book is truly about treasure/ and island... 
    But we don't have a great understanding of the significance unless we provide 
    additional analysis
   