# Exploring the word count and distribution of Sherlock Holmes: The Hound of the Baskervilles (an on-going project)

## Insights:
### Book:
 - there are a total of 5535 unique words in the book and 26895 words in total(20.58%).
 - the 3 most common words are: 
 1. sir: 336 times 
 2. upon: 316 times
 3. said: 240 times
 
 The top 2 most common words are both words that appear much more in writing then in speaking, they also better represent the time of the writing of the book (25 March 1902)
 
### TV Show:
 - 2976 unqiue words in the series episode and 11725 words in total(25.38%).
 - the 3 most common words are:
 1. sherlock:273 times
 2. john:242 times
 3. henry:158 times
 
 The 3 most common words are names of the main characters, which is very represtative of TV shows. 
 
 
### Worth mention-ables:
 - The words Sherlock and John are much less common in the book than the tv show. This is likely because the books were written in times when even close friends used to call each other by last names. 
 - The words sir appears very little in the TV show, prehaps because it was much more common to call people sir back than. The word upon does not appear at all in the TV show transcript. 


#### Known limitations:
 1. This is not a viable comparison, because that will be comparing book to script. This is more of an overview of both texts. 
 2. In the book, the narrator is John Watson, so his name doesn't appear as much. In the series there's no narrator, so John's name pops out a lot more. 
 3. The book contains more words and probably uses a much more descriptive language than the script.
  

In [1]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from collections import Counter

In [2]:
def analyze_text_from_url(url, n):
    
    """ This function recieves 2 inputs - 1. URL 2. Number of top most common words
    
    The purpose of the function is to extract HTML text content from a URL, toknize it, remove stop words
     and non alphabetic signs, count the frequency of each word and return 3 outputs:
    common_words = dictionary with n number of top common words in the text
    total_words = number of words in the text
    word_dict = dictionary of all the words in the text and their number of appearences as values"""
    # Step 1: Fetch the content from the URL
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for HTTP errors

    # Step 2: Extract text from the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text()

    # Step 3: Tokenize the text and remove non-alphabetic tokens and stopwords
    tokens = word_tokenize(text.lower())
    words = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]

    # Step 4: Count the frequency of each word
    words_count = Counter(filtered_words)

    # Step 5: Get the n most common words
    common_words = words_count.most_common(n)

    # Step 6: Calculate the total number of words
    total_words = len(filtered_words)

    # Step 7: Create a dictionary of all words and their frequencies
    words_dict = dict(words_count)

    return common_words, total_words, words_dict

In [3]:
url = 'https://www.gutenberg.org/cache/epub/2852/pg2852-images.html'
n = 20
common_words, total_words, words_dict = analyze_text_from_url(url, n)

print(f'Total Words: {total_words}')
print('Most Common Words:')
for word, count in common_words:
    print(f'{word}: {count}')

Total Words: 26895
Most Common Words:
sir: 336
upon: 316
said: 240
one: 238
man: 211
could: 201
would: 193
holmes: 191
us: 172
henry: 151
moor: 149
may: 117
know: 115
see: 114
baskerville: 111
watson: 110
well: 99
must: 96
charles: 94
stapleton: 92


In [4]:
#transforming into a df to use for graphs
series= pd.Series(words_dict).sort_values(ascending=False)
df_book = pd.DataFrame(series).reset_index()
columns = ['word','count']
df_book.columns = columns
# adding a ratio column
df_book['ratio'] = df_book['count']/df_book['count'].sum()
df_book.head(20)

Unnamed: 0,word,count,ratio
0,sir,336,0.012493
1,upon,316,0.011749
2,said,240,0.008924
3,one,238,0.008849
4,man,211,0.007845
5,could,201,0.007474
6,would,193,0.007176
7,holmes,191,0.007102
8,us,172,0.006395
9,henry,151,0.005614


In [5]:
df_book.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5535 entries, 0 to 5534
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   word    5535 non-null   object 
 1   count   5535 non-null   int64  
 2   ratio   5535 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 129.9+ KB


In [6]:
#there are a total of 5535 unique words in the book and 26895 words in total. 

In [7]:
print(f'{round(len(df_book["word"])/total_words*100,2)}% of the words in the book are unique')

20.58% of the words in the book are unique


### Sherlock season 2 episode 2: Hounds of The Baskerville

##### Note - since I could only find the transcript broken to several different pages with a lot of other text, I ended up creating a file on my local machine (in the repo) that I used. Also, since it's a script it contained the name of the charachter that is speaking the text (In all caps, like so - SHERLOCK:). I decided to remove those since it is not something that exists in books, and will likely create a bias in the distribution of the words. For those reasons I did scraping step by step and added steps along the way where needed rather than using the function that I created in the previous part. Other than that- all steps are the same as the first part.

In [8]:
path = 'sherlock_transcript.html'

with open(path, 'r', encoding='utf-8') as file:
    html_content = file.read()

soup = BeautifulSoup(html_content, 'lxml')
text = soup.get_text()

In [9]:
# since this a script it contains the name of the speaker before they speak. They appear in full caps,
# so I'm removing words that are in full caps before tokenizing. 
import regex as re 
text_no_caps = re.sub(r'\b[A-Z]{2,}\b', '', text)

In [10]:
tokens = word_tokenize(text_no_caps.lower())
words = [word for word in tokens if word.isalpha()]
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]

In [11]:
words_count = Counter(filtered_words)

In [12]:
common_words = words_count.most_common(20)

for word,count in common_words:
    print(f'{word}:{count}')

sherlock:273
john:242
henry:158
looks:131
back:112
turns:87
eyes:69
towards:67
away:66
oh:65
door:64
walks:62
around:60
right:54
see:52
room:50
know:49
face:48
one:47
look:46


In [13]:
series_show = pd.Series(words_count).sort_values(ascending = False)
df_show = pd.DataFrame(series_show).reset_index()
df_show.columns = ['word','count']

In [14]:
df_show['ratio'] = df_show['count']/df_show['count'].sum()
df_show.head(20)

Unnamed: 0,word,count,ratio
0,sherlock,273,0.023284
1,john,242,0.02064
2,henry,158,0.013475
3,looks,131,0.011173
4,back,112,0.009552
5,turns,87,0.00742
6,eyes,69,0.005885
7,towards,67,0.005714
8,away,66,0.005629
9,oh,65,0.005544


In [15]:
df_show.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2976 entries, 0 to 2975
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   word    2976 non-null   object 
 1   count   2976 non-null   int64  
 2   ratio   2976 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 69.9+ KB


In [16]:
len(filtered_words)

11725

In [17]:
# 2976 unqiue words in the series episode. 

In [18]:
print(f'{round(len(df_show["word"])/len(filtered_words)*100,2)}% of the words in the series episode are unique')

25.38% of the words in the series episode are unique


In [19]:
df_book[df_book['word']=='sherlock']

Unnamed: 0,word,count,ratio
109,sherlock,35,0.001301


In [20]:
df_show[df_show['word']=='sir']

Unnamed: 0,word,count,ratio
76,sir,21,0.001791


In [21]:
# merge tables for viz on Tableau
new_df = df_book.merge(df_show, left_on = 'word',right_on = 'word',how = 'outer')

In [22]:
new_df.columns = ['word','book_count','book_frequency','show_count','show_frequency']
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6988 entries, 0 to 6987
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   word            6988 non-null   object 
 1   book_count      5535 non-null   float64
 2   book_frequency  5535 non-null   float64
 3   show_count      2976 non-null   float64
 4   show_frequency  2976 non-null   float64
dtypes: float64(4), object(1)
memory usage: 273.1+ KB


In [23]:
new_df

Unnamed: 0,word,book_count,book_frequency,show_count,show_frequency
0,sir,336.0,0.012493,21.0,0.001791
1,upon,316.0,0.011749,,
2,said,240.0,0.008924,17.0,0.001450
3,one,238.0,0.008849,47.0,0.004009
4,man,211.0,0.007845,39.0,0.003326
...,...,...,...,...,...
6983,biting,,,1.0,0.000085
6984,licked,,,1.0,0.000085
6985,starter,,,1.0,0.000085
6986,frenetic,,,1.0,0.000085


In [24]:
new_df.to_csv('joint word count table.csv')