## 1. Tools for text processing
<p>What are the most frequent words in Franz Kafka's novel, The Methamorphosis, and how often do they occur?</p>
<p>In this notebook, we'll scrape the novel <em>The Methamorphosis</em> from the website <a href="https://www.gutenberg.org/">Project Gutenberg</a> (which contains a large corpus of books) using the Python package <code>requests</code>. Then we'll extract words from this web data using <code>BeautifulSoup</code>. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (<code>nltk</code>) and <code>Counter</code>.</p>


In [18]:
# Importing requests, BeautifulSoup, nltk, and Counter, Downloading stopwords

import requests
from bs4 import BeautifulSoup
import nltk
nltk.download('stopwords')
from collections import Counter
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Request Content
<p>To analyze The Metamorphosis, we need to get the contents of The Metamorphosis from <em>somewhere</em>. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .</p>
<p><strong>Note</strong> that HTML stands for Hypertext Markup Language and is the standard markup language for the web.</p>


In [19]:
# Getting the Metamorphosis HTML
r = requests.get('https://www.gutenberg.org/cache/epub/5200/pg5200-images.html')
# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'
# Extracting the HTML from the request object
html = r.text
#print(html[0:1000])


## 3. Get the text from the HTML
<p>This HTML is not quite what we want. However, it does <em>contain</em> what we want: the text of <em>The Metamorphosis</em>. What we need to do now is <em>wrangle</em> this HTML to extract the text of the novel. For this we'll use the package <code>BeautifulSoup</code>.</p>
We also have to make sure that we are selecting only the text of our interest, that is, clean the content that we extracted so that the license and other texts added as part of the Gutenberg Project format do not appear.

In [20]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html,'html.parser')

# Getting the text out of the soup
complete_content = soup.get_text()
# Filtering to have only the text of the book

# Printing out text between characters 32000 and 34000
#print(text[3200:3400])

final_identifier = "*** END OF THE PROJECT GUTENBERG EBOOK METAMORPHOSIS ***"
initial_identifier = "*** START OF THE PROJECT GUTENBERG EBOOK METAMORPHOSIS ***"
final = complete_content.find(final_identifier) # encontrar limites del texto
inicio = complete_content.find(initial_identifier) + len(initial_identifier) # encontrar limites del texto

text = complete_content[inicio:final]
#print(text)

## 4. Extract the words
<p>Now that we have the text of interest, it's time to count how many times each word appears, and for this we'll use <code>nltk</code> – the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.</p>

In [21]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens
print(tokens[0:8])


['Metamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie']


## 5. Make the words lowercase
<p>We should build a list of all words in <em>The Metamorphosis</em> in which all capital letters have been made lower case.</p>

In [22]:
# Create a list called words containing all tokens transformed to lower-case
# ... YOUR CODE FOR TASK 5 ...
total_words = [token.lower() for token in tokens]
print(total_words[0:8])
len(total_words)

['metamorphosis', 'by', 'franz', 'kafka', 'translated', 'by', 'david', 'wyllie']


22384

## 5.5 Split chapters
Divide the chapters of the book in such a way that we can see which words are most repeated per **chapter**

In [23]:
chapter_2_Del = total_words.index("ii")
chapter_3_Del =total_words.index("iii")

chapter_1 = total_words[0:chapter_2_Del]
chapter_2 = total_words[chapter_2_Del+1:chapter_3_Del]
chapter_3 =total_words[chapter_3_Del+1:]
#chapter_3[-10:]

## 6. Load in stop words
<p>It is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as <em>stop words</em>. The package <code>nltk</code> includes a good list of stop words in English that we can use.</p>

In [24]:
# Getting the English stop words from nltk


sw =  nltk.corpus.stopwords.words('english')

# Printing out the first eight stop words
print(sw[0:8])


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']


## 7. Remove stop words in The Metamorphosis
<p>We now want to create a new list with all <code>words</code> in The Metamorphosis, except those that are stop words (that is, those words listed in <code>sw</code>).</p>

In [25]:
# Create a list words_ns containing all words that are in words but not in sw
words_ns = [word for word in total_words if word  not in sw ]
print(words_ns[0:5])
# Printing the first 5 words_ns to check that stop words are gone
values_total= len(words_ns), len(total_words)-len(words_ns)
values_total

['metamorphosis', 'franz', 'kafka', 'translated', 'david']


(10014, 12370)

In [27]:
# For each chapter
chapter_1_ns = [word for word in chapter_1 if word  not in sw ]
count_c1 = Counter(chapter_1_ns)
values_c1 = len(chapter_1_ns), len(chapter_1)-len(chapter_1_ns)
values_c1

(3222, 4049)

In [29]:
chapter_2_ns = [word for word in chapter_2 if word  not in sw ]
count_c2 = Counter(chapter_2_ns)
values_c2 = len(chapter_2_ns), len(chapter_2)-len(chapter_2_ns)
values_c2

(3404, 4180)

In [31]:
# Lo mismo pero para cada capítulo individualmente
chapter_3_ns = [word for word in chapter_3 if word  not in sw ]
count_c3 = Counter(chapter_3_ns)
values_c3 = len(chapter_3_ns), len(chapter_3)-len(chapter_3_ns)
values_c3

(3386, 4141)

## 8. We have the answer
<p>Our original question was:</p>
<blockquote>
  <p>What are the most frequent words in Frank Kafka's novel The Metamorphosis and how often do they occur?</p>
</blockquote>


In [85]:
# Initialize a Counter object from our processed list of words
x = 10;
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_total = count.most_common(x)

# Print the top ten words and their counts
print(top_total)

[('gregor', 298), ('would', 187), ('room', 131), ('could', 119), ('father', 102), ('sister', 101), ('mother', 89), ('door', 87), ('back', 82), ('even', 80)]


# Frequency for Chapter


### Chapter1

In [28]:
chapter_1_df_total = pd.DataFrame(count_c1.most_common())
chapter_1_df_total.columns = ['Word','Chapter1_Count']
display(chapter_1_df_total)


Unnamed: 0,Word,Chapter1_Count
0,gregor,85
1,would,49
2,chief,36
3,clerk,36
4,could,34
...,...,...
1213,released,1
1214,flying,1
1215,heavily,1
1216,bleeding,1


### Chapter 2

In [30]:
chapter_2_df_total = pd.DataFrame(count_c2.most_common())
chapter_2_df_total.columns = ['Word','Chapter2_Count']
chapter_2_df_total


Unnamed: 0,Word,Chapter2_Count
0,gregor,110
1,would,86
2,room,50
3,could,47
4,sister,46
...,...,...
1244,sliding,1
1245,stumbling,1
1246,uniting,1
1247,ability,1


### Chapter 3

## Data visualization

In [58]:
chapter_3_df_total = pd.DataFrame(count_c2.most_common())
chapter_3_df_total.columns = ['Word','Chapter3_Count']
chapter_3_df_total


Unnamed: 0,Word,Chapter3_Count
0,gregor,110
1,would,86
2,room,50
3,could,47
4,sister,46
...,...,...
1244,sliding,1
1245,stumbling,1
1246,uniting,1
1247,ability,1


## Final Data Frame

In [97]:

df_juntas_1 = pd.merge(chapter_1_df_total,chapter_2_df_total,on='Word',how='outer')
df_juntas_2 = pd.merge(df_juntas_1,chapter_3_df_total,on='Word',how='outer')
df_juntas_2.fillna(0, inplace=True)
df_juntas_2["Ttl_Count"] = df_juntas_2["Chapter1_Count"] + df_juntas_2["Chapter2_Count"] + df_juntas_2["Chapter3_Count"]
df_juntas_2 = df_juntas_2.head(x)
df_juntas_2.sort_values(by=['Ttl_Count'], ascending=False, inplace=True)


In [32]:
sw_df = pd.DataFrame([values_c1,values_c2,values_c3,values_total])
sw_df.columns = ['Stop_Words', 'Value_Words']
sw_df.index = ['chapter_1','chapter_2','chapter_3','Total']
sw_df

Unnamed: 0,Stop_Words,Value_Words
chapter_1,3222,4049
chapter_2,3404,4180
chapter_3,3386,4141
Total,10014,12370


# Visualización

In [130]:

sw_df_chapter_1 = pd.DataFrame(sw_df.iloc[0]).reset_index().rename(columns={'index':'Type',0:'Count'})
sw_df_chapter_2 = pd.DataFrame(sw_df.iloc[1]).reset_index().rename(columns={'index':'Type',0:'Count'})
sw_df_chapter_3 = pd.DataFrame(sw_df.iloc[2]).reset_index().rename(columns={'index':'Type',0:'Count'})

# Making subplots
fig = make_subplots(rows=1, cols=3, specs=[[{'type': 'pie'}, {'type': 'pie'}, {'type': 'pie'}]],
                    subplot_titles=['Chapter 1', 'Chapter 2', 'Chapter 3'])

# Chapter 1 subplot
fig.add_trace(
    go.Pie(labels=sw_df_chapter_1['Type'], values=sw_df_chapter_1['chapter_1'],
           name='Chapter 1',
           marker=dict(colors= [px.colors.qualitative.G10[6],px.colors.qualitative.G10[5]])),
    row=1, col=1
)

# Chapter 2 subplot
fig.add_trace(
    go.Pie(labels=sw_df_chapter_2['Type'], values=sw_df_chapter_2['chapter_2'],
           name='Chapter 2',
           marker=dict(colors= [px.colors.qualitative.G10[6],px.colors.qualitative.G10[5]])),
    row=1, col=2
)

#Chapter 3 Subplot
fig.add_trace(
    go.Pie(labels=sw_df_chapter_3['Type'], values=sw_df_chapter_3['chapter_3'],
           name='Chapter 3',
           marker=dict(colors= [px.colors.qualitative.G10[6],px.colors.qualitative.G10[5]])),
    row=1, col=3
)


fig.update_layout(
    title_text='Value Words vs Stop Words in Each Chapter',
)


fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  insidetextfont=dict(color='white'),
                  marker=dict(line=dict(color='#FFFFFF', width=2)))


annotations = [
    dict(text='Chapter 1', x=0.16, y=-0.1, font_size=14, showarrow=False, xanchor='center', yanchor='top'),
    dict(text='Chapter 2', x=0.5, y=-0.1, font_size=14, showarrow=False, xanchor='center', yanchor='top'),
    dict(text='Chapter 3', x=0.84, y=-0.1, font_size=14, showarrow=False, xanchor='center', yanchor='top')
]

fig.update_layout(annotations=annotations)

In [95]:
fig = px.bar(df_juntas_2, x="Word", y=["Chapter1_Count", "Chapter2_Count","Chapter3_Count"], title="Top "+ str(x) + " words in The Methamorphosis",
             color_discrete_sequence=[px.colors.qualitative.Plotly[3],px.colors.qualitative.Plotly[0],px.colors.qualitative.Bold[4]],
             text_auto = True)

fig.update_layout(
    xaxis_title='Word',
    yaxis_title='Count',
    title={
        'text': f"Top {x} Words in 'The Metamorphosis'",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    legend_title_text='Chapters',
    template='plotly_white'
)
fig.show()

In [107]:
chapter_2_df_total = pd.DataFrame(count_c2.most_common())
chapter_2_df_total.columns = ['Word','Count']
chapter_2_df_total['Chapter'] = "Chapter 2"
chapter_2_df_total = chapter_2_df_total.head(x)

chapter_1_df_total = pd.DataFrame(count_c2.most_common())
chapter_1_df_total.columns = ['Word','Count']
chapter_1_df_total['Chapter'] = "Chapter 1"
chapter_1_df_total = chapter_1_df_total.head(x)

chapter_3_df_total = pd.DataFrame(count_c2.most_common())
chapter_3_df_total.columns = ['Word','Count']
chapter_3_df_total['Chapter'] = "Chapter 3"
chapter_3_df_total = chapter_3_df_total.head(x)




df_juntas_total = pd.concat([chapter_1_df_total,chapter_2_df_total,chapter_3_df_total],axis = 0)
df_juntas_total.reset_index(inplace=True)
df_juntas_total.drop(columns='index', inplace=True)
df_juntas_total

fig3 = px.sunburst(df_juntas_total, path = [ 'Chapter','Word'],
                   color_discrete_sequence=[px.colors.qualitative.Plotly[0],px.colors.qualitative.Bold[4], px.colors.qualitative.Alphabet[0]],
                   maxdepth = 2,
                   width = 1000,
                   height=800,
                   values = 'Count'
                   )
fig3.show()



