# Using Python for Research Homework: Week 3, Case Study 2

In this case study, we will find and plot the distribution of word frequencies for each translation of Hamlet.  Perhaps the distribution of word frequencies of Hamlet depends on the translation --- let's find out!

In [2]:
# DO NOT EDIT THIS CODE!
import os
import pandas as pd
import numpy as np
from collections import Counter

def count_words_fast(text):
    text = text.lower()
    skips = [".", ",", ";", ":", "'", '"', "\n", "!", "?", "(", ")"]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = Counter(text.split(" "))
    return word_counts

def word_stats(word_counts):
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

### Exercise 1 

In this case study, we will find and visualize summary statistics of the text of different translations of Hamlet. For this case study, functions `count_words_fast` and `word_stats` are already defined as in the Case 2 Videos (Videos 3.2.x).

#### Instructions 
- Read in the data as a pandas dataframe using `pd.read_csv`. Use the `index_col` argument to set the first column in the csv file as the index for the dataframe. The data can be found at https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@hamlets.csv

In [8]:
hamlets = pd.read_csv('hamlets.csv', index_col=0)
hamlets

Unnamed: 0,language,text
1,English,The Tragedie of Hamlet\n ...
2,German,"Hamlet, Prinz von Dännemark.\n ..."
3,Portuguese,HAMLET\n DRAMA EM ...


### Exercise 2 

In this exercise, we will summarize the text for a single translation of Hamlet in a `pandas` dataframe. 

#### Instructions
- Find the dictionary of word frequency in `text` by calling `count_words_fast()`. Store this as `counted_text`.
- Create a `pandas` dataframe named `data`.
- Using `counted_text`, define two columns in data:
    - `word`, consisting of each unique word in text.
    - `count`, consisting of the number of times each word in `word` is included in the text.

In [30]:
language, text = hamlets.iloc[0]
counted_text = count_words_fast(text)
data = pd.DataFrame()
data['word'] = list(counted_text.keys())
data['count'] = list(counted_text.values())
counted_text['hamlet']

# Enter your code here.


97

### Exercise 3

In this exercise, we will continue to define summary statistics for a single translation of Hamlet. 

#### Instructions
- Add a column to data named `length`, defined as the length of each word.
- Add another column named `frequency`, which is defined as follows for each word in `data`:
    - If `count > 10`, `frequency` is "frequent".
    - If `1 < count <= 10`, `frequency` is "infrequent".
    - If `count == 1`, `frequency` is "unique".

In [36]:
data['length'] = [len(key) for key in list(counted_text.keys())]
blah = []
tally = 0
for count in list(data['count']):
    if count > 10:
        blah.append('frequent')
    elif 1 < count and count <=10:
        blah.append('infrequent')
    elif count == 1:
        blah.append('unique')
        tally += 1
data['frequency'] = blah
data
    

Unnamed: 0,word,count,length,frequency
0,the,935,3,frequent
1,tragedie,3,8,infrequent
2,of,576,2,frequent
3,hamlet,97,6,frequent
4,,45513,0,frequent
...,...,...,...,...
5108,shooteexeunt,1,12,unique
5109,marching,1,8,unique
5110,peale,1,5,unique
5111,ord,1,3,unique


### Exercise 4

In this exercise, we will summarize the statistics in data into a smaller pandas dataframe. 

#### Instructions 
- Create a `pandas` dataframe named `sub_data` including the following columns:
    - `language`, which is the language of the text (defined in Exercise 2).
    - `frequency`, which is a list containing the strings "frequent", "infrequent", and "unique".
    - `mean_word_length`, which is the mean word length of each value in frequency.
    - `num_words`, which is the total number of words in each frequency category.

In [67]:
sub_data = pd.DataFrame(columns=['language', 'frequency', 'mean_word_length', 'num_words'])
sub_data['language'] = ['English', 'English', 'English']
sub_data['frequency'] = ['frequent', 'infrequent', 'unique']

frequent = [0, 0]
infrequent = [0, 0]
unique = [0, 0]
for index, word in enumerate(list(data['length'])):
    
    if list(data['frequency'])[index] == 'frequent':
        frequent[0] += word
        frequent[1] += 1
    elif list(data['frequency'])[index] == 'infrequent':
        infrequent[0] += word
        infrequent[1] += 1
    elif list(data['frequency'])[index] == 'unique':
        unique[0] += word
        unique[1] += 1
sub_data['mean_word_length'] = [frequent[0] / frequent[1], infrequent[0] / infrequent[1], unique[0] / unique[1]]

blah = [0, 0, 0]
for index, word in enumerate(list(data['length'])):
    if list(data['frequency'])[index] == 'frequent':
        blah[0] += 1
    elif list(data['frequency'])[index] == 'infrequent':
        blah[1] += 1
    elif list(data['frequency'])[index] == 'unique':
        blah[2] += 1
sub_data['num_words'] = blah
        
sub_data

Unnamed: 0,language,frequency,mean_word_length,num_words
0,English,frequent,4.371517,323
1,English,infrequent,5.825243,1442
2,English,unique,7.005675,3348


### Exercise 5

In this exercise, we will join all the data summaries for text Hamlet translation.

#### Instructions 
- The previous code for summarizing a particular translation of Hamlet is consolidated into a single function called `summarize_text`. Create a pandas dataframe` grouped_data` consisting of the results of `summarize_text` for each translation of Hamlet in `hamlets`.
    - Use a `for` loop across the row indices of `hamlets` to assign each translation to a new row.
    - Obtain the `ith` row of `hamlets` to variables using the `.iloc` method, and assign the output to variables `language` and `text`.
    - Call `summarize_text` using `language` and `text`, and assign the output to `sub_data`.
    - Use the pandas `.append()` function to append to pandas dataframes row-wise to `grouped_data`.

In [87]:
def summarize_text(language, text):
    counted_text = count_words_fast(text)

    data = pd.DataFrame({
        "word": list(counted_text.keys()),
        "count": list(counted_text.values())
    })
    
    data.loc[data["count"] > 10,  "frequency"] = "frequent"
    data.loc[data["count"] <= 10, "frequency"] = "infrequent"
    data.loc[data["count"] == 1,  "frequency"] = "unique"
    
    data["length"] = data["word"].apply(len)
    
    sub_data = pd.DataFrame({
        "language": language,
        "frequency": ["frequent","infrequent","unique"],
        "mean_word_length": data.groupby(by = "frequency")["length"].mean(),
        "num_words": data.groupby(by = "frequency").size()
    })
    
    return(sub_data)

grouped_data = pd.DataFrame(columns=['language', 'text'])
print(grouped_data)
for i in range(0, 3):
    language, text = hamlets.iloc[i]
    print(text)
    grouped_data = grouped_data.append(summarize_text(language, text))
grouped_data


Empty DataFrame
Columns: [language, text]
Index: []
The Tragedie of Hamlet
                        Actus Primus Scoena Prima
       Enter Barnardo and Francisco two Centinels
                              Barnardo Whos there
                 Fran Nay answer me Stand  vnfold
                                       your selfe
                           Bar Long liue the King
                                    Fran Barnardo
                                           Bar He
     Fran You come most carefully vpon your houre
   Bar Tis now strook twelue get thee to bed F...
   Fran For this releefe much thankes Tis bitt...
                          And I am sicke at heart
                    Barn Haue you had quiet Guard
                        Fran Not a Mouse stirring
   Barn Well goodnight If you do meet Horatio and
Marcellus the Riuals of my Watch bid them make...
                      Enter Horatio and Marcellus
      Fran I thinke I heare them Stand whos there
                       Ho

Unnamed: 0,language,text,frequency,mean_word_length,num_words
frequent,English,,frequent,4.371517,323.0
infrequent,English,,infrequent,5.825243,1442.0
unique,English,,unique,7.005675,3348.0
frequent,German,,frequent,4.528053,303.0
infrequent,German,,infrequent,6.48183,1596.0
unique,German,,unique,9.006987,5582.0
frequent,Portuguese,,frequent,4.417625,261.0
infrequent,Portuguese,,infrequent,6.49787,1643.0
unique,Portuguese,,unique,8.669778,5357.0


### Exercise 6

In this exercise, we will plot our results and look for differences across each translation.

#### Instructions 
- Plot the word statistics of each translations on a single plot. Note that we have already done most of the work for you.
- Consider: do the word statistics differ by translation?

In [88]:
colors = {"Portuguese": "green", "English": "blue", "German": "red"}
markers = {"frequent": "o","infrequent": "s", "unique": "^"}
import matplotlib.pyplot as plt
for i in range(grouped_data.shape[0]):
    row = grouped_data.iloc[i]
    plt.plot(row.mean_word_length, row.num_words,
        marker=markers[row.frequency],
        color = colors[row.language],
        markersize = 10
    )

color_legend = []
marker_legend = []
for color in colors:
    color_legend.append(
        plt.plot([], [],
        color=colors[color],
        marker="o",
        label = color, markersize = 10, linestyle="None")
    )
for marker in markers:
    marker_legend.append(
        plt.plot([], [],
        color="k",
        marker=markers[marker],
        label = marker, markersize = 10, linestyle="None")
    )
plt.legend(numpoints=1, loc = "upper left")

plt.xlabel("Mean Word Length")
plt.ylabel("Number of Words")
plt.show()

ModuleNotFoundError: No module named 'matplotlib'