### Text mining: compare two books
# Packages

### Install packages

"codecs" is for reading the text files, "re" (regular expretions) and "collections" for working with tokens, and "nltk" (natural language toolkit)

# Text
## comment

In [1]:
!pip install pandas
!pip install numpy
!pip install scipy
!pip install scikit-learn
!pip install nltk
!pip install matplotlib



### Import packages

In [2]:
import codecs
import re
import copy
import collections


In [3]:
import numpy as np
import pandas as pd

import nltk

In [4]:
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer

In [5]:
from __future__ import division


In [28]:
import matplotlib
%matplotlib inline

# Download stopwords

### Some specialized functions from NLTK
You can also download everything in NLTK with nltk.download(), but it will take time!

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SvetlanaMeissner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Import the stopwords package from NLTK

In [10]:
from nltk.corpus import stopwords

# Data

### Read data

In [11]:
with codecs.open('SenseSensibility.txt', "r", encoding="utf-8") as f:
    text_SS = f.read()
with codecs.open('EmailsHC2015_part2.txt', "r", encoding="utf-8") as f:
    text_HC = f.read()


# Process data
Check for stopwords

In [12]:
esw = stopwords.words('english')
esw.append("would")

Filter tokens (using regular expressions)

In [13]:
word_pattern = re.compile('^\w+$')

## Create a token counter function

In [14]:
def get_text_counter(text):
    tokens = WordPunctTokenizer().tokenize(PorterStemmer().stem(text))
    tokens = list(map(lambda x: x.lower(), tokens))
    tokens = filter(lambda x: x.isalpha(), tokens) # remove numbers
    tokens = [token for token in tokens if re.match(word_pattern, token) and token not in esw]
    return collections.Counter(tokens), len(tokens)

## Create a function to calculate the absolute frequency of the most commen words.

In [15]:
def make_df(counter, size):
    abs_freq = np.array([el[1] for el in counter])
    rel_freq = abs_freq / size
    index = [el[0] for el in counter]
    df = pd.DataFrame(data=np.array([abs_freq, rel_freq]).T, index=index, columns=["Absolute frequency", "Relative frequency"])
    df.index.name = "Most common words"
    return df

# Analysis

## Analyze individual texts

Calculate the most common words of Sense and Sensibility and display the 15 most common.

In [16]:
ss_counter, ss_size = get_text_counter(text_SS)


In [17]:
make_df(ss_counter.most_common(15), ss_size)

Unnamed: 0_level_0,Absolute frequency,Relative frequency
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1
elinor,685.0,0.012813
could,578.0,0.010811
marianne,566.0,0.010587
mrs,530.0,0.009914
said,397.0,0.007426
every,376.0,0.007033
one,331.0,0.006191
much,290.0,0.005424
must,283.0,0.005293
sister,282.0,0.005275


Save the 1000 most common words of Sense and Sensibility to .csv

In [18]:
ss_df = make_df(ss_counter.most_common(1000), ss_size)
ss_df.to_csv("SS2_1000.csv")

## Calculate the most common words in Hillary Clinton emails and display the 15 most common.

In [19]:
hc_counter, hc_size = get_text_counter(text_HC)

In [20]:
make_df(hc_counter.most_common(15), hc_size)

Unnamed: 0_level_0,Absolute frequency,Relative frequency
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1
state,8329.0,0.025936
f,6272.0,0.019531
u,4507.0,0.014035
h,4140.0,0.012892
department,4041.0,0.012584
case,3858.0,0.012014
gov,3857.0,0.012011
date,3812.0,0.011871
doc,3799.0,0.01183
unclassified,3761.0,0.011712


Save the 1000 most common words of Email to .csv

In [21]:
hc_df = make_df(hc_counter.most_common(1000), hc_size)
hc_df.to_csv("HC2_1000.csv")

# Compare texts

Find the most common words across the two documents.

In [22]:
all_counter = ss_counter + hc_counter

In [23]:
all_df = make_df(hc_counter.most_common(1000), 1)
most_common_words = all_df.index.values

Create a data frame with the differences in word frequency

In [24]:
df_data = []
for word in most_common_words:
    ss_c = ss_counter.get(word, 0) / ss_size
    hc_c = hc_counter.get(word, 0) / hc_size
    d = abs(ss_c - hc_c)
    df_data.append([ss_c, hc_c, d])
    
    

In [25]:
dist_df = pd.DataFrame(data=df_data, index=most_common_words,
                          columns=["SS relative frequency", "HC relative frequency", "Differences in relative frequency"])
dist_df.index.name = "Most common words"
dist_df.sort_values("Differences in relative frequency", ascending=False, inplace=True)
    

Display the most 20 distinctive words.

In [26]:
dist_df.head(20)

Unnamed: 0_level_0,SS relative frequency,HC relative frequency,Differences in relative frequency
Most common words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
state,0.000486,0.025936,0.02545
f,5.6e-05,0.019531,0.019475
u,0.0,0.014035,0.014035
h,0.0,0.012892,0.012892
department,0.0,0.012584,0.012584
gov,0.0,0.012011,0.012011
date,1.9e-05,0.011871,0.011852
doc,0.0,0.01183,0.01183
unclassified,0.0,0.011712,0.011712
case,0.000673,0.012014,0.01134


Save the full list of distinctive words to a dist_SSHC.csv

In [27]:
dist_df.to_csv("dist_SSHC.csv")