<a id='sec0'></a>
# Basic Analysis of Word Frequencies
- <a href='#sec1'>Create classified datasets</a>
- <a href='#sec2'>Basic Frequency Analysis</a>
- <a href='#sec3'></a>
- <a href='#sec4'></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import random
import json
import mskcc_functions as ski
import scipy.stats as scs
import feature_engineering as fe

from xgboost import plot_importance
from pprint import pprint
from matplotlib  import cm
from collections import Counter
from importlib import reload
from gensim import corpora, matutils, models, similarities
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

Using TensorFlow backend.
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
Slow version of gensim.models.doc2vec is being used


In [2]:
class_train = pd.read_csv('./data/training_variants')
text_train = pd.read_csv("./data/training_text", sep="\|\|", engine='python',
                         header=None, skiprows=1, names=["ID","Text"])

  text_train = pd.read_csv("./data/training_text", sep="\|\|", engine='python',


In [3]:
train = class_train.merge(text_train, on='ID')

<a id='sec1'></a>
# Create classified datasets (<a href='#sec0'>Back to top</a>)

<b>Work Flow</b>
1. Create "classified_docs to" hold all the docs (str) in each class
2. Convert each doc in each class to a list of words with tokenizer
3. Remove special character-containing words and stop words from each word_list in each class
4. Apply stemmer to each word in each word_list in each class


- classified_docs : dictionary<br>
    keys are class labels. values a lists, each of which contains strings from each 'Text' entry in train table
- classified_tokenized_docs : dictionary<br>
    keys are class labels. values are lists of lists. Inner list is a list of words from each 'Text' entry in train table

In [6]:
# create class label container
class_labels = []
for i in range(9):
    class_labels.append('class' + str(i+1))

In [4]:
%%time
# Make a stemmer object, define stop words
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# create classified_docs
classified_docs = {}
classified_tokenized_docs = {}
for i in range(9):
    print('%s being processed...' % class_labels[i])
    docs = [doc for j, doc in enumerate(train[train.Class == (i+1)]['Text'])]
    classified_docs[class_labels[i]] = docs
    
    tokenized_docs = []
    for k, doc in enumerate(docs):
        # tokenize the doc (DO NOT MAKE IT A SET FOR LATER USE)
        tokenized_doc = word_tokenize(fe.replace_with_whitespace(doc, hyphens='on'))

        # Remove stop words and words with special characters
        tokenized_doc = [word for word in tokenized_doc \
                         if re.search(r'^[A-Za-z]', word) \
                         if re.search(r'[A-Za-z0-9]$', word) \
                         if not re.search(r'[@#%&*()+=]', word) \
                         if len(word) > 1 \
                         if word.lower() not in stop_words]

        # Apply stemmer to each word in the list
        tokenized_doc = [stemmer.stem(word) for word in tokenized_doc]
        
        tokenized_docs.append(tokenized_doc)
    
    classified_tokenized_docs[class_labels[i]] = tokenized_docs

class1 being processed...
class2 being processed...
class3 being processed...
class4 being processed...
class5 being processed...
class6 being processed...
class7 being processed...
class8 being processed...
class9 being processed...
CPU times: user 7min 45s, sys: 381 ms, total: 7min 45s
Wall time: 7min 46s


In [5]:
for i in range(9):
    len_docs = len(classified_docs[class_labels[i]])
    len_tokenized_docs = len(classified_tokenized_docs[class_labels[i]])
    print('%s check: ' % class_labels[i], (len_docs == len_tokenized_docs))

class1 check:  True
class2 check:  True
class3 check:  True
class4 check:  True
class5 check:  True
class6 check:  True
class7 check:  True
class8 check:  True
class9 check:  True


<b>Create combined big text for each class</b>

- classified_texts : dictionary<br>
    keys are class labels. values are strings made by combining all the strings from 'Text' entries in each class in train table
- classified_tokenized_texts : dictionary<br>
    keys are class labels. values are lists of words in the combined strings from 'Text' entries in each class in train table
The latter is processed the same way as non-combined dictionary above (i.e. classified_tokenized_docs)

In [6]:
%%time
classified_texts = {}
classified_tokenized_texts = {}
for i in range(9):
    print('%s being processed...' % class_labels[i])
    docs = [doc for doc in train[train.Class == (i+1)]['Text']]
    text = ''
    for doc in docs:
        text = text + doc + ' '
    
    tokenized_text = word_tokenize(fe.replace_with_whitespace(text, hyphens='on'))
    tokenized_text = [word for word in tokenized_text \
                         if re.search(r'^[A-Za-z]', word) \
                         if re.search(r'[A-Za-z0-9]$', word) \
                         if not re.search(r'[@#%&*()+=]', word) \
                         if len(word) > 1 \
                         if word.lower() not in stop_words]
    tokenized_text = [stemmer.stem(word) for word in tokenized_text]
    
    classified_texts[class_labels[i]] = text
    classified_tokenized_texts[class_labels[i]] = tokenized_text

class1 being processed...
class2 being processed...
class3 being processed...
class4 being processed...
class5 being processed...
class6 being processed...
class7 being processed...
class8 being processed...
class9 being processed...
CPU times: user 8min 6s, sys: 27.9 s, total: 8min 34s
Wall time: 8min 35s


<b>Save dictionaries for later use</b>

In [7]:
with open('./data/classified_docs.json', 'w') as f1:
    json.dump(classified_docs, f1)

with open('./data/classified_tokenized_docs.json', 'w') as f2:
    json.dump(classified_tokenized_docs, f2)

with open('./data/classified_texts.json', 'w') as f3:
    json.dump(classified_texts, f3)

with open('./data/classified_tokenized_texts.json', 'w') as f4:
    json.dump(classified_tokenized_texts, f4)

<a id='sec2'></a>
# Basic Frequency Analysis  (<a href='#sec0'>Back to top</a>)

<b>average per-doc appearance & appearance frequency (num_docs with the word / total_num_docs)</b>

\*Use dictionaries and pandas
- Create a list of unique words for each class
- Average per-doc appearances
  Count number of appearances per document and average over number of documents in each class
- Appearance frequency (num_docs with the word / total_num_docs)
  Check if a word appearcs in a document or not, and calculate the fraction of documents that contain the word   for each class

In [8]:
classified_unique_sets = {}
for i in range(9):
    tokenized_text = classified_tokenized_texts[class_labels[i]]
    unique_set = list(set(tokenized_text))
    classified_unique_sets[class_labels[i]] = unique_set
    print('# of unique words in %s: %d'% (class_labels[i], len(unique_set)))

# of unique words in class1: 50671
# of unique words in class2: 42200
# of unique words in class3: 9331
# of unique words in class4: 41992
# of unique words in class5: 17022
# of unique words in class6: 19101
# of unique words in class7: 64771
# of unique words in class8: 7881
# of unique words in class9: 8949


In [9]:
%%time
ave_perdoc_apps = {}
app_freqs = {}
for i in range(9):
    print('%s being processed...' % class_labels[i])
    tokenized_docs = classified_tokenized_docs[class_labels[i]]
    num_docs = len(tokenized_docs)
    
    tokenized_text = classified_tokenized_texts[class_labels[i]]
    c = Counter(tokenized_text)
    ave_perdoc_app = dict(c)
    ave_perdoc_app = {key:(value/num_docs) for key, value in ave_perdoc_app.items()}
    
    app_freq_list = []
    for doc in tokenized_docs:
        c = Counter(doc)
        freq = dict(c)
        app_freq = {key:1 for key, value in freq.items() if value > 0}
        app_freq_list.append(app_freq)
    app_freq_table = pd.DataFrame(app_freq_list)
    app_freq = dict(app_freq_table.sum(axis=0)/num_docs)
    
    ave_perdoc_apps[class_labels[i]] = ave_perdoc_app
    app_freqs[class_labels[i]] = app_freq

class1 being processed...
class2 being processed...
class3 being processed...
class4 being processed...
class5 being processed...
class6 being processed...
class7 being processed...
class8 being processed...
class9 being processed...
CPU times: user 28.3 s, sys: 424 ms, total: 28.7 s
Wall time: 28.7 s


In [10]:
with open('./data/average_per_document_appearances.json', 'w') as f5:
    json.dump(ave_perdoc_apps, f5)
    
with open('./data/fraction_of_documents_with_appearance.json', 'w') as f6:
    json.dump(app_freqs, f6)

<b>Reload the dictionaries from json files</b>

In [8]:
with open('./data/classified_docs.json') as f1:
    classified_docs = json.load(f1)

with open('./data/classified_tokenized_docs.json') as f2:
    classified_tokenized_docs = json.load(f2)

with open('./data/classified_texts.json') as f3:
    classified_texts = json.load(f3)

with open('./data/classified_tokenized_texts.json') as f4:
    classified_tokenized_texts = json.load(f4)

with open('./data/average_per_document_appearances.json') as f5:
    ave_perdoc_apps = json.load(f5)
    
with open('./data/fraction_of_documents_with_appearance.json') as f6:
    app_freqs = json.load(f6)

<b>Remove frequent words that appear in >50% of docs in each class</b>

- Get top 2000 words in terms of fraction of documents that they appear within each class
- Make a set for intersection of above words among all classes
- Remove words in the set if they appear in more than 50% of the documents in each class

In [13]:
fracdocs = pd.DataFrame(app_freqs).fillna(value=0)
n = 2000

top_words = []
for i in range(9):
    tops = fracdocs[class_labels[i]].sort_values(ascending=False).head(n)
    top_words.append(list(tops.index))

overlap1 = set(top_words[0])
for lis in top_words[1:]:
    overlap1.intersection_update(lis)
print('# intersecting words among top%d appearing words in each class: ', len(overlap1))
    
remove_list = []
for i in range(9):
    remove_words = [word for word in overlap1 \
                    if word in fracdocs[class_labels[i]] \
                    if fracdocs[class_labels[i]][word] > 0.5]
    remove_list.append(list(remove_words))

overlap2 = set(remove_list[0])
for lis in remove_list[1:]:
    overlap2.intersection_update(lis)
print('# intersecting words with >50% appearance: ', len(overlap2))

fracdocs_update1 = fracdocs.copy()
fracdocs_update1 = fracdocs_update1.drop(overlap2)
print('Table shape before removal: ', fracdocs.shape)
print('Table shape after removal:  ', fracdocs_update1.shape)

# intersecting words among top%d appearing words in each class:  1028
# intersecting words with >50% appearance:  287
Table shape before removal:  (125448, 9)
Table shape after removal:   (125161, 9)


<b>1-class appearance words (8-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

In [None]:
threshold = 0.5
one_class_words = []
for i, word in enumerate(fracdocs.index):
    apps = np.array(fracdocs.loc[word])
    apps[::-1].sort()
    if (apps[0] - apps[1]) >= threshold:
        one_class_word.append(word)

In [None]:
%%time
# more general version of above
# make n(_)c(lass_)w(ords)_labels
ncw_labels = ['one_class_words', 'two_class_words', 'three_class_words', 
              'four_class_words', 'five_class_words', 'six_class_words', 
              'seven_class_words', 'eight_class_words','other_words']

# Create a new dictionary to contain each n-class of words in list formats
n_class_words = {}
for i in range(9):
    n_class_words[ncw_labels[i]] = []

# Get words for each n-class of words (might be a better way to do this?)
threshold = 0.5
for j, word in enumerate(fracdocs.index):
    apps = np.array(fracdocs.loc[word])
    apps[::-1].sort()
    if (apps[0] - apps[1]) >= threshold:
        n_class_words[ncw_labels[0]].append(word)
    elif (apps[1] - apps[2]) >= threshold:
        n_class_wors[ncw_labels[1]].append(word)
    elif (apps[2] - apps[3]) >= threshold:
        n_class_wors[ncw_labels[2]].append(word)
    elif (apps[3] - apps[4]) >= threshold:
        n_class_wors[ncw_labels[3]].append(word)
    elif (apps[4] - apps[5]) >= threshold:
        n_class_wors[ncw_labels[4]].append(word)
    elif (apps[5] - apps[6]) >= threshold:
        n_class_wors[ncw_labels[5]].append(word)
    elif (apps[6] - apps[7]) >= threshold:
        n_class_wors[ncw_labels[6]].append(word)
    elif (apps[7] - apps[8]) >= threshold:
        n_class_wors[ncw_labels[7]].append(word)
    else:
        n_class_wors[ncw_labels[8]].append(word)

In [None]:
for i in range(9):
    print('# of words in %s: %d' % (ncw_labels[i], len(n_class_wors[ncw_labels[i]])))

In [None]:
# Loop version. Prob very slow
threshold = 0.5
for i in range(1):
    # Pre-screen out any word with <0.5 appearance frequency
    class_words = [word for word in fracdocs[class_labels[i]] if fracdocs[class_labels[i]][word] >= 0.5]
    
    other_classes = [class_label for class_label in class_labels if class_label != class_labels[i]]
    fracdocs_other_classes = fracdocs[other_classes]
    one_class_words = [word for word in class_words \
                       if (np.max(fracdocs_other_classes.loc[word]) - fracdocs[class_labels[i]][word])]

<b>2-class appearance words (7-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>3-class appearance words (6-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>4-class appearance words (5-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>5-class appearance words (4-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>6-class appearance words (3-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>7-class appearance words (2-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes

<b>8-class appearance words (1-class non-appearance)</b>

Get words whose appearance frequency is more than 0.5 higher in one class than in all the other classes