**TEXT ANALYTICS**<br/>
To illustrate the basic concepts behind the analysis of text, a short collection of magazine stories will be use called "Astounding Stories" available from http://www.gutenberg.org. These are contained in a single file. The file will be read into a single string. The string will be parsed and filter to prepare the term-document matrix. Each row of this matrix will represent a single term, a word, number, date or
entity.

In [108]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.corpus import wordnet as wn
from collections import Counter
import operator
import string
import re

Next the lists files and term_doc are initialized. fiiles contains the name of the text documents that will be read and used to create the term-document matrix. These are books available from www.gutenberg.org. term_doc is a matrix that will be populated with the terms and their count for each document when run in a loop.

The last three binary variables , pos_tag, stemming, stop words are set to True or False to describe which of the four scenarios are being processed
1. POS(Yes) Stop(Yes) Stem(Yes)
2. POS(Yes) Stop(No) Stem(No)
3. POS(Yes) Stop(No) Stem(No)
4. POS(No) Stop(No) Stem(No)


In [109]:
file_path = 'C:/Users/gaura/Downloads/TAMU/656/Week 9 HW/TextFiles/'
files = ['T1.txt', 'T2.txt', 'T3.txt', 'T4.txt', 'T5.txt', 'T6.txt', 'T7.txt', 'T8.txt']
term_doc = []
pos_tags = True
stemming = True
remove_stop = True

**Process Each Document**<br/>
The main part of this code loops over individual documents. Each is processed using the following steps:
1. Read
2. Preprocessing anf Tokenization
3. Parts of Speech
4. Stop Words
5. Stem
6. Add to term_doc list

In [110]:
for file in files:
    with open (file_path+file, "r") as text_file:
        adoc = text_file.read()


# Convert to all lower case - required
    adoc = ("%s" %adoc).lower()
# Replace special characters with spaces
    adoc = adoc.replace('-', ' ')
    adoc = adoc.replace('_', ' ')
    adoc = adoc.replace(',', ' ')
# Replace not contraction with not
    adoc = adoc.replace("'nt", " not")
    adoc = adoc.replace("n't", " not")
    adoc = adoc.replace("'d", " ")
# Tokenize
    tokens = word_tokenize(adoc)
    tokens = [word.replace(',', '') for word in tokens]
    tokens = [word for word in tokens if ('*' not in word) and \
    word != "''" and word !="``"]
    for word in tokens:
        word = re.sub(r'[^\w\d\s]+','',word)
    print("\nDocument "+file+" contains a total of", len(tokens), " terms.")

    if pos_tags:
# POS Tagging
        tokens = nltk.pos_tag(tokens)
    if remove_stop:
# Remove stop words
        stop = stopwords.words('english') + list(string.punctuation)
        stop.append("said")
# Remove single character words and simple punctuation
        tokens = [word for word in tokens if len(word) > 1]
# Remove stop words
        if pos_tags:
            tokens = [word for word in tokens if word[0] not in stop]
            tokens = [word for word in tokens if (not word[0].replace('.','',1).isnumeric()) and word[0]!="'s" ]
        else:
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if word != "'s" ]

    if stemming:
# WordNet Lematization Stems using POS
        stemmer = SnowballStemmer("english")
        wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
        wnl = WordNetLemmatizer()
        stemmed_tokens = []
        if pos_tags:
            for token in tokens:
                term = token[0]
                pos = token[1]
                pos = pos[0]
                try:
                    pos = wn_tags[pos]
                    stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
                except:
                    stemmed_tokens.append(stemmer.stem(term))
        else:
            for token in tokens:
                stemmed_tokens.append(stemmer.stem(token))
    if stemming:
        print("Document "+file+" contains", len(stemmed_tokens), "terms after stemming.\n")
        tokens = stemmed_tokens
    fdist = FreqDist(tokens)
# Use with Wordnet
    td= {}
#term_doc = []
    for word, freq in fdist.most_common(2000):
        td[word] = freq
    term_doc.append(td)
    


Document T1.txt contains a total of 86484  terms.
Document T1.txt contains 40039 terms after stemming.


Document T2.txt contains a total of 108474  terms.
Document T2.txt contains 48289 terms after stemming.


Document T3.txt contains a total of 104778  terms.
Document T3.txt contains 50177 terms after stemming.


Document T4.txt contains a total of 83140  terms.
Document T4.txt contains 35269 terms after stemming.


Document T5.txt contains a total of 76238  terms.
Document T5.txt contains 35030 terms after stemming.


Document T6.txt contains a total of 35136  terms.
Document T6.txt contains 15670 terms after stemming.


Document T7.txt contains a total of 80206  terms.
Document T7.txt contains 35461 terms after stemming.


Document T8.txt contains a total of 64511  terms.
Document T8.txt contains 29797 terms after stemming.



<br/>**Term-Document Matrix** <br/>
This ends the loop over individual documents. At this point point, the matrix term_doc contains the counts for each term and each document, but the terms in multiple documents need to be combined. We create a dictionary *td_matrix* which combines the terms of different documents together in one matrix.

In [111]:
td_mat = {}
for td in term_doc:
    td_mat = Counter(td_mat)+Counter(td)
td_matrix = {}
for k, v in td_mat.items():
    td_matrix[k] = [v]

for td in term_doc:
    for k, v in td_matrix.items():
        if k in td:
            td_matrix[k].append(td[k])
        else:
            td_matrix[k].append(0)

<br/>**Printing Term-Document Matrix** <br/>
The term Document matrix is first sorted in descending order of frequency of each term and then printed along with the scenario description.


In [112]:
td_matrix_sorted = sorted(td_matrix.items(), key=operator.itemgetter(1),reverse=True)
print("Scenario: POS=", pos_tags, "Remove Stop Words=", remove_stop, " Stemming=", stemming)
print("------------------------------------------------------------")
print("TERM            TOTAL  D1  D2   D3   D4   D5   D6   D7   D8")
for i in range(20):
    s = '{:<15s}'.format(td_matrix_sorted[i][0])
    v = td_matrix_sorted[i][1]
    
    for j in range(9):
        s = s + '{:>5d}'.format(v[j])
    print('{:<60s}'.format(s))
print("____________________________________________________________")

Scenario: POS= True Remove Stop Words= True  Stemming= True
------------------------------------------------------------
TERM            TOTAL  D1  D2   D3   D4   D5   D6   D7   D8
one             2127  291  437  348  211  312  121  202  205
water           2040   47  922  825    7   94    7   55   83
make            1928  204  694  262  185  237   63  169  114
would           1855  270  407  195  309  222   60  289  103
go              1620  212  292   18  239  154  103  374  228
come            1511  211  153   62  126  276  155  282  246
could           1363  221  121   49  364  195   93  203  117
time            1333  137  128  175  167  164  213  216  133
see             1188  179  232  129  156  110   72  172  138
light           1175   87  461  322   21   92   61   60   71
get             1146  171  291   24   76  121   53  315   95
air             1126   69  518  412   20   19   23   30   35
know            1042  165  102  112  223  119   46  202   73
day              939   87 