# DTM & TF-IDF OF 1880s FICTION

- Let's take up Ramsay's project. We'll use some Python packages to analyze 1880s British fiction. We'll explore the data, make a document term matrix. We'll then use the DTM and a word weighting technique called TF-IDF (term frequency inverse document frequency) to identify important and discerning words within this dataset with Pandas.

- We start by importing the modules and packages we need, and we define the data directory

In [313]:
import os
import unicodedata
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('seaborn-poster')
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
from timeit import default_timer as timer

DATA_DIR = 'data/test' # Locate the directory for your dataset

## Listing text files ##
Define a function `list_textfiles`, taking a parameter `directory`, that: 
   - lists all the text files in `directory`
   - adds that text file to a new list, written as "folder/file"
   - returns the new list

In [314]:
def list_textfiles(directory):
    "Return a list of filenames ending in '.txt' in DIRECTORY. Remove files that are (almost) empty."
    textFiles = []
    # We are sorting because different operating systems may list files in different orders
    for fileName in sorted(os.listdir(directory)):
        if fileName.endswith(".txt"):
            textFiles.append(directory + "/" + fileName)
    return textFiles

## Tokenizing ##

Define a function `tokenizer`, taking a parameter `filename`, that:
   - opens `filename`
   - reads its contents
   - turns all characters into lower case
   - removes punctuation
   - tokenizes the words (use whatever tokenizer you like)
   - removes stopwords (hint: look at the stopwords module we installed)
   - returns a list containing the cleaned up words

In [315]:
def tokenizer(filename):
    with open(filename, 'r') as f:
        start = timer()
        text = f.read()
        lower = text.lower()
        no_punctuation = ''.join(ch for ch in lower if unicodedata.category(ch)[0] != 'P')
        words = word_tokenize(no_punctuation)
        stop_words = stopwords.words('english') 
        end = timer()
        print("function took " + str(end - start) + " ms")
        return [w for w in words if w not in stop_words and not w.isdigit() and len(w)>2]

## Listing text files ##

We'll run the functions we've just created:
   - Create a new list, `total`
   - Call list_textfiles with argument DATA_DIR and store its returned contents in a new variable `file_list`
   - Add a for-loop iterating over `file_list` that:
       - runs the `tokenizer` function with each file from `file_list` as argument
       - turns the tokens back into strings using the `' '.join()` function, saving this in variable `strings`
       - runs an `*if*` statement that checks if the file name ends with `"_female.txt"` (hint: look up the `os.path.basename` function), and if so, appends the letter "F" and the `strings` variable to a new list *inside* the `total` list.
       - runs an `*else*` statement that appends the letter "M" and the `strings` variable to a new list *inside* the `total` list.
    - Create a numpy array from the `total` list, called `total_array`
    - Print `total_array` to see if it works!

In [319]:
total = []
file_list = list_textfiles(DATA_DIR)
for filepath in file_list:
    tokens = tokenizer(filepath)
    strings = ' '.join(tokens)
    if os.path.basename(filepath).endswith("_female.txt"):
        total.append(["F", strings])    
    else:
        total.append(["M", strings])
total_array = np.asarray(total)
print(total_array)

function took 0.0005254239949863404 ms
function took 0.000804812996648252 ms
function took 0.0012806779996026307 ms
[['F' 'selecting useful dress materials manufactured messrs']
 ['M'
  'village nearest market altringham two miles away nearest church bowdon like distance road came direct village baguley moor crossed fairy well brook reached oaklandss timperley hall timperley brook ere branched right left acute angle altringham bowdon']
 ['M'
  'sooner taken pen hand intention writing intimate friend toady ascham make inquiries concerning schools general sound pattering feet corridor followed many thumps door attended shouts mammy mammy feet moment opened door soon hugging riotous screaming boy whip hand child master bruce answered pathetic inquiries matter sobbing repetitions word janet janet janet darling asked mother soothingly nothing maam replied respectablelooking middleaged woman followed master bruce kept whipping teasing miss janet pushed away tumbled would smudge drawing maam'

In [320]:
countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(total_array[:, 1])
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names())
# Quickly identify the most frequent words:
dtm.sum().sort_values(ascending=False).head(20)

janet           4
bowdon          2
followed        2
door            2
hand            2
inquiries       2
nearest         2
bruce           2
brook           2
timperley       2
master          2
feet            2
mammy           2
altringham      2
village         2
maam            2
away            2
manufactured    1
friend          1
general         1
dtype: int64

In [321]:
# View the average number of times each word is used in a book:
dtm.mean().sort_values(ascending=False).head(10)

janet        1.333333
bowdon       0.666667
followed     0.666667
door         0.666667
hand         0.666667
inquiries    0.666667
nearest      0.666667
bruce        0.666667
brook        0.666667
timperley    0.666667
dtype: float64

- We turn the text into a document term matrix using the scikit-learn function called `CountVectorizer`.

In [322]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
sparse_tfidf = vectorizer.fit_transform(total_array[:, 1])
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=vectorizer.get_feature_names())
tfidf.max().sort_values(ascending=False).head(20)

materials       0.408248
useful          0.408248
selecting       0.408248
dress           0.408248
manufactured    0.408248
messrs          0.408248
janet           0.389289
bowdon          0.289951
village         0.289951
brook           0.289951
altringham      0.289951
timperley       0.289951
nearest         0.289951
door            0.194645
followed        0.194645
feet            0.194645
maam            0.194645
master          0.194645
mammy           0.194645
inquiries       0.194645
dtype: float64

In [323]:
tfidf['gender_'] = total_array[:, 0]
male = tfidf[tfidf['gender_']=='M']
female = tfidf[tfidf['gender_']=='F']

In [324]:
female.max(numeric_only=True).sort_values(ascending=False).head(10)

materials       0.408248
messrs          0.408248
selecting       0.408248
useful          0.408248
dress           0.408248
manufactured    0.408248
make            0.000000
hand            0.000000
ere             0.000000
fairy           0.000000
dtype: float64

In [325]:
male.max(numeric_only=True).sort_values(ascending=False).head(10)

janet         0.389289
nearest       0.289951
altringham    0.289951
village       0.289951
timperley     0.289951
bowdon        0.289951
brook         0.289951
feet          0.194645
mammy         0.194645
door          0.194645
dtype: float64