# Text Analysis - Clustering
## Pt 1 - Significant Terms

If you have thousands, or hundreds of thousands of documents, how do you get an overall picture of what they are about? Techniques to find *significant* terms in large amounts of text are a useufl way to summarise large amounts of text effectively either summarising an entire collection of documents, or finding the terms that best describe a subset of those documents.

In this workbook we'll be looking at discovering significant terms through a process called *Vectorization*, and we'll be looking at two approaches.

- Count Vectorization
- TFIDF Vectorization

In [2]:
import pandas as pd
import spacy


## Loading our Sample Data

To demonstrate these techniques it is useful to have a set of documents with clear differences, so we can test to see how well the words we discover both express the overall collection of texts, and the groups of text seperately.

We will be using a dataset known as the "20 Newsgroups" Dataset. The [website about the dataset](http://qwone.com/~jason/20Newsgroups/) has more information.

In [3]:
from sklearn.datasets import fetch_20newsgroups

news_set = fetch_20newsgroups(subset='all', 
                              categories=['alt.atheism', 
                                          'talk.religion.misc',
                                            'comp.graphics', 
                                          'sci.space'],
                              remove=('headers', 'footers', 'quotes'))

The data is delivered as a dictionary, with different keys referring to different components

In [4]:
news_set['data'][0] # the list of texts itself here we look art just the first by using [0]

"My point is that you set up your views as the only way to believe.  Saying \nthat all eveil in this world is caused by atheism is ridiculous and \ncounterproductive to dialogue in this newsgroups.  I see in your posts a \nspirit of condemnation of the atheists in this newsgroup bacause they don'\nt believe exactly as you do.  If you're here to try to convert the atheists \nhere, you're failing miserably.  Who wants to be in position of constantly \ndefending themselves agaist insulting attacks, like you seem to like to do?!\nI'm sorry you're so blind that you didn't get the messgae in the quote, \neveryone else has seemed to."

In [5]:
 # a list of the different categories of document, based on the newsgroup they came from.
news_set['target_names']

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [6]:
# and a list that assigns each of the documents to a particular category number, which maps back to the order of the list of names above

# i.e. Target number 0 refers to alt.atheism because that is the label at position 0 in the list.
news_set['target']

array([0, 1, 1, ..., 2, 1, 1])

The number of texts should match the number of target labels

In [7]:
len(news_set['data'])

3387

In [8]:
len(news_set['target'])

3387

We start by putting our text and category labels into a dataframe

In [9]:
df = pd.DataFrame({'text':news_set['data'], 'category_num':news_set['target']})
df

Unnamed: 0,text,category_num
0,My point is that you set up your views as the ...,0
1,\nBy '8 grey level images' you mean 8 items of...,1
2,FIRST ANNUAL PHIGS USER GROUP CONFERENCE\n\n ...,1
3,"I responded to Jim's other articles today, but...",3
4,"\nWell, I am placing a file at my ftp today th...",1
...,...,...
3382,I am working on a program to display 3d wirefr...,1
3383,\n Did the Russian spacecraft(s) on the ill-f...,2
3384,"\n\nOh gee, a billion dollars! That'd be just...",2
3385,I am looking for software to run on my brand n...,1


To provide a string label we essentially want to take the `category_num` and look up the correct label.

The best data structure to look up information is a dictionary. In our case we want one that looks like this...
```
{0: 'alt.atheism',
 1: 'comp.graphics',
 2: 'sci.space',
 3: 'talk.religion.misc'}
 ```
Now you could just copy that into a variable and be done with it, but what happens if you need a dictionary with 300 category labels? Better to use code that will create the dictionary for you no matter how long it is.

In [10]:
news_set['target_names']

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

The items are in the correct order in the list and we already know that the target numbers relate to the position of the label in `news_set['target_names']`. So what we need to do is... 

- create a dictionary where the key is the position of a label and the value is the label itself.
- To do this we need to be able to somehow automatically get the position of an item in a list.
- For this we can use the built-in `enumerate` function.
- When you loop over an iterable like a list, which has been wrapped in the `enumerate` function, every loop will produce two values. As well as producing the value of the item in the list, it will also produce a number that, (unless you pass in extra arguments to enumerate) will return the index position of the item as well.



In [11]:
for position, item in enumerate(news_set['target_names']):
    print(position, item)
    

0 alt.atheism
1 comp.graphics
2 sci.space
3 talk.religion.misc


In [12]:
# all we need to do is translate this loop into a dictionary.

# method A - the for loop

category_lookup = {}
for position, item in enumerate(news_set['target_names']):
    category_lookup[position] = item
category_lookup

{0: 'alt.atheism', 1: 'comp.graphics', 2: 'sci.space', 3: 'talk.religion.misc'}

In [13]:
# method B the Dictionary comprehension - like a list comprehnsion, but as a dictionary!

# note the curled braces rather than square brackets, 
# ...and the seperation of the position and the item using a : to denote the key:value pair.

category_lookup = {position: item for position, item in enumerate(news_set['target_names'])}
category_lookup

{0: 'alt.atheism', 1: 'comp.graphics', 2: 'sci.space', 3: 'talk.religion.misc'}

Now we can use this to look up our string labels to go with our category numbers

In [14]:
category_lookup[2]

'sci.space'

To do this for every row we can use apply and a function.

In [15]:
def lookup_label(category_number):
    return category_lookup[category_number]

In [16]:
df['category_num'].apply(lookup_label)

0              alt.atheism
1            comp.graphics
2            comp.graphics
3       talk.religion.misc
4            comp.graphics
               ...        
3382         comp.graphics
3383             sci.space
3384             sci.space
3385         comp.graphics
3386         comp.graphics
Name: category_num, Length: 3387, dtype: object

However this is not BEST practice because we now have a function hanging around that we probably will only ever use once. Instead we ideally just want the function to exist for one job, after which we can forget it.

We can do this with a `lambda` function, which essentially creates a new function on the fly, rather than defining a function seperately using `def`. Lambda's are good for creating simple functions that you only need for one particular job. The general structure of a lambda is...

    lambda value being passed in : value to return after doing something with the passed in value 


Here we start the function by declaring `lambda` and then we give a name to the value about to be passed to it, which is the target `category_number` in each row of our dataframe.

The first part of the `lambda` ends with `:` like we have ended the line after declaring the start of our `lookup_label` function. 

The code after the `:` declares what will be returned. In this case we look up a string label using our `category_lookup` dictionary and the `category_number` and return that label value.

In [17]:
df['category_label'] = df['category_num'].apply(lambda category_number: category_lookup[category_number])

In [18]:
df.head()

Unnamed: 0,text,category_num,category_label
0,My point is that you set up your views as the ...,0,alt.atheism
1,\nBy '8 grey level images' you mean 8 items of...,1,comp.graphics
2,FIRST ANNUAL PHIGS USER GROUP CONFERENCE\n\n ...,1,comp.graphics
3,"I responded to Jim's other articles today, but...",3,talk.religion.misc
4,"\nWell, I am placing a file at my ftp today th...",1,comp.graphics


# Preprocessing the Text
As before we will use SpaCy for quick pre-processing of the text to tokenize and clean it.

However we're going to make a couple of changes to our text processor...

In [19]:
import spacy

In [20]:
nlp = spacy.load('en_core_web_md')

In [21]:
stop_list = nlp.Defaults.stop_words

In [22]:
%time df['text_nlp'] = list(nlp.pipe(df['text'],n_process=7))

CPU times: user 28.4 s, sys: 1.5 s, total: 29.9 s
Wall time: 1min 19s


### Phrasing
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/Archer-phrasing.jpg?raw=true" align="right" width="300">

- Operates on the assumption that if words often co-occur together in a corpus, they should be considered as a single 'phrase', rather than as individual words.
- Phrasing improves the accuracy of various analyses as it recognises that words may be transformed by their context.
- For example: 
    - In one document we have the phrase "human rights", in the other, "human biology". 
    - **Without phrasing** these may be considered similar as they both use the word "human".
    - However **with phrasing** these would be transformed into two seperate tokens, human_rights and human_biology, and therefore be more likely to be distinguished as different.

### The phrasing process

Currently our system of text processing first preprocesses all the documents using Spacy. Then we use Pandas apply to operate on all those documents individually, lemmatising and cleaning out unwanted material.

For the phraser to know what words often co-occur, it needs to see the entire corpus at once so we need to train the phraser on the entire corpus, before we then use it on a row by row basis on individual documents.

In [23]:
import gensim

#### Training Functions

`train_phraser` has three stages. 
- First we create a list of tokenized sentences. 
- We then feed that list of sentences to a Gensim `Phrases` model. This model looks at which token co-occur, how often and [makes a judgement](https://arxiv.org/abs/1310.4546) about whether co-occurence is common enough to consider it a 'phrase'.
- We `return` our trained phraser to use later on

In [24]:
def train_phraser(texts, stopwords):
    sentences = [
        [token.lemma_.lower() for token in sentence if token.lemma_.lower().isalpha()]
        for doc in texts 
        for sentence in doc.sents]
    
    bigram_phraser = gensim.models.Phrases(sentences, common_terms=stopwords)
    return bigram_phraser

In [25]:
phraser = train_phraser(df['text_nlp'], stopwords=stop_list)

Here is our new filter function. It has a couple of new features.
- Now we pass in the set of stopwords rather than rely on the function finding them in the global scope.
- We have a process to handle the phrase detection stage

Our new `filter_text` function...
- Takes a SpaCy doc
- Iterates over the doc sentences
- For each sentence it breaks the sentence up into individual tokens, lemmatises and lowers and filters out any non-alphabetical characters
- It then transforms those remaining tokens using the trained phraser
- ... and adds those tokens to a list using extend so the result is a flat list of tokens for the whole document.
- It then filters for stopwords before returning the list of tokens.

In [26]:
def filter_text(spacy_doc, phraser, stopwords):
    transformed_doc = []
    for sentence in spacy_doc.sents:
        sentence_tokens = [token.lemma_.lower() for token in sentence if token.lemma_.lower().isalpha()]
        transformed = phraser[sentence_tokens]
        transformed_doc.extend(transformed)
    tokens = [token for token in transformed_doc if token.lower() not in stopwords]
    return tokens

#### Why sentences?
We break down the text into sentences for phrase detection because of the following issue.

Consider this text...

```
... and so recognising that he was only human. Rights based discussions can only....
````

Here we have a division between two sentences around the full stop (period). If we feed the full text to our function, before it applies the phraser it will lower all the text, and remove non-alphabetical tokens meaning this section of the document would now look like...
```
.... and so recognising that he was only human rights based discussions can only..


```

The phraser, having perhaps seen the phrase 'human rights' a lot, would presume this another instance of the phrase being used. By feeding the phraser individual sentences, we maintain the boundaries in the text, and don't get false positives on phrases.

In [27]:
phrases = []
for i,row in df.iterrows():
    text = row['text_nlp']
    filtered = filter_text(text, phraser=phraser, stopwords=stop_list)
    test = [token for token in filtered if token.count('_') >0]
    phrases.extend(test)


In [28]:
from collections import Counter
phrase_counts = Counter(phrases)

In [29]:
phrase_counts.most_common(n=20)

[('e_mail', 161),
 ('image_processing', 113),
 ('solar_system', 89),
 ('space_station', 88),
 ('file_format', 83),
 ('lord_jehovah', 78),
 ('space_shuttle', 77),
 ('thank_in_advance', 76),
 ('source_code', 73),
 ('god_elohim', 72),
 ('anonymous_ftp', 71),
 ('ray_tracer', 64),
 ('look_like', 64),
 ('new_york', 63),
 ('gamma_ray', 61),
 ('jpeg_file', 61),
 ('computer_graphics', 59),
 ('year_ago', 57),
 ('image_quality', 55),
 ('jesus_christ', 54)]

In [30]:
df['cleaned_tokens'] = df['text_nlp'].apply(filter_text, stopwords=stop_list, phraser=phraser)

In [31]:
df['cleaned_tokens']

0       [point, set, view, way, believe, eveil, world,...
1       [grey, level, image, mean, item, image, work, ...
2       [annual, phigs, user, group, conference, annua...
3       [respond, jim, article, today, neglect, respon...
4       [place, file, ftp, today, contain, polygonal, ...
                              ...                        
3382    [work, program, display, wireframe, model, use...
3383    [russian, ill, fate, phobos, mission, year_ago...
3384    [oh, gee, billion, dollar, cover, cost, feasab...
3385    [look, software, run, brand, new, know, site, ...
3386    [month, look, job, computer_graphic, software,...
Name: cleaned_tokens, Length: 3387, dtype: object

In [32]:
#doc 2 has a lot of phrases we can see

print(df.loc[2, 'cleaned_tokens'])

['annual', 'phigs', 'user', 'group', 'conference', 'annual', 'phigs', 'user', 'group', 'conference', 'hold', 'march', 'orlando', 'florida', 'conference', 'organize', 'laer', 'design', 'research_center', 'co', 'operation', 'ieee', 'graph', 'attendee', 'come', 'country', 'span', 'tinent', 'good', 'cross_section', 'phigs', 'community', 'represent', 'conference', 'participant', 'include', 'phigs', 'user', 'workstation', 'vendor', 'party', 'phigs', 'implementor', 'dard', 'committee', 'member', 'researcher', 'industry', 'academia', 'opening', 'speaker', 'richard', 'puk', 'challenge', 'phigs', 'user', 'charge', 'phigs', 'participate', 'phigs', 'standardization', 'activity', 'communicate', 'need', 'phigs', 'implementor', 'close', 'speaker', 'andries', 'van_dam', 'describe', 'vision', 'future', 'graphic', 'standard', 'phigs', 'technical', 'paper', 'session', 'conference', 'cover', 'follow', 'topic', 'phigs', 'x', 'application', 'toolkits', 'application', 'issues', 'texture_mapping', 'nurbs', 'p

# Vectorizing
For much of this process we will use tools from a library called [SciKit-Learn](https://scikit-learn.org/stable/). This is a very thorough data science and Machine Learning library in Python with a LOT of different features. Today we'll just use a few of their functions for text analysis.

In [33]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

### Turning Documents into numbers
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/alicequotes.png?raw=true" align="right" width="400">

Vectorizers are designed to turn a set of documents into a grid of values to be treated as data for other analysis techniques that rely on numerical rather than textual data. This is a key part of many text analysis processes, and how you vectorize makes a big difference to how your data will be treated.

We will look at two vectorizers today.

- Count Vectorizer: Numbers that represent simple frequency counts of words
- TFIDF Vectorizer: Numbers that represent the 'significance' of a word based on a formula (more later).

The result of vectorizing a list of documents is a spreadsheet with a row representing each document, and a column representing each unique word used across the entire corpus of texts. In each cell is a value representing the relationship between that document and that word. For example, for a count vectorizer it will be a frequency count.

This means that these arrays tend to have many values of 0, because a word that occurs across maybe two or three documents, may not occur in any of the other hundreds of documents in the corpus.

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/docterm.png?raw=true" align="left" width="600">

## A simple example

In [34]:
test_corpus = ['This is my first sentence',
          'This is the second',
          'I enjoy peas in my sentence, peas peas peas!',
          'This is my first sentence']

In [35]:
# we make a basic count vectorizer without filters
test_vec = CountVectorizer()

In [36]:
# fit and transform to create a matrix
test_matrix = test_vec.fit_transform(test_corpus)

In [37]:
# and we'll view as a dataframe for ease
test_df = pd.DataFrame(test_matrix.toarray(), columns=test_vec.get_feature_names())
test_df

Unnamed: 0,enjoy,first,in,is,my,peas,second,sentence,the,this
0,0,1,0,1,1,0,0,1,0,1
1,0,0,0,1,0,0,1,0,1,1
2,1,0,1,0,1,4,0,1,0,0
3,0,1,0,1,1,0,0,1,0,1


#### Interpreting the matrix
We can see that each row corresponds to each document, and that each column corresponds to a unique word. The values correspond to the frequency of that word, in each document. For example "Peas" only occurs in the document at position 2, and it occurs 4 times. The word "Sentence" occurs once in all documents except the document at position 1.

### The dummy function

Scikit Vectorizers are designed to do the majority of heavy lifting for you. Above we were able to feed it unprocessed text and it did the job of  tokenizing for us. However they do not necessarily filter, lemmatise and pre-process in the ways that might be necessary for the kinds of text you are using. The way to get around this is to specify a custom tokenizer, and custom preprocessor for the vectorizer to use. 

Our `dummy_function` just pretends to do something and then returns what was fed into it. This allows us to feed the vectorizer a list of pre-tokenized pre-prepared documents. The downside is that this knocks out a couple of features of the vectorizer, such as discovering ngrams, but this can either be handled with extra preprocessing or you can allow the vectorizer to handle everything, but there may be some trade-offs, such as no lemmatisation.

In [38]:
def dummy_function(doc):
    return doc

### The vectorizer
above we define our vectorizer and name it `count_vec`. The arguments we have passed to it are...

- min_df: Minimum document frequency. The proportion of documents a token must occur in to be included. Filters out very low frequency words, which is also good for spelling mistakes. Here we set it to 0.01 or 1% which is approximately 33 documents out of 3,387. 
- max_df: Maximum document frequency. The proportion of documents a token can occur in before it is excluded. Filters out very high frequency words. If a word occurs in every single document, it does little for us if we want to distinguish the differences between documents. Here we set it to 0.999 or 99.9% which is approximately 3,383 documents out of 3,387. 
- tokenizer: Use to pass in a custom tokenizer function - as described in "The dummy function" above.
- preprocessor: Use to pass in a custom preprocessor function - as described in "The dummy function" above.

In [39]:
count_vec  = CountVectorizer(min_df=0.01, max_df=0.999, tokenizer=dummy_function, preprocessor=dummy_function)

### Fitting and Transforming
The next step we take our `count_vec` vectorizer and use the `.fit_transform()` method. We feed the method our list of tokenized documents. `.fit_transform()` then goes through two stages.

- fit: Examines the documents, learns the vocabulary of the entire corpus, filters out words based on our `max_df` and `min_df` arguments and works out how to score those words. For a Count vectorizer this is simple, +1 every time a word occurs in a single document. TFIDF is a little more involved as we'll see. After fitting we can see the vocabulary the vectorizer has retained using `.get_feature_names()`.

- transform: Takes the list of documents given, and creates a document/term matrix based on what it learned in the fitting stage.

In [40]:
count_matrix = count_vec.fit_transform(df['cleaned_tokens'])

In [41]:
# we'll limit to the first 20 items in the list
count_vec.get_feature_names()[:20] 

['ability',
 'able',
 'absolute',
 'absolutely',
 'accept',
 'access',
 'accord',
 'account',
 'accurate',
 'achieve',
 'act',
 'action',
 'active',
 'activity',
 'actual',
 'actually',
 'ad',
 'add',
 'addition',
 'address']

### The Sparse Matrix

A sparse matrix is a space efficient way to store data that has a lot of 0's in it. Rather than remembering every 0 it simply remembers the non-zero numbers and where they are, and assumes the rest to be 0. Whilst they are space efficient, not all python functions can understand a sparse matrix, so often we have to transform them into a normal matrix as below.

In [42]:
count_matrix

<3387x1087 sparse matrix of type '<class 'numpy.int64'>'
	with 97940 stored elements in Compressed Sparse Row format>

In [43]:
count_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 4, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [44]:
# if we put this into a dataframe, we can set the column names to be the associated word using out .get_feature_names() method.
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vec.get_feature_names())
count_df

Unnamed: 0,ability,able,absolute,absolutely,accept,access,accord,account,accurate,achieve,...,worth,write,writing,wrong,x,year,year_ago,yes,z,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,4,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3382,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
3383,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3384,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3385,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Top Terms
In this array, the values relate to how many times the relevant word was used in a document. That means that if we were to add up all the values in each column (think top to bottom), and then sorted those values, we could see which word occurred most across all documents.

In [45]:
# Here we take our count_df, sum up across the rows using the axis argument, 
# and then sort the results into descending order (largest number first) and then take the head, the top 5.

count_df.sum(axis='rows').sort_values(ascending=False).head()

use       1916
image     1484
know      1386
think     1252
people    1232
dtype: int64

Not particularly informative because...
- It is simple word frequencies, more frequent words are going to be generally quite dull words, even with our filters we used at the point of vectorizing. 
- This is across ALL documents, so more generic frequent words are going to rise to the top.

However this is still a good approach to getting top word lists so we'll create a function to do it quickly, and to use in our next stage.

In [46]:
def top_terms(df, top_n=5):
    return df.sum().sort_values(ascending=False).head(top_n)

In [47]:
top_terms(count_df, top_n=10)

use       1916
image     1484
know      1386
think     1252
people    1232
god       1217
like      1131
time      1000
good       884
find       863
dtype: int64

## Grouped Top Terms
Using the top words can be a great way to get a sense of what a set of documents is about. Our current dataset is a mix of discussions around computing, graphics, religion and space. This is generally expressed in our overall top words above, but what about the top words per group. If you have a way to slice up your documents into groups, the top terms can be a great indicator of what the each group is about.

In [48]:
# lets label our rows in our count_df by concatenating it with the 'category_label' column from our original dataframe
count_df_labelled = pd.concat([df['category_label'], count_df], axis=1)

In [49]:
count_df_labelled

Unnamed: 0,category_label,ability,able,absolute,absolutely,accept,access,accord,account,accurate,...,worth,write,writing,wrong,x,year,year_ago,yes,z,zero
0,alt.atheism,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,comp.graphics,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,comp.graphics,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,4,0
3,talk.religion.misc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,comp.graphics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3382,comp.graphics,0,1,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
3383,sci.space,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3384,sci.space,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3385,comp.graphics,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
# remember in pandas if we groupby first, and then .apply a function we are essentially...

# splitting the dataframe into four different dataframes (one for each category)
# applying the function to each dataframe seperately - so summing together all the values only for rows labelled as alt.atheism, etc.
# combining the results of each application back together into a single object.

count_results = count_df_labelled.groupby('category_label').apply(top_terms, top_n=10)
count_results

category_label               
alt.atheism         god           601
                    people        495
                    think         449
                    know          381
                    believe       376
                    atheist       337
                    religion      306
                    use           280
                    thing         265
                    argument      260
comp.graphics       image        1285
                    use           920
                    file          748
                    jpeg          583
                    program       583
                    format        427
                    software      398
                    system        376
                    color         372
                    available     364
sci.space           space         722
                    use           482
                    launch        414
                    time          346
                    like          344
                    

### Understanding the Groupby result
The produced object will be a pandas series with two indexes. 

- The first index is the names of the groups that we split the dataframe into
- The second index is the names of the columns with the highest scores. i.e the most frequent words
- Finally comes the actual series value, which is the frequency for each word.

Note for example that the word "use" occurs multiple times with different scores, this is the different frequency of the word "use" within each group. 

In [51]:
# we can access an individual gorup in the multi-index series by just using the name of the group as a key
count_results['sci.space']

space        722
use          482
launch       414
time         346
like         344
earth        336
satellite    334
orbit        321
year         316
think        314
dtype: int64

In [52]:
# and we can even access individual items by indexing twice
count_results['sci.space']['earth']


336

## TFIDF Vectorising

Term Frequency Inverse Document Frequency (TFIDF) is an approach to measuring word frequency that can be thought of as giving higher scores to words of greater "significance". 

TFIDF is not a simple word frequency, instead it assigns a word a score based on...

- The frequency of that word in a document
- How many other words are in that document
- How many documents are in the overall corpus
- How many of those documents that word appears in.

#### The forumla for those interested
- TFIDF = term freqency * inverse document frequency
- term frequency = Frequency of occurences of a term within a single document, sometimes divided by the number of terms in the document.
- inverse document frequency = number of documents within the entire corpus / number of documents the term occurs in.

Remember our test example from earlier?


In [53]:
test_corpus = ['This is my first sentence',
          'This is the second',
          'I enjoy peas in my sentence, peas peas peas!',
          'This is my first sentence']

In [54]:
test_df

Unnamed: 0,enjoy,first,in,is,my,peas,second,sentence,the,this
0,0,1,0,1,1,0,0,1,0,1
1,0,0,0,1,0,0,1,0,1,1
2,1,0,1,0,1,4,0,1,0,0
3,0,1,0,1,1,0,0,1,0,1


If we use the TFIDF vectorizer from scikit learn, we can transform these numbers based on the formula above.

In [55]:
test_tfidf = TfidfVectorizer()

In [56]:
test_tfidf_matrix = test_tfidf.fit_transform(test_corpus)

In [57]:
pd.DataFrame(test_tfidf_matrix.toarray(), columns=test_tfidf.get_feature_names())

Unnamed: 0,enjoy,first,in,is,my,peas,second,sentence,the,this
0,0.0,0.525464,0.0,0.425408,0.425408,0.0,0.0,0.425408,0.0,0.425408
1,0.0,0.0,0.0,0.380444,0.0,0.0,0.596039,0.0,0.596039,0.380444
2,0.230542,0.0,0.230542,0.0,0.147152,0.922168,0.0,0.147152,0.0,0.0
3,0.0,0.525464,0.0,0.425408,0.425408,0.0,0.0,0.425408,0.0,0.425408


<img src="https://github.com/Minyall/sc207_materials/blob/master/images/peas.jpg?raw=true" align="right" width="300">
We can see the weighting in these figures that have a range of 0-1.

- 'Peas' has a high weighting in doc 2 because it is frequent in doc 2, but infrequent elsewhere.
- 'Sentence' has the same weighting in docs 0 and 3, but lower in 2 despite occuring once in all three, because it is competing against more terms.
- 'Second' has an above average score because it is only competing against a few other words, and it doesn't occur anywhere else in the corpus.

TFIDF highlights "significant" words for two reasons...

- It gives higher scores to words that occur frequently within a single document, relative to the amount of other words in a document. 
    - In a document with only 10 words, and 8 of them are "Peas", you would imagine peas to be a word that indicates what that document is about.
    - In a document where "Peas" occurs 8 times, but there are 10,000 other words, then suddenly Peas doesn't look so significant.


- It drags down the scores of words if they exist in many of the documents in the corpus. This gives a sense of context to the significance of words. 
- If you have a corpus about growing Peas, and every document mentions them, well then no matter how many times the word occurs in an individual document, it is probably not very indicative of what that particular Pea focussed document is about, in the broader context of Pea focussed documents.

Peas photo by <a href="//commons.wikimedia.org/wiki/User:Atomicbre" title="User:Atomicbre">Bill Ebbesen</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=15727721">Link</a>

## Applying TFIDF to our News Data

In [58]:

tfidf_vectorizer = TfidfVectorizer(min_df =0.01, max_df=0.999, preprocessor=dummy_function, tokenizer=dummy_function)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_tokens'])

In [59]:
tfidf_scores = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
tfidf_scores

Unnamed: 0,ability,able,absolute,absolutely,accept,access,accord,account,accurate,achieve,...,worth,write,writing,wrong,x,year,year_ago,yes,z,zero
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.103762,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.024820,0.0,0.0,0.033153,0.026435,0.000000,0.0,0.156742,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.319160,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3382,0.0,0.108671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.091771,0.0,0.0,0.122585,0.000000,0.000000,0.0,0.000000,0.0
3383,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.350791,0.0,0.000000,0.0
3384,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0
3385,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0


Apart from seeing that some form of weighting is occuring (values are less than 1 so we know they're not just frequency counts) the above is pretty uninterpretible  as is.
Let's take a look at top terms...

In [60]:
# across the whole corpus

top_terms(tfidf_scores)

think    95.041990
know     94.863873
use      92.717395
like     78.224450
god      77.616942
dtype: float64

In [61]:
# again not great but let's try aross groups

In [62]:
# we label our rows with categories
tfidf_scores_labelled = pd.concat([df['category_label'], tfidf_scores], axis=1)

In [63]:
tfidf_results = tfidf_scores_labelled.groupby('category_label').apply(top_terms, top_n=10)
tfidf_results

category_label               
alt.atheism         god          40.127134
                    think        32.036445
                    people       29.700357
                    atheist      26.767684
                    religion     25.425237
                    believe      25.377210
                    know         22.136654
                    post         21.408256
                    claim        18.397012
                    thing        17.902507
comp.graphics       file         44.677253
                    image        43.719291
                    use          41.937761
                    program      33.746890
                    look         31.641670
                    know         31.019904
                    format       29.613754
                    graphic      27.905464
                    thank        26.269853
                    need         26.033888
sci.space           space        47.051557
                    like         27.083155
                    thin

We can compare TFIDF to simple frequency counts like so

In [64]:
tfidf_results['sci.space']

space        47.051557
like         27.083155
think        27.073104
orbit        26.758225
use          26.649382
launch       26.450035
know         23.324493
moon         22.931924
satellite    21.176212
time         20.691494
dtype: float64

In [65]:
count_results['sci.space']

space        722
use          482
launch       414
time         346
like         344
earth        336
satellite    334
orbit        321
year         316
think        314
dtype: int64

We can even examine the articles and their associated significant words and get a sense ourselves of how well they fit the original document.

In [66]:
article = 4
print(tfidf_scores.loc[article].sort_values(ascending=False).head(10))
print(df.loc[article,'text'])

ftp            0.437019
datum          0.371959
file           0.335330
polygon        0.330666
z              0.319160
contain        0.253432
workstation    0.163045
following      0.159580
normal         0.157069
resolution     0.155888
Name: 4, dtype: float64

Well, I am placing a file at my ftp today that contains several
polygonal descriptions of a head, face, skull, vase, etc. The format
of the files is a list of vertices, normals, and triangles. There are
various resolutions and the name of the data file includes the number
of polygons, eg. phred.1.3k.vbl contains 1300 polygons.


In order to get the data via ftp do the following:

	1) ftp taurus.cs.nps.navy.mil
	2) login as anonymous, guest as the password
	3) cd pub/dabro
	4) binary
	5) get cyber.tar.Z

Once you get the data onto your workstation:

	1) uncompress data.tar.Z
	2) tar xvof data.tar

If you have any questions, please let me know.

george dabro
dabro@taurus.cs.nps.navy.mil
-- 
george dabrowski
Cyberware Labs


## Summary

Pre-processing text and doing simple word frequency vectorisation or more complex TFIDF word vectorisation allows us to distinguish groups of documents from one another by using their words. Words, particularly the usage of words, either in the frequency of use, or through a more nuanced scoring of word use, can indicate to the computer the similarity or dissimilarity of documents which can be used to find themes/patterns across a corpus of texts.

We can use these techniques to allow us to find groups of documents, or patterns across documents, even if we don't have any labels telling us which documents are different.

On to part 2!