# Session 12 - Text Mining
## EXTRA
### TFIDF Keywords

One interesting way to find useful keywords for a document, corpus, or group of documents in a corpus is to use TFIDF weighting to identify the most 'significant' terms. See below to learn how.

In [None]:
! conda install -c conda-forge scikit-learn --yes

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
df = pd.read_csv('sample_news_large_phrased.csv', index_col='index')

In [None]:
df.head()

In [None]:
# converting this specific data's tokens column back to a list
df['tokens'] = df['tokens'].apply(lambda token_string: token_string.split('|*|'))

In [None]:
df.head()

## Top Keywords
Term Frequency Inverse Document Frequency or TFIDF is a scoring system that...
- Gives a word a higher score if it occurs frequently in a document...
- Buit also weights that score depending on how often it also occurs across the corpus.
- Words that occur often in a document, and rarely across the corpus are given higher scores.
- Words that occur often in a document and often across the corpus are given lower scores.

The result is a scoring system that doesn't just highlight 'frequent' words, but instead highlights significant words.

Here we'll use SKlearn's TFIDF Vectorizer to generate these word scores.

In [None]:
# Normally the Tfidf Vectorizer would do tokenization and preprocessing for us. 
# As we're passing it pre-processed tokens we can use a dummy function, which simply pretends to
# process the text

def dummy(doc):
    return doc

In [None]:
# it is recommended to filter out the extreme ends of the vocab. 
# The default is any words used less than 5 times and any word that occurs in more than 50% of the corpus.
# This can be tweaked depending on how succesful yout model is.

tfidf_model = TfidfVectorizer(analyzer=dummy, min_df=5, max_df=0.5)

In [None]:
# We train our model by fitting it to our entire corpus of tokens
model = tfidf_model.fit(df['tokens'])


### Term Document Matrix
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/DTM.png?raw=true" width="300">

We're going to turn our model into a "term document matrix".Think of it essentially as a spreadsheet where each row represents a word, and each column a document. The value for each cell is the weighted score for that word, in that document.

Sklearn by default creates a space efficient 'sparse' matrix, we use `todense()` to make it a full matrix.

In [None]:
matrix = tfidf_model.transform(df['tokens']).todense()
matrix

In [None]:
print(matrix) # we can see the matrix layout
print(matrix.shape) # we can ask what the shape of the matrix is (i.e. number of rows and columns)
print(len(tfidf_model.get_feature_names()))
print(len(df)) # the number of columns should match our document count

In [None]:
# Lets put this matrix into a dataframe and assign the column names the correct words 

doc_term_matrix = pd.DataFrame(matrix, columns=tfidf_model.get_feature_names())

In [None]:
doc_term_matrix

In [None]:
# top terms for document 0
doc_term_matrix.loc[0].sort_values(ascending=False).head(10)

In [None]:
# top terms overall
doc_term_matrix.mean().sort_values(ascending=False).head(10)

In [None]:
# top terms per query
# Here we create our groups by grouping the rows of our original df by the query value
# when we iterate over each group we ask which rows in the df are part of this group
# these rows will correspond to the same rows in our doc_term_matrix so we select the 
# appropriate rows using .loc, get the mean score for each word and then sort for the top 10

for query, group in df.groupby('query'):
    print(f"****{query}****")
    group_rows = group.index
    
    print(doc_term_matrix.loc[group_rows].mean().sort_values(ascending=False).head(10))
    print()

### Toy Example of TFIDF
Sometimes it is easier to understand these processes on a smaller example to really understand what is going on under the hood

In [None]:
test_corpus = ['This is my first sentence',
          'This is the second',
          'I enjoy peas in my sentence peas peas peas',
          'This is my first sentence']

tokens = [doc.split() for doc in test_corpus]
tokens

In [None]:
tfidf_model = TfidfVectorizer(analyzer=dummy)

In [None]:
tfidf_model.fit(tokens)

In [None]:
matrix = tfidf_model.transform(tokens).todense()

In [None]:
doc_term = doc_term_matrix = pd.DataFrame(matrix, columns=tfidf_model.get_feature_names())
doc_term

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/peas.jpg?raw=true" align="right" width="300">
We can see the weighting in these figures that have a range of 0-1.

- 'Peas' has a high weighting in doc 2 because it is frequent in doc 2, but infrequent elsewhere.
- 'Sentence' has the same weighting in docs 0 and 3, but lower in 2 despite occuring once in all three, because it is competing against more terms.
- 'Second' has an above average score because it is only competing against a few other words, and it doesn't occur anywhere else in the corpus.

TFIDF highlights "significant" words for two reasons...

- It gives higher scores to words that occur frequently within a single document, relative to the amount of other words in a document. 
    - In a document with only 10 words, and 8 of them are "Peas", you would imagine peas to be a word that indicates what that document is about.
    - In a document where "Peas" occurs 8 times, but there are 10,000 other words, then suddenly Peas doesn't look so significant.


- It drags down the scores of words if they exist in many of the documents in the corpus. This gives a sense of context to the significance of words. 
- If you have a corpus about growing Peas, and every document mentions them, well then no matter how many times the word occurs in an individual document, it is probably not very indicative of what that particular Pea focussed document is about, in the broader context of Pea focussed documents.

Peas photo by <a href="//commons.wikimedia.org/wiki/User:Atomicbre" title="User:Atomicbre">Bill Ebbesen</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=15727721">Link</a>