## Text Mining

![miners](img/text-miners.jpeg)

## Part 1

### Situation:

![greg](img/thinking.jpeg)

Last week we helped Greg build a model to sort through articles, but we rushed through the pre-processing of it all. This lesson we will go through it step by step.

### Discussion

- What type of problem is this?
- What are we trying to do?
- What steps do you think might be involved? (big picture steps)

![talk](https://media.giphy.com/media/l2SpQRuCQzY1RXHqM/giphy.gif)

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

#### How is text mining different? What is text?

- Order the words from **SMALLEST** to **LARGEST** units
 - character
 - corpora
 - sentence
 - word
 - corpus
 - paragraph
 - document

(after it is all organized)

- Any disagreements about the terms used?

### Steps with articles:

https://github.com/aapeebles/text_examples 

1. Create list of words
2. tally how many times words are used
3. order the words by frequency
4. try to find similar articles in the group using only your frequencies 


Yes, the list might might be long.
![list](https://media.giphy.com/media/YLHwkqayc1j7a/giphy.gif)

DISCUSS!

### Bag of Words Steps

<img style="float: left" src="./img/bag_of_word.jpg" width="200">

![step by step](https://i.gifer.com/VxbJ.gif)

1. make all lower case
2. Remove punctuation, numbers, symbols, etc
3. Remove stop words, perhaps develop custom stop words list
4. Stemming/Lemmatization


But what about tokenization? when's the best time to tokenize?

## New library!

while we have seen language processing tools in spark, NLTK is its own python library. And of course, it has its own [documentation](https://www.nltk.org/)

In [1]:

data_string=    "Tsunami debt deal to be announced \n \
\
Chancellor Gordon Brown has said he hopes to announce a deal to suspend debt interest repayments by tsunami-hit nations later on Friday.\n \
\
The agreement by the G8 group of wealthy nations would save affected countries £3bn pounds a year, he said. The deal is thought to have been hammered out on Thursday night after Japan, one of the biggest creditor nations, finally signed up to it. Mr Brown first proposed the idea earlier this week.\n \
\
G8 ministers are also believed to have agreed to instruct the World Bank and the International Monetary Fund to complete a country by country analysis of the reconstruction problems faced by all states hit by the disaster. Mr Brown has been locked in talks with finance ministers of the G8, which Britain now chairs. Germany also proposed a freeze and Canada has begun its own moratorium. The expected deal comes as Foreign Secretary Jack Straw said the number of Britons dead or missing in the disaster have reached 440."

In [2]:
data_string

'Tsunami debt deal to be announced \n Chancellor Gordon Brown has said he hopes to announce a deal to suspend debt interest repayments by tsunami-hit nations later on Friday.\n The agreement by the G8 group of wealthy nations would save affected countries £3bn pounds a year, he said. The deal is thought to have been hammered out on Thursday night after Japan, one of the biggest creditor nations, finally signed up to it. Mr Brown first proposed the idea earlier this week.\n G8 ministers are also believed to have agreed to instruct the World Bank and the International Monetary Fund to complete a country by country analysis of the reconstruction problems faced by all states hit by the disaster. Mr Brown has been locked in talks with finance ministers of the G8, which Britain now chairs. Germany also proposed a freeze and Canada has begun its own moratorium. The expected deal comes as Foreign Secretary Jack Straw said the number of Britons dead or missing in the disaster have reached 440.'

In [3]:
data_string.split()

['Tsunami',
 'debt',
 'deal',
 'to',
 'be',
 'announced',
 'Chancellor',
 'Gordon',
 'Brown',
 'has',
 'said',
 'he',
 'hopes',
 'to',
 'announce',
 'a',
 'deal',
 'to',
 'suspend',
 'debt',
 'interest',
 'repayments',
 'by',
 'tsunami-hit',
 'nations',
 'later',
 'on',
 'Friday.',
 'The',
 'agreement',
 'by',
 'the',
 'G8',
 'group',
 'of',
 'wealthy',
 'nations',
 'would',
 'save',
 'affected',
 'countries',
 '£3bn',
 'pounds',
 'a',
 'year,',
 'he',
 'said.',
 'The',
 'deal',
 'is',
 'thought',
 'to',
 'have',
 'been',
 'hammered',
 'out',
 'on',
 'Thursday',
 'night',
 'after',
 'Japan,',
 'one',
 'of',
 'the',
 'biggest',
 'creditor',
 'nations,',
 'finally',
 'signed',
 'up',
 'to',
 'it.',
 'Mr',
 'Brown',
 'first',
 'proposed',
 'the',
 'idea',
 'earlier',
 'this',
 'week.',
 'G8',
 'ministers',
 'are',
 'also',
 'believed',
 'to',
 'have',
 'agreed',
 'to',
 'instruct',
 'the',
 'World',
 'Bank',
 'and',
 'the',
 'International',
 'Monetary',
 'Fund',
 'to',
 'complete',
 'a',

In [4]:
set(data_string.split())

{'440.',
 'Bank',
 'Britain',
 'Britons',
 'Brown',
 'Canada',
 'Chancellor',
 'Foreign',
 'Friday.',
 'Fund',
 'G8',
 'G8,',
 'Germany',
 'Gordon',
 'International',
 'Jack',
 'Japan,',
 'Monetary',
 'Mr',
 'Secretary',
 'Straw',
 'The',
 'Thursday',
 'Tsunami',
 'World',
 'a',
 'affected',
 'after',
 'agreed',
 'agreement',
 'all',
 'also',
 'analysis',
 'and',
 'announce',
 'announced',
 'are',
 'as',
 'be',
 'been',
 'begun',
 'believed',
 'biggest',
 'by',
 'chairs.',
 'comes',
 'complete',
 'countries',
 'country',
 'creditor',
 'dead',
 'deal',
 'debt',
 'disaster',
 'disaster.',
 'earlier',
 'expected',
 'faced',
 'finally',
 'finance',
 'first',
 'freeze',
 'group',
 'hammered',
 'has',
 'have',
 'he',
 'hit',
 'hopes',
 'idea',
 'in',
 'instruct',
 'interest',
 'is',
 'it.',
 'its',
 'later',
 'locked',
 'ministers',
 'missing',
 'moratorium.',
 'nations',
 'nations,',
 'night',
 'now',
 'number',
 'of',
 'on',
 'one',
 'or',
 'out',
 'own',
 'pounds',
 'problems',
 'proposed

In [5]:
import nltk
import sklearn
from __future__ import print_function

In [6]:
#nltk.download() #for when you are bringing in files from gutenburg, etf

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib

In [8]:
#print(tokens[:100])

In [9]:
metamorph = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/5200/pg5200.txt').read()
#print(x.read())


In [10]:
metamorph_st = metamorph.decode("utf-8") 

Load your article here

In [11]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
metamorph_tokens_raw = nltk.regexp_tokenize(metamorph_st, pattern)
print(metamorph_tokens_raw[:100])

['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Metamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', 'gutenberg', 'net', 'This', 'is', 'a', 'COPYRIGHTED', 'Project', 'Gutenberg', 'eBook', 'Details', 'Below', 'Please', 'follow', 'the', 'copyright', 'guidelines', 'in', 'this', 'file', 'Title', 'Metamorphosis', 'Author', 'Franz', 'Kafka', 'Translator', 'David', 'Wyllie', 'Release', 'Date', 'August', 'EBook', 'First', 'posted', 'May', 'Last', 'updated', 'May', 'Language', 'English', 'START', 'OF', 'THIS']


In [12]:
metamorph_tokens = [i.lower() for i in metamorph_tokens_raw]
print(metamorph_tokens[:100])


['the', 'project', 'gutenberg', 'ebook', 'of', 'metamorphosis', 'by', 'franz', 'kafka', 'translated', 'by', 'david', 'wyllie', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www', 'gutenberg', 'net', 'this', 'is', 'a', 'copyrighted', 'project', 'gutenberg', 'ebook', 'details', 'below', 'please', 'follow', 'the', 'copyright', 'guidelines', 'in', 'this', 'file', 'title', 'metamorphosis', 'author', 'franz', 'kafka', 'translator', 'david', 'wyllie', 'release', 'date', 'august', 'ebook', 'first', 'posted', 'may', 'last', 'updated', 'may', 'language', 'english', 'start', 'of', 'this']


In [17]:
from nltk.corpus import stopwords
stopwords.words("english")

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - '/Users/Mango/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/Mango/anaconda3/nltk_data'
    - '/Users/Mango/anaconda3/share/nltk_data'
    - '/Users/Mango/anaconda3/lib/nltk_data'
**********************************************************************


In [14]:
stop_words = set(stopwords.words('english'))
metamorph_tokens_stopped = [w for w in metamorph_tokens if not w in stop_words]
print(metamorph_tokens_stopped[:100])

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - '/Users/Mango/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/Mango/anaconda3/nltk_data'
    - '/Users/Mango/anaconda3/share/nltk_data'
    - '/Users/Mango/anaconda3/lib/nltk_data'
**********************************************************************


## Stemming / Lemming

### Stemming - Porter Stemmer 
![porter](https://cdn.homebrewersassociation.org/wp-content/uploads/Baltic_Porter_Feature-600x800.jpg)

In [52]:
from nltk.stem import *
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
          'plotted']

In [53]:
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


### Stemming - Snowball Stemmer
![snowball](https://localtvwiti.files.wordpress.com/2018/08/gettyimages-936380496.jpg?quality=85&strip=all)

In [None]:
print(" ".join(SnowballStemmer.languages))

In [None]:
stemmer = SnowballStemmer("english")
print(stemmer.stem("running"))

### Porter vs Snowball

In [None]:
print(SnowballStemmer("english").stem("generously"))
print(SnowballStemmer("porter").stem("generously"))

### Use Snowball on metamorphesis

In [55]:
meta_stemmed = [stemmer.stem(word) for word in metamorph_tokens_stopped]
print(meta_stemmed[:100])

NameError: name 'metamorph_tokens_stopped' is not defined

### Lemmatization

Uses a corpus of words "WordNet"

`from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()`


Challenge of lemmatization:

`wordnet_lemmatizer.lemmatize(word, pos="v")`

## Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.

### Document statistics

Average word length in document

In [56]:
float(sum(map(len, meta_stemmed))) / len(meta_stemmed)

NameError: name 'meta_stemmed' is not defined

Number of words in document

In [None]:
len(meta_stemmed)

## What you've all been waiting for 

![big deal](http://reddebtedstepchild.com/wp-content/uploads/2013/04/Big-deal-gif.gif)


## Frequency distributions

In [None]:
meta_freqdist = FreqDist(meta_stemmed)

In [None]:
meta_freqdist.most_common(50)

In [None]:
meta_freqdist.plot(30,cumulative=False)

**TASK**: Create word frequency plot for your article

Question:  Should any more stop words be added to the list given your plot results?

In [54]:
meta_finder = BigramCollocationFinder.from_words(meta_stemmed)

NameError: name 'meta_stemmed' is not defined

## Creating a Data frame that compares the documents

**Puzzle**: how could you adapt the code below to allow you to compare documents and word counts?

In [18]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0
