## Text Mining And Analytics

                                                                                        by Tasos Kachrimanis

#### Natural Language Toolkit — NLTK

<img src="http://cdn.slidesharecdn.com/ss_thumbnails/nltk-150425131816-conversion-gate02-thumbnail-4.jpg?cb=1429968186", width="60%">

- The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about. In this series, we're going to tackle the field of opinion mining, or sentiment analysis.

>> **Tokenizing Words and Sentences**

In [7]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [9]:
message = "Καλησπέρα σας. Θα ήθελα πληροφορίες σχετικά με το πως θα μπορούσα να ενταχθώ στο νυστερινό τιμολόγιο."

print(sent_tokenize(message))
print(word_tokenize(message))
for i in word_tokenize(message):
        print(i)

['Καλησπέρα σας.', 'Θα ήθελα πληροφορίες σχετικά με το πως θα μπορούσα να ενταχθώ στο νυστερινό τιμολόγιο.']
['Καλησπέρα', 'σας', '.', 'Θα', 'ήθελα', 'πληροφορίες', 'σχετικά', 'με', 'το', 'πως', 'θα', 'μπορούσα', 'να', 'ενταχθώ', 'στο', 'νυστερινό', 'τιμολόγιο', '.']
Καλησπέρα
σας
.
Θα
ήθελα
πληροφορίες
σχετικά
με
το
πως
θα
μπορούσα
να
ενταχθώ
στο
νυστερινό
τιμολόγιο
.


- **Note:**

Some words carry more meaning than other words and some words are just plain useless.
We would not want these words taking up space in our database, or taking up valuable processing time. As such, we call these words "stop words" and we wish to do nothing with them. Another version of the term "stop words" can be more literal: *Words we stop on.*  For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.

We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus with:

In [12]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [15]:
example_sent = "This is a sample sentence, showing off the stop words filtration."
word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence

['This',
 'sample',
 'sentence',
 ',',
 'showing',
 'stop',
 'words',
 'filtration',
 '.']

>> **Stemming words **

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

In [16]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


>> **Part of Speech Tagging** 

Labeling words in a sentence as nouns, adjectives, verbs...etc, is a powerful tool. NLTK can do this for you. The even more impressive, is that it also labels by tense, and more. Here's a list of the tags, what they mean, and some examples:


- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'

In [17]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer


train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)



tokenized = custom_sent_tokenizer.tokenize(sample_text)



def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            
            print(tagged)
        
    except Exception as e:
        print(str(e))
        
        


        
        
process_content()     

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

>> **Lemmatizing**

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

In [18]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))

cat
cactus
goose
rock


In [19]:
print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", pos="a")) #as an adjective

better
good


In [20]:
print(lemmatizer.lemmatize("loving", pos="v")) #as a verb
print(lemmatizer.lemmatize("loving"))

love
loving


>> **Wordnet**

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.

You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more

In [21]:
from nltk.corpus import wordnet

syns = wordnet.synsets("program")

print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [22]:
#synset
print(syns[0].name())


#just the word
print(syns[0].lemmas()[0].name())


#definition (for plan)
print(syns[0].definition())

plan.n.01
plan
a series of steps to be carried out or goals to be accomplished


-  How might we discern synonyms and antonyms to a word? The lemmas will be synonyms, and then you can use .antonyms to find the antonyms to the lemmas.

In [23]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
            

            
print(set(synonyms))
print(set(antonyms))

{'skilful', 'skillful', 'goodness', 'proficient', 'honorable', 'estimable', 'practiced', 'in_force', 'well', 'full', 'commodity', 'upright', 'beneficial', 'unspoiled', 'dependable', 'adept', 'safe', 'expert', 'effective', 'right', 'thoroughly', 'salutary', 'honest', 'trade_good', 'secure', 'sound', 'respectable', 'undecomposed', 'good', 'in_effect', 'soundly', 'ripe', 'serious', 'near', 'dear', 'unspoilt', 'just'}
{'evil', 'badness', 'evilness', 'ill', 'bad'}


- Next, we can also easily use WordNet to compare the similarity of two words and their tenses

In [25]:
#compare similarity

w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

print(w1.wup_similarity(w2))


w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("cat.n.01")

print(w1.wup_similarity(w2))


0.9090909090909091
0.32


- Note: We need to incorporate the Wu and Palmer method for semantic related-ness.

### (Check MFQs - Most Frequent Questions)

>> Important Note

In [76]:
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()


def get_wordnet_pos(word_tok):
    
        
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''
    
    
word = input()

word_tok = word_tokenize(word)
pot = nltk.pos_tag(word_tok)

count = 0
pot_list = []
for w in word_tok:
    
    treebank_tag = pot[count][1]
    count += 1
    pot_list.append(get_wordnet_pos(treebank_tag))

print(pot_list)


this algorithm is going to be better by March
['', 'n', 'v', 'v', '', 'v', 'a', '', 'n']


## Feature extraction

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
- **tokenizing** strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- **counting** the occurrences of tokens in each document.
- **normalizing** and weighting with diminishing importance tokens that occur in the majority of samples / documents.

>>>Features and samples are defined as follows:

- each **individual token occurrence frequency** (normalized or not) is treated as a feature.
- the vector of all the token frequencies for a given **document** is considered a multivariate **sample.**


<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQQ9KKRdgcFD-HSDxCEgTvn5cvAlFZeIdfbcRAQq7pXQ3pWrbn7", width="30%">

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or “Bag of n-grams” representation.

<img src="http://images.slideplayer.com/10/2811681/slides/slide_51.jpg", width="60%">

In [112]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1,  stop_words='english')

In [125]:
content = ["Bursting the Big Data Data bubble starts with appreciating certain nuances about Data and its products and patterns"
           ,"the real solutions that are useful in dealing with certain Big Data will be needed and in demand even if the notion of Big Data falls from the height of its hype into the trough of disappointment"]

X = vectorizer.fit_transform(content)

print(vectorizer)
print(X.get_shape())
print(X)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
(2, 37)
  (0, 26)	0.24173784528
  (0, 27)	0.24173784528
  (0, 21)	0.171998467899
  (0, 1)	0.343996935798
  (0, 0)	0.24173784528
  (0, 24)	0.24173784528
  (0, 8)	0.171998467899
  (0, 2)	0.24173784528
  (0, 36)	0.171998467899
  (0, 30)	0.24173784528
  (0, 6)	0.24173784528
  (0, 9)	0.515995403697
  (0, 5)	0.171998467899
  (0, 32)	0.171998467899
  (0, 7)	0.24173784528
  (1, 12)	0.145594450865
  (1, 33)	0.145594450865
  (1, 20)	0.145594450865
  (1, 17)	0.145594450865
  (1, 16)	0.145594450865
  (1, 15)	0.145594450865
  (1, 14)	0.145594450865
  (1, 25)	0.4

In [126]:
vectorizer.get_feature_names()

['about',
 'and',
 'appreciating',
 'are',
 'be',
 'big',
 'bubble',
 'bursting',
 'certain',
 'data',
 'dealing',
 'demand',
 'disappointment',
 'even',
 'falls',
 'from',
 'height',
 'hype',
 'if',
 'in',
 'into',
 'its',
 'needed',
 'notion',
 'nuances',
 'of',
 'patterns',
 'products',
 'real',
 'solutions',
 'starts',
 'that',
 'the',
 'trough',
 'useful',
 'will',
 'with']

In [127]:
X_train = vectorizer.fit_transform(content)
num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples, num_features)) 

#samples: 2, #features: 37


>>  **Tf–idf term weighting**

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Tf means **term-frequency** while tf–idf means term-frequency times **inverse document-frequency**:
<img src="http://scikit-learn.org/stable/_images/math/40f34fb794a1d3561d64bc55e344634b1451a21f.png", width="20%">


- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

>> Example

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In [128]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)

In [130]:
X = vectorizer.fit_transform(content)
print(vectorizer.get_feature_names())

['about', 'and', 'appreciating', 'are', 'be', 'big', 'bubble', 'bursting', 'certain', 'data', 'dealing', 'demand', 'disappointment', 'even', 'falls', 'from', 'height', 'hype', 'if', 'in', 'into', 'its', 'needed', 'notion', 'nuances', 'of', 'patterns', 'products', 'real', 'solutions', 'starts', 'that', 'the', 'trough', 'useful', 'will', 'with']


In [123]:
print(X)

  (0, 26)	0.238324028932
  (0, 27)	0.238324028932
  (0, 21)	0.169569509451
  (0, 1)	0.339139018902
  (0, 0)	0.238324028932
  (0, 24)	0.238324028932
  (0, 8)	0.238324028932
  (0, 2)	0.238324028932
  (0, 36)	0.169569509451
  (0, 30)	0.238324028932
  (0, 6)	0.238324028932
  (0, 9)	0.508708528353
  (0, 5)	0.169569509451
  (0, 32)	0.169569509451
  (0, 7)	0.238324028932
  (1, 12)	0.146381998863
  (1, 33)	0.146381998863
  (1, 20)	0.146381998863
  (1, 17)	0.146381998863
  (1, 16)	0.146381998863
  (1, 15)	0.146381998863
  (1, 14)	0.146381998863
  (1, 25)	0.439145996588
  (1, 23)	0.146381998863
  (1, 18)	0.146381998863
  (1, 13)	0.146381998863
  (1, 11)	0.146381998863
  (1, 22)	0.146381998863
  (1, 4)	0.146381998863
  (1, 35)	0.146381998863
  (1, 10)	0.146381998863
  (1, 19)	0.292763997726
  (1, 34)	0.146381998863
  (1, 3)	0.146381998863
  (1, 31)	0.146381998863
  (1, 29)	0.146381998863
  (1, 28)	0.146381998863
  (1, 21)	0.104151997811
  (1, 1)	0.104151997811
  (1, 36)	0.104151997811
  (1, 9)	0.

### Limitations of the Bag of Words representation

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.
Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

<img src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRkBo8ElivvKC9EZhG86lgi0nMSek2l_VOIPYKA8fBLoOfH-KlD", width="40%">

- Example: ['words', 'wprds']

The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better

In [131]:
ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
ngram_vectorizer.get_feature_names()

[' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']

In [132]:
counts.toarray().astype(int)

array([[1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 0, 1, 1, 1, 0, 1]])

### Vectorizing a large text corpus with the hashing trick

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices causes several **problems when dealing with large datasets**

It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by the sklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features of the CountVectorizer.
This combination is implementing in HashingVectorizer

In [137]:
from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
X = hv.transform(content)
X.get_shape()

(2, 10)