# Text processing

In [None]:
import pandas as pd

text = ['the cat with the hat sat in the mat',
        'you have brains in your head ',' you have feet in your shoes',
        'You can steer yourself any direction you choose' ,
        'Think and wonder, wonder and think', 'The more that you read',' the more things you will know',
        'The more that you learn' ,'the more places you’ll go',
        'the mat has a cat with a hat']

txt = pd.DataFrame(text,columns=['text'])
txt

## Stopwords

In [None]:
from nltk.corpus import stopwords

stopwords.words('english')

In [None]:
from nltk import word_tokenize

filtered = []
for w in word_tokenize(txt.text.values[0]):
    if w in stopwords.words('english'):
        continue
    filtered.append(w)
filtered

## Stemming

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print([stemmer.stem(w) for w in word_tokenize(txt.text.values[3])])

## Bag-of-Words model

### Counts/Occurences

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_transformed = count_vect.fit_transform(txt.text.values)
X_transformed.A

In [None]:
vectdf =pd.DataFrame(X_transformed.A.T)
vectdf['term']=count_vect.get_feature_names()
vectdf

### Frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf(corpus, stop_words=None):
    tfidf_vect = TfidfVectorizer() if stop_words is None else TfidfVectorizer(stop_words=stop_words)
    X_tf_transformed = tfidf_vect.fit_transform(corpus)
    tfidf_vectdf = pd.DataFrame(X_tf_transformed.A.T)
    tfidf_vectdf['term'] = tfidf_vect.get_feature_names()
    return tfidf_vectdf


In [None]:
tfidf(txt.text.values)

In [None]:
tfidf(txt.text.values, stop_words=stopwords.words('english'))

## N-Grams as Features

In [None]:
from nltk import bigrams, trigrams
from nltk import word_tokenize

exmpl = word_tokenize("The quick brown fox jumps over the lazy dog")

In [None]:
list(bigrams(exmpl))

In [None]:
list(trigrams(exmpl))

## Word Embedding

Shortcoming of Bag of Words method

  - It ignores the order of the word, for example, this is bad = bad is this.
  - It ignores the context of words. Suppose If I write the sentence "He loved books. Education is best found in books". It would create two vectors one for "He loved books" and other for "Education is best found in books." It would treat both of them orthogonal which makes them independent, but in reality, they are related to each other 

To overcome these limitation word embedding was developed and word2vec is an approach to implement such. 

In [None]:
from gensim.models import Word2Vec
from nltk.corpus import abc

model= Word2Vec(abc.sents())
X= list(model.wv.vocab)

In [None]:
model.wv.most_similar('science')

In [None]:
model.wv.doesnt_match('See you later, thanks for visiting'.split())

In [None]:
model.wv['computer']

In [None]:
model.wv.similarity('man','woman')

## POS Tagging

Open class words | Closed class words | Other
--- | --- | ---
ADJ | ADP | PUNCT
ADV | AUX | SYM
INTJ | CCONJ | X
NOUN | DET |	 
PROPN | NUM |
VERB |PART |	 
PRON |  
SCONJ |

#### VERB: verb
A verb is a member of the syntactic class of words that typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause. Verbs are often associated with grammatical categories like tense, mood, aspect and voice, which can either be expressed inflectionally or using auxilliary verbs or particles.

<p style="text-align: center;">What is the verb in this sentence?</p>


#### AUX: auxiliary
An auxiliary is a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality. It is often a verb (which may have non-auxiliary uses as well) but many languages have nonverbal TAME markers and these should also be tagged AUX. The class AUX also include copulas (in the narrow sense of pure linking words for nonverbal predication).

Examples
   - Tense auxiliaries: *has* (done), is (doing), will (do)
   - Passive auxiliaries: *was* (done), *got* (done)
   - Modal auxiliaries: *should* (do), *must* (do)
   - Verbal copulas: He *is* a teacher.

#### NOUN: noun
Nouns are a part of speech typically denoting a person, place, thing, animal or idea.

The NOUN tag is intended for common nouns only. See PROPN for proper nouns and PRON for pronouns.

Examples
  - girl
  - cat
  - tree
  
#### PROPN: proper noun  
A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object.

Acronyms of proper nouns, such as UN and NATO, should be tagged PROPN.

#### ADJ: adjective
Adjectives are words that typically modify nouns and specify their properties or attributes:
 - The car is *green*.

Numbers vs. Adjectives: In general, cardinal numbers receive the part of speech NUM, while ordinal numbers (more precisely adjectival ordinal numerals) receive the tag ADJ.

Examples:
  - big
  - old
  - green
  - African
  - incomprehensible
  - first, second, third

#### ADP: adposition
Adposition is a cover term for prepositions and postpositions. Adpositions belong to a closed set of items that occur before (preposition) or after (postposition) a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase, and that form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause.

Examples
  - in
  - to
  - during

#### ADV: adverb
Adverbs are words that typically modify verbs for such categories as time, place, direction or manner. They may also modify adjectives and other adverbs, as in *very briefly* or *arguably wrong*.

Examples
   - very
   - well
   - exactly
   - tomorrow
   - up, down
   - interrogative adverbs: where, when, how, why
   - demonstrative adverbs: here, there, now, then
   - indefinite adverbs: somewhere, sometime, anywhere, anytime
   - totality adverbs: everywhere, always
   - negative adverbs: nowhere, never

#### NUM: numeral
A numeral is a word, functioning most typically as a determiner, adjective or pronoun, that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction.

#### DET: determiner
Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular number or quantity, etc.

Examples
   - articles (a closed class indicating definiteness, specificity or givenness): a, an, the
   - possessive determiners (which modify a nominal): my, your
   - demonstrative determiners: *this* as in I saw *this* car yesterday.
   - interrogative determiners: *which* as in "*Which* car do you like?"
   - relative determiners: *which* as in "I wonder *which* car you like."
   - quantity determiners (quantifiers): indefinite *any*, universal: *all*, and negative *no* as in "We have *no* cars available.”

#### PRON: pronoun
Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context.
See also general principles on pronominal words for more tips on how to define pronouns. In particular:
  - Non-possessive personal, reflexive or reciprocal pronouns are always tagged PRON.
  - Possessives vary across languages. In some languages the above tests put them in the DET category. In others, they are more like a normal personal pronoun in a specific case (often the genitive), or a personal pronoun with an adposition; they are tagged PRON.

#### INTJ: interjection
An interjection is a word that is used most often as an exclamation or part of an exclamation. It typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language.

Examples
   - psst
   - ouch
   - bravo
   - hello
   
#### CCONJ: coordinating conjunction
A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them.

Examples
   - and
   - or
   - but

#### SCONJ: subordinating conjunction
A subordinating conjunction is a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate) clause.

Examples
    - *that* as in I believe *that* he will come.
    - if
    - while

#### SYM: symbol
A symbol is a word-like entity that differs from ordinary words by form, function, or both.

Many symbols are or contain special non-alphanumeric characters, similarly to punctuation. What makes them different from punctuation is that they can be substituted by normal words. This involves all currency symbols, e.g. $ 75 is identical to seventy-five dollars.

#### PUNCT: punctuation
Punctuation marks are non-alphabetical characters and character groups used in many languages to delimit linguistic units in printed text.

Punctuation is not taken to include logograms such as $, \%, and §, which are instead tagged as SYM.

#### X: other
The tag X is used for words that for some reason cannot be assigned a real part-of-speech category. It should be used very restrictively.


[Source](https://universaldependencies.org/u/pos/), note it has the above definition on many different languages!

In [None]:
from nltk import pos_tag

print(txt.text.values[0])
pos_tag(word_tokenize(txt.text.values[0]), tagset='universal')

# Simple example of text classification

In [None]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')

In [None]:
print(news.target_names)

In [None]:
for text, num_label in zip(news.data[:10], news.target[:10]):
    print('[%s]:\t\t "%s ..."' % (news.target_names[num_label], text[:100].split('\n')[0]))

In [None]:
from sklearn.model_selection import train_test_split


def train(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
    classifier.fit(X_train, y_train)
    print("Accuracy: %s" % classifier.score(X_test, y_test))
    return classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

trial1 = Pipeline(
    [('vectorizer', TfidfVectorizer()),
     ('classifier', MultinomialNB()),])
train(trial1, news.data, news.target)

## Stopwords

In [None]:
from nltk.corpus import stopwords

trial2 = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))),
    ('classifier', MultinomialNB()),
])
train(trial2, news.data, news.target)

## Filter by occurence

In [None]:
trial3 = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'),
min_df=5)),
    ('classifier', MultinomialNB(alpha=0.05)),
])
train(trial3, news.data, news.target)

## Stemming

In [None]:
import string
from nltk.stem import PorterStemmer
from nltk import word_tokenize

def stemming_tokenizer(text):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(text)]

trial4 = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words=stopwords.words('english') + list(string.punctuation))),
    ('classifier', MultinomialNB(alpha=0.05)),
])
train(trial4, news.data, news.target)

# Additional resources

### LDA
A good visualization tool for LDA topics and keywords is LDAVis in R that has been ported to python [pyLDAvis](https://github.com/bmabey/pyLDAvis).

[A good notebook on LDA](http://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb).

### PubMed artcile classification
A [model](https://github.com/melcutz/NLU_tutorial/blob/master/3_spacy_pubmed_model.ipynb) that is able to tell if a Pubmed article is refering to child or adult patient(s).

### POS tagging
[Negation detection](https://github.com/melcutz/NLP-demo-2017/blob/master/SpaCy_Intro.ipynb) using POS tagging and syntactic dependencies.

### Biopython
[Biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html) has a vast amount of tutorial on biomedical data processing using python.