# Analyzing text data

### What tasks can be solved by processing text?


syntactic tasks
markup by parts of speech and morphological features
division of words in the text into morphemes (suffix, prefix, etc.)
stemming, lemmatization
division into sentences (initials and abbreviations) and words (Chinese)
search for names and titles in the text - entities (named entity recognition)
resolution of the meaning of words in a given context (lock / lock)
build syntax tree
determining what other objects the word refers to
text comprehension tasks that include a "teacher"
next character prediction
information search
sentiment analysis
highlighting relationships and facts
answers on questions
understanding and generating text
text generation
Machine translate
dialogue models (chatbot)
Indirect tasks:

image description
speech recognition
Business objectives:

speech recognition (assistant)
chat bot (replacement of technical support in solving most of the issues)
search for an exact answer to a question in a document base (for example, a base of standards)
evaluation of the opinion in social networks about the product

In [1]:
import nltk
# nltk.download()  # download lots of data

# From text to simple models

## Splitting into tokens
** Def. **
splitting a sequence of characters into parts (tokens), possibly excluding some characters from consideration
Naive approach: split the string with spaces and strip out punctuation marks


* Tricia loved New York as loving New York could positively influence her career. *


**Problems:**
* my.email@mail.ru, 127.0.0.1
* C ++, C #
* York University vs New York University
* Language dependence (“Lebensversicherungsgesellschaftsangestellter”, “l’amour”)
Alternative: n-grams

In [2]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+|[^\w\s]+')
s = u'Трисия любила Нью-Йорк, поскольку любовь к Нью-Йорку могла положительно повлиять на ее карьеру.'

for t in tokenizer.tokenize(s)[:7]: 
    print(t)

Трисия
любила
Нью
-
Йорк
,
поскольку


## ftfy: fixes text for you

In [5]:
from ftfy import fix_text
print(fix_text(u'\001\033[36;44mI&#x92;m blue, da ba dee da ba doo&#133;\033[0m', normalization='NFKC'))

I'm blue, da ba dee da ba doo...


## Stop-Words

The most frequent words in a language that do not contain any information about the content of the text

In [6]:
from nltk.corpus import stopwords
print (' '.join(stopwords.words('russian')[:20]))
print (' '.join(stopwords.words('english')[:20]))

и в во не что он на я с со как а то все она так его но да ты
i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his


## Normalization

Bringing tokens to a uniform look in order to get rid of superficial differences in spelling

Approaches
* formulate a set of rules according to which the token is converted
New york → new york → new york → new york
* explicitly store connections between tokens (WordNet - Princeton)
machine -> car, Windows 6 -> window

In [7]:
s = 'Нью-Йорк'
s1 = s.lower()
print(s1)

нью-йорк


In [12]:
import re
s2 = re.sub("\W", "", s1)
print(s2)

ньюйорк


In [13]:
s3 = re.sub("й", u"и", s2)
print(s3)

ньюиорк


## Stemming and Lemmatization

Bringing the grammatical forms of a word and cognate words to a single stem (lemma):

* Stemming - using simple heuristic rules
   * Porter (Cambridge - 1980)
         5 stages, each one applies a set of rules such as
             sses → ss (caresses → caress)
             ies → i (ponies → poni)

   * Lovins (1968)
   * Paice (1990)
   * others
* Lemmatization - using dictionaries and morphological analysis


## Stemming

In [16]:
from nltk.stem.snowball import PorterStemmer
s = PorterStemmer()
print (s.stem('Tokenization'))
print (s.stem('stemming'))

token
stem


In [17]:
from nltk.stem.snowball import RussianStemmer
r = RussianStemmer()
print(r.stem('Авиация'))
print(r.stem('национальный'))

авиац
национальн


## Lemmatization
(usually works better for complex languages, including Russian)

In [26]:
import pymorphy2

import pymorphy2
morph = pymorphy2.MorphAnalyzer()
for i in morph.parse(u'замок'):
    print("Metadata: {}".format(i))
    print("Word: {} | Normal form: {}".format(i.word, i.normal_form))
    print('\n')

Metadata: Parse(word='замок', tag=OpencorporaTag('NOUN,inan,masc sing,nomn'), normal_form='замок', score=0.3333333333333333, methods_stack=((<DictionaryAnalyzer>, 'замок', 139, 0),))
Word: замок | Normal form: замок


Metadata: Parse(word='замок', tag=OpencorporaTag('NOUN,inan,masc sing,accs'), normal_form='замок', score=0.3333333333333333, methods_stack=((<DictionaryAnalyzer>, 'замок', 139, 3),))
Word: замок | Normal form: замок


Metadata: Parse(word='замок', tag=OpencorporaTag('VERB,perf,intr masc,sing,past,indc'), normal_form='замокнуть', score=0.3333333333333333, methods_stack=((<DictionaryAnalyzer>, 'замок', 730, 1),))
Word: замок | Normal form: замокнуть




## Document submission

** Boolean Model. ** The presence or absence of a word in the document
** Bag of Words. ** Token order is not important

* The weather was terrible, the princess was beautiful.
Or was it the other way around? *

Coordinates
* Multinomial: the number of tokens in the document
* Numeric: the weighted number of tokens in the document

In [27]:
from sklearn.feature_extraction import DictVectorizer

In [28]:
dvectorizer = DictVectorizer(sparse=False)
text_dict = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = dvectorizer.fit_transform(text_dict)
X

array([[2., 0., 1.],
       [0., 1., 3.]])

In [29]:
dvectorizer.inverse_transform(X)

[{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]

In [30]:
dvectorizer.transform({'foo': 4, 'unseen_feature': 3})

array([[0., 0., 4.]])

In [31]:
from collections import Counter

docs = [
    "Thank 40 you, Mr President.",
    "Madam President, I agree and recognise Turkey's European prospects, but if these prospects are to have an auspicious outcome, Turkey needs to:",
    "Madam President, firstly, I would like to express my sincerest thanks to the High Representative for including this important issue in the agenda at such an early stage.",
]

tokenizer = RegexpTokenizer('\w+|[^\w\s]+')
stopwords_eng = stopwords.words()

document_bags = list()

for d in docs:
    bag = Counter()
    text = d.lower()
    for t in tokenizer.tokenize(text):     
        if t in stopwords_eng:
            continue
        bag[t] += 1
    document_bags.append(bag)
    
document_bags

[Counter({'thank': 1, '40': 1, ',': 1, 'mr': 1, 'president': 1, '.': 1}),
 Counter({'madam': 1,
          'president': 1,
          ',': 3,
          'agree': 1,
          'recognise': 1,
          'turkey': 2,
          "'": 1,
          'european': 1,
          'prospects': 2,
          'auspicious': 1,
          'outcome': 1,
          'needs': 1,
          ':': 1}),
 Counter({'madam': 1,
          'president': 1,
          ',': 2,
          'firstly': 1,
          'would': 1,
          'like': 1,
          'express': 1,
          'sincerest': 1,
          'thanks': 1,
          'high': 1,
          'representative': 1,
          'including': 1,
          'important': 1,
          'issue': 1,
          'agenda': 1,
          'early': 1,
          'stage': 1,
          '.': 1})]

In [32]:
dvectorizer.fit_transform(document_bags)

array([[0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 3., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 1., 1., 1., 2., 1., 0., 0., 0., 0., 0., 2., 0.],
       [0., 2., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1.,
        1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 1., 0., 1., 0., 1.]])

In [35]:
dvectorizer.feature_names_

["'",
 ',',
 '.',
 '40',
 ':',
 'agenda',
 'agree',
 'auspicious',
 'early',
 'european',
 'express',
 'firstly',
 'high',
 'important',
 'including',
 'issue',
 'like',
 'madam',
 'mr',
 'needs',
 'outcome',
 'president',
 'prospects',
 'recognise',
 'representative',
 'sincerest',
 'stage',
 'thank',
 'thanks',
 'turkey',
 'would']

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
sklearn_vectorizer = CountVectorizer(stop_words='english')
sklearn_vectorizer.fit_transform(docs).todense()

matrix([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
         0, 0, 1, 0, 0],
        [0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 2, 1, 0,
         0, 0, 0, 0, 2],
        [0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
         1, 1, 0, 1, 0]])

In [38]:
sklearn_vectorizer.vocabulary_

{'thank': 23,
 '40': 0,
 'mr': 14,
 'president': 17,
 'madam': 13,
 'agree': 2,
 'recognise': 19,
 'turkey': 25,
 'european': 5,
 'prospects': 18,
 'auspicious': 3,
 'outcome': 16,
 'needs': 15,
 'firstly': 7,
 'like': 12,
 'express': 6,
 'sincerest': 21,
 'thanks': 24,
 'high': 8,
 'representative': 20,
 'including': 10,
 'important': 9,
 'issue': 11,
 'agenda': 1,
 'early': 4,
 'stage': 22}

## TF-IDF

Количество вхождений слова $t$ в документе $d$
$$
TF_{t,d} = term\!\!-\!\!frequency(t, d)
$$
Количество документов из $N$ возможных, где встречается $t$
$$
DF_t = document\!\!-\!\!fequency(t)
$$
$$
IDF_t = inverse\!\!-\!\!document\!\!-\!\!frequency(t) = \log \frac{N}{DF_t}
$$
TF-IDF
$$
TF\!\!-\!\!IDF_{t,d} = TF_{t,d} \times IDF_t
$$

Оценивает важность слова в контексте документа, являющегося частью корпуса
`

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
features = vectorizer.fit_transform(docs).todense()
features

matrix([[0.54645401, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.54645401,
         0.        , 0.        , 0.32274454, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.54645401, 0.        ,
         0.        ],
        [0.        , 0.        , 0.25882751, 0.25882751, 0.        ,
         0.25882751, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.19684499, 0.        ,
         0.25882751, 0.25882751, 0.1528677 , 0.51765502, 0.25882751,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.51765502],
        [0.        , 0.26795858, 0.        , 0.        , 0.26795858,
         0.        , 0.26795858, 0.26795858, 0.26795858, 0.26795858,
         0.26795858, 0.26795858, 0.26795858, 0.20378941, 0.        ,
         0.        , 0.        , 0.15826066, 0.        , 0.

In [48]:
vectorizer.vocabulary_

{'thank': 23,
 '40': 0,
 'mr': 14,
 'president': 17,
 'madam': 13,
 'agree': 2,
 'recognise': 19,
 'turkey': 25,
 'european': 5,
 'prospects': 18,
 'auspicious': 3,
 'outcome': 16,
 'needs': 15,
 'firstly': 7,
 'like': 12,
 'express': 6,
 'sincerest': 21,
 'thanks': 24,
 'high': 8,
 'representative': 20,
 'including': 10,
 'important': 9,
 'issue': 11,
 'agenda': 1,
 'early': 4,
 'stage': 22}