# Texts

**Bag of words** - code each token of the corpus with binary feature - occut in documnet or not.

pros:
* easy to use

cons:
* number of features increase dramaticly
* sentences with the same words can have different meanings (I have no, No, I have)
* the same words in different contects have different meanings

*Basic variant*

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import euclidean

import pandas as pd
import numpy as np

vect = CountVectorizer() 
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

data = vect.fit_transform([documentA, documentB]).toarray()
pd.DataFrame(data, columns = {k: v for k, v in sorted(vect.vocabulary_.items(), key=lambda item: item[0])}.keys())

Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0,0,0,1,1,1,0,1,1,1
1,1,1,1,0,0,0,1,2,0,0


*Exclude stop words*

In [3]:
vect_stop = CountVectorizer(stop_words='english') 
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

data_stop = vect_stop.fit_transform([documentA, documentB]).toarray()
pd.DataFrame(data_stop, columns = {k: v for k, v in sorted(vect_stop.vocabulary_.items(), key=lambda item: item[0])}.keys())

Unnamed: 0,children,man,sat,walk,went
0,0,1,0,1,1
1,1,0,1,0,0


*N-grams from words*

Help to mentione difference in combination of words

In [6]:
vect_ngw = CountVectorizer(ngram_range=(1,2)) #unigrams + bigrams 
documentA = 'I have no apple'
documentB = 'No, I have apple'

data_ngw = vect_ngw.fit_transform([documentA, documentB]).toarray()
pd.DataFrame(data_ngw, columns = {k: v for k, v in sorted(vect_ngw.vocabulary_.items(), key=lambda item: item[0])}.keys())

Unnamed: 0,apple,have,have apple,have no,no,no apple,no have
0,1,1,0,1,1,1,0
1,1,1,1,0,1,0,1


*N-grams from part of words*

Help to find simmilarity in word parts

In [31]:
vect_ngchar = CountVectorizer(ngram_range=(3,3), analyzer='char_wb') 

data_ngchar = vect_ngchar.fit_transform(['beauty', 'beautiful']).toarray()

pd.DataFrame(data_ngchar, columns = {k: v for k, v in sorted(vect_ngchar.vocabulary_.items(), key=lambda item: item[0])}.keys())

Unnamed: 0,be,aut,bea,eau,ful,ifu,tif,ty,ul,uti,uty
0,1,1,1,1,0,0,0,1,0,0,1
1,1,1,1,1,1,1,1,0,1,1,0


In [32]:
euclidean(data_ngchar[0], data_ngchar[1])

2.6457513110645907

**TF-IDF**

$$TF-IDF = TF * IDF$$

TF - term frequency. It is the number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency. 
$$TF_{i,j} = \frac{n_{i,j}}{\sum_{k}n_{i,j}}$$

IDF - inverse data frequency. The log of the number of documents in the corpus divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

$$IDF_w = \log(\frac{N}{df_t})$$

*Sklearn*

In [37]:
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

In [38]:
vectorizer = TfidfVectorizer()

vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

In [39]:
df

Unnamed: 0,around,children,fire,for,man,out,sat,the,walk,went
0,0.0,0.0,0.0,0.42616,0.42616,0.42616,0.0,0.303216,0.42616,0.42616
1,0.407401,0.407401,0.407401,0.0,0.0,0.0,0.407401,0.579739,0.0,0.0


*From scratch*

In [42]:
from nltk.corpus import stopwords

documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'

bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
stop_words = stopwords.words('english')

filtered_a = [w for w in set(bagOfWordsA) if w not in stop_words]
filtered_b = [w for w in set(bagOfWordsB) if w not in stop_words]

unique_words = set(filtered_a).union(filtered_b)

In [43]:
numOfWordsA = dict.fromkeys(unique_words, 0)
for word in filtered_a:
    numOfWordsA[word] += 1
    
numOfWordsB = dict.fromkeys(unique_words, 0)
for word in filtered_b:
    numOfWordsB[word] += 1 

In [48]:
def computeTF(dict_of_words, bag_of_words):
    dict_tf = {}
    words_cnt = len(bag_of_words)
    for word, count in dict_of_words.items():
        dict_tf[word] = count / words_cnt
    return dict_tf

def computeIDF(documents):
    
    from math import log
    
    dict_idf = dict.fromkeys(documents[0].keys(), 0)
    doc_cnt = len(documents)
    
    for document in documents:
        for word, count in document.items():
            if count > 0:
                dict_idf[word] += 1
                
    for word, count in dict_idf.items():
        dict_idf[word] = log(doc_cnt / count)
        
    return dict_idf

def calculate_tfidf(tf, idf):
    dict_tfidf = {}
    for word, count in tf.items():
        dict_tfidf[word] = count * idf[word]
        
    return dict_tfidf

In [49]:
tf_a = computeTF(numOfWordsA, bagOfWordsA)
tf_b = computeTF(numOfWordsB, bagOfWordsB)

idfs = computeIDF([numOfWordsA, numOfWordsB])

tf_idfs_a = calculate_tfidf(tf_a, idfs)
tf_idfs_b = calculate_tfidf(tf_b, idfs)

df = pd.DataFrame([tf_idfs_a, tf_idfs_b])

In [50]:
df

Unnamed: 0,around,sat,walk,children,fire,went,man
0,0.0,0.0,0.099021,0.0,0.0,0.099021,0.099021
1,0.115525,0.115525,0.0,0.115525,0.115525,0.0,0.0


**Stamming, lemmatization**

* Stamming - the process of finding a word root
* Lemmatization - the process of bringing a word to its normal form

*Libraries* - nltk, pymorphy

In [8]:
# Stamming
import nltk

stemmer = nltk.stem.snowball.EnglishStemmer()

print(stemmer.stem('created'), stemmer.stem('writing'))

creat write


In [14]:
# Lemmatization
import pymorphy2

morph = pymorphy2.MorphAnalyzer()

print(morph.parse('играющих')[0].normal_form)

играть
