<div style="font-size:18pt; padding-top:20px; text-align:center">СЕМИНАР 13. <b>Классификация текстовых документов и </b> <span style="font-weight:bold; color:green">NumPy/SciPy/Sklearn</span></div><hr>
<div style="text-align:right;">Папулин С.Ю. <span style="font-style: italic;font-weight: bold;">(papulin_hse@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Содержание</span>
    <ol>
        <li><a href="#1">Преобразование TF-IDF</a></li>
        <li><a href="#2">Байесовские классификаторы</a>
            <ol style = "list-style-type:lower-alpha">
                <li><a href="#2a">Модель Бернулли</a></li>
                <li><a href="#2b">Мультиномиальная модель</a></li>
            </ol>
        </li>
        <li><a href="#3">Источники</a></li>
    </ol>
</div>

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Преобразование TF-IDF</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<p>Исходный набор документов</p>

In [1]:
docs = ["n\t\ This is an \nexample of how to transform documents to TF-IDF vectors. Transform, transform?",
        "The example is below.", 
        "The example is not so bad"]

<a name="1a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Векторизация документов с CountVectorizer и TfidfTransformer
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1">Назад</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#1b">Далее</a>
            </div>
        </div>
    </div>
</div>

<p><b>Подсчет количества слов в документах</b></p>

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer">CountVectorizer</a>

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
count_vectorizer = CountVectorizer(analyzer="word", ngram_range=(1,1), 
                                   stop_words=None, lowercase=True,
                                   binary=False, strip_accents=None)
count_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

<p>Передаем набор документов для преобразования</p>

In [4]:
count_model = count_vectorizer.fit(docs)

<p>Сформированный словарь</p>

In [5]:
count_model.vocabulary_

{'an': 0,
 'bad': 1,
 'below': 2,
 'documents': 3,
 'example': 4,
 'how': 5,
 'idf': 6,
 'is': 7,
 'not': 8,
 'of': 9,
 'so': 10,
 'tf': 11,
 'the': 12,
 'this': 13,
 'to': 14,
 'transform': 15,
 'vectors': 16}

<p>Преобразование документов в векторы количества вхождения слов</p>

In [6]:
tf_vectors = count_vectorizer.transform(docs)
tf_vectors.toarray()

array([[1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 2, 3, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]], dtype=int64)

<p>Обратное преобразование</p>

In [7]:
count_model.inverse_transform(tf_vectors)

[array(['an', 'documents', 'example', 'how', 'idf', 'is', 'of', 'tf',
        'this', 'to', 'transform', 'vectors'], 
       dtype='<U9'), array(['below', 'example', 'is', 'the'], 
       dtype='<U9'), array(['bad', 'example', 'is', 'not', 'so', 'the'], 
       dtype='<U9')]

<p>Использование stop-слов</p>

In [8]:
count_vectorizer = CountVectorizer(analyzer="word", ngram_range=(1,1), 
                                   stop_words="english", lowercase=True,
                                   binary=False, strip_accents=None)

In [9]:
count_vectorizer.fit_transform(docs).toarray()

array([[0, 1, 1, 1, 1, 3, 1],
       [0, 0, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0]], dtype=int64)

<p>Отображение stop-слов</p>

In [10]:
count_vectorizer.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

<p><b>Преобразование в TF-IDF</b></p>

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TfidfTransformer</a>

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

In [12]:
tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=False)
tfidf_transformer

TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False,
         use_idf=True)

In [13]:
tfidf_model = tfidf_transformer.fit(tf_vectors)

In [14]:
tfidf_vectors = tfidf_model.transform(tf_vectors)
tfidf_vectors.toarray()

array([[ 0.21589605,  0.        ,  0.        ,  0.21589605,  0.10287563,
         0.21589605,  0.21589605,  0.10287563,  0.        ,  0.21589605,
         0.        ,  0.21589605,  0.        ,  0.21589605,  0.4317921 ,
         0.64768816,  0.21589605],
       [ 0.        ,  0.        ,  0.72497497,  0.        ,  0.34545446,
         0.        ,  0.        ,  0.34545446,  0.        ,  0.        ,
         0.        ,  0.        ,  0.48552418,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.50619914,  0.        ,  0.        ,  0.2412066 ,
         0.        ,  0.        ,  0.2412066 ,  0.50619914,  0.        ,
         0.50619914,  0.        ,  0.33900746,  0.        ,  0.        ,
         0.        ,  0.        ]])

<a name="1b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Векторизация документов с TfidfVectorizer
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#1a">Назад</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2">Далее</a>
            </div>
        </div>
    </div>
</div>

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TfidfVectorizer</a>

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

<p><b>TF-IDF</b></p>

In [16]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words=None, 
                             use_idf=True, ngram_range=(1,1),
                             smooth_idf=False)                        
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=False,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

<p>Передаем набор документов, на основе которых будут строиться векторы</p>

In [17]:
vector_model = vectorizer.fit(docs)

<p>Сформированный словарь</p>

In [18]:
vector_model.vocabulary_

{'an': 0,
 'bad': 1,
 'below': 2,
 'documents': 3,
 'example': 4,
 'how': 5,
 'idf': 6,
 'is': 7,
 'not': 8,
 'of': 9,
 'so': 10,
 'tf': 11,
 'the': 12,
 'this': 13,
 'to': 14,
 'transform': 15,
 'vectors': 16}

<p>Значения IDF для слов из словаря</p>

In [19]:
vector_model.idf_

array([ 2.09861229,  2.09861229,  2.09861229,  2.09861229,  1.        ,
        2.09861229,  2.09861229,  1.        ,  2.09861229,  2.09861229,
        2.09861229,  2.09861229,  1.40546511,  2.09861229,  2.09861229,
        2.09861229,  2.09861229])

<p>Преобразование документов в векторы TF-IDF</p>

In [20]:
tfidf_vector = vector_model.transform(docs)
tfidf_vector.toarray()

array([[ 0.21589605,  0.        ,  0.        ,  0.21589605,  0.10287563,
         0.21589605,  0.21589605,  0.10287563,  0.        ,  0.21589605,
         0.        ,  0.21589605,  0.        ,  0.21589605,  0.4317921 ,
         0.64768816,  0.21589605],
       [ 0.        ,  0.        ,  0.72497497,  0.        ,  0.34545446,
         0.        ,  0.        ,  0.34545446,  0.        ,  0.        ,
         0.        ,  0.        ,  0.48552418,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.50619914,  0.        ,  0.        ,  0.2412066 ,
         0.        ,  0.        ,  0.2412066 ,  0.50619914,  0.        ,
         0.50619914,  0.        ,  0.33900746,  0.        ,  0.        ,
         0.        ,  0.        ]])

<p>Обратное преобразование</p>

In [21]:
vector_model.inverse_transform(tfidf_vector[0])

[array(['vectors', 'transform', 'to', 'this', 'tf', 'of', 'is', 'idf',
        'how', 'example', 'documents', 'an'], 
       dtype='<U9')]

<p><b>Stop-слова</b></p>

In [22]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words="english", 
                             use_idf=True, ngram_range=(1,1),
                             smooth_idf=False)                        
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=False,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [23]:
vector_model = vectorizer.fit(docs)

In [24]:
vector_model.vocabulary_

{'bad': 0,
 'documents': 1,
 'example': 2,
 'idf': 3,
 'tf': 4,
 'transform': 5,
 'vectors': 6}

In [25]:
vector_model.idf_

array([ 2.09861229,  2.09861229,  1.        ,  2.09861229,  2.09861229,
        2.09861229,  2.09861229])

<p>Отображение stop-слов</p>

In [26]:
vector_model.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Байесовские классификаторы</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.сross_validation import train_test_split

<p><b>Исходные данные</b></p>

<p>Загрузка исходных данных</p>

In [None]:
data = fetch_20newsgroups(subset="all", shuffle=True, random_state=123)

<p>Файлы с данными</p>

In [None]:
data.filenames

<p>Описание данных</p>

In [None]:
data.description

<p>Элемент данных</p>

In [None]:
data.data[0]

In [None]:
len(data.data)

<p>Классы документов</p>

In [None]:
data.target_names, len(data.target)

<p>Разбиение исходных данных на обучающее и тестовое подмножества</p>

In [None]:
data = fetch_20newsgroups(subset="all", 
                          shuffle=True, random_state=123)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=0)

<a name="2a"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            a. Модель Бернулли
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2">Назад</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#2b">Далее</a>
            </div>
        </div>
    </div>
</div>

<a name="2b"></a>
<div style="display:table; width:100%">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-style:italic; font-weight:bold; font-size:12pt">
            b. Мультиномиальная модель
        </div>
        <div style="display:table-cell; border:1px solid lightgrey; width:20%">
            <div style="display:table-cell; width:10%; text-align:center; background-color:whitesmoke;">
                <a href="#2a">Назад</a>
            </div>
            <div style="display:table-cell; width:10%; text-align:center;">
                <a href="#3">Далее</a>
            </div>
        </div>
    </div>
</div>

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
vectorizer = TfidfVectorizer(lowercase=True, stop_words=None, 
                             use_idf=True, ngram_range=(1,1),
                             smooth_idf=False)                        

In [None]:
tfidf_train_vectors = vectorizer.fit_transform(x_train)

In [None]:
tfidf_test_vectors = vectorizer.transform(x_test)

In [None]:
m_multNB = MultinomialNB(alpha=1).fit(tfidf_train_vectors, y_train)

In [None]:
y_test_pred = m_multNB.predict(tfidf_test_vectors)

In [None]:
m_multNB.score(tfidf_test_vectors, y_test)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {"alpha": [0.001, 0.01, 0.1, 1]}

In [None]:
multNB = MultinomialNB()

In [None]:
m_multNB_grid = GridSearchCV(m_multNB, params)

In [None]:
m_multNB_grid.fit(tfidf_train_vectors, y_train)

In [None]:
m_multNB_grid.cv_results_

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Источники</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

In [None]:
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html