# Анализ тональности отзывов

**Библиотека NLTK, или NLTK** — пакет библиотек и программ для символьной и статистической обработки естественного языка, написанных на языке программирования Python. Содержит графические представления и примеры данных. Сопровождается обширной документацией, включая книгу с объяснением основных концепций, стоящих за теми задачами обработки естественного языка, которые можно выполнять с помощью данного пакета

Сначала возьмем выборку отзывов на фильмы из NLTK:

In [2]:
# conda install --name py27 nltk
# данные надо доставить https://www.nltk.org/data.html

from nltk.corpus import movie_reviews
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print negids[:5]

[u'neg/cv000_29416.txt', u'neg/cv001_19502.txt', u'neg/cv002_17424.txt', u'neg/cv003_12683.txt', u'neg/cv004_12641.txt']


Приготовим список текстов и классов как обучающую выборку:

In [3]:
negfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in posids]

texts = negfeats + posfeats
labels = [0] * len(negfeats) + [1] * len(posfeats)

In [5]:
print texts[1]

the happy bastard ' s quick movie review damn that y2k bug . it ' s got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . little do they know the power within . . . going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . we don ' t know why the crew was really out in the middle of nowhere , we don ' t know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don ' t know why donald sutherland is stumbling around drunkenly throughout . here , it ' s just " hey , let ' s chase these people around with some robots " . the acting is below average , even from the likes of curtis . you ' re more likely to get a kick out of her work i

Импортируем нужные нам модули

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

###Оценка качества работы разных классификаторов

In [8]:
def text_classifier(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
        )

In [9]:
for clf in [LogisticRegression, LinearSVC, SGDClassifier]:
    print clf
    print cross_val_score(text_classifier(CountVectorizer(), TfidfTransformer(), clf()), texts, labels).mean()
    print "\n"

<class 'sklearn.linear_model.logistic.LogisticRegression'>




0.8135111159063255


<class 'sklearn.svm.classes.LinearSVC'>
0.8455071838305371


<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>




0.8400271529014044




### Подготовка классификатора, обученного на всех данных

In [8]:
clf_pipeline = Pipeline(
            [("vectorizer", TfidfVectorizer()),
            ("classifier", LinearSVC())]
        )


clf_pipeline.fit(texts, labels)

print clf_pipeline

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_id...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])


In [9]:
print clf_pipeline.predict(["Amazing film! I will advice it to all my friends. Genious",
                           "Awful film! The man who advised me to watch it is really crazy idiot."])

[1 0]


## Понижение размерности и ансамбли деревьев

In [10]:
%%time
from sklearn.decomposition import NMF, TruncatedSVD

v = CountVectorizer()
mx = v.fit_transform(texts)
mf = TruncatedSVD(10)
u = mf.fit_transform(mx)

CPU times: user 3.74 s, sys: 100 ms, total: 3.84 s
Wall time: 2.45 s


In [11]:
# NMF - неотрицательные матричные разложения
for transform in [TruncatedSVD, NMF]:
    print transform
    print cross_val_score(text_classifier(CountVectorizer(), transform(n_components=10), LinearSVC()), texts, labels).mean()
    print "\n"


<class 'sklearn.decomposition.truncated_svd.TruncatedSVD'>




0.5555435675196154


<class 'sklearn.decomposition.nmf.NMF'>
0.6430082777388167







Если задать n_components=1000:

In [12]:
%%time
print cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), LinearSVC()),
                      texts, 
                      labels
                     ).mean()

0.842014169858
CPU times: user 3min 11s, sys: 13.5 s, total: 3min 25s
Wall time: 2min 24s


## Ансамбли деревьев на преобразованных признаках

In [12]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [15]:
%%time
print cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TruncatedSVD(100)),
            ("classifier", RandomForestClassifier(100))
        ]),
    texts,
    labels
    )

[ 0.70209581  0.71621622  0.69069069]
CPU times: user 13.9 s, sys: 600 ms, total: 14.5 s
Wall time: 14.3 s


Больше компонент и больше деревьев:

In [19]:
%%time
print cross_val_score(text_classifier(CountVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean()

0.561485137832
CPU times: user 4min 43s, sys: 13.6 s, total: 4min 57s
Wall time: 3min 56s


Tf*Idf вместо частот слов:

In [18]:
%%time
print cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean()

0.590001678325
CPU times: user 4min 39s, sys: 14.3 s, total: 4min 53s
Wall time: 3min 52s


## Совмещаем Tf*Idf и SVD

In [16]:
from sklearn.pipeline import FeatureUnion

estimators = [('tfidf', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

In [17]:
%%time
print cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    labels
    )

[ 0.74251497  0.78978979  0.62912913]
CPU times: user 10.7 s, sys: 24 ms, total: 10.7 s
Wall time: 10.7 s
