Отели играют решающую роль в путешествиях, и с расширением доступа к информации появились новые способы выбора лучших.
С помощью этого набора данных, состоящего из 20 тысяч отзывов, полученных с сайта Tripadvisor, вы можете изучить, что делает отель отличным, и, возможно, даже использовать эту модель в своих путешествиях!

[Датасет](https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews)  
&copy; Алам, М.Х., Рю, В.-Дж., Ли, С., 2016. Совместное многоплановое отношение к теме: моделирование семантических аспектов для онлайн-обзоров. Информационные науки 339, 206–223.

In [1]:
import pandas as pd
import dill
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.metrics import f1_score
#working with text
from sklearn.feature_extraction.text import TfidfVectorizer
#normalizing data
from sklearn.preprocessing import StandardScaler
#pipeline
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import precision_score,recall_score
#imputer
from sklearn.impute import SimpleImputer

import sklearn.datasets

In [2]:
df=pd.read_csv('tripadvisor_hotel_reviews.csv.zip')

In [3]:
df

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5
...,...,...
20486,"best kept secret 3rd time staying charm, not 5...",5
20487,great location price view hotel great quick pl...,4
20488,"ok just looks nice modern outside, desk staff ...",2
20489,hotel theft ruined vacation hotel opened sept ...,1


Целью модели будет предсказание эмоциональной окраски отзыва:  
  0 - отрицательный(рейтинг 1-3),  
  1  - положительный(рейтинг 4-5)

In [4]:
def get_sentiment(x):
    if x<=3:
        return 0
    return 1

df['Sentiment']=df['Rating'].apply(get_sentiment)

In [5]:
df['Sentiment'].value_counts()

1    15093
0     5398
Name: Sentiment, dtype: int64

Поле Rating можем удалить

In [6]:
df=df.drop('Rating', axis=1)

Разделим датасет на тренировочную, валидационную и тестовую выборки. И сохраним тестовую выборку на диск

In [7]:
X_train, X_test, y_train, y_test=train_test_split(df.iloc[:,:-1], df.iloc[:,-1], test_size= 0.2, random_state=24, stratify=df['Sentiment'])
X_test.to_csv("X_test.csv", index=None)
y_test.to_csv("y_test.csv", index=None)

In [8]:
X_train, X_val, y_train, y_val=train_test_split(X_train, y_train, test_size= 0.25, random_state=24, stratify=y_train)

In [9]:
import regex
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asavv\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [15]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class FeatImp(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        self.new_key='lenght'
        return self
    def transform(self, X):
        X[self.new_key]=X[self.key].apply(lambda x: len(x.strip().split()))
        return X
    
class TextLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        self.stop_words=stopwords.words('english')
        self.lemmatizer=WordNetLemmatizer()
        return self
    
    def transform(self, X):
        X[self.key] = X[self.key].str.replace(r"http\S+", "")
        X[self.key] = X[self.key].str.replace(r"http","")
        X[self.key] = X[self.key].str.replace(r"@/S+","")
        X[self.key] = X[self.key].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
        X[self.key] = X[self.key].str.replace(r"@"," at ")
        X[self.key] = X[self.key].str.lower()
        X[self.key] = X[self.key].apply(self._lemm)
        return X
    
    def _lemm(self, seq):
        # review = regex.sub('[^a-zA-Z]', ' ', seq)    
        review = review.split()    
        review = [word for word in review if not word in set(self.stop_words)]    
        review = [self.lemmatizer.lemmatize(word) for word in review]    
        review = ' '.join(review)
        return review
        

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
text_feat=Pipeline([
    ('lemm', TextLemmatizer(key='Review')),
    ('selector', ColumnSelector(key='Review')),
    ('tfidf', TfidfVectorizer(ngram_range=(1, 3),  max_features=10000, tokenizer = word_tokenize))
    
])
next_feat=Pipeline([    
    ('selector', NumberSelector(key='lenght'))
])

In [18]:
features=FeatureUnion([
    ('rewiew', text_feat),
    ('lenght', next_feat)
])

In [19]:
%%time

model = Pipeline([('imp', FeatImp(key='Review')),
    ('feat',features),
    ('classifier', LogisticRegression()),
])

model.fit(X_train, y_train)

  X[self.key] = X[self.key].str.replace(r"http\S+", "")
  X[self.key] = X[self.key].str.replace(r"@/S+","")
  X[self.key] = X[self.key].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")


Wall time: 26.9 s


Pipeline(steps=[('imp', FeatImp(key='Review')),
                ('feat',
                 FeatureUnion(transformer_list=[('rewiew',
                                                 Pipeline(steps=[('lemm',
                                                                  TextLemmatizer(key='Review')),
                                                                 ('selector',
                                                                  ColumnSelector(key='Review')),
                                                                 ('tfidf',
                                                                  TfidfVectorizer(max_features=10000,
                                                                                  ngram_range=(1,
                                                                                               3),
                                                                                  tokenizer=<function word_tokenize at 0x00000253CD037820>))])),
   

In [20]:
predictions=model.predict_proba(X_val)

  X[self.key] = X[self.key].str.replace(r"http\S+", "")
  X[self.key] = X[self.key].str.replace(r"@/S+","")
  X[self.key] = X[self.key].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")


In [21]:
print(classification_report(y_val,np.argmax(predictions,axis=1)))
print(confusion_matrix(y_val, np.argmax(predictions,axis=1)))

              precision    recall  f1-score   support

           0       0.90      0.70      0.79      1080
           1       0.90      0.97      0.94      3018

    accuracy                           0.90      4098
   macro avg       0.90      0.84      0.86      4098
weighted avg       0.90      0.90      0.90      4098

[[ 761  319]
 [  80 2938]]


In [22]:
roc_auc_score(y_score=predictions[:, 1][:], y_true=y_val.iloc[:])

0.9552453795744055

Результаты вполне приемлимы, сохраним модель

In [23]:
with open("model.dill", "wb") as f:
    dill.dump(model, f)