In [3]:
import pandas as pd
df = pd.read_csv('tripadvisor_hotel_reviews.csv')
df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [4]:
import numpy as np
def create_sentiment (rating):
    x = range(0,5,1)
    for i in x:
           if rating == 1 or rating == 2:
            return 'negative'
           elif rating == 4 or rating == 5:
            return 'positive'
           else:
            return 'neutral'


Done. Recommendations: The range of the rating, control it to be 1-5 (bound it).

In [5]:
df['Sentiment'] = df['Rating'].apply(create_sentiment)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents = None, lowercase = False, preprocessor = None)
X = tfidf.fit_transform(df['Review'])

Done. Recommendations: Put info about functions. It'd help you clear your mind

TFIDF - Term Frequency Inverse-Frequency Document. 
TF - The proportion of how many times a word appears in a report
IDF - Looks at how common or uncommon a word is amongs the copy. 

Baically, measuring the importance of words. 

Question: What is the difference between this and bag of words?

In [7]:
df.head()

Unnamed: 0,Review,Rating,Sentiment
0,nice hotel expensive parking got good deal sta...,4,positive
1,ok nothing special charge diamond member hilto...,2,negative
2,nice rooms not 4* experience hotel monaco seat...,3,neutral
3,"unique, great stay, wonderful time hotel monac...",5,positive
4,"great stay great stay, went seahawk game aweso...",5,positive


In [8]:
from sklearn.feature_extraction.text import re
def clean_data(review):
    no_punc = re.sub(r'[^\w\s]', '', review)
    no_digits = ''.join([i for i in no_punc if not i.isdigit()])
    return(no_digits)

In [9]:
df['Review'][0]

'nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,  '

In [10]:
df['Review'] = df['Review'].apply(clean_data)

In [11]:
from sklearn.model_selection import train_test_split 
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [17]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'liblinear')
lr.fit(X_train, y_train)
preds = lr.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

0.8524302166699199

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [22]:
print("Precision score:", precision_score(y_test, preds, average='macro'))
print("Recall score:", recall_score(y_test, preds, average='macro'))
print("F1 score:", f1_score(y_test, preds, average='macro'))
print("Confusion matrix:\n", confusion_matrix(y_test, preds))

Precision score: 0.7658412198514739
Recall score: 0.6141740746680338
F1 score: 0.6327498153997057
Confusion matrix:
 [[ 583   22  188]
 [  99   68  398]
 [  28   21 3716]]


In [14]:
def predict_sentiment(review: str, rating: int) -> str:
    df = pd.DataFrame({'Review': [review], 'Rating': [rating]})
    df['Review'] = df['Review'].apply(clean_data)
    X = tfidf.transform(df['Review'])
    result = lr.predict(X)
    sentiment = 'positive' if result == 'positive' else 'negative'
    message = f"The sentiment of this review is {sentiment}."
    return message

In [14]:
predict_sentiment('really crap', 1)

'The sentiment of this review is positive.'

In [15]:
predict_sentiment('I had a terrible time at the hotel', 5
                 )

'The sentiment of this review is negative.'

In [16]:
predict_sentiment('A MESS', 5
                 )

'The sentiment of this review is positive.'

In [17]:
predict_sentiment('A MESS', 1
                 )

'The sentiment of this review is positive.'

In [18]:
predict_sentiment('Trash', 5
                 )

'The sentiment of this review is positive.'

It runs! However, this is a negative review, but it brings out "postive". It could be that I need to use another measure of accuracy, like confusion matrix. 
It could also be that there is an issue with the mapping. 

*Put a lot of two token reviews and see how it works*

# Correction: Train it with only rating. Build it to be 'trainable' with other datasets, and other use cases, like restaurants. 