<a href="https://colab.research.google.com/github/zorsebolotanshiyolo/Sentiment_analysis-major-project/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import pandas as pd
data = pd.read_csv('hotel-reviews.csv')
import csv
df = pd.read_csv('hotel-reviews.csv', header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

## Displaying the Data

After the data has been sucessfully read, we can display different aspects of the data programmatically.

In [9]:
data.shape

(38932, 5)

In [10]:
data.sample(5)

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
34051,id44377,Great location for Penn Station which we wante...,Mozilla Firefox,Mobile,happy
18323,id28649,"We loved this hotel! It was quiet, clean and h...",Mozilla Firefox,Desktop,happy
34035,id44361,We stayed for - nights on a weekend in March. ...,Firefox,Mobile,happy
16329,id26655,"Nice modern rooms, great plush bed. TV had lot...",Chrome,Tablet,happy
6394,id16720,The Belleclaire has a very friendly and accomm...,Mozilla Firefox,Desktop,happy


In [11]:
data.describe()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
count,38932,38932,38932,38932,38932
unique,38932,38932,11,3,2
top,id12268,The hotel is old and so as the room. We stayed...,Firefox,Desktop,happy
freq,1,1,7367,15026,26521


In [12]:
data['Is_Response'].value_counts()

happy        26521
not happy    12411
Name: Is_Response, dtype: int64

In this project, we'll only use the column of `Description` and `Is_Response` only. 

We'll also store all of the `Description` data to a variable named `attribute` and the `Is_Response` as `target`.

In [13]:
data.drop(columns = ['User_ID', 'Browser_Used', 'Device_Used'], inplace = True)

Next we will change the `Is_Response` column values from "happy" and "not happy" to "positive" and "negative"

In [14]:
data['Is_Response'] = data['Is_Response'].map({'happy' : 'positive', 'not happy' : 'negative'})

data.sample(3)

Unnamed: 0,Description,Is_Response
26759,We just got back from our stay at the Best Wes...,negative
17080,The Michelangelo is a beautiful hotel in the l...,positive
34696,I have been traveling to New York to meet with...,positive


In [15]:
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
twitter_handle = r'@[A-Za-z0-9_]+'                         # remove twitter handle (@username)
url_handle = r'http[^ ]+'                                  # remove website URLs that start with 'https?://'
combined_handle = r'|'.join((twitter_handle, url_handle))  # join
www_handle = r'www.[^ ]+'                                  # remove website URLs that start with 'www.'
punctuation_handle = r'\W+'

In [20]:
stopwords = set(pd.read_csv('stopword_list.txt', sep='\n', header=0).stopword)

In [21]:
def process_text(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()

    try:
        text = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        text = souped

    cleaned_text = re.sub(punctuation_handle, " ",(re.sub(www_handle, '', re.sub(combined_handle, '', text)).lower()))
    cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stopwords])

    return (" ".join([word for word in tokenizer.tokenize(cleaned_text) if len(word) > 1])).strip()

Below is an input-based example to test the above text cleaning method. Try it~

In [22]:
example_text = "hahaha if above a ----'-' www.adasd apakah SAYA ingin pergi pada tanggal 15 bulan februari besok ? tidak karena hari kemarin @twitter suka main https://www.twitter.com"

process_text(example_text)

'hahaha apakah saya ingin pergi pada tanggal 15 bulan februari besok tidak karena hari kemarin suka main'

In [23]:
cleaned_text = []

for text in data.Description:
    cleaned_text.append(process_text(text))

clean_text = pd.DataFrame({'clean_text' : cleaned_text})
data = pd.concat([data, clean_text], axis = 1)

data.sample(5)

Unnamed: 0,Description,Is_Response,clean_text
12123,"The appearance from the outside is striking, w...",negative,appearance outside striking tall circle towers...
21741,Staff and service would definitely be a reason...,negative,staff service definitely reason return ave doo...
8220,"Yes, the rooms are small (same goes for most p...",positive,yes rooms small goes places nyc reserve deluxe...
25500,I was dissatisfied with the exercise room. The...,negative,dissatisfied exercise room stationary upright ...
27419,we checked in around -pm. There was blood on t...,negative,checked pm blood door jam bathroom mold ceilin...


In [24]:
from sklearn.model_selection import train_test_split

attribute = data.clean_text
target = data.Is_Response

In [25]:
attribute_train, attribute_test, target_train, target_test = train_test_split(attribute, target, test_size = 0.1, random_state = 225)

print('attribute_train :', len(attribute_train))
print('attribute_test  :', len(attribute_test))
print('target_train :', len(target_train))
print('target_test  :', len(target_test))

attribute_train : 35038
attribute_test  : 3894
target_train : 35038
target_test  : 3894


# Training

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tvec = TfidfVectorizer()
clf2 = LogisticRegression()

In [27]:
from sklearn.pipeline import Pipeline

model = Pipeline([('vectorizer',tvec)
                 ,('classifier',clf2)])

model.fit(attribute_train, target_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inter

In [28]:
example_text = ["I'm very happy now"]
example_result = model.predict(example_text)

print(example_result)

['positive']


In [29]:
from sklearn.metrics import confusion_matrix

verdict = model.predict(attribute_test)

confusion_matrix(verdict, target_test)

array([[ 988,  147],
       [ 335, 2424]])

In [30]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(verdict, target_test))
print("Precision : ", precision_score(verdict, target_test, average = 'weighted'))
print("Recall : ", recall_score(verdict, target_test, average = 'weighted'))

Accuracy :  0.8762198253723678
Precision :  0.8856843363140553
Recall :  0.8762198253723678


In [31]:
!pip install streamlit

!pip install pyngrok
from pyngrok import ngrok

Collecting streamlit
[?25l  Downloading https://files.pythonhosted.org/packages/5f/3c/f0a97b684a49bd043bef9f8b8f4092b8a25bee9ad9a3f5e121a7161ded8f/streamlit-0.78.0-py2.py3-none-any.whl (7.5MB)
[K     |████████████████████████████████| 7.5MB 5.7MB/s 
Collecting validators
  Downloading https://files.pythonhosted.org/packages/db/2f/7fed3ee94ad665ad2c1de87f858f10a7785251ff75b4fd47987888d07ef1/validators-0.18.2-py3-none-any.whl
Collecting blinker
[?25l  Downloading https://files.pythonhosted.org/packages/1b/51/e2a9f3b757eb802f61dc1f2b09c8c99f6eb01cf06416c0671253536517b6/blinker-1.4.tar.gz (111kB)
[K     |████████████████████████████████| 112kB 44.0MB/s 
Collecting gitpython
[?25l  Downloading https://files.pythonhosted.org/packages/a6/99/98019716955ba243657daedd1de8f3a88ca1f5b75057c38e959db22fb87b/GitPython-3.1.14-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 30.0MB/s 
Collecting pydeck>=0.1.dev5
[?25l  Downloading https://files.pythonhosted.org/packages/1

Collecting pyngrok
[?25l  Downloading https://files.pythonhosted.org/packages/fd/14/70caa2fd38bbddfd19208bccbc8a20a82e1de5378829fc334c6397ef4dc9/pyngrok-5.0.4.tar.gz (743kB)
[K     |████████████████████████████████| 747kB 5.5MB/s 
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone
  Created wheel for pyngrok: filename=pyngrok-5.0.4-cp37-none-any.whl size=18971 sha256=085bd49b86c0d3c20e99fc4a29e20273fdbd6138518657b98203f5c99f7cb886
  Stored in directory: /root/.cache/pip/wheels/8a/82/b1/cecfba4ff6e2f05777a5a4a65b46c1114842453d5a0e61bdd4
Successfully built pyngrok
Installing collected packages: pyngrok
Successfully installed pyngrok-5.0.4


In [32]:
%%writefile app.py
import streamlit as st
st.title('Sentiment Analysis')

Writing app.py


In [33]:
!nohup streamlit run app.py &

url = ngrok.connect(port='8501')
url

nohup: appending output to 'nohup.out'


<NgrokTunnel: "http://6d98325c17c8.ngrok.io" -> "http://localhost:80">

In [34]:
pip install ipython-autotime 

Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/b4/c9/b413a24f759641bc27ef98c144b590023c8038dfb8a3f09e713e9dff12c1/ipython_autotime-0.3.1-py2.py3-none-any.whl
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1


In [35]:
%load_ext autotime

time: 123 µs (started: 2021-03-12 17:53:52 +00:00)
