<a href="https://colab.research.google.com/github/Jiatong-han/Recommendation-System-for-Airbnb-Houses/blob/bingqiao/Recommendation-System-for-Airbnb-Houses/Data%20Preparation/%E5%8E%9F%E5%A7%8B%E6%95%B0%E6%8D%AE/NLP_on_Customer_Comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
import nltk

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [63]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [6]:
df = pd.read_csv('drive/My Drive/Data/cleaned_customer.csv')

In [None]:
df.head()

In [None]:
df.comments.isna().value_counts()

In [7]:
df.comments.fillna('',inplace=True)

### 1. Using Vader Model

#### 1.1 Raw model performance

In [82]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
def get_sentiment(sentence,sid):
    score = sid.polarity_scores(sentence)
    compound = score.get('compound')
    return compound
sid_score = df.comments.apply(lambda x : get_sentiment(x,sid))

In [117]:
# Check performance
import scipy.stats

def perform(score):
  res = scipy.stats.pearsonr(score[df.review_scores_rating.notnull()], df.review_scores_rating.dropna())
  print('Correlation coefficient:',res[0],'; P value:',res[1])

In [116]:
perform(sid_score)

Correlation coefficient: 0.19071607353248088 , P value: 0.0


##### The correlation is significant yet the coefficient is relatively small.

#### 1.2 Train the model

In [126]:
sid_new = SentimentIntensityAnalyzer()
newWords = {'amenities': 0.5, 'big': 0.5, 'bus': 0.5,'near':0.5,'nearby':0.5,'quiet':0.5,'quick':0.5,'spacious':0.5,'mrt':0.5,'station':0.5,'walking':0.5,'walk':0.5,'restaurant':0.5,'responsive':0.5,'shopping':0.5,'short':0.5,
    'pool':0.5,'kitchen':0.5,'food':0.5,'family':0.5,'distance':0.5,'close':0.5,'communication':0.5}
sid_new.lexicon.update(newWords)

In [110]:
get_sentiment('good amenities',sid_new)

0.5267

In [127]:
sid_score_updated = df.comments.apply(lambda x : get_sentiment(x,sid_new))

In [130]:
sid_score_updated.describe()

count    155250.000000
mean          0.645265
std           0.402035
min          -0.998300
25%           0.440400
50%           0.839500
75%           0.945400
max           0.999700
Name: comments, dtype: float64

In [128]:
perform(sid_score_updated)

Correlation coefficient: 0.17954210721681987 ; P value: 0.0


#### 2.3 Detailed analysis

In [133]:
df.comments[sid_score_updated < -0.5].sample(1)

22736    Yanhong und ihr Partner sind ein sehr nettes P...
Name: comments, dtype: object

### 2. Using Afinn for sentiment analysis
Afinn measures the sentiment score (-5 to 5) for each single word. But it does not consider word combinations.

In [121]:
!pip install afinn
from afinn import Afinn

Collecting afinn
[?25l  Downloading https://files.pythonhosted.org/packages/86/e5/ffbb7ee3cca21ac6d310ac01944fb163c20030b45bda25421d725d8a859a/afinn-0.1.tar.gz (52kB)
[K     |██████▎                         | 10kB 25.4MB/s eta 0:00:01[K     |████████████▌                   | 20kB 30.7MB/s eta 0:00:01[K     |██████████████████▊             | 30kB 21.9MB/s eta 0:00:01[K     |█████████████████████████       | 40kB 25.6MB/s eta 0:00:01[K     |███████████████████████████████▏| 51kB 25.3MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 8.2MB/s 
[?25hBuilding wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-cp37-none-any.whl size=53451 sha256=bad7d7e8a939673f7429bc4e95e46f2516ad1d0fc4e46498a36245937588934b
  Stored in directory: /root/.cache/pip/wheels/b5/1c/de/428301f3333ca509dcf20ff358690eb23a1388fbcbbde008b2
Successfully built afinn
Installing collected packages: afinn
Su

In [124]:
afinn = Afinn()
afinn_score = np.array([afinn.score(text) for text in df.comments])

In [None]:
pd.DataFrame(afinn_score).describe()

In [None]:
df[afinn_score < -30].comments

#### Discovery: It seems that most of the extremely low-rated comments are written in German. But Afinn hasn't yet been equipped to understand German words.

In [None]:
df[afinn_score > 70].comments

#### There seems to be no much problem with extremely highly rated comments.

In [125]:
perform(afinn_score)

Correlation coefficient: 0.18179259638672068 ; P value: 0.0


### 3. Using Textblob for sentiment analysis

#### Features of Textblob

1. Textblob measures polarity score between -1 and 1, negative to postive. 
2. Ignore the words it does not know.
3. Can also measure the subjectivity score.

In [33]:
from textblob import TextBlob
TextBlob('Hi you are ugly').sentiment

Sentiment(polarity=-0.7, subjectivity=1.0)

In [134]:
textblob_score = np.array([TextBlob(text).sentiment for text in df.comments])

In [136]:
textblob_score

array([[0.21      , 0.68      ],
       [0.49640873, 0.79214286],
       [0.45888889, 0.58655556],
       ...,
       [0.6       , 1.        ],
       [0.        , 0.        ],
       [0.35      , 0.66666667]])