## Create customer review related features based on NRC data on four kinds of languages

In this part, we conduct a sentiment analysis based on NRC data on four main kinds of languages.The NRC csv file includes 14182 words in 105 kinds of languages and each word is coded into 2 sentiments and 8 emotions.

Different from the NRC Emotion Lexicon we used in class, here we need to create an emotion dictionary based on a csv in a quite different format and find the related emotions for each word in four languages. 

In [1]:
import pandas as pd
import numpy as np
import datetime

In [3]:
nrc = pd.read_csv('NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.csv',encoding = "ISO-8859-1")

In [6]:
nrc.head()

Unnamed: 0,English (en),Afrikaans (af),Albanian (sq),Amharic (am),Arabic (ar),Armenian (hy),Azeerbaijani (az),Basque (eu),Belarusian (be),Bengali (bn),...,Positive,Negative,Anger,Anticipation,Disgust,Fear,Joy,Sadness,Surprise,Trust
0,aback,uit die veld geslaan,prapa,????,??? ??????,??????,sanki,aback,§Ù§Ù§Ñ§Õ§å,???????,...,0,0,0,0,0,0,0,0,0,0
1,abacus,abakus,num?rator,abacus,????? ???,????????????????,abacus,abako,§Ñ§Ò§Ñ§Ü§Ñ,????-???????????,...,0,0,0,0,0,0,0,0,0,1
2,abandon,verlaat,braktis,??,????,????,t?rk et,bertan behera,§Ñ§Õ§Þ§à§Ó?§è§è§Ñ §Ñ§Õ,?????? ???,...,0,1,0,0,0,1,0,1,0,0
3,abandoned,verlate,braktisur,????,?????,?????,t?rk etdi,abandonatutako,§Ù§Ñ§Ü?§ß§å§ä§í,?????????,...,0,1,1,0,0,1,0,1,0,0
4,abandonment,verlating,braktisje,????,?????? ??,???????????,l??v,abandono,§á§Ñ§Ü?§Õ§Ñ§ß§ß§Ö,???????,...,0,1,1,0,0,1,0,1,1,0


In [11]:
nrc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14182 entries, 0 to 14181
Columns: 115 entries, English (en) to Trust
dtypes: int64(10), object(105)
memory usage: 12.4+ MB


In [5]:
reviews = pd.read_csv('reviews.csv',encoding = "ISO-8859-1")

In [12]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1051974 entries, 0 to 1051973
Data columns (total 6 columns):
listing_id       1051974 non-null int64
id               1051974 non-null int64
date             1051974 non-null object
reviewer_id      1051974 non-null int64
reviewer_name    1051974 non-null object
comments         1050548 non-null object
dtypes: int64(3), object(3)
memory usage: 48.2+ MB


In [13]:
reviews.columns

Index(['listing_id', 'id', 'date', 'reviewer_id', 'reviewer_name', 'comments'], dtype='object')

In [5]:
reviews = reviews[reviews['comments'].notnull()]

In [8]:
nrc.columns

Index(['English (en)', 'Afrikaans (af)', 'Albanian (sq)', 'Amharic (am)',
       'Arabic (ar)', 'Armenian (hy)', 'Azeerbaijani (az)', 'Basque (eu)',
       'Belarusian (be)', 'Bengali (bn)',
       ...
       'Positive', 'Negative', 'Anger', 'Anticipation', 'Disgust', 'Fear',
       'Joy', 'Sadness', 'Surprise', 'Trust'],
      dtype='object', length=115)

In [9]:
nrc = nrc[['English (en)','French (fr)','Italian (it)','Spanish (es)','Positive', 'Negative', 'Anger', 'Anticipation', 'Disgust', 'Fear',
       'Joy', 'Sadness', 'Surprise', 'Trust']]

In [7]:
emotion_dict=dict()
for x in range(len(nrc)):
    for y in range(4):
        word = nrc.iloc[x][y]
        if emotion_dict.get(word):
            for i in ['Positive','Negative','Anger','Anticipation','Disgust','Fear','Joy','Sadness','Surprise','Trust']:
                if nrc.iloc[x][i]==1:
                    emotion_dict[word].add(i)
        else:
            emotion_dict[word] = set(' '.join(
                list(np.array(['Positive','Negative','Anger','Anticipation','Disgust','Fear','Joy','Sadness','Surprise','Trust']) 
                     * np.array(nrc.iloc[x][4:]))).split())

In [8]:
emotion_dict_final = {k: v for k, v in emotion_dict.items() if v != set()}

In [9]:
emotion_dict_final

{'abacus': {'Trust'},
 'abaque': {'Trust'},
 'abaco': {'Trust'},
 '¨¢baco': {'Trust'},
 'abandon': {'Anger', 'Fear', 'Negative', 'Sadness', 'Surprise'},
 'abandonner': {'Fear', 'Negative', 'Sadness'},
 'abbandono': {'Anger', 'Fear', 'Negative', 'Sadness', 'Surprise'},
 'abandonar': {'Fear', 'Negative', 'Sadness'},
 'abandoned': {'Anger', 'Fear', 'Negative', 'Sadness'},
 'abandonn¨¦': {'Anger', 'Fear', 'Negative', 'Sadness'},
 'abbandonato': {'Anger', 'Disgust', 'Fear', 'Negative', 'Sadness'},
 'abandonado': {'Anger', 'Disgust', 'Fear', 'Negative', 'Sadness'},
 'abandonment': {'Anger', 'Fear', 'Negative', 'Sadness', 'Surprise'},
 'abandono': {'Anger', 'Fear', 'Negative', 'Sadness', 'Surprise'},
 'abate': {'Trust'},
 'diminuer': {'Anger', 'Negative', 'Sadness'},
 'disminuir': {'Negative', 'Sadness'},
 'disminuci¨®n': {'Negative'},
 'abba': {'Positive'},
 'Abba': {'Positive'},
 'abbot': {'Trust'},
 'abb¨¦': {'Trust'},
 'abad': {'Trust'},
 'abduction': {'Fear', 'Negative', 'Sadness', 'Surp

In [10]:
reviews.columns

Index(['listing_id', 'id', 'date', 'reviewer_id', 'reviewer_name', 'comments'], dtype='object')

In [11]:
len(reviews.listing_id.unique())

39528

## Remove automatic system reviews

In [13]:
reviews1 = reviews[~reviews.comments.str.contains('The host canceled this reservation')]

In [14]:
reviews1.set_index('listing_id',inplace=True)

In [15]:
reviews1['date']=reviews1['date'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


## Select reviews after 2018-07-01

In [16]:
reviews1=reviews1[reviews1['date']>datetime.datetime.strptime('2018-07-01', '%Y-%m-%d')]

In [17]:
reviews1

Unnamed: 0_level_0,id,date,reviewer_id,reviewer_name,comments
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2515,286175649,2018-07-05,59440921,Nicole,Our room was large and spacious. The bunk bed ...
2515,317675755,2018-09-02,22582104,Joseph,!
2515,320292388,2018-09-08,37610745,Sebastian,Stephâs place was really convenient due to i...
2515,325040243,2018-09-18,112583472,Ana,"Steph was amazing, the location is the most co..."
2539,292241108,2018-07-17,64678596,Dahn,"Clean, quiet, well-run home, very close to the..."
2595,328954829,2018-09-27,203936538,Lucia,"Awesome location, spotless, wonderfully accomm..."
2595,331029784,2018-10-01,174975601,Julie,The location made getting to Time Square easil...
3330,324255635,2018-09-16,36137871,Patrick,We had a marvellous stay in Julia's penthouse!...
3831,288777574,2018-07-10,142965005,Patrick,TrÃ¨s bon sÃ©jour de trois semaines Ã NYC. Le...
3831,289115837,2018-07-11,130956729,Carlos,Muy buen lugar!


In [18]:
review_list = list()
for i in set(reviews1.index.values):
    list_id = i
    if type(reviews1.loc[i]['comments']) == str:
        review_text = reviews1.loc[i]['comments']
    else:
        review_text = ' '.join(list(reviews1.loc[i]['comments']))
    review_list.append((list_id,review_text))

In [19]:
review_list

[(27000837,
  'Ms Septina was a very great host. The house was kept very clean at all times and I actually extended my stay because I was so impressed with the service!'),
 (23724038,
  'The apartment was exactly as portrayed and is in a great location. Just make sure you ask how everything works and/or test it out when you check in. Bosi is very responsive to issues but we probably could have avoided some of them if we had been more thorough during check in. '),
 (26869767,
  "This was a great stay! A little far from Manhattan,  but a really cute and clean apartment.  Conrad was incredibly friendly, responsive. The space was big, had great AC, lots of great spots nearby.  La ubicaciÃ³n del alojamiento es muy buena ( cuenta con tres paradas de metro cercanas), hay supermercados en la zona y locales  para tomar algo o en los que puedes escuchar mÃºsica en directo. Conrad es muy atento y respondiÃ³ y solucionÃ³ todos los problemas que le planteamos. Como cosas a mejorar, seÃ±alarÃ\xada l

In [20]:
def emotion_analyzer(text,emotion_dict=emotion_dict_final):
    emotions = {x for y in emotion_dict.values() for x in y} 
    emotion_count = dict()
    for emotion in emotions:
        emotion_count[emotion] = 0
    #Analyze the text and normalize by total number of words
    total_words = len(text.split())
    for word in text.split():
        if emotion_dict.get(word):
            for emotion in emotion_dict.get(word):
                emotion_count[emotion] += 1/total_words
    return emotion_count

In [21]:
def comparative_emotion_analyzer(text_tuples,object_name="listing_id"):
    import pandas as pd
    df = pd.DataFrame(columns=[object_name,'Fear','Trust','Negative',
                           'Positive','Joy','Disgust','Anticipation','Anger',
                           'Sadness','Surprise'],)
    df.set_index(object_name,inplace=True)
    
    output = df    
    for text_tuple in text_tuples:
        text = text_tuple[1] 
        result = emotion_analyzer(text)
        df.loc[text_tuple[0]] = [result['Fear'],result['Trust'],
                  result['Negative'],result['Positive'],result['Joy'],result['Disgust'],
                  result['Anticipation'],result['Anger'],result['Sadness'],result['Surprise']]
    return output

df = comparative_emotion_analyzer(review_list)

In [22]:
df['customer_experience']= df['Fear']*(-10)+ df['Trust']*10 + df['Negative'] *(-5)+df['Positive']*5+df['Joy']*10 + df['Disgust']*(-10)+df['Anticipation'] * 5 + df['Anger'] *(-10)+df['Sadness']*(-5)+df['Surprise']*10

In [23]:
df.to_csv('list_review_rating.csv')

In [24]:
df

Unnamed: 0_level_0,Fear,Trust,Negative,Positive,Joy,Disgust,Anticipation,Anger,Sadness,Surprise,customer_experience
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
27000837,0.000000,0.033333,0.000000,0.033333,0.033333,0.000000,0.000000,0.000000,0.000000,0.000000,0.833333
23724038,0.000000,0.019231,0.000000,0.038462,0.000000,0.000000,0.019231,0.000000,0.000000,0.000000,0.480769
26869767,0.002538,0.055838,0.012690,0.073604,0.032995,0.000000,0.030457,0.000000,0.007614,0.022843,1.510152
12713995,0.030303,0.045455,0.000000,0.106061,0.045455,0.000000,0.060606,0.000000,0.000000,0.030303,1.742424
15073292,0.004651,0.032558,0.000000,0.065116,0.037209,0.000000,0.041860,0.000000,0.004651,0.000000,1.162791
25034766,0.000000,0.066667,0.000000,0.066667,0.000000,0.000000,0.066667,0.000000,0.000000,0.000000,1.333333
26869774,0.008000,0.056000,0.000000,0.056000,0.040000,0.000000,0.008000,0.008000,0.016000,0.008000,1.120000
28049432,0.000000,0.029557,0.000000,0.059113,0.029557,0.000000,0.009852,0.000000,0.004926,0.009852,1.009852
15859741,0.000000,0.024615,0.003077,0.049231,0.024615,0.000000,0.024615,0.000000,0.003077,0.003077,0.861538
15597600,0.012500,0.031250,0.000000,0.056250,0.043750,0.000000,0.031250,0.000000,0.006250,0.031250,1.343750
