# Index
## Sentiment Analysis

For the Sentiment and emotional analysis, two corpora were used: The AFINN Sentiment lexicon and the NRC EmoLex. AFINN is able to easily provide a sentiment (positive or negative) of a text, NRC EmoLex is able to provide both a sentiment (which will be ignored, as AFINN is likely more precise), and an emotional association with 8 different emotions.

## E.g:

In [1]:
import numpy as np
import pandas as pd
from afinn import Afinn
afn = Afinn()

In [2]:
df_NRC = pd.read_csv("processed/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", sep="\t", index_col=0, header=None)
df_NRC.index.name = 'word'
df_NRC.rename(columns={ df_NRC.columns[0]: "emotion", df_NRC.columns[1]: "association" }, inplace = True)

In [3]:
def nrc_sentiment(sentence):
    test = ["anger",0],["anticipation",0],["disgust",0],["fear",0],["joy",0],["negative",0],["positive",0],["sadness",0],["surprise",0],["trust",0]
    df_overallScore = pd.DataFrame(test, columns=["emotion", "association"])
    for word in sentence.split():
        if word not in df_NRC.index:
            continue
        else:
            df_test = df_NRC.loc[word].reset_index().iloc[:,1:]
            df_overallScore["association"] = df_overallScore["association"]+df_test["association"]
    return df_overallScore

In [4]:
sentence_to_be_analysed = "Knowledgeable and friendly staff who are very willing to spend time talking you through the history and design behind their models"
afinn_score = afn.score(sentence_to_be_analysed)
nrc_score = nrc_sentiment(sentence_to_be_analysed)

print(f"The sentence \"{sentence_to_be_analysed}\" has a AFINN Score of {afinn_score} and an NRC Score of according to the following matrix:\n")
print(nrc_score)

The sentence "Knowledgeable and friendly staff who are very willing to spend time talking you through the history and design behind their models" has a AFINN Score of 2.0 and an NRC Score of according to the following matrix:

        emotion  association
0         anger            0
1  anticipation            2
2       disgust            0
3          fear            0
4           joy            1
5      negative            0
6      positive            1
7       sadness            0
8      surprise            0
9         trust            1


## Practical usage:
Thanks the the scraped data of all boutiques available in Google My Business, we are now able to perform a detailed analysis of each review, which can then be seperated per boutique, per country, per broader geographical region, and overall.

In [5]:
dataStucture = ["UID", "Country", "Region", "AFINN", "anger", "anticipation", "disgust", "fear", "joy", "negative", "positive", "sadness", "surprise", "trust"]
df_perBTQ = pd.DataFrame(columns=dataStucture)
df_perCountry = pd.DataFrame(columns=dataStucture[1:])
df_perRegion = pd.DataFrame(columns=dataStucture[2:])

The CSV that was processed has indicators for individual boutiques (_["ID"]_), country (_["Country"]_) and region (_["Region"]_)

In [6]:
data = "processed/JustNPSBoutiqueReviewsscrapedCSV.csv"
df_analysis = pd.read_csv(data)
df_analysis = df_analysis[df_analysis['UID'].notna()]
df_analysis.reset_index(drop=True, inplace=True)
df_analysis.head()

Unnamed: 0,UID,title,location/lat,location/lng,address,text,textTranslated,actualText,reviewerId,reviewId,Stars,Country,Region
0,25.0,IWC,31.236862,121.501857,"China, Shanghai, Pudong, Lujiazui, 世纪大道8号上海国金中...",,,0,,,,China,AP
1,33.0,IWC Schaffhausen Boutique – Geneve,46.204155,6.147348,"Rue du Rhône 48, 1204 Genève, Switzerland",,,0,103276218157342886127,ChdDSUhNMG9nS0VJQ0FnSURBOWJEUWl3RRAB,5.0,Switzerland,EU
2,33.0,IWC Schaffhausen Boutique – Geneve,46.204155,6.147348,"Rue du Rhône 48, 1204 Genève, Switzerland",,,0,111212756567393344887,ChZDSUhNMG9nS0VJQ0FnSURBMDhhUWFBEAE,3.0,Switzerland,EU
3,33.0,IWC Schaffhausen Boutique – Geneve,46.204155,6.147348,"Rue du Rhône 48, 1204 Genève, Switzerland",I bought the Portuguese seven days here,,I bought the Portuguese seven days here,106605683279110772621,ChZDSUhNMG9nS0VJQ0FnSUN3emRQMFdREAE,5.0,Switzerland,EU
4,33.0,IWC Schaffhausen Boutique – Geneve,46.204155,6.147348,"Rue du Rhône 48, 1204 Genève, Switzerland",,,0,103390353641349883822,ChZDSUhNMG9nS0VJQ0FnSUN3LXVLVlpREAE,5.0,Switzerland,EU


We can now perform a setiment analysis of all reviews (in the column "actualText") and then calculate an average sentiment score over all reviews. This is achieved by using the functions "afn.score()" and "nrc_sentiment()" for each review and adding them to the unique review. After that, the empty dataframes can be created based on summaries of the other main dataframe. As this will be chain assigned, a warning will be surpressed.

In [7]:
pd.options.mode.chained_assignment = None  # default='warn'

In [8]:
df_analysis["AFINN"] = df_analysis["actualText"].apply(lambda x: afn.score(str(x)))
df_analysis["anger"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[0]["association"])
df_analysis["anticipation"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[1]["association"])
df_analysis["disgust"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[2]["association"])
df_analysis["fear"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[3]["association"])
df_analysis["joy"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[4]["association"])
df_analysis["sadness"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[7]["association"])
df_analysis["surprise"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[8]["association"])
df_analysis["trust"] = df_analysis["actualText"].apply(lambda x: nrc_sentiment(str(x)).iloc[9]["association"])
df_analysis.to_csv("processed/all_incl_sent.csv")

In [8]:
df_inclSent = pd.read_csv("processed/all_incl_sent.csv", index_col=[0])

In [9]:
df_perBTQ["UID"]=df_inclSent["UID"].unique()

In [10]:
for boutique in df_perBTQ.UID:
    df_btq = df_inclSent.loc[df_inclSent.UID == boutique]
    numRev = len(df_btq[df_btq['actualText']!='nothin'])
    afinnSum = df_btq.AFINN.sum()
    if pd.isnull(afinnSum):
        continue
    else:
        df_perBTQ.AFINN.loc[df_perBTQ.UID == boutique] = afinnSum/numRev
        angerSum = df_btq.anger.sum()
        anticSum = df_btq.anticipation.sum()
        disguSum = df_btq.disgust.sum()
        fearSum = df_btq.fear.sum()
        joySum = df_btq.joy.sum()
        sadnessSum = df_btq.sadness.sum()
        surpriseSum = df_btq.surprise.sum()
        trustSum = df_btq.trust.sum()
        df_perBTQ.anger.loc[df_perBTQ.UID == boutique] = angerSum/numRev
        df_perBTQ.anticipation.loc[df_perBTQ.UID == boutique] = anticSum/numRev
        df_perBTQ.disgust.loc[df_perBTQ.UID == boutique] = disguSum/numRev
        df_perBTQ.fear.loc[df_perBTQ.UID == boutique] = fearSum/numRev
        df_perBTQ.joy.loc[df_perBTQ.UID == boutique] = joySum/numRev
        df_perBTQ.sadness.loc[df_perBTQ.UID == boutique] = sadnessSum/numRev
        df_perBTQ.trust.loc[df_perBTQ.UID == boutique] = trustSum/numRev
df_perBTQ
df_perBTQ.to_csv("processed/PerBTQsent.csv")
    

In [11]:
df_perBTQ.head()

Unnamed: 0,UID,Country,Region,AFINN,anger,anticipation,disgust,fear,joy,negative,positive,sadness,surprise,trust
0,25.0,,,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,,0.0
1,33.0,,,3.694444,0.027778,0.722222,0.055556,0.166667,0.416667,,,0.027778,,0.527778
2,32.0,,,2.608696,0.086957,0.51087,0.108696,0.26087,0.358696,,,0.173913,,0.619565
3,23.0,,,1.325,0.1,0.8,0.075,0.25,0.35,,,0.1,,0.75
4,52.0,,,4.75,0.416667,1.729167,0.458333,0.6875,1.020833,,,0.5,,1.520833


## Linear regression

We can now use this data summarised by boutique in order to find a statistical correlation between certain sentiments and the performance rank of a given boutique. As the performance rank is based on absolute sales numbers, we are controlling for number of employees (NOE).

In [3]:
import statsmodels.formula.api as smf
from IPython.display import HTML
from stargazer.stargazer import Stargazer

In [10]:
df_perf = pd.read_csv("processed/Control Data.csv")
df_perf = df_perf.dropna()
df_perf.head()

Unnamed: 0,Store name,UID,location,noe,perfrank,countries,region,Stars,NPS,AFINN,anger,anticipation,disgust,fear,joy,sadness,surprise,trust
0,IWC Sydney,1,Downtown,5.0,51.0,Australia,AP,4.4,68.421053,6.708333,0.0,1.583333,0.083333,0.625,0.958333,0,0,0.166667
1,IFS Chengdu,3,Shopping Center,6.0,69.0,China,AP,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0
2,IWC Madrid,4,High street,4.0,42.0,Spain,EU,4.8,91.919192,4.631579,0.210526,0.947368,0.157895,0.526316,0.684211,0,0,0.421053
3,Mixcity Shenzhen,5,Shopping Center,6.0,56.0,China,AP,0.0,33.333333,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0
5,Shinsegae Gangnam,7,Shopping Center,6.0,22.0,South Korea,AP,0.0,63.636364,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0


In [11]:
results = smf.ols("perfrank ~ AFINN + NPS + anger + anticipation + disgust + fear + joy + trust + location + noe", data=df_perf).fit()

In [12]:
stargazer = Stargazer([results])

In [13]:
HTML(stargazer.render_html())

0,1
,
,Dependent variable:perfrank
,
,(1)
,
AFINN,-5.864*
,(3.065)
Intercept,76.408***
,(19.328)
NPS,-0.045


## Keyness analysis

Now that we have been able to identify the most relevent potentian predicting emotions for the performance ranking, we can perform a keyness analysis on these datapoints.

The best possible way to do so is by finding a relevant corpus and compare the frequencies of certain words to it. When performing a keyness analysis, it is important to find a relevant corpus to compare to. We found the most relevant corpus might be the yelp dataset of online reviews, which we were already able to prepare on word frequencies:

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
df_yelpCorp = pd.read_csv("processed/processed10000chunks.csv", names=["word", "frequency"], header=0)
df_yelpCorp["relative frequency"] = df_yelpCorp["frequency"]/df_yelpCorp["frequency"].sum()

In [3]:
df_yelpCorp.head()

Unnamed: 0,word,frequency,relative frequency
0,the,36711861,0.050124
1,and,26130923,0.035678
2,a,18805880,0.025677
3,i,18989648,0.025927
4,to,17718243,0.024192


We can now use the data to analyse only certain emotions

In [7]:
df_emotionallexicon = pd.read_csv("processed/all_incl_sent.csv", index_col = 0)
emotionTBA = "anger"
df_justEM = df_emotionallexicon[df_emotionallexicon[emotionTBA] >0]
df_justEM.shape

(80, 22)

## WordCount

In [16]:
def wordFrequencies(actualText):
    return(actualText.str.split(expand=True).stack().value_counts())

In [17]:
wordfreqdct = {}
wordFreq = wordFrequencies(df_justEM["actualText"])
for word in wordFreq.index:
    wordcl = ''.join([i for i in word if i.isalpha()])
    wordcl = wordcl.lower()
    if wordcl not in wordfreqdct:
        wordfreqdct[wordcl] = wordFreq[word]
    else:
        wordfreqdct[wordcl] += wordFreq[word]
df_wordFreq = pd.DataFrame.from_dict(wordfreqdct, orient='index', columns=["frequency"])
df_wordFreq.index.name = "word"
df_wordFreq["relative frequency"]= df_wordFreq["frequency"]/df_wordFreq["frequency"].sum()
df_wordFreq = df_wordFreq.reset_index()

## Performing keyness analysis
According to Gabrielatos & Marchi, 2011 the %DIFF calculation $$ \%DIFF = {(NFC1 - NFC2) *100 \over NFC2} $$ where $NFC\, =\, Normalised\, Frequencies$ is the most relevant metric to measure statistically significance of keyness. Therefore:

In [18]:
def diffCalc(word):
    DIFF = ((df_wordFreq["relative frequency"].loc[df_wordFreq["word"] == word].values - df_yelpCorp["relative frequency"].loc[df_yelpCorp["word"] == word].values)*100)/(df_yelpCorp["relative frequency"].loc[df_yelpCorp["word"] == word].values)
    return DIFF

In [19]:
def returnreffreq(word):
    reffreq = df_yelpCorp["relative frequency"].loc[df_yelpCorp["word"] == word].values
    return reffreq.astype(np.float64)

In [10]:
df_wordFreq["RelFreqComp"] = df_wordFreq["word"].apply(lambda x: returnreffreq(x))
df_wordFreq["%DIFF"] = df_wordFreq["word"].apply(lambda x: diffCalc(x))

In [11]:
df_wordFreq = df_wordFreq.sort_values(by=["%DIFF"], ascending=False)

In [12]:
df_wordFreq.to_csv("processed/"+emotionTBA+"keyness.csv")

This data can now be analysed in Excel. As can be seen, the outlying "%DIFF" values occur among the values where the NFC2 values are infinitesimal. The data therefore needs to be cleaned of those values.

## Keyness analysis for overalll IWC and competitors

Now that we have the keyness analysis for the specific emotions, it might be interesting to see the %DIFF values of all the reviews and IWC's competitors in order to assure the %DIFF values are smaller for certain keys. E.g.: If in the "joy" dataset, the keyness of "welcoming" is high, it might be interesting to see what the keyness of it is in all the reviews of all IWC boutiques and IWC's competitors

In [38]:
df_IWC = pd.read_csv("Data/JustNPSBoutiqueReviewsscrapedCSV.csv")
df_comp = pd.read_csv("Data/Comp.csv")
both = [df_IWC, df_comp]

In [None]:
for data in both:
    name =[x for x in globals() if globals()[x] is data][0]
    wordfreqdct = {}
    wordFreq = wordFrequencies(data["actualText"])
    for word in wordFreq.index:
        wordcl = ''.join([i for i in word if i.isalpha()])
        wordcl = wordcl.lower()
        if wordcl not in wordfreqdct:
            wordfreqdct[wordcl] = wordFreq[word]
        else:
            wordfreqdct[wordcl] += wordFreq[word]
    df_wordFreq = pd.DataFrame.from_dict(wordfreqdct, orient='index', columns=["frequency"])
    df_wordFreq.index.name = "word"
    df_wordFreq["relative frequency"]= df_wordFreq["frequency"]/df_wordFreq["frequency"].sum()
    df_wordFreq = df_wordFreq.reset_index()
    df_wordFreq["RelFreqComp"] = df_wordFreq["word"].apply(lambda x: returnreffreq(x))
    df_wordFreq["%DIFF"] = df_wordFreq["word"].apply(lambda x: diffCalc(x))
    df_wordFreq = df_wordFreq.sort_values(by=["%DIFF"], ascending=False)
    df_wordFreq.to_csv("processed/"+name+"keyness.csv")
    

## Adding reference frequencies to dataset

We can now add these reference frequencies to the the datasets that have been created by the emotions to be analysed in order to gain more statistical insight

In [47]:
df_joy = pd.read_csv("processed/joykeyness.csv", index_col=0)
df_anger = pd.read_csv("processed/angerkeyness.csv", index_col=0)
df_IWC = pd.read_csv("processed/df_IWCkeyness.csv", index_col=0)
df_comp = pd.read_csv("processed/df_compkeyness.csv", index_col=0)
both = [df_joy, df_anger]

In [46]:
for data in both:
    data["Compared to IWC"] = data["word"].apply(lambda x:df_IWC.loc[df_IWC["word"] == x]["%DIFF"].values)
    data["Compared to comp"] = data["word"].apply(lambda x:df_comp.loc[df_comp["word"] == x]["%DIFF"].values)

df_joy.to_csv("processed/joykeynessINCLCOMP.csv")
df_anger.to_csv("processed/angekeynessINCLCOMP.csv")