# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
!pip install ipython-autotime
 
%load_ext autotime

time: 1.96 ms (started: 2021-06-04 05:01:01 +00:00)


In [2]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import scipy.sparse
import string
tqdm.pandas()

time: 702 ms (started: 2021-06-04 05:01:01 +00:00)


  from pandas import Panel


In [3]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
time: 102 ms (started: 2021-06-04 05:01:01 +00:00)


In [4]:
# load stopwords
sw = set(stopwords.words('english'))

time: 4.23 ms (started: 2021-06-04 05:01:02 +00:00)


In [5]:
from google.colab import drive
drive.mount("/content/drive")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
time: 2.53 ms (started: 2021-06-04 05:01:02 +00:00)


In [28]:
p = 'some_directory'
df = pd.read_csv(os.path.join(p,'/content/drive/MyDrive/Computational Data science/reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

time: 2.53 s (started: 2021-06-04 06:36:44 +00:00)


In [29]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


time: 22.5 ms (started: 2021-06-04 06:36:48 +00:00)


In [30]:
df.shape

(452143, 6)

time: 5.76 ms (started: 2021-06-04 06:36:50 +00:00)


### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [11]:
def process_reviews(df):
  ''' This function perform task of tokenizing ,tagging and transform the tagged word into lower case.
  argument = DataFrame
  return = DataFrame with three additional columns''' 

  # word tokenizing and add new column
  df['tokenized'] = df['comments'].apply(word_tokenize)

  # tagging using pos_tag
  tag = []
  for comment in df.comments:
    tag.append(pos_tag(comment.translate(str.maketrans('', '', string.punctuation)).split()))
  df["tagged"] = tag

  # converting all the tagged words to lower to reduce memory usuage.
  lower_tag = []
  for tag in df.tagged:
    lwr_tag = []
    for word in tag:
      wrd = (word[0].lower(), word[1])
      lwr_tag.append(wrd)
    lower_tag.append(lwr_tag)
  df["lower_tagged"] = lower_tag
  return df

time: 13.4 ms (started: 2021-06-04 05:01:04 +00:00)


In [12]:
df = process_reviews(df)

time: 10min 44s (started: 2021-06-04 05:01:04 +00:00)


In [13]:
df

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco..."
...,...,...,...,...,...,...,...,...,...
199995,8560237,141676637,2017-04-04,17391423,Paul,It was a pleasure to stay at Marja's home. The...,"[It, was, a, pleasure, to, stay, at, Marja, 's...","[(It, PRP), (was, VBD), (a, DT), (pleasure, NN...","[(it, PRP), (was, VBD), (a, DT), (pleasure, NN..."
199996,8560237,146877857,2017-04-24,114705007,Dennis,Ruhig und durch den Anschluss zur Metro in 7 m...,"[Ruhig, und, durch, den, Anschluss, zur, Metro...","[(Ruhig, NNP), (und, JJ), (durch, NN), (den, N...","[(ruhig, NNP), (und, JJ), (durch, NN), (den, N..."
199997,8560237,147475181,2017-04-27,108433229,Vanessa,"Eine sehr zuvorkommende und liebe Familie, bei...","[Eine, sehr, zuvorkommende, und, liebe, Famili...","[(Eine, NNP), (sehr, NN), (zuvorkommende, NN),...","[(eine, NNP), (sehr, NN), (zuvorkommende, NN),..."
199998,8560237,158076178,2017-06-05,123933418,Megan,"Very lovely place to stay, very clean and a gr...","[Very, lovely, place, to, stay, ,, very, clean...","[(Very, RB), (lovely, RB), (place, NN), (to, T...","[(very, RB), (lovely, RB), (place, NN), (to, T..."


time: 365 ms (started: 2021-06-04 05:11:48 +00:00)


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [14]:
def get_vocab(df):
  ''' This function generate two list, first one contains most common 1000 nouns and second one contains
  most common 1000 verb/adjective.
  argument: DataFrame
  return: two lists.'''

  New_list=[]
  New_list1=[]
  for i in range(len(df.lower_tagged)):
    x=df["lower_tagged"][i]
    New_list.append(x)
  Z=list(New_list)
  New_list1=[]
  for j in range(len(Z)):
    t=Z[j]
    for k in range(len(t)):
      p=Z[j][k]
      New_list1.append(p)
  noun_list =[]
  verb_list = []
  noun_in = ['NNP','NNS','NNPS','NN']  # list of noun tags to check whether tag is noun or not
  verb_in = ['JJS','JJ','JJR']         # list of verb/adjective tags to check whether tag is verb/adjective or not

   # through for loop to check what is tag type.
  for tok,tag in New_list1:
    if tag in noun_in:
      noun_list.append(tok)            # if tag is noun it will be added to noun list
    elif tag in verb_in:
      verb_list.append(tok)            # if tag is verb/adjective it will be added to adjective

  # using Counter function to count the the occurance of a word.
  noun_count = Counter(noun_list)
  verb_count = Counter(verb_list)
  noun_sorted = noun_count.most_common()
  verb_sorted = verb_count.most_common()
  New_verb=[]
  New_noun=[]
  for i in tqdm(range(len(verb_sorted))):
    L=verb_sorted[i][0]
    New_verb.append(L)
  New_verb
  for i in tqdm(range(len(noun_sorted))):
    L=noun_sorted[i][0]
    New_noun.append(L)

  # removed puntuation who got tagged as noun or verb/adjective
  New_noun = [''.join(c for c in s if c not in string.punctuation) for s in New_noun]
  New_noun = [s for s in New_noun if s]
  New_verb = [''.join(c for c in s if c not in string.punctuation) for s in New_verb]
  New_verb = [s for s in New_verb if s]
  print(len(New_noun),len(New_verb))

  # To keep unique values in both the lists
  final_verb=[]
  for i in New_noun:
    if i not in New_verb:
      final_verb.append(i)

  # Select most common 1000 Noun and verb/Adjective
  cent_vocab = final_verb[:1000]
  cont_vocab = New_verb[:1000]
  return cent_vocab, cont_vocab

time: 42.4 ms (started: 2021-06-04 05:11:48 +00:00)


In [15]:
cent_vocab, cont_vocab = get_vocab(df)

100%|██████████| 28134/28134 [00:00<00:00, 1343319.39it/s]
100%|██████████| 110980/110980 [00:00<00:00, 1424199.25it/s]


110980 28134
time: 50.2 s (started: 2021-06-04 05:11:48 +00:00)


### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [16]:
def get_coocs(df, cent_vocab, cont_vocab):

  '''This function get_coocs(df, center_vocab, context_vocab) which takes as 
  argument: the DataFrame generated in step 1, and the lists generated in step 2  
  returns a dictionary of dictionaries, of the form in the example below
  ‘A big restaurant served delicious food in big dishes’
  {‘restaurant’: {‘big’: 2, ‘served’:1, ‘delicious’:1}} '''

  X=(df["comments"].apply(str.lower))
  X=list(X)

  
  # sentence tokenisation
  sentence_tokenize_lst=[]  
  for i in (X):
    k=sent_tokenize(i)
    sentence_tokenize_lst.append(k)

  # word tokenisation and save in a separate list
  word_tokenizer_list=[] 
  for i in range(len(X)):
    for j in range(len(sentence_tokenize_lst[i])):
      T=word_tokenize(sentence_tokenize_lst[i][j])
      word_tokenizer_list.append(T)

# searching context vocab sentence wise and save in the list of find_context_vocab
  find_context_vocab=[]
  for i in range(len(word_tokenizer_list)):
    context_vocab_list = []    # search context_vocab from each sentence and store in context_vocab_list
    for j in cont_vocab:
      if j in word_tokenizer_list[i]:
        context_vocab_list.append(j)
      if j is '.':
       break
    find_context_vocab.append(context_vocab_list)

# count context vocab and save in the list of count_context_vocab
  count_context_vocab=[]
  for i in range(len(find_context_vocab)):
    K=dict(Counter(find_context_vocab[i]))
    count_context_vocab.append(K) 


# searching center vocab sentence wise and save in the list of find_context_vocab
  find_center_vocab=[]
  for i in (range(len(word_tokenizer_list))):
    center_vocab_list = []
    for j in cent_vocab:
      if j in word_tokenizer_list[i]:
        center_vocab_list.append(j)
      if j is '.':
        break
    find_center_vocab.append(center_vocab_list)

# count center vocab and save in the list of count_context_vocab
  count_center_vocab=[]
  for i in range(len(find_center_vocab)):
    K=dict(Counter(find_center_vocab[i]))
    count_center_vocab.append(K)

  for i in range(len(count_center_vocab)):
    for j in count_center_vocab[i]:
      count_center_vocab[i][j]=count_context_vocab[i]

# Make dictionary of dictionary each sentence wise.
  new_list=[]
  for i in range(len(count_center_vocab)):
    final_dict={}
    for j in count_center_vocab[i]:
      final_dict[j]=count_center_vocab[i][j]
      break
    new_list.append(final_dict)

# Remove all emnpty dictionary
  while {} in new_list:
    new_list.remove({})

  # Remove all those dictionary in which there is key but no values coresponding to that key
  Dictionary =[]
  for i in range(len(new_list)):
    if new_list[i].get(next(iter(new_list[i])))!={}:
      Dictionary .append(new_list[i])

 # To convert sentence wise dictionary of dictionary  to whole  comment dictionary of dictionary.
  dict1={}
  for i in cent_vocab:
    dict1[i]=[]
  for i in range(len(Dictionary)):
    for key in cent_vocab:
      if key ==next(iter(Dictionary[i])):
        dict1[key].append(Dictionary[i].get(key))
  t_list=[]
  for i in cent_vocab:
    a_list=[]
    for j in range(len(dict1[i])):
      k=next(iter(dict1[i][j]))
      a_list.append(k)
    t_list.append(a_list)

  # here  we make final dictionary of dictionary which having 1000 nouns( keys ) and there corresponding verb/adjective(values) according to whole comments.  
  dict2=[]
  coocs={}
  for i,j in zip(cent_vocab,range(len(cent_vocab))):
    k=dict(Counter(t_list[j]))
    dict2.append(k)
    coocs[i]=dict2[j]
  
  return coocs

time: 89.6 ms (started: 2021-06-04 05:12:39 +00:00)


In [17]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

time: 1h 19min 40s (started: 2021-06-04 05:12:39 +00:00)


### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [33]:
def cooc_dict2df(coocs):
  ''' This function takes dictionary of dictionaries as argument and return a DataFrame'''
  coocdf = pd.DataFrame.from_dict(coocs,orient = 'index',dtype="Int64").fillna(0)
  return coocdf

time: 2.35 ms (started: 2021-06-04 06:40:40 +00:00)


In [34]:
'''
Here the shape of data frame is (995,903) because of there are 5 such Noun there Value is Zero(0)(i.e. there corresponding no verb/adjective ).
'''
coocdf = cooc_dict2df(coocs)
coocdf.shape

(995, 903)

time: 371 ms (started: 2021-06-04 06:40:41 +00:00)


### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [23]:
def cooc2pmi(df):
  ''' This function takes step 4 DataFrame as argunment and return new Dataframe with PMI score instead of raw co-occurence count.'''
  row_totals = df.sum(axis=1).astype(float)         # take the total sum of all rows in dataframe
  prob_cols_given_row = (df.T / row_totals).T       # calculating the probability of each index against total sum of rows
  col_totals = df.sum(axis=0).astype(float)         # calculating sum of all rows.
  prob_of_cols = col_totals / sum(col_totals)       # calculating the probability of each index against total sum of columns.
  ratio = prob_cols_given_row / prob_of_cols        # calculating ratio
  ratio[ratio==0] = 0.00001                         # replacing ratios that have zero value with 0.00001 to avoid mathematical error.
  pmidf = np.log(ratio)                             # calculating log of ratio using numpy library log function
  pmidf[pmidf < 0] = 0
  pmidf = pmidf.fillna(0.00001)
  return pmidf

time: 5.94 ms (started: 2021-06-04 06:32:21 +00:00)


In [32]:
pmidf = cooc2pmi(cooc_df)
pmidf.shape


(995, 903)

time: 601 ms (started: 2021-06-04 06:38:09 +00:00)


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [25]:
def topk(df, center_word, N=10):
  ''' This function takes PMI score filled in Dataframe,center_word and an optional N argument with default value 10 as input 
  return a list of N strings in order of their PMI score with the center_word'''
  # finding the top N PMI score with there index values.
  top_words = df[center_word].sort_values(ascending = False).head(N)
  return top_words

time: 4.12 ms (started: 2021-06-04 06:32:22 +00:00)


In [26]:
"machine" in cent_vocab

True

time: 3.8 ms (started: 2021-06-04 06:32:22 +00:00)


In [36]:
topk(pmidf,'coffee')

pods           6.152465
drinking       5.641640
maker          5.428547
iron           5.379275
package        5.379275
machine        4.815768
quarters       4.360706
pastries       4.333307
arrangement    4.306639
foods          4.206555
Name: coffee, dtype: float64

time: 12.3 ms (started: 2021-06-04 06:45:44 +00:00)


# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

The Data Ethics Framework guides appropriate and responsible data use in government and the wider public sector. It helps public servants understand ethical considerations, address these within their projects, and encourages responsible innovation. Based on the given context, it is clearly mentioned  that the proposed price recommender algorithm will not be effective. In compliance to this recommender system, i think the comply with the law is the most weakest aspect. According to rule, there must be an understanding of the relevant laws and codes of practice that relate to the use of data. So, i will score 2 out of 5. Furthermore, Review of quality and limitations of data is the another weak factor. One of the reason of such context is lack of data and transparancy. The score for this will be again 2 out of 5. Afterwards Define and understand public benifits and user needs will be in place of 3 alongwith Involvement of diverse expertise and consideration of wider policy implications thereafter.

Considering the ethical implications of disclosing information is a major challenge for information providers. On the one hand ,providers have to eveluate the potential ethical or unethical use of disclosed information.


Accountability includes effective governance and oversight mechanism for any project. It should be in the hands of the users that they are able to exercise effective oversight and control over decisions taken by gov. But, in this case, due to incomplete data availabiliy, accountability hold weak and users are affected too much. Hence will score 2 out of 5.

Finally, the fairness, it is must to avoid project potentials to have unintended discriminatory effects. Try avoiding bias. The authority must be fair enough to showcase the reality of their residents and provide complete robust data. So the score will be somewhere around 3 out of 5.

Possibilly there could be multiple solutions in this context. For instance, before getting data from the authority, make sure there is clear articulation of the problem before starting of the project. Also, make sure there must be effective governance structure with experts with all regulations. Hence, it is must to follow ethical way to develop any project and maintain data credibility at same time.

To support the above statement used, a statement given by Luciano et al, "The problem of defining what kind of information should be disclosed by an organisation when the ethical nature of information transparency is taken into account. The common understanding of information transparency as the process of disclosing a set of data has been challenged as too limited, in favour of a more inclusive definition that takes into account also the ethical principles factually endorsed in producing information." The points alongwith solutions are in context to discussion with peers as well.



---

...