<a href="https://colab.research.google.com/github/Simranjeet-Singh-1313/Airbnb-Amsterdam/blob/main/Part_3_final_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [None]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import scipy.sparse
import string
tqdm.pandas()

In [None]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

In [None]:
# load stopwords
sw = set(stopwords.words('english'))

In [None]:
from google.colab import drive
drive.mount("/content/drive")


In [None]:
p = 'some_directory'
df = pd.read_csv(os.path.join(p,'/content/drive/MyDrive/Computational Data science/reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df=df[:200000]

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [None]:
# def process_reviews(df):
#   df["tokenized"]=df.comments.apply(word_tokenize)
#   df["tagged"]=df.tokenized.apply(pos_tag)
#   df["lower_comments"]=df.comments.apply(str.lower)
#   df["lower_tagged"]=df.lower_comments.apply(lambda x: pos_tag(word_tokenize(x)))
#   df.drop(['lower_comments'],axis=1,inplace=True)
#   return df

In [None]:
def process_reviews(df):
  df['tokenized'] = df['comments'].apply(word_tokenize)
  df["sen_tokenized"] = df.comments.apply(sent_tokenize)
  tag = []
  for comment in df.comments:
    tag.append(pos_tag(comment.translate(str.maketrans('', '', string.punctuation)).split()))
  df["tagged"] = tag
  lower_tag = []
  for tag in df.tagged:
    lwr_tag = []
    for word in tag:
      wrd = (word[0].lower(), word[1])
      lwr_tag.append(wrd)
    lower_tag.append(lwr_tag)
  df["lower_tagged"] = lower_tag
  return df

In [None]:
df=process_reviews(df)

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [None]:
def get_vocab(df):
  New_list=[]
  New_list1=[]
  for i in range(len(df.lower_tagged)):
    x=df["lower_tagged"][i]
    New_list.append(x)
  Z=list(New_list)
  New_list1=[]
  for j in range(len(Z)):
    t=Z[j]
    for k in range(len(t)):
      p=Z[j][k]
      New_list1.append(p)
  noun_list =[]
  verb_list = []
  noun_in = ['NNP','NNS','NNPS','NN']
  verb_in = ['JJS','JJ','JJR']
  for tok,tag in New_list1:
    if tag in noun_in:
      noun_list.append(tok)
    elif tag in verb_in:
      verb_list.append(tok)
  noun_count = Counter(noun_list)
  verb_count = Counter(verb_list)
  noun_sorted = noun_count.most_common(1010)
  verb_sorted = verb_count.most_common(1375)
  New_verb=[]
  New_noun=[]
  for i in tqdm(range(len(verb_sorted))):
    L=verb_sorted[i][0]
    New_verb.append(L)
  New_verb
  for i in tqdm(range(len(noun_sorted))):
    L=noun_sorted[i][0]
    New_noun.append(L)

  New_noun = [''.join(c for c in s if c not in string.punctuation) for s in New_noun]
  New_noun = [s for s in New_noun if s]
  New_verb = [''.join(c for c in s if c not in string.punctuation) for s in New_verb]
  New_verb = [s for s in New_verb if s]
  print(len(New_noun),len(New_verb))
  final_verb=[]
  for i in New_verb:
    if i not in New_noun:
  
      final_verb.append(i)

  cent_vocab = New_noun[:1000]
  cont_vocab = final_verb[:1000]
  return cent_vocab, cont_vocab

In [None]:
cent_vocab, cont_vocab = get_vocab(df)

In [None]:
print("Center_vocab List :",cent_vocab,'\n',
      "Context_vocab List :",cont_vocab)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [None]:
def get_coocs(df, cent_vocab, cont_vocab):

  X=(df["comments"].apply(str.lower))
  X=list(X)

  sentence_tokenize_lst=[]  # sentence tokenisation 
  for i in (X):
    k=sent_tokenize(i)
    sentence_tokenize_lst.append(k)

  word_tokenizer_list=[] # then word tokenisation
  for i in range(len(X)):
    for j in range(len(sentence_tokenize_lst[i])):
      T=word_tokenize(sentence_tokenize_lst[i][j])
      word_tokenizer_list.append(T)

  find_context_vocab=[]
  for i in range(len(word_tokenizer_list)):
    context_vocab_list = []    # search context_vocab from each sentence and store in context_vocab_list
    for j in cont_vocab:
      if j in word_tokenizer_list[i]:
        context_vocab_list.append(j)
      if j is '.':
       break
    find_context_vocab.append(context_vocab_list)

  count_context_vocab=[]
  for i in range(len(find_context_vocab)):
    K=dict(Counter(find_context_vocab[i]))
    count_context_vocab.append(K)   

  find_center_vocab=[]
  for i in (range(len(word_tokenizer_list))):
    center_vocab_list = []
    for j in cent_vocab:
      if j in word_tokenizer_list[i]:
        center_vocab_list.append(j)
      if j is '.':
        break
    find_center_vocab.append(center_vocab_list)



  count_center_vocab=[]
  for i in range(len(find_center_vocab)):
    K=dict(Counter(find_center_vocab[i]))
    count_center_vocab.append(K)

  for i in range(len(count_center_vocab)):
    for j in count_center_vocab[i]:
      count_center_vocab[i][j]=count_context_vocab[i]
  # print(sim_lst_noun)
  new_list=[]
  for i in range(len(count_center_vocab)):
    final_dict={}
    for j in count_center_vocab[i]:
      final_dict[j]=count_center_vocab[i][j]
      break
    new_list.append(final_dict)
# print(final_lst)
  while {} in new_list:
    new_list.remove({})

  coocs=[]
  for i in range(len(new_list)):
    if new_list[i].get(next(iter(new_list[i])))!={}:
      coocs .append(new_list[i])
  return coocs  

In [None]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

In [None]:
coocs

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [None]:
def cooc_dict2df(coocs):
  dict1={}
  for i in cent_vocab:
    dict1[i]=[]
  for i in range(len(coocs)):
    for key in cent_vocab:
      if key ==next(iter(coocs[i])):
        # print(key.values() from final_list)
        dict1[key].append(coocs[i].get(key))
  # dict1

  t_list=[]

  for i in cent_vocab:
   # print(i)
    a_list=[]
    for j in range(len(dict1[i])):
      k=next(iter(dict1[i][j]))
     # print(i,k)
      a_list.append(k)
    t_list.append(a_list)

  dict2=[]
  dict3={}
  for i,j in zip(cent_vocab,range(len(cent_vocab))):
    k=dict(Counter(t_list[j]))
    dict2.append(k)
    dict3[i]=dict2[j]
  # print(len(dict3))
  coocdf = pd.DataFrame.from_dict(dict3,orient = 'index',dtype="Int64").fillna(0)
  return coocdf

In [None]:
coocdf = cooc_dict2df(coocs)
print(coocdf.shape)
coocdf

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [None]:
def cooc2pmi(df):
  # your code here
  row_totals = df.sum(axis=1).astype(float)
  prob_cols_given_row = (df.T / row_totals).T
  col_totals = df.sum(axis=0).astype(float)
  prob_of_cols = col_totals / sum(col_totals)
  ratio = prob_cols_given_row / prob_of_cols
  ratio[ratio==0] = 0.00001
  pmidf = np.log(ratio)
  pmidf[pmidf < 0] = 0
  return pmidf

In [None]:
pmidf = cooc2pmi(coocdf)


In [None]:
print(pmidf.shape)
pmidf

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [None]:
def topk(df, center_word, N=10):
 top_words = df[center_word].sort_values(ascending = False).head(N)
 return top_words

In [None]:
topk(pmidf, 'public')

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

...