# U1 Activity NLP

First of all we need to download the hotel's opinions dataset to `/data` and uncompress it.

In [None]:
# If you haven't downloaded the dataset, then uncomment and run. Note unzipping is os dependent
# !curl https://github.com/kavgan/OpinRank/raw/master/OpinRankDatasetWithJudgments.zip -o "data/OpinRankDatasetWithJudgments.zip"
# !7z x -y "data/OpinRankDatasetWithJudgments.zip" -odata # 7zip / windows
# !unzip "data/OpinRankDatasetWithJudgments.zip" # mac

## Load the dataset

Inside the uncompressed data we find the following structure:

- data
  - cars
  - hotels
    - data
      - beijing
      - chicago
      - dubai
      - las-vegas
      - london
      - montreal
      - new-delhi
      - new-york-city
      - san-francisco
      - shanghai
      - beijing.csv
      - chicago.csv
      - dubai.csv
      - las-vegas.csv
      - london.csv
      - montreal.csv
      - new-delhi.csv
      - new-york-city.csv
      - san-francisco.csv
      - shanghai.csv
    - judgments
      - beijing
      - chicago
      - dubai
      - las-vegas
      - london
      - montreal
      - new-delhi
      - new-york-city
      - san-francisco
      - shanghai

### Understanding the data structure

The data is grouped by city, that means that we will find a file or directory per city.
For each city, there are 3 types of data: 
 - the hotel list as a `<hotel-name>.csv` inside `hotels/data/` - csv
 - the user review grouped by hotel inside `hotels/data/<hotel-name>` - tsv
 - the hotel statistics inside `hotels/judgments/<hotel-name>`

On a first approach we are going to load only the hotel list and merge all into one dataset called `hotels`.

In [34]:
import os
import pandas as pd

hotels = pd.DataFrame()


for name in os.listdir("data/hotels/data"):
    if name.endswith(".csv"):
      hotelTmp = pd.read_csv(f"data/hotels/data/{name}", delimiter=',', index_col=False)
      hotels = pd.concat([hotels, hotelTmp], axis=0)

hotels.head()

Unnamed: 0,doc_id,hotel_name,hotel_url,street,city,state,country,zip,class,price,num_reviews,CLEANLINESS,ROOM,SERVICE,LOCATION,VALUE,COMFORT,overall_ratingsource
0,china_beijing_holiday_inn_central_plaza,holiday inn central plaza,http://www.tripadvisor.com/ShowUserReviews-g29...,no.1 caiyuan street xuanwu district,beijing,-1,China,100053,-1,-1,247,4.786408,4.631068,4.73301,3.553398,4.699029,0.0,4.480583
1,china_beijing_hilton_beijing_wangfujing,hilton beijing wangfujing,http://www.tripadvisor.com/ShowUserReviews-g29...,no.8 wangfujing east street dongcheng,beijing,-1,China,100006,-1,-1,74,4.810345,4.844828,4.758621,4.827586,4.517241,0.0,4.751724
2,china_beijing_hotel_g,hotel g,http://www.tripadvisor.com/ShowUserReviews-g29...,a7 worker's stadium chaoyang district,beijing,-1,China,100020,-1,-1,110,4.769231,4.75,4.576923,4.375,4.653846,0.0,4.625
3,china_beijing_the_regent_beijing,the regent beijing,http://www.tripadvisor.com/ShowUserReviews-g29...,no.99 jinbao street dongcheng district,beijing,-1,China,100005,-1,-1,111,4.625,4.8125,4.4375,4.645833,4.53125,0.0,4.610417
4,china_beijing_the_st_regis_beijing,the st regis beijing,http://www.tripadvisor.com/ShowUserReviews-g29...,no.21jianguomenwai street chaoyang district,beijing,-1,China,100020,-1,-1,89,4.846154,4.646154,4.615385,4.492308,4.184615,0.0,4.556923


In [99]:
# This functions returns the reviews of a hotel given a hotels index
def getReviews(dataframe, idx):
  docId = dataframe.iloc[idx]["doc_id"]
  city = dataframe.iloc[idx]["city"]
  fileName = f"data/hotels/data/{city}/{docId}"

  reviews = pd.read_csv(fileName, delimiter="\t", index_col=False, header=None, encoding="iso-8859-1")
  reviews = reviews.drop(3, 1) # last column is generated by a tab before end of line
  reviews.columns = ["date", "title", "review"]
  return reviews

# test the function
reviews = getReviews(hotels, 10)
reviews.head()

  reviews = reviews.drop(3, 1) # last column is generated by a tab before end of line


Unnamed: 0,date,title,review
0,,Excellent realistic Chinese Courtyard Accommod...,We can't recommend this hotel enough. It reall...
1,,Very warm and helpful staff,Hi I had made a reservation to stay at Michael...
2,Nov 23 2009,Cozy and friendly courtyard for socializing,The most comfortable bed I have slept in for q...
3,Nov 16 2009,Excellent,Michael's House combines authentic old Chinese...
4,Nov 13 2009,Magic @ Michael's,"Unfortunately only stayed one night,wished it ..."


## 1. Which parts of a room are the most mentioned in each city?

In this section, I'm going to use wordnet of each room to look for matches. First, I'm going to look for the definition of room that we need. Then use it's hyponyms to search in the reviews.

In [108]:
from nltk.corpus import wordnet

# check the room definition that we need
for sense in wordnet.synsets('room'):
    print(sense)
    print(sense.definition())
    print(sense.examples())
    print("-"*10)

# Synset('room.n.01') is the definition that I'm looking for

Synset('room.n.01')
an area within a building enclosed by walls and floor and ceiling
['the rooms were very small but they had a nice view']
----------
Synset('room.n.02')
space for movement
['room to pass', 'make way for', 'hardly enough elbow room to turn around']
----------
Synset('room.n.03')
opportunity for
['room for improvement']
----------
Synset('room.n.04')
the people who are present in a room
['the whole room was cheering']
----------
Synset('board.v.02')
live and take one's meals at or in
['she rooms in an old boarding house']
----------


In [103]:
room = wordnet.synset('room.n.01')
room.hyponyms()

# now we need to "correct" the spelling of user words. Ex: some reviews wrote "bathrom" instate of "bathroom". 


# https://subscription.packtpub.com/book/application-development/9781782167853/1/ch01lvl1sec16/calculating-wordnet-synset-similarity

[Synset('anechoic_chamber.n.01'),
 Synset('anteroom.n.01'),
 Synset('back_room.n.01'),
 Synset('ballroom.n.01'),
 Synset('barroom.n.01'),
 Synset('bathroom.n.01'),
 Synset('bedroom.n.01'),
 Synset('belfry.n.02'),
 Synset('billiard_room.n.01'),
 Synset('boardroom.n.01'),
 Synset('cardroom.n.01'),
 Synset('cell.n.06'),
 Synset('cell.n.07'),
 Synset('chamber.n.03'),
 Synset('checkroom.n.01'),
 Synset('classroom.n.01'),
 Synset('clean_room.n.01'),
 Synset('cloakroom.n.02'),
 Synset('closet.n.04'),
 Synset('clubroom.n.01'),
 Synset('compartment.n.02'),
 Synset('conference_room.n.01'),
 Synset('control_room.n.01'),
 Synset('court.n.02'),
 Synset('cubby.n.01'),
 Synset('cutting_room.n.01'),
 Synset('darkroom.n.01'),
 Synset('den.n.04'),
 Synset('dinette.n.01'),
 Synset('dining_room.n.01'),
 Synset('door.n.05'),
 Synset('dressing_room.n.01'),
 Synset('durbar.n.01'),
 Synset('engineering.n.03'),
 Synset('floor.n.10'),
 Synset('furnace_room.n.01'),
 Synset('gallery.n.03'),
 Synset('gallery.n.04'