# Natural Language Processing and Information Retrieval
Today we will do multiple tasks. First we will download the relevant libraries (don't worry this is handled in the scripts below and all you have to do is press run). Then we will read a csv file full of customer reviews for Disneyland. I got this data from Kaggle here - https://www.kaggle.com/arushchillar/disneyland-reviews

The dataset we're using has much fewer rows though (about 200). It will be up to us to create an indexer for these reviews (given this data is unstructured). To do this, for a single review, we will:
1) Tokenize the words (go from a sentence to a list of words)
2) Remove all the stopwords
3) Filter the sentence down to just the nouns (so we can search for reviews that talk about particular things) and adjectives (so we can also search for how those things are decribed). 
4) Collect our indexed terms and create an inverted index and save it to some file. 
5) Create a loop that does this for all reviews in the disneyland review csv. 

In [1]:
# download nltk to do the language processing jobs for us
import sys
#!{sys.executable} -m pip install nltk
#!{sys.executable} -m pip install pandas

# import pandas to have our data in a table and read it from the csv
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import brown
import csv
nltk.download('stopwords')
nltk.download('punkt') # this is a tokenizer tool we need
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WilliamClifford\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WilliamClifford\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\WilliamClifford\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Here we will just load the disneyland review data and print it to the screen to check it loaded correctly. 

In [2]:
disney_data = pd.read_csv("disneylandReviews.csv")

In [3]:
disney_data.head()

Unnamed: 0,Review_Text
0,If you've ever been to Disneyland anywhere you...
1,Its been a while since d last time we visit HK...
2,Thanks God it wasn t too hot or too humid wh...
3,HK Disneyland is a great compact park. Unfortu...
4,"the location is not in the city, took around 1..."


# 1. Tokenize the sentence

In [4]:
#selecting the first review
example_review = disney_data["Review_Text"][0]
print("Review Raw Text: %s \n" % example_review)

#tokenize the text
tokens = nltk.word_tokenize(example_review)
print("Tokenized review text: %s" % tokens)

Review Raw Text: If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you walk into main street! It has a very familiar feel. One of the rides  its a Small World  is absolutely fabulous and worth doing. The day we visited was fairly hot and relatively busy but the queues moved fairly well.  

Tokenized review text: ['If', 'you', "'ve", 'ever', 'been', 'to', 'Disneyland', 'anywhere', 'you', "'ll", 'find', 'Disneyland', 'Hong', 'Kong', 'very', 'similar', 'in', 'the', 'layout', 'when', 'you', 'walk', 'into', 'main', 'street', '!', 'It', 'has', 'a', 'very', 'familiar', 'feel', '.', 'One', 'of', 'the', 'rides', 'its', 'a', 'Small', 'World', 'is', 'absolutely', 'fabulous', 'and', 'worth', 'doing', '.', 'The', 'day', 'we', 'visited', 'was', 'fairly', 'hot', 'and', 'relatively', 'busy', 'but', 'the', 'queues', 'moved', 'fairly', 'well', '.']


# 2. Remove stop words

## 2.1 To remove the stop words we first need an array of stop words to compare against. 

In [5]:
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## 2.1 filter out the stop words

we then filter the senetence and remove the stop words. The filtered_sentence now only contain words that are not stop words. 

In [6]:
filtered_sentence = [w for w in tokens if not w.lower() in stop_words]

# also lets remove duplicates
filtered_sentence = set(filtered_sentence)
print(filtered_sentence)

{'Small', 'find', 'One', 'World', 'ever', 'Hong', 'familiar', 'fabulous', 'main', 'well', 'layout', 'queues', 'Kong', 'street', 'hot', "'ve", 'moved', 'absolutely', 'Disneyland', 'visited', 'relatively', 'busy', 'similar', "'ll", 'day', '.', 'fairly', 'worth', 'walk', '!', 'anywhere', 'feel', 'rides'}


# 3. filter so we only have nouns and adjectives

This part will return each word with a corresponding label, for example: 


* "and" and "a" are CC,  coordinating conjunction
* "now" and "completely" are RB, or adverbs
* "for" is IN, a preposition; something is NN, a noun
* "different" is JJ, an adjective, 
* NNP are proper nouns.


So basically once this is tagged we will narrow it down to tags beginning with NN and JJ. 

In [8]:
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'

nouns_and_adjectives = [word for (word, pos) in nltk.pos_tag(filtered_sentence) if is_noun(pos)] 
print(nouns_and_adjectives)

['Small', 'World', 'Hong', 'familiar', 'fabulous', 'main', 'queues', 'Kong', 'street', 'hot', 'Disneyland', 'busy', 'similar', 'day', 'worth', 'walk', 'rides']


# Save these to an inverted index structure 

For this we will highlight the index term (e.g. Disneyland) and the list of related reviews (in this case it would be a single item -review 1). 

In [9]:
# this will create a dictionary of each term and the corresponding list of documents
inverted_index = {index: [1] for index in nouns_and_adjectives}
print(inverted_index)

{'Small': [1], 'World': [1], 'Hong': [1], 'familiar': [1], 'fabulous': [1], 'main': [1], 'queues': [1], 'Kong': [1], 'street': [1], 'hot': [1], 'Disneyland': [1], 'busy': [1], 'similar': [1], 'day': [1], 'worth': [1], 'walk': [1], 'rides': [1]}


In [10]:
# now save this to a file
with open('inverted_index.csv', 'w') as f:
    for key in inverted_index.keys():
        f.write("%s, %s\n" % (key,inverted_index[key]))

# 5. Now you should design a loop that does this for every review and puts all of this information in inverted_index.csv 

In [11]:
# get the list of stop words
stop_words = stopwords.words('english')
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
counter=0
inverted_index={}
for review in disney_data["Review_Text"] : 
    # tokenize the individual review
    tokens = nltk.word_tokenize(review)
    
    # remove the stop words
    filtered_sentence = [w for w in tokens if not w.lower() in stop_words]
    
    # also lets remove duplicates
    filtered_sentence = set(filtered_sentence)
    
   
    nouns_and_adjectives = [word for (word, pos) in nltk.pos_tag(filtered_sentence) if is_noun(pos)] 
    
    # this will create a dictionary of each term and the corresponding list of documents
    for index in nouns_and_adjectives:
        try:
            inverted_index[index].append(counter)
        except: 
            inverted_index[index] = [counter]
    
    counter+=1

In [12]:
print(inverted_index)

{'Small': [0, 52, 90, 121, 190], 'World': [0, 5, 44, 51, 113, 121, 135, 140], 'Hong': [0, 4, 5, 21, 22, 23, 26, 27, 32, 33, 34, 35, 36, 37, 44, 49, 51, 57, 58, 68, 71, 74, 77, 78, 82, 89, 90, 93, 96, 98, 120, 128, 139, 142, 143, 144, 148, 151, 156, 163, 164, 166, 172, 175, 177, 181, 189, 191, 199], 'familiar': [0], 'fabulous': [0, 83, 148], 'main': [0, 2, 23, 29, 37, 53, 101, 110, 144, 148, 171], 'queues': [0, 42, 52, 83, 87, 93, 98, 105, 114, 129, 144, 148, 157], 'Kong': [0, 4, 5, 21, 22, 23, 26, 27, 32, 33, 34, 35, 37, 44, 49, 51, 57, 58, 68, 71, 74, 77, 82, 89, 90, 93, 96, 98, 103, 120, 128, 139, 142, 143, 144, 148, 151, 156, 163, 164, 166, 172, 175, 177, 181, 189, 191, 199], 'street': [0, 16, 37, 46, 90, 96, 148, 171], 'hot': [0, 2, 4, 12, 29, 138, 198], 'Disneyland': [0, 1, 3, 5, 8, 9, 10, 12, 18, 19, 21, 23, 26, 27, 29, 32, 33, 34, 35, 36, 37, 42, 44, 45, 47, 48, 49, 51, 57, 58, 64, 70, 72, 74, 75, 76, 77, 79, 85, 88, 89, 90, 93, 96, 99, 106, 107, 108, 109, 111, 113, 114, 117, 12

In [13]:
with open('inverted_index.csv', 'w') as f:
    for key in inverted_index.keys():
        f.write("%s, %s\n" % (key,inverted_index[key]))

# Self Directed Section

For the remainder of this lab I want you to download another dataset of unstructured data. It's up to you what dataset you use but I would recommend one where they are using a CSV file. Perhaps something like a review dataset on restaurants https://www.kaggle.com/search?q=restaurant+reviews

Once downloaded, I would like you to do the following

1. Parse the files and display their contents in a pandas dataframe
2. Tokenize the words (go from a sentence to a list of words)
3. Remove all the stopwords
4. Filter the sentence down to just the nouns (so we can search for reviews that talk about particular things) and adjectives (so we can also search for how those things are decribed). 
5. Collect our indexed terms and create an inverted index and save it to some file. 
6. Create a loop that does this for all reviews in the restaurant review csv. 
