# Homework 3 - Find the perfect place to stay in Texas!

# # Davide Toma, Fabio Montello, Umawhawhire Bonaventure

The homework consists in analyzing the text of Airbnb property listings and building a search engine.

## Step 0: Preparation code

In this part we will write some functions to make sure all the libraries that will be needed along the run of the program are available. We divided this in three main block of code: a function to install missing libraries, a block to check that the libraries that usually needs to be installed are already in the system and a general import of the other libraries that are for sure part of Python packages suite.

In the first part we define a function that will let us install a package in the future, if it does not exist in the local packages. The library will be installed through `pip` just if it does actually exists in the repositories.

In [5]:
def install(package):
    try:
        import subprocess
        import sys
        subprocess.call([sys.executable, "-m", "pip", "install", package])
        print("Library installed")
    except:
        print("Cannot install the library")

In the second part we try to install all the libraries that usually are not part of the local set of packages. In the case the library is already available, we import it, otherwise we install it with the previous function, and then try again to import it.

In [6]:
try:
    from langdetect import detect
except:
    print("The library LanfDetect is not in the repository, I will install it")
    install('langdetect')
    from langdetect import detect

try:
    from texttable import Texttable
except:
    print("The library TextTable is not in the repository, I will install it")
    install('texttable')
    from texttable import Texttable

try:
    from geopy.geocoders import Nominatim
    import geopy.distance
except:
    print("The library geopy is not in the repository, I will install it")
    install('geopy')
    from geopy.geocoders import Nominatim
    import geopy.distance

try:
    import tabulate
except:
    print("The library tabulate is not in the repository, I will install it")
    install('tabulate')
    import tabulate

The library tabulate is not in the repository, I will install it
Library installed


In the last block we import the common libraries that usually comes with Python (supposing that Anaconda is installed as well). In particular we import `unicodecsv` to not have problems with the import and export of text with special caracters. `Pickle` will be used to import/export the dictionaries into files, `nltk` is used to clean the sentences from the stopwords and for stemming, `langdetect` to understand in which language a text is written. `Pandas` is used to load the initial CSV and do some operations in a quicker way.
After all the imports, we download/update the list of the stopwords with the command `nltk.download('stopwords')`. `folium` is to create a map displaying the points and `re` is to apply regex to a string.

In [7]:
import pandas as pd
import unicodecsv as csv # We also import the unicodecsv library to handle any Unicode objects in the CSV report...as csv.
import nltk
import string
import pickle
import folium
import re
import json

from time import sleep
from IPython.display import HTML, display
from nltk import word_tokenize
from nltk.corpus import stopwords
from langdetect import detect
from heapq import heappush, heappop
from scipy import spatial

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 1: Load the Data

Now we want to read the data from the initial CSV. We start by defining the directory of the data and then running the command to read the CSV.

In [19]:
DIR = 'C:/Users/david/Desktop/DATA SCIENCE/AMD/documents/' # data/docs

df = pd.read_csv("Airbnb_Texas_Rentals.csv", sep=",") # we read the dataset

We want then to clean the data we will not be using. In particular the following actions has been executed:
- Remove the column `Unamed: 0` which is useless for our analysis
- Remove all the rows which do not have a description, a latitude or a longitude
- Remove all the rows where the description is a `\` or a `.`
- Replace the new lines characters with a space in the whole dataframe
- Define a new column language, with undefined language for each row (will be updated later)

Finally we display the table, useful later for reference

In [20]:
df = df.drop('Unnamed: 0',axis = 1) #this column is useless for our analisys
df = df[pd.notnull((df['description'])) & pd.notnull((df['latitude'])) & pd.notnull((df['longitude'])) & (df['description'] != "\\" ) & (df['description'] != "." )] #If the row has no description, for us is useless, so we drop it
df = df.replace(r'\\n',' ', regex=True) #Remove all the new lines in the dataframe
df['language'] = 'nd'
df.head()

Unnamed: 0,average_rate_per_night,bedrooms_count,city,date_of_listing,description,latitude,longitude,title,url,language
0,$27,2,Humble,May 2016,Welcome to stay in private room with queen bed...,30.020138,-95.293996,2 Private rooms/bathroom 10min from IAH airport,https://www.airbnb.com/rooms/18520444?location...,nd
1,$149,4,San Antonio,November 2010,"Stylish, fully remodeled home in upscale NW – ...",29.503068,-98.447688,Unique Location! Alamo Heights - Designer Insp...,https://www.airbnb.com/rooms/17481455?location...,nd
2,$59,1,Houston,January 2017,'River house on island close to the city' A w...,29.829352,-95.081549,River house near the city,https://www.airbnb.com/rooms/16926307?location...,nd
3,$60,1,Bryan,February 2016,Private bedroom in a cute little home situated...,30.637304,-96.337846,Private Room Close to Campus,https://www.airbnb.com/rooms/11839729?location...,nd
4,$75,2,Fort Worth,February 2017,Welcome to our original 1920's home. We recent...,32.747097,-97.286434,The Porch,https://www.airbnb.com/rooms/17325114?location...,nd


We also retrieve the list of all the locations we need, and assign to them an id. These will be used later on in the software:

In [7]:
locationsList = {elem.lower() : i for i, elem in enumerate(set(df["city"]))}
locationsList

{'johnson city': 0,
 'clyde': 1,
 'beaumont': 2,
 'anna': 3,
 'south houston': 4,
 'rockdale': 5,
 'teague': 6,
 'saginaw': 7,
 'lakeway': 8,
 'alvarado': 9,
 '阿纳瓦克': 10,
 'mansfield': 11,
 'east bernard': 12,
 'damon': 13,
 'fairview': 14,
 'bastrop county': 15,
 'fort worth': 469,
 'san leon': 17,
 'kingsland': 18,
 'lago vista': 19,
 'seagoville': 20,
 'clear lake shores': 21,
 'flower mound': 22,
 'point venture': 23,
 'grand saline': 24,
 'rio frio': 25,
 'whitney': 26,
 'mabank': 27,
 'fair oaks ranch': 28,
 'groesbeck': 29,
 'eliasville': 30,
 'lohn': 31,
 'melissa': 32,
 'coppell': 33,
 'china': 34,
 'plainview': 35,
 'van': 36,
 'junction': 37,
 'alamo heights': 38,
 'webster': 39,
 'bridge city': 40,
 'selma': 41,
 'trophy club': 42,
 'dripping springs': 43,
 'palestine': 44,
 'cibolo': 45,
 'hurst': 46,
 'godley': 47,
 'princeton': 48,
 'smithville': 49,
 'hereford': 50,
 'benbrook': 51,
 'winnie': 52,
 'nacogdoches': 53,
 'farmersville': 54,
 'bandera': 55,
 'shenandoah': 5

## Step 2: Create documents

Now from every row of the table, we want to create a `.tsv` file which will store all the relevant informations about a single istance that will be displayed later on in the search engine.

In [None]:
for row in df.iterrows():
    file_name = DIR + 'doc_'+str(row[0])+'.tsv' # We build dinamically the path of every file
    
    with open(file_name,'wb') as file: # We create/open the file. Wb stands for binary mode.
        csv_writer = csv.writer(file, delimiter = '\t') # Now we are writing the files as tab separated files.
        csv_writer.writerow(list(row[1].values)) # writerow want a list object ad input, so we used list.
        file.close() # Once the file is written, we close it, so we free some memory


After creating all the documents we will work with, now some more data to retrieve for the point 4 of our exercise. We are in fact going to retrieve the coordinates for the center of every city where there is an airbnb, so we can then do further computation to get a rank based also on the position of the B&B in respect to the city center.
To do so, starting from the location list we got previously, we retrieve the center through `geopy` and we add it to a list, containing the city ID and the coordinates of the center.
Please notice, since the following block of code can be very slow, expecially with a low connection, we saved a copy of the data retrived and executed this block just once. So we **strongly suggest to skip downloading the data again** (skip running the following block), and instead be sure you got the file `citycoord.pkl` from the Github repository in the `data/docs` directory. We will upload that file later on in the program, when needed

In [57]:
citycoord = {}

for key in locationsList:
    geolocator = Nominatim(user_agent="search engine")
    for i in range(5): #
        try:
            location = geolocator.geocode(key)
            coords_2 = ((location.latitude, location.longitude))
            break
        except: 
            sleep(0.1)
            pass
    citycoord[locationsList[key]] = coords_2
    
with open(DIR+'citycoord.pkl', 'wb') as file:
        pickle.dump(citycoord, file, pickle.HIGHEST_PROTOCOL)

## 3.1: Conjunctive query

We want now to be able to make a dictionary that, given a query, returns us a list of documents conatining all the words in that query. We start by creating a dictionary that will map for every document the single words that title and description will contain, cleaned of the stopwords and stemmed. 
Since there are a lot of different languages used for the descriptions (with different alphabets as well), we decided to keep just the 4 most used languages: English, German, Spanish and Dutch. All the other languages will not be considered.

In [6]:
docdict = {}

languages = {'en': 'english', 'de': 'german', 'es': 'spanish', 'nl': 'dutch'}

for i, row in df[:].iterrows(): # For every row of the dataframe, we get values and index

    #We merge title and description (since we will consider both) and remove punctuation
    text = (str(row['description'])+ ' '+ str(row['title'])).translate({key: None for key in string.punctuation})
    
    
    if(text): # If the text exists
        try: # Try to detect the language
            lan = detect(str(text))
            df.loc[i,'language'] = lan # Set the language in the row

            if (lan in languages): # If it's one of the languages considered
        
                # Remove all the stopwords and all the non alphabetic words
                tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
        
                # Stem all the words and add them to a dictionary
                docdict[i] = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
            else:
                pass
        except:
            pass

docdict

{0: ['welcom',
  'stay',
  'privat',
  'room',
  'queen',
  'bed',
  'detach',
  'privat',
  'bathroom',
  'second',
  'anoth',
  'privat',
  'bedroom',
  'sofa',
  'bed',
  'avail',
  'addit',
  'addit',
  'iah',
  'airport',
  'airport',
  'avail',
  'privat',
  'iah',
  'airport'],
 1: ['fulli',
  'remodel',
  'home',
  'upscal',
  'nw',
  'alamo',
  'height',
  'amaz',
  'locat',
  'hous',
  'conveni',
  'locat',
  'quiet',
  'beauti',
  'season',
  'prestigi',
  'neighborhood',
  'close',
  'loop',
  'featur',
  'open',
  'floor',
  'origin',
  'hardwood',
  'full',
  'bathroom',
  'independ',
  'room',
  'sleep',
  'european',
  'inspir',
  'kitchen',
  'driveway',
  'park',
  'uniqu',
  'alamo',
  'height',
  'design',
  'inspir'],
 2: ['hous',
  'island',
  'close',
  'well',
  'maintain',
  'river',
  'hous',
  'san',
  'jacinto',
  'river',
  'extra',
  'room',
  'temporari',
  'river',
  'hous',
  'near',
  'citi'],
 3: ['privat',
  'bedroom',
  'cute',
  'littl',
  'home',


Next we want to map once every single stemmed word with a corresponding sequential ID into a dictionary. This will improve the speed later on.

In [8]:
voc = {}
cont = 0

for key, value in docdict.items(): # for every document
    for elem in value: # for every word
        if(elem not in voc): # if we did not map it yet
            voc[elem] = cont # map it with a sequential value
            cont += 1 # and increase the value
            
voc

{'welcom': 0,
 'stay': 1,
 'privat': 2,
 'room': 3,
 'queen': 4,
 'bed': 5,
 'detach': 6,
 'bathroom': 7,
 'second': 8,
 'anoth': 9,
 'bedroom': 10,
 'sofa': 11,
 'avail': 12,
 'addit': 13,
 'iah': 14,
 'airport': 15,
 'fulli': 16,
 'remodel': 17,
 'home': 18,
 'upscal': 19,
 'nw': 20,
 'alamo': 21,
 'height': 22,
 'amaz': 23,
 'locat': 24,
 'hous': 25,
 'conveni': 26,
 'quiet': 27,
 'beauti': 28,
 'season': 29,
 'prestigi': 30,
 'neighborhood': 31,
 'close': 32,
 'loop': 33,
 'featur': 34,
 'open': 35,
 'floor': 36,
 'origin': 37,
 'hardwood': 38,
 'full': 39,
 'independ': 40,
 'sleep': 41,
 'european': 42,
 'inspir': 43,
 'kitchen': 44,
 'driveway': 45,
 'park': 46,
 'uniqu': 47,
 'design': 48,
 'island': 49,
 'well': 50,
 'maintain': 51,
 'river': 52,
 'san': 53,
 'jacinto': 54,
 'extra': 55,
 'temporari': 56,
 'near': 57,
 'citi': 58,
 'cute': 59,
 'littl': 60,
 'situat': 61,
 'covet': 62,
 'garden': 63,
 'acr': 64,
 'access': 65,
 'campus': 66,
 'recent': 67,
 'purchas': 68,
 'mil

Now we want to create an inverted index that will map every word (using the ID we created previously) to a list of all the documents that contain that single word. 
The result will be in the following format: `{1: [0,2,5], 245: [1, 4, 7]}`.

In [9]:
invidx = {} 

for key, value in docdict.items():
    for elem in value:
        if(voc[elem] in invidx):
            invidx[voc[elem]] = invidx[voc[elem]] + [key]
        else:
            invidx[voc[elem]] = [key]
invidx

{0: [0,
  4,
  22,
  50,
  58,
  58,
  70,
  70,
  150,
  196,
  199,
  222,
  236,
  275,
  288,
  296,
  332,
  384,
  429,
  439,
  443,
  449,
  463,
  474,
  475,
  504,
  540,
  549,
  567,
  589,
  607,
  610,
  670,
  672,
  708,
  735,
  769,
  771,
  793,
  812,
  815,
  816,
  826,
  851,
  855,
  862,
  896,
  926,
  944,
  976,
  1024,
  1052,
  1058,
  1099,
  1138,
  1175,
  1187,
  1238,
  1240,
  1254,
  1257,
  1259,
  1305,
  1354,
  1366,
  1391,
  1391,
  1397,
  1411,
  1506,
  1540,
  1542,
  1552,
  1561,
  1581,
  1582,
  1589,
  1593,
  1605,
  1605,
  1611,
  1627,
  1646,
  1674,
  1695,
  1695,
  1701,
  1772,
  1839,
  1850,
  1902,
  1967,
  1969,
  1970,
  1999,
  2033,
  2039,
  2045,
  2055,
  2057,
  2088,
  2109,
  2129,
  2145,
  2158,
  2160,
  2181,
  2181,
  2212,
  2214,
  2216,
  2263,
  2275,
  2279,
  2317,
  2321,
  2340,
  2374,
  2382,
  2389,
  2401,
  2413,
  2423,
  2426,
  2429,
  2446,
  2465,
  2485,
  2489,
  2489,
  2492,
  2500,
 

Now we define a search function that can be called every time and will gives us back a list of all the items that contains all the words requested in the query (apart from the stopwords). The function is splitted mainly in three parts: 
- Recognizing the language and stemming the query. If the query is not recognized as one of the languages written before, then we consider it and stem it as English.
- Getting the lost of documents containing all the words requested in the query
- Printing the results

In [10]:
def searchConj(string, n):
    
    text = string 
    lan = detect(str(text)) # Detect the language
    
    if(lan not in languages): 
        lan = 'en'
        
    # Remove non alphanumeric text and stem the words
    tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
    result = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
    
    # Get the list of documents
    listofdocs = {}
    for elem in result:
        if(elem in voc):
            if(len(listofdocs) == 0):
                listofdocs = set(invidx[voc[elem]])
            else:
                listofdocs = listofdocs & set(invidx[voc[elem]]) # & does the union of the sets

    table = [['<b>Title</b>', '<b>Description</b>', '<b>City</b>', '<b>URL</b>']]
    
    # Open every document and add it to a list
    for elem in listofdocs:
        file_name = DIR + 'doc_'+str(elem)+'.tsv'
        
        with open(file_name,'rb') as file: # the wb means we are writing a file in binary mode.
            text = str(file.read()).split('\\t')
            
            table = table + [[text[7], text[4], text[2], '<a href ="' + str(text[8]) + ' target= "_blank">' + str(text[8]) + '</a>' ]]
            
            file.close()

    # Display the list
    display(HTML(tabulate.tabulate(table[:n+1], tablefmt='html')))   
    

Now we call the search function, letting the user decide the query to input, and limiting the outputs to the first 5 elements.

In [11]:
searchConj(input(), 5)

House with garden


0,1,2,3
Title,Description,City,URL
"Rosen House Inn, Rose Garden Room","We are a small family owned Bed and Breakfast located in the Heart of the near Southside. Within walking distance to small shops, bars, and local dining. We offer snacks, coffee, tea, and sodas as well as a full hearty breakfast for our guests.",Fort Worth,https://www.airbnb.com/rooms/9972447?location=Benbrook%2C%20TX
"BRAND NEW HOME, EXQUISITE DECORATION, LARGE GARDEN","Brand new house with exquisite decoration. We have two indoor parking spaces and three exterior parking spaces in case you are traveling by car. The beds are orthopedic which provides a comfortable rest. For us cleaning is an essential point in the house, so feel totally safe. The house has been designed and decorated to satisfy the needs of families, business trips and relaxation time. We hope to make your stay an unforgettable experience!",Spring,https://www.airbnb.com/rooms/15285127?location=Conroe%2C%20TX
East Austin Hillside Gem,"Beautiful and modern 3Br, 2.5Ba located minutes from downtown. Amenities include front deck, back patio, peaceful backyard with tiered gardens and fire pits. There is also gym equipment, wifi, Xbox, and cable with HBO, Showtime, Starz, etc. There is free offstreet parking for two cars. Perfect location for families(with kids or furry friends,) couples, business travelers or friends looking for a spacious and comfortable location with plenty of amenities. This house has beautiful sunset views of downtown Austin just 6 miles from Darrell K Royal Stadium, 8 miles from Zilker Park and 14 miles from Circuit of the Americas.",Austin,https://www.airbnb.com/rooms/17555039?location=Bastrop%20County%2C%20TX
Charming 1 BR cottage near Chappell Hill,"Welcome to Cedar Meadows, a one-of-a-kind 2015 custom-built cottage situated on 7-acres of land with gardens all around. Surrounded by a white picket fence with decks on either side of the house, this property will allow you to enjoy the stunning views in any weather. The house is located on North Meyersville Road, 4.5 miles from Chappell Hill and 9.5 miles from Brenham making it convenient to access either town but giving you a feeling of secluded haven.",Brenham,https://www.airbnb.com/rooms/17921211?location=Brenham%2C%20TX
Gibbs House / Walk to the Beach,"Your private suite is in my personal residence. I live here. It takes 2 minutes to walk to the Convention Center and the beach is a three minute walk away. A theme park (The Pleasure Pier) is two miles away on the Galveston Seawall. Moody Gardens is about four miles away. A Kroger grocery store (with a Starbucks Coffee Shop) is a three minute walk from the house. Gibbs House is excellent for couples, business travelers, small families, and solo adventurers.",Galveston,https://www.airbnb.com/rooms/16982846?location=Bayou%20Vista%2C%20TX


In [12]:
searchConj(input(), 5)

Increíble habitación lugares


0,1,2,3
Title,Description,City,URL
Cuarto luminoso,"Lugares de inter\xc3\xa9s: transporte p\xc3\xbablico, parques. Te va a encantar mi espacio porque Hay mucho verde y espacios con juegos para ni\xc3\xb1os y chanchas de futbol. Parrilla y mesas! , La ubicaci\xc3\xb3n es muy buena, es una zona muy linda con muchos restorantes y supermercados, farmacias cercas., el ambiente Es agradable.. Mi alojamiento es bueno para parejas, aventureros, viajeros de negocios y mascotas.",Irving,https://www.airbnb.com/rooms/16073393?location=Coppell%2C%20TX
"House for Rent at Brownsville, Tx.","Lugares de inter\xc3\xa9s: VICC Country Golf Club, Schlitterbahn Resort Waterpark, Mercedes Outlet, Sunrise Mall, Shopping, South Padre Island, ValleyTennis Center. Te va a encantar mi espacio por Comodidad, ubicaci\xc3\xb3n, acogedor, practico. Mi alojamiento es bueno para familias (con hijos) y grupos grandes.",Brownsville,https://www.airbnb.com/rooms/16649395?location=Brownsville%2C%20TX
Peaceful home 25 min from Airport &35 min Downtown,"Lugares de inter\xc3\xa9s: el aeropuerto se encuentra a 25 minutos de distancia, el centro de la ciudad se encuentra a 35 min, en el centro hay mucha vida nocturna , transporte p\xc3\xbablico a 10 minutos de la casa hay autobuses que van a el centro . Te va a encantar mi lugar debido a los espacios, y la casa tiene un ambiente muy acogedor. la tranquilidad, los techos altos. Mi alojamiento es bueno para viajeros de negocios o turismo.",San Antonio,https://www.airbnb.com/rooms/15773671?location=Cibolo%2C%20TX
Houston 15 min NRG,Lugares de inter\xc3\xa9s: Hobby airport. Te va a encantar mi espacio por Extremely relax. Mi alojamiento es bueno para aventureros y familias (con hijos).,Houston,https://www.airbnb.com/rooms/16973563?location=Channelview%2C%20TX
Peaceful home 25 min from Airport &35 min Downtown,"Lugares de inter\xc3\xa9s: el aeropuerto se encuentra a 25 minutos de distancia, el centro de la ciudad se encuentra a 35 min, en el centro hay mucha vida nocturna , transporte p\xc3\xbablico a 10 minutos de la casa hay autobuses que van a el centro . Te va a encantar mi lugar debido a los espacios, y la casa tiene un ambiente muy acogedor. la tranquilidad, los techos altos. Mi alojamiento es bueno para viajeros de negocios o turismo.",San Antonio,https://www.airbnb.com/rooms/15773671?location=Bulverde%2C%20TX


## 3.2: Conjunctive query & Ranking score

Now we have to create a second search engine that, given a query, will return the top-k documents ordered by the similarity (cosine similarity).
First of all we create the tf_idf function that will compute the tfidf (term frequency–inverse document frequency) of every word in the documents (so it measure the 'importance' of every word in every document) as seen during classes.

In [18]:
#function tfidf
def tf_idf(word,doc,D,N):
    import math
    f = doc.count(word) / len(doc)
    tfidf = math.log(D/N)*f
    
    return (tfidf)   

As second step we create the second inverted index with the term ID as key of the disctionary, and with the documents where you can find this word and the tf-idf as values of this dictionary.

In [19]:
invidx2 = invidx.copy()
alldocs = []
for key, value in invidx2.items():
    alldocs = alldocs + invidx2[key]
    invidx2[key] = set(value)

Ndocs = len(set(alldocs))
print(Ndocs)

18123


In [20]:
inv_voc = {v: k for k, v in voc.items()}
invidx3 = invidx2.copy()

for key, value in invidx2.items():
    tmp = []
    for elem in value:
        x = tf_idf(inv_voc[key], docdict[elem], (Ndocs + 1), len(value))
        tmp.append((elem,x))
    invidx3[key] = tmp

print(invidx3)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
print(invidx3[1])

In [21]:
with open(DIR+'dicIndex.pkl', 'wb') as file:
        pickle.dump(invidx3, file, pickle.HIGHEST_PROTOCOL)

Now we want to find the documents most similar in according to the query the user give in input and we want as output of this second search engine, the first k-documents and in  particular:

- Title
- Description
- City
- Url
- Similarity

For the similarity we consider the cosine similarity and we compute it with the function spatial.distance.cosine between the two vectors (the one with the tfidf of the query and the other one with the tfidf of the words we can fine in the documents where we have all the words of the query).

In [23]:
def heapsort(d):
    l = []
    for key in d:
        heappush(l, (d[key], key))
    return [heappop(l) for i in range(len(l))]

def searchCos(string, n):
    
    with open(DIR+'dicIndex.pkl', 'rb') as file:
        data = pickle.load(file)

    languages = {'en': 'english', 'de': 'german', 'es': 'spanish', 'nl': 'dutch'}
    text = string
    lan = detect(str(text))
    if(lan not in languages):
        lan = 'en'
    tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
    result = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
    

    query_tfidf = [tf_idf(i,text,(Ndocs + 1),1) for i in result]
    
    finaldic = {}
    listofdocs = {}
    for elem in result:
        if(elem in voc):
            for k in invidx3[voc[elem]]:
                if(k[0] in finaldic):
                    finaldic[k[0]] = finaldic[k[0]] + [k[1]]
                else:
                    finaldic[k[0]] = [k[1]]

    final = {}    
    for k in finaldic:
        if(len(finaldic[k]) == len(result)):
            final[k] = finaldic[k]
    for k in final:
        final[k] = 1 - spatial.distance.cosine(final[k] , query_tfidf) 
    final = heapsort(final)
    
    table = [['<b>Doc</b>','<b>Title</b>', '<b>Description</b>', '<b>City</b>', '<b>URL</b>', '<b>Similarity</b>']]
    
    for elem in final[:n:-1]:
        file_name = DIR + 'doc_'+str(elem[1])+'.tsv'
        with open(file_name,'rb') as file: 
            text = str(file.read()).split('\\t')
            table = table + [[str(elem[1]), text[7], text[4], text[2], '<a href ="' + str(text[8]) + ' target= "_blank">' + str(text[8]) + '</a>', elem[0]]]
            file.close()
    display(HTML(tabulate.tabulate(table[:n+1], tablefmt='html')))


In [24]:
searchCos(input(), 5)

House with garden


0,1,2,3,4,5
Doc,Title,Description,City,URL,Similarity
14254,Garden Bedroom,"Comfortable and private, the garden bedroom is ideal for a long weekend or festival stay. It features a private closet, a full-sized bed and a lovely view of the backyard. Common spaces are here for your use as well. Cook a meal and stick to your travel budget, sit in the living room front window with a cup of coffee and watch the peacocks in the morning or enjoy the sunshine and garden from the back patio. If being out and about in Austin is your goal, our neighborhood is central and there are many easy options for getting around the city. We have and love dogs. Please get in touch if you would like to bring yours! There are so many things to love about our neighborhood. Not only are we super close to great parkland, bars, restaurants, coffee shops, food trailers and vintage shops, we're surrounded by huge live oaks and peacocks! The neighbors are open and friendly and have made us feel welcome from the start. Some of our favorite places include Skylark Lounge, Contigo, Dai Due, Sugar Mama's, Coldtowne Theater, The Work Horse, Foreign and Domestic, Drink Well, Kome, Tyson's Tacos, Room Service Vintage, Uptown Modern, Michi Ramen, Big Bertha's and Lucy's Fried Chicken. We also really dig our backyard patio and grill, and we think you will, too. Our neighborhood connects directly to the Mueller Greenway trails, putting guests less than a mile away from The Thinkery and Mueller Park (they have a great Farmer's Market on Sundays and a beautiful loop for walkers and runners alike). By bike, take the Greenway two miles west to UT, North Loop or Hyde Park, take it a mile southeast to Cherrywood or Manor Road, five miles further and you're in the thick of the East Side & downtown without ever having to leave a bike lane. If cycling is not your thing, shared rides are abundant (getme is a good app) as are Car To Go cars. There is also a bus stop less than two blocks from the house, and we are two miles from the Capmetro station at Highland. I-35 is literally a stone's throw from the house, putting you just a few exits from wherever you want to be in Austin. If you are interested, we can provide bike rental services and route recommendations. We have two dogs, one big and one small. Although we are responsible owners and you will find the house clean, please consider this if you are afraid of dogs or have allergies. On the other hand, if you would like to bring your well-behaved dog, please reach out and we will see if this can be arranged. Our neighborhood connects directly to the Mueller Greenway trails, putting guests less than a mile away from The Thinkery and Mueller Park (they have a great Farmer's Market on Sundays and a beautiful loop for walkers and runners alike). By bike, take the Greenway two miles west to UT, North Loop or Hyde Park, take it a mile southeast to Cherrywood or Manor Road, five miles further and you're in the thick of the East Side & downtown without ever having to leave a bike lane. If cycling is not your thing, rideshares and cabs are abundant, as are Car To Go cars. There is also a bus stop less than two blocks from the house, and we are two miles from the Capmetro station at Highland. I-35 is a stone's throw from the house, putting you just a few exits from wherever you want to be in Austin. If you are interested, we can provide bike rental services and route recommendations.",Austin,https://www.airbnb.com/rooms/5096161?location=Austin%2C%20TX,0.9908098073078521
9137,Greeley Gardens Carriage House,This is a beautiful cozy garage apartment with a large porch overlooking the grounds that make up the Greeley Gardens estate. The Apartment has low ceilings (less than 7ft) and is ideal for travelers looking for a serene indoor/outdoor environment to relax during their travels.,Houston,https://www.airbnb.com/rooms/8301898?location=Bellaire%2C%20TX,0.9796706467921616
6037,CLEAN GARDEN PATIO HOUSE 15 MINUTES FROM ANYWHERE,"Private Patio Home 15 minutes from Hobby Airport, Downtown, The Medical Center . On The Bus Route, Bus 50 come every 15 minutes. Im a mile from the Magnolia Park Train Center. My Huge Covered Patio Is 420 Friendly. NO Smoking Inside. Pets Welcome. BE RESPECTFUL. USE SOMETHING CLEAN IT AND PUT IT BACK. Music can be played as LOUD as you want in your room because the walls are soundproof. I am a very tolerant and open minded. I mostlt garden and cook. Please Message Me Any Questions.",Houston,https://www.airbnb.com/rooms/19454562?location=Channelview%2C%20TX,0.9796706467921616
6072,Private East End Garden Patio Home,"This is a very colorful electic house 15 minutes from Hobby Airport. 15 minutes From Downtown. 30 minutes If you take public transportation which the bus stop is across thw street. Quiet Garden Sanctuary or Party Beer and Grilling With Friends. Perfect Privacy For Either Mood. I recycle, resuse and repurposed most all the patio decor and art. Art Installations Are Welcome And Some Supplies Are Provided.",Houston,https://www.airbnb.com/rooms/19508450?location=Channelview%2C%20TX,0.9796706467921615
4582,Alla's Garden House in S.W. Dallas.,"Welcome to Alla's Garden House! Beautiful private Garden House 3bedrooms, 2bathrooms (sleep 8 -10 people located only 15min away from downtown Dallas, American Airline Center, 7min from Horse Race Arena Grand Prairie, 20min Dallas Cowboys Stadium.",Duncanville,https://www.airbnb.com/rooms/286328?location=Cedar%20Hill%2C%20TX,0.9646886448489406


# Step 4: Define a new score!

Now it's our turn to think how documents should be classified according to the query.
For us it's really important the distance of each house from the city center and the price, so these are the variables we consider to define our score.

In the next cell we find the centers coordinates for each city with an house in the dataset.

In [58]:
with open(DIR+'citycoord.pkl', 'rb') as file:
        citycenters = pickle.load(file)
        print(citycenters)

{0: (36.3134398, -82.3534728), 1: (38.025702, -122.028262510398), 2: (50.0839627, 2.6567465), 3: (10.416667, 77.666667), 4: (29.663008, -95.2354902), 5: (33.6430283, -84.0322064), 6: (31.6271145, -96.2838621), 7: (43.4200387, -83.9490365), 8: (30.3644888, -97.9875325), 9: (18.82867105, -95.9238843132054), 10: (25.0338013, 99.8273188), 11: (40.75839, -82.5154471), 12: (29.5275595, -96.0645975), 13: (37.8229199, -78.6591822), 14: (37.6785422, -122.0457953), 15: (30.0900753, -97.3127179), 469: (32.753177, -97.3327459), 17: (28.5813687, -107.8668768), 18: (50.9863244, -114.0773408), 19: (30.4601975, -97.9883477), 20: (32.6395776, -96.5383228), 21: (29.547452, -95.0321506), 22: (33.0145673, -97.0969552), 23: (30.3793672, -97.9961238), 24: (18.5441194, -72.4766886), 25: (19.13499, -72.156828), 26: (40.73964505, -74.0089209502898), 27: (32.3665322, -96.1008056), 28: (29.7457771, -98.6433561), 29: (31.5243379, -96.5338693), 30: (32.95984, -98.7653408), 31: (52.665257, 8.2363523), 32: (39.30737

Now we define the function that will return the distance from the house coordinates and the city center.

In [59]:
def distanceFromCityCenter(lat, long, city):
    city = city.lower()
    return geopy.distance.distance([lat, long], citycenters[locationsList[city]]).km

In the next cell we will have a dictionary with the ranking of the houses computed in this way:

score = (1 - distance/max_distance)*0,7 + (1 - price/max_price)*0,3

From our scoring function you can see that for us the most importnat variable is the distance from the city center that has the 0,7 weight on the final score, and the price comes after the distance in importance.

In [60]:
rankdict = {}
maxprice = 0
for i, row in df[:].iterrows():
    try:
        dist = (distanceFromCityCenter(row['latitude'], row['longitude'], row['city']))
    except:
        dist = 1
    price = float((row['average_rate_per_night'])[1:])
    if (maxprice < price):
        maxprice = price
    if(locationsList[row['city'].lower()] in rankdict):
        rankdict[locationsList[row['city'].lower()]] = rankdict[locationsList[row['city'].lower()]] + [[i, dist, price]]
    else:
        rankdict[locationsList[row['city'].lower()]] =  [[i, dist, price]]
print(rankdict)
for k in rankdict:
    m = max([i[1] for i in rankdict[k]]) # Relative distance
    for i, val in enumerate(rankdict[k]):
        rankscore = (1-val[1]/m)*0.7 + (1-(val[2]/maxprice))*0.3
        rankdict[k][i] = [rankscore, val[0]]


{155: [[0, 3.8749399563436104, 27.0], [53, 4.991207205823675, 100.0], [55, 0.4503278334498338, 81.0], [91, 7.756838666749961, 200.0], [134, 3.5486679705853184, 175.0], [420, 10.230554488869872, 250.0], [441, 11.087377858237474, 160.0], [445, 0.6971933943690694, 27.0], [457, 11.280396729029892, 89.0], [484, 11.317228230511981, 36.0], [699, 2.7069265669804157, 100.0], [849, 2.963448561079422, 50.0], [1041, 8.004287754573184, 39.0], [1356, 5.424195423322233, 35.0], [1428, 6.581434198705654, 24.0], [1555, 7.030019924062355, 37.0], [1611, 1.7556165979782816, 25.0], [1633, 4.669089037304398, 50.0], [1768, 4.8841748309633495, 59.0], [2511, 9.900289831242617, 35.0], [2993, 4.8841748309633495, 59.0], [3228, 8.004394546891957, 999.0], [3260, 8.004287754573184, 39.0], [3366, 9.235785390532966, 25.0], [3412, 9.900289831242617, 35.0], [3439, 7.018176019322771, 94.0], [3686, 8.873813611949808, 1000.0], [3753, 11.187181190480057, 500.0], [3821, 3.7930425425811376, 60.0], [3849, 4.538756142919433, 250

Finally, this is the last search engine with our ranking score, so we will give in input the location (city) and then we will find the top-k houses in this city according to our search engine.

In [63]:
from heapq import heappush, heappop

def heapsortd(d):
    l = []
    for e in d:
        heappush(l, [e[0], e[1]])
    return [heappop(l) for i in range(len(l))]

def searchByLocation(loc, n):
    try:
        loc = loc.lower()
        todisplay = (heapsortd(rankdict[locationsList[loc]])[:n:-1])

        table = [['<b>Doc</b>','<b>Title</b>', '<b>Description</b>', '<b>City</b>', '<b>URL</b>', '<b>Rank</b>']]
        
        for elem in todisplay:
            file_name = DIR + 'doc_'+str(elem[1])+'.tsv'
            with open(file_name,'rb') as file: 
                text = str(file.read()).split('\\t')
                table = table + [[str(elem[1]), text[7], text[4], text[2], '<a href ="' + str(text[8]) + ' target= "_blank">' + str(text[8]) + '</a>', elem[0]]]
                file.close()
        display(HTML(tabulate.tabulate(table[:n+1], tablefmt='html')))
    except:
        print("All the locations available are displayed")

In [66]:
searchByLocation(input(), 10)

Austin


0,1,2,3,4,5
Doc,Title,Description,City,URL,Rank
4232,Historical Loft with Capitol View!!,Located one block south of the Capitol in a historical building on the corner of https://www.airbnb.com/loCongress Avenue this property is walkable to: - All of downtown - The University of Texas - Paramount Theatre - 6th Street - Lady Bird Lake,Austin,https://www.airbnb.com/rooms/961883?location=Colorado%20River%2C%20TX,0.9922330425532018
6287,Historical Loft with Capitol View!,Located one block south of the Capitol in a historical building on the corner of Congress Avenue this apartment is walkable to: - All of downtown - The University of Texas - Paramount Theatre - 6th Street - Lady Bird Lake,Austin,https://www.airbnb.com/rooms/669469?location=Colorado%20River%2C%20TX,0.9910560416416072
311,Historical Loft with Capitol View!,Located one block south of the Capitol in a historical building on the corner of Congress Avenue this apartment is walkable to: - All of downtown - The University of Texas - Paramount Theatre - 6th Street - Lady Bird Lake,Austin,https://www.airbnb.com/rooms/669469?location=Brazos%20River%2C%20TX,0.9910560416416072
5498,Extraordinary Accommodations 1bd|1ba Pool Gym,"Enjoy yourself in this brand new, uniquely designed one bedroom apartment with a phenomenal view in the heart of downtown! 9th floor apartment has an ensuite bathroom and features a private balcony with amazing views of Austin. Our place is great for couples, solo adventurers, business travelers, families and friends.",Austin,https://www.airbnb.com/rooms/18914845?location=Cedar%20Park%2C%20TX,0.9906070340384818
7687,PRIVATE ROOM/BATHROOM WITH AMAZING VIEWS,"private room in the heart of downtown, Private bath with access from inside the room FREE PARKING Room has floor to ceiling windows queen bed extra fold-able bed can be provided for guests with more than 2",Austin,https://www.airbnb.com/rooms/17547362?location=Cedar%20Park%2C%20TX,0.9900048501246079
14125,Downtown Oasis: Suite-Historic Home,"Charming private suite in the heart of downtown Austin in a beautiful historic home. Amenities include private entrance and patio, off-street parking and a kitchenette. Walking distance from the Texas Capitol, UT, downtown & Lady Bird Lake/Trail. Registered with the City of Austin: #V",Austin,https://www.airbnb.com/rooms/4013329?location=Austin%2C%20TX,0.9898293431901585
11718,Downtown Condo with PARKING & Pool,"Perfect downtown location for SXSW, F1, ACL Music Festival, or and business trip! Literally, in the middle of everything, and 2 blocks from convention center in downtown Austin.",Austin,https://www.airbnb.com/rooms/4111528?location=Brazos%20River%2C%20TX,0.9885557751295055
4136,Downtown Condo with PARKING & Pool,"Perfect downtown location for SXSW, F1, ACL Music Festival, or and business trip! Literally, in the middle of everything, and 2 blocks from convention center in downtown Austin.",Austin,https://www.airbnb.com/rooms/4111528?location=Colorado%20River%2C%20TX,0.9885557751295055
9680,4th St Loft in Downtown Austin!,Welcome to our home! This a two bedroom lofted condo with one bathroom and extra vanity area in Downtown Austin. We usually have a 2 night minimum but can offer one night stays for special pricing.,Austin,https://www.airbnb.com/rooms/323733?location=Brazos%20River%2C%20TX,0.9884395625684279


## Bonus: Make a nice visualization!

Now it's time to make some maps.
We will ask in input to the user the coordinates and a radius and our goal is returning each house in our database that is inside the circle with center the coordinates given in input and the radius.
But for us is important also to visualize the prices of the houses we can find in the circle, so we color the markers of each house in the circle in the following way:

-Green: houses with a rate per night less than 100$.
-Blue: houses with a rate per night more or equal than 100$ and less than 600$.
-Red: houses with a rate per night higher than 600$.

in the marker you can also find the ink to reach the announcement of the house you want to book.

In [1]:
lat = float(input())
lon = float(input())
c = [lat, lon]
r = int(input('please insert the radius in km:'))

29.829352
-95.081549
please insert the radius in km:15


In [11]:
m = folium.Map(location = c, zoom_start = 6)

In [12]:
folium.Circle(
    radius = (r * 1000),
    location = c,
    popup = 'The Waterfront',
    color = 'crimson',
    fill = True,
).add_to(m)

<folium.vector_layers.Circle at 0x1dedddafbe0>

In [13]:
m

In [21]:
Coords = {}
for doc in range(len(df)):
    f = open(DIR+'doc_'+str(doc)+'.tsv', 'r', encoding = 'utf-8')
    for line in f:
        ou = line.strip().split('\t')
        break
    if ou[8] not in Coords:
        if ou[6] != 'nan' and ou[7] != 'nan':
            Coords[ou[8]] = [float(ou[5]),ou[6]]

In [24]:
def google_maps(lat, lon):
    for k, v in Coords.items():
        P1 = c
        lat1 = v[0]
        lon1 = v[1]
        P2 = (lat1, lon1) 
        if (distance.distance(P1, P2).km) < r:
            if v[2] < 100 :
                tooltip = 'Click me!'
                folium.Marker(location = [float(P2[0]),float(P2[1])], 
                              popup = folium.Popup('<div><a href="'+k+'" target = "_blank" >'+k+'</a></div>'),
                              icon = folium.Icon(color = 'green', icon = 'home')
                             ).add_to(m)
            elif v[2]>= 100 and v[2] < 600:
                    tooltip = 'Click me!'
                    folium.Marker(location = [float(P2[0]),float(P2[1])], 
                                  popup = folium.Popup('<div><a href="'+k+'" target = "_blank" >'+k+'</a></div>'),
                                  icon = folium.Icon(color = 'blue', icon = 'home')
                                 ).add_to(m)
            elif v[2] >= 600:
                
                tooltip = 'Click me!'
                folium.Marker(location = [float(P2[0]),float(P2[1])], 
                                popup = folium.Popup('<div><a href="'+k+'" target = "_blank" >'+k+'</a></div>'),
                                icon = folium.Icon(color = 'red', icon = 'home')
                                ).add_to(m)
                
            
        else:
            pass

In [25]:
google_maps(c[0], c[1])

TypeError: 'module' object is not callable

In [73]:
m