# Homework 3 - Find the perfect place to stay in Texas!

The homework consists in analyzing the text of Airbnb property listings and building a search engine.

## Step 0: Preparation code

In this part we will write some functions to make sure all the libraries that will be needed along the run of the program are available. We divided this in three main block of code: a function to install missing libraries, a block to check that the libraries that usually needs to be installed are already in the system and a general import of the other libraries that are for sure part of Python packages suite.

In the first part we define a function that will let us install a package in the future, if it does not exist in the local packages. The library will be installed through `pip` just if it does actually exists in the repositories.

In [1]:
def install(package):
    try:
        import subprocess
        import sys
        subprocess.call([sys.executable, "-m", "pip", "install", package])
        print("Library installed")
    except:
        print("Cannot install the library")

In the second part we try to install all the libraries that usually are not part of the local set of packages. In the case the library is already available, we import it, otherwise we install it with the previous function, and then try again to import it.

In [21]:
try:
    from langdetect import detect
except:
    print("The library LanfDetect is not in the repository, I will install it")
    install('langdetect')
    from langdetect import detect

try:
    from texttable import Texttable
except:
    print("The library TextTable is not in the repository, I will install it")
    install('texttable')
    from texttable import Texttable

try:
    from geopy.geocoders import Nominatim
    import geopy.distance
except:
    print("The library geopy is not in the repository, I will install it")
    install('geopy')
    from geopy.geocoders import Nominatim
    import geopy.distance

try:
    import tabulate
except:
    print("The library tabulate is not in the repository, I will install it")
    install('tabulate')
    import tabulate

The library tabulate is not in the repository, I will install it
Library installed


In the last block we import the common libraries that usually comes with Python (supposing that Anaconda is installed as well). In particular we import `unicodecsv` to not have problems with the import and export of text with special caracters. `Pickle` will be used to import/export the dictionaries into files, `nltk` is used to clean the sentences from the stopwords and for stemming, `langdetect` to understand in which language a text is written. `Pandas` is used to load the initial CSV and do some operations in a quicker way.
After all the imports, we download/update the list of the stopwords with the command `nltk.download('stopwords')`. `folium` is to create a map displaying the points and `re` is to apply regex to a string.

In [50]:
import pandas as pd
import unicodecsv as csv # We also import the unicodecsv library to handle any Unicode objects in the CSV report...as csv.
import nltk
import string
import pickle
import folium
import re

from IPython.display import HTML, display
from nltk import word_tokenize
from nltk.corpus import stopwords
from langdetect import detect

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fabiomontello/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 1: Load the Data

Now we want to read the data from the initial CSV. We start by defining the directory of the data and then running the command to read the CSV.

In [6]:
DIR = 'data/docs/' # C:/Users/david/Desktop/DATA SCIENCE/AMD/documents/

df = pd.read_csv("data/Airbnb_Texas_Rentals.csv", sep=",") # we read the dataset

We want then to clean the data we will not be using. In particular the following actions has been executed:
- Remove the column `Unamed: 0` which is useless for our analysis
- Remove all the rows which do not have a description, a latitude or a longitude
- Remove all the rows where the description is a `\` or a `.`
- Replace the new lines characters with a space in the whole dataframe
- Define a new column language, with undefined language for each row (will be updated later)

Finally we display the table, useful later for reference

In [7]:
df = df.drop('Unnamed: 0',axis = 1) #this column is useless for our analisys
df = df[pd.notnull((df['description'])) & pd.notnull((df['latitude'])) & pd.notnull((df['longitude'])) & (df['description'] != "\\" ) & (df['description'] != "." )] #If the row has no description, for us is useless, so we drop it
df = df.replace(r'\\n',' ', regex=True) #Remove all the new lines in the dataframe
df['language'] = 'nd'
df.head()

Unnamed: 0,average_rate_per_night,bedrooms_count,city,date_of_listing,description,latitude,longitude,title,url,language
0,$27,2,Humble,May 2016,Welcome to stay in private room with queen bed...,30.020138,-95.293996,2 Private rooms/bathroom 10min from IAH airport,https://www.airbnb.com/rooms/18520444?location...,nd
1,$149,4,San Antonio,November 2010,"Stylish, fully remodeled home in upscale NW – ...",29.503068,-98.447688,Unique Location! Alamo Heights - Designer Insp...,https://www.airbnb.com/rooms/17481455?location...,nd
2,$59,1,Houston,January 2017,'River house on island close to the city' A w...,29.829352,-95.081549,River house near the city,https://www.airbnb.com/rooms/16926307?location...,nd
3,$60,1,Bryan,February 2016,Private bedroom in a cute little home situated...,30.637304,-96.337846,Private Room Close to Campus,https://www.airbnb.com/rooms/11839729?location...,nd
4,$75,2,Fort Worth,February 2017,Welcome to our original 1920's home. We recent...,32.747097,-97.286434,The Porch,https://www.airbnb.com/rooms/17325114?location...,nd


## Step 2: Create documents

Now from every row of the table, we want to create a `.tsv` file which will store all the relevant informations about a single istance that will be displayed later on in the search engine.

In [8]:
for row in df.iterrows():
    file_name = DIR + 'doc_'+str(row[0])+'.tsv' # We build dinamically the path of every file
    
    with open(file_name,'wb') as file: # We create/open the file. Wb stands for binary mode.
        csv_writer = csv.writer(file, delimiter = '\t') # Now we are writing the files as tab separated files.
        csv_writer.writerow(list(row[1].values)) # writerow want a list object ad input, so we used list.
        file.close() # Once the file is written, we close it, so we free some memory


## 3.1: Conjunctive query

We want now to be able to make a dictionary that, given a query, returns us a list of documents conatining all the words in that query. We start by creating a dictionary that will map for every document the single words that title and description will contain, cleaned of the stopwords and stemmed. 
Since there are a lot of different languages used for the descriptions (with different alphabets as well), we decided to keep just the 4 most used languages: English, German, Spanish and Dutch. All the other languages will not be considered.

In [9]:
docdict = {}

languages = {'en': 'english', 'de': 'german', 'es': 'spanish', 'nl': 'dutch'}

for i, row in df[:].iterrows(): # For every row of the dataframe, we get values and index

    #We merge title and description (since we will consider both) and remove punctuation
    text = (str(row['description'])+ ' '+ str(row['title'])).translate({key: None for key in string.punctuation})
    
    
    if(text): # If the text exists
        try: # Try to detect the language
            lan = detect(str(text))
            df.loc[i,'language'] = lan # Set the language in the row

            if (lan in languages): # If it's one of the languages considered
        
                # Remove all the stopwords and all the non alphabetic words
                tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
        
                # Stem all the words and add them to a dictionary
                docdict[i] = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
            else:
                pass
        except:
            pass

docdict

{0: ['welcom',
  'stay',
  'privat',
  'room',
  'queen',
  'bed',
  'detach',
  'privat',
  'bathroom',
  'second',
  'anoth',
  'privat',
  'bedroom',
  'sofa',
  'bed',
  'avail',
  'addit',
  'addit',
  'iah',
  'airport',
  'airport',
  'avail',
  'privat',
  'iah',
  'airport'],
 1: ['fulli',
  'remodel',
  'home',
  'upscal',
  'nw',
  'alamo',
  'height',
  'amaz',
  'locat',
  'hous',
  'conveni',
  'locat',
  'quiet',
  'beauti',
  'season',
  'prestigi',
  'neighborhood',
  'close',
  'loop',
  'featur',
  'open',
  'floor',
  'origin',
  'hardwood',
  'full',
  'bathroom',
  'independ',
  'room',
  'sleep',
  'european',
  'inspir',
  'kitchen',
  'driveway',
  'park',
  'uniqu',
  'alamo',
  'height',
  'design',
  'inspir'],
 2: ['hous',
  'island',
  'close',
  'well',
  'maintain',
  'river',
  'hous',
  'san',
  'jacinto',
  'river',
  'extra',
  'room',
  'temporari',
  'river',
  'hous',
  'near',
  'citi'],
 3: ['privat',
  'bedroom',
  'cute',
  'littl',
  'home',


Next we want to map once every single stemmed word with a corresponding sequential ID into a dictionary. This will improve the speed later on.

In [13]:
voc = {}
cont = 0

for key, value in docdict.items(): # for every document
    for elem in value: # for every word
        if(elem not in voc): # if we did not map it yet
            voc[elem] = cont # map it with a sequential value
            cont += 1 # and increase the value
            
voc

{'welcom': 0,
 'stay': 1,
 'privat': 2,
 'room': 3,
 'queen': 4,
 'bed': 5,
 'detach': 6,
 'bathroom': 7,
 'second': 8,
 'anoth': 9,
 'bedroom': 10,
 'sofa': 11,
 'avail': 12,
 'addit': 13,
 'iah': 14,
 'airport': 15,
 'fulli': 16,
 'remodel': 17,
 'home': 18,
 'upscal': 19,
 'nw': 20,
 'alamo': 21,
 'height': 22,
 'amaz': 23,
 'locat': 24,
 'hous': 25,
 'conveni': 26,
 'quiet': 27,
 'beauti': 28,
 'season': 29,
 'prestigi': 30,
 'neighborhood': 31,
 'close': 32,
 'loop': 33,
 'featur': 34,
 'open': 35,
 'floor': 36,
 'origin': 37,
 'hardwood': 38,
 'full': 39,
 'independ': 40,
 'sleep': 41,
 'european': 42,
 'inspir': 43,
 'kitchen': 44,
 'driveway': 45,
 'park': 46,
 'uniqu': 47,
 'design': 48,
 'island': 49,
 'well': 50,
 'maintain': 51,
 'river': 52,
 'san': 53,
 'jacinto': 54,
 'extra': 55,
 'temporari': 56,
 'near': 57,
 'citi': 58,
 'cute': 59,
 'littl': 60,
 'situat': 61,
 'covet': 62,
 'garden': 63,
 'acr': 64,
 'access': 65,
 'campus': 66,
 'recent': 67,
 'purchas': 68,
 'mil

Now we want to create an inverted index that will map every word (using the ID we created previously) to a list of all the documents that contain that single word. 
The result will be in the following format: `{1: [0,2,5], 245: [1, 4, 7]}`.

In [49]:
invidx = {} 

for key, value in docdict.items():
    for elem in value:
        if(voc[elem] in invidx):
            invidx[voc[elem]] = invidx[voc[elem]] + [key]
        else:
            invidx[voc[elem]] = [key]
invidx

{0: [0,
  4,
  22,
  50,
  58,
  58,
  70,
  70,
  150,
  196,
  199,
  222,
  236,
  275,
  288,
  296,
  332,
  384,
  429,
  439,
  443,
  449,
  463,
  474,
  475,
  504,
  540,
  549,
  567,
  589,
  607,
  610,
  670,
  672,
  708,
  735,
  769,
  771,
  793,
  812,
  815,
  816,
  826,
  851,
  855,
  862,
  896,
  926,
  944,
  976,
  1024,
  1052,
  1058,
  1099,
  1138,
  1175,
  1187,
  1238,
  1240,
  1254,
  1257,
  1259,
  1305,
  1354,
  1366,
  1391,
  1391,
  1397,
  1411,
  1506,
  1540,
  1542,
  1552,
  1561,
  1581,
  1582,
  1589,
  1593,
  1605,
  1605,
  1611,
  1627,
  1646,
  1674,
  1695,
  1695,
  1701,
  1772,
  1839,
  1850,
  1902,
  1967,
  1969,
  1970,
  1999,
  2033,
  2039,
  2045,
  2055,
  2057,
  2088,
  2109,
  2129,
  2145,
  2158,
  2160,
  2181,
  2181,
  2212,
  2214,
  2216,
  2263,
  2275,
  2279,
  2317,
  2321,
  2340,
  2374,
  2382,
  2389,
  2401,
  2413,
  2423,
  2426,
  2429,
  2446,
  2465,
  2485,
  2489,
  2489,
  2492,
  2500,
 

Now we define a search function that can be called every time and will gives us back a list of all the items that contains all the words requested in the query (apart from the stopwords). The function is splitted mainly in three parts: 
- Recognizing the language and stemming the query. If the query is not recognized as one of the languages written before, then we consider it and stem it as English.
- Getting the lost of documents containing all the words requested in the query
- Printing the results

In [47]:
def searchConj(string, n):
    
    text = string 
    lan = detect(str(text)) # Detect the language
    
    if(lan not in languages): 
        lan = 'en'
        
    # Remove non alphanumeric text and stem the words
    tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
    result = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
    
    # Get the list of documents
    listofdocs = {}
    for elem in result:
        if(elem in voc):
            if(len(listofdocs) == 0):
                listofdocs = set(invidx[voc[elem]])
            else:
                listofdocs = listofdocs & set(invidx[voc[elem]]) # & does the union of the sets

    table = [['<b>Title</b>', '<b>Description</b>', '<b>City</b>', '<b>URL</b>']]
    
    # Open every document and add it to a list
    for elem in listofdocs:
        file_name = DIR + 'doc_'+str(elem)+'.tsv'
        
        with open(file_name,'rb') as file: # the wb means we are writing a file in binary mode.
            text = str(file.read()).split('\\t')
            
            table = table + [[text[7], text[4], text[2], '<a href ="' + str(text[8]) + ' target= "_blank">' + str(text[8]) + '</a>' ]]
            
            file.close()

    # Display the list
    display(HTML(tabulate.tabulate(table[:n+1], tablefmt='html')))   
    

Now we call the search function, letting the user decide the query to input, and limiting the outputs to the first 5 elements.

In [48]:
searchConj(input(), 5)

House with garden


0,1,2,3
Title,Description,City,URL
"Rosen House Inn, Rose Garden Room","We are a small family owned Bed and Breakfast located in the Heart of the near Southside. Within walking distance to small shops, bars, and local dining. We offer snacks, coffee, tea, and sodas as well as a full hearty breakfast for our guests.",Fort Worth,https://www.airbnb.com/rooms/9972447?location=Benbrook%2C%20TX
"BRAND NEW HOME, EXQUISITE DECORATION, LARGE GARDEN","Brand new house with exquisite decoration. We have two indoor parking spaces and three exterior parking spaces in case you are traveling by car. The beds are orthopedic which provides a comfortable rest. For us cleaning is an essential point in the house, so feel totally safe. The house has been designed and decorated to satisfy the needs of families, business trips and relaxation time. We hope to make your stay an unforgettable experience!",Spring,https://www.airbnb.com/rooms/15285127?location=Conroe%2C%20TX
East Austin Hillside Gem,"Beautiful and modern 3Br, 2.5Ba located minutes from downtown. Amenities include front deck, back patio, peaceful backyard with tiered gardens and fire pits. There is also gym equipment, wifi, Xbox, and cable with HBO, Showtime, Starz, etc. There is free offstreet parking for two cars. Perfect location for families(with kids or furry friends,) couples, business travelers or friends looking for a spacious and comfortable location with plenty of amenities. This house has beautiful sunset views of downtown Austin just 6 miles from Darrell K Royal Stadium, 8 miles from Zilker Park and 14 miles from Circuit of the Americas.",Austin,https://www.airbnb.com/rooms/17555039?location=Bastrop%20County%2C%20TX
Charming 1 BR cottage near Chappell Hill,"Welcome to Cedar Meadows, a one-of-a-kind 2015 custom-built cottage situated on 7-acres of land with gardens all around. Surrounded by a white picket fence with decks on either side of the house, this property will allow you to enjoy the stunning views in any weather. The house is located on North Meyersville Road, 4.5 miles from Chappell Hill and 9.5 miles from Brenham making it convenient to access either town but giving you a feeling of secluded haven.",Brenham,https://www.airbnb.com/rooms/17921211?location=Brenham%2C%20TX
Gibbs House / Walk to the Beach,"Your private suite is in my personal residence. I live here. It takes 2 minutes to walk to the Convention Center and the beach is a three minute walk away. A theme park (The Pleasure Pier) is two miles away on the Galveston Seawall. Moody Gardens is about four miles away. A Kroger grocery store (with a Starbucks Coffee Shop) is a three minute walk from the house. Gibbs House is excellent for couples, business travelers, small families, and solo adventurers.",Galveston,https://www.airbnb.com/rooms/16982846?location=Bayou%20Vista%2C%20TX


## 3.2: Conjunctive query & Ranking score

In [None]:
#function tfidf
def tf_idf(word,doc,D,N):
    import math
    f = doc.count(word) / len(doc)
    tfidf = math.log(D/N)*f
    
    return (tfidf)

#function cosine similarity

def cosine_sim(x , y):
    import numpy as np
    
    a = dot(x,y)
    
    b = np.linalg.norm(x)
    c = np.linalg.norm(y)
    
    cos = (a/(b*c))
    
    return (cos)
    

    

In [None]:
# Here I have to create the dictionary with the term ID as key and then the documents where the term id is and also the tfidf.

invidx2 = invidx.copy()
alldocs = []
for key, value in invidx2.items():
    alldocs = alldocs + invidx2[key]
    invidx2[key] = set(value)

Ndocs = len(set(alldocs))
print(Ndocs)

In [None]:
inv_voc = {v: k for k, v in voc.items()}
invidx3 = invidx2.copy()

for key, value in invidx2.items():
    tmp = []
    for elem in value:
#         file_name = DIR + 'doc_'+str(elem)+'.tsv'
#         with open(file_name,'rb') as file: # the wb means we are writing a file in binary mode.
#             text = str(file.read()).split('\\t')
#             text = text[4].split(' ')
#             file.close()
        x = tf_idf(inv_voc[key], docdict[elem], Ndocs, len(value))
        tmp.append((elem,x))
    invidx3[key] = tmp

print(invidx3)

In [None]:
print(invidx3[1])

In [None]:
import json
with open(DIR+'dicIndex.json', 'w') as file:
     file.write(json.dumps(invidx3)) # use `json.loads` to do the reverse
        
with open(DIR+'dicIndex.pkl', 'wb') as file:
        pickle.dump(invidx3, file, pickle.HIGHEST_PROTOCOL)

In [None]:
from heapq import heappush, heappop
from scipy import spatial

def heapsort(d):
    l = []
    for key in d:
        heappush(l, key[::-1])
    return [heappop(l)[::-1] for i in range(len(l))]

def searchCos(string, n):
    with open(DIR+'dicIndex.pkl', 'rb') as file:
        data = pickle.load(file)

    languages = {'en': 'english', 'de': 'german', 'es': 'spanish', 'nl': 'dutch'}
    text = string
    lan = detect(str(text))
    if(lan not in languages):
        lan = 'en'
    tostem = ([x.lower() for x in text.split(' ') if x.lower() not in stopwords.words(languages[lan]) and x.isalpha()])
    result = [nltk.stem.SnowballStemmer(languages[lan]).stem(word) for word in tostem]
    
    listofdocs = {}
    for elem in result:
        if(elem in voc):
            if(len(listofdocs) == 0):
                listofdocs = set(invidx3[voc[elem]])
            else:
                listofdocs = listofdocs & set(invidx3[voc[elem]])
    print(listofdocs)
    listofdocs = heapsort(listofdocs)
    listofdocs = listofdocs[::-1]
    result = 1 - spatial.distance.cosine([1, 2, 1],[elem[1] for elem in listofdocs])
    print(result)
    t = Texttable()
    t.add_row(['Title', 'Description', 'City', 'URL', 'Similarity'])
    for elem in enumerate(listofdocs):
        file_name = DIR + 'doc_'+str(elem[0])+'.tsv'
        with open(file_name,'rb') as file: # the wb means we are writing a file in binary mode.
            text = str(file.read()).split('\\t')
            t.add_row([text[7], text[4], text[2], text[8], elem[1][1]])
            file.close()

    print(t.draw())

In [None]:
searchCos(input(), 10)

In [None]:
def findcenter(lat, long):
    return [sum(lat)/len(lat), sum(long)/len(long)]

In [None]:
# df.groupby
# latvect = df.groupby('city')['latitude'].apply(list)
# lonvect = df.groupby('city')['longitude'].apply(list)

In [None]:
# centers = {}
# for key, elem in latvect.iteritems():
#     centers[key] = findcenter(elem, lonvect[key])
# print(centers)


In [None]:
latvect

In [None]:
lonvect

In [None]:
location

In [69]:
locationsList = {elem : i for i, elem in enumerate(set(df["city"]))}
locationsList

{'Quitman': 0,
 'Martindale': 1,
 'Llano': 2,
 'Whitney': 3,
 'Salado': 4,
 'Channelview': 5,
 'DeSoto': 6,
 'Caddo': 7,
 'San Benito': 8,
 'Burnet': 9,
 'Dublin': 10,
 'Junction': 11,
 'Mason': 12,
 'Blue Mound': 13,
 'Plano': 14,
 'China Spring': 15,
 'Pflugerville': 16,
 'Wimberley': 17,
 '诺斯莱克': 18,
 'City By The Sea': 19,
 'Converse': 20,
 'The Woodlands': 21,
 'Cleburne': 22,
 'Eustace': 23,
 'McKinney': 24,
 'Natalia': 25,
 'Port Isabel': 26,
 'Ellinger': 27,
 'Decatur': 28,
 'Melissa': 29,
 'Los Fresnos': 30,
 'Coppell': 31,
 'Copper Canyon': 32,
 'Schertz': 33,
 'Justin': 34,
 'Baird': 35,
 'Pipe Creek': 36,
 'Buda': 37,
 'Jamaica Beach': 38,
 'Anson': 39,
 'CIty by the Sea': 40,
 'Hickory Creek': 41,
 'Crawford': 42,
 'Zephyr': 43,
 'Concan': 44,
 'Manor': 45,
 'Joshua': 46,
 'Beach City': 47,
 'Shore Acres': 48,
 'Orange': 49,
 'Fair Oaks Ranch': 50,
 'Needville': 51,
 'New Fairview': 52,
 'Addison': 53,
 'Van': 54,
 'Friendswood': 55,
 'Horseshoe bay': 56,
 'Kemah': 57,
 'S

In [None]:
from time import sleep

def distanceFromCityCenter(lat, long, city):
    coords_1 = (lat, long)
    coords_2 = (52.406374, 16.9251681)
    while True:
        try:
            geolocator = Nominatim(user_agent="search engine")
            location = geolocator.geocode(city + ", Texas")
            coords_2 = (location.latitude, location.longitude)
            return geopy.distance.vincenty(coords_1, coords_2).km
        except:
        #sleep(0.1)
            pass
    return None

In [71]:
from time import sleep

citycoord = {}
for key in locationsList:
    geolocator = Nominatim(user_agent="search engine")
    for i in range(5):
        try:
            location = geolocator.geocode(key + ", Texas")
            coords_2 = (location.latitude, location.longitude)
            print(location)
            break
        except: 
            sleep(0.1)
            pass
    citycoord[locationsList[key]] = coords_2

with open(DIR+'citycoord.pkl', 'wb') as file:
        pickle.dump(citycoord, file, pickle.HIGHEST_PROTOCOL)

Quitman, Wood County, Texas, 75783, USA
Martindale, Caldwell County, Texas, USA
Llano, Llano County, Texas, 76824, USA
Whitney, Hill County, Texas, USA
Salado, Bell County, Texas, USA
Channelview, Harris County, Texas, 77530, USA
DeSoto, Dallas County, Texas, 75115, USA
Caddo, Wilson County, Texas, USA
San Benito, Cameron County, Texas, 78586, USA
Burnet, Burnet County, Texas, USA
Dublin, Erath County, Texas, 76446, USA
Junction, Kimble County, Texas, 76849, USA
Mason, Mason County, Texas, 76856, USA
Blue Mound, Denton County, Texas, 76202, USA
Plano, Collin County, Texas, USA
China Spring, McLennan County, Texas, 76633, USA
Pflugerville, Travis County, Texas, USA
Wimberley, Hays County, Texas, USA
City-by-the Sea, Aransas County, Texas, USA
Converse, Bexar County, Texas, 78109, USA
The Woodlands, Montgomery County, Texas, USA
Cleburne, Johnson County, Texas, 76033, USA
Eustace, Henderson County, Texas, USA
McKinney, Collin County, Texas, USA
Natalia, Medina County, Texas, 78059, USA
P

China, Jefferson County, Texas, 77613, USA
Sunset, Knox County, Texas, USA
Brownwood, Brown County, Texas, 76801, USA
Nacogdoches, Nacogdoches County, Texas, USA
New Waverly, Walker County, Texas, USA
Texas Avenue, Bryan, Brazos County, Texas, 77840, USA
Devine, Medina County, Texas, 78016, USA
Hallettsville, Lavaca County, Texas, USA
Mico, Medina County, Texas, 78056, USA
Galveston, Galveston County, Texas, USA
Fredonia, Gregg County, Texas, 75695, USA
Keene, Johnson County, Texas, 76059, USA
Jacksboro, Jack County, Texas, 76458, USA
Beaumont, Jefferson County, Texas, USA
Canton, Van Zandt County, Texas, 75103, USA
Hunt County, Texas, USA
Balch Springs, Dallas County, Texas, USA
Allen, Collin County, Texas, USA
Haslet, Tarrant County, Texas, USA
Weimar, Colorado County, Texas, 78962, USA
Flower Mound, Denton County, Texas, USA
Ovalo, Taylor County, Texas, 79541, USA
Washington County, Texas, USA
San Antonio, Bexar County, Texas, USA
Glenn Heights, Dallas County, Texas, USA
Livingston,

Nolanville, Bell County, Texas, USA
Grand Saline, Van Zandt County, Texas, 75140, USA
Trinidad, Henderson County, Texas, USA
Briarcliff, Travis County, Texas, USA
Runaway Bay, Wise County, Texas, USA
Lake Jackson, Brazoria County, Texas, 77566, USA
Duncanville, Dallas County, Texas, USA
Hereford, Deaf Smith County, Texas, USA
Glen Rose, Somervell County, Texas, 76043, USA
KIXL-AM (Del Valle), Crofford Lane, Daffan Gin Park, Travis County, Texas, 78724, USA
Leonard, Fannin County, Texas, 75452, USA
Garden Ridge, Comal County, Texas, 78266, USA
Port Neches, Jefferson County, Texas, 77651, USA
Winnie, Chambers County, Texas, 77665, USA
Llano County, Texas, USA
Mabank, Kaufman County, Texas, 75147, USA
Marion County, Texas, 75657, USA
Somerville, Burleson County, Texas, 77879, USA
Ingleside, San Patricio County, Texas, 78362, USA
Anahuac, Chambers County, Texas, TX 77514, USA
Waco, McLennan County, Texas, USA
Palo Pinto County, Texas, USA
Kyle, Hays County, Texas, USA
Cedar Hill, Dallas Co

In [None]:
locdict = {}
for i, row in df[:100].iterrows():
    dist = distanceFromCityCenter(row['latitude'], row['longitude'], row['city'])
    if(locationsList[row['city']] in locdict):
        locdict[locationsList[row['city']]] = locdict[locationsList[row['city']]] + [[i, dist]]
    else:
        locdict[locationsList[row['city']]] =  [[i, dist]]
locdict

In [None]:
locationsList
sortLocations = {}
for key in sorted(locationsList):
    sortLocations[key] =  locationsList[key]
sortLocations

In [None]:

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

def searchByLocation(location):
    return locdict[location]

interact(searchByLocation, location = sortLocations);

## Bonus: Make a nice visualization!

In [52]:
lat = float(input())
lon = float(input())
c = [lat, lon]

29.829352
-95.081549


In [65]:
m = folium.Map(location = c, zoom_start = 6)

In [67]:
folium.Circle(
    radius = 10000,
    location = c,
    popup = 'The Waterfront',
    color = 'crimson',
    fill = True,
).add_to(m)

<folium.vector_layers.Circle at 0x1a161d7160>

In [57]:
Coords = {}
for doc in range(len(df)):
    f = open(DIR+'doc_'+str(doc)+'.tsv', 'r', encoding = 'utf-8')
    for line in f:
        ou = line.strip().split('\t')
        break
    if ou[8] not in Coords:
        if ou[6] != 'nan' and ou[7] != 'nan':
            Coords[ou[8]] = [float(ou[5]),ou[6]]

In [62]:
def google_maps(lat, lon):
    r = 10 #in km
    for k, v in Coords.items():
        P1 = c
        lat1 = v[0]
        lon1 = v[1]
        P2 = (lat1, lon1) 
        if (geopy.distance.distance(P1, P2).km) < r:
            tooltip = 'Click me!'
            folium.Marker(location = [float(P2[0]),float(P2[1])], 
                          popup = folium.Popup('<div><a href="'+k+'" target = "_blank" >'+k+'</a></div>'),
                          icon = folium.Icon(color = 'blue', icon = 'home')
                         ).add_to(m)
            
        else:
            pass

In [63]:
google_maps(c[0], c[1])

In [64]:
m