# City Search Tool

There are a lot of factors that go into making a big move, and for many people, the top priority is either their job or their family. But if you’re on your own and you have job flexibility to go basically wherever you want (i.e. you work remotely), then what? In that case, you have the luxury of finding a place that suits you—and not necessarily just your career.

A myriad of decisions go into picking the perfect place to call home — political leanings, crime rates, walkability, affordability, religious affiliations, weather and more — can you make a tool that allows aggie graduates and others to find thier next move?

[High speed internet](https://www.highspeedinternet.com/best-cities-to-live-work-remotely) (of all people?!) made a tool to do this.... but you can do better! Think of more factors: like median income of a location, cuisine, primary ethnicity, pollution index, happiness index, number of coffee shops or microbreweries in the city, etc. There's no end! Furthermore, maybe you are an international student and want to make this tool for global placement! Go for it! Maybe you want to penalize distance from POI's (points of interest) like family. Do it! The world is your oyster!

#### Starter Datasets
- [MoveHub City Ratings](https://www.kaggle.com/blitzr/movehub-city-rankings?select=movehubqualityoflife.csv)
  - [Notebooks for ideas on how to use data](https://www.kaggle.com/blitzr/movehub-city-rankings/notebooks)
- [World City Populations](https://www.kaggle.com/max-mind/world-cities-database?select=worldcitiespop.csv)
- [Rental Price](https://www.kaggle.com/zillow/rent-index)

#### Where to Find More Data
- [Google Datasets](https://datasetsearch.research.google.com/)
- [US Census](https://data.census.gov/cedsci/?q=United%20States)
- [Kaggle Datasets](https://www.kaggle.com/datasets)


#### How We Judge
- *Data Use*: Effectively used data, acquired additional data
- *Analytics*: Effective application of analytics (bonus points for ML/clustering techniques)
- *Visualization*: Solution is visually appealing and useful (Bonus points if you create an interactive tool/ application/ website)
- *Impact*: Clear impact of solution to solving problem

#### Helpful Workshops
- Intro to Python: Sat, 10:30-12:00
- Statistics for Data Scientists: Sat, 10:30-12:00
- How to Win TAMU Datathon: Sat, 13:00-14:00
- Data Wrangling: Sat, 17:00-18:15
- Data Visualization: Sat, 18:30-19:45
- Machine Learning Part 1 - Theory: Sat, 20:00-21:15
- Machine Learning Part 2 - Applied: Sat, 21:30-22:45


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('https://tamu-datathon-2020.s3.us-east-2.amazonaws.com/data/country.csv')

In [152]:
#Gets the countries given cities

def get_country(df):
    
    from geopy.geocoders import Nominatim
    
    for i in range(len(df)): 
        nm = Nominatim(user_agent="my-application")
        place, (lat, lng) = nm.geocode(df.at[i, 'City'], language = 'en')
        df.at[i, 'Country'] = place.split(', ')[-1]

    return df

In [157]:
#Converts countries to 2 letter country codes

def convert_code(df):
    
    import country_converter as coco
    
    iso2_codes = coco.convert(names=df['Country'].tolist(), to='ISO2')
    df['Country_Code'] = coco.convert(names=df['Country'].tolist(), to='ISO2')
    
    return df

In [158]:
convert_code(get_country(df)).to_csv('country.csv', index = False)
convert_code(df_democracy).to_csv('democracy.csv', index = False)

In [186]:
#Web scraping rental prices

import urllib.request
from pprint import pprint
from html_table_parser import HTMLTableParser
import pandas as pd

def url_get_contents(url):
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
    return f.read()

xhtml = url_get_contents('https://www.numbeo.com/cost-of-living/city_price_rankings?itemId=100').decode('utf-8')

p = HTMLTableParser()
p.feed(xhtml)

df_rental = pd.DataFrame(p.tables[1], columns = ['Index', 'City', 'Empty', 'Price'])
df_rental.head()

Unnamed: 0,Index,City,Empty,Price
0,1.0,"Hong Kong, Hong Kong",,"32,783.46 $"
1,2.0,"Singapore, Singapore",,"19,714.13 $"
2,3.0,"Seoul, South Korea",,"19,667.72 $"
3,4.0,"New York, NY, United States",,"15,824.01 $"
4,5.0,"Beijing, China",,"15,535.83 $"


In [187]:
df_rental = df_rental.drop('Index', axis = 1)
df_rental = df_rental.drop('Empty', axis = 1)

for i in range(len(df_rental)):
    df_rental.at[i, 'City'] = df_rental.at[i, 'City'].split(', ')[0]
    df_rental.at[i, 'Price'] = df_rental.at[i, 'Price'].replace('$', '')
    
df_rental.to_csv('df_rental.csv')
df_rental.head()

Unnamed: 0,City,Price
0,Hong Kong,32783.46
1,Singapore,19714.13
2,Seoul,19667.72
3,New York,15824.01
4,Beijing,15535.83


In [151]:
df.head()

Unnamed: 0,City,Movehub Rating,Purchase Power,Health Care,Pollution,Quality of Life,Crime Rating,lat,lng,Country,Country_Code
0,Caracas,65.18,11.25,44.44,83.45,8.61,85.7,10.480594,-66.903606,VE,
1,Johannesburg,84.08,53.99,59.98,47.39,51.26,83.93,-26.204103,28.047305,ZA,ZA
2,Fortaleza,80.17,52.28,45.46,66.32,36.68,78.65,-3.732714,-38.526998,BR,BR
3,Saint Louis,85.25,80.4,77.29,31.33,87.51,78.13,38.627003,-90.199404,US,US
4,Mexico City,75.07,24.28,61.76,18.95,27.91,77.86,19.432608,-99.133208,MX,MX


In [37]:
#Scrape images for cities

def get_images(df):
    
    from bs4 import BeautifulSoup
    import requests
    
    for i in range(len(df)):
        url = 'https://unsplash.com/s/photos/' + df.at[i, 'City']
        page = requests.get(url)
        soup = str(BeautifulSoup(page.text, 'html.parser'))
        idx1 = soup.find('http://images.unsplash.com/photo')
        idx2 = soup.find('" data-react', idx1)
        df.at[i, 'Image'] = soup[idx1:idx2]
    
    return df

In [39]:
get_images(df_country).to_csv('df_country.csv', index = False)

In [47]:
df = pd.read_csv('df_country.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,City,Movehub Rating,Purchase Power,Health Care,Pollution,Quality of Life,Crime Rating,Country,Country_Code,lat,lng,Democracy,Happiness,Image
0,0,Caracas,65.18,11.25,44.44,83.45,8.61,85.7,Venezuela,VE,10.506098,-66.914602,2.88,27.03,http://images.unsplash.com/photo-1509915964737...
1,1,Johannesburg,84.08,53.99,59.98,47.39,51.26,83.93,South Africa,ZA,-26.205,28.049722,7.24,28.38,http://images.unsplash.com/photo-1577948000111...
2,2,Cape Town,87.95,60.36,71.67,75.98,78.73,68.06,South Africa,ZA,-33.928992,18.417396,7.24,28.38,http://images.unsplash.com/photo-1576485290814...
3,3,Pretoria,80.56,46.74,71.11,70.13,61.44,68.06,South Africa,ZA,-25.745937,28.187944,7.24,28.38,http://images.unsplash.com/photo-1599641084223...
4,4,Fortaleza,80.17,52.28,45.46,66.32,36.68,78.65,Brazil,BR,-3.730451,-38.521799,6.86,78.38,http://images.unsplash.com/photo-1589216235686...


In [7]:
import pandas as pd
df = pd.read_csv('https://tamu-datathon-2020.s3.us-east-2.amazonaws.com/data/country.csv')
df = df.drop('Unnamed: 0', axis = 1)
df.head()

Unnamed: 0,City,Movehub Rating,Purchase Power,Health Care,Pollution,Quality of Life,Crime Rating,Country,Country_Code,lat,lng,Democracy,Happiness,Population,Price Per Square Foot (USD),paired
0,Caracas,65.18,11.25,44.44,83.45,8.61,85.7,Venezuela,VE,10.506098,-66.914602,2.88,27.03,2938992.0,98.12,"('Caracas', 'http://images.unsplash.com/photo-..."
1,Johannesburg,84.08,53.99,59.98,47.39,51.26,83.93,South Africa,ZA,-26.205,28.049722,7.24,28.38,5782747.0,926.65,"('Johannesburg', 'http://images.unsplash.com/p..."
2,Cape Town,87.95,60.36,71.67,75.98,78.73,68.06,South Africa,ZA,-33.928992,18.417396,7.24,28.38,4617560.0,2181.94,"('Cape Town', 'http://images.unsplash.com/phot..."
3,Pretoria,80.56,46.74,71.11,70.13,61.44,68.06,South Africa,ZA,-25.745937,28.187944,7.24,28.38,2565660.0,616.44,"('Pretoria', 'http://images.unsplash.com/photo..."
4,Fortaleza,80.17,52.28,45.46,66.32,36.68,78.65,Brazil,BR,-3.730451,-38.521799,6.86,78.38,4073465.0,1480.41,"('Fortaleza', 'http://images.unsplash.com/phot..."


In [None]:
from geopy.geocoders import Nominatim

#updates lat long with correct lat long

for i in range(len(df)):
    
    geolocator = Nominatim(user_agent="my_user_agent")
    city = df.at[i, 'City']
    country = df.at[i, 'Country']
    loc = geolocator.geocode(city+','+ country)
    df.at[i, 'lat'] = loc.latitude
    df.at[i, 'lng'] = loc.longitude
    
for i in range(len(df)):
    
    city = df.at[i, 'City']
    df.at[i, 'Population'] = list(df_population[df_population.Name.str.contains(city)].Population.values)[0]
    
for i in range(len(df)):
    
    city = df.at[i, 'City']
    df.at[i, 'Price Per Square Foot (USD)'] = list(df_rental[df_rental.City.str.contains(city)].Price.values)[0]