# Introduction

In this notebook I'm going to describe how to find the geographic coordinates for the houses in 'Madrid real estate market' dataset.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.set_option("display.max_columns", None)    

In [None]:
data = pd.read_csv('../input/madrid-real-estate-market/houses_Madrid.csv', sep=',', header=0, index_col=0)
data.head()

We start downloading the dataset and selecting the relevant attributes.

In [None]:
data_map = data.loc[:,['id','latitude', 'longitude', 'raw_address', 'subtitle']]

In [None]:
data_map.shape

We'll have to send each address to a geolocation server to get latitude and longitude (and altitude but we won't need it here). Doing this for several addresses will take time, so we need to make sure that we are not sending irrelevant information.

Let's start by removing null values and duplicated addresses.

In [None]:
(data_map.raw_address.isnull()).sum()

In [None]:
data_map.dropna(subset=['raw_address'],inplace=True)

In [None]:
data_map.raw_address.duplicated().sum()

In [None]:
data_map.drop_duplicates(subset=['raw_address'], inplace=True)
data_map.shape

We are only left with less than ten thousand rows.

We'll create a new column with the street's name and its county. That's what we'll send to the server.

In [None]:
data_map['address'] = data_map.raw_address + ', ' + data_map.subtitle
data_map.drop(columns=['raw_address', 'subtitle'], inplace=True)
data_map.reset_index(drop=True, inplace=True)

In [None]:
data_map.head(10)

We'll use the library 'geopy' to acess a geocoding service. These services provide APIs and geopy is a library that provides implementations for them.

We'll use Nominatim that works with OpenStreetMap data to find locations by their name. It can also do the reverse process, find an address from its coordinates.

We have to specify a name in the 'user-agent' field, for example, our email.

In [None]:
from geopy.geocoders import Nominatim

locator = Nominatim(user_agent="my-email@")

location1 = locator.geocode("Calle de Godella, 64, San Cristóbal, Madrid")
location2 = locator.reverse("43.3624, -5.8444")

Let's see what attributes we get and their contents.

In [None]:
dir(location1)[-6:]

In [None]:
print(location1.address)
print(location1.altitude)
print(location1.latitude)
print(location1.longitude)
print(location1.point)
print(location1.raw)

In [None]:
print(location2.address)
print(location2.altitude)
print(location2.latitude)
print(location2.longitude)
print(location2.point)
print(location2.raw)

Whether we ask for an address or its coordinates, we allways get the same parameters back.

We'll see a small example with a function that sends a request every 1 second according to <a href="https://operations.osmfoundation.org/policies/nominatim/">Nomitamim Usage Policy</a>. This way we won't overload their servers and risk being blocked.

In [None]:
from geopy.extra.rate_limiter import RateLimiter

#function to delay calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
#save all the data in a new columns
data_map['location'] = data_map.loc[0:15]['address'].apply(geocode)

In [None]:
data_small = data_map.loc[0:15]
data_small['location'][0]


'Location' gives us the street's name, county, province, area, autonomous region, zip code and country. Besides its coordinates.

Now, for each row, we'll extract it's latitude and longitude

In [None]:
data_small['latitude'] = data_small['location'].apply(lambda loc: loc.latitude if loc else None)
data_small['longitude'] = data_small['location'].apply(lambda loc: loc.longitude if loc else None)

In [None]:
data_small

We only have 7 sets of coordinates from 16 addresses we sent. That's the proportion we'll get for the whole file, about half of them.

There are other geocoding services, like Google, that may be more accurate but require an API key or payment.