# Capstone Project - The Battle of the Neighborhoods
#### *Applied Data Science Capstone*

## 1. Introduction

### 1.1 Backgourd

Considered for ONU in 2015 of the second-best place for living in Brazil, the ninth-best place for getting old and by having seven among twenty neighborhoods with the highest IDH of country, Vitória, the fourth biggest city of the state of Espirito Santo and his capital is a quite wanted destiny for living. And nothing is most boring that finding a place in a great location in an unknown city because is hard to recognize if the vicinity venues are similar to the wanted.

### 1.2 Problem

To make life easier for those arriving in the city, this work aims to create a way to find the lowest price in the city in a neighborhood that is similar to the neighborhood in which the user currently lives, or wants to live.

### 1.3 Interest

People who want to rent a house in the city, but do not have the knowledge to do so, and the result of this work can be easily converted into a service sold directly to customers.

## 2. Data

### 2.1 Data sources

#### 2.1.1 Foursquare API

To achieve the objective, the Foursquare Places API will be used for presenting very complete data and a free plan that meets the project's demand. Thus, it will be possible to use Foursquare data to find the neighborhoods that are closest to the target neighborhood using an unsupervised algorithm.

##### 2.1.1.1 Explore endpoint

The API provides several functions through specific endpoints. Thus, it is possible to search for specific locations or even find locations close to a certain point.

For the present work, the endpoint "/explore" will be used to explore the places close to a certain point. This point is passed as a parameter in the request made, below are the parameters necessary to carry out the request:
- ll: Latitude and Longitude of center of search;
- radius: Radius of search in meters;
- client_id: Value of the account for authentication;
- client_secret: Values of the account for authentication;
- limit: Maximum number of results.

There are other items but only those important for the execution of the project were explained.

A typical endpoint request has the following format:
```python
'https://api.foursquare.com/v2/venues/explore?ll=XXXX,XXXX&radiu=XXX&limit=XXX&client_id=XXXX&client_secret=XXXX'
```

As a response, the following items are received:
- groups: An array of objects representing groups of recommendations.

There are other items but only those important for the execution of the project were explained.

#### 2.1.2 Zap Imóveis

As for the rental price of the properties, the Zap Imóveis apartments website will be used, where the data provided by them will be scrapped. Soon, it will be possible to have real estate price information in each of the neighborhoods analyzed and, consequently, find the neighborhood that best meets the characteristics sought.

The search is done through "get" requests, so it is possible to use it as an API, making only a scrap on the results found. Below you can see an image of the site with some values searched.
<img src='./tela_principal.PNG'/>

Thus, all the resulting pages will be scrolled while the relevant data for each property is saved. The data chosen was:
- Price;
- Location;
- Bedrooms number;
- Bathrooms number;
- Vacancies number.

To simplify and show the true price of the property, the value of the condominium and the value of the property tax, divided by twelve, will be added to the rental price of the property.

#### 2.1.3 Neighborhoods of the city

To obtain data on the neighborhoods present in the city, a <a href='https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Vit%C3%B3ria'>Wikipedia page</a> was used, which lists all the neighborhoods in the city, where it was possible to obtain the names of each one. Latitude and longitude data were obtained using the Google Maps API.

### 2.2 Data scrapping

#### 2.2.1 Zap Imóveis

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org
  ipython-dev@scipy.org""")


In [2]:
data = pd.DataFrame(columns={'price', 'neighborhood', 'area', 'bedrooms', 'parking', 'bathrooms'})

In [3]:
url = 'https://www.zapimoveis.com.br/venda/imoveis/{}/1-quarto/?pagina={}'

In [4]:
def clear_number(value):
    return value.replace('\n','').replace(' ', '').replace('R$', '').replace('.', '').replace('m²','').split('-')[-1]

def clear_address(value):
    ad = value.split(',')
    if(ad[-1] == 'Vitória'):
        r = ad[-2]
    else:
        r = ad[-1]
    return r.strip()

In [5]:
def get_max_pages(number):
    mp = int(clear_number(
        BeautifulSoup(
            requests.get(url.format('es+vitoria', number)).content, 'lxml'
        ).select('.pagination__container li a')[-1].contents[-1]
    ))
    if(number == mp):
        return mp
    else:
        return get_max_pages(mp)

In [6]:
max_page = get_max_pages(1)

for page in range(max_page):
    r = requests.get(url.format('es+vitoria', page + 1))
    soup = BeautifulSoup(r.content, 'lxml')
    elements = soup.select('.simple-card__box')
    print(page + 1)

    for p in elements:
        try:
            price = int(clear_number(p.select('p strong')[0].contents[0]))
        except ValueError:
            continue

        try:
            condominum = int(clear_number(p.select('.condominium .card-price__value')[0].contents[0]))
        except IndexError:
            condominum = 0

        try:
            iptu = int(clear_number(p.select('.iptu .card-price__value')[0].contents[0]))
        except IndexError:
            iptu = 0

        try:
            neighborhood = clear_address(p.select('.simple-card__address')[0].contents[0])
        except IndexError:
            neighborhood = ''

        try:
            area = int(clear_number(p.select('.js-areas span')[-1].contents[0]))
        except IndexError:
            area = None

        try:
            bedrooms = int(clear_number(p.select('.js-bedrooms span')[-1].contents[0]))
        except IndexError:
            bedrooms = None

        try:
            parking = int(clear_number(p.select('.js-parking-spaces span')[-1].contents[0]))
        except IndexError:
            parking = None

        try:
            bathrooms = int(clear_number(p.select('.js-bathrooms span')[-1].contents[0]))
        except IndexError:
            bathrooms = None

        real_price = price + condominum + iptu/12

        data = data.append({
            'price':real_price, 
            'neighborhood':neighborhood, 
            'area':area, 
            'bedrooms':bedrooms, 
            'parking':parking,
            'bathrooms':bathrooms
        }, ignore_index=True)

1
2
3
4
5
6
7
8
9
10


In [7]:
data.shape

(356, 6)

In [8]:
data.head()

Unnamed: 0,neighborhood,price,bedrooms,parking,area,bathrooms
0,Praia da Costa,349600.0,2,1.0,58,2
1,Jardim Camburi,1199000.0,4,3.0,146,6
2,Praia do Canto,2151350.0,4,2.0,210,3
3,Vitória,370000.0,2,,45,1
4,Enseada do Suá,520550.0,2,1.0,71,2


#### 2.2.2 Netimóveis

In [9]:
url = 'https://www.vivareal.com.br/venda/espirito-santo/vitoria/apartamento_residencial/?pagina={}#onde=BR-Espirito_Santo-NULL-Vitoria&tipos=apartamento_residencial,casa_residencial,condominio_residencial,cobertura_residencial,flat_residencial,kitnet_residencial,sobrado_residencial,'

In [10]:
def clear_number(value):
    return value.replace('R$', '').replace('.', '').replace('m²','').strip().split('-')[-1]

def clear_address(value):
    ad = value.strip().split(',')
    if(ad[-1].strip() == 'Vitória - ES'):
        r = ad[-2].split('-')[-1]
    else:
        r = ad[-1]
    return r.strip()

In [11]:
def get_max_pages(number):
    mp = int(clear_number(
        BeautifulSoup(
            requests.get(url.format(number)).content, 'lxml'
        ).select('.pagination__item a')[-2].contents[-1]
    ))
    print('Max page:', mp)
    if(number == mp):
        return mp
    else:
        return get_max_pages(mp)

In [12]:
def pageScrap(url, page, results):
    r = requests.get(url.format(page + 1))
    soup = BeautifulSoup(r.content, 'lxml')
    elements = soup.select('.property-card__main-content')
    result = {
        'price':[], 
        'neighborhood':[], 
        'area':[], 
        'bedrooms':[], 
        'parking':[],
        'bathrooms':[]
    }
    for p in elements:
        try:
            price = int(clear_number(p.select('.property-card__price')[-1].contents[-1]))
        except ValueError:
            continue

        try:
            condominum = int(clear_number(p.select('.js-condo-price')[-1].contents[-1]))
        except IndexError:
            condominum = 0
        except ValueError:
            condominum = 0

        try:
            neighborhood = clear_address(p.select('.js-property-card-address')[-1].contents[-1])
        except IndexError:
            continue

        try:
            area = int(clear_number(p.select('.js-property-card-detail-area')[0].contents[0]))
        except IndexError:
            area = None
        except ValueError:
            area = None

        try:
            bedrooms = int(clear_number(p.select('.js-property-detail-rooms span')[0].contents[0]))
        except IndexError:
            bedrooms = None
        except ValueError:
            bedrooms = None

        try:
            parking = int(clear_number(p.select('.js-property-detail-garages span')[0].contents[0]))
        except IndexError:
            parking = None
        except ValueError:
            parking = None

        try:
            bathrooms = int(clear_number(p.select('.js-property-detail-bathroom span')[0].contents[0]))
        except IndexError:
            bathrooms = None
        except ValueError:
            bathrooms = None


        real_price = price + condominum
        
        result['price'] += [real_price]
        result['neighborhood'] += [neighborhood]
        result['area'] += [area]
        result['bedrooms'] += [bedrooms]
        result['parking'] += [parking]
        result['bathrooms'] += [bathrooms]
        
        print("Page {} is completed".format(page + 1))
        
    results[page] = result

In [13]:
from threading import Thread
import time
max_page = get_max_pages(195)
print('Max page:', max_page)

results = [{} for x in range(max_page)]

def create_threads(url, max_page, results):
    threads = []
    for page in range(max_page):
        # We start one thread per url page.
        process = Thread(target=pageScrap, args=[url, page, results])
        process.start()
        threads.append(process)
        time.sleep(2)

    for process in threads:
        process.join()
        
process = Thread(target=create_threads, args=[url, max_page, results])
process.start()
process.join()

for result_page in results:
    data = data.append(pd.DataFrame(result_page), ignore_index=True, sort=False)

193
196
199
202
202
202
Max page: 1
Max page: 2
Max page: 3
Max page: 4
Max page: 5
Max page: 7
Max page: 6
Max page: 8
Max page: 9
Max page: 10
Max page: 11
Max page: 12
Max page: 13
Max page: 14
Max page: 15
Max page: 19
Max page: 17
Max page: 16
Max page: 20
Max page: 18
Max page: 21
Max page: 23
Max page: 22
Max page: 25
Max page: 27
Max page: 24
Max page:Max page: 31
 28
Max page: 26
Max page: 30
Max page: 29
Max page: 34
Max page: 32
Max page:Max page: 33
 38
Max page: 36
Max page: 35
Max page: 43
Max page: 37
Max page: 39
Max page:Max page: 42
 46
Max page: 41
Max page: 48
Max page: 44
Max page: 40
Max page: 45
Max page: 54
Max page: 50
Max page: 47
Max page: 51
Max page: 49
Max page: 56
Max page: 61
Max page: 53
Max page:Max page: 55
 52
Max page: 57
Max page: 59
Max page: 58
Max page: 62
Max page: 70
Max page: 60
Max page: 64
Max page: 68
Max page: 63
Max page: 72
Max page: 73
Max page: 65
Max page: 74
Max page: 66
Max page: 67
Max page: 77Max page: 69

Max page: 79
Max page: 

In [14]:
data.shape

(7224, 6)

In [19]:
data.tail()

Unnamed: 0,neighborhood,price,bedrooms,parking,area,bathrooms
7219,Jardim Camburi,330334.0,2,1,70,2
7220,Jardim da Penha,440680.0,3,1,85,2
7221,Jardim da Penha,890430.0,4,3,130,3
7222,Bento Ferreira,375450.0,2,1,70,2
7223,Jardim Camburi,280320.0,3,1,85,1


In [20]:
data.to_csv(r'./property.csv', index=False)

In [69]:
data.groupby('neighborhood').count().sort_values('neighborhood')

Unnamed: 0_level_0,price,bedrooms,parking,area,bathrooms
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Balneário de Carapebus,2,2,1,2,0
Barro Vermelho,415,415,415,415,415
Bento Ferreira,616,616,614,616,616
Castanheiras,1,1,1,1,0
Centro,408,408,205,408,408
Enseada do Suá,206,206,206,206,206
Fradinhos,3,3,3,3,3
Goiabeiras,1,1,1,1,1
Ilha de Santa Maria,1,1,1,1,1
Ilha do Boi,1,1,1,1,1


In [73]:
data.loc[data['neighborhood'] == 'Santa Lucia', 'neighborhood'] = 'Santa Lúcia'

In [74]:
data.groupby('neighborhood').count().sort_values('neighborhood')

Unnamed: 0_level_0,price,bedrooms,parking,area,bathrooms
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Balneário de Carapebus,2,2,1,2,0
Barro Vermelho,415,415,415,415,415
Bento Ferreira,616,616,614,616,616
Castanheiras,1,1,1,1,0
Centro,408,408,205,408,408
Enseada do Suá,206,206,206,206,206
Fradinhos,3,3,3,3,3
Goiabeiras,1,1,1,1,1
Ilha de Santa Maria,1,1,1,1,1
Ilha do Boi,1,1,1,1,1


#### 2.2.3 Neighborhood names

In [29]:
r = requests.get('https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Vit%C3%B3ria')
soup = BeautifulSoup(r.content, 'lxml')
table = soup.select('.sortable')[0]

In [55]:
n = []
for element in table.select('tr')[1:]:
    n += element.select('td')[-2].select('a')

In [67]:
neighborhood_name = [str(neighborhood.contents[0]).lower() for neighborhood in n]

Removing neighborhoods that don't appear on the Wikipedia list

In [80]:
mask = data['neighborhood'].str.lower().isin(neighborhood_name)
data = data[mask]

In [81]:
data.shape

(6883, 6)