# Capstone project -- Opening a restaurant

## Introduction and business problem

### Problem Description

Nowadays, it is difficult to imagine a city without a restaurant or a venue for food where people can have a meal or drink. The city of my choose is Taganrog that is the leading historic, cultural and industrial center in the South of Russia. Local industry and businesses are represented by aerospace, machine-building, military, iron and steel industry, farming, food, theathers, museums and one of the major ports of Azov Sea. That means there are a lot of business opportunities for restaurant business what leads to high competition.

To survive in such competitive market it is very important to find right place and take into account many other important factors such as:
* City population
* Sport and Entertainment zones
* Food markets with products of local farmers
* Local competitors and their ratings
* etc

In order to reduce the risks and avoid loss of money, the analysis of all accessible data should be carried out carefully in order to choose a suitable place or location. In my opinion, even an amazing idea or ingenious concept will not make your restaurant business successful without suitable place for it.

Obviously, this project will be interesting for a big company as well as anyone who wants to open a new restaurant in Taganrog city.


## Data description

The city of <b>Taganrog</b> (located in the South of Russia) will be analyzed in this project.

To solve the problem of finding the right location, we should find all existing businesses in the city of interest, explore them carefully to understand what we already have and plot our venues on map to gain insights, possible patterns or clusters.

For further analysis we will use the following data sources:
1. Wikipedia page for city population (https://en.wikipedia.org/wiki/Taganrog)
2. Nominatim search engine for OpenStreetMap data to get the bounding box of the city
    (https://nominatim.openstreetmap.org/search?format=json&q=Taganrog&polygon_geojson=1)
3. Foursquare API

Foursquare will be used as the main data source for analysis. We will retrieve both geographical coordinates and additional information about each venue using Foursquare API.

The following attributes for each venue will be collected:
* Id -- venue id (in order to remove duplicates)
* Venue -- venue name
* Category -- venue category
* Location -- venue address
* Latitude -- venue latitude
* Longitude -- venue longitude
* Rating -- numerical rating of the venue (0 through 10)
* Tips -- total count of tips
* Likes -- the count of users who have liked this venue

Because of Taganrog city has no neighborhood division like cities in the United States, we will build a coordinate grid that will cover the entire city by cells (or squares) of size 0.005x0.005 or 700x700 metre approximately.
South-west (sw) and north-east (ne) corners of cells will be utilized as input for the <b>search</b> endpoint of Foursquare API.

#### Example of coordinate grid

In [6]:
url_bounds = 'https://nominatim.openstreetmap.org/search?format=json&q=Taganrog, Russia&polygon_geojson=1'
# get borders in json format
bounds = requests.get(url_bounds).json()[0]['boundingbox']
# convert to float
city_rect = [float(i) for i in bounds] 

In [8]:
map_tag = folium.Map(location=[city_rect[0], city_rect[2]], zoom_start=12)

for lat in np.arange(city_rect[0], city_rect[1], step):
    for lon in np.arange(city_rect[2], city_rect[3], step):
        folium.Rectangle([[lat, lon], [lat+step, lon+step]], color='red', weight=0.3).add_to(map_tag)
        
map_tag.fit_bounds([[city_rect[0], city_rect[2]], [city_rect[1], city_rect[3]]])
map_tag

#### Example of data from Foursquare
The first five venues within 700 meters bounding box of city center are below:

In [305]:
# first five rows

Unnamed: 0,Venue,Latitude,Longitude,Category,Id
0,Площадь перед администрацией города,47.215733,38.92823,Plaza,5368f4ad498ea0cb80cef632
1,Культ вина,47.21551,38.92931,Wine Bar,5c74142e60255e002c1aefbc
2,Театр имени А. П. Чехова,47.216325,38.928217,Theater,4dcbe98a1f6ea1401d49d12a
3,Администрация Таганрога,47.215517,38.92842,City Hall,4da693d90cb66f658708dafc
4,Л'Этуаль,47.215416,38.929266,Cosmetics Shop,4f83002ee4b0b2237e8a6cb1


When data collected, the general approach to solution is to cluster venues in the city and identify what cluster fit best of all. It may either be a cluster with the most popular venues for food or a cluster having similar parameters but the least vanues for food.

## Data collection

Let's import all necessary packages

In [None]:
import numpy as np 
import pandas as pd
import json 

from geopy.geocoders import Nominatim

import requests 
from pandas.io.json import json_normalize

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium 

# Methodology

This section focuses on cleaning, preparation of previously collected data, exploratory analysis, and clustering. Below is the list of items in this section, more detail on each individual operation is in the text below. 

* Cleaning and preparation: the data (especially socioeconomic) needs to be cleaned and prepared:
    * fields renamed;
    * extra symbols removed;
    * NaN's fixed;

* Exploratory analysis: 
    * check field distributions, detect any abnormalities such as percentage exceeding 100%
    * explore correlation between fields;
    * discard erroneous or not needed data;
    
* Clustering -- the core Machine Learning methodology used in this project. The major focus of the project is to find a set of neighborhoods in Raleigh that is similar to Murraywood according to socioeconomic data and proximity to venues. 
    * determine the optimum number of clusters;
    * perform clustering;  