# __Capstone Project - Final Report__

__In this notebook we will be exploring city of Prague for new gym.__

## __Introduction:__

We want to open a new gym in Prague. For this business to be successful, we must find a good place. We will use Foursquare's data to determine which administrative district in Prague is opening gyms. This is necessary in order to try to find the administrative area with the least number of gyms.

## __Business problem:__

Prague is the capital of the Czech Republic.
<br>Population: 1.3 million (2018).
<br>The main political, economic and cultural center of the Czech Republic. A major tourist center in Europe.
<br>Moreover, the city has a compact size. Only 500 square kilometers.


The dense city life leads to the emergence of bad habits: alcohol, unhealthy food, constant work sitting, laziness to train at home. And this has a negative effect on health. Nowadays, many people are forced to limit themselves to visiting sports grounds. The pandemic will not last forever (I hope), which means that people will again start visiting sports grounds and playing sports. Now is the time to think about your investment in sports. Namely, in which district of Prague you can open a new gym.

__Target audience:__ New Business who wants to open a new gym in Prague.

## __Data__

Using data from the Foursquare Places API, we will determine in which area the new gym in Prague can be built.
The same analysis as with gyms can be done for any type of business.
And:
* List of administrative districts of Prague as neighborhoods. Since we are considering only one city, we do not need extra data.
* Latitude and longitude coordinates all areas. Required for building a map and obtaining data.
* Gym location data. Will be used for clustering

We use the following data:
* Wikipage with table - [link](https://en.wikipedia.org/wiki/Districts_of_Prague)
* Foursquare - [link](https://developer.foursquare.com/docs/)

## __Methodology__

First, __we need to get a list of districts in Prague, Czech Republic__. The table is on the next Wikipedia page: [Prague District](https://en.wikipedia.org/wiki/Districts_of_Prague).

We will parse the web page using the following Python libraries:
- request
- beautiful soup

We need to extract the list of administrative districts of Prague. Then write this to a dataframe using the Pandas library.


Secondly, __we need to get the coordinates of the found administrative districts__.
To do this, we will use the geocoder library. The geocoder provides latitude and longitude data for each administrative region.
Using a geocoder, you first need to find the center of Prague. We get the following coordinates: latitude 50.0864234 and longitude 14.4156772. This is necessary for orientation in the data. You need to be sure that the administrative districts are located near the city center.
Then we find the coordinates of all districts of Prague using the geoinformation program ArcGIS from the geocoder library.


Let's __create a map of Prague and mark the centers of the administrative districts on it__. For this we need coordinates and the Folium library. Folium is used for spatial visualization and analysis. It has a lot of functionality, but we will use: binding to real geographic data and placing points on the map by coordinates.

We then __use the Foursquare API to explore neighborhoods and segmentation__. Let's set a standard limit of 100 and a radius of 500 meters. In this study, we will look at the central part of Prague. The request must include a "gym". Having received all the necessary data from the request from the Foursqare API, we create a dataframe from the available data.
We will also create a word cloud that includes the entire list of names. The most popular headwords are fitness, studio and gym.


__Let's start analyzing each area.__
For this we will use the onehot encoding. It works like this: each category for the district is preceded by a 1 if that category is included in the district. Otherwise 0 is set. Also note the number of unique categories in the data frame. According to Foresquare, there are several categories within the Gym / Fitness category. You need to remove unnecessary data and group the data by the "neighborhood" column.
Let's find the 5 most visited places from our dataframe for each district and combine them all into one table.


__Finally, let's group the data__. The K-Means method will help us with this.
The K-Means Clustering Algorithm determines k centroids and then distributes each data point to the nearest cluster while keeping the centroids as small as possible. It is one of the popular and simple machine learning algorithms. For this we use the sklearn library. Since we have a small amount of data, we use 4 clusters. We render them using the Folium library. Also, to understand the data, it is necessary to display the results of data clustering. This will help us make a decision.


## __Result and Discussion__

Our research shows that despite the small number of gyms in Prague, there are areas with low or no gym density, quite close to the city center. We have 17 districts with a gym and 5 districts without a gym within a radius of 500 meters from the center of Prague with geographic coordinates 50.0864234 latitude and 14.4156772 longitude.

The best solution would be to open a new gym in districts that don't have one at all. Prague is divided into 22 districts and as many as 5 in the city center do not have a gym.
It is also a good choice to consider those districts in which gyms are in small quantities and they are among the most visited places. Since a small number of gyms are visited by a large number of clients interested in visiting gyms, the opening of a new modern gym will definitely attract clients.

There may be some reason for the lack of gyms in these districts despite the lack of competition. Therefore, the recommended areas should only be considered as a starting point for more detailed analysis. A relatively small search radius was also used. Larger radius will give more results.

## __Conclusion__

In this study, we modeled the distribution of gyms in Prague using data from geographic locations and the Foresquare API. Using clustering and the K-Means method, we were able to find the best location for the new gym. This can help a budding gym owner choose where to start their business. According to research, the best solution would be areas that do not have a gym.

This work is for informational purposes only, as we do not have most of the data. For example, visit and prices at the gym. The study also used a small radius of 500 meters. This study is more suitable for the central area of ​​Prague.