## 2. Data

### 2.1 Data Sources and Description

The data used in this project is derived from three different sources. For the purpose of this analysis, New York City serves as the originating location and the City of Toronto is the destination. 

#### 2.1.1 New York City Data

Data for New York City was initially downloaded from the NYU Spatial Data Repository and then housed on the course server for use in the Week 3 laboratory. The “newyork_data.json” file contains information for the City’s 5 boroughs and 306 neighborhoods, including the latitude and longitude coordinates for each neighborhood. The relevant data is contained in the features object with the borough and neighborhood values in the properties object and the latitude and longitude coordinates in the geometry object. This information is read into a pandas DataFrame that contains columns for “Borough,” “Neighborhood,” “Latitude,” and “Longitude.” See table below for an example.

 <table>
    <tr>
        <th>Borough</th>
        <th>Neighborhood</th>
        <th>Latitude</th>
        <th>Longitude</th>
    </tr>
    <tr>
        <th>Manhattan</th>
        <th>Marble Hill</th>
        <th>40.876551</th>
        <th>-73.910660</th>
    </tr>
    <tr>
        <th>Manhattan</th>
        <th>Chinatown</th>
        <th>40.715618</th>
        <th>-73.994279</th>
    </tr>
     <tr>
        <th>Manhattan</th>
        <th>Washington Heights</th>
        <th>40.851903</th>
        <th>-73.936900</th>
    </tr>
     <tr>
        <th>Manhattan</th>
        <th>Inwood</th>
        <th>40.867684</th>
        <th>-73.921210</th>
    </tr>
     <tr>
        <th>Manhattan</th>
        <th>Hamilton Heights</th>
        <th>40.823604</th>
        <th>-73.949688</th>
    </tr>
</table>

#### 2.1.2 City of Toronto Data

Collecting data for the City of Toronto is a more complex endeavor. The raw data for Toronto’s neighborhoods is gathered from the “List of Postal Codes of Canada: M” Wikipedia page. Data from this page is scraped using BeautifulSoup4 with the lxml toolkit for processing XML and HTML. Data from each HTML table row in the “wikitable sortable” table class is read into a pandas DataFrame that contains columns for “Postcode,” “Borough,” and “Neighborhood.” After creating the DataFrame, the data is wrangled in four steps:
1. Rows having boroughs with the designation of “Not assigned” are removed.
2. Neighborhoods that share the same postal code are combined into one record with the neighborhoods separated with a comma.
3. If a record has a borough, but the neighborhood has the designation of “Not assigned,” the neighborhood has the same value as its borough. 
4. Values in the Neighborhood column with an extra character (“\n”) have it removed using the replace() method.

See table below for an example of the resulting DataFrame.

<table>
    <tr>
        <th>PostalCode</th>
        <th>Borough</th>
        <th>Neighborhood</th>
    </tr>
<tr>
        <th>M1B</th>
        <th>Scarborough</th>
        <th>Rouge, Malvern</th>
    </tr>
    <tr>
        <th>M1C</th>
        <th>Scarborough</th>
        <th>Highland Creek, Rouge Hill, Port Union</th>
    </tr>
     <tr>
        <th>M1E</th>
        <th>Scarborough</th>
        <th>Guildwood, Morningside, West Hill</th>
    </tr>
    <tr>
        <th>M1G</th>
        <th>Scarborough</th>
        <th>Woburn</th>
    </tr>
     <tr>
        <th>M1H</th>
        <th>Scarborough</th>
        <th>Cedarbrae</th>
    </tr>
</table>

However, the preceeding DataFrame does not contain the necessary latitude and longitude coordinates. This information was provided via an instructor furnished .csv file, Geospatial_Coordinates.csv. This data is merged with the original City of Toronto DataFrame. See table below for an example of the resulting DataFrame.

<table>
    <tr>
        <th>PostalCode</th>
        <th>Borough</th>
        <th>Neighborhood</th>
        <th>Latitude</th>
        <th>Longitude</th>
    </tr>
<tr>
        <th>M1B</th>
        <th>Scarborough</th>
        <th>Rouge, Malvern</th>
        <th>43.806686</th>
        <th>-79.194353</th>
    </tr>
    <tr>
        <th>M1C</th>
        <th>Scarborough</th>
        <th>Highland Creek, Rouge Hill, Port Union</th>
        <th>43.784535</th>
        <th>-79.160497</th>
    </tr>
     <tr>
        <th>M1E</th>
        <th>Scarborough</th>
        <th>Guildwood, Morningside, West Hill</th>
        <th>43.763573</th>
        <th>-79.188711</th>
    </tr>
     <tr>
        <th>M1G</th>
        <th>Scarborough</th>
        <th>Woburn</th>
        <th>43.770992</th>
        <th>-79.216917</th>
    </tr>
    <tr>
        <th>M1H</th>
        <th>Scarborough</th>
        <th>Cedarbrae</th>
        <th>43.773136</th>
        <th>-79.239476</th>
    </tr>
</table>

Finally, the Postcode column is removed, resulting in the same format structure as the DataFrame for New York City. See table below for an example of the resulting DataFrame.

<table>
    <tr>
        <th>Borough</th>
        <th>Neighborhood</th>
        <th>Latitude</th>
        <th>Longitude</th>
    </tr>
<tr>
        <th>Scarborough</th>
        <th>Rouge, Malvern</th>
        <th>43.806686</th>
        <th>-79.194353</th>
    </tr>
    <tr>
        <th>Scarborough</th>
        <th>Highland Creek, Rouge Hill, Port Union</th>
        <th>43.784535</th>
        <th>-79.160497</th>
    </tr>
     <tr>
        <th>Scarborough</th>
        <th>Guildwood, Morningside, West Hill</th>
        <th>43.763573</th>
        <th>-79.188711</th>
    </tr>
     <tr>
        <th>Scarborough</th>
        <th>Woburn</th>
        <th>43.770992</th>
        <th>-79.216917</th>
    </tr>
    <tr>
        <th>Scarborough</th>
        <th>Cedarbrae</th>
        <th>43.773136</th>
        <th>-79.239476</th>
    </tr>
</table>

#### 2.1.3 FourSquare Data

Connecting to its database, the FourSquare API is used to explore neighborhoods in New York City and the City of Toronto. The resulting Jason files provide the following information regarding venues from each neighborhood:
- Venue Name (name)
- Category Name (categories)
- Venue Latitude (lat)
- Venue Longitude (lng)

The data is cleaned and stored in a DataFrame. See table below for an example.

<table>
    <tr>
        <th>name</th>
        <th>categories</th>
        <th>lat</th>
        <th>lng</th>
    </tr>
    <tr>
        <th>Eagle's Nest Golf Club</th>
        <th>Golf Course</th>
        <th>43.805455</th>
        <th>-79.364186</th>
    </tr>
    <tr>
        <th>AY Jackson Pool</th>
        <th>Pool</th>
        <th>43.804515</th>
        <th>-79.366138</th>
    </tr>
    <tr>
        <th>Villa Madina</th>
        <th>Mediterranean Restaurant</th>
        <th>43.801685</th>
        <th>-79.363938</th>
    </tr>
     <tr>
        <th>Duncan Creek Park</th>
        <th>Dog Run</th>
        <th>43.805539</th>
        <th>-79.360695</th>
    </tr>
</table>

This information is then merged with the DataFrames for New York City and the City of Toronto. The resulting DataFrames provide columns for each city’s neighborhoods, neighborhood latitudes, neighborhood longitudes, venues, venue latitudes, venue longitudes, and venue categories. See table below for an example.

<table>
    <tr>
        <th>Neighborhood</th>
        <th>Neighborhood Latitude</th>
        <th>Neighborhood Longitude</th>
        <th>Venue</th>
        <th>Venue Latitude</th>
        <th>Venue Longitude</th>
        <th>Category</th>
    </tr>
    <tr>
        <th>Hillcrest Village</th>
        <th>43.803762</th>
        <th>-79.363452</th>
        <th>Eagle's Nest Golf Club</th>
        <th>43.805455</th>
        <th>-79.364186</th>
        <th>Golf Course</th>
    </tr>
    <tr>
        <th>Hillcrest Village</th>
        <th>43.803762</th>
        <th>-79.363452</th>
        <th>AY Jackson Pool</th>
        <th>43.804515</th>
        <th>-79.366138</th>
        <th>Pool</th>
    </tr>
    <tr>
        <th>Hillcrest Village</th>
        <th>43.803762</th>
        <th>-79.363452</th>
        <th>Villa Madina</th>
        <th>43.801685</th>
        <th>-79.363938</th>
        <th>Mediterranean Restaurant</th>
    </tr>
    <tr>
        <th>Hillcrest Village</th>
        <th>43.803762</th>
        <th>-79.363452</th>
        <th>Duncan Creek Park</th>
        <th>43.805539</th>
        <th>-79.360695</th>
        <th>Dog Run</th>
    </tr>
    <tr>
        <th>Fairview, Henry Farm, Oriole</th>
        <th>43.778517</th>
        <th>-79.346556</th>
        <th>The LEGO Store</th>
        <th>43.778207</th>
        <th>-79.360695</th>
        <th>Toy / Game Store</th>
    </tr>
</table>








### 2.2 How the Data Will Be Used to Solve the Problem

Data pertaining to the boroughs of New York City and the City of Toronto will be used to identify their respective neighborhoods. Neighborhood coordinates will be used to locate them on a geographical map. The map of New Yok City will identify the location of the neighborhood that serves as the starting point for the subject’s relocation. The initial map of the City of Toronto will identify locations of potential neighborhoods in which the subject may reside. Eventually, the City of Toronto map will display color-coded neighborhoods, identifying those that meet the subject’s requirements.

Neighborhood data will be combined with FourSquare’s venue data to:
1. associate venues with a specific neighborhood
2. determine the number of unique categories in each neighborhood
3. gather a specified number of top venues within a given radius of a neighborhood, and
4. determine the number of venues returned for each neighborhood.

The collected data will then be processed to determine the mean and frequency of occurrence of each category for each neighborhood. The most common five venues for each neighborhood will be identified and sorted in descending order. Finally, a DataFrame will be created to store the top 10 most common venues for each neighborhood. The data from this DataFrame will be processed to cluster the neighborhoods, allowing for comparsion between the baseline neighborhood in New York City and those in the City of Toronto.