## Buffalo Recycles - Data section

The initial dataset was developed by Zerocycle for the City of Buffalo to analyze curbside recycling rates by neighborhood. This can be found on the following data.word page: <a href="https://data.world/buffalony/ug79-xatx">Neighborhood Curbside Recycling Rates</a>. The last update done was on **December 14, 2018**, so it's a very recently dataset which is important in order to the further analysis be meaningful, the updates are done by the Department of Public Works on the Streets & Sanitation division, so we asura that the source is reliable. 

On the dataset we have **624** rows and **5** columns in where each row is each row is a neighborhood's curbside recycling/waste collection statistic. The columns are:

1. NEIGHBORHOOD (String)
2. DATE (Date&Time)
3. CURBSIDE RECYCLING (IN POUNDS) (Float)
4. CURBSIDE GARBAGE (IN POUNDS) (Float)
5. CURBSIDE RECYCLING RATE (Integer)


Let's show that:

In [1]:
import pandas as pd

In [2]:
buff_df = pd.read_csv('https://query.data.world/s/cvtfgya4cauuuwosm32v75zoffl3dk')
buff_df.head()

Unnamed: 0,NEIGHBORHOOD,DATE,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE
0,ABBOTT McKINLEY,07/31/2018,49.18,256.09,16
1,ABBOTT McKINLEY,03/31/2018,35.12,182.03,16
2,ABBOTT McKINLEY,10/31/2018,46.98,243.91,16
3,ABBOTT McKINLEY,11/30/2018,42.04,216.08,16
4,ABBOTT McKINLEY,05/31/2018,43.48,232.49,16


In [3]:
buff_df.shape

(624, 5)

In [4]:
buff_df.columns

Index(['NEIGHBORHOOD', 'DATE', 'CURBSIDE RECYCLING (IN POUNDS)',
       'CURBSIDE GARBAGE (IN POUNDS)', 'CURBSIDE RECYCLING RATE'],
      dtype='object')

In [5]:
buff_df.dtypes

NEIGHBORHOOD                       object
DATE                               object
CURBSIDE RECYCLING (IN POUNDS)    float64
CURBSIDE GARBAGE (IN POUNDS)      float64
CURBSIDE RECYCLING RATE             int64
dtype: object

In [6]:
buff_df['NEIGHBORHOOD'].unique()

array(['ABBOTT McKINLEY', 'ALBRIGHT', 'ALLEN', 'BABCOCK', 'BLACK ROCK',
       'BROADWAY FILLMORE', 'BRYANT', 'CAZENOVIA PARK', 'COLD SPRING',
       'COLUMBUS', 'DELAWARE PARK', 'DELAWARE W. FERRY', 'EMERSON',
       'EMSLIE', 'FIRST WARD', 'FOREST', 'FRONT PARK', 'GENESEE MOSELLE',
       'GRANT FERRY', 'GRIDER', 'HAMLIN PARK', 'JOHNSON', 'KAISERTOWN',
       'KENFIELD', 'KENSINGTON', 'KINGSLEY', 'LAKEVIEW', 'LaSALLE',
       'LEROY', 'LOVEJOY', 'MASTEN PARK', 'MEDICAL PARK', 'MILITARY',
       'M.L.K. PARK', 'NORTH DELAWARE', 'NORTH PARK', 'PARK MEADOW',
       'PARKSIDE', 'PERRY', 'RIVERSIDE PARK', 'SCHILLER PARK', 'SENECA',
       'SOUTH ABBOTT', 'SOUTH ELLICOTT', 'SOUTH PARK', 'STARIN CENTRAL',
       'STATE HOSPITAL', 'TIFFT', 'TRIANGLE', 'UNIVERSITY', 'VALLEY',
       'WILLERT PARK'], dtype=object)

The only thing that differs from the assumptions we do in first place, is that the 'NEIGHBORHOOD' and the 'DATE' columns are object types.

For now tha date column is not that relevant. Now we can group the data by neighborhood and normilize the new curbside values, so let's put the dataset in the way we want:

In [7]:
buff_df.drop(columns='DATE', inplace=True)
buff_df.head()

Unnamed: 0,NEIGHBORHOOD,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE
0,ABBOTT McKINLEY,49.18,256.09,16
1,ABBOTT McKINLEY,35.12,182.03,16
2,ABBOTT McKINLEY,46.98,243.91,16
3,ABBOTT McKINLEY,42.04,216.08,16
4,ABBOTT McKINLEY,43.48,232.49,16


In [8]:
buffG_df = buff_df.groupby(['NEIGHBORHOOD']).mean()
buffG_df.head(10)

Unnamed: 0_level_0,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE
NEIGHBORHOOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABBOTT McKINLEY,42.654167,215.3575,16.666667
ALBRIGHT,23.645,88.204167,21.5
ALLEN,22.2925,95.519167,18.833333
BABCOCK,19.578333,131.258333,13.0
BLACK ROCK,34.943333,215.675833,13.833333
BROADWAY FILLMORE,26.028333,207.4625,11.25
BRYANT,28.58,139.280833,17.083333
CAZENOVIA PARK,30.0525,153.4075,16.5
COLD SPRING,5.445,37.299167,13.0
COLUMBUS,5.911667,32.0625,15.75


In [9]:
buffG_df.reset_index(inplace=True)
buffG_df.head()

Unnamed: 0,NEIGHBORHOOD,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE
0,ABBOTT McKINLEY,42.654167,215.3575,16.666667
1,ALBRIGHT,23.645,88.204167,21.5
2,ALLEN,22.2925,95.519167,18.833333
3,BABCOCK,19.578333,131.258333,13.0
4,BLACK ROCK,34.943333,215.675833,13.833333


In [10]:
buffG_df.shape

(52, 4)

Before using the foursquare API is mandarory to get the latitude and longuitude coordinates by using a **geolocation** library. So let's do that.

In [11]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

Gettting the coordinats of buffalo to have a better idea of what we are looking for:

In [12]:
address = 'Buffalo, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Buffalo are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Buffalo are 42.8867166, -78.8783922.


In [13]:
buffG_df['Latitude']=""
buffG_df['Longitude']=""

There are some neighborhoods that the geopy library doesn't recognize. By using a try the ones that are recognized are going to be filled and after that the ones who don't are going to be manually filled.

In [15]:
for ind, val in buffG_df['NEIGHBORHOOD'].items():
    try:
        address = val+',Buffalo, NY'
        geolocator = Nominatim(user_agent="ny_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        buffG_df['Latitude'][ind] = latitude
        buffG_df['Longitude'][ind]= longitude
    except Exception:
        pass

In [16]:
buffG_df.head()

Unnamed: 0,NEIGHBORHOOD,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE,Latitude,Longitude
0,ABBOTT McKINLEY,42.654167,215.3575,16.666667,,
1,ALBRIGHT,23.645,88.204167,21.5,42.932,-78.8756
2,ALLEN,22.2925,95.519167,18.833333,42.8994,-78.8699
3,BABCOCK,19.578333,131.258333,13.0,42.8736,-78.8321
4,BLACK ROCK,34.943333,215.675833,13.833333,42.9326,-78.9002


Retriving the not recognized neighborhoods: 

In [17]:
for ind,value in buffG_df['Latitude'].items():
    if value=="":
        print(buffG_df['NEIGHBORHOOD'][ind], ind)

ABBOTT McKINLEY 0
BROADWAY FILLMORE 5
COLD SPRING 8
DELAWARE W. FERRY 11
GENESEE MOSELLE 17
GRANT FERRY 18
M.L.K. PARK 30
PARK MEADOW 36


By searching the coordinates from those neighborhoods we get the following:
1. ABBOTT McKINLEY : 42.847, -78.824
2. BROADWAY FILLMORE : 42.89, -78.83
3. COLD SPRING : 42.9161, 78.8575
4. DELAWARE W. FERRY : Not found
5. GENESEE MOSELLE : 42.91 , -78.82
6. GRANT FERRY : 42.92, -78.89
7. M.L.K. PARK : Not found
8. PARK MEADOW : 42.94, -78.87

Let's add those to the DataFrame. For the neighborhoods that couldn't be possible to get the coordinates they are going to be deleted

In [14]:
pd.options.mode.chained_assignment = None

In [24]:
# For ABBOTT McKINLEY 
buffG_df['Latitude'][0]=42.847;buffG_df['Longitude'][0]=-78.824
# For BROADWAY FILLMORE
buffG_df['Latitude'][5]=42.89;buffG_df['Longitude'][5]=-78.83
# For COLD SPRING
buffG_df['Latitude'][8]=42.916;buffG_df['Longitude'][8]=-78.8575
# For GENESEE MOSELLE
buffG_df['Latitude'][17]=42.91;buffG_df['Longitude'][17]= -78.82
# For GRANT FERRY
buffG_df['Latitude'][18]=42.92;buffG_df['Longitude'][18]= -78.89
# For PARK MEADOW
buffG_df['Latitude'][36]=42.94;buffG_df['Longitude'][36]=  -78.87

In [26]:
#Deleting the columns
buffG_df.drop(index=11, inplace=True)
buffG_df.drop(index=30, inplace=True)

Unnamed: 0,NEIGHBORHOOD,CURBSIDE RECYCLING (IN POUNDS),CURBSIDE GARBAGE (IN POUNDS),CURBSIDE RECYCLING RATE,Latitude,Longitude
0,ABBOTT McKINLEY,42.654167,215.3575,16.666667,42.847,-78.824
1,ALBRIGHT,23.645,88.204167,21.5,42.932,-78.8756
2,ALLEN,22.2925,95.519167,18.833333,42.8994,-78.8699
3,BABCOCK,19.578333,131.258333,13.0,42.8736,-78.8321
4,BLACK ROCK,34.943333,215.675833,13.833333,42.9326,-78.9002
5,BROADWAY FILLMORE,26.028333,207.4625,11.25,42.89,-78.83
6,BRYANT,28.58,139.280833,17.083333,42.909,-78.8803
7,CAZENOVIA PARK,30.0525,153.4075,16.5,42.8446,-78.8026
8,COLD SPRING,5.445,37.299167,13.0,42.916,-78.8575
9,COLUMBUS,5.911667,32.0625,15.75,42.8477,-78.8215


In [27]:
buffG_df.reset_index(inplace=True)

### Foursquare API

In [39]:
import requests

Now let's get the revenues near to the neighborhoods using the Foursquare API, for that the explore method is perfect because it can give us the information of the categories and the actual venues around the neighborhoods on Buffalo. So first there are some function we need to create:

In [30]:
# The code was removed by Watson Studio for sharing.

Let's create a function to get the category from a venue:

In [31]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now a function to get a desired number of venues within a desired radius around each neighborhood in Buffalo:

In [32]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now defining the parameters and call the venues function we are obtaining the dataframe with the venues and their categories. In this case a limit to up to 100 venues can be retreived in 500 meters range:

In [40]:
LIMIT=100
radius = 500
buff_venues = getNearbyVenues(names=buffG_df['NEIGHBORHOOD'],
                                   latitudes=buffG_df['Latitude'],
                                   longitudes=buffG_df['Longitude']
                                  )

ABBOTT McKINLEY
ALBRIGHT
ALLEN
BABCOCK
BLACK ROCK
BROADWAY FILLMORE
BRYANT
CAZENOVIA PARK
COLD SPRING
COLUMBUS
DELAWARE PARK
EMERSON
EMSLIE
FIRST WARD
FOREST
FRONT PARK
GENESEE MOSELLE
GRANT FERRY
GRIDER
HAMLIN PARK
JOHNSON
KAISERTOWN
KENFIELD
KENSINGTON
KINGSLEY
LAKEVIEW
LEROY
LOVEJOY
LaSALLE
MASTEN PARK
MEDICAL PARK
MILITARY
NORTH DELAWARE
NORTH PARK
PARK MEADOW
PARKSIDE
PERRY
RIVERSIDE PARK
SCHILLER PARK
SENECA
SOUTH ABBOTT
SOUTH ELLICOTT
SOUTH PARK
STARIN CENTRAL
STATE HOSPITAL
TIFFT
TRIANGLE
UNIVERSITY
VALLEY
WILLERT PARK


In [41]:
buff_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,ABBOTT McKINLEY,42.847,-78.824,The Nine-Eleven Tavern,42.845277,-78.823668,Bar
1,ABBOTT McKINLEY,42.847,-78.824,7-Eleven,42.845152,-78.824053,Convenience Store
2,ABBOTT McKINLEY,42.847,-78.824,Family Dollar,42.846652,-78.823818,Discount Store
3,ABBOTT McKINLEY,42.847,-78.824,Mulroy Park,42.848282,-78.825299,Park
4,ABBOTT McKINLEY,42.847,-78.824,Molly's Pub,42.849712,-78.823871,Bar


Let's check the size of the data.

In [45]:
print(buff_venues.shape)

(712, 7)


### This is the final dataset that is going to be used in the analysis, to find the insights to the described problem in the Introduction/Business problem section which are in the following link <a href="https://github.com/DanSeb04/Data-Science/blob/master/BuffaloRecyclesProject/Business%20Problem.ipynb">Buffalo Recycles Introduction</a>