# ETL Project

This project consisted of extracting several data sets that shared data on zipcode and/or latitude and longitude. These data sets were then transformed to json format in order to load them into a non-relational database for analysis. In this portion of the project, two data sets were extracted and transformed.

### Data Sets Used:

- https://www.kaggle.com/savargaonkar/sf-zipcodes-limited, a list of neighborhoods and their zipcodes in San Francisco
- https://www.kaggle.com/danofer/sf-parks, a list of parks and their locations, features, and quarterly park evalutation scores

## Extract

Both data sets were csv files that were read in as pandas data frames

In [1]:
#Dependencies
import pandas as pd
import requests
import json
import pymongo
import numpy as np

### Neighborhoods

In [2]:
#Read file
neighborhoods = "Resources/SFZ.csv"
#Created data frame
nbh_df = pd.read_csv(neighborhoods)
nbh_df

Unnamed: 0,Zipcode,Neighborhood,Unnamed: 2,Unnamed: 3
0,94102,Hayes Valley,,
1,94102,Tenderloin,,
2,94102,North of Market,,
3,94103,South of Market,,
4,94107,Potrero Hill,,
...,...,...,...,...
61,94132,Lake Merced,,
62,94133,North Beach,,
63,94133,Chinatown,,
64,94134,Visitacion Valley,,


### Parks Scores

In [3]:
#Read file
parks = "Resources/SF_Park_Scores.csv"
#Created data frame    
parks_df = pd.read_csv(parks)
parks_df.head()

Unnamed: 0,ParkID,PSA,Park,FQ,Score,Facility Type,Facility Name,Address,State,Zipcode,Floor Count,Square Feet,Perimeter Length,Acres,Longitude,Latitude
0,86,PSA4,Carl Larsen Park,FY05Q3,0.795,Basketball Court,Ocean View Basketball Courts,Capitol & Montana St,CA,94112.0,,5572.020314,311.982228,0.127916,-122.456708,37.716335
1,13,PSA4,Junipero Serra Playground,FY05Q3,0.957,Ball Field,Glen ball fields,Diamond & Farnum Street,CA,94131.0,,124520.486259,1891.675445,2.858608,-122.440592,37.736008
2,9,PSA4,Rolph Nicol Playground,FY05Q3,0.864,Dog Play Area,Douglass dog play area,26th & Douglass Street,CA,94114.0,,70655.337234,1153.019646,1.62203,-122.438895,37.746741
3,117,PSA2,Alamo Square,FY05Q4,0.857,Restroom,Gilman Bathrooms,Gilman Ave & Griffith,CA,94124.0,,378.668603,94.257319,0.008693,-122.388772,37.717179
4,60,PSA6,Jose Coronado Playground,FY05Q4,0.859,Basketball Court,GGP1 Panhandle Basketball Courts,Stanyan & Great Hwy,CA,94117.0,,4645.553645,279.465313,0.106648,-122.44838,37.772304


## Transform

The data sets both contained some unnecessary and hard-to-read information. The transformation process consisted of cleaning it for clear, functional analysis.

### Steps taken to transform data:

deleted unnecessary columns, dropped duplicate rows, dropped rows with no zip/lat/long data, formated columns for clarity

### Neighborhood

In [4]:
#Cleaned neighborhood dataframe columns
    #Deleted unnecessary columns
del nbh_df['Unnamed: 2']
del nbh_df['Unnamed: 3']

    #Dropped duplicate rows
nbh_df = nbh_df.drop_duplicates()

nbh_df

Unnamed: 0,Zipcode,Neighborhood
0,94102,Hayes Valley
1,94102,Tenderloin
2,94102,North of Market
3,94103,South of Market
4,94107,Potrero Hill
5,94108,Chinatown
6,94109,Polk
7,94109,Russian Hill (Nob Hill)
8,94110,Inner Mission
9,94110,Bernal Heights


### Parks

In [5]:
#Cleaned parks dataframe columns
    #Deleted unnecessary columns
del parks_df['PSA']
del parks_df['FQ']
del parks_df['Floor Count']

    #Dropped values without zipcode/lat/long
parks_df = parks_df.drop_duplicates().dropna()

    #Formatted columns
parks_df["Zipcode"] = parks_df["Zipcode"].astype(int)
parks_df["Square Feet"] = parks_df["Square Feet"].astype(float).map("{:,.2f}".format)
parks_df["Perimeter Length"] = parks_df["Perimeter Length"].astype(float).map("{:,.2f}".format)
parks_df["Acres"] = parks_df["Acres"].astype(float).map("{:,.2f}".format)

parks_df = parks_df.rename(columns={"Perimeter Length":"Perimeter Length(ft)"})

parks_df.head()

Unnamed: 0,ParkID,Park,Score,Facility Type,Facility Name,Address,State,Zipcode,Square Feet,Perimeter Length(ft),Acres,Longitude,Latitude
0,86,Carl Larsen Park,0.795,Basketball Court,Ocean View Basketball Courts,Capitol & Montana St,CA,94112,5572.02,311.98,0.13,-122.456708,37.716335
1,13,Junipero Serra Playground,0.957,Ball Field,Glen ball fields,Diamond & Farnum Street,CA,94131,124520.49,1891.68,2.86,-122.440592,37.736008
2,9,Rolph Nicol Playground,0.864,Dog Play Area,Douglass dog play area,26th & Douglass Street,CA,94114,70655.34,1153.02,1.62,-122.438895,37.746741
3,117,Alamo Square,0.857,Restroom,Gilman Bathrooms,Gilman Ave & Griffith,CA,94124,378.67,94.26,0.01,-122.388772,37.717179
4,60,Jose Coronado Playground,0.859,Basketball Court,GGP1 Panhandle Basketball Courts,Stanyan & Great Hwy,CA,94117,4645.55,279.47,0.11,-122.44838,37.772304


## Load

This project used json data to load into a non-relational database with the other transformed data sets using mongodb. New csv and json files were created, and connection to a mongo database was established where the data could be loaded for analysis.

((This step is commented out since the database and load portion was completed in a separate notebook along with the other transformed data sets))

In [6]:
#Saved to new csv file
# nbh_df.to_csv("Output/Neighborhoods.csv", index=False, header=True)
# parks_df.to_csv("Output/Parks.csv", index=False, header=True)

In [7]:
#Saved to json file
# nbh_df.to_json('Output/Neighborhoods.json')
# parks_df.to_json('Output/Parks.json')

In [8]:
# #Connect to mongo db and create database
# conn = 'mongodb://localhost:27017'
# client = pymongo.MongoClient(conn)

# # Define the 'classDB' database in Mongo
# db = client.locations_mdb

In [9]:
# collection_parks = db['parks']
# parks_df = pd.read_csv ('../Outputs/Parks.csv')
# parks_dict=parks_df.to_dict('index')

# dumps=json.dumps(parks_dict)
# load=json.loads(dumps)
# collection_parks.insert_one(load)

In [10]:
# collection_neighborhoods = db['neighborhoods']
# neighborhoods_df = pd.read_csv ('../Outputs/Neighborhoods.csv')
# neighborhoods_dict=neighborhoods_df.to_dict('index')

# dumps=json.dumps(neighborhoods_dict)
# load=json.loads(dumps)
# collection_neighborhoods.insert_one(load)