# Non-relational Bay Area Datasets (ETL-Project)

---

### **Objective:**
#### The purpose of this project is to collaborate as a team to:
* **Extract** - 4 different datasets (in at least 2 formats)
* **Transform** - each dataset, based on their current state
* **Load** - these non-related datasets as individual datasets as a collection into a NoSQL database

---

### **Resources:**

#### This repo contains the collaborative work for Team 1. In this repo, you will find the following: ####
#### **Code Folder** ####


Containing each team member's Jupyter Notebook, that contains their respective code, used for the extraction and transformation of each team member's dataset:

 - `william_notebook.ipynb` - extrating a CSV file and saving as a Pandas DataFrame, containing stats for 150 San Francisco restaurants, before undetaking transformation (inc. `str.replace` and code to convert longitute and latitude to zipcodes) and finally converting into a JSON file.
 
 - `SFzips.ipynb` - Caitlin's file, that imports 2x CSV files containing (i) Park (ii) Neighbourhood information, performing transformation (inc. dropping duplicates and deleting obsolete rows), before saving 2x new CSV and finally converting the contents of both CSV files to a JSON file).
 
 - `Heesung_code_with_comments.ipynb` - Heesung's file, that imports 2x CSV files (i) Boba Shops in the Bay Area (ii) Pokemon Go Spawns in the Bay Area. Tranformation of the datasets includes (i) removal of irrelevant information, utilizing Geopy to return zipcodes, from latitute and longitude and merging the two datasets using the zipcode as the join.
 
 - `kathryn's.ipynb` - **<INSERT KATHRYN'S INFORMATION HERE>**
 
 
 

#### **Inputs Folder** ####

Contains each team members raw dataset(s), once extraction was completed, but before transformation and loading into the destination database:

**William**
- `restaurant_raw.csv` - a CSV file containing data (inc. longitude and latitude) of 150 Bay Area restaurants

**Caitlin**  
- `SF_Park_Scores.csv` - contains data for San Francisco parks (inc. a score for each park and it's longitude and latitude)
- `SFZ.csv`  - contains a list of zipcodes for each neighbourhood in San Francisco

**Heesung**
- `boba.csv` - contains a data for Boba merchants in the Bay Area, which includes each location's longitude and latitude.
- `pokemon-spawns.csv` - contains data for Pokemon Go spawns in the Bay Area - includes the longitude and latitude of each spawn location, as well as a unique identifier for each Pokemon

**Kathryn**<BR>
**<INSERT KATHRYN'S INFORMATION HERE>**


#### **Outputs Folder** ####

Containing each team member's output files, which includes JSON files (that were subsquently used to form the collections, that made the final MongoDB database) and also any output files, once transformation was completed and before conversion to JSON files:

**William**
- `will_data.JSON` - the byproduct of William's data extraction and transformation is a JSON file, used for loading in as the 'Restaurants' collection of the final MongoDB database.
     
**Caitlin**      
- `Neighborhoods.csv` & `Parks.csv` - the output files of both of Caitlin's datasets, once transformation had been completed on her original input files, which were `SF_Park_Scores.csv` & `SFZ.csv`.
- `sf_parks.json` - the output file, that combines both the `Neighborhoods.csv` & `Parks.csv` into a JSON file.

**Heesung**
- `pika_boba.csv` - contains a CSV of both datasets, once transformation had completed and before conversion into a JSON file.
- `boba_pika.json` - contains the merged dataset data (clean_boba_pika_df) in JSON format, ready for loading to the MongoDB database, as a collection.
- `boba_zip.csv` - contains the boba dataset, once zip codes had bee obtainined, using the longitude and latitude of each boba shop, with help from the `geocoders` library.
- `pika_zip.csv` - contains the Pokeman dataset, including zip codes for each spawn location.

**Kathryn**<BR>
**<INSERT KATHRYN'S INFORMATION HERE>**


___

### **The Process:**

The objective of this project was to work as a team, as collaboratively as possible, to walkthrough the the ETL process with at least 2 datasets, from different sources (e.g Kaggle & API, in atleast 2 differeing formats (e.g JSON and CSV).

The obvious approach would to have had individual team members tackle a different stage of the ETL process, however it was collectively agreed that this approach wouldn't necessarily have the entire team empowered with the knowledge of every step of the ETL process.

Consensus was therefore reached, that each team member would go through the ETL process individually, but in parallel. **Each team member would therefore:**

- **Extract** at least 1x dataset
- Perform **transformation**, dependant on the type and state of the data, following extraction. The prerequisite being, that each team member would present their transformed data in JSON format.
- Individually **load** their JSON-formatted transformed dataset(s) into a personal NOSQL (MongoDB) data, as a sanity check, before...
- Submitting their JSON file (that has successfully parsed by MongoDB) to the team member's respective resources file.

Once each team member had successfully extracted, transformed and 'test' loaded their datasets into MongoDB, Kathryn then took the helm of the load element of the project, by uploading each team member's dataset JSON file as individual collections in a master MongoDB database.