# Data Extraction and Loading Backlog

**Goal:** 
 * To use a cloud storage servie to host the data lake environment 

**Solution:**
 * AWS s3

### Extraction Steps Executed:
**Source**
The dataset is sourced from: 
https://catalog.data.gov/dataset/nyc-jobs

**Format**
The data is stored as CSV file on an Amazon s3 bucket 

**Scope**
The data includes information about government job postings in NYC



### Extraction Logic:
**Method**
Utalize AWS s3 for storing and accessing the data, by uploading a structured file to the s3 Buket 

**Extraction Process**
Develop a script to access and download the file from the s3 bucket.  Use Python Boto3 to interact with the s3 bucket. 

### Justification for Chosen Method:

**AWS s3**
Using s3 provides a secure and reliable solution for storing and accessing large datasets. It has great features such as encryption and acccess control to manage data effectively.

**Data Partitioning**
s3 allows good organization of data into folders based on criteria such as category and date that facilitates faster data retrieval and analysis.

In [None]:
import json
import boto3
import pandas as pd
import requests

s3 = boto3.client('s3')


def lambda_handler(event, context):
    bucket ='semistructuredata'
    url = 'https://data.cityofnewyork.us/resource/kpav-sd4t.json'
    
    params = {'$limit': 1000}
    response = requests.get(url,params=params)
    data = response.json()


    df = pd.DataFrame(data)
    df.drop_duplicates(inplace=True)
    
    
    csv_data = df.to_csv(index=False)
    fileName = 'data1.csv'
    uploadByteStream = bytes(csv_data.encode('utf-8'))
    s3.put_object(Bucket=bucket, Key=fileName, Body=uploadByteStream)
    print('Put Complete')
    
    


## Layout of s3:
![s3 Bucket](./static/s3bucket.png)
## Objects in Bucket:
![objectsInBucket](./static/objectsInBucket.png)
## Creating a Lambda Function to read s3 bucket:
![Lambda](./static/lambda.png)