# Processing a AWS S3 bucket

This example requires skale-engine and the amazon s3 SDK

In [1]:
var sc = require('skale-engine').context();
var AWS = require('aws-sdk');

undefined

Let's create a readable stream to our S3 bucket

In [4]:
var s3 = new AWS.S3({signatureVersion: 'v4'});
var bucket = s3.getObject({Bucket: 'skale-demo', Key: 'datasets/restaurants-ny.json'}).createReadStream();

undefined

The bucket contains a list of restaurants located in New York. Let's:
- Read the bucket line by line, one line being a stringified JSON
- JSON parse each line
- make the result persistent in memory

In [5]:
var restaurants = sc.lineStream(bucket).map(line => JSON.parse(line)).persist();

undefined

# What type of information do we have in our dataset ?

Let's display the first restaurant of our dataset.

In [14]:
$$async$$ = restaurants.first().then(data => $$done$$(data[0]));

{ address: 
   { building: '1007',
     coord: [ -73.856077, 40.848447 ],
     street: 'Morris Park Ave',
     zipcode: '10462' },
  borough: 'Bronx',
  cuisine: 'Bakery',
  grades: 
   [ { date: [Object], grade: 'A', score: 2 },
     { date: [Object], grade: 'A', score: 6 },
     { date: [Object], grade: 'A', score: 10 },
     { date: [Object], grade: 'A', score: 9 },
     { date: [Object], grade: 'B', score: 14 } ],
  name: 'Morris Park Bake Shop',
  restaurant_id: '30075445' }

As we can see, each entry of our dataset is a nested JSON object. 

Let's count the number of restaurants located in the Bronx.

In [7]:
$$async$$ = restaurants.filter(restaurant => restaurant.borough == 'Bronx').count().then($$done$$);

2338

Lets' now display the list of distinct boroughs and the number of restaurants they count.

In [16]:
$$async$$ = restaurants.map(restaurant => restaurant.borough).countByValue().then($$done$$);

[ [ 'Brooklyn', 6086 ],
  [ 'Manhattan', 10259 ],
  [ 'Staten Island', 969 ],
  [ 'Missing', 51 ],
  [ 'Bronx', 2338 ],
  [ 'Queens', 5656 ] ]

# How many chinese restaurants in Brooklyn ?

Let's first display the list of cuisine realated to chinese food.

In [17]:
$$async$$ = restaurants
    .map(restaurant => restaurant.cuisine)
    .filter(cuisine => cuisine.search(/chinese/i) != -1)
    .distinct()
    .collect()
    .then($$done$$);

[ 'Chinese', 'Chinese/Cuban', 'Chinese/Japanese' ]

We can see 'Chinese' is a type of cuisine, let's find out how many Chinese restaurants we have in Brooklyn.

In [18]:
$$async$$ = restaurants
    .filter(restaurant => (restaurant.cuisine == 'Chinese') && (restaurant.borough == 'Brooklyn'))
    .count().then($$done$$)

763

Let's display the name and street address of the 10 first Chinese restaurants in Brooklyn

In [19]:
$$async$$ = restaurants
    .filter(d => (d.cuisine == 'Chinese') && (d.borough == 'Brooklyn'))
    .map(d => [d.name, d.address.street])
    .take(10).then($$done$$);

[ [ 'May May Kitchen', 'Sutter Avenue' ],
  [ 'Golden Pavillion', 'Rutland Road' ],
  [ 'Lee\'S Villa Chinese Restaurant', 'Lawrence Street' ],
  [ 'Kum Kau Kitchen', 'Myrtle Avenue' ],
  [ 'Szechuan Delight Restaurant', '7 Avenue' ],
  [ 'Yen Yen Restaurant', 'Church Avenue' ],
  [ 'Master Wok', 'Kings Plaza Shopping Ct' ],
  [ 'Choy Le Chinese Restaurant', 'Avenue U' ],
  [ 'New Ruan\'S Restaurant', '86 Street' ],
  [ 'Great Wall Restaurant', 'Fort Hamilton Parkway' ] ]

# What are the five most reviewed Chinese restaurants in Brooklyn ?

In [12]:
$$async$$ = restaurants
    .filter(restaurant => (restaurant.cuisine == 'Chinese') && (restaurant.borough == 'Brooklyn'))
    .map(restaurant => [restaurant.name, restaurant.address.street, restaurant.grades.length])
    .sortBy(data => data[2], false)
    .take(5).then(function(data) {
        for (var i in data)
            console.log(data[i][0] + ', ' + data[i][1], ': ', data[i][2] + ' reviews');
        $$done$$();
    });

Noodle Station, 8 Avenue :  9 reviews
Lai Lai Gourmet, 8 Avenue :  9 reviews
New Chung Mee Restaurant, Church Avenue :  8 reviews
New Star Seafood Restaurant, Avenue U :  8 reviews
Mr. Q'S Grill, 8 Avenue :  8 reviews


undefined