**Install MongoDB**: https://docs.mongodb.com/manual/installation/

**Example Data**
* https://docs.mongodb.com/php-library/v1.2/tutorial/example-data/
* https://github.com/ozlerhakan/mongodb-json-files/tree/master/datasets

# MongoDB

* Previously, we learm about SQL (Structured Query Language), as the standard language to define, manipulate, extract and manage data in RDBMS (relationship Database). The data is as tables. 
* Schema: structure of the data, all data models, approaches, technologies for creating DB System 

## Why NoSQL?

NoSQL (Not Only SQL): not a query language, have some advances comparning to SQL

* **Flexibility**: NoSQL DB are schema-flexible, no rigid columns as in tables, value-key (nested) => No predefined data models
* **Scalability**: ease for horizontal scaling, by adding more machines. NoSQL is easy to config by cluster
* **Agile Adaptability**: Simplicity of their config, schema-flexible nature (ability to make prototype, frequent iteration, not costly to change schema as RDBMS)


## Type of NoSQL DBMS

* Key-value: similar to JSON (Dict)
* Document: key/document (JSON, BSON, XML, or even richer data). Nested documents
* Column Family Store: store data by columns separately with a column key, or rows separately with a row key (=> More efficient for heavy read requests for data with large scale)
* Graph: The data is represented as links and ndoes (each node have a list of properties, link: relationship)

![nosql-databases-overview.png](./images/nosql-databases-overview.png)


## Load data into MongoDB

```
mongoimport --db vivadata --collection zips --drop --file /Users/danghuynhmaianh/Documents/GitHub/fsds-courses/02-Data-Collection/data/zips.json
```

![mongo-dbs.png](./images/mongo-dbs.png)

## Inspect the Data

1. `show dbs`
2. `use vivadata`
3. `show collections`
4. `db.zips.findOne()`

## Read (Query Data)

![mongodb-find.png](./images/mongodb-find.png)

1. Find a Particular Value: `db.zips.find({city: {$eq: "NEW YORK"} })`
2. Filter a range of numeric value :`db.zips.find({pop: {"$gte" : 20000, "$lte": 50000} })`
3. Show field: `db.zips.find({pop : {$gt: 90000} }, {_id: 0, city: 1, pop: 1})` (Show fields `city`, `pop`)
4. Sorting: `db.zips.find({pop : {$gt: 90000} }, {_id: 0, city: 1, pop: 1}).sort({pop: +1})` (Sort `pop` ascending)
5. Limit displayed records: `db.zips.find({pop : {$gt: 90000} }, {_id: 0, city: 1, pop: 1}).sort({pop: +1}).limit(3)`
6. Filter nested documents conditions: `db.students.find({scores: {$elemMatch: {score: {$gt: 90} , type: "exam"}} })` (score and type nested in scores)
7. Count: `db.students.find({scores: {$elemMatch: {score: {$gt: 90} , type: "exam"}} }).count()`
8. Distinct values: `db.zips.distinct("city")`


## `db.collection.aggregate()`

![mongodb-aggregate.png](./images/mongodb-aggregate.png)

![agg-operator.png](./images/agg-operator.png)

1. Sum of population by city

```
db.zips.aggregate([
        { $group : { _id: "$city", total_population : {$sum: "$pop"} } }
     ])
```

2. Sum by Group, then sort, then limit

```
db.zips.aggregate([
        { $group : { _id: "$city", total_population : {$sum: "$pop"} } },
        { $sort : { total_population :-1 } },
        { $limit: 3}
     ])
```

3. Count Zip Code per city

```
db.zips.aggregate([
        { $group : { _id: "$city", nb_zipcodes: {$sum: 1} } },
        { $sort : { nb_zipcodes :-1 } },
     ])
```

4. Add the zipcodes to 

```
db.zips.aggregate([
        { $group : { _id: "$city", nb_zipcodes : { $sum : 1}, zipcodes : {$addToSet: "$_id"} } },
    ])
```

5. List 10 most populated cities in NY
* Filter state = 'NY'
* Sum(Pop) by city 
* Sort by total_population
* Limit top 10 
* Add the zip code (_id is just a trick, no actual grouping)
* Not show _id

```
db.zips.aggregate([
        { $match : { state : {$eq : "NY" } } },
        { $group : { _id: "$city", total_population :  { $sum: "$pop"} } },
        { $sort : { total_population :-1 } },
        { $limit: 10},
        { $group : { _id: null , most_populated_cities : { $addToSet: "$_id"} } },
        { $project : {_id: 0 } }
    ])
```

6. Cities name with more than 1 words 
* Split city by " " by `$split`
* Count the element in words_city by `$size`
* Math to filter nb_words_city > 1 
* Group by city, then sort 

```
db.zips.aggregate([
    { $project : { words_city : { $split: [ "$city" , " "]},  city : 1}},
    { $project : { nb_words_city : { $size : "$words_city"},  city : 1}},
    { $match : { nb_words_city : { $gt : 1}}},
    { $group : { _id :  "$city" }},
    { $sort : { _id : 1}},
])
```


# PyMongo

In [1]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-3.11.4-cp38-cp38-macosx_10_9_x86_64.whl (380 kB)
[K     |████████████████████████████████| 380 kB 1.6 MB/s 
[?25hInstalling collected packages: pymongo
Successfully installed pymongo-3.11.4


In [7]:
# Importing the MongoDB client to communicate with our server
from pymongo import MongoClient

In [8]:
# Connecting to the MongoDB server
client = MongoClient('localhost', 27017)

In [11]:
db = client.vivadata # Connect to vivadata database
db.list_collection_names() # Show list of collection

['zips', 'students']

In [12]:
# Select Collection
zips = db.zips 

In [13]:
# Show one document 
zips.find_one()

{'_id': '01010',
 'city': 'BRIMFIELD',
 'loc': [-72.188455, 42.116543],
 'pop': 3706,
 'state': 'MA'}

In [14]:
# List of distinct cities with zip code areas of population greater than 80 000
## Equivalent to: db.zips.distinct() in terminal
zips.distinct('city', {'pop': {'$gt': 80000} })

['ARLETA',
 'BELL GARDENS',
 'BRONX',
 'BROOKLYN',
 'CHICAGO',
 'FONTANA',
 'JACKSON HEIGHTS',
 'LOS ANGELES',
 'NEW YORK',
 'NORWALK',
 'PHILADELPHIA',
 'RIDGEWOOD',
 'SOUTH GATE',
 'WESTLAND']

In [15]:
# This gives us the perk to take the output from MongoDB and write python on this object
city_list = zips.distinct('city', {'pop': {'$gt': 80000} })
for c in city_list[:3]:
    print(c)

ARLETA
BELL GARDENS
BRONX


In [16]:
# Select zip code areas with population exceeding 90000, but this time displaying only the city name and the population 
high_pop = zips.find({"pop" : {"$gt": 90000} }, {"_id": 0, "city": 1, "pop": 1})
for city in high_pop:
    print("{} - {}".format(city['city'], city['pop']))

NEW YORK - 106564
NEW YORK - 100027
BROOKLYN - 111396
CHICAGO - 92005
CHICAGO - 98612
CHICAGO - 112047
CHICAGO - 91814
CHICAGO - 94317
CHICAGO - 95971
LOS ANGELES - 96074
BELL GARDENS - 99568
NORWALK - 94188
