<a href="https://colab.research.google.com/github/SawsanYusuf/Air-Quality-in-China/blob/main/1_data_wrangling_with_mongodb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font size="+3"><strong>Data Wrangling with Mongodb</strong></font>

In [2]:
from pprint import PrettyPrinter
import pandas as pd
from pymongo import MongoClient

**Step 1:** Instantiate a PrettyPrinter, and assign it to the variable pp.

In [3]:
pp = PrettyPrinter (indent= 2)

# Prepare Data

## Connect

**Step 2:** Create a client that connects to the database running at `localhost` on port `27017`.

In [None]:
client = MongoClient(host="localhost", port=27017)

## Explore

**Step 3:** Print a list of the databases available on client.

In [5]:
pp.pprint(list(client.list_databases()))

[ {'empty': False, 'name': 'Air-Quality', 'sizeOnDisk': 8118272},
  {'empty': False, 'name': 'sample_airbnb', 'sizeOnDisk': 55005184},
  {'empty': False, 'name': 'sample_analytics', 'sizeOnDisk': 9592832},
  {'empty': False, 'name': 'sample_geospatial', 'sizeOnDisk': 1404928},
  {'empty': False, 'name': 'sample_restaurants', 'sizeOnDisk': 6901760},
  {'empty': False, 'name': 'sample_weatherdata', 'sizeOnDisk': 2895872},
  {'empty': False, 'name': 'admin', 'sizeOnDisk': 344064},
  {'empty': False, 'name': 'local', 'sizeOnDisk': 25189883904}]


**Step 4:** Assign the `"air-quality"` database to the variable db.

In [6]:
db = client["Air-Quality"]

**Step 5:** Use the `list_collections` method to print a list of the collections available in `db`.

In [7]:
for c in db.list_collections():
    print(c["name"])

beijing
mumbai
Delhi


**Step 6:** Assign the `"beijing"` collection in db to the variable name beijing.

In [8]:
beijing = db["beijing"]

**Step 7:** Use the `count_documents` method to see how many documents are in the beijing
collection.

In [9]:
beijing.count_documents({})

116027

**Step 8:** Use the `find_one` method to retrieve one document from the beijing collection, and
assign it to the variable name result.

In [10]:
result = beijing.find_one({})
pp.pprint(result)

{ 'PM2': 4,
  '_id': ObjectId('64d5173b5c3e40269fce34fb'),
  'metadata': { 'lat': 25.055,
                'lon': 121.454,
                'measurement': 'PM2.5',
                'sensor_type': 'SDS011',
                'station': 'Aotizhongxin'},
  'timestamp': '2013-3-1 0:27:46'}


**Step 9:** Use the `distinct` method to determine how many sensor sites are included in the
beijing collection.

In [11]:
beijing.distinct("metadata.station")

['Aotizhongxin', 'Changping']

**Step 10:** Use the `count_documents` method to determine how many readings there are for
each site in the beijing collection.

In [12]:
print("Documents from Aotizhongxin Station:", beijing.count_documents({"metadata.station":"Aotizhongxin"}))
print("Documents from Changping Station:", beijing.count_documents({"metadata.station": "Changping"}))

Documents from Aotizhongxin Station: 62219
Documents from Changping Station: 53808


**Step 11:** Use the `aggregate` method to determine how many readings there are for each site
in the beijing collection.

In [13]:
result = beijing.aggregate(
       [
           {"$group":{"_id": "$metadata.station","count":{"$count":{}}}}
       ]
)
pp.pprint(list(result))

[{'_id': 'Changping', 'count': 53808}, {'_id': 'Aotizhongxin', 'count': 62219}]


**Step 12:** Use the `distinct` method to determine how many types of measurements have been
taken in the beijing collection.

In [14]:
beijing.distinct("metadata.measurement")

['O3', 'PM10', 'PM2.5']

**Step 13:** Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit
your results to 3 records only.

In [15]:
result = beijing.find({"metadata.measurement": "PM2.5"}). limit (3)
pp.pprint(list(result))

[ { 'PM2': 4,
    '_id': ObjectId('64d5173b5c3e40269fce34fb'),
    'metadata': { 'lat': 25.055,
                  'lon': 121.454,
                  'measurement': 'PM2.5',
                  'sensor_type': 'SDS011',
                  'station': 'Aotizhongxin'},
    'timestamp': '2013-3-1 0:27:46'},
  { 'PM2': 8,
    '_id': ObjectId('64d5173b5c3e40269fce34fc'),
    'metadata': { 'lat': 25.055,
                  'lon': 121.454,
                  'measurement': 'PM2.5',
                  'sensor_type': 'SDS011',
                  'station': 'Aotizhongxin'},
    'timestamp': '2013-3-1 1:20:53'},
  { 'PM2': 7,
    '_id': ObjectId('64d5173b5c3e40269fce34fd'),
    'metadata': { 'lat': 25.055,
                  'lon': 121.454,
                  'measurement': 'PM2.5',
                  'sensor_type': 'SDS011',
                  'station': 'Aotizhongxin'},
    'timestamp': '2013-3-1 2:58:24'}]


**Step 14:** Use the aggregate method to calculate how many readings there are for each type
`("PM2.5", "PM10", and "O3")` in `Aotizhongxin`.

In [16]:
result = beijing.aggregate(
      [
          {"$match":{"metadata.station":'Aotizhongxin'}},
          {"$group":{"_id": "$metadata.measurement","count":{"$count":{}}}}
      ]
)
pp.pprint(list(result))

[ {'_id': 'PM2.5', 'count': 18759},
  {'_id': 'PM10', 'count': 31188},
  {'_id': 'O3', 'count': 12272}]


**Step 15:** Use the aggregate method to calculate how many readings there are for each type
`("PM2.5", "PM10", and "O3")` in `Changping`.

In [17]:
result = beijing.aggregate(
     [
          {"$match":{"metadata.station":'Changping'}},
          {"$group":{"_id": "$metadata.measurement","count":{"$count":{}}}}
     ]
)
pp.pprint(list(result))

[ {'_id': 'O3', 'count': 11525},
  {'_id': 'PM2.5', 'count': 27654},
  {'_id': 'PM10', 'count': 14629}]


## Import

**Step 16:** Use the find method to retrieve the `PM 2.5` readings from `Aotizhongxin`. Be sure to limit
your results to 3 records only. Since we won’t need the metadata for our model, use the projection
argument to limit the results to the `"PM2"` and `"timestamp"` keys only.

In [18]:
result = beijing.find(
       {"metadata.station":"Aotizhongxin","metadata.measurement":"PM2.5"},
       projection= {"PM2":1,"timestamp":1,"_id":0}
)
pp.pprint(result.next())

{'PM2': 4, 'timestamp': '2013-3-1 0:27:46'}


**Task 17:** Read records from your result into the DataFrame df. Be sure to set the index to
`"timestamp"`.

In [19]:
df = pd.DataFrame(result).set_index("timestamp")
df.head()

Unnamed: 0_level_0,PM2
timestamp,Unnamed: 1_level_1
2013-03-01 01:20:53,8.0
2013-03-01 02:58:24,7.0
2013-03-01 03:32:40,6.0
2013-03-01 04:29:44,3.0
2013-03-01 05:31:46,5.0
