## Step 0: Preparation

> **import libraries, csv files**

In [120]:
import pymongo
from pymongo import MongoClient
import pandas as pd
from pprint import pprint
import datetime

In [13]:
hot_df = pd.read_csv("datasets/hotspot_historic.csv")
clim_df = pd.read_csv("datasets/climate_historic.csv")

> Have a look at data, trying to find the relationship

In [27]:
#hot_df.date.unique()
len(clim_df.date.unique()), clim_df.shape, clim_df.station.unique()

(366, (366, 7), array([948700, 948701, 948702]))

In [33]:
len(hot_df.date.unique()), len(clim_df.date.unique()) 

(145, 366)

In [54]:
hot_df[hot_df.date == clim_df.date[0]]

Unnamed: 0,latitude,longitude,datetime,confidence,date,surface_temperature_celcius


In [47]:
clim_df.head()

Unnamed: 0,station,date,air_temperature_celcius,relative_humidity,windspeed_knots,max_wind_speed,precipitation
0,948700,31/12/2016,19,56.8,7.9,11.1,0.00I
1,948700,2/01/2017,15,50.7,9.2,13.0,0.02G
2,948700,3/01/2017,16,53.6,8.1,15.0,0.00G
3,948700,4/01/2017,24,61.6,7.7,14.0,0.00I
4,948700,5/01/2017,24,62.3,7.0,13.0,0.00I


> Three stations, but they are not matters, as the date col doesn't have duplicate data ?

In [41]:
clim_df[clim_df.station==948702].tail(5)

Unnamed: 0,station,date,air_temperature_celcius,relative_humidity,windspeed_knots,max_wind_speed,precipitation
361,948702,28/12/2017,21,61.1,6.6,11.1,0.00I
362,948702,29/12/2017,19,59.7,7.4,14.0,0.63G
363,948702,30/12/2017,16,51.5,8.7,15.0,0.02G
364,948702,31/12/2017,18,53.6,7.9,15.9,0.00G
365,948702,1/01/2018,19,52.9,8.1,15.0,0.00I


## Task A
> Aim: data model for querying <br>
> Query for two kind of data: hotspot, climate

In [57]:
# Check whether all date from hotspot data have climate data
for date in hos_df.date.unique():
    if date in clim_df.date.unique():
        continue
    else:
        print("not matched all")

In [18]:
doc_format = {
    "hotspot":{
        "latitude": 0.0,
        "longitude": 0.0,
        "datetime": "",
        "confidence": 0,
        "date": "",
        "surface_temperature_celcius": 0
    },
    "climate":{
        "station":0,
        "date":"",
        "air_temperature_celcius":0,
        "relative_humidity": 0.0,
        "windspeed_knots": 0.0,
        "max_wind_speed": 0.0,
        "precipitation": ""
    }
}

doc_format_together = {
    "station":0,
    "date":"",
    "air_temperature_celcius":0,
    "relative_humidity": 0.0,
    "windspeed_knots": 0.0,
    "max_wind_speed": 0.0,
    "precipitation": "",
    "hotspot":{
        "latitude": 0.0,
        "longitude": 0.0,
        "datetime": "",
        "confidence": 0,
        "surface_temperature_celcius": 0
    }
}

> Connect to MongoDB, create Collections in MongoDB

In [77]:
client = MongoClient()

In [78]:
# connect to db and if not exits, create it.
db = client.assignment

In [79]:
# connect to collection and if not exits, create it.
collection = db.document

In [126]:
result = collection.find({})
print(len(list(result)))
for document in result: 
    pprint(document)
    #pass

366


In [101]:
# Check
print(clim_df.dtypes,"\n",hot_df.dtypes)

station                      int64
date                        object
air_temperature_celcius      int64
relative_humidity          float64
windspeed_knots            float64
max_wind_speed             float64
precipitation               object
dtype: object 
 latitude                       float64
longitude                      float64
datetime                        object
confidence                       int64
date                            object
surface_temperature_celcius      int64
dtype: object


In [129]:
# scan over the climate dataframe, and for each day look for related hospot records
for i in range(len(clim_df)):
    doc = {}
    doc["station"] = int(clim_df.station[i])
    doc["date"] = datetime.datetime.strptime(clim_df.date[i], "%d/%m/%Y")
    doc["air_temperature_celcius"] = int(clim_df.air_temperature_celcius[i])
    doc["relative_humidity"] = float(clim_df.relative_humidity[i])
    doc["windspeed_knots"] = float(clim_df.windspeed_knots[i])
    doc["max_wind_speed"] = float(clim_df.max_wind_speed[i])
    doc["precipitation "] = clim_df.at[i, "precipitation "]

    hot_df_related = hot_df[hot_df.date == date]
    doc["Hotspots"] = []
    for index, row in hot_df_related.iterrows():
        hotspot = {}
        hotspot["latitude"] = float(row.latitude)
        hotspot["longitude"] = float(row.longitude)
        hotspot["datetime"] = datetime.datetime.strptime(row.datetime,"%Y-%m-%dT%H:%M:%S")
        hotspot["confidence"] = int(row.confidence)
        hotspot["surface_temperature_celcius"] = int(row.surface_temperature_celcius)
        doc["Hotspots"].append(hotspot)
    
    result = collection.insert_one(doc)
    #print(result.inserted_id)

In [128]:
print(hot_df.datetime[0])
d = datetime.datetime.strptime(clim_df.date[0], "%d/%m/%Y")

d2 = datetime.datetime.strptime(hot_df.datetime[0],"%Y-%m-%dT%H:%M:%S")

d2
#clim_df.date[i]

2017-12-27T04:16:51


datetime.datetime(2017, 12, 27, 4, 16, 51)

## Task B

> **a**. Find climate data on ​10th December 2017​.

> **b**. Find the latitude​, ​longitude, surface temperature (​ °C)​, a​ nd ​confidence ​when the surface temperature (°C) was between ​65 °C​ and​ 100 °C​.

> **c**. Find ​date,​ ​surface temperature (°C), air temperature (°C), relative humidity and max wind speed on​ 15th and 16th of December 2016.

> **d**. Find ​datetime, air temperature (°C), surface temperature (°C) and ​confidence when the ​confidence i​ s between 80 and 100.

> **e**. Find the top 10 records with the highest ​surface temperature​ ​(°C).

> **f**. Find the number of fire in each day. You are required to only display ​the total number of fire​ a​ nd ​the date​ in the output.

> **g**. Find the ​average surface temperature ​(°C) f​ or each day​. You are required to only display ​average surface temperature (°C)​ ​and ​the date​ in the output.