# Data Connector
`Author: YUAN Yan Zhe`

Data connector connects raw data with mongoDB database and spark data structure, basically it contains two parts:
- Data Preprocessor
  - Preprocess the raw data
  - .csv -> pd.DataFrame
- Data Transportor
  - Write the preprocessed data to MongoDB through instance of mongoDB class
  - pd.DataFrame -> mongoDB data structure(json)
- the indivisual test version for the previous two parts are in `data_transportor.ipynb` and `data_preprocessor.ipynb`

##### NOTES
**To run this code, first start mongodb service using`brew services start mongodb-community@4.4` (https://docs.mongodb.com/manual/tutorial/install-mongodb-on-os-x/)**

**For mongodb connection, use `pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.0` to wake up pyspark. (match the version with scala-pyspark-mangodb, details: https://docs.mongodb.com/spark-connector/master/python-api)**

- To switch scala version, do the following:
  - `brew search scala`
  - `brew install scala@2.12`
  - `brew unlink scala`
  - `brew link scala@2.12 --force`
  - `scala -version`
  - add `echo 'export PATH="/usr/local/opt/scala@2.12/bin:$PATH"' >> ~/.zshrc` to the shell

## Data Preprocessor
- Load raw data from .csv file
- Preprocess raw data
  - For train and test data, change the time-stamp to [datetime, date, time] for future analysis
  - For weather train and test data, change the time-stamp and fill Nan values with median value of that day
  - For building metadata, drop the columns with too many Nan values.
- Store the preprocessed data into .csv file for Data Transportor

In [1]:
spark

In [2]:
# Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from datetime import datetime
from pytz import timezone
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load raw data from original dataset

train_df = pd.read_csv("ashrae-energy-prediction/train.csv")
building_df = pd.read_csv("ashrae-energy-prediction/building_metadata.csv")
test_df = pd.read_csv("ashrae-energy-prediction/test.csv")
weather_train_df = pd.read_csv('ashrae-energy-prediction/weather_train.csv')
weather_test_df = pd.read_csv('ashrae-energy-prediction/weather_test.csv')
print('Data Loaded')
print('the size of train dataset:', train_df.shape)
print('the size of test dataset:', test_df.shape)
print('the size of weather_train dataset:', weather_train_df.shape)
print('the size of weather_test dataset:', weather_test_df.shape)
print('the size of building_metadata dataset:', building_df.shape)

Data Loaded
the size of train dataset: (20216100, 4)
the size of test dataset: (41697600, 4)
the size of weather_train dataset: (139773, 9)
the size of weather_test dataset: (277243, 9)
the size of building_metadata dataset: (1449, 6)


In [4]:
# Preprocess the train and test data

train_df['datetime'] = train_df['timestamp'].astype('datetime64[ns]') 
train_df['date'] = train_df['datetime'].apply(datetime.date)
train_df['time'] = train_df['datetime'].apply(datetime.time)
print(train_df.isna().sum())

test_df['datetime'] = test_df['timestamp'].astype('datetime64[ns]') 
test_df['date'] = test_df['datetime'].apply(datetime.date)
test_df['time'] = test_df['datetime'].apply(datetime.time)
print(train_df.isna().sum())


building_id      0
meter            0
timestamp        0
meter_reading    0
datetime         0
date             0
time             0
dtype: int64
building_id      0
meter            0
timestamp        0
meter_reading    0
datetime         0
date             0
time             0
dtype: int64


In [5]:
train_df.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,datetime,date,time
0,0,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
1,1,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
2,2,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
3,3,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
4,4,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00


In [6]:
test_df.head()

Unnamed: 0,row_id,building_id,meter,timestamp,datetime,date,time
0,0,0,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
1,1,1,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
2,2,2,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
3,3,3,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
4,4,4,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00


In [7]:
weather_nan_train = weather_train_df.isna().sum()
print(weather_nan_train)
weather_nan_test = weather_test_df.isna().sum()
print(weather_nan_test)

site_id                   0
timestamp                 0
air_temperature          55
cloud_coverage        69173
dew_temperature         113
precip_depth_1_hr     50289
sea_level_pressure    10618
wind_direction         6268
wind_speed              304
dtype: int64
site_id                    0
timestamp                  0
air_temperature          104
cloud_coverage        140448
dew_temperature          327
precip_depth_1_hr      95588
sea_level_pressure     21265
wind_direction         12370
wind_speed               460
dtype: int64


In [8]:
# Preprocess the weather train and test data

weather_train_df['timestamp_2'] = weather_train_df['timestamp'].astype(str).str[:-6]
weather_train_df['timestamp_2'] = pd.to_datetime(weather_train_df['timestamp_2'])
weather_train_df['date'] = weather_train_df['timestamp_2'].apply(datetime.date)
weather_train_df['time'] = weather_train_df['timestamp_2'].apply(datetime.time)

weather_test_df['timestamp_2'] = weather_test_df['timestamp'].astype(str).str[:-6]
weather_test_df['timestamp_2'] = pd.to_datetime(weather_test_df['timestamp_2'])
weather_test_df['date'] = weather_test_df['timestamp_2'].apply(datetime.date)
weather_test_df['time'] = weather_test_df['timestamp_2'].apply(datetime.time)

weather_train_df['air_temperature'] = weather_train_df['air_temperature'].fillna(weather_train_df.groupby('date')['air_temperature'].transform('median'))
weather_train_df['dew_temperature'] = weather_train_df['dew_temperature'].fillna(weather_train_df.groupby('date')['dew_temperature'].transform('median'))
weather_train_df['sea_level_pressure'] = weather_train_df['sea_level_pressure'].fillna(weather_train_df.groupby('date')['sea_level_pressure'].transform('median'))
weather_train_df['wind_speed'] = weather_train_df['wind_speed'].fillna(weather_train_df.groupby('date')['wind_speed'].transform('median'))
weather_train_df['cloud_coverage'] = weather_train_df['cloud_coverage'].fillna(weather_train_df.groupby('date')['cloud_coverage'].transform('median'))
weather_train_df['precip_depth_1_hr'] = weather_train_df['precip_depth_1_hr'].fillna(weather_train_df.groupby('date')['precip_depth_1_hr'].transform('median'))
weather_train_df['wind_direction'] = weather_train_df['wind_direction'].fillna(weather_train_df.groupby('date')['wind_direction'].transform('median'))

weather_test_df['air_temperature'] = weather_test_df['air_temperature'].fillna(weather_test_df.groupby('date')['air_temperature'].transform('median'))
weather_test_df['dew_temperature'] = weather_test_df['dew_temperature'].fillna(weather_test_df.groupby('date')['dew_temperature'].transform('median'))
weather_test_df['sea_level_pressure'] = weather_test_df['sea_level_pressure'].fillna(weather_test_df.groupby('date')['sea_level_pressure'].transform('median'))
weather_test_df['wind_speed'] = weather_test_df['wind_speed'].fillna(weather_test_df.groupby('date')['wind_speed'].transform('median'))
weather_test_df['cloud_coverage'] = weather_test_df['cloud_coverage'].fillna(weather_test_df.groupby('date')['cloud_coverage'].transform('median'))
weather_test_df['precip_depth_1_hr'] = weather_test_df['precip_depth_1_hr'].fillna(weather_test_df.groupby('date')['precip_depth_1_hr'].transform('median'))
weather_test_df['wind_direction'] = weather_test_df['wind_direction'].fillna(weather_test_df.groupby('date')['wind_direction'].transform('median'))

weather_nan_train = weather_train_df.isna().sum()
print(weather_nan_train)
weather_nan_test = weather_test_df.isna().sum()
print(weather_nan_test)

site_id               0
timestamp             0
air_temperature       0
cloud_coverage        0
dew_temperature       0
precip_depth_1_hr     0
sea_level_pressure    0
wind_direction        0
wind_speed            0
timestamp_2           0
date                  0
time                  0
dtype: int64
site_id                 0
timestamp               0
air_temperature         0
cloud_coverage        528
dew_temperature         0
precip_depth_1_hr       0
sea_level_pressure      0
wind_direction          0
wind_speed              0
timestamp_2             0
date                    0
time                    0
dtype: int64


In [9]:
weather_test_df = weather_test_df.fillna(0)
print(weather_test_df.isna().sum())

site_id               0
timestamp             0
air_temperature       0
cloud_coverage        0
dew_temperature       0
precip_depth_1_hr     0
sea_level_pressure    0
wind_direction        0
wind_speed            0
timestamp_2           0
date                  0
time                  0
dtype: int64


In [10]:
weather_train_df.head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,timestamp_2,date,time
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,0.0,1019.7,0.0,0.0,2016-01-01 00:00:00,2016-01-01,00:00:00
1,0,2016-01-01 01:00:00,24.4,0.0,21.1,-1.0,1020.2,70.0,1.5,2016-01-01 01:00:00,2016-01-01,01:00:00
2,0,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.2,0.0,0.0,2016-01-01 02:00:00,2016-01-01,02:00:00
3,0,2016-01-01 03:00:00,21.1,2.0,20.6,0.0,1020.1,0.0,0.0,2016-01-01 03:00:00,2016-01-01,03:00:00
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6,2016-01-01 04:00:00,2016-01-01,04:00:00


In [11]:
weather_test_df.head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,timestamp_2,date,time
0,0,2017-01-01 00:00:00,17.8,4.0,11.7,0.0,1021.4,100.0,3.6,2017-01-01 00:00:00,2017-01-01,00:00:00
1,0,2017-01-01 01:00:00,17.8,2.0,12.8,0.0,1022.0,130.0,3.1,2017-01-01 01:00:00,2017-01-01,01:00:00
2,0,2017-01-01 02:00:00,16.1,0.0,12.8,0.0,1021.9,140.0,3.1,2017-01-01 02:00:00,2017-01-01,02:00:00
3,0,2017-01-01 03:00:00,17.2,0.0,13.3,0.0,1022.2,140.0,3.1,2017-01-01 03:00:00,2017-01-01,03:00:00
4,0,2017-01-01 04:00:00,16.7,2.0,13.3,0.0,1022.3,130.0,2.6,2017-01-01 04:00:00,2017-01-01,04:00:00


In [12]:
print(building_df.isna().sum())

site_id           0
building_id       0
primary_use       0
square_feet       0
year_built      774
floor_count    1094
dtype: int64


In [13]:
# Preprocess the buidling metadata

building_df = building_df.drop(columns='floor_count')
building_df = building_df.drop(columns='year_built')
building_df.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet
0,0,0,Education,7432
1,0,1,Education,2720
2,0,2,Education,5376
3,0,3,Education,23685
4,0,4,Education,116607


In [14]:
train_df.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,datetime,date,time
0,0,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
1,1,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
2,2,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
3,3,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00
4,4,0,2016-01-01 00:00:00,0.0,2016-01-01,2016-01-01,00:00:00


In [15]:
test_df.head()

Unnamed: 0,row_id,building_id,meter,timestamp,datetime,date,time
0,0,0,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
1,1,1,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
2,2,2,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
3,3,3,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00
4,4,4,0,2017-01-01 00:00:00,2017-01-01,2017-01-01,00:00:00


In [34]:
# Store the preprocessed data as intermediate results into csv files

building_df.to_csv("ashrae_intermidiate_result/building_metadata_cld.csv", index=False)
weather_train_df.to_csv('ashrae_intermidiate_result/weather_train_cld.csv', index=False)
weather_test_df.to_csv('ashrae_intermidiate_result/weather_test_cld.csv', index=False)
train_df.to_csv("ashrae_intermidiate_result/train_cld.csv", index=False)
test_df.to_csv("ashrae_intermidiate_result/test_cld.csv", index=False)


## Data Tranportor

- Load preprocessedata from csv files
- Write them into mongodb using functions in defined class mongoDB, here are some infos for the already stored data
  - database: ashrae_db
  - collection:
    - train
    - test
    - weather_train
    - weather_test
    - building_metadata
- the data in mongoDB can be loaded to Spark.DataFrame in the Data Loader.

In [16]:
spark

In [17]:
# Import packages

import pandas as pd
import datetime
import numpy as np

from datetime import datetime
from pytz import timezone

In [36]:
# Load datasets from csv file into pandas.dataframe

train_df = pd.read_csv("ashrae_intermidiate_result/train_cld.csv")
building_df = pd.read_csv("ashrae_intermidiate_result/building_metadata_cld.csv")
test_df = pd.read_csv("ashrae_intermidiate_result/test_cld.csv")
weather_train_df = pd.read_csv('ashrae_intermidiate_result/weather_train_cld.csv')
weather_test_df = pd.read_csv('ashrae_intermidiate_result/weather_test_cld.csv')

print('Data Loaded')
print('the size of train dataset:', train_df.shape)
print('the size of test dataset:', test_df.shape)
print('the size of weather_train dataset:', weather_train_df.shape)
print('the size of weather_test dataset:', weather_test_df.shape)
print('the size of building_metadata dataset:', building_df.shape)

Data Loaded
the size of train dataset: (20216100, 7)
the size of test dataset: (41697600, 7)
the size of weather_train dataset: (139773, 12)
the size of weather_test dataset: (277243, 12)
the size of building_metadata dataset: (1449, 4)


In [18]:
print('Data Loaded')
print('the size of train dataset:', train_df.shape)
print('the size of test dataset:', test_df.shape)
print('the size of weather_train dataset:', weather_train_df.shape)
print('the size of weather_test dataset:', weather_test_df.shape)
print('the size of building_metadata dataset:', building_df.shape)

Data Loaded
the size of train dataset: (20216100, 7)
the size of test dataset: (41697600, 7)
the size of weather_train dataset: (139773, 12)
the size of weather_test dataset: (277243, 12)
the size of building_metadata dataset: (1449, 4)


In [19]:
# Import packages

import pymongo
from pymongo import MongoClient
import json
import bson
from bson import ObjectId
from bson import json_util as jsonb
import datetime
import pandas as pd

In [22]:
class mongoDB():
    
    """
    Class Description:
    - Serve as the Data Connector layer in the project. 
    - raw_data->database->spark_data
        
    Initialization:
      - host_address
      - port_number
      - connect the Mongo client
      
    Functions: 
      - Create a MongoDB service for the user
      - Write raw data (pd.dataframe)/huge raw data into mongodb collections
      - Load mongodb collections into Spark.DataFrame
      - Drop existing collections
      - Count the number of rows in a collection
      - Close MongoDB connection
    
    Notices:
      - Use close_connection after finishing using mongo service (for the best)
    
    """  
    
    def __init__(self, host='127.0.0.1', port=27017):
        self.host = host
        self.port = port
        # Create a MongoClient to the running mongod instance
        self.mongo_client = MongoClient(self.host, self.port) # or we can use: MongoClient('localhost', 27017)
    
    def data_to_db(self, database_name='database_test', collection_name='collection_test', raw_data=pd.DataFrame(np.arange(100).reshape(25,4), columns=list('wxyz'))):
        '''
        Write raw data (pd.dataframe form) into mongodb database (as a collection)
        '''
        
        client = self.mongo_client
        
        # Create a database inside this client
        db = client[database_name]

        # Create a collection inside this database.
        # A collection is a group of documents stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database
        collection = db[collection_name]

        # Data in MongoDB is represented (and stored) using JSON-style documents. In PyMongo we use dictionaries to represent documents.
        # data = your_df.to_dict(orient='record') / mycol.insert_many(data) may be useful as well.
        print('----------------------------------------------')
        print('Start Inserting......')
        collection.insert_many(json.loads(raw_data.T.to_json()).values())
    
        # print related information
        # print()
        print('- Data inserted to Database:', database_name, 'Collection:', collection_name)
        # print('There are other collections:', db.list_collection_names(), 'And other databases:', client.list_database_names())
        print('- The number of rows in', collection_name, 'is', collection.count_documents({}))
        print('- Fetch the first row in the collection', jsonb.dumps(list(collection.find_one())))
        print('Data Loaded')
    
        '''
        Set the info for MongoClient:
          - username="admin",
          - password="123456"
        '''
        '''
        If there are redundant info in the collection, use the following commands and simply rerun this function:
          client = MongoClient()
          db = client['database_test_1']
          collection = db['collection_test_1']
          collection.drop()
          client.close()
        Alternatively, use the following to remove single document
          # collection.remove( {'_id':id_num}) 
          db.test.delete_many({'x': 1})
        Use the following to find all documents in oone collection:
          list(collection.find())
        '''

    def huge_data_to_db(self, database_name='database_test', collection_name='collection_test', data=pd.DataFrame(np.arange(100).reshape(25,4), columns=list('wxyz')),split_num=10):
        '''
        Split big data into pieces and feed them into mongoDB piece by piece in order to show the process
        '''
        def split_df(data=pd.DataFrame(np.arange(100).reshape(25,4), columns=list('wxyz')), split_num=10):
            '''
            Split raw data (pd.DataFrame form) into different pieces so that it is available to load into mongoDB
            '''
            df_length = data.shape[0]
            result = []
            start = 0
            end = 0
            for i in range(1,split_num+1):
                print(i)
                if i == split_num:
                    end = df_length
                else:
                    end=i*int(df_length/split_num)
                print('start',start)
                print('end',end)
                result.append(data[start:end])
                start=end
            return result
        
        self.get_row_num(database_name, collection_name)
        print('--------------Start At:', datetime.datetime.now(),'----------------')
        
        df_splited = split_df(data,split_num)
        for i,item in enumerate(df_splited):
            self.data_to_db(database_name, collection_name, item)
            print("--------------Data split",i+1,"has been loaded----------------")
        
        print('--------------Finish At:', datetime.datetime.now(),'----------------')
        self.get_row_num(database_name, collection_name)
        
    def db_to_data(self, database_name='database_test', collection_name='collection_test', query={}, id_exist=False):
        '''
        Load collections/data from collections to Spark_DataFrame
        '''
        collection = self.mongo_client[database_name][collection_name]
        cursor = collection.find(query)
        df = pd.DataFrame(list(cursor))
        if bool(1-id_exist):
            del df['_id']
        print('Data Loaded')
        return df
    
    def drop_collection(self, database_name='database_test', collection_name='collection_test'):
        '''
        Drop a collection(table) in a database, all specified.
        '''
        collection = self.mongo_client[database_name][collection_name]
        collection.drop()
        print('Collection Dropped')
    
    def get_row_num(self, database_name='database_test', collection_name='collection_test'):
        '''
        Get the number of rows in the specified collection
        '''
        collection = self.mongo_client[database_name][collection_name]
        print('There are', collection.count_documents({}),'rows of data in the collection.')

    def close_connection(self):
        self.mongo_client.close()
        print('Client Closed')

In [25]:
# Load weather and building data (relatively small data)

instance = mongoDB()

# Load to database
instance.data_to_db('ashrae_db','weather_train',weather_train_df)
instance.data_to_db('ashrae_db','weather_test',weather_test_df)
instance.data_to_db('ashrae_db','building_metadata',building_df)

# Drop if needed
#instance.drop_collection('ashrae_db','weather_train')
#instance.drop_collection('ashrae_db','weather_test')
#instance.drop_collection('ashrae_db','building_metadata')


----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: weather_train
- The number of rows in weather_train is 139773
- Fetch the first row in the collection ["_id", "site_id", "timestamp", "air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr", "sea_level_pressure", "wind_direction", "wind_speed", "timestamp_2", "date", "time"]
Data Loaded
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: weather_test
- The number of rows in weather_test is 277243
- Fetch the first row in the collection ["_id", "site_id", "timestamp", "air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr", "sea_level_pressure", "wind_direction", "wind_speed", "timestamp_2", "date", "time"]
Data Loaded
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: building_metadata
- The nu

In [28]:
# Check if loaded
instance.get_row_num('ashrae_db','weather_train')
instance.get_row_num('ashrae_db','weather_test')
instance.get_row_num('ashrae_db','building_metadata')


There are 139773 rows of data in the collection.
There are 277243 rows of data in the collection.
There are 1449 rows of data in the collection.


In [31]:
# instance.get_row_num('ashrae_db','train')

There are 0 rows of data in the collection.


In [32]:
# Load train data (relatively large data)

instance.huge_data_to_db('ashrae_db','train',train_df,100)


There are 0 rows of data in the collection.
--------------Start At: 2020-11-20 02:21:53.126293 ----------------
1
start 0
end 202161
2
start 202161
end 404322
3
start 404322
end 606483
4
start 606483
end 808644
5
start 808644
end 1010805
6
start 1010805
end 1212966
7
start 1212966
end 1415127
8
start 1415127
end 1617288
9
start 1617288
end 1819449
10
start 1819449
end 2021610
11
start 2021610
end 2223771
12
start 2223771
end 2425932
13
start 2425932
end 2628093
14
start 2628093
end 2830254
15
start 2830254
end 3032415
16
start 3032415
end 3234576
17
start 3234576
end 3436737
18
start 3436737
end 3638898
19
start 3638898
end 3841059
20
start 3841059
end 4043220
21
start 4043220
end 4245381
22
start 4245381
end 4447542
23
start 4447542
end 4649703
24
start 4649703
end 4851864
25
start 4851864
end 5054025
26
start 5054025
end 5256186
27
start 5256186
end 5458347
28
start 5458347
end 5660508
29
start 5660508
end 5862669
30
start 5862669
end 6064830
31
start 6064830
end 6266991
32
start 626

- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 3032415
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 15 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 3234576
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 16 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 3436737
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data spl

- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 7682118
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 38 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 7884279
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 39 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 8086440
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data spl

- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 12331821
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 61 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 12533982
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 62 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 12736143
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data 

- The number of rows in train is 16779363
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 83 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 16981524
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 84 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: train
- The number of rows in train is 17183685
- Fetch the first row in the collection ["_id", "building_id", "meter", "timestamp", "meter_reading", "datetime", "date", "time"]
Data Loaded
--------------Data split 85 has been loaded----------------
----------------

In [33]:
# Load train data (relatively large data)

instance.huge_data_to_db('ashrae_db','test',test_df,200)


There are 0 rows of data in the collection.
--------------Start At: 2020-11-20 05:30:12.741452 ----------------
1
start 0
end 208488
2
start 208488
end 416976
3
start 416976
end 625464
4
start 625464
end 833952
5
start 833952
end 1042440
6
start 1042440
end 1250928
7
start 1250928
end 1459416
8
start 1459416
end 1667904
9
start 1667904
end 1876392
10
start 1876392
end 2084880
11
start 2084880
end 2293368
12
start 2293368
end 2501856
13
start 2501856
end 2710344
14
start 2710344
end 2918832
15
start 2918832
end 3127320
16
start 3127320
end 3335808
17
start 3335808
end 3544296
18
start 3544296
end 3752784
19
start 3752784
end 3961272
20
start 3961272
end 4169760
21
start 4169760
end 4378248
22
start 4378248
end 4586736
23
start 4586736
end 4795224
24
start 4795224
end 5003712
25
start 5003712
end 5212200
26
start 5212200
end 5420688
27
start 5420688
end 5629176
28
start 5629176
end 5837664
29
start 5837664
end 6046152
30
start 6046152
end 6254640
31
start 6254640
end 6463128
32
start 646

- The number of rows in test is 1250928
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 6 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 1459416
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 7 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 1667904
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 8 has been loaded----------------
----------------------------------------------
S

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 6046152
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 29 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 6254640
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 30 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 6463128
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 31 has been loaded------

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 10841376
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 52 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 11049864
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 53 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 11258352
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 54 has been loaded---

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 15636600
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 75 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 15845088
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 76 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 16053576
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 77 has been loaded---

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 20431824
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 98 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 20640312
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 99 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 20848800
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 100 has been loaded--

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 25227048
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 121 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 25435536
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 122 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 25644024
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 123 has been loaded

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 30022272
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 144 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 30230760
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 145 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 30439248
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 146 has been loaded

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 34817496
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 167 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 35025984
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 168 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 35234472
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 169 has been loaded

- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 39612720
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 190 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 39821208
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 191 has been loaded----------------
----------------------------------------------
Start Inserting......
- Data inserted to Database: ashrae_db Collection: test
- The number of rows in test is 40029696
- Fetch the first row in the collection ["_id", "row_id", "building_id", "meter", "timestamp", "datetime", "date", "time"]
Data Loaded
--------------Data split 192 has been loaded