## Code Details
Author: Rory Angus<br>
Created: 21FEB19<br>
Version: 0.1<br>
***
This code is to test a writing of data to a MongoDB. <br>
This is a proof of concept and the data is real. However, it does not bring all of it into Mongo, only the key fields.
This uses data that was extracted after the SSO data model was implemented at LE on the platform.
The data now is in three parts. The groups and their members, the users linked to the results as well as coaching/coachee relationship.
Please note that the results from doing the survey can be more than two per journey and the coaching/coachee relationship is many to many. There is a field called coach on the user but that is from the old version and should not be used.<br>

There is also an issue with the data migration onto the new platform which ended up with some users having coaching relationships with UTS students but they are not part of the UTS organisation on the platform. This is no longer possible, so when it happens, this will be patched manually in this code.

# Package Importing + Variable Setting

In [1]:
import pandas as pd
import numpy as np

import datetime

# mongo stuff
import pymongo
from pymongo import MongoClient
from bson.objectid import ObjectId
import bson

# json stuff
import json

In [2]:
# the file to read. This needs to be manually updated
readLoc = "~/datasets/CLARA/190328_052400_LE_LivePlatform_UsersCoachRelationship.json"
# if true the code outputs to the notebook a whole of diagnostic data that is helpful when writing but not so much when running it for real
verbose = False
# first run will truncate the target database and reload it from scratch. Once delta updates have been implmented this needs adjusting
first_run = True

# Set display options

In [3]:
# further details found by running:
# pd.describe_option('display')
# set the values to show all of the columns etc.
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199

# locals() # show all of the local environments

# Connect to Mongo DB

In [4]:
# create the connection to MongoDB
# define the location of the Mongo DB Server
# in this instance it is a local copy running on the dev machine. This is configurable at this point.
client = MongoClient('127.0.0.1', 27017)

# define what the database is called.
db = client.CLARA

# define the collection
raw_data_collection = db.raw_data_coach_coachee

Command to clean the databzse if needed when running this code

In [5]:
# Delete the raw_data_collection - used for testing
if first_run:
    raw_data_collection.drop()

# Functions Definitions

In [6]:
# The web framework gets post_id from the URL and passes it as a string
def get(post_id):
    # Convert from string to ObjectId:
    document = client.db.collection.find_one({'_id': ObjectId(post_id)})

# Place Data from CSV File into Mongo

In [7]:
#import the data file

claraDf = pd.read_json(readLoc, orient='records')

In [8]:
if verbose:

    # count columns and rows
    print("Number of columns are " + str(len(claraDf.columns)))
    print("Number of rows are " + str(len(claraDf.index)))
    print()

    # output the shape of the dataframe
    print("The shape of the data frame is " + str(claraDf.shape))
    print()

    # output the column names
    print("The column names of the data frame are: ")
    print(*claraDf, sep='\n')
    print()

    # output the column names and datatypes
    print("The datatypes of the data frame are: ")
    print(claraDf.dtypes)
    print()

Number of columns are 4
Number of rows are 52

The shape of the data frame is (52, 4)

The column names of the data frame are: 
coachId
dateFrom
dateTo
learnerId

The datatypes of the data frame are: 
coachId      int64 
dateFrom     object
dateTo       object
learnerId    int64 
dtype: object



In [9]:
# mongo is not able to store integers so convert them to strings

claraDf['coachId'] = claraDf['coachId'].astype(str)
claraDf['learnerId'] = claraDf['learnerId'].astype(str)

# output the column names and datatypes
if verbose:
    print("The datatypes of the data frame are: ")
    print(claraDf.dtypes)
    print()

The datatypes of the data frame are: 
coachId      object
dateFrom     object
dateTo       object
learnerId    object
dtype: object



In [10]:
# Loop through the data frame and build a list
# the list will be used for a bulk update of MongoDB

# define the list to hold the data
clara_row = []

# loop through dataframe and create each item in the list
for index, row in claraDf.iterrows():
    clara_row.insert(
        index, {
            "userGroup_index": index,
            "coachId": claraDf['coachId'].iloc[index],
            "learnerId": claraDf['learnerId'].iloc[index],
            "dateFrom": claraDf['dateFrom'].iloc[index],
            "dateTo": claraDf['dateTo'].iloc[index],
            "insertdate": datetime.datetime.utcnow()
        })

if verbose:
    print(clara_row[5])

{'userGroup_index': 5, 'coachId': '3206', 'learnerId': '3103', 'dateFrom': '2018-12-18 22:38:25', 'dateTo': None, 'insertdate': datetime.datetime(2019, 3, 28, 5, 55, 43, 670330)}


In [11]:
# bulk update the mongo database

raw_data_collection.insert_many(clara_row)

if verbose:
    print(raw_data_collection.inserted_ids)

Collection(Database(MongoClient(host=['127.0.0.1:27017'], document_class=dict, tz_aware=False, connect=True), 'CLARA'), 'raw_data_coach_coachee.inserted_ids')


## Create Index

In [12]:
# Only create the indexes onthe first run through
if first_run:
    # put the restult into a list so it can be looked at later.
    result = []

    # Create some indexes
    result.append(
        raw_data_collection.create_index(
            [('userGroup_index', pymongo.ASCENDING)], unique=False))
    result.append(
        raw_data_collection.create_index([('coachId', pymongo.ASCENDING)],
                                         unique=False))
    result.append(
        raw_data_collection.create_index([('learnerId', pymongo.ASCENDING)],
                                         unique=False))
    result.append(
        raw_data_collection.create_index([('dateTo', pymongo.ASCENDING)],
                                         unique=False))
    result.append(
        raw_data_collection.create_index([('dateFrom', pymongo.ASCENDING)],
                                         unique=False))

    if verbose:
        print(result)

['userGroup_index_1', 'coachId_1', 'learnerId_1', 'dateTo_1', 'dateFrom_1']
