# Data Management and Database Design INFO 6210


## How to create a database in MongoDB ?




In this article we will learn how to create, populateand query a MongoDB database. 
We will cover the following details in this article:

**1. Installing MongoDB on MacOS**

**2. Running MongoDB on MacOS**

**3. Creating Collections in MongoDB in python**

**4. Populating data in MongoDB**

**5. How to query MongoDB**


### Install MongoDB on MacOS

#### Install homebrew package manager 
If you dont have homebrew packmanager on you mac install as shown below. 
1. Open terminal app from Spotlight Search
    1. Press CMD + SPACE to open spotlight search 
    2. Type terminal and open app
2. Type : `ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"`

#### Install MongoDB with homebrew
1. Update homebrew package database with: 
>`brew update`
2. Install MongoDB
>`brew install mongodb`


### Run MongoDB on MacOS

To run MongoDB, we run `mongod` command on terminal. If necessary, specify the path of the data directory as explained below.

##### Specify the path of data directory
If you want to specify the path of data directory, run the following command on terminal 
>`mongod --dbpath <path to data directory>`

#### Using MongoDB on terminal 
To access MongoDB on terminal run the following command
>`mongo`

Here we can perform all the operations in MongoDB on terminal. 

### Brief Introduction to Document based databases

MongoDB stored data in the form of BSON documents. BSON is a binary representation of JSON documents. 
a typical JSON document looks as shown below. 
```
{
   field1: value1,
   field2: value2,
   field3: value3,
   ...
   fieldN: valueN
}
```
A collection of Documents is known as **Collection** in MongoDB.

We can compare a __Table__ in SQL as a __Collection__ in MongoDB

### Using MongoDB in Python. 

#### Importing all the required packages for running MongoDB

In [1]:
from pymongo import MongoClient
import json

#### Connecting to MongoDB and switching to required Database

In [2]:
client = MongoClient('localhost', 27017)  #Connection to MongoDB
db=client.project   #Switching to Database with name 'project'

In this article we will gather data from Twitter and store it in MongoDB.
We get a lot of data in twitter JSON file. But, we will only extract the data that we need. 

Suppose we are creating a database to store information of USERs and TWEETs. We will create two collections, 
1. User 
2. Tweet 

Now **User** collection will hold all the information about the user and the **Tweet** collection will store all the data about a specific tweet.

So, the collections will be as follows,
```
Collection user:
{
     "user_id":"Data String",
     "user_name":"Data String",
     "followers": Data Int,
     "following": Data Int,
     "tweet_count": Data Int,
     "hashtags":[ ]
}
```
```
Collection tweet:
{
     "tweet_id": "Data String",
     "user_id": "Data String",
     "tweet_contents": "Data String",
     "urls": "Data String",
     "date": "Data String",
     "time": "Data String",
     "favourites": Data Int,
     "retweets": Data Int,
     "hashtags":[ ]
}
```

Here we can see that we are storing all the data related to user in **user** collection. Similarly all the data related to tweet is stored in **tweet** collection

You can also see that we are storing **Hashtags** inside a list in the JSON file. This is because JSON provides us the functionality implement this. Thus we can Store and Query lists in JSON on MongoDB

We can populate the data to MongoDB using the **pymongo** package in Python

To populate data in MongoDB, we create a JSON file with all the data and insert it in MongoDB using the method 

`db.<Collection>.insert_one(<JSON data>)`


Thus a sample JSON data for Tweet will look like this:

```
{ "tweet_id": "990004485060624384", "user_id": "@AngelHealthTech", "tweet_contents": "RT @diioannid: What s hot in #Wearables and #WearableTech | https://t.co/susj0K9qUe via @MikeQuindazzi   #BigData #HealthTech #DataScience…", "urls": "https://pwc.to/2Hnu6s9", "date": "2018-04-27", "time": "23:07:17", "favourites": 0, "retweets": 12, "hashtags": ["wearables", "wearabletech", "bigdata", "healthtech", "datascience"] }
```

And a sample JSON data for User will look like this:

```
{ "user_id": "@ExpoDX" , "user_name": "DXWorldEXPO ®", "followers": 521, "following": 0, "tweet_count": 3709, "hashtags": ["clodnative", "apm", "linx", "serverless", "devops", "bigdata", "clod", "iot", "iiot"]  }

```

In case I have transferred data from SQL database to MongoDB. 

##### There were following tables in SQL database

`User`  
```
(user_id varchar(25) not null primary key , user_name varchar(50), followers int, following int, tweet_count int)
```
Here **user_id** is the Primary Key

`Tweet` 
```
(tweet_id varchar(25) not null primary key, user_id varchar(25),tweet_content varchar(300), urls varchar(100))
```
Here **tweet_id** is the Primary Key

`Tweet_details`  
```
(tweet_id varchar(25) not null primary key, date varchar(10),time varchar(8), favourate int, retweets int )
```
Here **tweet_id** is the Primary Key

`Tags` 
```
(tag_id integer not null primary key autoincrement, tag_details varchar(50) )
```
Here **tag_id** is the Primary Key

`User_tags`
```
(tag_no integer not null primary key autoincrement, user_id varchar(25), tag_id varchar(25))
```
Here **tag_no** is the Primary Key

`Tweet_tags` 
```
(tag_no integer not null primary key autoincrement, tweet_id varchar(25), tag_id varchar(25))
```
Here **tag_no** is the Primary Key


The formatting of data in SQL is done to satisfy the conditions of normalization. 

Thus we have different tables of `USER` , `Tweet` and  `Tweet_details`.

We also have tables for `Tags` which Tag_Id and Tag_details, we are then creating relational tables `user_tags` and `tweet_tags` to store data for all users.

We also have tables of *mispellings* and *synonyms*. But we use these while populating tags for users, to see if there is any mistake in spelling or if any Synonym is being used. 


### Populating Data into MongoDB

For **Transfering Data** from SQL to MongoDB, we wrote the following function. 

Here we have written a functions which extracts data from SQL in our required format and converts it into **JSON** file compatible for MongoDB and this **JSON** is directly inserted into the required document. 


##### Populate User Collection with data from SQL

In [3]:
def getAndPopulateUsers():

    mainQuery="Select * from user"

    for row in cursor.execute(mainQuery):

        output= "SELECT tag_id from user_tags where user_id='"+ row[0]+"' "
        list = []
        for row2 in cursor2.execute(output):
            getTag="select tag_details from tags where tag_id='"+row2[0]+"'"
            tagName=cursor3.execute(getTag)
            list.append(tagName.fetchone()[0])
        res= ('{ "user_id": "%s" , "user_name": "%s", "followers": %d, "following": %d, "tweet_count": %d, "hashtags": %s  }' %(row[0],row[1],row[2],row[3],row[4], str(list).replace("'","\"").replace("u","").replace("\\","")))
        print (res)
        db.user.insert_one(json.loads(res))

Here, the **mainquery** variable is used to extract all the data from the user table.

For every user, we extract tags related to that user and add them to a list.
We will insert this list into hashtags column directly. 

##### Populate Tweet Document with data from SQL

In [4]:
def getandPopulateTweets():

    mainQuery=" Select * from tweet inner join tweet_details where tweet.tweet_id=tweet_details.tweet_id"

    for row in cursor.execute(mainQuery):
        output = "SELECT tag_id from tweet_tags where tweet_id='" + row[0] + "' "
        list = []
        for row2 in cursor2.execute(output):
            getTag = "select tag_details from tags where tag_id='" + row2[0] + "'"
            tagName = cursor3.execute(getTag)
            list.append(tagName.fetchone()[0])
        res= ('{ "tweet_id": "%s", "user_id": "%s", "tweet_contents": "%s", "urls": "%s", "date": "%s", "time": "%s", "favourites": %d, "retweets": %d, "hashtags": %s }'  %(row[0], row[1], row[2].replace("\n"," ").replace("\"",""), row[3], row[5], row[6], row[7], row[8], str(list).replace("'","\"").replace("u\"","\"").replace("\\","")   ))
        print (res)
        db.tweet.insert_one(json.loads(res))

Similar to User Document, we will populate the tweet document by extracting the data from SQL with queries and converting them to **JSON** format

In the above two functions, it can be observed that, we are replacing few characters from the data and substituting with some another character, or we are adding an escape character. This is done to make the data compitable with JSON, i.e some inverted commas or double inverted commas can cause abnormal termination of string.

On running the above two functions, we can populate data in MongoDB

### Querying MongoDB

In [5]:
user=db.user
tweet=db.tweet

Here, we gave the cursor for db.user to a variable user and db.tweet to variable tweet.

In MongoDB, we use the following method to find and display data :

`db.<collection>.find(<query>, <projection>)`

Here, the projection parameter determines which fields are returned in the matching documents. The projection parameter takes a document of the following form:

```
{ field1: <value>, field2: <value> ... }
```

##### Displaying sample data from User Collection

In [6]:
output= user.find({}).limit(10)
for data in list(output):
    print data

{u'user_id': u'@ExpoDX', u'user_name': u'DXWorldEXPO \xae', u'hashtags': [u'clodnative', u'apm', u'linx', u'serverless', u'devops', u'bigdata', u'clod', u'iot', u'iiot'], u'followers': 521, u'following': 0, u'_id': ObjectId('5ada4426ec661529fc12d319'), u'tweet_count': 3709}
{u'user_id': u'@TechNewsRprt', u'user_name': u'TechNews Report', u'hashtags': [u'infographic', u'digitaltransformation', u'ai', u'marketing', u'mwc18', u'ericssondigital', u'bigdata'], u'followers': 4045, u'following': 3059, u'_id': ObjectId('5ada4426ec661529fc12d31a'), u'tweet_count': 5882}
{u'user_id': u'@ScopeOnline', u'user_name': u'SCOPE', u'hashtags': [u'artificialintelligence', u'deeplearning', u'machinelearning', u'atomation'], u'followers': 2635, u'following': 2127, u'_id': ObjectId('5ada4426ec661529fc12d31b'), u'tweet_count': 11505}
{u'user_id': u'@nschaetti', u'user_name': u'NS.ai (Nils Schaetti)', u'hashtags': [u'ai', u'machinelearning', u'ia', u'bigdata', u'datascience', u'robots', u'robotic', u'intelli

Here we are limit the output to 10 items with the `.limit(<number>)` method

##### Displaying sample data from Tweet Collection

In [7]:
output= tweet.find({}).limit(10)
for data in list(output):
    print data

{u'user_id': u'@ExpoDX', u'tweet_contents': u'RT @ExpoDX: AI, Monitoring and Digital Transformation with @SaboTaylorDiab  @Loom_Systems #CloudNative #APM #Linux #Serverless #DevOps #Dat\u2026', u'hashtags': [u'cloudnative', u'apm', u'linux', u'serverless', u'devops'], u'tweet_id': u'968036614202150912', u'retweets': 2, u'urls': u'NO_URL', u'time': u'08:14:48', u'date': u'2018-02-26', u'favourites': 0, u'_id': ObjectId('5ada44f2ec661529fc12dc5e')}
{u'user_id': u'@ExpoDX', u'tweet_contents': u'AI, Monitoring and Digital Transformation with @SaboTaylorDiab  @Loom_Systems #CloudNative #APM #Linux #Serverless\u2026 https://t.co/5CaXydsQEp', u'hashtags': [u'cloudnative', u'apm', u'linux', u'serverless'], u'tweet_id': u'968036597634564096', u'retweets': 2, u'urls': u'https://twitter.com/i/web/status/968036597634564096', u'time': u'08:14:44', u'date': u'2018-02-26', u'favourites': 1, u'_id': ObjectId('5ada44f2ec661529fc12dc5f')}
{u'user_id': u'@TechNewsRprt', u'tweet_contents': u'RT @TamaraMcC