# Topic 08: NoSQL Databases 

- 03/09/21
- onl01-dtsc-ft-022221

## Learning Objectives
- Discuss other database structures beyond SQL databases
- Walk through installation and usage of MongoDB and pymongo.
- #### Demo: A Taste of WebScraping with MongoDB & Puppies

### Announcements

- For tomorrow's study group: 
    - Topic 09: Appendix Lab > Yelp API Lab 

- Canvas Bonus Lesson: MongoDB Installation Links Fixed 

- Canvas Bonus Lesson Added: Pushing Illumidesk Work to Your Personal Forks 

### Questions 


# NoSQL - Not Only SQL

## What's wrong with SQL? 

- SQL offers a ton of structure for storing data 
    - That structure requires data to come in, in a certain way (aka your data must have structure) 
    - Structure comes at the cost of speed 
    
    
- SQL structure is very rigid - if you want to change the schema it requires you to change all of your existing data to match the new schema 


- Large data requires distributed computing (many computers working together to accomplish the same task) - Executing distributed joins is a very complex problem in relational databases. 

## What does NoSQL offer? 

- Schemaless − Number of fields, content and size of the data object can differ from one data object to another.
- You can store virtually any kind of data. 
- Structure of a single object is clear.
- No complex joins.
- To scale up and handle more queries, just add more machines
- You can change the schema of your database on the fly

## Types of NoSQL Databases

<img style='width: 400px' src='images/nosql-types.png/'>

<b>Document databases</b> pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

<div   style='clear: both; display: table;'>
    <div style='float:left; size: 250px'>
        <img  style='align: center; width:150px' src='images/mongodb.png' /></div>
    <div style='float:left; size: 250px'>
        <img style='align: center;' src='images/couchdb.png' /></div>
    <div style= 'float: left; width: 250px'>
        <img style='align: center; width: 200px' src='images/documentdb.png' /></div>
</div>

<b>Graph stores</b> are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.

<div style='clear: both; display: table;'>
    <div style='float:left; size: 250px'>
        <img  style='align: center; width:150px' src='images/ApacheGiraph.svg' /></div>
    <div style='float:left; size: 250px'>
        <img style='align: center;' src='images/neo4j.png' /></div>
</div>

<b>Key-value</b> stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. 

<div   style='clear: both; display: table;'>
    <div style='float:left; size: 250px'>
        <img  style='align: center; width:150px' src='images/riak.png' /></div>
    <div style='float:left; size: 250px'>
        <img style='align: center;' src='images/dynamodb.jpeg' /></div>
</div>

<b>Wide-column stores</b> such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

<img src='images/widcolumn.jpg'/>

## What is MongoDB

MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time

<b>Data Structure</b>

Single Entry = Document

` { 
  _id: ObjectId(8af37bd7891c), 
  title: 'MongoDB Lab',
  description: 'Introductory lab on how to use MongoDB',
  by: 'Flatiron School',
  topics: ['mongodb', 'database', 'NoSQL', 'JSON']  
   } `

You can embed documents inside documents! 
<!-- 
<img src ='images/househouse.gif' /> -->

`{ 
  _id: ObjectId(8af37bd78ssc), 
  title: 'Other Lab',
  description: 'Introductory lab on how to use something',
  by: 'Flatiron School',
  topics: ['blah', 'blah', 'blah', 'blah'],
  author: {
          _id: ObjectId(83928shkjw183),
          name: Vishal Patel,
          building: 11 Broadway
          }
   }`

##### Why would we want to nest objects? 

Multiple Documents = Collection

` { 
  _id: ObjectId(8af37bd7891c), 
  title: 'MongoDB Lab',
  description: 'Introductory lab on how to use MongoDB',
  by: 'Flatiron School',
  topics: ['mongodb', 'database', 'NoSQL', 'JSON']  
   }, 
{ 
  _id: ObjectId(8af37bd78ssc), 
  title: 'Other Lab',
  description: 'Introductory lab on how to use something',
  by: 'Flatiron School',
  topics: ['blah', 'blah', 'blah', 'blah']  
   }
`

# MongoDB

## MongoDB Installation Instructions - 2021


#### MacOS
- **[Video Walk Through - Mac OS](https://youtu.be/OpKEw7093F8)**

- For additional tips, follow the instructions from these 2 links:
    1. https://zellwk.com/blog/install-mongodb/
    2. https://zellwk.com/blog/local-mongodb/
    



#### Windows
- **[Video Walk Through - Windows](https://youtu.be/UJQiGBDKXY0)**

- [Article on Getting MongoDB Working in GitBash](https://teamtreehouse.com/community/how-to-setup-mongodb-on-windows-cmd-or-gitbash-with-shortcuts)

# USING MONGODB

## 📓Initializing Mongodb

- In a terminal window, enter `mongod` to initialize the server. 
    - Just like running a jupyter notebook, the terminal window will remain busy running the server. (LEAVE IT RUNNING)
    
    
- Note to MacOS Catalina users who followed the article above for installation, you would need to add the aliases mentioned in the Install MongoDB article for `mongod` to work.

```bash
alias mongod='brew services run mongodb-community'
alias mongod-status='brew services list'
alias mongod-stop='brew services stop mongodb-community'
```

### Working with MongoDB via Terminal:

- Start the server in your terminal using`mongod`
- Start a **second terminal** and enter `mongo` to connect to server.
- The Normal `$` command prompt will become `>`

- Use the `db` command to test if connection working (will see "test" if it is)

- use `db.help()` to get list of avialable commands.
- `df.test.help()`

#### Creating, Reading, Updating, and Deleting (CRUD) Information in MongoDB

Type the following in the terminal running the mongo instance: 

`db.test.save( { a: 1 } )`
- the spacing is important
- there should be a space between the different brackets, as well as the key and value.
- Note that the key is directly touching the `:`, but the value is not.

- the key:value pair with `_id` wil appear after runing `db.test.find()`


 Let's take a look at how we can write queries or do CRUD operations in Python with the `pymongo` library!
 
> - When done using `mongo`, press Control+C to stop the mongo server

## Working with Mongodb through Python with `pymongo`

1. Import the `pymongo` library. 
(if does not exist run `%conda install pymongo` in your notebook.)

2. Create a client that is connected to our running mongodb server by using the `pymongo` library's `MongoClient` object and passing it the URL for the server (which the mongo server told us as output when we started it up at the very beginning).

3. Get the database that we'll be working with from the `myclient` object -- this can include creating a new database by passing in it's name as a key.


In [None]:
#%conda install pymongo
try:
    import pymongo
except: 
    print("Pymongo not found, running:  %conda install pymongo")
    %conda install pymongo
    import pymongo

In [None]:
myclient = pymongo.MongoClient()#"mongodb://127.0.0.1:27017/")
myclient

- Note that we can get a full list of the names of every database we have by running our clients object's `.list_database_names()` method. 

In [None]:
print(myclient.list_database_names())

- Just as a SQL database has tables, a mongo database has **_Collections_** of documents.
- We can get a collection or create a new one by passing its name to the database object we created.



In [None]:
mydb = myclient['example_database']
mydb

We can get collection names by using `mydb.list_collection_names()`

Let's add some data to our database and see what we can do with it. 

In [None]:
mydb.list_collection_names()

In [None]:
mycollection = mydb['example_collection']
mycollection

### CRUD Operations with `pymongo`

To insert a document (in SQL, we would call this a *record*) into a mongoDB collection, we make use of the collection's `.insert_one()` method, and pass in the information we want saved as a Python dictionary. 

In [None]:
example_customer_data = {'name': 'John Doe',
                         'address': '123 elm street',
                         'age': 28}

results = mycollection.insert_one(example_customer_data)
results

When we insert something into mongo, we get back a `results` object. This object contains the unique `_id` of the object we just inserted inside its `.inserted_id` attribute. 

In [None]:
results.inserted_id

If we want to insert 2 or more items at the same time, we can just store the dictionary for each separate record we want to insert in a list and use `.insert_many()` method. 

In [None]:
customer_2 = {'name': 'Jane Doe', 'address': '234 elm street', 
              'age': 7}
customer_3 = {'name': 'Santa Claus', 'address': 'The North Pole',
              'age': 547}
customer_4 = {'name': 'John Doe jr.', 'address': '', 'age': 0.5}
customer_5 = {'name': 'John Doe jr.', 'address': '', 'age': 0.5}

list_of_customers = [customer_2, customer_3, customer_4,customer_5]

results_2 = mycollection.insert_many(list_of_customers)

In [None]:
results_2.inserted_ids

## Querying data in MongoDB 

- The quickest and easiest way to get data from a collection is to use the collection object's `.find()` method!


### 📒📔Using `.find()`

- We call `.find()` on collection objects.
- To get all results,pass in an empty dictionary:
`mycollection.find({})`

In [None]:
query_1 = mycollection.find({})
for x in query_1:
    print(x)

In [None]:
## Note: instead of a loop, you can wrap the .find 
## with the list function to make a list of results
list(mycollection.find({}))

### Filtering columns:

- Pass a second dictionary with column names and a `1` as the value if you do want the col
- Pass 0 as value if you don't 
- in query 3, we are only excluding age, so taking everything else
`query_3 = mycollection.find({}, {'age': 0})`

> - You can only choose to include or exclude columns in any given dictionary ( so all values must be one if you want to select columns, or all must be 0's to exclude columns)

In [None]:
query_1 = mycollection.find({})
for x in query_1:
    print(x)

- In the cell above, we grabbed every field from every item in the entire collection.
- **What if we want to get all the names and addresses for each customer, but not the age?** There are two ways we can do this. 
    1. By passing in a dictionary specifying the fields we want, like so:
    ```python 
mycollection.find({}, {'_id': 1, 'name': 1, 'address': 1})
```

    2. By passing in a dictionary specifying the fields we DON'T want. 
    ```python
mycollection.find({}, {'_id': 0, 'name': 0, 'address': 0})
```

In [None]:
## Explictly selecting some columns
query_2 = mycollection.find({}, {'_id': 1, 'name': 1, 'address': 1})
for item in query_2:
    print(item)

In [None]:
## Explicitly excluding age
query_3 = mycollection.find({}, {'age': 0})
for item in query_3:
    print(item)

### Filtering Query Results

- We'll rarely want to get all the records at once. 

- If we know the value for a given key, we can pass that key-value pair (or pairs) into `.find()` as a dictionary, and the results will contain the entire document. 

In [None]:
query_4 = mycollection.find({'name': 'Santa Claus'})
for item in query_4:
    print(item)

- We can also filter queries by using **_Modifiers_**. 
- If we wanted to get record for every person in our collection older than 20. We can signify this with the 'greater than' modifier, `"$gt"` and pass in the corresponding value. 

In [None]:
query_5 = mycollection.find({"age": {"$gt": 20}})
for item in query_5:
    print(item)

In [None]:
list(mycollection.find({"age": {"$gt": 20}}))

### 📓 MongoDB Notation/Modifiers

https://docs.mongodb.com/manual/reference/operator/query-modifier/

| symbol | action |
| --- | --- |
| "$gt" | greater than |
| "\$lt" | less than |
| "\$set" | setting a specific record, as specified  by <br> a {'col':value_to_insert} dictionary| 
| "\$sum" |'Sum ' |
| "\$ne" | |

## Updating Documents

- Updating a record works like filtering a query with a specific value, although we also pass in an additional dictionary as the second parameter.
- This second parameter will contain the modifier `'$set'` as the key, and a dictionary containing the key-value pair we want to update. 

In [None]:
## Update birthday and age for John Doe
record_to_update = {'name' : 'John Doe'}
update_1 = {'$set': {'age': 29}}
update_2 = {'$set': {'birthday': '02/20/1986'}}

mycollection.update_one(record_to_update, update_1)
mycollection.update_one(record_to_update, update_2)
query_6 = mycollection.find({'name': 'John Doe'})
for item in query_6:
    print(item)

### Deleting Records

- We can delete records by using the collection object's `.delete_*()` methods. 
    -  `delete_one()` for a single deletion, 
    - `delete_many()` for multiple deletions.

- Let's try deleting the record for `'John Doe'`:

In [None]:
deletion_1 = mycollection.delete_one({'name': 'John Doe'})
print(deletion_1.deleted_count)

- Note that we can also use modifiers here, too! 
    - For instance, in the cell below, we'll delete all records for customers younger than 10.

In [None]:
deletion_2 = mycollection.delete_one({'age': {'$lt': 10}})
print(deletion_2.deleted_count)

### Deleting By ID

In [None]:
johns =list(mycollection.find({"name":'John Doe jr.'}))
johns

In [None]:
deleted = mycollection.delete_one({'_id':'5ebc6da07d8c59d444d93d85'})
deleted.deleted_count

In [None]:
print(johns[0]['_id'])

- To delete via document's ID, you must first import an object called `ObjectId` from `bson` (binary json).
- Get the "`_id`" of document to delete.
- To delete use `mycollection.delete_one({'_id':ObjectID(<insert id here>)`

In [None]:
from bson import ObjectId
query = {'_id':ObjectId('5ebc6da07d8c59d444d93d85')}
list(mycollection.find(query))

In [None]:
deletion_3 = mycollection.delete_one(query)
deletion_3.deleted_count

In [None]:
johns =list(mycollection.find({"name":'John Doe jr.'}))
johns

In [None]:
deleted = mycollection.delete_many({})
deleted.deleted_count

In [None]:
res = mycollection.find({})
[print(x) for x in res]

# Demo/Activity: Storing Instagram Puppies with MongoDB

> We will use some advanced webscraping to grab the urls and captions for Puppies posts from instagram.

> #### To run this notebook on your own computer:
> - Open `password.py` and enter your Instagram username and password (this file is already added to the repo's  .gitignore so it won't be pushed to GitHub) 
- If it doesn't exist yet, the following cell will create it.

In [None]:
## check for password.py (don't modify this cell!!!)
import os 
FILE = './password.py'
if os.path.exists(FILE) == False:
    print(f"[!] {FILE} does not exist.")
    with open(FILE,'w') as f:
        f.write("""password = ''
username = ''
""")
    print(f"- {FILE} created. \n\tPlease modify with your personal login")
else:
    print(f'[!] {FILE} already exists. \n\tPlease verify it contains your personal login')

In [None]:
## Once youve added your username and password to password.py, run the following:
import password as pw

### A Taste of Advanced WebScraping

- We will learn basic web scraping for topic 10 where we will be using `BeautifulSoup`. 
- Note that if we want to access data from a website that requires a login, we will need a more advanced toolkit.


- **Introducing [Selenium](https://selenium-python.readthedocs.io/) and its WebDriver class.**
   - [Getting Started](https://selenium-python.readthedocs.io/getting-started.html) Documentation

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time

- Make sure you have the Chrome Web Driver installed (https://chromedriver.chromium.org/downloads), as well as selenium
- Follow the instructions [here for saving the .exe to your path.](https://www.selenium.dev/documentation/en/webdriver/driver_requirements/#adding-executables-to-your-path)

In [None]:
## Open a python-controlled chrome browser
driver = webdriver.Chrome("/opt/WebDriver/bin/chromedriver")

## Go to instagram's login page
driver.get("https://www.instagram.com/accounts/login/")
time.sleep(2)

## Find the Email and Password elements
email_input = driver.find_element_by_xpath("//input[@name='username']")
password_input = driver.find_element_by_xpath("//input[@name='password']")
time.sleep(2)
email_input

In [None]:
## Send username and password to login fields
email_input.send_keys(pw.username)
password_input.send_keys(pw.password)

## Find and click the submit button
login = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]/button')
login.click()

In [None]:
## Click through the 2 Not Now screens
for i in range(2):
    try: 

        not_now = WebDriverWait(driver, 15).until(
            lambda d: d.find_element_by_xpath('//button[text()="Not Now"]')
        )
        not_now.click()
    except: 
        pass

> Now that we have logged in, let's explore the puppies tag

In [None]:
## open puppies tag page
driver.get("https://www.instagram.com/explore/tags/puppies/")

In [None]:
## Save the html source code as a BeautifulSoup
soup = BeautifulSoup(driver.page_source)

## close the webdriver's chrome window
driver.close()

> - The variable soup now contains a beautiful soup object of all the html elements related to the image grid on Instagram. 
> - Loop over this object and store the image url and the category text into your MongoDB. 

In [None]:
## Find all 'img' tags  and list how many tags in list


In [None]:
## select an image # and slice out 1 image


In [None]:
## Display an image of a dog
from PIL import Image
import requests
from io import BytesIO

def show_web_image(url):
    """Returns a PIL Image of the provided url
     Code from https://stackoverflow.com/questions/7391945/how-do-i-read-image-data-from-a-url-in-python"""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

In [None]:
## Create a mongodb client and an insta_db

# create your insta_collection


In [None]:
## Lets the find the url and alt text for 1 dog


In [None]:
## View the image from the url


In [None]:
## Insert One Dog into db
# use 'src','alt', and 'Dog #' as keys


In [None]:
## Check if our dog was inserted


In [None]:
## Insert all dogs by first generating a list of dicts


In [None]:
## insert the dogs dict using insert_many


In [None]:
## In a final loop, get all dogs, print the dog # & alt text and show the image


___

# Appendix - Detailed Installation Instructions

### Instructions for Windows Users

#### Resources for Installing MongoDB - Windows 10
- [Video on installing MongoDB by Downloading from Mongo](https://youtu.be/UJQiGBDKXY0) **updated url 03/04/21**

- [Article on Getting MongoDB Working in GitBash](https://teamtreehouse.com/community/how-to-setup-mongodb-on-windows-cmd-or-gitbash-with-shortcuts)




#### Installation Steps
1. Download and install MongoDB from  https://www.mongodb.com/download-center/community 
    - Watch the video above to instructions on what options to select.
    
    
2. After installation is complete, open a GitBash Terminal As Administrator
    - In the Start Menu find and right click on GitBash > Run as Administrator. 


3. You must add the bin folder inside of MongoDB's Program Files to your system path.
    - To edit system path (see above article) type the following in GitBash:
    ```shell
    rundll32 sysdm.cpl,EditEnvironmentVariables```
    - In the window that pops-up, edit Path and Add `C:/Programs Files/MongoDB/<whatever version is install>/bin` to the Path variable.
    - Close the settings window when complete.
    
    
4. You must manually create the `c:/data` and `c:/data/db` folders SEPERATELY
    - Use gitbash to create c:/data FIRST, cd into data and then mkdir db  
    - This replaces the official step in the notebook below (`sudo mkdir -p /data/db`)


5. Give the directory the correct permission: 
```bash
sudo chown -R `id -un` /data/db
```

5. (You may still need to run `conda install mongodb` even though we installed it directly.)

6. Finall, make sure that pymongo is installed 
```bash
conda install pymongo
```

7. Now you should be all set and `mongod`should work!

### Instructions for MacOS Catalina


- Follow the instructions from these 2 links:
    1. https://zellwk.com/blog/install-mongodb/
    2. https://zellwk.com/blog/local-mongodb/
    
#### Summary: 

1.  Install Home brew
```bash
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```

2. Install mongodb
```bash
brew tap mongodb/brew
brew install mongodb-community
```