****
# MongoDB connection using PyMongo
****

## About this notebook: 
Notebook prepared by **Jesus Perez Colino** Version 0.1, First Released: 01/12/2014, Alpha.  

- This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This work is offered for free, with the hope that it will be useful.


- **Summary**: This notebook contains a brief introduction to **MongoDB** and **PyMongo** with data scrapping examples, using **scrapy**.


- **Python & packages versions** to reproduce the results of this notebook: 

In [1]:
from datetime import datetime, timedelta
import pymongo
import scrapy
from sys import version
from pymongo import MongoClient
print ' Reproducibility conditions for this notebook '.center(90,'-')
print 'Python version:       ' + version
print 'Pymongo version:      ' + pymongo.version
print 'Scrapy version:       ' + scrapy.__version__
print '-'*90

---------------------- Reproducibility conditions for this notebook ----------------------
Python version:       2.7.10 |Anaconda 2.3.0 (x86_64)| (default, Sep 15 2015, 14:29:08) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pymongo version:      3.0.3
Scrapy version:       0.20.2
------------------------------------------------------------------------------------------


# Basics about MongoDB with PyMongo

First, open a connection with the MondoDB server:

In [42]:
try: 
    client = MongoClient("localhost", 27017)
    print "Connected to MongoDB in:", client
except pymongo.errors.ConnectionFailure, e:
    print "Could not connect to MongoDB: %s" % e 

Connected to MongoD in: MongoClient('localhost', 27017)


Mongodb creates **databases** and **collections** automatically for you if they don't exist already. A single instance of MongoDB *can support multiple independent databases*. 

When working with PyMongo you access databases using attribute style access:

In [43]:
db = client.test_database
print db

Database(MongoClient('localhost', 27017), u'test_database')


A **collection** is a *group of documents* stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. 

In [114]:
# to prevent colision cases in db with previous db connetions: 
for name in db.collection_names():
    if name != 'system.indexes':
        db.drop_collection(name)

db.collection_names()

[u'system.indexes']

Getting a collection in PyMongo works the same as getting a database:

In [115]:
db.create_collection("test")

document = {"x": "jpcolino", "tags": ["author", "developer", "tester"]}

db.test.insert_one(document)

<pymongo.results.InsertOneResult at 0x10656bb40>

In [116]:
print '-'*75
print 'Databases open in client: ', client.database_names()
print 'Collection names in db:   ', db.collection_names()
print '-'*75

---------------------------------------------------------------------------
Databases open in client:  [u'local', u'test_database']
Collection names in db:    [u'system.indexes', u'test']
---------------------------------------------------------------------------


In [84]:
result = db.test.insert_many([{"x": 1, "tags": ["dog", "cat"]},
                              {"x": 2, "tags": ["cat"]},
                              {"x": 2, "tags": ["mouse", "cat", "dog"]},
                              {"x": 3, "tags": []}])

In [113]:

print 'Name of the Database: \n', db.test.name
print '-'*50
print 'Full descriptions: \n', db.test.acknowledged
print '-'*50
print result.inserted_ids
print '-'*50
print db.test.find_one()
print '-'*50
for d in db.test.find()[1:]:
    print d
print '-'*50
print db.test['x']
print '-'*50
print db.test['tags']
print '-'*50

Name of the Database: 
test
--------------------------------------------------
Full descriptions: 
Collection(Database(MongoClient('localhost', 27017), u'test_database'), u'test.acknowledged')
--------------------------------------------------
[ObjectId('56250c44c47fab411aa37826'), ObjectId('56250c44c47fab411aa37827'), ObjectId('56250c44c47fab411aa37828'), ObjectId('56250c44c47fab411aa37829')]
--------------------------------------------------
{u'x': u'jpcolino', u'_id': ObjectId('56250c42c47fab411aa37825'), u'tags': [u'author', u'developer', u'tester']}
--------------------------------------------------
{u'x': 1, u'_id': ObjectId('56250c44c47fab411aa37826'), u'tags': [u'dog', u'cat']}
{u'x': 2, u'_id': ObjectId('56250c44c47fab411aa37827'), u'tags': [u'cat']}
{u'x': 2, u'_id': ObjectId('56250c44c47fab411aa37828'), u'tags': [u'mouse', u'cat', u'dog']}
{u'x': 3, u'_id': ObjectId('56250c44c47fab411aa37829'), u'tags': []}
--------------------------------------------------
Collection(Databa

Here, we have some **query operators**: 

In [103]:
print 'Number of Documents: ', db.test.count()
print '-'*50
print 'Number of Documents where x = 2: ', db.test.find({"x": 2}).count()

Number of Documents:  5
--------------------------------------------------
Number of Documents where x = 2:  2


Queries can also use special query operators. These operators include **gt, gte, lt, lte, ne, nin, regex, exists, not, or**, and many more. 

Additionally we can use **regular expresions**: 

In [108]:
# Using Regex to find tags = cats
import re
regex = re.compile(r'cat')
rstats = db.test.find({"tags":regex}).count()
print rstats

3
