In [2]:
from pymongo import MongoClient
import pprint
# Connect to the hosted MongoDB instance
client = MongoClient('localhost', 27017)
db = client['clicks']
log = db['log']

# MongoDB
MongoDB is a popular noSQL database.  It's loose structure makes it well suited for capturing unstructured data, such as that encountered in web scraping.  This sprint will focus on getting up and running with this system.  This is intended to be an individual sprint.


### Using Mongo with Docker
It is highly recommended you get used to using Docker.  See [Using Mongo with Docker](using_mongo_with_docker.md) for detailed instructions.  If you want install MongoDB see instructions at the end of the assignment.  Again this is not recommended. 


## Practicing Mongo Queries 

To get familiar with MongoDB, we are going to load in some click-log data from 
a government website and do some basic queries on it. Write your queries in a 
text file. Paste and run the queries in the Mongo shell.

1. Open a ***bash terminal in Docker***, navigate to the directory containing the data in Docker and load in the data with (for more detailed directions [see here](using_mongo_with_docker.md))    
   `mongoimport --db clicks --collection log < click_log.json`

2. **In the Mongo shell on Docker**, run `show dbs;` to make sure the `clicks` database has 
   been created. Run `use clicks;` to use the `clicks` database for your 
   queries.

3. Inspect the `log` collection in your database. How many entries are in the 
   `log` collection? 
   
   If you are not sure about what command to use, you can access the help 
   section by:
    - `help`
    - `db.help()`
    - `db.<collection_name>.help()`

   Mongo also has tab complete, so you can tab complete some of your commands 
   for convenience.  

In [5]:
log.count_documents({})

3069

4. Print out all of the clicks you have stored using `.find()`. Now using 
   `.limit()`, return 10 entries. You can also use `.findOne()` to quickly 
   view the first row and examine the available columns.  

In [37]:
len(list(log.find()))

3069

In [16]:
log.find_one()

{'_id': ObjectId('5e838dea6d350b74698b840b'),
 'a': 'Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B329',
 'c': 'US',
 'nk': 0,
 'tz': 'America/Los_Angeles',
 'gr': 'CA',
 'g': '1084Psg',
 'h': '19Cztuz',
 'l': 'tweetdeckapi',
 'al': 'en-us',
 'hh': '1.usa.gov',
 'r': 'http://t.co/btKvKFBaF5',
 'u': 'http://science.nasa.gov/science-news/science-at-nasa/2013/16may_lunarimpact/',
 't': 1368774599,
 'hc': 1368774179,
 'cy': 'Palm Desert',
 'll': [33.7724, -116.345802]}

5. Use `.find()` to find all the clicks where `cy` (city) is `San Francisco`. 
   How many are there?

In [20]:
#list(log.find({'cy': 'San Francisco'}))

6. Use `.distinct()` to find all the distinct types of web browsers (under the 
   field `a`) people use to visit the sites. Count the the number of distinct web 
   browsers (use `.length` on your distinct list).

In [23]:
len(log.distinct('a'))

559

7. Select and count the records where the users have visited a website either 
   from a `Mozilla` or an `Opera` web browser. Search the `a` field using 
   [regex in mongo][mongo-like-query]. 

In [38]:
len(list(log.find({'a': { '$regex': 'oper|moz', '$options': 'gmi'}})))

2830

8. Find the type of the `t` (timestamp) field. You can access the type of a 
   field in an entry by using `typeof db.log.findOne({'t': {$exists: true}}).t`. 
   The field should be a `number` now.
   
   Convert the timestamp field to the date type. You will need to multiply the 
   number by 1000 and then make it a `Date` object (you can create a `Date` 
   object by using `new Date()`). You can loop over each record using 
   `.forEach()` and then [`.update()`][mongo-update] the record (using the `_id`
   field) with the created `Date` object. When you're done, confirm that the 
   data type has been converted. Below is some template code. 

   ```javascript
   db.log.find({'t': {$exists: true}}).forEach(function(entry) {
      // your code to update an entry by _id and set the t field as a new 
      //  Date() object
   })

In [42]:
typeof(log.find_one({'t': {'$exists': 'true'}}).t)

NameError: name 'typeof' is not defined