# MongoDB笔记

### mongodb

pymongo是线程安全的,但是不是fork-safe. 当多进程使用同一个client实例会报错:

解决办法:
在连接mongo数据库时,设置关键字connect=False,即在mongodb实例化时不建立连接,等到有数据库操作才进行数据库连接.

In [None]:
class MongoBaseModel(object):

    def __init__(self):
        self.client = MongoClient(
            'mongodb://{}:{}@{}/{}'.format(MONGODB['user'],
                                           MONGODB['passwd'],
                                           MONGODB['host'],
                                           MONGODB['dbname']),
            connect=False
        )
        self.mongodb = self.client[MONGODB['dbname']]

    def __del__(self):
        self.client.close()

### 查询指定字段最新一条记录

In [26]:
res = movie.find({}).sort('_id', -1).limit(1)
if res:
    pprint(res[0])

{'_id': ObjectId('5af8f6e52b5eba0bb8fece6b'), 'level': 4, 'score': 27}


**注意:** limit返回的是一个迭代器,并不是数据结果

### 查询结果排序

In [None]:
db.Account.find({}).sort("UserName")  --默认为升序
db.Account.find({}).sort("UserName",pymongo.ASCENDING)   --升序
db.Account.find({}).sort("UserName",pymongo.DESCENDING)  --降序

### 多列结果排序

In [None]:
db.Account.find().sort([("UserName",pymongo.ASCENDING),("Email",pymongo.DESCENDING)])

### 嵌套查询

查询子文档中某个key

doc结构:
```json
{
    "_id" : ObjectId("5af14ce42b5eba2fc08cfcd6"),
    "request" : {
        "headers" : {}
        "url" : "https://www.baidu.com/",
        "method" : "GET"
    },
    "response" : {
        "headers" : {},
        "content" : {},
        "status_code" : 200,
        "url" : "https://www.baidu.com:443/"
    },
    "title" : "百度一下，你就知道",
    "site" : "https://www.baidu.com",
    "end_type" : "PC",
    "time" : 1525763299
}
```

In [None]:
db.getCollection('my_crawler_urls').find({'response.status_code': 302}).count()

### 插入一条文档立刻返回_id

In [21]:
import random
from pymongo import MongoClient
from pprint import pprint

# MONGODB参数
MONGODB = {
    "user": "xxx",
    "passwd": "xxxxxx",
    "host": "127.0.0.1:27017",
    "dbname": "xxxx"
}

def connect_mongo(MONGODB):
    # 少一个参数
    client = MongoClient(
        'mongodb://{}:{}@{}/{}'.format(MONGODB['user'],
                                       MONGODB['passwd'],
                                       MONGODB['host'],
                                       MONGODB['dbname']))
    return client[MONGODB['dbname']]

db = connect_mongo(MONGODB)
movie = db['my_crawler_urls']

id_list = []

for i in range(5):
    doc = {'score': random.randint(1, 100), 'level': random.randint(1, 10)}
    result = movie.insert_one(doc)
    id_list.append(result.inserted_id)
    
pprint(id_list)

for item in id_list:
    data = movie.find_one({'_id':item})
    print(type(data))
    pprint(data)
    
for item in id_list:
    res = movie.delete_one({'_id': item})
    pprint(res)




[ObjectId('5af8f8c62b5eba0bb8fece6d'),
 ObjectId('5af8f8c62b5eba0bb8fece6e'),
 ObjectId('5af8f8c62b5eba0bb8fece6f'),
 ObjectId('5af8f8c62b5eba0bb8fece70'),
 ObjectId('5af8f8c62b5eba0bb8fece71')]
<class 'dict'>
{'_id': ObjectId('5af8f8c62b5eba0bb8fece6d'), 'level': 9, 'score': 76}
<class 'dict'>
{'_id': ObjectId('5af8f8c62b5eba0bb8fece6e'), 'level': 4, 'score': 70}
<class 'dict'>
{'_id': ObjectId('5af8f8c62b5eba0bb8fece6f'), 'level': 5, 'score': 73}
<class 'dict'>
{'_id': ObjectId('5af8f8c62b5eba0bb8fece70'), 'level': 4, 'score': 77}
<class 'dict'>
{'_id': ObjectId('5af8f8c62b5eba0bb8fece71'), 'level': 2, 'score': 38}
<pymongo.results.DeleteResult object at 0x000001F0261F49C8>
<pymongo.results.DeleteResult object at 0x000001F0261F4808>
<pymongo.results.DeleteResult object at 0x000001F0261F49C8>
<pymongo.results.DeleteResult object at 0x000001F0261F4808>
<pymongo.results.DeleteResult object at 0x000001F0261F49C8>


### 更新局部字段

In [None]:
self.dst_movie.update_one({'_id': item.get('_id')}, {'$set':{'end_time': doc.get('time')}})

### 关于mongodb你应该知道的

#### 索引

如果没有索引，MongoDB 需要为了找到一个匹配的文档而扫描整个 collection，代价非常高昂

#### 默认的_id索引

Mongodb 在 collection 创建时会默认建立一个基于`_id` 的唯一性索引作为 document 的 primary key，这个 index 无法被删除。

### 一些优化建议

1. 当无法使用索引或者无法有效利用`_id`索引时,也就是不得不进行全集合查询时,尽量减少全集合遍历查询次数.