<a href="https://colab.research.google.com/github/LiuChen-5749342/LiuChen-Programming-BigDataAnalytics/blob/main/Lecture/3_03_key_value_stores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://drive.google.com/uc?export=view&id=1vv_PsWBnUJwSCkwKDoJAC-vXjtaEA4Ts)

# 3.03 Key-value Stores with TinyDB
This tutorial gives a basic introduction to working with key-value (KV) stores (or document DBs). We will be working with [TinyDB](https://tinydb.readthedocs.io/en/latest/index.html), an in-memory Python database, which is particularly attractive here as it is, as the name suggests, pretty small and lightweight.

We will begin with the relevant installs:

In [None]:
!pip install tinydb
!pip install faker
!pip install python-lorem

Collecting tinydb
  Downloading tinydb-4.8.2-py3-none-any.whl.metadata (6.7 kB)
Downloading tinydb-4.8.2-py3-none-any.whl (24 kB)
Installing collected packages: tinydb
Successfully installed tinydb-4.8.2
Collecting faker
  Downloading faker-37.12.0-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.12.0-py3-none-any.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.12.0
Collecting python-lorem
  Downloading python_lorem-1.3.0.post3-cp312-none-any.whl.metadata (3.7 kB)
Downloading python_lorem-1.3.0.post3-cp312-none-any.whl (9.1 kB)
Installing collected packages: python-lorem
Successfully installed python-lorem-1.3.0.post3


As you may infer from the pacakages installed, we will run something similar to one of our DuckDB examples - specifically building a database using fake data generated by Faker:

In [None]:
import random
from faker import Faker # 用于生成各种虚假的（但格式正确的）数据，如姓名、邮箱、公司等
import pandas as pd
from lorem import paragraph
import itertools

fake = Faker()

def get_person():
  person = {}
  person['id'] = random.randrange(1000,9999999999999)
  person['first_name'] = fake.first_name()
  person['last_name'] = fake.last_name()
  person['email'] = fake.unique.ascii_email() # unique保证生成的假邮箱不重复
  person['company'] = fake.company()
  person['phone'] = fake.phone_number()
  person['review'] = list(itertools.islice(paragraph(count=1), 1)) # itertools.islice(..., 1) 从生成器中取出第一个元素
  return person

personlist = []
for x in range(100):
  personlist.append(get_person())

df = pd.DataFrame.from_dict(personlist)
df.head()

Unnamed: 0,id,first_name,last_name,email,company,phone,review
0,9116595710739,Taylor,Yang,peter40@yahoo.com,"Robinson, Paul and Robbins",(640)236-8982,[Mollit deserunt voluptate enim. Velit elit ex...
1,5705656797552,Henry,Williams,kelly43@gmail.com,Leon-Morrison,386.761.3892,[Deserunt lorem ipsum ut lorem. Pariatur do mi...
2,5556364846410,Gail,Velez,lbrown@yahoo.com,Ferrell Group,(367)302-9593,[Ipsum ipsum commodo incididunt et. Sunt aliqu...
3,2076248970653,Melanie,Edwards,carrieray@villanueva.com,Brown-Bradley,(909)751-4383x52084,[Eiusmod cupidatat do laboris magna incididunt...
4,995894756059,Brian,Orozco,john90@thompson.com,"Garcia, Hampton and Gonzalez",472-901-8575x2504,[Quis aliqua incididunt velit tempor id qui ei...


Everything here is the same except we have also add a text column (using lorem ipsum). As before we have created this as a Pandas dataframe, but like most KV stores, TinyDB prefers data stored as a dictionary:

In [None]:
fake_data = df.to_dict(orient='records') #将DataFrame转换为字典，orient='records'设定按照行的形式
fake_data

[{'id': 9116595710739,
  'first_name': 'Taylor',
  'last_name': 'Yang',
  'email': 'peter40@yahoo.com',
  'company': 'Robinson, Paul and Robbins',
  'phone': '(640)236-8982',
  'review': ['Mollit deserunt voluptate enim. Velit elit excepteur non culpa. Nisi non elit fugiat consequat qui quis et, fugiat mollit magna consectetur do sint esse sint. Ea tempor magna ea culpa nisi anim. Consequat tempor eu officia ad elit, anim reprehenderit sed cupidatat dolor. Nostrud esse magna ipsum commodo. Pariatur ea cupidatat dolor. Exercitation eiusmod lorem nulla, veniam sit ipsum ea tempor sunt. Proident eiusmod cillum sed eu ad cupidatat.']},
 {'id': 5705656797552,
  'first_name': 'Henry',
  'last_name': 'Williams',
  'email': 'kelly43@gmail.com',
  'company': 'Leon-Morrison',
  'phone': '386.761.3892',
  'review': ['Deserunt lorem ipsum ut lorem. Pariatur do minim excepteur proident voluptate, ex non sunt magna, non ut anim minim. Qui nostrud eu ullamco. Adipiscing eu exercitation aliqua reprehe

With this transform in place we can load the data into our database. You may note the database itself is specified as JSON format:

In [None]:
from tinydb import TinyDB, Query

db = TinyDB('db.json')

for record in fake_data:
  db.insert(record)

We can check this has worked with a simple Python loop:

In [None]:
for item in db:
  print(item)

{'id': 9116595710739, 'first_name': 'Taylor', 'last_name': 'Yang', 'email': 'peter40@yahoo.com', 'company': 'Robinson, Paul and Robbins', 'phone': '(640)236-8982', 'review': ['Mollit deserunt voluptate enim. Velit elit excepteur non culpa. Nisi non elit fugiat consequat qui quis et, fugiat mollit magna consectetur do sint esse sint. Ea tempor magna ea culpa nisi anim. Consequat tempor eu officia ad elit, anim reprehenderit sed cupidatat dolor. Nostrud esse magna ipsum commodo. Pariatur ea cupidatat dolor. Exercitation eiusmod lorem nulla, veniam sit ipsum ea tempor sunt. Proident eiusmod cillum sed eu ad cupidatat.']}
{'id': 5705656797552, 'first_name': 'Henry', 'last_name': 'Williams', 'email': 'kelly43@gmail.com', 'company': 'Leon-Morrison', 'phone': '386.761.3892', 'review': ['Deserunt lorem ipsum ut lorem. Pariatur do minim excepteur proident voluptate, ex non sunt magna, non ut anim minim. Qui nostrud eu ullamco. Adipiscing eu exercitation aliqua reprehenderit officia minim, conse

With our database setup, we can start to query our records. In TinyDB we do this by creating a query object:

In [None]:
User = Query() # query object
# 创建了一个 Query 对象，这个对象用于构建 TinyDB 的复杂查询。

db.search(User.first_name == 'Chad') # adapt based on your data
# 在数据库中搜索 first_name 字段恰好是 'Chad' 的文档。结果是一个包含所有匹配文档的列表。由于生成的假数据中没有名字是 'Chad' 的条目，所以输出是空列表 []。

[]

We can also add new data in dictionary/JSON-like format:

In [None]:
db.insert({'id': 123, 'first_name': 'Amir', 'star_sign': 'Dog', 'review': 'I do not speak Latin.'})

101

And retrieve the data as before:

In [None]:
db.search(User.id == 123)

[{'id': 123,
  'first_name': 'Amir',
  'star_sign': 'Dog',
  'review': 'I do not speak Latin.'}]

One thing to note here is that our new record does not follow the schema we may infer from the original dataset (i.e. the original data all used the same columns/fields). Here many of those fields are missing and we have the new field 'star_sign'.

This demonstrates the extra flexibility we get with a KV store over a relational model. We can also query our database to get all records that have a specific field:

In [None]:
db.search(User.star_sign.exists())
# 在数据库中搜索任何包含 star_sign 字段的文档，无论其值是什么。这展示了文档数据库的灵活性，即不是所有记录都需要具有相同的字段。

[{'id': 123,
  'first_name': 'Amir',
  'star_sign': 'Dog',
  'review': 'I do not speak Latin.'}]

This gives a basic intro into KV (and document) stores. While there are many competing brands/solutions, the common themes are the dictionary-like structure (key-value pairs) and flexibility to accept any fields (keys).