## Title - A_guide_to_connecting_with_MongoDB_using_pymongo

## Introduction



This tutorial is deigned for users who want to connect to a MongoDb collection using the Pymongo library and generate statistics out of the data present in the collection. It assumes a basic knowledge of Python and MongoDB. 


**Learning Goal**
By the end of this tutorial, an user will be able to connect to a MongoDB collection hosted on a server and display some statistics about the data


**Learning Objectives**
- Install the pymongo Library
- Explore the library to connect to a MongoDB collection
- Generate some statistics out of the data
- Obtain a data dump

**Description**

MongoDB is a popular open-source, NoSQL database management system designed for high performance, scalability, and flexibility. 
It stores data in flexible, JSON-like documents, making it easy to work with structured and unstructured data. 
MongoDB uses a document-oriented data model, which allows for the storage of complex data structures and nested arrays,
providing more flexibility than traditional relational databases. It supports features such as indexing, replication, sharding,
and aggregation, making it suitable for a wide range of use cases, from small-scale applications to large-scale enterprise systems.
PyMongo is the official Python client library for MongoDB. It allows Python developers to interact with MongoDB databases using a
simple and intuitive API. With PyMongo, you can perform various operations such as querying, inserting, updating, and deleting
documents in MongoDB collections directly from your Python code. It provides a flexible and powerful way to work with MongoDB 
databases, making it a popular choice for Python developers working with MongoDB. In this tutorial we will connect to an already 
existing MongoDB collection with the PyMongo library and generate some statistics out of the data

**Target Audience**

This tutorial is meant for users who wants to use PyMongo library to connect to a MongoDB collection.
This tutorial aims to put all related information in a single place

**Prerequisites**

    Basic knowledge of python (https://www.python.org/)
    Basic knowledge of MongoDB (https://www.mongodb.com/)
    A local instance of MongoDB docker running in your system
    Already stored data in MongoDB collection
   
**Difficulty Level**

    Easy

**Duration**

    2 hours

**Social Science Use Case**

John is a researcher who wants to generate some statistics for a huge data that is collected and stored in MongoDB collection. For connecting and exploring the huge dataset, he uses the pymongo library to generate details about the collection like size of the collection, number of records, number of documents and generates a dump out of the collection. He can then re use this method for any other data collected and stored in MongoDB to generate statistics out of the data



**Sample Data**

As a sample data we consider data from Telegram channels stored in MongoDB. The name of the database is telegram and the name of 
the collection is channel. The username is root and password is example 


In [None]:
#connecting to MongoDB database
from pymongo import MongoClient

# MongoDB connection details
mongo_host = 'localhost'
mongo_port = 27017
mongo_database = 'telegram'
mongo_collection = 'channel'
mongo_username = 'root'
mongo_password = 'example'

# Connection URI for MongoDB with authentication
#mongo_uri = f"mongodb://{mongo_username}:{mongo_password}@{mongo_host}:{mongo_port}/{mongo_database}"

mongo_uri=f"mongodb://root:example@localhost:27017"

# Connect to MongoDB
client = MongoClient(mongo_uri)

# Specify the database and collection
db = client[mongo_database]
collection = db[mongo_collection]


In [2]:
# Count the number of documents in the collection
document_count = collection.count_documents({})

# Print the result
print(f"Number of documents in the '{mongo_collection}' collection: {document_count}")



Number of documents in the 'channel' collection: 82602523


In [3]:
#print statistics out of the data
stats = db.command("collstats", mongo_collection)
print(stats)


{'ns': 'telegram.channel', 'size': 345146883805.0, 'count': 82603773, 'avgObjSize': 4178, 'storageSize': 118590033920.0, 'freeStorageSize': 270336, 'capped': False, 'wiredTiger': {'metadata': {'formatVersion': 1}, 'creationString': 'access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none,write_timestamp=off),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,import=(enabled=false,file_metadata=,repair=false),internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_

In [4]:
#print one document out of the data
one_document = collection.find_one()
print(one_document)

{'_id': ObjectId('65ae71c72fff6da70b8f7e4a'), 'record_id': 8134311851216338949, 'message_id': 5, 'channel_id': 1893917064, 'retrieved_utc': 1705931207, 'updated_utc': 1705931207, 'data': '{"_": "Message", "id": 5, "peer_id": {"_": "PeerChannel", "channel_id": 1893917064}, "date": "2023-10-15T08:21:44+00:00", "message": "https://t.me/+TVKXr6DVofhWZlqH", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": true, "from_scheduled": false, "legacy": false, "edit_hide": false, "pinned": false, "noforwards": false, "from_id": null, "fwd_from": null, "via_bot_id": null, "reply_to": {"_": "MessageReplyHeader", "reply_to_msg_id": 4, "reply_to_scheduled": false, "forum_topic": false, "reply_to_peer_id": null, "reply_to_top_id": null}, "media": null, "reply_markup": null, "entities": [{"_": "MessageEntityUrl", "offset": 0, "length": 30}], "views": 10638, "forwards": 107, "replies": null, "edit_date": "2023-10-15T16:25:07+00:00", "post_author": null, "grouped_id": null,

In [5]:
import pandas as pd




In [6]:
pipeline = [
    {"$group": {"_id": "$channel_id", "count": {"$sum": 1}}}
]
# Execute the aggregation pipeline
result = list(collection.aggregate(pipeline))

# Print the results
df = pd.DataFrame(result)

In [7]:
# number of unique channels in the collection , groupby channel_id
df

Unnamed: 0,_id,count
0,1296944294,7415
1,1158379858,916
2,1076967923,5824
3,1074418185,4433
4,1788633993,1630
...,...,...
6798,1122432103,13405
6799,1594967462,2666
6800,1232912002,7723
6801,1314318233,13373


In [31]:
df.columns

Index(['_id', 'count'], dtype='object')

In [33]:
# Channel with the minimum count
min_count_channel = df.loc[df['count'].idxmin()]

# Channel with the maximum count
max_count_channel = df.loc[df['count'].idxmax()]

# Average count across all channels
average_count = df['count'].mean()

In [36]:
average_count

10001.887608069164

In [34]:
min_count_channel

_id      1379942214
count             1
Name: 1416, dtype: int64

In [35]:
max_count_channel

_id      1164999973
count        335500
Name: 1839, dtype: int64

In [10]:
channel_ids=df._id.tolist()

In [11]:
import pymongo
import json

In [12]:
from langdetect import detect

In [13]:
collection.create_index([("channel_id", pymongo.ASCENDING)])

'channel_id_1'

In [14]:
import os

def saveChannelDict(channel_id, messages):
    savedir = 'channels/'
    os.makedirs(savedir, exist_ok=True)  # Create the directory if it doesn't exist
    with open(savedir + str(channel_id) + '.jsonl', 'w+') as f:
        for m in messages:
            f.write(json.dumps(m) + '\n')


In [15]:
import tqdm

In [19]:
#detect language of the messages
import csv
import json
from tqdm import tqdm
# Initialize an empty list to store messages
# Define the file path for the TSV file
tsv_file = 'channel_languages.tsv'
messages_text = []
limited_channel_ids = channel_ids[:5]
# Open the TSV file in write mode with newline='' to prevent extra newlines
with open(tsv_file, 'w', newline='') as f:
    # Create a CSV writer object
    writer = csv.writer(f, delimiter='\t')
    
    # Write the header row
    writer.writerow(['channel_id', 'detected_language'])
    
    # Iterate over each channel ID
    for channel_id in tqdm(limited_channel_ids):
        # Iterate over each document in the collection for the current channel ID
        for doc in tqdm(collection.find({'channel_id': channel_id})):
            # Parse the 'data' field as JSON
            data = json.loads(doc['data'])
            # Extract the 'message' field from the 'data' field
            message = data.get('message')
            # If 'message' is not None, append it to the messages_text list
            if message:
                messages_text.append(message)
        saveChannelDict(channel_id,messages_text)
        # Concatenate all messages into a single string
        all_messages_text = ' '.join(messages_text)
        
        # Detect the language of the concatenated text
        detected_language = detect(all_messages_text)
        
        # Write the channel ID and detected language to the TSV file
        writer.writerow([channel_id, detected_language])

# Print confirmation message
print("TSV file exported successfully.")


  0%|                                                          | 0/5 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
5764it [00:00, 34726.11it/s][A
 20%|██████████                                        | 1/5 [00:00<00:01,  3.87it/s]
0it [00:00, ?it/s][A
2613it [00:00, 20698.55it/s][A
5068it [00:00, 22740.78it/s][A
7789it [00:00, 24663.27it/s][A
10279it [00:00, 19681.92it/s][A
12927it [00:00, 21734.85it/s][A
15774it [00:00, 23765.04it/s][A
20962it [00:00, 23924.73it/s][A
 40%|████████████████████                              | 2/5 [00:01<00:03,  1.03s/it]
94it [00:00, 23818.32it/s]
 60%|██████████████████████████████                    | 3/5 [00:02<00:01,  1.15it/s]
0it [00:00, ?it/s][A
3496it [00:00, 34958.28it/s][A
7246it [00:00, 36450.44it/s][A
13340it [00:00, 37861.00it/s][A
 80%|████████████████████████████████████████          | 4/5 [00:03<00:01,  1.00s/it]
260it [00:00, 5396.98it/s]
100%|██████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.08it/s]

TSV file exported successfully.





In [None]:
# Generate MongoDB data dump
import subprocess

# Set your MongoDB connection parameters
mongo_uri = "mongodb://root:example@172.24.0.2:27017"
database_name = "telegram"
collection_name = "channel"

# Set the output directory for the dump
output_directory = "/home/telegram/telegram_pushshift/mongodump"

# Construct the mongodump command
mongodump_cmd = f"mongodump --uri={mongo_uri} --collection={collection_name} --db={database_name} --out={output_directory}  --authenticationDatabase=admin"

# Execute the command
subprocess.run(mongodump_cmd, shell=True)

print("MongoDB dump completed.")


## References 

1.https://pymongo.readthedocs.io/en/stable/

2.https://www.mongodb.com/


## Contact Details - Susmita.gangopadhyay@gesis.org