We start by importing the necessary libraries

In [20]:
from pymongo import MongoClient
import configparser
from pathlib import Path

Using the config parser, we access our configuration file, which contains the MongoDB URI, database name, and collection name.

In [21]:
config = configparser.ConfigParser()
config.read("../settings.cfg")

MONGO_URI = config.get("MONGO", "mongo_uri")
DB_NAME = config.get("MONGO", "database")
COLLECTION_NAME = config.get("MONGO", "collection")

After extracting these values, we connect to our local MongoDB instance and the specified database and collection using the Mongo client.

In [22]:
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

We then load the cleaned dataset.

In [23]:
import pandas as pd

df = pd.read_csv("../data/processed/car_sales_cleaned.csv")
print(df.shape)

(2500000, 11)


Since it contains 2.5 million records, itâ€™s best to divide it into 250 batches of 10,000 records each. This approach ensures that all data is safely and efficiently inserted into the collection.

In [24]:
batch_size = 10000
total = len(df)

for i in range(0, total, batch_size):
    batch_df = df.iloc[i:i+batch_size]
    records = batch_df.to_dict(orient="records")
    collection.insert_many(records)
    print(f"Inserted {i+len(records)} out of {total} documents")

Inserted 10000 out of 2500000 documents
Inserted 20000 out of 2500000 documents
Inserted 30000 out of 2500000 documents
Inserted 40000 out of 2500000 documents
Inserted 50000 out of 2500000 documents
Inserted 60000 out of 2500000 documents
Inserted 70000 out of 2500000 documents
Inserted 80000 out of 2500000 documents
Inserted 90000 out of 2500000 documents
Inserted 100000 out of 2500000 documents
Inserted 110000 out of 2500000 documents
Inserted 120000 out of 2500000 documents
Inserted 130000 out of 2500000 documents
Inserted 140000 out of 2500000 documents
Inserted 150000 out of 2500000 documents
Inserted 160000 out of 2500000 documents
Inserted 170000 out of 2500000 documents
Inserted 180000 out of 2500000 documents
Inserted 190000 out of 2500000 documents
Inserted 200000 out of 2500000 documents
Inserted 210000 out of 2500000 documents
Inserted 220000 out of 2500000 documents
Inserted 230000 out of 2500000 documents
Inserted 240000 out of 2500000 documents
Inserted 250000 out of 25

All data is safely inserted into the collection.