### Analyzing Book Data with Mongo DB

The analysis outlines step by step process for working with book data using MongoDB. 
The process involves:
1. Connection to the Database
2. Load and Insert the Data
3. Data Aggregation
4. Export Data 

### Import the packages

In [1]:

import json
import pymongo  # pymongo is a python driver for MongoDB
import credentials # Load username and password from credentials.py

### Establish a connection to the MongoDB database

In [3]:

connection_string = "mongodb+srv://keerthireddy0902:Keerumongodb@cluster0.rkx0lhr.mongodb.net/?retryWrites=true&w=majority&appName=AtlasApp"
client = pymongo.MongoClient(connection_string) # create a client object to connect to the database
db = client['bookdata'] # this creates a new database with the name bookdata


### Load the Data
1.Loaded the book data from a JSON file into the python environment. I have synthesized the data in the data.json file that contains information about various books that includes their titles, authors, publication years and the genres.<br>
2.The data is structured in the form of dictionaries where each dictionary represents a book record. Each book record includes attributes such as title, author, published year and genre.<br>
Attribute Description:<br>
1. Title: The title of the book.
2. Author: The Author of the book.
3. Publication Year: The year the book was published.
4. Genre: The genre of the book


In [4]:

with open ('data.json','r') as fin:
    file_data = json.load(fin)

In [5]:
# Define the collection name
collection_name = 'bookdata_01'

# Drop the collection if it already exists
if collection_name in db.list_collection_names():
    db[collection_name].drop()

### Inserting the Data

In [6]:
# Insert the book data into a collection named 'bookdata_01' within the bookdata database
# The insert_one() adds each book record to the collection by iteration through the file_data list

for document in file_data:
    result = db[collection_name].insert_one(document)

In [7]:
# To ensure the data has been successfully inserted, print all the documents in the collection.

cursor = db[collection_name].find()

for document in cursor:
    print(document)

{'_id': 1, 'title': 'The Great Gatsby', 'author': 'F. Scott Fitzgerald', 'publication_year': 1925, 'genre': 'Fiction'}
{'_id': 2, 'title': 'To Kill a Mockingbird', 'author': 'Harper Lee', 'publication_year': 1960, 'genre': 'Fiction'}
{'_id': 3, 'title': '1984', 'author': 'George Orwell', 'publication_year': 1949, 'genre': 'Dystopian'}
{'_id': 4, 'title': 'Pride and Prejudice', 'author': 'Jane Austen', 'publication_year': 1813, 'genre': 'Romance'}
{'_id': 5, 'title': 'The Hobbit', 'author': 'J.R.R. Tolkien', 'publication_year': 1937, 'genre': 'Fantasy'}
{'_id': 6, 'title': 'Moby-Dick', 'author': 'Herman Melville', 'publication_year': 1851, 'genre': 'Adventure'}
{'_id': 7, 'title': 'The Catcher in the Rye', 'author': 'J.D. Salinger', 'publication_year': 1951, 'genre': 'Fiction'}
{'_id': 8, 'title': 'The Lord of the Rings', 'author': 'J.R.R. Tolkien', 'publication_year': 1954, 'genre': 'Fantasy'}
{'_id': 9, 'title': 'Brave New World', 'author': 'Aldous Huxley', 'publication_year': 1932, '

### Data Aggregation
1.Aggregation query performed on the book data to calculate the count of books, average publication year for books in each genre, earliest publication year and the latest publication year by genre using '$group'. <br>
2.The results are then sorted in ascending order through '$sort'. <br>
3.The aggregation query results are stored in the 'results' variable. <br>
4. The query effectively provides the summary of books dataset identifying the patterns and trends within different genres.

In [8]:

averages = db[collection_name].aggregate([
    {
        "$group": {
            "_id": "$genre",
            "average_year": {"$avg": "$publication_year"},
            "count": {"$sum": 1},  
            "earliest_year": {"$min": "$publication_year"},  
            "latest_year": {"$max": "$publication_year"}  
        }
    },
    {
        "$sort": {"_id": 1}
    }
])

results = list(averages) 
print(results)


[{'_id': 'Adventure', 'average_year': 1851.0, 'count': 1, 'earliest_year': 1851, 'latest_year': 1851}, {'_id': 'Dystopian', 'average_year': 1940.5, 'count': 2, 'earliest_year': 1932, 'latest_year': 1949}, {'_id': 'Fantasy', 'average_year': 1945.5, 'count': 2, 'earliest_year': 1937, 'latest_year': 1954}, {'_id': 'Fiction', 'average_year': 1945.3333333333333, 'count': 3, 'earliest_year': 1925, 'latest_year': 1960}, {'_id': 'Gothic Fiction', 'average_year': 1818.0, 'count': 1, 'earliest_year': 1818, 'latest_year': 1818}, {'_id': 'Romance', 'average_year': 1813.0, 'count': 1, 'earliest_year': 1813, 'latest_year': 1813}]


### Writing the query results to JSON

In [9]:
# The results of the aggregation query are saved in the 'write.json' file using 'json.dump()'

with open('write.json', 'w') as json_file:
    json.dump(results, json_file, indent=4)

print("Aggregation results saved to 'write.json'.")


Aggregation results saved to 'write.json'.


### Summary:
The aggregation results are saved in the write.json file with a validation message.