# Dataset description

Group number: [your group number here]


Team members: [a list of team members here]

## Introduction

- Dataset name
- Authors
- Source/URL
- A brief description of what the dataset is about

## General information

- Data format
- How many files/collections
- Data size in terms of storage

## Import data

In [10]:
from pymongo import MongoClient
import json
import urllib.parse

# Load credentials from JSON file
with open('credentials_mongodb.json') as f:
    login = json.load(f)

# Assign credentials to variables
username = login['username']
password = urllib.parse.quote(login['password'])  # Ensure the password is URL encoded
host = login['host']

# Construct the MongoDB connection string
url = f"mongodb+srv://{username}:{password}@{host}/?retryWrites=true&w=majority"

# Connect to MongoDB
client = MongoClient(url)

# Select the database you want to use
db = client['news_database']  # Replace with your database name

# Drop the collection if it exists to free up space
try:
    db.drop_collection('news_collection')  # Replace with the collection name you want to delete
    print("Collection dropped successfully.")
except Exception as e:
    print(f"Error dropping collection: {e}")

# Select the collection you want to use
collection = db['news_collection']  # Replace with your collection name

# Initialize an empty list to store the documents
documents = []

# Load the JSON file
file_path = r"C:\Users\tejaa\Downloads\archive (2)\News_Category_Dataset_v3.json"
with open(file_path, 'r') as file:
    for line in file:
        # Each line is a separate JSON object/document
        documents.append(json.loads(line))

# Number of documents
num_documents = len(documents)
print(f"Total number of documents to insert: {num_documents}")

# Insert the documents into MongoDB
try:
    collection.insert_many(documents)
    print(f"Inserted {num_documents} documents into MongoDB.")
except Exception as e:
    print(f"An error occurred while inserting data: {e}")


Collection dropped successfully.
Total number of documents to insert: 209527
Inserted 209527 documents into MongoDB.


- Describe how many collections/how many documents
- Describe the schema of the dataset/collection
- Print out a sample document
- List and briefly describe the most important fields/attributes in the dataset

In [11]:
#1 Describe how many collections/how many documents

# Count the number of collections and documents in the database
def describe_database(db):
    # Get the list of collections in the database
    collections = db.list_collection_names()
    num_collections = len(collections)

    print(f"Total number of collections: {num_collections}")

    # Print the number of documents in each collection
    for collection_name in collections:
        collection = db[collection_name]
        num_documents = collection.count_documents({})
        print(f"Collection '{collection_name}' has {num_documents} documents.")

# Call the function to describe the database
describe_database(db)
#

Total number of collections: 1
Collection 'news_collection' has 209527 documents.


```
#2 Describe the schema of the dataset/collection
The schema defines the structure of the documents in your dataset. Based on the JSON ,each document in the dataset follows this structure:

link (String): The URL linking to the news article.
headline (String): The headline or title of the news article.
category (String): The category under which the article is classified (e.g., "U.S. NEWS", "COMEDY").
short_description (String): A brief summary or description of the news article.
authors (String): The author(s) of the article.
date (String): The date the article was published, typically formatted as "YYYY-MM-DD"..
```

In [13]:
#3 Print out a sample document
from bson import ObjectId

# Function to convert ObjectId to string
def convert_objectid_to_string(doc):
    if isinstance(doc, dict):
        return {k: convert_objectid_to_string(v) for k, v in doc.items()}
    elif isinstance(doc, list):
        return [convert_objectid_to_string(i) for i in doc]
    elif isinstance(doc, ObjectId):
        return str(doc)
    return doc

#3 Print out a sample document
if documents:
    sample_document = documents[0]  # Get the first document
    sample_document_str = convert_objectid_to_string(sample_document)
    print("Sample document:")
    print(json.dumps(sample_document_str, indent=4))  # Pretty-print the sample document
else:
    print("No documents loaded to display.")


Sample document:
{
    "link": "https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9",
    "headline": "Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters",
    "category": "U.S. NEWS",
    "short_description": "Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.",
    "authors": "Carla K. Johnson, AP",
    "date": "2022-09-23",
    "_id": "66f08bfd13e93754c8f7305e"
}


In [14]:
#4 List and briefly describe the most important fields/attributes in the dataset

# Describing the most important fields/attributes
fields_description = {
    "link": "A string containing the URL of the news article.",
    "headline": "A string representing the headline or title of the article.",
    "category": "A string indicating the category or section of the news article.",
    "short_description": "A brief summary or description of the news article.",
    "authors": "A string containing the name(s) of the author(s) of the article.",
    "date": "A string representing the publication date of the article, typically in 'YYYY-MM-DD' format."
}

print("Important fields/attributes in the dataset:")
for field, description in fields_description.items():
    print(f"{field}: {description}")


Important fields/attributes in the dataset:
link: A string containing the URL of the news article.
headline: A string representing the headline or title of the article.
category: A string indicating the category or section of the news article.
short_description: A brief summary or description of the news article.
authors: A string containing the name(s) of the author(s) of the article.
date: A string representing the publication date of the article, typically in 'YYYY-MM-DD' format.


## Submission instruction
- Push the notebook to your group Github repository
- Upload an URL to the `data-eda.ipynb` to Moodle under week 3 assignment