# Introduction to Data Formats

In the world of data exchange and storage, several formats are commonly used to structure and represent data. This guide provides an overview of three popular data formats: JSON, CSV, and XML.

## JSON (JavaScript Object Notation)

### Overview
JSON is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is often used for transmitting data in web applications.

### Example
```json
{
    "timestamp": "2023-10-01T12:00:00Z",
    "level": "INFO",
    "message": "User login successful",
    "user": {
        "id": 1248,
        "username": "johndoe"
    },
    "ip_address": "192.168.1.1"
},
```

### Data Types
JSON can't store all Python types like a dictionary can. Instead it can store:
 - strings
 - integers
 - floats
 - Booleans 
 - lists
 - dictionaries
 - NoneType

In [None]:
# Json usually is transferred as a string
from pprint import pprint
import json

log_data = '''
[
    {
        "timestamp": "2023-10-01T12:00:00Z",
        "level": "INFO",
        "message": "User login successful",
        "user": {
            "id": 1248,
            "username": "johndoe"
        },
        "ip_address": "192.168.1.1"
    },
    {
        "timestamp": "2023-10-01T12:05:00Z",
        "level": "ERROR",
        "message": "Failed to load user profile",
        "user": {
            "id": 2734,
            "username": "janesmith"
        },
        "ip_address": "192.168.1.2",
        "error": {
            "code": 500,
            "message": "Internal Server Error"
        }
    },
    {
        "timestamp": "2023-10-01T12:10:00Z",
        "level": "WARNING",
        "message": "Password attempt failed",
        "user": {
            "id": 3890,
            "username": "bobjohnson"
        },
        "ip_address": "192.168.1.3"
    },
    {
        "timestamp": "2023-10-01T12:15:00Z",
        "level": "INFO",
        "message": "User logout successful",
        "user": {
            "id": 1248,
            "username": "johndoe"
        },
        "ip_address": "192.168.1.1"
    },
    {
        "timestamp": "2023-10-01T12:20:00Z",
        "level": "ERROR",
        "message": "Database connection failed",
        "error": {
            "code": 503,
            "message": "Service Unavailable"
        },
        "ip_address": "192.168.1.4"
    },
    {
        "timestamp": "2023-10-01T12:25:00Z",
        "level": "ERROR",
        "message": "Unknown error",
        "error": {},
        "ip_address": "192.168.1.5"
    }
]
'''


json_log_data = json.loads(log_data)
print(len(json_log_data))
print(type(json_log_data))
print()
pprint(json_log_data)

In [None]:
# Parsing through data

errors = 0
for log_entry in json_log_data:
    if log_entry.get("level") == "ERROR":
        print(log_entry.get("message", "No message provided"))
        print(log_entry.get("error").get("code"))
        print(log_entry.get("error", {}).get("message"))
        print()


In [None]:
# We can also convert that back to a string

str_log_data = json.dumps(json_log_data)

print(type(json_log_data))
print(len(json_log_data))
print(json_log_data)

In [None]:
# We can format that string to be more readable

formatted_str_log_data = json.dumps(json_log_data, indent=4)
print(type(formatted_str_log_data))
print(len(formatted_str_log_data))
print(formatted_str_log_data)

In [None]:
# Reading JSON from a file
with open('mto.json', 'r') as file:
    data = json.load(file)
    print(type(data))
    pprint(data)

In [None]:
# Writing JSON to a file
with open("log.json", "w") as file:
    json.dump(json_log_data, file, indent=4)


In [None]:
# MTO Exercise

# YAML

### Overview
YAML is a human-readable data serialization format that is commonly used for configuration files. Yaml files can either have a `.yaml` or `.yml` extension. They are interchangable.

### Characteristics
- **Human-readable**: YAML is designed to be easy for humans to read and write.
- **Hierarchical**: Supports nested structures, allowing for complex data representations.
- **Flexible**: Can represent various data types, including scalars, lists, and dictionaries.
- **Whitespace-sensitive**: Uses indentation to denote structure, similar to Python.

### Example:

```yaml
# Comment: Appliction configurations
---
app:
  name: MyApp
  version: 1.0.0
  environment: production

server:
  host: 0.0.0.0
  port: 8080
  ssl:
    enabled: true
    certificate_path: /etc/ssl/certs/myapp.crt
    key_path: /etc/ssl/private/myapp.key

database:
  type: postgresql
  host: localhost
  port: 5432
  name: myapp_db
  user: dbuser
  password: dbpassword
  pool_size: 10

logging:
  level: INFO
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  handlers:
    - console
    - file
  file:
    path: /var/log/myapp.log
    max_size: 10MB
    backup_count: 5

features:
  authentication: true
  caching: true
  rate_limiting: false

email:
  smtp_server: smtp.example.com
  port: 587
  use_tls: true
  username: emailuser
  password: emailpassword
  from_address: no-reply@example.com
```

In [None]:
# PyYAML can read yaml into python objects
import yaml

with open('config.yaml') as file:
    data = yaml.safe_load(file)
    pprint(data)
    print(type(data))
print()

database = data.get("database")
print(database)

In [None]:
# Writing YAML from a string

configs_yaml = """
logging:
    level: DEBUG
    format: text
    handlers:
        - console
        - file
    file: /var/log/app.log
    max_size: 10MB
    backup_count: 5
"""

data = yaml.safe_load(configs_yaml)

pprint(data)
print(type(data))
with open("config-output.yml", "w") as file:
    yaml.dump(data, file, indent=4)

In [None]:
# Writing YAML from a dictionary
configs_yaml = {
    'logging': {
        'backup_count': 5,
        'file': '/var/log/app.log',
        'format': 'text',
        'handlers': ['console', 'file'],
        'level': 'DEBUG',
        'max_size': '10MB'
        }
     }
print(type(configs_yaml))
with open("config-output.yml", "w") as file:
    yaml.dump(configs_yaml, file, indent=4)

In [None]:
# Converting from JSON to YAML

with open("mto.json", "r") as json_file:
    json_data = json.load(json_file)

yaml_data = yaml.dump(json_data, indent=4)
print(yaml_data)

with open("mto.yaml", "w") as yaml_file:
    yaml.dump(json_data, yaml_file, indent=4)

# CSV (Comma-Separated Values)

### Overview
CSV is a format used to store tabular data, such as speadsheets or databases. Each line in a CSV file represents an individual row. Each value in that row is separated by a comma. Commonly associated with Excel or Google Sheets. 


CSV files are much more limited than Excel files. Every value is a string, and there is no formatting.

### Example

```csv
name,age,email
John Doe,30,john.doe@example.com
Jane Smith,25,jane.smith@example.com
Bob Johnson,40,bob.johnson@example.com
```


In [None]:
# Reading CSV data as a list of lists

import csv

with open('users.csv', 'r') as file:
    reader = csv.reader(file)
    data = list(reader)
    print(data)


In [None]:
# Using a for loop
with open('users.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

In [None]:
# Writing to a CSV file

data = ["apples", "bananas", "cherries", "oranges", "lemons"]
with open('fruit.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(data) # Can write individual rows
    writer.writerow([1, 2, 3, 4, 5])

In [None]:
# Delimiter and Lineterminator
data = [
    ['apples', 'bananas', 'cherries', 'oranges', 'lemons'],
    ["one", "two", "three", "four", "five"]
]
with open('fruit.csv', 'w') as file:
    writer = csv.writer(file, delimiter='\t', lineterminator='\n\n')
    writer.writerows(data) #writerows writes multiple rows

In [None]:
# DictReader
# Puts the csv data into a dictionary, where the keys are the headers

with open('headers.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row)
    print()
    print(row["name"], row["email"])

In [None]:
# DictWriter

with open('headers-output.csv', 'w') as file:
    output = csv.DictWriter(file, fieldnames=["Name", "Email", "State"])
    output.writeheader()
    output.writerow({
        "Name": "John Doe",
        "Email": "john.doe@example.com",
        "State": "CA"
    })
    output.writerow({
        "Name": "Jane Smith",
        "Email": "jane.smith@example.com",
        "State": "NY"
    })

# XML (eXtensible Markup Language)

### Overview
XML is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is commonly used for data representation and transfer, especially in web services and APIs.

### Characteristics
- **Self-descriptive**: Tags describe the data they enclose.
- **Hierarchical**: Supports nested elements, allowing for complex data structures.
- **Extensible**: Users can define their own tags, making it flexible for various applications.
- **Platform-independent**: Can be used across different systems and platforms.

### Example
```xml
<users>
    <user>
        <id>1</id>
        <username>johndoe</username>
        <email>johndoe@example.com</email>
        <profile>
            <first_name>John</first_name>
            <last_name>Doe</last_name>
            <age>30</age>
            <gender>male</gender>
            <address>
                <street>123 Main St</street>
                <city>Anytown</city>
                <state>CA</state>
                <zip>12345</zip>
            </address>
        </profile>
        <preferences>
            <newsletter>true</newsletter>
            <notifications>
                <email>true</email>
                <sms>false</sms>
            </notifications>
        </preferences>
    </user>
</users>
```

In [None]:
# pip3 install lxml
# pip3 install beautifulsoup4

from bs4 import BeautifulSoup

with open('users.xml', 'r') as f:
    data = f.read()

soup = BeautifulSoup(data, "xml")

state = soup.find("state")
print(state)

print(state.text)
print(state.attrs)
print()

all_states = soup.find_all("state")
print(all_states)

In [None]:
# Getting the text of the tags
for state in all_states:
    print(state.text)

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('users.xml')
root = tree.getroot()

print(root)
print()

state = root.find("user/profile/address/state")
print(state.text)
print()

stateList = root.findall("user/profile/address/state")
for i in stateList:
    print(i.text)



In [None]:
# Writing XML
import xml.etree.ElementTree as ET

# Create the root element
root = ET.Element("users")

# Create a user element
user = ET.SubElement(root, "user")
ET.SubElement(user, "id").text = "1"
ET.SubElement(user, "username").text = "johndoe"
ET.SubElement(user, "email").text = "johndoe@example.com"

# Create a profile element
profile = ET.SubElement(user, "profile")
ET.SubElement(profile, "first_name").text = "John"
ET.SubElement(profile, "last_name").text = "Doe"
ET.SubElement(profile, "age").text = "30"
ET.SubElement(profile, "gender").text = "male"

# Create an address element
address = ET.SubElement(profile, "address")
ET.SubElement(address, "street").text = "123 Main St"
ET.SubElement(address, "city").text = "Anytown"
ET.SubElement(address, "state").text = "CA"
ET.SubElement(address, "zip").text = "12345"

# Create a preferences element
preferences = ET.SubElement(user, "preferences")
ET.SubElement(preferences, "newsletter").text = "true"
notifications = ET.SubElement(preferences, "notifications")
ET.SubElement(notifications, "email").text = "true"
ET.SubElement(notifications, "sms").text = "false"

# Convert the tree to a string and write to a file
tree = ET.ElementTree(root)
ET.indent(tree, space="\t")

tree.write("output.xml", encoding="utf-8", xml_declaration=True)