<a href="https://colab.research.google.com/github/Chood16/DSCI222/blob/main/lectures/(8)_Semi_Structured_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Semi-Structured Data in Python

## 1. Introduction: Structured vs Semi-Structured vs Unstructured Data

- **Structured Data**: Organized in fixed fields, e.g., relational databases, spreadsheets. Easy to query with SQL.
- **Semi-Structured Data**: Does not conform to rigid tabular structures but still has some organizational properties (tags, key-value pairs). Examples: JSON, XML, YAML, HTML, log files.
- **Unstructured Data**: No inherent structure, difficult to query directly. Examples: images, videos, free-text documents.

Semi-structured data is common in modern data workflows because it provides flexibility while still being machine-readable.

## 2. JSON (JavaScript Object Notation)
- **What it is**: Lightweight format for representing structured data as nested key-value pairs.
- **Common Uses**: Web APIs, configuration files, NoSQL databases (e.g., MongoDB).
- **Sample Data**: [Sample JSON Dataset](https://jsonplaceholder.typicode.com/users)

In [None]:
import requests
import pandas as pd

# requests allows your Python code to interact with web services, APIs, and websites by sending HTTP requests (like GET, POST, PUT, DELETE) and parse responses
# parse means to take raw data or text and analyze it to convert it into a structured format that a program can understand and work with.


url = "https://jsonplaceholder.typicode.com/users"
json_data = requests.get(url).json() # <-- .json() tells the response to be parsed as JSON

display(json_data)



In [None]:
# Load JSON into DataFrame
# json_normalize converts a JSON structure into a df
df_json = pd.json_normalize(json_data)
df_json.head()

## 3. XML (eXtensible Markup Language)
- **What it is**: Tag-based format for hierarchical data.
- **Common Uses**: RSS feeds, configuration files, document storage.
- **Sample Data**: [Sample XML Dataset](https://www.w3schools.com/xml/simple.xml)

In [None]:
import xml.etree.ElementTree as ET

# xml.etree.ElementTree will be used to parse .xml data

xml_url = "https://www.w3schools.com/xml/simple.xml"
xml_data = requests.get(xml_url).text # <-- .text() tells the response to be parsed as text


xml_data

In [None]:
# Create an Element object called root.
# Parses xml_data and converts it into an ElementTree element, which is a tree structure Python can navigate.
root = ET.fromstring(xml_data)

root


In [None]:
# Let's look at some of the information stored in root

print(root.tag)

In [None]:
print(root.find('food').find('name').text)

In [None]:
for food in root.findall('food'):
    print(food.find('price').text)

In [None]:
# Let's look through all food items and convert this xml file into a df

menu_items = []
for food in root.findall("food"):
    menu_items.append({
        "name": food.find("name").text,
        "price": food.find("price").text,
        "description": food.find("description").text
    })

display(menu_items) # <-- we've created a list of dictionaries that can be converted into a df

In [None]:
df_xml = pd.DataFrame(menu_items)
df_xml.head()

## 4. YAML (Yes Ain't Markup Language), also called YML
- **What it is**: Human-readable data format, similar to JSON but easier for configuration.
- **Common Uses**: Configuration files (store settings or options for a program) and workflow files (automated steps or tasks).
- **Sample Data**: [Sample YAML Dataset](https://raw.githubusercontent.com/Chood16/DSCI222/main/.github/workflows/engadget.yml)

In [None]:
import yaml

yaml_url = "https://raw.githubusercontent.com/Chood16/DSCI222/main/.github/workflows/engadget.yml"
yaml_text = requests.get(yaml_url).text
yaml_data = yaml.safe_load(yaml_text)

yaml_data

## 5. HTML (HyperText Markup Language)
- **What it is**: The standard format for web pages.
- **Common Uses**: Web scraping and extracting data from websites.
- **Sample Data**: [Sample HTML Table](https://www.contextures.com/xlSampleData01.html)

In [None]:
import requests

url = "https://github.com/Chood16/DSCI222"
response = requests.get(url)

html_content = response.text
print(html_content[:1000])  # print the first 2000 characters to check



## 6. Log Files
- **What it is**: Text-based records of events, typically semi-structured with timestamps, log levels, and messages.
- **Common Uses**: Application monitoring, error tracking, system audits.
- **Sample Data**: [Sample Apache Log File](https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs)

In short, Apache HTTP Server is the software that powers a web server, taking requests from clients and delivering web pages, APIs, or other content over the internet. This is an example log of these calls. What information could we possible obtain from it?

In [None]:
# The regular expressions module (re) allows you to search, match, and manipulate strings using patterns.
import re

log_url = "https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs"
log_text = requests.get(log_url).text.split("\n")
display(log_text[:10]) # <-- let's look at the first ten lines


In [None]:
# regex parse example, what could this possibly mean?
log_pattern = re.compile(r'(?P<ip>\S+) - - \[(?P<date>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+)')



Here we are using re.compile turn the following string into a regex object for easy use later on

With this object, we can:

| Method      | Purpose                                   | Example                                                      | Output                  |
|------------|-------------------------------------------|-------------------------------------------------------------|------------------------|
| `match()`   | Check pattern at **start** of string     | `re.match(r"\d+", "123abc")`                                | Match object (matches '123') |
| `search()`  | Find pattern **anywhere** in string      | `re.search(r"\d+", "abc123")`                               | Match object (matches '123') |
| `findall()` | Find **all matches** in string           | `re.findall(r"\d+", "I have 2 cats and 3 dogs")`           | `['2', '3']`           |
| `sub()`     | **Replace matches** with another string  | `re.sub(r"\d", "#", "123-456")`                             | `"###-###"`            |


This string in particular says the following

| Part                 | Meaning                                                                                                                       |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `(?P<ip>\S+)`        | Named group `ip`. Matches one or more non-whitespace characters (`\S+`). Captures the client IP address.                      |
| `- -`                | Matches the literal `- -` in the log. These usually represent remote logname and authenticated user (often unused).           |
| `\[`                 | Matches a literal `[` character. Needed because `[` has special meaning in regex.                                             |
| `(?P<date>.*?)`      | Named group `date`. Matches any characters lazily (`.*?`) until the next pattern. Captures the timestamp of the request.      |
| `\]`                 | Matches a literal `]` character to close the timestamp.                                                                       |
| `"(?P<request>.*?)"` | Named group `request`. Matches the request line (e.g., `GET /index.html HTTP/1.1`). Lazily captures everything inside quotes. |
| `(?P<status>\d+)`    | Named group `status`. Matches one or more digits (`\d+`). Captures the HTTP status code (e.g., 200, 404).                     |
| `(?P<size>\d+)`      | Named group `size`. Matches one or more digits (`\d+`). Captures the size of the response in bytes.                           |


In [None]:
# What in the world does this mean?
log_parsed = [log_pattern.match(line).groupdict() for line in log_text if log_pattern.match(line)]

# Let's look at all of the code together



In [None]:
import re

log_url = "https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs"
log_text = requests.get(log_url).text.split("\n")
log_pattern = re.compile(r'(?P<ip>\S+) - - \[(?P<date>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<size>\d+)')
log_parsed = [log_pattern.match(line).groupdict() for line in log_text if log_pattern.match(line)]
display(log_parsed[:1])

In [None]:
df_logs = pd.DataFrame(log_parsed)
df_logs.head()

# Example: How to make an RSS feed (.xml file) and run it with a .yml file

```xml
<rss version="2.0">
  <channel>
    <title>My Example RSS Feed</title>
    <link>https://www.example.com</link>
    <description>This is an example RSS feed</description>
    <language>en-us</language>
    <lastBuildDate>Mon, 15 Sep 2025 00:28:21 +0000</lastBuildDate>
    <item>
        <title>First Article</title>
        <link>https://www.example.com/articles/1</link>
        <description>This is a summary of the first article.</description>
        <pubDate>Mon, 15 Sep 2025 00:28:21 +0000</pubDate>
        <guid>https://www.example.com/articles/1</guid>
    </item>
    <item>
        <title>Second Article</title>
        <link>https://www.example.com/articles/2</link>
        <description>This is a summary of the second article.</description>
        <pubDate>Mon, 15 Sep 2025 00:28:21 +0000</pubDate>
        <guid>https://www.example.com/articles/2</guid>
    </item>
  </channel>
</rss>



In [None]:

# Let's take care of the time component first
import xml.etree.ElementTree as ET
from datetime import datetime, timezone

def rfc822_now():
    """Return current UTC time formatted for RSS (RFC 822).
    RFC 822 is a technical standard that defines the format for text-based Internet messages"""
    return datetime.now(timezone.utc).strftime("%a, %d %b %Y %H:%M:%S %z")

# What do each of those % mean?
display(rfc822_now())




In [None]:
# Root <rss> element
rss = ET.Element("rss", version="2.0") # <-- begin creating the rss feed
channel = ET.SubElement(rss, "channel") # <-- .SubElement creates a child element

# Channel metadata
ET.SubElement(channel, "title").text = "My Example RSS Feed"
ET.SubElement(channel, "link").text = "https://www.example.com"
ET.SubElement(channel, "description").text = "This is an example RSS feed"
ET.SubElement(channel, "language").text = "en-us"
ET.SubElement(channel, "lastBuildDate").text = rfc822_now()

# Add first item
item1 = ET.SubElement(channel, "item")
ET.SubElement(item1, "title").text = "First Article"
ET.SubElement(item1, "link").text = "https://www.example.com/articles/1"
ET.SubElement(item1, "description").text = "This is a summary of the first article."
ET.SubElement(item1, "pubDate").text = rfc822_now()
ET.SubElement(item1, "guid").text = "https://www.example.com/articles/1" # <--Globally Unique Identifier

# Add second item
item2 = ET.SubElement(channel, "item")
ET.SubElement(item2, "title").text = "Second Article"
ET.SubElement(item2, "link").text = "https://www.example.com/articles/2"
ET.SubElement(item2, "description").text = "This is a summary of the second article."
ET.SubElement(item2, "pubDate").text = rfc822_now()
ET.SubElement(item2, "guid").text = "https://www.example.com/articles/2"

# Convert to XML file
tree = ET.ElementTree(rss) # <-- converts RSS from an Element to an Element Tree
tree.write("feed.xml", encoding="utf-8", xml_declaration=True)



A bit more on encoding. Text encoding is a way of representing characters as bytes so that computers can store, transmit, and read text correctly. Think of it as a mapping between characters (letters, numbers, symbols) and numeric codes that computers understand.

encoding="utf-8" sets the text encoding of the file. UTF-8 is standard for XML and supports all characters. You need to explicitly state encoding="utf-8" when writing an XML file (like an RSS feed) because it tells the computer and any programs reading the file how to interpret the bytes as characters.

## Now let's create a .yml file to run with an actual RSS feed!

https://wvusports.com/rss

https://github.com/Chood16/DSCI222/blob/main/.github/workflows/WVU_Sports.yml

https://github.com/Chood16/DSCI222/actions/workflows/WVU_Sports.yml

https://github.com/Chood16/DSCI222/blob/main/lectures/wvu_sports.xml

https://chood16.github.io/DSCI222/lectures/wvu_sports.xml