# <center>Big Data for Engineers &ndash; Exercises</center>
## <center>Spring 2025 &ndash; Week 4 &ndash; ETH Zurich</center>

# Introduction and setup
This exercise will cover XML and JSON well-formedness.

For the next few weeks you will be using [oXygen 26.0](https://www.oxygenxml.com/xml_editor/download_oxygenxml_editor.html), an XML/JSON development IDE. Before starting, make sure oXygen is installed and working on your computer. You can download the required licence from the [ETH IT shop](https://itshop.ethz.ch/EndUser/Items/Home):

1. Login with your ETH credentials

2. Click on `+ CREATE REQUEST` in the top right, select **Software and Business Applications** and go to **Software & Licenses** > **Order Software Product**.

3. Look for "oxygen" and select the version that fits your local setup.

4. Click **Next step** at the bottom, and accept the terms of services.

5. Wait until you get the confirmation email (it should take a couple of minutes). Simply download the __license file__, and then download the software from the [official website](https://www.oxygenxml.com/xml_editor/software_archive_editor.html), and proceed with the installation. You should get asked to copy the __license file__ at some point.

6. Alternatively, after downloading open a shell and `cd` to the directory where you downloaded the installer.

- At the prompt type:
```
sh ./oxygen-64bit-openjdk.sh
```
- Copy the license key (License Key String) provided in the instructions from the step 4 and paste it in the license registration dialog box from the application.

*Another option is to follow the instructions on the IT shop page and using the server address information below that applies to your operating system*

# 1. JSON 

## 1.1 Well-formedness
Correct the following JSON documents to be well-formed. Try first to "parse" them in your mind manually, then use oXygen to check your solutions.

### 1.1.1 Document A

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  age: 25,
  "isRetired",
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100",
    'is verified' : "true"
  }
  'phoneNumbers': [
    {
      "type": [["home"]],
      "@number": "212 555-1234"
    },
    {
      "type": [["office"]],
      "@number": "646 555-4567"
    },
    {
      "type": [["mobile"[],
      "@number": "123 456-7890"
    }
  ],
  "children": [],
  "settings": {},
  "spouse": Null,
  "": ""
}
```

#### **Solution**
1. `age` key must be double quoted.
2. `isRetired` must have a value.
3. `is verified` and `phoneNumbers` should be double quoted.
4. `address` object must be followed by a comma.
5. The nested array in the `type` attribute of the last `phoneNumbers` is incorrectly balanced (`[["mobile"[]`).
6. `Null` is not a valid value (`null` is valid).

*Best practices:*
- Using whitespaces and non-ascii characters for key names is allowed although not recommended. 
- Mixing proper boolean values and strings used as boolean values (ie. "true") is considered bad practice.

Corrected document:

```json
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "isRetired": false,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100",
    "isVerified" : true
  },
  "phoneNumbers": [
    {
      "type": [["home"]],
      "@number": "212 555-1234"
    },
    {
      "type": [["office"]],
      "@number": "646 555-4567"
    },
    {
      "type": [["mobile"]],
      "@number": "123 456-7890"
    }
  ],
  "children": [],
  "settings": {},
  "spouse": null,
  "": ""
}
```

### 1.1.2 Document B

```
[
    1: {
      "name": 'John'
      "lastname": 'Smith',
      "account": "jsmith"
      "phonenumbers" [{
           "type": "home",
           "1phone": 212-3242,
           "2phone": "545-4568"
       }]
    },
    2: {
      "name": "Jane"
      "lastname": 'Doe',
      "account": "jdoe"
      "phonenumbers" [
      {
           "type": "home",
           "phone": "8989 7685"
      },
      "phone": "545-4568"
      ],
      "account": "janedoe"
    }
]
```

#### **Solution:**
1. The document must start with `{`, not with `[`, as we have key-value pairs inside.
2. All strings must be double quoted.
3. Commas are missing after `"John"`, `"jsmith"`, `"Jane"` and `"jdoe"`
4. `:` are missing after `phonenumbers`.
5. `212-3242` is an invalid number, to include the dash it would need to be a string.
6. `"phone": "545-4568"` can not be an element in an array, it has to be part of an object (inside `{ }`).
7. Duplicated key `account` in the second element.

Corrected document:

```json
{
    "1": {
      "name": "John",
      "lastname": "Smith",
       "account": "jsmith",
       "phonenumbers": [{
           "type": "home",
           "1phone": "212-3242",
           "2phone": "545-4568"
       }]
    },
    "2": {
      "name": "Jane",
      "lastname": "Doe",
       "account": "jdoe",
       "phonenumbers": [
          {
              "type": "home",
              "phone": "8989 7685"
          },
          {
            "phone": "545-4568"
          }
       ]
    }   
}
```

### 1.1.3 Document C

```
{
  "Physical quantities": [
    {"elementary charge": +1.6033e-19},
    {"electron specific charge": -1758819}
  ]
}
```

#### **Solution**
1. Leading plus sign (`+`) in front of the numeric value is not valid in JSON.
2. Negative numeric values (`-1758819`) are allowed, so that part is fine.
3. The rest of the structure (object, array, key-value pairs) is correct

Corrected document:
```json
{
  "Physical quantities": [
    {"elementary charge": 1.6033e-19},
    {"electron specific charge": -1758819}
  ]
}
```


### 1.1.3 Document D

```
{
  "Physical quantities": [
    "sl":299792458,
    "eg":1.60217733e+19,
    "ep":-0
  ]
}
```

#### **Solution**
1. The snippet places `"sl"`, `"eg"`, and `"ep"` in the array but doesn’t wrap them as object properties. THis set of key-value pairs should be inside an object: `{ "sl": ..., "eg": ..., "ep": ... }`.
2. Negative zero (`-0`) is unusual but still valid JSON.

Corrected document:

```json
{
  "Physical quantities": [
    {
      "sl":299792458,
      "eg":1.60217733e+19,
      "ep":-0
    }
  ]
}
```

## 1.2 JSON Key Names
Which of the following are well-formed JSON key names? 
1. `""`
1. `"123456"`
1. `"abcd"`
1. `"\"`
1. `"\\"`
1. `"""`
1. `"'"`

#### **Solution**

1, 2, 3, 5, 7 are valid key names. The only restriction the JSON syntax imposes on the key names is that " and \ must be escaped.

# 2. XML
## 2.1 Well-formedness
Correct the following XML documents to be well-formed! Just as with the JSON documents from the last exercise, first try to solve the problems without software, and then check.

### 2.1.1 Document A

```
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE catalog>
<catalog>
    <!-- Start book list --to be defined -->
   <Book id=`bk101`>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95€</price>
      <publish_date version='hard' version='soft'>2000-10-01</publish_date>
      <_description lang=en>An `in-depth look` at creating applications 
      with XML <for dummies>.</_description>
      <xml_parse>true</xml_parse>
   </book>
</>
```

#### **Solution**

Document A has the following problems:
1. Comments `<!-- -->` cannot include the characters `--`;
2. The quotes in XML must always be simple quotes or double quotes, but not "Word-style" quotes (〝, 〞, \`, etc.);
3. Attribute `version` in `publish_date` is duplicated, this is forbidden;
4. The `lang` attribute should be quoted;
5. `<` must be escaped in text. Also it is suggested to use `&gt;` for the `>` symbol;
6. The `book` start tag does not correspond to the `Book` end tag;
7. The `catalog` tag is not closed correctly;
8. XML names beginning with xml are reserved by the W3C. Their usage should be avoided (except if it is as specified as the W3C, e.g. xml:space, xml:lang, xmlns...). **OxYgen does not show this as an error to be future-compatible, but this is still considered an error**.

Here is the corrected document:

```xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE catalog [
<!ENTITY cright "&#169;">
]>
<catalog>
    <!-- Start book list - -to de defined -->
   <Book id='bk101'>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95€</price>
      <publish_date version='hard' version2='soft'>2000-10-01</publish_date>
      <_description lang='en'>An `in-depth look` at creating applications 
      with XML &lt;for dummies&gt;.</_description>
      <parse>true</parse>
   </Book>
</catalog>
```

### 2.1.2 Document B

```
<?xml version="1.0" encoding="utf-16"?>
<h:library xmlns:xdc="http://www.xml.com/books" xmlns:h="http://xml.com/library">
    <head><h:title>Book Review</title></head>
    <body/>
        <_xdc:bookreview>
            <xdc:title>XML: A Primer</xdc:title>
            <_table _style='container'>
                <h:tr align="#center">
                    <h:td>Author<h:span>St. Laurent & Tom Faron</h:td></h:span>
                </h:tr>
                <h:tr align="#left">
                    <h:td><xdc:author>Simon St. Laurent</xdc:author></h:td>
                    <h:td><xdc:price>31.98</xdc:price></h:td>
                    <h:td><xdc:#pages>352</xdc:#pages></h:td>
                    <h:td><xdc:_date>1998/01</xdc:_date></h:td>
                    <h:td><xdc:-comment>Love it</xdc:-comment></h:td>
                </h:tr>
            </_table>
        </_xdc:bookreview>
    </body>
</h:library>
```

#### **Solution**

Document B has the following problems:
1. `<h:title>` opening tag does not match the closing tag `</title>`;
1. In `<_xdc:_bookreview>` the namespace `_xdc` is not defined;
1. The `&` in the author text field should be escaped;
1. The `<h:span>` element containing the author name should be closed before closing its parent;
1. `<xdc:#pages>` is not a valid tag name;
1. `<xdc:-comment>` is not a valid tag name.
1. `body` uses an empty tag when opening tag is required instead;

Here is the corrected document:

```xml
<?xml version="1.0" encoding="utf-16"?>
<h:library xmlns:xdc="http://www.xml.com/books" xmlns:h="http://xml.com/library">
    <head><h:title>Book Review</h:title></head>
    <body>
    <xdc:bookreview>
        <xdc:title>XML: A Primer</xdc:title>
        <_table _style='container'>
            <h:tr align="#center">
                <h:td>Author<h:span>St. Laurent &amp; Tom Faron</h:span></h:td>
            </h:tr>
            <h:tr align="#left">
                <h:td><xdc:author>Simon St. Laurent</xdc:author></h:td>
                <h:td><xdc:price>31.98</xdc:price></h:td>
                <h:td><xdc:pages>352</xdc:pages></h:td>
                <h:td><xdc:_date>1998/01</xdc:_date></h:td>
                <h:td><xdc:comment>Love it</xdc:comment></h:td>
            </h:tr>
        </_table>
    </xdc:bookreview>
    </body>
</h:library>
```

### 2.2 XML Names
Which of the following are well-formed XML tags (i.e. which tag contain a conform XML name)? 
1. `<_bar/>`
1. `<123foo/>`
1. `<Foo/>`
1. `<foo 123>`
1. `<foo_123/>`
1. `<foo#123/>`
1. `<foo-123/>`
1. `<foo.123/>`
1. `<XmL_123/>`

### **Solution**

1, 3, 5, 7, 8 are valid names. Remember:
1. Element names are case-sensitive.
1. Element names must start with a letter or underscore.
1. Element names cannot start with the letters xml (or XML, or Xml, etc).
1. Element names can contain letters, digits, hyphens, underscores, and periods.
1. Element names cannot contain spaces.

## 3. Exercise: XML Document Structure

Below is an empty table describing where different XML constructs can appear within an XML document. Fill in each cell with **yes** or **no** to indicate whether the given construct (elements, attributes, text, comments) is allowed in that position.

|                | Top-Level | Between Element Tags | Inside Opening Element Tag |
|----------------|-----------|----------------------|----------------------------|
| **Elements**   | ?         | ?                    | ?                          |
| **Attributes** | ?         | ?                    | ?                          |
| **Text**       | ?         | ?                    | ?                          |
| **Comments**   | ?         | ?                    | ?                          |



#### **Solution**

|                | Top-Level | Between Element Tags | Inside Opening Element Tag |
|----------------|-----------|----------------------|----------------------------|
| **Elements**   | yes (once)| yes                  | no                         |
| **Attributes** | no        | no                   | yes                        |
| **Text**       | no        | yes                  | no                         |
| **Comments**   | yes       | yes                  | no                         |

### Explanations

1. **Elements**  
   - **Top-Level**: An XML document must have exactly **one** root element, hence "yes (once)".  
   - **Between Element Tags**: Child elements can appear within another element, so "yes".  
   - **Inside Opening Element Tag**: Not possible; elements cannot be nested in the start tag itself, so "no".

2. **Attributes**  
   - **Top-Level**: Standalone attributes (not inside an element) are not allowed, so "no".  
   - **Between Element Tags**: Attributes cannot exist outside of a start tag, so "no".  
   - **Inside Opening Element Tag**: This is where attributes belong (e.g., `<myElement attr="value">`), so "yes".

3. **Text**  
   - **Top-Level**: Outside the single root element, only whitespace may be allowed, but not text content in general, so "no".  
   - **Between Element Tags**: Text content typically appears within elements, so "yes".  
   - **Inside Opening Element Tag**: Not allowed; only attributes can be placed there, so "no".

4. **Comments**  
   - **Top-Level**: Comments can be placed before or after the root element, or even outside of it, so "yes".  
   - **Between Element Tags**: Comments are allowed between elements (e.g., `<!-- comment -->`), so "yes".  
   - **Inside Opening Element Tag**: Comments cannot appear in the middle of a start tag, so "no".

## 4. XML vs CSV - the limits of tables for heterogeneous data (Optional)
If your document consists of a collection of heterogeneous objects with different attributes, XML/JSON turns out to be more suited than a comma-separated format to store the data. In this exercise we want to show that denormalization is a good idea in this setting. 

You are given the following XML document representing a collection of products available in an online shop selling all kinds of products. In this product catalog each product has different attributes. You are asked to turn this data into a CSV file.
```xml
<productscatalog>
    <product>
        <id> 1 </id>
        <category> BBQ </category>
        <type> Gas </type>
        <height> 120cm </height>
    </product>
    <product>
        <id> 2 </id>
        <category> notebook </category>
        <brand> Apple </brand>
        <specs>
            <RAM> 16Gb </RAM>
            <storage> 128Gb </storage>
        </specs>
    </product>
    <product>
        <id> 3 </id>
        <category> shoes </category>
        <size> 39 </size>
        <model> Heels </model>
    </product>
</productscatalog>
```    

<br> 

1. Turn this data into a CSV file (i.e. into a table)

#### **Solution:**

```
id, category, type, height, brand, specs:RAM, specs:storage, size, model
1, BBQ, Gas, 120cm,,,,,
2, notebook,,,Apple,16Gb,128Gb,,
3,shoes,,,,,,39, Heels
```

This solution is however not unique, you could for example also store it in the following way:

```
id, AttributeName, AttributeValue
1, category, BBQ
1, type, Gas
1, height, 120cm
2, category, notebook
2, brand, Apple
2, specs:RAM, 16Gb
2, specs:storage, 128Gb
3, category, shoes
3, size, 39
3, model, Heels
```

<br>

2. What are the disadvantages of the CSV format compared to the XML format in this case?

#### **Solution:**

For the first solution:
We have different attributes for each category of products, so most of the columns in the table are empty. The resulting table is extremely sparse and not easily humanly readable. 

For the second solution: 
It is not convenient to read with several lines for the same product. You have to store the id multiple times. And you need to make sure the table is sorted by id if you want to see all the attributes for one product as a group.

Other problem: if we have a lot of nested attributes it can be cumbersome to put them in the table. 

<br>

3. Describe or give an example of one use case where the CSV format would be more appropriate than the XML format

#### **Solution:**

If all the rows have the same (fixed set) of attributes and there is no nesting, it is more natural to describe the data as a table.
