# <center>Big Data &ndash; Exercises &ndash; Solution</center>
## <center>Fall 2024 &ndash; Week 6 &ndash; ETH Zurich</center>
## <center>Data Models</center>

## Reading:
* (Mandatory) Chapter 7: Data models and validation of course-book
* (Recommended) M. Droettboom, Understanding JSON Schema [[online](https://json-schema.org/understanding-json-schema/)]
* (Recommended) Harold, E. R., & Means, W. S. (2004). XML in a Nutshell. [Available in the ETH library] [[online](https://learning.oreilly.com/library/view/xml-in-a/0596007647/?ar)] (Chapter 17 on XML Schema, except 17.3 on namespaces)


This exercise will consist of six main parts: 
* XML Data Models
* XML Schemas
* JSON Data Models
* JSON Schemas
* JSound
* DataFrames

## 1. XML Data Models &ndash; Information Sets

XML "Information Set" provides an abstract representation of an XML document—it can be thought of as a set of rules on how one would draw an XML document on a whiteboard.

An XML document has an information set if it is well-formed and satisfies the namespace constraints. There is no requirement for an XML document to be valid in order to have an information set. An information set can contain up to eleven different types of information items, e.g., the document information item (always present), element information items, attribute information item, etc.

### Task 1.1

Draw the Information Set trees for the following XML documents. You can confine your trees to only have the following types of information items: *document information item, elements, character information items, and attributes.*

#### Document 1

```xml
<Burger>
    <Bun>
        <Pickles/>
        <Cheese origin="Switzerland" />
        <Patty/>
    </Bun>
</Burger>
```

#### Solution

![](https://polybox.ethz.ch/index.php/s/LfUUcrCwLFgwX45/download)

#### Document 2
```xml
<catalog>
   <!-- A list of books -->
   <book id='bk101'>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date version='hard' version2='soft'>2000-10-01</publish_date>
   </book>
</catalog>
```

#### Solution

![](https://polybox.ethz.ch/index.php/s/2s54Gyy25QHXyg0/download)

#### Document 3

```xml
<eth date="11.11.2006">
   <date>16.11.2017</date>
   <president since="2020">Prof. Dr. Joël Mesot</president>
   <rector>Prof. Dr. Sarah M. Springman</rector>
</eth>
```


#### Solution

![](https://polybox.ethz.ch/index.php/s/fjPJi0DgvnPEJZw/download)

## 2. XML Schemas
 
In this task we will explore XML Schemas in detail. An XML Schema describes the structure of an XML document.

The purpose of an XML Schema is to define the legal building blocks of an XML document:
* the elements and attributes that can appear in a document
* the number of (and order of) child elements
* data types for elements and attributes
* default and fixed values for elements and attributes

When you open an XML Schema in oXygen, you can switch to its graphical representation, by choosing the "Design" mode at the bottom of the document pane; "Text" mode shows the XML Schema as an XML document.

### Task 2.1
Match the following XML documents to XML Schemas that will validate them. Match them manually then validate with oXygen.

#### Document 1
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd"/>
```

#### Document 2
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health/>
    <friends/>
    <family/>
</happiness>
```

#### Document 3
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    3.141562
</happiness>
```

#### Document 4
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health value="100"/>
    <friends/>
    <family/>
</happiness>
```

#### Document 5
```xml
<happiness xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:noNamespaceSchemaLocation="Schema.xsd">
    <health/>
    <friends/>
    <family/>
    But perhaps everybody defines it differently...
</happiness>
```

______


#### Schema 1
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="health"/>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 2
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType mixed="true">
            <xs:sequence>
                <xs:element name="health"/>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 3
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness" type="xs:decimal"/>
</xs:schema>
```

#### Schema 4
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence/>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Schema 5
```xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="happiness">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="health">
                    <xs:complexType>
                        <xs:attribute name="value" type="xs:integer" use="required"/>
                    </xs:complexType>
                </xs:element>
                <xs:element name="friends"/>
                <xs:element name="family"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Solution
*  Document 1 – Schema 4
*  Document 2 – Schema 1 and Schema 2
*  Document 3 – Schema 3
*  Document 4 – Schema 1, Schema 2, and Schema 5
*  Document 5 – Schema 2

### Task 2.2

The [Great Language Game](http://greatlanguagegame.com/) is a game in which you are given a voice clip to listen, and you are asked to identify the language in which the person was speaking. It is a multiple-choice question&ndash;you make your choice out of several alternatives.

The following XML document presents a user's attempt at answering a single question in the game: it contains the identifier of the voice clip, the choices presented to the player, and the player's response.
Provide an XML Schema which will validate this document:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<attempt country="AU" date="2013-08-19">
    <voiceClip>48f9c924e0d98c959d8a6f1862b3ce9a</voiceClip>
    <choices>
        <choice>Maori</choice>
        <choice>Mandarin</choice>
        <choice>Norwegian</choice>
        <choice>Tongan</choice>
    </choices>
    <target>Norwegian</target>
    <guess>Norwegian</guess>
</attempt>

```

#### Solution
Here is one possible XML Schema that will validate the original document:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="attempt">
        <xs:complexType>
            <xs:sequence maxOccurs="1" minOccurs="1">
                <xs:element name="voiceClip" type="xs:string"/>
                <xs:element name="choices">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="choice" type="xs:string" minOccurs="4" maxOccurs="4"/>
                        </xs:sequence>
                    </xs:complexType>
                </xs:element>
                <xs:element name="target" type="xs:string"/>
                <xs:element name="guess" type="xs:string"/>
            </xs:sequence>
            <xs:attribute name="country" type="xs:string"/>
            <xs:attribute name="date" type="xs:date"/>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

Also, the root element of the document should be changed as follows to point to the schema:
```xml
<attempt 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:noNamespaceSchemaLocation="SingleGreatLanguageGame.xsd" country="AU" date="2013-08-19">
```

### Task 2.3
Continuing the topic of the Great Language Game, provide an XML Schema which will validate the following document:


```xml
<?xml version="1.0" encoding="UTF-8"?>
<attempts>
    <attempt country="AU" date="2013-08-19">
        <voiceClip>48f9c924e0d98c959d8a6f1862b3ce9a</voiceClip>
        <choices>
            <choice>Maori</choice>
            <choice>Mandarin</choice>
            <choice>Norwegian</choice>
            <choice>Tongan</choice>
        </choices>
        <target>Norwegian</target>
        <guess>Norwegian</guess>
    </attempt>
    <attempt country="US" date="2014-03-01">
        <voiceClip>5000be64c8cc8f61dda50fca8d77d307</voiceClip>
        <choices>
            <choice>Finnish</choice>
            <choice>Mandarin</choice>
            <choice>Scottish Gaelic</choice>
            <choice>Slovak</choice>
            <choice>Swedish</choice>
            <choice>Thai</choice>
        </choices>
        <target>Slovak</target>
        <guess>Slovak</guess>
    </attempt>
    <attempt country="US" date="2014-03-01">
        <voiceClip>923c0d6c9e593966e1b6354cc0d794de</voiceClip>
        <choices>
            <choice>Hungarian</choice>
            <choice>Sinhalese</choice>
            <choice>Swahili</choice>
        </choices>
        <target>Hungarian</target>
        <guess>Sinhalese</guess>
    </attempt>
</attempts>
```

#### Solution

Here is one possible solution for a schema that validates the above document. 

```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="attempts">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="attempt" minOccurs="1" maxOccurs="unbounded">
                    <xs:complexType>
                        <xs:sequence maxOccurs="1" minOccurs="1">
                            <xs:element name="voiceClip" type="xs:string"/>
                            <xs:element name="choices">
                                <xs:complexType>
                                    <xs:sequence>
                                        <xs:element name="choice" type="xs:string" minOccurs="3"
                                            maxOccurs="6"/>
                                    </xs:sequence>
                                </xs:complexType>
                            </xs:element>
                            <xs:element name="target" type="xs:string"/>
                            <xs:element name="guess" type="xs:string"/>
                        </xs:sequence>
                        <xs:attribute name="country" type="xs:string"/>
                        <xs:attribute name="date" type="xs:date"/>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

```

As in the previous task, the root element of the original document has to be augmented with:
```xml
<attempts 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="MultipleGreatLanguageGames.xsd">
```

In an XML you can also declare named custom types and use them in your element declarations. For example the following schema would also validate the document from episode 2. But beware that the types lies also in the target namespace. Hence in order to be able to reference them in other elements you have to add a prefix declaration for your target namespace.

```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> 
    
    <!-- lets declare first some custom types -->
    <xs:complexType name="choicesType">
        <xs:sequence>
            <xs:element name="choice" minOccurs="1" maxOccurs="unbounded" type="xs:string"/>
        </xs:sequence>
    </xs:complexType>
    
    <xs:complexType name="attemptType">
        <xs:sequence>
            <xs:element name="voiceClip" type="xs:string"/>
            <xs:element name="choices" type="choicesType"/>
            <xs:element name="target" type="xs:string"/>
            <xs:element name="guess" type="xs:string"/>
        </xs:sequence>
        <xs:attribute name="country" type="xs:string"/>
        <xs:attribute name="date" type="xs:date"/>  
    </xs:complexType>
    
    <!-- lets declare the elements appearing in the document -->
    <xs:element name = "attempts">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="attempt" type="attemptType" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>    
    
</xs:schema>
```

### Task 2.4

Let us now solve the reverse problem. Given the following XML Schema, provide a valid instance document:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="movies">
        <xs:complexType>
            <xs:sequence maxOccurs="unbounded" minOccurs="0">
                <xs:element name="Movie">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="title" type="xs:string"/>
                            <xs:element name="year" type="xs:gYear"/>
                            <xs:element name="_director">
                                <xs:complexType>
                                    <xs:sequence/>
                                    <xs:attribute name="name" type="xs:string"/>
                                </xs:complexType>
                            </xs:element>
                            <xs:choice minOccurs="1" maxOccurs="unbounded">
                                <xs:element name="comment">
                                    <xs:complexType>
                                        <xs:simpleContent>
                                            <xs:extension base="xs:string">
                                                <xs:attribute name="lang" type="xs:string"/>
                                            </xs:extension>
                                        </xs:simpleContent>
                                    </xs:complexType>
                                </xs:element>
                                <xs:element name="newcomment" type="xs:string"/>
                            </xs:choice>
                        </xs:sequence>
                        <xs:attribute name="id" type="xs:ID"/>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
```

#### Solution
One of the very many valid instance documents is:
```xml
<?xml version="1.0" encoding="UTF-8"?> 
<movies xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="Movies.xsd">
    <Movie id="imdb0173840">
        <title>Final Fantasy: The Spirits Within</title>
        <year>2001</year>
        <_director name="Hironobu Sakaguchi"/>
        <comment lang=""/>
        <newcomment>A great movie!</newcomment>
        <comment lang="de"/>
    </Movie>
    <Movie id="imdb0405094">
        <title>Das Leben der Anderen</title>
        <year>2006</year>
        <_director name="Florian Henckel von Donnersmarck"/>
        <comment lang="de">Das ist ein guter Film</comment>
        <!-- I need to watch more movies -->
    </Movie>
</movies>
```

## 3. JSON Data Models
The appropriate abstraction for any JSON document is a tree, the nodes of which are JSON logical values

It is possible to visualize JSON documents as logical trees.

###  Task 3.1 
Draw the logical trees for the following JSON documents.

#### Document 1

```json
{
  "name": "Einstein",
  "papers": [
    {
      "title": "Special Relativity",
      "year": 1905
    },
    {
      "title": "General Relativity",
      "year": 1915
    }
  ],
  "awards": ["Nobel Prize"]
}
```

#### Solution

![](https://polybox.ethz.ch/index.php/s/9lmWKP4JvAZodhS/download)

#### Document 2

```json
{
  "university": {
    "name": "ETH Zurich",
    "founded": 1855,
    "departments": [
      {
        "name": "Computer Science",
        "programs": ["BSc", "MSc", "PhD"],
        "stats": {
          "students": 2000,
          "faculty": 50,
          "international": true
        }
      },
      {
        "name": "Mathematics",
        "programs": ["BSc", "MSc"],
        "stats": null
      }
    ],
    "locations": ["Hönggerberg", "Zentrum"]
  }
}
```

#### Solution

![](https://polybox.ethz.ch/index.php/s/l6zEWNCykXdABda/download)

## 4. JSON Schemas

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. It is used to:
* Describe your existing data format(s).
* Provide clear human- and machine- readable documentation.
* Validate data, i.e., automated testing, ensuring quality of client submitted data.

###  Task 4.1 
Provide an JSON Schema which will validate the following document.

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

#### Solution
The following json schema is a possible solution to the above question.

```json
{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}
```

### Task 4.2

Provide an JSON Schema which will validate the following document.
The JSON Schema has to check for the following properties:


*   The price of a product has to be strictly positive.
*   Tags are describing the product and necessary for a proper product description. We need at least one tag per product and each tag should be unique.
*   The "productId", "productName" and the "price" should always be contained in a valid JSON document.



```json
  {
    "productId": 1,
    "productName": "An ice sculpture",
    "price": 12.50,
    "tags": [ "cold", "ice" ],
    "dimensions": {
      "length": 7.0,
      "width": 12.0,
      "height": 9.5
    }
  }
```

#### Solution

A possibe schema for the original document.

```json
{
  "$schema":"http://json-schema.org/draft-07/schema#",
  "$id": "https://example.com/product.schema.json",
  "title": "Product",
  "description": "A product from Acme's catalog",
  "type": "object",
  "properties": {
    "productId": {
      "description": "The unique identifier for a product",
      "type": "integer"
    },
    "productName": {
      "description": "Name of the product",
      "type": "string"
    },
    "price": {
      "description": "The price of the product",
      "type": "number",
      "exclusiveMinimum": 0
    },
    "tags": {
      "description": "Tags for the product",
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 1,
      "uniqueItems": true
    },
    "dimensions": {
      "type": "object",
      "properties": {
        "length": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": [ "length", "width", "height" ]
    }
  },
  "required": [ "productId", "productName", "price" ]
}
```

### Task 4.3 
Based on the given Json schema, can you give an instance of it?
HINT: We defined array of things.

```json
{
  "$id": "https://example.com/arrays.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "description": "A representation of a person, company, organization, or place",
  "type": "object",
  "properties": {
    "fruits": {
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "vegetables": {
      "type": "array",
      "items": { "$ref": "#/definitions/veggie" }
    }
  },
  "definitions": {
    "veggie": {
      "type": "object",
      "required": [ "veggieName", "veggieLike" ],
      "properties": {
        "veggieName": {
          "type": "string",
          "description": "The name of the vegetable."
        },
        "veggieLike": {
          "type": "boolean",
          "description": "Do I like this vegetable?"
        }
      }
    }
  }
}
```

#### Solution
Arrays are fundamental structures in JSON

```json
{
  "fruits": [ "apple", "orange", "pear" ],
  "vegetables": [
    {
      "veggieName": "potato",
      "veggieLike": true
    },
    {
      "veggieName": "broccoli",
      "veggieLike": false
    }
  ]
}
```

## 5. JSound

JSound is a vocabulary that allows you to validate JSON documents. It employs a very simple and intuitive JSON-like synthax.

### Task 5.1 
Repeat the exercise in 4.1 but now produce a a JSound schema which will validate the following document.

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

#### Solution
The following JSound schema is a possible solution to the above question.

```json
{
  "firstName": "string",
  "lastName": "string",
  "age": "integer"
}
```

### Task 5.2 

Build a valid JSON document based on the following JSound schema. 

```json
{
    "id": "integer",
    "who": [{
        "name": "string",
        "type": "string",
        "preferred": "boolean"
    }],
    "year_of_birth": "integer",
    "living": "boolean"
}
```

#### Solution

```json
{
    "id": "100",
    "who": [{
        "name": "Albert",
        "type": "first",
        "preferred": true
    },
    {
        "name": "Einstein",
        "type": "last",
        "preferred": false
    }],
    "year_of_birth": 1879,
    "alive": false
}
```

## 6. Data Frames

Data frames are collections of JSON objects that are valid against a common schema, where the schema requires closed object types and specific types for values while allowing for structured nestedness. 



#### Task 6.1

Given the following JSound schema, draw the data frame visual representation, and fill with following entries:

ID=0, {"Marie", "Curie"}, [{"Chemistry": 6.0}]

ID=1, {"Albert", "Einstein"}, [{"Math": 6.0}, {"Physics": 6.0}]

```json
{
  "student": {
    "id": "integer",
    "name": {
      "first": "string",
      "last": "string"
    },
    "grades": [{
      "course": "string",
      "grade": "decimal"
    }]
  }
}
```



#### Solution

![](https://polybox.ethz.ch/index.php/s/Fb3GigBk5BFg5SV/download)

#### Task 6.2

Convert the following heterogeneous JSON collection into a data frame by writing an appropriate JSound schema that enforces homogeneity while preserving as much information as possible:

```json
[
  {
    "name": "Alice",
    "scores": [95, 87, 91],
    "active": true
  },
  {
    "name": "Bob",
    "scores": [88],
    "graduated": "2023"
  },
  {
    "name": "Charlie",
    "active": false
  }
]
```

#### Solution
```json
{
  "!name": "string",
  "scores": ["integer"],
  "active?": "boolean",
  "graduated?": "string"
}
```

#### Task 6.3
Consider the following data frame schema representing a movie database:
Write two valid JSON documents that conforms to this schema.

```json
{
  "movie": {
    "!id": "string",
    "!title": "string",
    "year": "integer",
    "genres": ["string"],
    "cast": [{
      "!actor": "string",
      "role": "string",
      "award?": "string"
    }],
    "ratings": {
      "imdb?": "decimal",
      "rottenTomatoes?": "integer"
    }
  }
}
```

#### Solution
```json
{
  "id": "mv001",
  "title": "The Matrix",
  "year": 1999,
  "genres": ["Sci-Fi", "Action"],
  "cast": [
    {
      "actor": "Keanu Reeves",
      "role": "Neo"
    }
  ],
  "ratings": {
    "imdb": 8.7
  }
}
```

```json
{
  "id": "mv002",
  "title": "The Lord of the Rings: The Return of the King",
  "year": 2003,
  "genres": ["Fantasy", "Adventure", "Action", "Drama"],
  "cast": [
    {
      "actor": "Ian McKellen",
      "role": "Gandalf",
      "award": "Academy Award Nomination"
    },
    {
      "actor": "Viggo Mortensen",
      "role": "Aragorn"
    },
    {
      "actor": "Andy Serkis",
      "role": "Gollum",
      "award": "Special Achievement Award"
    },
    {
      "actor": "Elijah Wood",
      "role": null
    }
  ],
  "ratings": {
    "imdb": 9.0,
    "rottenTomatoes": 93
  }
}
```
