# Data and File Formats

When working in projects involved in AI, most likely you will need to deal with different data and file formats including:
- CSV
- JSON
- YAML
- images
- videos
- audio

In this notebook, we are making a brief introduction to each one of them, as well as giving some comments on how and when to use them.

We are going to work with some files with the data formats we mentioned, so before start reading the notebook, make sure to run the following cell to download the necessary files

In [1]:
!wget "https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv" "https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/employees.xml" "https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/JSON_sample.json" "https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/yaml_example.yaml"

--2022-02-15 16:55:06--  https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/Salaries.csv
Resolving aicore-files.s3.amazonaws.com (aicore-files.s3.amazonaws.com)... 52.217.134.81
Connecting to aicore-files.s3.amazonaws.com (aicore-files.s3.amazonaws.com)|52.217.134.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75224 (73K) [text/csv]
Saving to: ‘Salaries.csv’


2022-02-15 16:55:07 (362 KB/s) - ‘Salaries.csv’ saved [75224/75224]

--2022-02-15 16:55:07--  https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/employees.xml
Reusing existing connection to aicore-files.s3.amazonaws.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 603 [text/xml]
Saving to: ‘employees.xml’


2022-02-15 16:55:07 (11.7 MB/s) - ‘employees.xml’ saved [603/603]

--2022-02-15 16:55:07--  https://aicore-files.s3.amazonaws.com/Foundations/Data_Formats/JSON_sample.json
Reusing existing connection to aicore-files.s3.amazonaws.com:443.
HTTP request sent, aw

## CSV

> <font size=+1> CSV __(comma-separated values)__ files contain rows of data, with each value in a row separated by a comma

- They are a very common way to store data. </font>
- All of the data for a single record is on one line: each new line is a new record.
- The comma in this case is called the __'delimiter'__ as it shows the difference (or limit) between one value and the next.
- Other common delimiters are semi-colons and tabs (also called __tsv/tab-separated values__).
- We must be careful to check what exactly the delimiter is, as a common error is reading in a file with the wrong delimiter, and so getting a weird representation in your data.
- CSVs can also be read by Excel.
<p style="font-size:10.5px">
Usually if you are using data from mainland European countries (France/Spain etc) they will use semi-colons, hence some people prefer <i>character</i>-separated values for CSV.
</p>

### Open CSV files

Python counts with a library called `csv` that has the needed functionalities to read and write CSV files.

We open an existing file (Salaries.csv) using a context manager, and the mode in the context manager is set to read (`r`). Then, use the reader class from csv, which will take the values in the csv and store them into a variable that becomes an iterable.

In [None]:
import csv
with open('Salaries.csv', mode='r', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for n, row in enumerate(reader):
        print(','.join(row))
        if n == 5: # Read only the first five entries
            break


### Create CSV files

The same library can be used to generate csv files. The only thing you need to change is the mode argument in the context manager is write (`w`). If you want to append things to the csv, you can use the mode append (`a`)


As opposed to reading a CSV file, if we write a CSV, we need to use the `writer` class, which will point to the file we want to create. Notice that the file we want to generate doesn't necessarily have to exist (if it exists, it will overwrite its content)

The `writer` object has some methods to create a new file. The most common one is `writerows`, which accepts iterables as arguments, and parse them into a comma separated row

So, if we define a list:

In [None]:
my_list = [['Sparky', 7, 'Brown', 'Corgi'], ['Fido', 4, 'White', 'Husky']]

We can create a new csv file where each row contains the characteristic of each dog

In [None]:
import csv
with open('Dogs.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'Age', 'Colour', 'Breed'])
    writer.writerows(my_list)

Notice the difference between `writerow` and `writerows`. Try running the following cell and see if you see any difference between both files

In [None]:
with open('Dogs_2.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'Age', 'Colour', 'Breed'])
    writer.writerow(my_list)

## JSON


> <font size=+1> JSON (JavaScript Object Notation) is a file format that stores data in a way that is easily readable by both humans and machines.</font>

- It is as useful way for a browser and a server to exchange data, so it is used extensively in Web-based applications of coding.
- In fact, Jupyter Notebook .ipynb files are actually stored in JSON format.


JSON format is very similar to Python dictionaries, they contain a key and it has a corresponding value to that key

### Read JSON files

Python offers a library called `json` that can read, write, or append elements from or to a JSON file

The syntax is very similar to the one for CSV files. We use a context manager, set the mode we want to use, and then use a method. In this case, for reading a file, we use the `load` method

In [None]:
import json
with open('JSON_sample.json', mode='r') as f:
    json_dict = json.load(f)

print(json_dict)

Observe that, whatever we read, is a dictionary

In [None]:
type(json_dict)

### Create JSON files

We can create json files from dictionaries. Observe that the mode of the context manager is `w`. The method in this case is `dump`. The `dump` method accepts the data we want to use, and then the file we want to dump the data into.

In [None]:
test_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
with open('JSON_test.json', mode='w') as f:
    json.dump(test_dict, f)

Observe in your directory that now you have a `json` file called `JSON_test.json`

We can also have a string containing a json and parse it

In [None]:
x =  '{"name": "John", "age": 30, "city": "New York"}'

In [None]:
y = json.loads(x)
print(y)
print(type(y))

Be careful with the double quotes! If your keys have single quotes, the json parser will not work!

In [None]:
x =  "{'name': 'John', 'age': 30, 'city': 'New York'}"
y = json.loads(x)

We do the opposite (from dictionary to JSON string) using the `dumps` method

In [None]:
test_dict = {'a': 3, 'b': 4}
new_json = json.dumps(test_dict)
print(new_json)

# YAML

YAML is a data serialization language, which means that it is a common language across different applications. In fact, you already saw a serialization language in this lesson: JSON.

> <font size=+1> YAML (YAML Ain't Markup Language) is a data serialization language </font>

The main advantage of YAML is that is highly human-readable. You can see a comparison between JSON and YAML containing the same information.
### YAML:
```
simple-property: a simple value

object-property:
    first-property: first value
    second-property: second value

array-property:
    - item-1-property-1: one
      item-1-property-2: 2
    - item-2-property-1: three
      item-2-property-2: 4
```

### JSON
```
{
  "simple-property": "a simple value",

  "object-property": {
      "first-property": "first value",
      "second-property": "second value",
  },

  "array-property": [
      { "item-1-property-1": "one",
        "item-1-property-2": 2 },
      { "item-2-property-1": "three",
        "item-2-property-2": 4}
  ]
}
```


Observe that the base of YAML files lies in the indentation and the linespaces.

The most basic syntax in a YAML file is the __key:value__ pair
```
key: value
```
For example:
```
# This is a comment
name: Ivan
surname: 'Ying'
role: "Instructor"
IQ: 0
```
Notice that strings can be either into double quotes, single quotes or nothing, and they will work the same.

Another useful way of using YAML files is leveraging __objects__ simply by indenting the key:value pairs:
```
# This is a comment
Person:
    name: Ivan
    surname: 'Ying'
    role: "Instructor"
    IQ: 0
```
Same as with Python, indentation should be at the right level, and it would be a good idea to have a linter for checking it. 

You can look for `docs-yaml` in your Extensions tab on VSCode to install a linter to tell you whether your YAML file is well indented or not. Or you can also visit [this link](https://codebeautify.org/yaml-validator)

One more thing you can use in YAML files are lists. List can contain single values, or it can also contain key:value pair values
```
Person:
    - name: Ivan
      surname: 'Ying'
      role: "Instructor"
      IQ: 0
    - name: Not Ivan
      surname: 'Gniy'
      role: "Doppelganger"
      IQ: 150
Animals:
    - Cat
    - Dog
    - Shoebill
    - Kakapo
```
The last list can also be written as:
```
Animals: [Cat, Dog, Shoebill, Kakapo]
```

### Read YAML files

Python doesn't have a library for reading YAML files. But not to worry, you can install a library that allows you to do so. The library is named `PyYAML`.

In [None]:
!pip install PyYAML

Be careful, some libraries don't have the same name as they are published with. In this case, if you want to use the PyYAML library, you simply need to import `yaml`

Like CSVs and JSONs, we might want to use a context manager with the read mode

In [None]:
import yaml
with open('yaml_example.yaml', 'r') as stream:
    data_loaded = yaml.safe_load(stream)

print(type(data_loaded))

Observe that, same as with JSON files, we obtain a dictionary. Let's print it out:

In [None]:
print(data_loaded)
print(data_loaded.keys())

Notice that we have two main keys, 'Person', and 'Animal'. The value corresponding to 'Person' is a list with dictionaries, and the value corresponding to 'Animal' is just a regular list

So we can get the values of it by indexing the correct key and/or index

In [None]:
print(f"The first element of Person is: {data_loaded['Person'][0]}")
print(f"The name of the first element of Person is: {data_loaded['Person'][0]['name']}")
print(f"The second element of Person is: {data_loaded['Person'][1]}")
print(f"The name of the second element of Person is: {data_loaded['Person'][1]['name']}")
print(f'The value corresponding to Animals is: {data_loaded["Animals"]}')


### Create YAML files

You can also create YAML using the same library. The variable you need to use to create a YAML file is a dictionary. So, let's define a simple dictionary out of a JSON file we have, and then create a YAML from there

In [None]:
import json

with open('JSON_sample.json', mode='r') as f:
    my_dict = json.load(f)

print(my_dict)

Now, we can use the `dump` method to save the dictionary as a yaml file. The `dump` method accepts the data we want to use, and then the file in which we want to dump our data

In [None]:
with open('YAML_from_JSON.yaml', 'w') as f:
        yaml.dump(my_dict, f)

# Pandas

Pandas allows us to read these data formats in an easy way, so we don't have to think about the libraries or modules that we need to import

The syntax is as follows:

In [None]:
import pandas as pd

df = pd.read_{format}('FILE_DIR')

df.to_{format}('FILE_DIR')

## Pandas and CSV
The syntax for reading in a CSV to pandas is thus:

In [None]:
# we save the read_csv to a variable
df = pd.read_csv('<filename>')

# the to_csv method is a method off a data frame
df.to_csv('<filename>')

Example:

In [None]:
# import pandas
import pandas as pd

# read in the csv file
df = pd.read_csv('Salaries.csv', index_col='Id')

# show as DataFrame
df

Let's get only the first 5 rows and save that into a new csv

In [None]:
df_short = df.head(5)
df_short
df_short.to_csv('Salaries_5.csv')

## Pandas and JSON

- Pandas can also read and write from and to JSON using the following commands.

In [None]:
# json
df = pd.read_json('JSON_sample.json')
df


This doesn't look good... We have to normalize each value in that column, so each key corresponds to a column

In [None]:
df['Employees']

In [None]:
df_nice = pd.json_normalize(df["Employees"])
df_nice

In [None]:
df_nice.to_json('JSON_sample_new.json')

## XML
- XML (eXtensible Markup Language) is another way of exchanging data between browsers and servers (JSON is an alternative to XML).
- Hence, like with JSON, we can use XML to obtain data from the web and they have the extension `.xml`.
- XML is a markup language like HTML, so it contains data, and information on how to structure that data, but not how it is displayed.
- Hence we need an API to extract data from an XML file.
- You can use the following process although it is not the only possible way to do it:

### XML Structure

XML Documents are structured much like HTML:
- They are hierarchical in structure.
- The document usually contains a prolog tag containing meta data, such as version, character encoding and associated style sheet.
- The next tag will be the root(`data` in this case) tag which will contain all other tags of the document.
- Each tag is completely flexible in it's naming unlike HTML which has a pre-defined set of tags.

#### Components of XML

- __Document:__ The root tag opens the document in this case `<data>` and the ending tag `</data>` closes it.
- __Node:__ Each tag containing other tags is a node tag here `<employee>` is a node tag.
- __Elements:__ Elements such as `<email>alpha@aicore.com</email>` and ` <age>36</age>` are considered elements.
- __Content:__ The data between the elements tags are considered content. In the email element `<email>alpha@aicore.com</email>`, the string `alpha@aicore.com` is considered the content.

```
<?xml version="1.0" encoding="UTF-8"?>
<data>
    <employee name="Alpha">
        <email>alpha@aicore.com</email>
        <department>HR</department>
        <age>36</age>
    </employee>
    <employee name="Bravo">
        <email>bravo@aicore.com</email>
        <department>sales</department>
        <age>23</age>
    </employee>
    <employee name="Charlie">
        <email>charlie@aicore.com</email>
        <department>accounts</department>
        <age>44</age>
    </employee>
    <employee name="Delta">
        <email>delta@aicore.com</email>
        <department>reception</department>
        <age>51</age>
    </employee>
</data>

```

### Parsing XML

- You can use this premade function to parse in XML files, which requires only 2 arguments:
    - The XML filename
    - The columns of the data frame (the fields in each observation in the XML file)

In [None]:
import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    
    The first element of df_cols is supposed to be the identifier 
    variable, which is an attribute of each node element in the 
    XML data; other features will be parsed from the text content 
    of each sub-element. 
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

In [None]:
df = parse_XML("employees.xml", ["name", "email", "department", "age"])
df

### XML and Pandas

We can also read `.xml` files using pandas `read_xml` method. 

In [None]:
df = pd.read_xml("employee.xml")
df

Using the parameter `attrs_only` we can specify only showing the tags with attributes.

In [None]:
df = pd.read_xml("employee.xml", attrs_only=True)
df

With the `elems_only` parameter we can specify only showing the data from element tags.

In [None]:
df = pd.read_xml("employee.xml", elems_only=True)
df

We can then convert the dataframe easily back to an XML document with the `to_xml` method. 

In [19]:
df.to_xml("employees_df_export")

## Images

Computers don't see images the way we see them, when an image is stored on your computer it needs to be stored in a way in which the computer can understand. 
- They see images as a 2D matrix or 3D array where the third dimension represents the channels.
- Each unit in that grid is a __pixel__.
- Your resolution determines the size of this matrix. A resolution of 800 x 600 would be a grid of size 800 by 600 pixels.
- Each pixel has a number associated with it determining its colour.

For example, this is how a computer would represent a grayscale image:
- __Grayscale__ images have one channel to represent the image(gray).
- Colours on computers are usually represented using 8-bit numbers giving a set of 8 zeros and ones. This gives 2<sup>8</sup> or 256 possible representations of each pixel.
- Each representation describes the intensity or brightness of that particular colour. In this case, 0 is black and 255 would represent white.

![](./images/grayscale.png)


With colour images they are normally represented by the RGB (Red, Green, Blue) model, for example:
- __RGB__ is represented by three channels - __Red__, __Green__ and __Blue__.
- The channels are combined together to create the image.
- All red would be expressed as (255, 0, 0), green by (0, 255, 0) and blue by (0, 0, 255).
- White can be  by (255, 255, 255) and black by (0, 0, 0).
- Any other colour can be represented by a combination of all three. For instance (106, 13, 173) would represent the colour purple. 
- This gives us a possible combination of 16,777,216 different colours.

<img src="./images/rgb_image.png"/>

There are other systems a computer can use to represent colours
- For printers, they use the CMYK system to represent colours:
  - __C__ for Cyan
  - __M__ for Mageneta 
  - __Y__ or Yellow 
  - __K__ for Black 
- Another common one is Hexadecimal format.
  - Each colour is represented by # followed by six characters #RRGGBB.
  - Each RR (red), GG (green), and BB (blue) are hexadecimal integers between 00 and FF. 
  - For example, #0000FF displays blue since FF is the highest representation and 00 is the lowest.
  - #CC5500 would represent a burnt orange colour can you see why?


## Audio

Audio is a little more complicated; sound is an analog signal, which is made up of waves travelling through matter. The computer needs to convert this signal into a digital format to be machine readable.

The sound card on your computer performs this process for you:

- The sounds card has __four__ main processing components to perform this task.
  - __analog-to-digital converter (ADC)__ - to process the incoming analog signal into a digital format, e.g. sound recorded by your microphone. 
  - __digital-to-analog converter (DAC)__ - converts the digital audio signal on your pc to an analog format so you can listen to it, e.g. converted to analog and output through the speakers. 
  - __PCI interface__ - to connect the sound card to the motherboard.
  - __Input and output__ - for devices such as a microphone or speakers.

Sound cards represent audio with the following process:
- When an incoming audio is detected by the sound card it will take measurements (samples) of it at regular intervals.
  - __Sampling rate__ is defined as the number of samples taken per second of the sound card.
  - Sampling rate is measured in hertz (one hertz is one sample per second).
  - The higher the hertz, the better quality of  the sound representation.
- __Sampling resolution__ is the numbers of bits used to represent the audio.
  - The higher the resolution, the better the representation of the sound.

True sound (yellow), then visualisations of low and higher sampling rate:

<img src="./images/sound_sampling.png" height="600" width="400"/>

Increasing sampling resolution:

<img src="./images/sound_resolution.png" height="600" width="400"/>

Commonly, audio data is visualised as a spectrogram. A spectrogram shows time on the horizontal axis, and frequency on the vertical axis with brighter colors where that frequency is present.

<img src="./images/spectrogram.png" height="600"/>


## Summary
- We now understand the basic file formats of CSV, JSON, and YAML
- We now know how to read them into pandas.