# Pandas - File Formats

Pandas allows us to read different data formats easily, so we don't have to think about the libraries or modules that we need to import.

The syntax is as follows:

In [None]:
import pandas as pd

df = pd.read_{format}('FILE_DIR')

df.to_{format}('FILE_DIR')

## Pandas and CSV
The syntax for reading in a CSV to pandas is:

In [None]:
# we save the read_csv to a variable
df = pd.read_csv('<filename>')

# the to_csv method is a method off a data frame
df.to_csv('<filename>')

Example:

In [None]:
# import pandas
import pandas as pd

# read in the csv file
df = pd.read_csv('Salaries.csv', index_col='Id')

# show as DataFrame
df

Let's get only the first 5 rows and save that into a new csv

In [None]:
df_short = df.head(5)
df_short
df_short.to_csv('Salaries_5.csv')

## Pandas and JSON

- Pandas can also read and write from and to JSON using the following commands:

In [None]:
# json
df = pd.read_json('JSON_sample.json')
df


This doesn't look good... We have to normalise each value in that column, so each key corresponds to a column:

In [None]:
df['Employees']

In [None]:
df_nice = pd.json_normalize(df["Employees"])
df_nice

In [None]:
df_nice.to_json('JSON_sample_new.json')

## XML
- XML (eXtensible Markup Language) is another way of exchanging data between browsers and servers (JSON is an alternative to XML).
- Hence, like with JSON, we can use XML to obtain data from the web and they have the extension `.xml`.
- XML is a markup language like HTML, so it contains data, and information on how to structure that data, but not how it is displayed.
- Hence we need an API to extract data from an XML file.
- You can use the following process although it is not the only possible way to do it:

### XML Structure

XML Documents are structured much like HTML:
- They are hierarchical in structure.
- The document usually contains a prolog tag containing meta data, such as version, character encoding and associated style sheet.
- The next tag will be the root(`data` in this case) tag which will contain all other tags of the document.
- Each tag is completely flexible in it's naming unlike HTML which has a pre-defined set of tags.

#### Components of XML

- __Document:__ The root tag opens the document in this case `<data>` and the ending tag `</data>` closes it.
- __Node:__ Each tag containing other tags is a node tag here `<employee>` is a node tag.
- __Elements:__ Elements such as `<email>alpha@aicore.com</email>` and ` <age>36</age>` are considered elements.
- __Content:__ The data between the elements tags are considered content. In the email element `<email>alpha@aicore.com</email>`, the string `alpha@aicore.com` is considered the content.

```
<?xml version="1.0" encoding="UTF-8"?>
<data>
    <employee name="Alpha">
        <email>alpha@aicore.com</email>
        <department>HR</department>
        <age>36</age>
    </employee>
    <employee name="Bravo">
        <email>bravo@aicore.com</email>
        <department>sales</department>
        <age>23</age>
    </employee>
    <employee name="Charlie">
        <email>charlie@aicore.com</email>
        <department>accounts</department>
        <age>44</age>
    </employee>
    <employee name="Delta">
        <email>delta@aicore.com</email>
        <department>reception</department>
        <age>51</age>
    </employee>
</data>

```

### Parsing XML

- You can use this premade function to parse in XML files, which requires only 2 arguments:
    - The XML filename
    - The columns of the data frame (the fields in each observation in the XML file)

In [None]:
import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    
    The first element of df_cols is supposed to be the identifier 
    variable, which is an attribute of each node element in the 
    XML data; other features will be parsed from the text content 
    of each sub-element. 
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

In [None]:
df = parse_XML("employees.xml", ["name", "email", "department", "age"])
df

### XML and Pandas

We can also read `.xml` files using pandas `read_xml` method. 

In [None]:
df = pd.read_xml("employees.xml")
df

Using the parameter `attrs_only` we can specify only showing the tags with attributes.

In [None]:
df = pd.read_xml("employees.xml", attrs_only=True)
df

With the `elems_only` parameter we can specify only showing the data from element tags.

In [None]:
df = pd.read_xml("employees.xml", elems_only=True)
df

We can then convert the dataframe easily back to an XML document with the `to_xml` method. 

In [19]:
df.to_xml("employees_df_export")

## Images

Computers don't see images the way we see them, when an image is stored on your computer it needs to be stored in a way in which the computer can understand. 
- They see images as a 2D matrix or 3D array where the third dimension represents the channels.
- Each unit in that grid is a __pixel__.
- Your resolution determines the size of this matrix. A resolution of 800 x 600 would be a grid of size 800 by 600 pixels.
- Each pixel has a number associated with it determining its colour.

For example, this is how a computer would represent a grayscale image:
- __Grayscale__ images have one channel to represent the image(gray).
- Colours on computers are usually represented using 8-bit numbers giving a set of 8 zeros and ones. This gives 2<sup>8</sup> or 256 possible representations of each pixel.
- Each representation describes the intensity or brightness of that particular colour. In this case, 0 is black and 255 would represent white.

![](./images/grayscale.png)


With colour images they are normally represented by the RGB (Red, Green, Blue) model, for example:
- __RGB__ is represented by three channels - __Red__, __Green__ and __Blue__.
- The channels are combined together to create the image.
- All red would be expressed as (255, 0, 0), green by (0, 255, 0) and blue by (0, 0, 255).
- White can be  by (255, 255, 255) and black by (0, 0, 0).
- Any other colour can be represented by a combination of all three. For instance (106, 13, 173) would represent the colour purple. 
- This gives us a possible combination of 16,777,216 different colours.

<img src="./images/rgb_image.png"/>

There are other systems a computer can use to represent colours
- For printers, they use the CMYK system to represent colours:
  - __C__ for Cyan
  - __M__ for Mageneta 
  - __Y__ or Yellow 
  - __K__ for Black 
- Another common one is Hexadecimal format.
  - Each colour is represented by # followed by six characters #RRGGBB.
  - Each RR (red), GG (green), and BB (blue) are hexadecimal integers between 00 and FF. 
  - For example, #0000FF displays blue since FF is the highest representation and 00 is the lowest.
  - #CC5500 would represent a burnt orange colour can you see why?


## Audio

Audio is a little more complicated; sound is an analog signal, which is made up of waves travelling through matter. The computer needs to convert this signal into a digital format to be machine readable.

The sound card on your computer performs this process for you:

- The sounds card has __four__ main processing components to perform this task.
  - __analog-to-digital converter (ADC)__ - to process the incoming analog signal into a digital format, e.g. sound recorded by your microphone. 
  - __digital-to-analog converter (DAC)__ - converts the digital audio signal on your pc to an analog format so you can listen to it, e.g. converted to analog and output through the speakers. 
  - __PCI interface__ - to connect the sound card to the motherboard.
  - __Input and output__ - for devices such as a microphone or speakers.

Sound cards represent audio with the following process:
- When an incoming audio is detected by the sound card it will take measurements (samples) of it at regular intervals.
  - __Sampling rate__ is defined as the number of samples taken per second of the sound card.
  - Sampling rate is measured in hertz (one hertz is one sample per second).
  - The higher the hertz, the better quality of  the sound representation.
- __Sampling resolution__ is the numbers of bits used to represent the audio.
  - The higher the resolution, the better the representation of the sound.

True sound (yellow), then visualisations of low and higher sampling rate:

<img src="./images/sound_sampling.png" height="600" width="400"/>

Increasing sampling resolution:

<img src="./images/sound_resolution.png" height="600" width="400"/>

Commonly, audio data is visualised as a spectrogram. A spectrogram shows time on the horizontal axis, and frequency on the vertical axis with brighter colors where that frequency is present.

<img src="./images/spectrogram.png" height="600"/>


# Key Takeaways

- Pandas is a powerful Python library that we can use to read and write data to different file formats. The data is then stored in a structure called a _dataframe_
- To read CSV files, the `.read_csv` command is used
- Similarly, to read JSON files, the `.read_json` command is used
- XML documents are another popular type of file formats. They are structured much like HTML documents and contain the following components:
	- Document 
	- Node
	- Elements
	- Content
-  Pandas provides `.read_xml` to read data stored in XML documents
- Image data is stored as a 2D matrix structure, where unit within the matrix is a _pixel_
- Audio data is more complicated to store than video. The computer's sound card transforms analog audio waves into digital waves, which can then be interpreted by the computer.