# Python and Data Handling

Python is a powerful, multi-purpose programming language with the ability to handle multiple data types. In this article we will-

1) Describe types of data <br>
2) Discuss types of data formats <br>
3) Provide examples of accessing different data types <br>




## Types of data 

Data comes in various forms and these forms is primarily derived by human perception and interaction. Rather the evolution of various types of data is based on the way we perceive and interact with objects around us.

* Image data - A picture is something we see. Light emitted in various wavelengths of the visible spectrum is perceived as varying colors.

* Audio data - Sound is waves created in a medium which we hear by means of a vibrating diaphragm (called ear drum).

* Numerical/Text data - Evolution of language and written script led to a new form of communication. Any written scripts/symbols create documented data.

An interesting point to observe is that, though there are various types of data, for the purpose of analysis, this real world data is often converted into a pool of numbers and/or strings in order to make the data conducive for analysis within a programming language or an analytical tool.


## Types of data file formats

* **Image files -**
There are various types of file formats which store the data pertaining to an image. An image file is nothing but a 2-dimensional grid with each point represented by distinct color. This distinct point in the grid is called as 'pixel' and the pixel value is the color value created on an RGB scale (RGB scale is Red-Green-Blue scale, where values of each of these primary colors create a specific color). JPEG, PNG, TIFF, Bitmap, SVG, GIF are all various types of image file formats designed for specific purposes. Some file formats support simple animations, some are highly scalable, some others are of high resolution (density packing of pixels).<br>


* **Audio files -** 
There are also various types of audio file formats available. An audio file consists of wave data on various channels. The wave data of a single channel is sent to a single speaker (a device capable of creating sound waves). Based on the amolitude values, sound waves are created and this creates/re-creates the audio. MP3, WAV, AAC, WMA, MP4 etc. are some popular audio file formats.<br>


* **Text files -**
Text files consist of organized/unorganized group of strings. The data within a text file can be read and parsed as a string and using various string operations, analysis can be conducted on that data. CSV, TXT, RTF, DOC are some of the text file formats.<br>


* **Table/Database files -**
When digitally storing data, certain reliable and efficient structures are chosen in order to provide easy accessibility and comprehension of various data points. One such widely used structure of storing data is 'table'. A table is a combination of rows and columns of data. Each row usually represents a unique record of data pertaining to one specific observation (observation is data pertaining to whatever the subject of the table is. A 'subject' is the main entity, event or object which defines the central theme of the data set.). Each column is a unique attribute of the observation in question, sometimes also known as dimensions or features.


* **Types of Analyses**
<img src="types_of_analysis.PNG" style="width:50vw">


## Accessing different types of data 

Python allows for accessing data from different sources. Below, we provide examples for each type of data source. 

### Image files

TIFF and PNG are common image formats 


#### TIFF

TIFF (Tagged Image File Format) is a kind of image format like JPEG, PNG, Bitmap and GIF. However the advantage of TIFF over other formats is that it has unparalleled image quality and file security. It is a loss-less format and it is also difficult to alter making it ideal for protecting information and archiving. Python has an imaging library called **PIL**, that can process images. 

* First, let's see how to use this library to import images..

``` python
from PIL import Image

im = Image.open('myfile.tif')
im.show()
```
**Note:** The *.show()* function opens image in your systems's default image viewer.

* Next, let's convert this image to a numpy array for processing, it's as simple as:

``` python
import numpy as np

imarray = np.array(im)
imarray
```
This converts your image into numpy array values, which can be used for manipulating the image.


#### PNG

Portable Network Graphics (PNG) is an open image format that was created to replace the GIF (Graphics Interchange Format). It is the most widely used lossless image compression file format on the internet today. (Source: https://en.wikipedia.org/wiki/Portable_Network_Graphics)

### Viewing an image file using matplotlib

There are many libraries which allow reading, loading and analysis of images in Python. 'Pillow' is probably one of the most widely used libraries for image analysis. As seen in above section, the 'open' function from 'Image' module launches an image viewer in your local operating system and reads/displays the image file in that application. If we would like to display the image within the Jupyter Notebook, we can do so by simply plotting the image data on coordinate system. We can do this easily using the matplotlib library.

The 'image' sub-module within matplotlib contains a function called 'imread()'. This function is specifically built to read image files. Once the data is read using this function, it can be passed to the 'show()' function from 'pyplot' sub-module, to display as a plot. See below for a simple example:

An Example:
```python
# importing pyplot
import matplotlib.pyplot as plt
# importing image module from matplotlib
import matplotlib.image as mpimg
# Rendering all plots inline within the notebook
%matplotlib inline

# Reading the image file data into a variable
py_img=mpimg.imread('../../../data/python-logo.png')

# Showing/Displaying the image data as a plot
plt.imshow(py_img)

# Output
>>>
```

<img src="python-logo.PNG" style="width:30vw">




### Audio files 

### Opening a WAV file and analyzing it

WAV files are a common method of audio encoding. WAV files can be opened with 2 common libraries:
1. scipy.io.wavfile
2. wave

<b>Opening a WAV file:</b>
Examples:
```python
# Using scipy.io.wavfile
from scipy.io.wavfile import read
a = read("../../../data/Matteo-Amandoi__Official_Music_HD_.wav")

# Using wave
import wave
a = wave.open("../../../data/Matteo-Amandoi__Official_Music_HD_.wav",'r') # 2nd parameter 'r' specifies 'read'mode
```

Note that the 'read' method from scipy module allows reading the entire audio file, which can be then transformed to other formats (such as an array), and analyzed. When the 'open' method from wave library is used, the file opens with a cursor, so we may read the audio source file, line by line or whole at a time, but in itself, the object 'a' (which is initialized with wave.open("Path to the file")) cannot be converted into a different data format.

### Wave to Numpy Array and Signal Plotting

When the audio file is read in its entirety using the 'scipy.io.wavfile' module, we can convert the file into a numpy object and understand the underlying programmable data that constitutes an audio file. 

<b>Parts of a wave file:</b> When a wave file is read using the read() method of 'scipy.io.wavfile' module, it reads the audio file in two parts:
1. Audio sample rate - The first part of the data that is read, is an integer that conveys the sample rate of the audio data. The higher the sample rate, the more better the audio quality is said to be. The common sample rate could be around 44.1 kHz (or 44100) or higher. The significance and interpretation of this sample rate is that, every second of music/waves, are made up of 44,100 points of wave data.
2. Audio wave data - The second part of the the data is a two dimensional array, which signifies the audio wave amplitudes. The audio can have multiple channels. Generally most audio waves have 2 channels - left and right, similar to left and right speakers. Each element of the 2D array is in itself an array with 2 elements - wave amplitudes of left and right channels.


We can convert the amplitudes/wave data of the wave file into a numpy array, simply by using the numpy.array() method and specifying the data type as 'float' or any longer integer data types.

Example:
```python
from scipy.io.wavfile import read
a = read("../../../data/Matteo-Amandoi__Official_Music_HD_.wav")

import numpy as np
np.array(a[1],dtype=float)

# Output
>>> array([[ 0.,  0.],
>>>        [ 0.,  0.],
>>>        [ 0.,  0.],
>>>        ..., 
>>>        [ 0.,  0.],
>>>        [ 0.,  0.],
>>>        [ 0.,  0.]])
```

Note that the audio files' amplitudes may largely constitute zeroes or sometimes a constant number repeated across multiple time stamps. This indicates 'silence' (zero ammplitude) or 'white noise' (constant background noise amplitude).



### Text files


Python can read text files of various formats. One of the most common formats is CSV. 

#### CSV

CSV (Comma Separated Values) files usually contain mixed data types and are used to transfer large database between programs. There are multiple ways to import csv into Python. The first method we’ll look at uses the **csv** module, a powerful and versatile module available in the core python install. It has **reader()** function which reads in the data as rows, then we can print each row. The second and by far the best method, is to import it as a dataframe using the pandas **read_csv()** funtion in python. 

* **Method 1:**

Using the CSV module

```python
import csv

csv_file = csv.reader(open("myfile.csv"))
for row in csv_file:
    print(row)
```

* **Method 2:**

Using the pandas module

```python
import pandas as pd

df = pd.read_csv("myfile.csv")
df.head()
```

### Database files

Python allows for creating and working with SQL databases. Below is an example of converting a pandas dataframe to an sql table in python. 

```python 
import sqlite3
import csv
import pandas as pd

# Connecting to the database
fancon = sqlite3.connect(':memory:')

# Reading data into tables
scrapedf = pd.read_csv('https://raw.githubusercontent.com/colaberry/538data/master/fandango/fandango_scrape.csv')
scoredf = pd.read_csv('https://raw.githubusercontent.com/colaberry/538data/master/fandango/fandango_score_comparison.csv')
scrapedf.to_sql(name='fscrape',con=fancon,if_exists='append',index=False)
scoredf.to_sql(name='fscore',con=fancon,if_exists='append',index=False)

```

## Learn more about data manipulation in Python 

Python has many other functionalities that can be used to manipulate data. The topics discussed above are a subset of these functionalities. If you want to learn more about data handling with python, check out our courses at https://refactored.ai. Our course covers everything from introductory Python to Pandas, Plotly, Bokeh, and machine learning techniques. 