<center><img src="https://www.dlsi.ua.es/~juanra/UA/curso_verano_DL/images/pandas-logo.png" height="100"></center>

# 1.3 Data tables with Pandas and basic graphics

Instructor: Juan Ramón Rico (<juanramonrico@ua.es>)

## Summary

----
**Pandas**: Pandas is a high-performance package for table-like data structures, data analysis and visualization. It is built on `NumPy` and `MatPlotLib`.
- Documentation <https://pandas.pydata.org/pandas-docs/stable/>
- Quick start tutorial <https://pandas.pydata.org/pandas-docs/stable/10min.html>    

**MatPlotLib**: It is a flexible package for graph visualization. It is powerful but somewhat difficult for novice users.
- Documentation <https://matplotlib.org/contents.html>
- Quick start tutorial <https://matplotlib.org/users/pyplot_tutorial.html>

---

# Data tables (data frame)


There is also a type similar to a data table (data frame). It is equivalent to a two-dimensional table with names in the rows and columns.
In addition, it can contain different types of data with each column that the matrix type does not allow.

The most common way to work with data frames is to import tables from files (text, spreadsheets).

Creating and storing data frames

To use this type of data we will use a package called `Pandas`

`pip install pandas`

## Creating tables

In [None]:
import numpy as np
import pandas as pd

Name = np.array(['Juan', 'Pedro', 'Ana', 'Isabel'])
Group  = ['Morning','Afternoon','Morning','Afternoon']
Grade = [8, 5, 9, 5.5]

semester = pd.DataFrame({'Name':Name, 'Grade':Grade, 'Group':Group}, columns=['Name', 'Grade', 'Group'])
semester

We can also load the data from a file. The same `data frame` above could be achieved with:

In [None]:
# Display the CSV file

!curl 'https://www.dlsi.ua.es/~juanra/UA/curso_verano_DL/data/pandas_example-en.csv'

In [None]:
import pandas as pd

semester = pd.read_csv('https://www.dlsi.ua.es/~juanra/UA/curso_verano_DL/data/pandas_example-en.csv')
semester

Analyzing the data

In [None]:
semester.describe(include='all')

## Selecting elements

| semester
|:----
| `semester[:2]`
| `semester[2:]`
| `semester[:2]['Grade']`
| `semester[2:]['Grade']`
| `semester[semester.Group=='Morning']`
| `semester[(semester.Group=='Morning') & (semester.Grade>8)]`
| `semester[(semester.Group=='Morning') & (semester.Grade>8)]["Name"]`
| `semester[(semester.Group=='Morning') & (semester.Grade>8)][["Name","Grade"]]`

In [None]:
semester[semester.Group=='Morning']

To select elements by reducing the size of the syntax we can use the `query()` function.

In [None]:
semester.query('Group == "Morning"')

## Preparing a data file

Data is usually loaded from a CSV file. In this section we will show how to load and verify data types.

### Copying test files to the server

In [None]:
# You have to copy the example files
!wget https://www.dlsi.ua.es/~juanra/UA/curso_verano_DL/data/basic_data.zip
!unzip basic_data

### Loading and verifying data types

In [None]:
import pandas as pd

# Binary classification with the 'diabete_01.csv' file
data = pd.read_csv('./basic_data/diabetes_01.csv')

print('\nFirst rows')
display(data.head())

print('\nData types in columns')
display(data.dtypes)

print('\nCheck if there are unknown values in the data')
display(data.isnull().any())

A precision of `int32` and `float32` is sufficient to represent the information, in addition, current GPUs work with 32-bit precisions and NOT 64-bit.

In [None]:
# Converting 64-bit types to 32-bit to use GPU
pairs = {'int64':'int32', 'float64':'float32'}
for i in data.columns:
  pair = pairs.get(str(data[i].dtype))
  if pair != None:
    data[i]= data[i].astype(pair)

data.dtypes

In [None]:
# Actually, the 'class' column corresponds to whether the diagnosis is diabetes (1) or not (0)
# It can be left as an integer, or transformed into a category which it actually is
data['class'] = data['class'].astype('category')
data['class'].dtype

In [None]:
# Selection of attributes and target variable
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

display(X.head())
display(y.head())

### Exercise: read data from iris.csv

In [None]:
# Exercise adapt the values of the 'iris.csv' file

data = pd.read_csv('./basic_data/iris.csv')

display(data.head())
display(data.dtypes)

In [None]:
# Convert numeric features to 'float32' and 'class' to 'category'


In [None]:
# Assign the values of the features to the variable 'x'

# Assign the values of the target variable or class to the variable 'y'


# Basic graphics

The basic package for `Python` graphics is `MatPlotlib` and would be installed with

`pip install matplotlib`

if we don't already have it installed. It is also possible to show graphs with `Pandas`, so we will show the same examples with both packages.

## Function show 2D points

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

#
# matplotlib
#
df = pd.DataFrame(np.random.rand(10,2), columns=['x','y'])
plt.title('matplotlib')
plt.scatter(df.x, df.y)
plt.show()

# Pandas
df.plot.scatter('x','y', title='Pandas')

# Plotly
px.scatter(df, x='x', y='y', title='Plotly', height = 400, width = 500)

## 1D numeric vector

We are going to generate a vector with 100 numbers created at random following
a distribution $\mathcal{N}(0,1)$

In [None]:
np.random.seed(1000)

df_100 = pd.DataFrame(np.random.normal(size=100), columns=['x'])

### Sequence of values

In [None]:
# matplotlib
plt.title('matplotlib')
plt.plot(df_100.x)

# Pandas
df_100.plot.line(title='Pandas')

# Plotly
px.line(df_100, y='x', title='Plotly', height = 400, width = 500)

### Histogram

In [None]:
# Matplotlib
plt.title('matplotlib')
plt.hist(df_100.x,20)

# Pandas
df_100.plot.hist(bins=20, title='Pandas')

# Plotly
px.histogram(df_100, x='x', title='Plotly', height = 400, width = 500)

### Boxplot (boxplot)

In [None]:
#Matplotlib
plt.title('matplotlib')
plt.boxplot(df_100.x)

# Pandas
df_100.plot.box(title='Pandas')

# Plotly
px.box(df_100, y='x', title='Plotly', height = 400, width = 500)

#### Calculating outliers (outliers)

In [None]:
Q1, Q3 = np.percentile(df_100.x, [25,75])
IQR = (Q3 - Q1)
outlier_low = Q1 - 1.5 * IQR
outlier_high= Q3 + 1.5 * IQR
print(f'Q1: {Q1:.2f}; Q3: {Q3:.2f}; IQR: {IQR:.2f}')
print(f'outlier_low: {outlier_low:.2f} outlier_high: {outlier_high:.2f}')
print(f'vn100 outliers: {df_100[(df_100.x<outlier_low) | (df_100.x>outlier_high)]}')

---

# Summary

* **Pandas** as a basic package for manipulating data of different types and with a **table** structure also serves to **graphically represent** its content, or analyze it **statistically** in a simple way.

* **Matplotlib** is the most important graphic representation package in Python. It has a descriptive orientation where you need to indicate how, and you need to define each part of the graph with code. There is another high-level orientation called **declarative** like the **Plotnine** package that allows you to define what, and does not need to specify each part of the graph.