# Python data science intro

An introduction to python data science libraries

_2024 César Freire_

* [Numpy](#numpy)
* [Pandas](#pandas)
* [Matplotlib](#matplotlib)

### Libraries

In [None]:
%pip install numpy pandas matplotlib -q

## Numpy
https://numpy.org/

__NumPy__ is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

### CheatSheets

https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

In [None]:
# code example
import numpy as np

# Create 10 uniform samples
sample = np.random.randint(1,11, size=10)
sample

In [None]:
# ideally mean will be 5

sample.mean()

### Motivation to numpy
Pre-compiled in C lang

In [None]:
%timeit [x for x in range(100_000)]

In [None]:
%timeit np.arange(100_000)

## Byte code

In [None]:
import dis

dis.dis("[x for x in range(100_000)]")

In [None]:
dis.dis("np.arange(100_000)")

### Numpy array object
Most used properties

* shape
* size
* ndim
* nbytes
* itemsize
* dtype

In [None]:
array = np.array([[1,2], [3,4], [5,6]])
array

In [None]:
# try your properties here

array.dtype

### Numpy data types (dtype)
* `int` (8 bits ... 64 bits)
* `uint` (8bits ... 64 bits)
* `float` (16 bits ... 128 bits)
* `complex` (64 bits ... 256 bits)
* `bool`

In [None]:
# forcing 16bits (float)

array = np.array([[1,2], [3,4], [5,6]], dtype=np.float16)
array

### Reshape data

In [None]:
array = np.arange(10)
array

In [None]:
array.reshape(5,2)

### Slicing data
__NOTE:__ Arrays index start at zero

In [None]:
array = np.arange(1, 101).reshape(10,10)
array

In [None]:
array[1]  # second line

In [None]:
array[-1]  # last line

In [None]:
array[1,1]  # second row second column

In [None]:
array[:2]  # first two lines

In [None]:
array[1:4:2]  # second and forth lines

In [None]:
# exercise: extract last column
array[:,-1]

### Operations

In [None]:
x = np.array([[1,2], [3,4]])
y = np.array([[5,6], [7,8]])

In [None]:
y-x

In [None]:
np.concatenate([x, y])

### More info
https://numpy.org/doc/stable/user/absolute_beginners.html

### NumPy exercises

1. Write a NumPy program to create an manual array with the values 1, 7, 13, 105 and determine the size of the memory occupied by the array.
2. Write a NumPy program to create a vector with values ​​ranging from 15 to 55 and print all values ​​except the first and last.
3. Write a NumPy program to create a 3x4 matrix filled with values ​​from 10 to 21.
4. Write a NumPy program to create a 2D array with 1 on the border and 0 inside.

In [None]:
# Ex: 1


In [None]:
# Ex: 2


In [None]:
# Ex: 3


In [None]:
# Ex: 4


## Pandas
https://pandas.pydata.org/

__pandas__ is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal

The two primary data structures of pandas, __Series__ (1-dimensional) and __DataFrame__ (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

### Cheat Sheets

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


### Series

In [None]:
import pandas as pd
import numpy as np

# Create data from numpy
data = pd.Series(np.random.rand(10))
data

In [None]:
pd.set_option('display.precision', 3)
data.describe()

__Slicing__  
index start at zero

In [None]:
data[:2]  # first and second

In [None]:
data[ data > 0.6 ] # logical slices

In [None]:
# data is a numpy array
data.dtypes

In [None]:
# numpy operations are allow
np.mean(data)

In [None]:
data.mean()

### DataFrame
DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous:
that is, of varying types. It is similar to structured arrays in NumPy with mutability
added. It has the following properties:
* Conceptually analogous to a table or spreadsheet of data.
* Similar to a NumPy ndarray but not a subclass of np.ndarray.
* Columns can be of heterogeneous types: float64, int, bool, and so on.
* A DataFrame column is a Series structure.
* It can be thought of as a dictionary of Series structures where both the
columns and the rows are indexed, denoted as 'index' in the case of rows
and 'columns' in the case of columns.
* It is size mutable: columns can be inserted and deleted.

In [None]:
# Example creating a DF from dictionaries and series
stockSummaries = {
    'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
    index=['Closing price','EPS', 'Shares Outstanding(M)','Beta', 'P/E','Market Cap(B)']),
    'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
    index=['Closing price','EPS','Shares Outstanding(M)', 'Beta','P/E','Market Cap(B)']),
}

stock = pd.DataFrame(stockSummaries)
stock

__Operations__
* Selection
* Assignment
* Deletion

In [None]:
# Selection
stock['AMZN']
stock.AMZN 

In [None]:
# Assignment
stock['AMZN']['Beta'] = 0.6
stock

In [None]:
# Deletion
try:
    del stock['GOOG']
except KeyError:  # Exception
    pass

stock

### Importing data
https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html

* CSV
* SQL
* json
* excel
* (others)

#### Import CSV

__IPMA API__

https://api.ipma.pt/

Download file at the following url and upload to jupyter/colab/local data folder

`http://api.ipma.pt/open-data/observation/climate/precipitation-total/leiria/mrrto-1006-caldas-da-rainha.csv`

__NOTE:__

On some system can be possible to download directy into PANDAS

In [None]:
import pandas as pd

file = '../data/mrrto-1006-caldas-da-rainha.csv'

precipitation_data = pd.read_csv(file, index_col='date', parse_dates=['date'])
precipitation_data.head()


#### Importing from JSON

__Worldwide public holidays__

https://date.nager.at/

`https://date.nager.at/api/v3/publicholidays/2024/PT`

__NOTE__: If direct import fail, copy-paste file

In [None]:
url = "https://date.nager.at/api/v3/publicholidays/2024/PT"

In [None]:
import sys

public_holidays_data = None

if sys.platform != 'emscripten':
    public_holidays_data = pd.read_json(url)
else:
    import pyodide.http
    import json
    with pyodide.http.open_url(url) as f:
        public_holidays_data = pd.DataFrame(json.loads(f.getvalue()))

public_holidays_data

### Importing SQL

SimpleFolks for Simple SQL

[http://2016.padjo.org/files/data/starterpack/simplefolks.sqlite](http://2016.padjo.org/files/data/starterpack/simplefolks.sqlite)

Download sqlite file to local DATA folder

SQL Lite Viewer ONLINE

[https://inloop.github.io/sqlite-viewer/](https://inloop.github.io/sqlite-viewer/)

In [None]:
%pip install -q  sqlalchemy

In [None]:
sql_connection = 'sqlite:///../data/simplefolks.sqlite'

In [None]:
# VSCODE / Colab
sql_data = None

if sys.platform != 'emscripten':
    sql_data = pd.read_sql('SELECT * FROM homes WHERE area="urban" ORDER BY value', 
                con=sql_connection)
sql_data

In [None]:
# JupyterLite solution

from sqlalchemy import create_engine, text

if sys.platform == 'emscripten':
    sql_connection = 'sqlite:///../data/simplefolks.sqlite'
    
    engine = create_engine(sql_connection)
    sql_data = pd.DataFrame(engine.connect().execute(text('SELECT * FROM homes WHERE area="urban" ORDER BY value')))
sql_data

### Label Indexing
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

In addition to the standard indexing operator [] and attribute operator, there
are operators provided in pandas to make the job of indexing easier and more
convenient.
* `.loc` logic operator
* `.iloc` index operator

In [None]:
from datetime import date, timedelta

# Using precipitation data from IPMA

some_days_ago = str(date.today() - timedelta(days=4))
print(some_days_ago)
precipitation_data.loc[some_days_ago]

In [None]:
# Boolean filters (AND & ) (OR | ) (NOT ~ )
precipitation_data.loc[ (precipitation_data['mean'] > 5) & (precipitation_data['minimum'] < 5) ]

__label index__ operator

In [None]:
# code
precipitation_data.iloc[:2]

In [None]:
# code
precipitation_data.iloc[:2,3] # 1,2 (4ª column)

### Groupby of data
https://pandas.pydata.org/docs/reference/groupby.html#groupby


In [None]:
# Using data from SQL
from sqlalchemy import create_engine, text

engine = create_engine(sql_connection)
sql_data = pd.DataFrame(engine.connect().execute(text('SELECT area, value FROM homes')))
sql_data

In [None]:
# code
data_areas = sql_data.groupby('area')
data_areas.groups

In [None]:
# show size
data_areas.size()

### Aggregate method
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html

In [None]:
# aggregate code
sql_data.groupby('area').agg(['mean', 'std'])

### Pivots and reshaping data
https://pandas.pydata.org/docs/user_guide/reshaping.html#pivot

This section deals with how you can reshape data. Sometimes, data is stored in
what is known as the stacked format

In [None]:
# Get all data from SQL
sql_data = pd.read_sql('SELECT * FROM homes', con=sql_connection)
sql_data

In [None]:
# pivot table
pd.pivot_table(sql_data, values='value', index='owner_name', aggfunc='sum')

### Value counts
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts

Return a Series containing the frequency of each distinct row in the Dataframe.

In [None]:
# code
sql_data['owner_name'].value_counts(normalize='index') * 100

### More info
https://pandas.pydata.org/docs/user_guide/10min.html#min

### Pandas exercises
1. Write a Pandas program to convert a dictionary to a Pandas series:

```python
d1 = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':800}
```
 
2. Write a Pandas program to compute the minimum, 25th percentile, median, 75th, and maximum, count and std of a given series from 100 uniform random numbers, `describe` the `Series`

3. Write a Pandas program to select the 'name' and 'score' columns from the following dictionary

```python
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin'],
  'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8],
  'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2],
  'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no']}
  
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
```
4. Write a Pandas program to select the rows where number of attempts in the examination is less than 3 and score greater than 15. Use data from last exercise

5. Write a Pandas program to group by the `attempts` and prints the average `score`

In [None]:
# Ex: 1


In [None]:
# Ex: 2


In [None]:
# Ex: 3


In [None]:
# Ex: 4


In [None]:
# Ex: 5


## Matplotlib
https://matplotlib.org/

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

### CheatSheets

https://matplotlib.org/cheatsheets/

In [None]:
from matplotlib import pyplot as plt

x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

plt.plot(x, y, marker='.', alpha=0.5)
plt.title('sin func')
plt.xlabel('\u03C0')
plt.grid(linestyle='--')
plt.show()

### Bar plots and legends

In [None]:
fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange']

plt.bar(fruits, counts, label=bar_labels, color=bar_colors)

plt.ylabel('fruit supply')
plt.title('Fruit supply by kind and color')
plt.legend(title='Fruit color')

plt.show()

### Sub plots and grids

In [None]:
# Fixing random state for reproducibility
np.random.seed(19680801)

dt = 0.01
t = np.arange(0, 30, dt)
nse1 = np.random.randn(len(t))   # white noise 1
nse2 = np.random.randn(len(t))   # white noise 2

# Two signals with a coherent part at 10 Hz and a random part
s1 = np.sin(2 * np.pi * 10 * t) + nse1
s2 = np.sin(2 * np.pi * 10 * t) + nse2

fig, axs = plt.subplots(2, 1)
axs[0].plot(t, s1, t, s2)
axs[0].set_xlim(0, 2)
axs[0].set_xlabel('Time')
axs[0].set_ylabel('s1 and s2')
axs[0].grid(True)

axs[1].cohere(s1, s2, 256, 1. / dt)
axs[1].set_ylabel('Coherence')
fig.tight_layout()

In [None]:
# example 2

x1 = np.random.randn(1_000)  # normal
x2 = np.random.rand(1_000)  # uniform

# rows=2, columns=2
plt.subplot(221)
plt.plot(x1)
plt.title('Normal distribution')

plt.subplot(222)  
plt.plot(x2, color='red')
plt.title('Uniform distribution')

plt.subplot(223)
plt.hist(x1)

plt.subplot(224)
plt.hist(x2, color='red')

plt.show()

### Confidence bands

In [None]:
N = 21
x = np.linspace(0, 10, 11)
y = [3.9, 4.4, 10.8, 10.3, 11.2, 13.1, 14.1,  9.9, 13.9, 15.1, 12.5]

# fit a linear curve and estimate its y-values and their error.
a, b = np.polyfit(x, y, deg=1)
y_est = a * x + b
y_err = x.std() * np.sqrt(1/len(x) +
                          (x - x.mean())**2 / np.sum((x - x.mean())**2))

fig, ax = plt.subplots()
ax.plot(x, y_est, 'b-')
ax.fill_between(x, y_est - y_err, y_est + y_err, alpha=0.2)
ax.plot(x, y, 'o')
ax.grid(linestyle='--')

plt.show()

### Plot images

https://en.wikipedia.org/wiki/Grace_Hopper

In [None]:
import matplotlib.patches as patches
import matplotlib.cbook as cbook

with cbook.get_sample_data('grace_hopper.jpg') as image_file:
    image = plt.imread(image_file)

fig, ax = plt.subplots()
im = ax.imshow(image)
patch = patches.Circle((260, 200), radius=200, transform=ax.transData)
im.set_clip_path(patch)

ax.axis('off')
plt.show()

### Pandas integration

In [None]:
# Data from IPMA 

precipitation_data['mean'].plot(style='go--',
                  figsize=(12,4), 
                  grid=True, 
                  title='Precipitation', 
                  ylabel='mm')
plt.show()

### Exercises
https://api.ipma.pt/open-data/observation/climate/precipitation-total/

Get __Precipitação total diária por concelho (formato CSV)__ for your town 

1. Plot `mean` over time for your city and `figsize` 16x4
2. Plot `minimum` and `maximum` over time with legend
3. Create two boxplot for `minimum` and `maximum` values for last month precipitation
4. Aggreggate percipitation of `mean` per month with mean function and create a bar plot

In [None]:
# Ex: 1


In [None]:
# Ex: 2


In [None]:
# Ex. 3


In [None]:
# Ex: 4


## Are you finish
![](https://www.thesquirepresents.co.uk/wp-content/uploads/2020/10/Smashing-Pumpkins-1990-Album.jpg)

__Solutions__

__numpy__

    # Ex: 1
    np.array([1,7,13,105], dtype=np.uint8).nbytes

    # Ex: 2
    np.arange(15,56, dtype=np.uint8)[1:-1]

    # Ex: 3
    np.arange(10,22).reshape(3, 4)

    # Ex: 4
    dim = (5,5)
    matrix = np.ones(dim, dtype=np.uint8)
    matrix[1:-1,1:-1] = 0
    matrix

    
__pandas__

    # Ex: 1
    d1 = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':800}
    pd.Series(d1, dtype=pd.Int32Dtype())

    # Ex: 2
    x = pd.Series(np.random.randint(1,101, 100))
    x.describe()

    # Ex: 3
    exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin'],
      'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8],
      'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2],
      'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no']}
    labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
    x = pd.DataFrame(exam_data, index=labels)
    x[['name', 'score']]  # double list too allow more columns

    # Ex: 4
    x.loc[ (x.attempts < 3) & (x.score > 15) ]

    # Ex: 5
    x.groupby('attempts').mean('score')

__matplotlib__


    # Ex: 1
    url = 'http://api.ipma.pt/open-data/observation/climate/' \
          'precipitation-total/leiria/mrrto-1006-caldas-da-rainha.csv'
    data = pd.read_csv(url, index_col='date', parse_dates=['date'])
    data['mean'].plot(figsize=(16,4));

    # Ex: 2
    data[['minimum','maximum']].plot();

    # Ex. 3
    df = data.loc[ data.index >= '2024-05-01']
    df[['minimum','maximum']].plot(kind='box', vert=False, 
                                  showfliers=False, patch_artist=True, 
                                  figsize=(14,3), xlabel='mm', 
                                  title='Precipitation (last month)');
    # Ex: 4
    data.groupby(data.index.strftime('%B')).mean()['maximum'].plot(kind='bar');