# Python Introduction

Introduction notebook with a quick recap of Pandas functionalities for data manipulation

Link and useful material
- Pandas documentation: https://pandas.pydata.org/docs/user_guide/index.html#user-guide
- Numpy documentation: https://numpy.org/devdocs/user/index.html
- Pandas cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Numpy cheatsheet: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
- Useful Python Tips and notes: https://chrisalbon.com/
- Useful Book (Python Data Science Handbook): https://jakevdp.github.io/PythonDataScienceHandbook/

# Google Colab

For this lesson we will use Google Colab as main tool.

Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with

- Zero configuration required
- Free access to GPUs
- Easy sharing
Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier. Watch Introduction to Colab to learn more, or just get started below!


Official Google Notebook with documentation: https://colab.research.google.com/notebooks/intro.ipynb#

## Working with Data on Colab

There are 2 ways to upload and use your data on Google Colab.
- Mounting a folder (hard way)
- Upload directly your data (simple way)


### Mounting a folder 
To mount a folder you must go through Google Drive, that is, mount a virtual folder in your cloud.
To do this you must click on the folder symbol that is in the vertical bar on the left (wait a few moments) then click on the icon: "Mount Drive".

After that you will be asked to launch the cell below, press the Play button on the left.

Give permissions to access your Drive, copy the code you see on the page and paste it into the cell just launched and press the Enter key.

By doing so, your Google Drive will be mounted virtually inside this notebook.

Every time you open a new notebook you will need to do this operation.

To import a new file: Navigate to the file you want to import using the menu on the left, right click on it and do: "copy path". 

Now you are able to paste the file path into your code, which we will see later is the way to access it conveniently.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

### Upload directly your Data

An easier way to use data is to import it directly from your operating system.

There are two ways to perform this procedure:
1. Run the following code to launch the file import by hand
2. Press on the folder icon in the left sidebar of Google Colab and drag the files, or do right click: upload

We will use this example when we look at the data and Pandas related part.

In [4]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google'

To download a file uploaded from the Google Colab space you'll need to run the following code

In [5]:
from google.colab import files
files.download('sample.csv')

ModuleNotFoundError: No module named 'google'

## Libraries

Colab offer a complete environment with all the most common libraries already installed into the runtime instance.

Therefore it's possible to install new libraries by doing this simple command

``` !pip install <library_name>```

In [1]:
#Example of installation
!pip install numpy
!pip install pandas



# Pandas Review

Now we can start the Pandas library rieview with the most common and useful functions

First of all it's important always to import the libraries

In [6]:
import pandas as pd
import numpy as np
import string

Then with the notebooks it's possible to configure some settings using `magic commands` 

Info about autoreload: https://www.wrighters.io/using-autoreload-to-speed-up-ipython-and-jupyter-work/

In [7]:
%load_ext autoreload
%autoreload 2

## Data Frame

Pandas introduces two main data structures:
- `DataFrame`: A "relational" table with rows, columns and indexes
- `Series`: Representing a single column, it is similar to a list, but richer.

A Dataframe therefore is composed of multiple `Series`.

Let's do some examples:

In [8]:
#Series example
city = pd.Series(['Milano', 'Roma', 'Napoli'])
population = pd.Series([1398715, 2790712, 3033256])

#Create Dataframe from series
df = pd.DataFrame({ 'City Name': city, 'Population': population })
df

Unnamed: 0,City Name,Population
0,Milano,1398715
1,Roma,2790712
2,Napoli,3033256


Going into details a Dataframe is a particular Pandas object that is represented on Python as a series of Dictionaries grafted together.

In fact, in a Dataframe we can also have indexes on the rows and columns.

It is always possible to convert a Dataframe into a Dictionary and vice versa.

Furthermore, a Series can also be represented as a list

In [9]:
#Define a dictionary
my_dict = {'City':['Milano', 'Roma', 'Napoli'], 'Population':[1398715, 2790712, 3033256]}
print(f"\nDictionary: \n {my_dict}")

#Create a dataframe from a dictionary
city_df = pd.DataFrame(my_dict)
print(f"\nDataframe: \n {city_df}")

#Create a dictionary from a dataframe
old_dict = city_df.to_dict()
print(f"\nDictionary obtained from dataframe: \n {old_dict}") #Be careful on what we obtain


Dictionary: 
 {'City': ['Milano', 'Roma', 'Napoli'], 'Population': [1398715, 2790712, 3033256]}

Dataframe: 
      City  Population
0  Milano     1398715
1    Roma     2790712
2  Napoli     3033256

Dictionary obtained from dataframe: 
 {'City': {0: 'Milano', 1: 'Roma', 2: 'Napoli'}, 'Population': {0: 1398715, 1: 2790712, 2: 3033256}}


In [9]:
#Define a test dataframe
df = pd.DataFrame({
  'id': range(5),
  'heigth': [1.7, 1.7, 1.8, 1.9, 2.0],
  'weight': [70 , 73 , 80 , 100, 95]
})
df

Unnamed: 0,id,heigth,weight
0,0,1.7,70
1,1,1.7,73
2,2,1.8,80
3,3,1.9,100
4,4,2.0,95


In [None]:
#Create a new column
df['bmi'] = df.weight / (df.heigth)**2
df

Unnamed: 0,id,heigth,weight,bmi
0,0,1.7,70,24.221453
1,1,1.7,73,25.259516
2,2,1.8,80,24.691358
3,3,1.9,100,27.700831
4,4,2.0,95,23.75


In [None]:
# https://www.geeksforgeeks.org/python-change-column-names-and-row-indexes-in-pandas-dataframe/?ref=rp

#Extract the columnnames into a variable
idx = df.columns

#Manipulate the names
new_colnames = idx.str.upper()
new_colnames

#To substitute the new values `new_colnames` 
# df.columns = new_colnames

Index(['ID', 'HEIGTH', 'WEIGHT', 'BMI'], dtype='object')

Rename columns with a function. The `rename` method is a method that does not mute the current data frame, but outputs the modified version.

In [None]:
#Rename a column
tb_2 = df.rename(columns = lambda x: x.title())
tb_2

Unnamed: 0,Id,Heigth,Weight,Bmi
0,0,1.7,70,24.221453
1,1,1.7,73,25.259516
2,2,1.8,80,24.691358
3,3,1.9,100,27.700831
4,4,2.0,95,23.75


Select and reorder columns:

In [None]:
tb_3 = df.reindex(['id', 'bmi', 'weight'], axis=1)
tb_3

Unnamed: 0,id,bmi,weight
0,0,24.221453,70
1,1,25.259516,73
2,2,24.691358,80
3,3,27.700831,100
4,4,23.75,95


The same concepts can be applied to row names:

In [None]:
# row names
df.index

RangeIndex(start=0, stop=5, step=1)

In [None]:
# row's rename
df.rename(index = lambda x: "row-" + str(x))

Unnamed: 0,id,heigth,weight,bmi
row-0,0,1.7,70,24.221453
row-1,1,1.7,73,25.259516
row-2,2,1.8,80,24.691358
row-3,3,1.9,100,27.700831
row-4,4,2.0,95,23.75


Select rows by their name.

In [None]:
# Select rows by their name.
df.reindex([1, 3, 4], axis=0)

Unnamed: 0,id,heigth,weight,bmi
1,1,1.7,73,25.259516
3,3,1.9,100,27.700831
4,4,2.0,95,23.75


## Data Import

### Read a csv file

CSV: Comma Separated Values file. The first row contains the column names:

In [15]:
%%bash
pwd
head ../dataset/rome-airbnb/dc-wikia-data.csv

/home/daco/dev/MaterialeLezioni/PercorsoDati/lab1
page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093,"1939, May",1939
23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496,"1986, October",1986
1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565,"1959, October",1959
1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316,"1987, February",1987
1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237,"19

Read a csv with a standard separator (`,`):

In [16]:
dc = pd.read_csv('../dataset/rome-airbnb/dc-wikia-data.csv')
dc

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6891,66302,Nadine West (New Earth),\/wiki\/Nadine_West_(New_Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,283475,Warren Harding (New Earth),\/wiki\/Warren_Harding_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,283478,William Harrison (New Earth),\/wiki\/William_Harrison_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,283471,William McKinley (New Earth),\/wiki\/William_McKinley_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,


Or specifing the separator

In [18]:
dc = pd.read_table('../dataset/rome-airbnb/dc-wikia-data.csv', sep=',')
dc

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6891,66302,Nadine West (New Earth),\/wiki\/Nadine_West_(New_Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,283475,Warren Harding (New Earth),\/wiki\/Warren_Harding_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,283478,William Harrison (New Earth),\/wiki\/William_Harrison_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,283471,William McKinley (New Earth),\/wiki\/William_McKinley_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,


Specifing column types:

In [19]:
pd.read_table(
  '../dataset/rome-airbnb/dc-wikia-data.csv', 
  sep=',',
  dtype={
      'page_id': 'int', 
      'name': 'string',
      # ...
    }
  )

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6891,66302,Nadine West (New Earth),\/wiki\/Nadine_West_(New_Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,283475,Warren Harding (New Earth),\/wiki\/Warren_Harding_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,283478,William Harrison (New Earth),\/wiki\/William_Harrison_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,283471,William McKinley (New Earth),\/wiki\/William_McKinley_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,


Set all column names lowercase:

In [20]:
dc = dc.rename(columns = lambda x: x.lower())
dc

Unnamed: 0,page_id,name,urlslug,id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6891,66302,Nadine West (New Earth),\/wiki\/Nadine_West_(New_Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,283475,Warren Harding (New Earth),\/wiki\/Warren_Harding_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,283478,William Harrison (New Earth),\/wiki\/William_Harrison_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,283471,William McKinley (New Earth),\/wiki\/William_McKinley_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,


## Basic operations

### Simple column selection

In [21]:
# select columns: name, appearances, sex
dc_2 = dc[['name', 'appearances', 'sex']]
dc_2

Unnamed: 0,name,appearances,sex
0,Batman (Bruce Wayne),3093.0,Male Characters
1,Superman (Clark Kent),2496.0,Male Characters
2,Green Lantern (Hal Jordan),1565.0,Male Characters
3,James Gordon (New Earth),1316.0,Male Characters
4,Richard Grayson (New Earth),1237.0,Male Characters
...,...,...,...
6891,Nadine West (New Earth),,Female Characters
6892,Warren Harding (New Earth),,Male Characters
6893,William Harrison (New Earth),,Male Characters
6894,William McKinley (New Earth),,Male Characters


### Simple row filter

In [22]:
# get the "Female Characters"
dc_2_female = dc_2[dc_2.sex == "Female Characters"]
dc_2_female

Unnamed: 0,name,appearances,sex
5,Wonder Woman (Diana Prince),1231.0,Female Characters
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters
10,GenderTest,1028.0,Female Characters
12,Barbara Gordon (New Earth),951.0,Female Characters
14,Lois Lane (New Earth),934.0,Female Characters
...,...,...,...
6878,Dorothea Tane (New Earth),,Female Characters
6881,Doris Zuel (New Earth),,Female Characters
6882,Doris Lee (New Earth),,Female Characters
6885,Catwoman (Selina Kyle),,Female Characters


### Pipeline

Two operations in one pipeline:

In [23]:
dc[['name', 'appearances', 'sex']][dc.sex == "Female Characters"]

Unnamed: 0,name,appearances,sex
5,Wonder Woman (Diana Prince),1231.0,Female Characters
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters
10,GenderTest,1028.0,Female Characters
12,Barbara Gordon (New Earth),951.0,Female Characters
14,Lois Lane (New Earth),934.0,Female Characters
...,...,...,...
6878,Dorothea Tane (New Earth),,Female Characters
6881,Doris Zuel (New Earth),,Female Characters
6882,Doris Lee (New Earth),,Female Characters
6885,Catwoman (Selina Kyle),,Female Characters


## Column Selections

#### Basic selections 

In [28]:
# select a range (unmutable)
dc.loc[:,'name':'year'] 

# deselect (immutable)
dc.drop(columns=['page_id', 'urlslug'])
# or (immutable)
dc.drop(['page_id', 'urlslug'], axis=1)

Unnamed: 0,name,id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
0,Batman (Bruce Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,Superman (Clark Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,Green Lantern (Hal Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,James Gordon (New Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,Richard Grayson (New Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
...,...,...,...,...,...,...,...,...,...,...,...
6891,Nadine West (New Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,Warren Harding (New Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,William Harrison (New Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,William McKinley (New Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,


In [29]:
# rename 'id' to 'secret_id' (unmutable because 'inplace'=False, dc has changed because of the assignation)
dc = dc.rename(columns={'id': 'secret_id'}, inplace=False)
dc[['name', 'secret_id']]

Unnamed: 0,name,secret_id
0,Batman (Bruce Wayne),Secret Identity
1,Superman (Clark Kent),Secret Identity
2,Green Lantern (Hal Jordan),Secret Identity
3,James Gordon (New Earth),Public Identity
4,Richard Grayson (New Earth),Secret Identity
...,...,...
6891,Nadine West (New Earth),Public Identity
6892,Warren Harding (New Earth),Public Identity
6893,William Harrison (New Earth),Public Identity
6894,William McKinley (New Earth),Public Identity


NB: `inplace=False` means not to change the data frame, but to push changes in the result only (immutable method).

#### Name matching (filter)

Select matching columns with [filter](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html)

In [None]:
# all column names
dc.columns

Index(['page_id', 'name', 'urlslug', 'secret_id', 'align', 'eye', 'hair',
       'sex', 'gsm', 'alive', 'appearances', 'first appearance', 'year'],
      dtype='object')

In [30]:
# select columns by name
dc.filter(items=['name', 'first appearance', 'appearances'])

Unnamed: 0,name,first appearance,appearances
0,Batman (Bruce Wayne),"1939, May",3093.0
1,Superman (Clark Kent),"1986, October",2496.0
2,Green Lantern (Hal Jordan),"1959, October",1565.0
3,James Gordon (New Earth),"1987, February",1316.0
4,Richard Grayson (New Earth),"1940, April",1237.0
...,...,...,...
6891,Nadine West (New Earth),,
6892,Warren Harding (New Earth),,
6893,William Harrison (New Earth),,
6894,William McKinley (New Earth),,


In [31]:
# select columns containing 'appearance' as sub-string
dc.filter(like='appearance', axis=1)

Unnamed: 0,appearances,first appearance
0,3093.0,"1939, May"
1,2496.0,"1986, October"
2,1565.0,"1959, October"
3,1316.0,"1987, February"
4,1237.0,"1940, April"
...,...,...
6891,,
6892,,
6893,,
6894,,


Use [Regular-expressions](https://en.wikipedia.org/wiki/Regular_expression) (regex)

In [32]:
# begin with a specific string
dc.filter(regex='^appearance', axis=1)

Unnamed: 0,appearances
0,3093.0
1,2496.0
2,1565.0
3,1316.0
4,1237.0
...,...
6891,
6892,
6893,
6894,


In [None]:
# end with a specific string
dc.filter(regex='appearance$', axis=1)

Unnamed: 0,first appearance
0,"1939, May"
1,"1986, October"
2,"1959, October"
3,"1987, February"
4,"1940, April"
...,...
6891,
6892,
6893,
6894,


NB: the `^` and `$` signs in a regular expression mean respectively the begin and the end of the string.


In [None]:
# Any pattern you can create with a regex
dc.filter(regex='s.+_id$', axis=1)

Unnamed: 0,secret_id
0,Secret Identity
1,Secret Identity
2,Secret Identity
3,Public Identity
4,Secret Identity
...,...
6891,Public Identity
6892,Public Identity
6893,Public Identity
6894,Public Identity


#### Select by type

In [None]:
# Select all numeric fields
dc.select_dtypes(np.number).head()

# Select all real fields
dc.select_dtypes('float64').head()

# Select all integer fields
dc.select_dtypes('int64').head()

# Exclude strings
dc.select_dtypes(exclude='string').head()

Unnamed: 0,page_id,name,urlslug,secret_id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


### Unique values

In [34]:
# Unique values of a column
dc.sex.unique()

array(['Male Characters', 'Female Characters', nan,
       'Genderless Characters', 'Transgender Characters'], dtype=object)

In [35]:
# Count of those unique values
dc.sex.nunique()

4

In [33]:
dc.sex.nunique(dropna = False)

5

## Row Filter


In [37]:
dc_simple = dc[['name', 'appearances', 'sex', 'hair']]
dc_simple

Unnamed: 0,name,appearances,sex,hair
0,Batman (Bruce Wayne),3093.0,Male Characters,Black Hair
1,Superman (Clark Kent),2496.0,Male Characters,Black Hair
2,Green Lantern (Hal Jordan),1565.0,Male Characters,Brown Hair
3,James Gordon (New Earth),1316.0,Male Characters,White Hair
4,Richard Grayson (New Earth),1237.0,Male Characters,Black Hair
...,...,...,...,...
6891,Nadine West (New Earth),,Female Characters,
6892,Warren Harding (New Earth),,Male Characters,
6893,William Harrison (New Earth),,Male Characters,
6894,William McKinley (New Earth),,Male Characters,


Simple filter:

In [38]:
# get the "Female Characters"
dc_simple[dc_simple.sex == 'Female Characters'].head()

Unnamed: 0,name,appearances,sex,hair
5,Wonder Woman (Diana Prince),1231.0,Female Characters,Black Hair
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters,Blond Hair
10,GenderTest,1028.0,Female Characters,Blond Hair
12,Barbara Gordon (New Earth),951.0,Female Characters,Red Hair
14,Lois Lane (New Earth),934.0,Female Characters,Black Hair


In [None]:
# match multiple conditions (all of them)
# get the "Female Characters" AND with 'Blond Hair'
dc_simple[dc_simple.sex.eq('Female Characters') & dc_simple.hair.eq('Blond Hair')].head()

Unnamed: 0,name,appearances,sex,hair
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters,Blond Hair
10,GenderTest,1028.0,Female Characters,Blond Hair
21,Kara Zor-L (Earth-Two),635.0,Female Characters,Blond Hair
38,Cassandra Sandsmark (New Earth),423.0,Female Characters,Blond Hair
67,Courtney Whitmore (New Earth),305.0,Female Characters,Blond Hair


In [None]:
# Match one of the alternatives (any of them)
# get the "Female Characters" OR with 'Blond Hair'
dc_simple[dc_simple.sex.eq('Female Characters') | dc_simple.hair.eq('Blond Hair')].head()

Unnamed: 0,name,appearances,sex,hair
5,Wonder Woman (Diana Prince),1231.0,Female Characters,Black Hair
6,Aquaman (Arthur Curry),1121.0,Male Characters,Blond Hair
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters,Blond Hair
9,Flash (Barry Allen),1028.0,Male Characters,Blond Hair
10,GenderTest,1028.0,Female Characters,Blond Hair


In [39]:
# Conditions on numbers
# Example: 
# get characters that appeared a greater or equal number of times than 1000 
dc_simple[dc_simple.appearances >= 1000]


Unnamed: 0,name,appearances,sex,hair
0,Batman (Bruce Wayne),3093.0,Male Characters,Black Hair
1,Superman (Clark Kent),2496.0,Male Characters,Black Hair
2,Green Lantern (Hal Jordan),1565.0,Male Characters,Brown Hair
3,James Gordon (New Earth),1316.0,Male Characters,White Hair
4,Richard Grayson (New Earth),1237.0,Male Characters,Black Hair
5,Wonder Woman (Diana Prince),1231.0,Female Characters,Black Hair
6,Aquaman (Arthur Curry),1121.0,Male Characters,Blond Hair
7,Timothy Drake (New Earth),1095.0,Male Characters,Black Hair
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters,Blond Hair
9,Flash (Barry Allen),1028.0,Male Characters,Blond Hair


In [None]:
# Example: 
# within a given interval (inclusive)
dc_simple[(900 <= dc_simple.appearances) & (dc_simple.appearances <= 1000)]
# or (equivalent)
dc_simple[ dc_simple.appearances.between(900, 1000, inclusive=True) ]

Unnamed: 0,name,appearances,sex,hair
11,Alan Scott (New Earth),969.0,Male Characters,Blond Hair
12,Barbara Gordon (New Earth),951.0,Female Characters,Red Hair
13,Jason Garrick (New Earth),951.0,Male Characters,Brown Hair
14,Lois Lane (New Earth),934.0,Female Characters,Black Hair
15,Alfred Pennyworth (New Earth),930.0,Male Characters,Black Hair


In [None]:
# Example: 
# outside a given interval
dc_simple[ (dc_simple.appearances < 900) | (1000 < dc_simple.appearances) ]
# or (NB: non equivalent relating to NaN)
dc_simple[ ~ dc_simple.appearances.between(900, 1000, inclusive=True) ]

Unnamed: 0,name,appearances,sex,hair
0,Batman (Bruce Wayne),3093.0,Male Characters,Black Hair
1,Superman (Clark Kent),2496.0,Male Characters,Black Hair
2,Green Lantern (Hal Jordan),1565.0,Male Characters,Brown Hair
3,James Gordon (New Earth),1316.0,Male Characters,White Hair
4,Richard Grayson (New Earth),1237.0,Male Characters,Black Hair
...,...,...,...,...
6891,Nadine West (New Earth),,Female Characters,
6892,Warren Harding (New Earth),,Male Characters,
6893,William Harrison (New Earth),,Male Characters,
6894,William McKinley (New Earth),,Male Characters,


In [None]:
# set-in operator

# Example: 
# 'appearances' has a value in a vector of possible real numbers
dc_simple[ dc_simple.appearances.isin([900, 1316]) ] 
# Advice: the 'isin' method should be read as the 'is in' method

Unnamed: 0,name,appearances,sex,hair
3,James Gordon (New Earth),1316.0,Male Characters,White Hair


[is-in-operator](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)

In [None]:
# Example: 
# 'hair' has a value in a vector of possible strings
dc_simple[ dc_simple.hair.isin(['Violet Hair', 'Pink Hair']) ]

Unnamed: 0,name,appearances,sex,hair
743,Brainiac 8 (New Earth),38.0,Female Characters,Pink Hair
987,Susan Linden I (New Earth),28.0,Female Characters,Violet Hair
1009,Flora Black (New Earth),27.0,Female Characters,Violet Hair
1780,Susan Linden II (New Earth),14.0,Female Characters,Violet Hair
1959,Silica (New Earth),12.0,Female Characters,Pink Hair
1979,Fay Moffit (New Earth),12.0,Female Characters,Pink Hair
2373,Vanessa Kingsbury (New Earth),10.0,Female Characters,Pink Hair
2559,Gretti (New Earth),9.0,Male Characters,Violet Hair
2865,Poprocket (New Earth),7.0,Female Characters,Pink Hair
2910,Venizz (New Earth),7.0,Female Characters,Pink Hair


### Sampling

In [None]:
# take rows by their position index
dc_simple.iloc[5:10]

Unnamed: 0,name,appearances,sex,hair
5,Wonder Woman (Diana Prince),1231.0,Female Characters,Black Hair
6,Aquaman (Arthur Curry),1121.0,Male Characters,Blond Hair
7,Timothy Drake (New Earth),1095.0,Male Characters,Black Hair
8,Dinah Laurel Lance (New Earth),1075.0,Female Characters,Blond Hair
9,Flash (Barry Allen),1028.0,Male Characters,Blond Hair


[Sample Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

In [None]:
# Extract a sample of 'n' lines
dc_simple.sample(n=3)

# Extract a fraction of all lines
dc_simple.sample(frac=0.1).head()

# Re-sample repeating lines
dc_simple.sample(frac=1.5, replace=True).head()

# if you want to fix the random state of the random sampling
dc_simple.sample(n=3, random_state=123)

Unnamed: 0,name,appearances,sex,hair
6690,Nameless One (New Earth),,Male Characters,White Hair
3850,Dead Hand Legendre (New Earth),4.0,Male Characters,
6312,Snowflame (New Earth),1.0,Male Characters,White Hair


### Row sorting

Reference: [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) documentation.

Sort using a column as a criteria:

In [None]:

# sort by one column
dc_simple.sort_values(by='appearances')
# sort by one column in descending order
dc_simple.sort_values(by='appearances', ascending=False)
# set the na position
dc_simple.sort_values(by='appearances', ascending=False, na_position='first')

Unnamed: 0,name,appearances,sex,hair
6541,Matteo Bischoff (New Earth),,Male Characters,Grey Hair
6542,Doomslayer (New Earth),,Male Characters,White Hair
6543,Emily Sung (New Earth),,Female Characters,Purple Hair
6544,Ry'jll (New Earth),,Female Characters,
6545,Baron Gestapo (New Earth),,Male Characters,
...,...,...,...,...
5877,Scratch (New Earth),1.0,Male Characters,Red Hair
5878,Sheila Carr (New Earth),1.0,Female Characters,Black Hair
5879,Steel Fang (New Earth),1.0,Male Characters,
5880,Althea (New Earth),1.0,Female Characters,Brown Hair


In [None]:
# sort by two criteria (the first has the precedence)
dc_simple.sort_values(by=['sex', 'appearances'], ascending=[True, False])[['sex', 'appearances']]


Unnamed: 0,sex,appearances
5,Female Characters,1231.0
8,Female Characters,1075.0
10,Female Characters,1028.0
12,Female Characters,951.0
14,Female Characters,934.0
...,...,...
6680,,
6731,,
6736,,
6835,,


## Transform columns

Here we need to use a function that returns a vector of the same length of the inputs:

$$
R^n \to R^n
$$

### Create a new column

In [None]:
dc.assign(age = 2021 - dc.year)[['name', 'year', 'age']]

Unnamed: 0,name,year,age
0,Batman (Bruce Wayne),1939.0,82.0
1,Superman (Clark Kent),1986.0,35.0
2,Green Lantern (Hal Jordan),1959.0,62.0
3,James Gordon (New Earth),1987.0,34.0
4,Richard Grayson (New Earth),1940.0,81.0
...,...,...,...
6891,Nadine West (New Earth),,
6892,Warren Harding (New Earth),,
6893,William Harrison (New Earth),,
6894,William McKinley (New Earth),,


In [42]:
from nycflights13 import flights

In [43]:
flights[['dep_delay', 'arr_delay']]\
  .assign(time_gain = flights.dep_delay - flights.arr_delay)

Unnamed: 0,dep_delay,arr_delay,time_gain
0,2.0,11.0,-9.0
1,4.0,20.0,-16.0
2,2.0,33.0,-31.0
3,-1.0,-18.0,17.0
4,-6.0,-25.0,19.0
...,...,...,...
336771,,,
336772,,,
336773,,,
336774,,,


### Segment data values into bins

In order to create a categorical variable from a continuous variable you need to segment and sort data values into bins.

You can do that with the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function:

#### Example 1: split a variable into equal intervals

Here a `cut` example. `cut` splits a variable in classes, each of them correspond to an interval of values.

You can automatically split the range of `year` in a number of classes:

In [40]:
# You can automatically split the range of `year` in a number of classes:
dc[['year']]\
  .assign(year_class = pd.cut(dc.year, bins = 10))

Unnamed: 0,year,year_class
0,1939.0,"(1934.922, 1942.8]"
1,1986.0,"(1981.8, 1989.6]"
2,1959.0,"(1958.4, 1966.2]"
3,1987.0,"(1981.8, 1989.6]"
4,1940.0,"(1934.922, 1942.8]"
...,...,...
6891,,
6892,,
6893,,
6894,,


#### Example 2: split a variable into customized intervals

In [None]:
flights[['arr_delay']].assign(
    delay_class=pd.cut(
        x=flights['arr_delay'],
        bins=  [ -np.inf,         0,                15,             np.inf],
        labels=[       'no-delay',    'small-delay',    'big-delay']
    )
)

Unnamed: 0,arr_delay,delay_class
0,11.0,small-delay
1,20.0,big-delay
2,33.0,big-delay
3,-18.0,no-delay
4,-25.0,no-delay
...,...,...
336771,,
336772,,
336773,,
336774,,


## Aggregate rows

### Scalar-returning aggregations

Here we need to use a function that returns a scalar:

$$
R^n \to R
$$


For example, the `mean` function takes a vector and returns a single value:

In [None]:
np.mean(flights.arr_delay)

6.89537675731489

Examples:

- [numpy.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)

- [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)
- [numpy.quantile](https://numpy.org/doc/stable/reference/generated/numpy.quantile.html)

- [pandas.Series.max](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html)
- [pandas.Series.min](https://pandas.pydata.org/docs/reference/api/pandas.Series.min.html)


### Aggregate to a scalar

Prepare the dataset:

In [46]:
flights_tiny = flights[[ 'dep_delay', 'arr_delay', 'carrier', 'origin', 'dest' ]]
flights_tiny.head()

Unnamed: 0,dep_delay,arr_delay,carrier,origin,dest
0,2.0,11.0,UA,EWR,IAH
1,4.0,20.0,UA,LGA,IAH
2,2.0,33.0,AA,JFK,MIA
3,-1.0,-18.0,B6,JFK,BQN
4,-6.0,-25.0,DL,LGA,ATL


Calculate a single aggregation:

In [None]:
flights_tiny.agg({
    'arr_delay': np.mean,
  })

arr_delay    6.895377
dtype: float64

Calculate multiple aggregations:

In [None]:
flights_tiny.aggregate({
  'dep_delay': [np.mean, np.std, np.median],
  'arr_delay': [np.mean, np.std, np.median],
})

Unnamed: 0,dep_delay,arr_delay
mean,12.63907,6.895377
std,40.210061,44.633292
median,-2.0,-5.0


Create your own aggregation function:


In [50]:
def my_median(x):
    return np.nanquantile(list(x), q=0.5)

flights_tiny.agg({
    'arr_delay': [np.median, my_median]
  })

Unnamed: 0,arr_delay
median,-5.0
my_median,-5.0


## Compute per groups

### Aggregation per groups

In [None]:
flights_tiny\
  .groupby('carrier')\
  .aggregate({
    'dep_delay': [np.mean, np.std, np.median],
    'arr_delay': [np.mean, np.std, np.median],
  })

## Joins

![Join e Merge](https://www.practicaldatascience.org/html/_images/join-or-merge-in-python-pandas.png)

Let us a couple of example tables. Let us define a [foreign-key](https://en.wikipedia.org/wiki/Foreign_key) a column whose values have a correspondence in another table. This creates a relationship among two tables.

In [51]:
import numpy as np

letters = np.array(list("abcdefghijklmnopqrstuvwxyz"))
letters

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

In [54]:
upper_letters = np.vectorize(lambda x: x.upper())(letters)
upper_letters


array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype='<U1')

`DataFrame` joins are performed with the function  [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).

In [56]:
# Main table, where
# 'id' id number
# 'lower' a foreign key
main_tbl = pd.DataFrame({
  'id': np.arange(5),
  'lower': np.concatenate((letters[0:2], np.array(['a-non-letter']), letters[3:5]))
})
main_tbl

Unnamed: 0,id,lower
0,0,a
1,1,b
2,2,a-non-letter
3,3,d
4,4,e


And another that contains (usually all) the occurrences of the foreign-key:

In [57]:
# The table of letters 
# 'lower' list all the lower case letters
# 'upper' an attribute to the letters, for the example case a upper case copy of the letter
letter_tbl = pd.DataFrame({
  'lower': letters,
  'upper': upper_letters
})
letter_tbl

Unnamed: 0,lower,upper
0,a,A
1,b,B
2,c,C
3,d,D
4,e,E
5,f,F
6,g,G
7,h,H
8,i,I
9,j,J


In [58]:
# full-join
main_tbl.merge(letter_tbl, how='outer')

Unnamed: 0,id,lower,upper
0,0.0,a,A
1,1.0,b,B
2,2.0,a-non-letter,
3,3.0,d,D
4,4.0,e,E
5,,c,C
6,,f,F
7,,g,G
8,,h,H
9,,i,I


In [62]:
# inner-join
main_tbl.merge(letter_tbl, how='inner')

Unnamed: 0,id,lower,upper
0,0,a,A
1,1,b,B
2,3,d,D
3,4,e,E


In [None]:
# left-join
main_tbl.merge(letter_tbl, how='left')

In [None]:
# right-join
main_tbl.merge(letter_tbl, how='right')

In [None]:
# inverting the order a 'right_join' return the same as a 'left_join'
letter_tbl.merge(main_tbl, how='left')

In [79]:
#It's possible to have a much more specific syntax
#Using also different keys
df_merged = pd.merge(main_tbl, letter_tbl, how="inner", left_on=["lower"], right_on=["lower"])

### Unite data frames

#### Concatenate rows

Example from the [guide](https://pandas.pydata.org/docs/user_guide/merging.html):

In [63]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[0, 1, 2, 3],
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[0, 1, 2, 3],
)

In [64]:
# concat rows
pd.concat([df1, df2, df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7
0,A8,B8,C8,D8
1,A9,B9,C9,D9


In [65]:
# bind columns by their index
pd.concat([df1, df2, df3], axis=1)


Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,A4,B4,C4,D4,A8,B8,C8,D8
1,A1,B1,C1,D1,A5,B5,C5,D5,A9,B9,C9,D9
2,A2,B2,C2,D2,A6,B6,C6,D6,A10,B10,C10,D10
3,A3,B3,C3,D3,A7,B7,C7,D7,A11,B11,C11,D11


# Bonus Topics

## Configure Pandas Settings

Pandas have a lot of settings that it's possible to configure

In [None]:
# Use 3 decimal places in output display
pd.set_option("display.precision", 3)

# Don't wrap repr(DataFrame) across additional lines
pd.set_option("display.expand_frame_repr", False)

# Set max rows displayed in output to 25
pd.set_option("display.max_rows", 25)

## Apply custom functions 

With Pandas it's possible to build custom functions and apply them to a dataframe

Pandas provides a very useful and powerful function called: `apply`.
This function allows you to apply a function you define or any transformation to a dataframe.

In the example we will also use `lambda expressions`, a particular Python construct.

In [66]:
#Build a custom function to lowering all strings in a dataframe
def put_string_lower(dataset):
    """Function to generate lower strings based on particular condition

    Args:
        dataset ([dataframe]): Pandas dataframe
    """
    print(type(dataset))    
    if ((dataset['value'] > 200) & (dataset['check'] is True)):
        
        applied = True
    else:
        
        applied = False
    
    print(f"\nFunction applied: {applied} for:\n {dataset}")
    return(dataset.str.lower())

In [67]:
#Build a custom function to add a single record
def anomaly_record(dataset):
    print(type(dataset))
    if dataset['check'] is False:
        applied = True
    else:
        applied = False
    return(applied)

In [68]:
import pandas as pd

#Define a dataframe with custom fields and none variables
df_check = pd.Series([True, True, False, False, True])
df_value = pd.Series([200, 500, 600, 1, 2000, 20])
df_string = pd.Series(["FIRSTSTRING","SecondString","ThirdOne","allunder","SUPER","why?"])

df = pd.DataFrame({'check':df_check,'value':df_value,'string':df_string})
df

Unnamed: 0,check,value,string
0,True,200,FIRSTSTRING
1,True,500,SecondString
2,False,600,ThirdOne
3,False,1,allunder
4,True,2000,SUPER
5,,20,why?


In [69]:
%%time
df['anomaly'] = df.apply(anomaly_record, axis=1)
df

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
CPU times: user 1.74 ms, sys: 4.47 ms, total: 6.21 ms
Wall time: 17.8 ms


Unnamed: 0,check,value,string,anomaly
0,True,200,FIRSTSTRING,False
1,True,500,SecondString,False
2,False,600,ThirdOne,True
3,False,1,allunder,True
4,True,2000,SUPER,False
5,,20,why?,False


In [71]:
%%time
#Apply a function
result = df.apply(put_string_lower, axis=1)
df['lower'] = result['string']
df

<class 'pandas.core.series.Series'>

Function applied: False for:
 check             True
value              200
string     FIRSTSTRING
anomaly          False
lower      firststring
Name: 0, dtype: object
<class 'pandas.core.series.Series'>

Function applied: True for:
 check              True
value               500
string     SecondString
anomaly           False
lower      secondstring
Name: 1, dtype: object
<class 'pandas.core.series.Series'>

Function applied: False for:
 check         False
value           600
string     ThirdOne
anomaly        True
lower      thirdone
Name: 2, dtype: object
<class 'pandas.core.series.Series'>

Function applied: False for:
 check         False
value             1
string     allunder
anomaly        True
lower      allunder
Name: 3, dtype: object
<class 'pandas.core.series.Series'>

Function applied: True for:
 check       True
value       2000
string     SUPER
anomaly    False
lower      super
Name: 4, dtype: object
<class 'pandas.core.series.Serie

Unnamed: 0,check,value,string,anomaly,lower
0,True,200,FIRSTSTRING,False,firststring
1,True,500,SecondString,False,secondstring
2,False,600,ThirdOne,True,thirdone
3,False,1,allunder,True,allunder
4,True,2000,SUPER,False,super
5,,20,why?,False,why?


In [73]:
#Apply a function using lambda
df['anomaly'] = df.apply(lambda x: anomaly_record(x), axis=1)
df

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


Unnamed: 0,check,value,string,lower,anomaly
0,True,200,FIRSTSTRING,firststring,True
1,True,500,SecondString,secondstring,True
2,False,600,ThirdOne,thirdone,False
3,False,1,allunder,allunder,False
4,True,2000,SUPER,super,True
5,,20,why?,why?,True


## Working with Dates

In Python it's quite easy to work with dates, but it can often be misleading because of the type with which dates and time are represented.

Sometimes the time or dates can be in string format or in Datetime format.

Let's take a quick look at how to deal and work with dates using the basic Python language libraries.

Working with Dates

In [82]:
#Working with dates
from datetime import date

d1 = date(2021,2,9)

print(d1)

print(type(d1))

2021-02-09
<class 'datetime.date'>


In [3]:
# There are a lot of useful functions inside the library
# For example the possibility to retrieve the current day and divide into parts

# present day date
d1 = date.today()
print(d1)
# day
print('Day :',d1.day)
# month
print('Month :',d1.month)
# year
print('Year :',d1.year)

2021-04-10
Day : 10
Month : 4
Year : 2021


If we want to work with time

In [72]:
#Lavoriamo con il tempo
from datetime import time

t1 = time(13,20,13,40)

print(t1)

print(type(t1))

# hour
print('Hour :',t1.hour)
# minute
print('Minute :',t1.minute)
# second
print('Second :',t1.second)
# microsecond
print('Microsecond :',t1.microsecond)

13:20:13.000040
<class 'datetime.time'>
Hour : 13
Minute : 20
Second : 13
Microsecond : 40


Working with dates + time = `Datetime`

It's also possible to work with date and time on the same type of data

In [73]:
from datetime import datetime
d1 = datetime(2021,2,9,11,20,30,40)
print(d1)
print(type(d1))

#Visualizing current datetime
d1 = datetime.now()
print(d1)
print(type(d1))

2021-02-09 11:20:30.000040
<class 'datetime.datetime'>
2021-04-11 16:05:16.689877
<class 'datetime.datetime'>


Datetime object are very useful because you can execute very fast operations on the data

In [74]:
print('Datetime :',d1)
# date
print('Date :',d1.date())
# time
print('Time :',d1.time())

# change the data (day = 24 and hour = 14)
print('New datetime :',d1.replace(day=24, hour=14))

Datetime : 2021-04-11 16:05:16.689877
Date : 2021-04-11
Time : 16:05:16.689877
New datetime : 2021-04-24 14:05:16.689877


There are a lot of useful functions, for example the possibility to use the calendar

In [75]:
d1 = datetime.now()

# The week starts from 0
print(d1.weekday())
# The week starts from 1
print(d1.isoweekday())

# Week of the year
# retuns year, week, month
print(d1.isocalendar())
print('Year :',d1.isocalendar()[0])
print('Week :',d1.isocalendar()[1])
print('Weekday :',d1.isocalendar()[2])

6
7
(2021, 14, 7)
Year : 2021
Week : 14
Weekday : 7


In [76]:
# Visualize directly the calendar :) 

#Using the library calendar
import calendar

# Its was April when I wrote this
print(calendar.month(2021, 2))

print(calendar.calendar(2021))

   February 2021
Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28

                                  2021

      January                   February                   March
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
             1  2  3       1  2  3  4  5  6  7       1  2  3  4  5  6  7
 4  5  6  7  8  9 10       8  9 10 11 12 13 14       8  9 10 11 12 13 14
11 12 13 14 15 16 17      15 16 17 18 19 20 21      15 16 17 18 19 20 21
18 19 20 21 22 23 24      22 23 24 25 26 27 28      22 23 24 25 26 27 28
25 26 27 28 29 30 31                                29 30 31

       April                      May                       June
Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su      Mo Tu We Th Fr Sa Su
          1  2  3  4                      1  2          1  2  3  4  5  6
 5  6  7  8  9 10 11       3  4  5  6  7  8  9       7  8  9 10 11 12 13
12 13 14 15 16 17 18      10 11 12 13 14 15 16      14 15 16 

So we have seen how the default Python libraries work and they are very powerful.

To work on the DateTime object and therefore on dates there are two very important and very useful functions:
- **strptime** = Allows you to create a DateTime object from a string, representing the time and date. You can also pass a particular format to the string to transform it
- **strftime** = To convert a DateTime object into a string (the opposite of strptime)

So let's see how you can use

In [77]:
#Be careful for the library import
from datetime import datetime

In [79]:
# strptime
date = '22 April, 2020 13:20:13' #It's always in english!!!
d1 = datetime.strptime(date,'%d %B, %Y %H:%M:%S')
print(d1)
print(type(d1))

2020-04-22 13:20:13
<class 'datetime.datetime'>


In [80]:
# strftime
d1 = datetime.now()
print('Datetime object :',d1)
new_date = d1.strftime('%d/%m/%Y %H:%M')
print('Formatted date :',new_date)
print(type(new_date))

Datetime object : 2021-04-11 16:08:36.292555
Formatted date : 11/04/2021 16:08
<class 'str'>


These time operators: `('%d %B, %Y %H:%M:%S')` allow you to represent time (they are keywords).

To see the list of possible keywords you can check the documentation: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

In [7]:
#With the temporal operators (called: format codes) it's possible to gather interesting information
d1 = datetime.now()
print('Weekday :',d1.strftime('%A'))
print('Month :',d1.strftime('%B'))
print('Week number :',d1.strftime('%W'))
print("Locale's date and time representation :",d1.strftime('%c'))

Weekday : Saturday
Month : April
Week number : 14
Locale's date and time representation : Sat Apr 10 19:25:42 2021


To perform operations on dates you have to be a bit careful, because performing simple operations on dates will result in a new object called: **timedelta**.

Timedelta is useful because it allows you to have interesting new methods.

In [84]:
# timedelta : duration between dates
d1 = datetime(2020,4,23,11,13,10)
d2 = datetime(2021,1,10,12,13,10)
duration = d2 - d1
print(type(duration))
duration

<class 'datetime.timedelta'>


datetime.timedelta(days=262, seconds=3600)

In [None]:
print(duration.days) # 365
print(duration.seconds) # 3600

Time delta is important because it allows you to perform operations directly on dates in a much simpler way (and also being able to deal with strings)

In [85]:
from datetime import timedelta
# duration in hours
print('Duration in hours :', duration / timedelta(hours=1))
# duration in minutes
print('Duration in minutes :', duration / timedelta(minutes=1))
# duration in seconds
print('Duration in seconds :', duration / timedelta(seconds=1))

Duration in hours : 6289.0
Duration in minutes : 377340.0
Duration in seconds : 22640400.0


In [93]:
#Don't work!! Required Timedelta!!
d1 = datetime.now()
# d1 + 5

#Because we are trying to sum an int to a datetime data type

TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'int'

In [88]:
d1 = datetime.now()
print("Today's date :", d1)

d2 = d1+timedelta(days=2)
print("Date 2 days from today :", d2)

d3 = d1+timedelta(weeks=2)
print("Date 2 weeks from today :", d3)

Today's date : 2021-04-11 16:17:05.104910
Date 2 days from today : 2021-04-13 16:17:05.104910
Date 2 weeks from today : 2021-04-25 16:17:05.104910


Given the basic operation of the library we can see how it behaves with Pandas and Numpy.

The operation is similar, but optimized and simplified to work on dataframes.

In addition, both Pandas and Numpy have functions that extend the default functions of the library and that simplify our life working on dataframes.

In [94]:
# convert a column in: to_datetime
date = pd.to_datetime('24th of April, 2020')
print(date)
print(type(date))

2020-04-24 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [99]:
# timedelta
import numpy as np
date = datetime.now()
# present date
print(date)
# date after 1 day
print(date + pd.to_timedelta(1, unit='D'))
# date after 1 month
# print(date + pd.to_timedelta(1, unit='M'))

2021-04-11 16:20:37.680910
2021-04-12 16:20:37.680910


## From SQL to Pandas example

Pandas is very useful to manipulate data, some of you will be more used to using sql or excel.  
So let's see a parallelism between SQL syntax that you might write in your queries versus how you might write it with python

In [None]:
!pip install nycflights13

In [93]:
from nycflights13 import flights

print(list(flights.columns))

flights.head(10)

['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight', 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour']


Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z
5,2013,1,1,554.0,558,-4.0,740.0,728,12.0,UA,1696,N39463,EWR,ORD,150.0,719,5,58,2013-01-01T10:00:00Z
6,2013,1,1,555.0,600,-5.0,913.0,854,19.0,B6,507,N516JB,EWR,FLL,158.0,1065,6,0,2013-01-01T11:00:00Z
7,2013,1,1,557.0,600,-3.0,709.0,723,-14.0,EV,5708,N829AS,LGA,IAD,53.0,229,6,0,2013-01-01T11:00:00Z
8,2013,1,1,557.0,600,-3.0,838.0,846,-8.0,B6,79,N593JB,JFK,MCO,140.0,944,6,0,2013-01-01T11:00:00Z
9,2013,1,1,558.0,600,-2.0,753.0,745,8.0,AA,301,N3ALAA,LGA,ORD,138.0,733,6,0,2013-01-01T11:00:00Z


In [86]:
# select origin from flights where no departue delay 
flights[flights['dep_delay'] > 0]['origin']

# select distinct destinations from dataframe
flights['dest'].unique()

#select Year, month, Delay, Origin, Destination from dataframe where Delay > 0 and Distance < 1000
#and reset the index keeping the previous one (the original index)
flights[(flights['dep_delay'] > 0) & (flights.distance < 1000)][['year','month','dep_delay','origin']].reset_index()

# select * from dataframe where Delay > 0 order by Distance desc
flights[flights['dep_delay'] > 0].sort_values('distance',ascending = False)

# select count(*) from dataframe group by year and month
# as_index = False is required to emulate the SQL function
# otherwise Pandas the grouped-on columns are pushed into the MultiIndex of the resulting Series by default
flights.groupby(['year','month'])[['year','month','origin','dest']].count()
flights.groupby(['year','month'],as_index=False)[['year','month','origin','dest']].count()

# select count(year) as ANNO, count(month) as MESE from dataframe group by year and month
# To emulate the SQL group by it's required only to use sort = False
# because Pandas use the sort function by default
flights.groupby(['year','month'],as_index=False, sort=False)['year','month'].count()

# select Origin, Destination count(*) from dataframe group by Year and Month
# Using .count() excludes NaN values, while .size() includes everything, NaN or not.
flights.groupby(['year','month'],as_index=False)['origin','destination'].size()


### AGGREGATE FUNCTIONS: MIN, MAX, MEAN ###
# select max(Arrival Time), min(Arrival Time), avg(Arrival Time), median(Arrival Time) from dataframe
flights.agg({'arr_time': ['min', 'max', 'mean', 'median']})

### JOIN ###
# select Capo Area, Premi Anno Prec, df_anagrafica.'Specialista Di Territorio' from df_test join df_anagrafica on df_test.CapoArea = df_anagrafica.CapoArea where df_anagafica.CapoArea = 'Area T1'
# df_test.merge(df_anagrafica[df_anagrafica['Capo Area'] == 'Area T1'][['Specialista Di Territorio','Capo Area']], 
#               left_on=['Capo Area'], 
#               right_on=['Capo Area'], 
#               how='inner')[['Capo Area', 'Premi Anno Prec', 'Specialista Di Territorio']]

TypeError: 'DataFrame' object is not callable

## Verify time of execution

There are two main ways to verify the execution time of a function:
1. Using `Jupyter Magics Commands` if you are using Jupyter Notebooks
2. Using the library time of python standard library

Interesting article on how to benchmark functions in Python  
https://towardsdatascience.com/how-to-benchmark-functions-in-python-ed10522053a2

### With Jupyter Magics
Jupyter Notebook provides some very useful commands called: `Magics`.

One of these commands is `%%time` which allows you to check the execution time of a cell.

It is very useful because you can see how long a code inside a cell takes to be executed.

It's also possible to visualize the execution time of a single function by writing `%%timeit <function>`  
Timeit it's much more complete the time because time only print the time of execution of a single cell

In [1]:
%%time
lst = [i for i in range(100000)]

print(len(lst))

100000
CPU times: user 5.44 ms, sys: 3.55 ms, total: 8.99 ms
Wall time: 9.4 ms


In [101]:
%timeit lst = [i for i in range(100000)]

6.34 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [102]:
#Using the python standard library timer
import time
#Launch the timer
start_time = time.time()

lst = [i for i in range(100000)]

print(len(lst))

#guardiamo il risultato
finish_time = time.time()
print("--- %s seconds ---" % (finish_time - start_time))

100000
--- 0.01683354377746582 seconds ---
