<b><font size="5">Pandas Data Structures, Indexing and Slicing. </font></b>
<br><br>
This notebook is an introduction to Pandas library. Feel free to complement your knowledge with online documentation:<br>
https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 

- [1. Data Structures](#P1) 
    - [Data Types](#P1.1)
    - [Series](#P1.2)
    - [DataFrame](#P1.3)
- [2. Indexing and Slicing](#P2)
- [3. MultiIndex and Advanced indexing](#P3)
    - [MultiIndex / Hierarquical Index](#P3.1)
    - [Advanced indexing](#P3.2)
    - [Multi to Single Indexing](#P3.3)
- [4. Try it out](#P4)

### <font color='#BFD72F'>1. Data Structures </font> <a class="anchor" id="P1"></a>
  [Back to TOC](#toc)

In [7]:
# Import libraries and define the alias
import pandas as pd # due to .append it can't be version 2.0 or higher
import numpy as np

In [15]:
# Can downgrade pandas
!pip install pandas==1.5.3

# Can upgrade pandas
#!pip install -U pandas

Collecting pandas==1.5.3
  Obtaining dependency information for pandas==1.5.3 from https://files.pythonhosted.org/packages/da/6d/1235da14daddaa6e47f74ba0c255358f0ce7a6ee05da8bf8eb49161aa6b5/pandas-1.5.3-cp311-cp311-win_amd64.whl.metadata
  Using cached pandas-1.5.3-cp311-cp311-win_amd64.whl.metadata (12 kB)
Using cached pandas-1.5.3-cp311-cp311-win_amd64.whl (10.3 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
Successfully installed pandas-1.5.3


#### Data Types <a class="anchor" id="P1.1"></a>

- Ints: also called integers are positive or negative whole numbers with no decimal point. <br>
- Floats: represent real numbers and are written with a decimal point dividing the integer and fractional parts. <br>
- Strings: used to record text information, using single or double quotes. <br>
- Booleans: predefined as True or False (the capitalization is important!) wich are basically just the integers 1 and 0.

In [8]:
# Example of an Integer
x = 1

# Confirm the Data Type 
type(x)

int

In [9]:
# Example of a Float
Y = #code here

# Confirm the Data Type
#code here

SyntaxError: invalid syntax (3306144151.py, line 2)

In [10]:
# Example of a String
s = 'Hello World!'
print(s)

# Confirm the Data Type
print(type(s))

Hello World!
<class 'str'>


In [11]:
# Strings with special characters
print('Hello..\n New World!')
print('\n') #just a blank newline
print("And you can use \t to create tabs on your sentence.")

Hello..
 New World!


And you can use 	 to create tabs on your sentence.


Some special characters: <br>
 \\" – double quote <br>
 \\ – single backslash <br>
 \a – bell/alert <br>
 \b – backspace <br>
 \r – carriage return <br>
 \n – newline <br>
 \s – space <br>
 \t – tab

In [12]:
# Example of a Boolean
b = True

# Confirm the Data Type
type(b)

bool

In [13]:
# Get a boolean result
1 > 2

False

#### Series <a class="anchor" id="P1.2"></a>

- One-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html <br>
*class pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)* <br>
Data can be many different things: array-like, Iterable, dict, or scalar value.
<br><br>[Back to TOC](#toc)

In [14]:
# Create Series from one array and check datatype
a = np.array([1, 2])
print(type(a))

ser = pd.Series(data=a)
print(type(ser))
ser

<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>


0    1
1    2
dtype: int32

In [26]:
# Create Series from the list l=[1,2] and check datatype
l = [1,2]
S = pd.Series(data=[1,2])
type(S)


pandas.core.series.Series

In [27]:
# Create Series from the dictionary d={'a': 1, 'b': 2, 'c': 3} and check datatype
d = {'a': 1, 
     'b': 2,
     'c': 3
    }

s=(pd.Series(data=d)) # or pd.Series(data = ..., index=...)
print(s)
type(s)

a    1
b    2
c    3
dtype: int64


pandas.core.series.Series

In [28]:
# Define the index order
pd.Series(d, index=['c', 'b', 'a'])

c    3
b    2
a    1
dtype: int64

In [29]:
# What if an inexistant record is selected ('e')
pd.Series(d, index=['c', 'b', 'e']) # As there is no values on 'e' the output is NaN

c    3.0
b    2.0
e    NaN
dtype: float64

In [31]:
# Create Series from a scalar 
# Note: with scalar value an index must be provided
ser1 = pd.Series(data=5.0, index=["a", "b", "c", "d", "e"])
ser1

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [44]:
# Change datatype for 'int' (same as 'int32')
ser1.astype('int8') 

# Check other variable dtypes and bitsizes: https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes
#code here

a    5
b    5
c    5
d    5
e    5
dtype: int8

In [46]:
# How to define the Series' name
ser=pd.Series(ser, name="rank")
print(ser)
print('\nThis series is labelled as:',ser.name) # Print a string with a variable (name attribute of the Series)
#The name will become the COLUMN NAME, when merged into a DataFrame

0    1
1    2
Name: rank, dtype: int32

This series is labelled as: rank


In [47]:
# How to add/append new values to Series
# Note: It won't work on Pandas 2.0
ser = ser.append(ser)
print(ser)

AttributeError: 'Series' object has no attribute 'append'

In [48]:
# As series.append method is deprecated...
ser = pd.concat([ser,ser])
print(ser)

0    1
1    2
0    1
1    2
Name: rank, dtype: int32


#### DataFrame <a class="anchor" id="P1.3"></a>

- DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html <br>
*class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)* <br>
Data can be many different things: array-like, Series, or DataFrame.
<br><br>[Back to TOC](#toc)

In [49]:
# Create DataFrame from an array
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(type(a))
             
df = pd.DataFrame(data=a)
print(type(df))
df

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [51]:
# Create DataFrame from an array setting the index and columns label 
pd.DataFrame(a, columns=['tizio', 'caio', 'sempronio'], index=['d', 'e', 'f'])

Unnamed: 0,tizio,caio,sempronio
d,1,2,3
e,4,5,6
f,7,8,9


In [52]:
# Create DataFrame from list; same principle...
l = [['a', 2],['b',3]]
print(type(l))

df = pd.DataFrame(l)
print(type(df))
df

<class 'list'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1
0,a,2
1,b,3


In [53]:
# Create DataFrame from list of tuples
t = [('a',1),('b',2),('c',3)]
print(type(t))

df = pd.DataFrame(t)
print(type(df))
df

<class 'list'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1
0,a,1
1,b,2
2,c,3


In [56]:
# Create DataFrame from dictionary
d = {'col1': [1, 2], 'col2': [3, 4]}
print(type(d))

df = pd.DataFrame(d)
print(type(df))
df

<class 'dict'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,col1,col2
0,1,3
1,2,4


In [57]:
# Create DataFrame from scalar
# Note: with scalar values, index and colunms must be provided
pd.DataFrame(5.0, columns=['a', 'b', 'c'], index=['d', 'e', 'f'])

Unnamed: 0,a,b,c
d,5.0,5.0,5.0
e,5.0,5.0,5.0
f,5.0,5.0,5.0


In [71]:
# Create DataFrame from Series
ser = pd.Series([1,'a',2,'b',3,'c'])
df = pd.DataFrame(ser, columns=['col1'])
df

Unnamed: 0,col1
0,1
1,a
2,2
3,b
4,3
5,c


In [72]:
# Add new data/column, named 'col2', to DataFrame using previous series
df['col2']=ser
df


Unnamed: 0,col1,col2
0,1,1
1,a,a
2,2,2
3,b,b
4,3,3
5,c,c


In [74]:
# Create DataFrame from DataFrames (pointer)
df1 = df            # df1 points to df, meaning they are the same! So...
df1['col3'] = 'wow' # ...when df1 data is changed...

display(df1)        # ...as displayed...
display(df)         # ...is also changed in df, because they are one and the same (with different names)! They POINT to the SAME object.

Unnamed: 0,col1,col2,col3
0,1,1,wow
1,a,a,wow
2,2,2,wow
3,b,b,wow
4,3,3,wow
5,c,c,wow


Unnamed: 0,col1,col2,col3
0,1,1,wow
1,a,a,wow
2,2,2,wow
3,b,b,wow
4,3,3,wow
5,c,c,wow


In [75]:
df['col3'] = 'oida' # same the other way around
display(df1)        
display(df)  

Unnamed: 0,col1,col2,col3
0,1,1,oida
1,a,a,oida
2,2,2,oida
3,b,b,oida
4,3,3,oida
5,c,c,oida


Unnamed: 0,col1,col2,col3
0,1,1,oida
1,a,a,oida
2,2,2,oida
3,b,b,oida
4,3,3,oida
5,c,c,oida


In [76]:
# DataFrame from DataFrames (new variable)
df1 = df.copy()      # a new avriable df1 is created as a copy of df. So...
df1['col4'] = 'cool' # ...when df1 data is changed...

display(df1)         # ...as displayed...
display(df)          # ...it won't affect df.

Unnamed: 0,col1,col2,col3,col4
0,1,1,oida,cool
1,a,a,oida,cool
2,2,2,oida,cool
3,b,b,oida,cool
4,3,3,oida,cool
5,c,c,oida,cool


Unnamed: 0,col1,col2,col3
0,1,1,oida
1,a,a,oida
2,2,2,oida
3,b,b,oida
4,3,3,oida
5,c,c,oida


In [77]:
# Create new boolean data/column
df1['col5'] = df1['col1']==2
df1

Unnamed: 0,col1,col2,col3,col4,col5
0,1,1,oida,cool,False
1,a,a,oida,cool,False
2,2,2,oida,cool,True
3,b,b,oida,cool,False
4,3,3,oida,cool,False
5,c,c,oida,cool,False


In [78]:
# How to change DataFrame index
df1['new_index']=['l1','l2','l3','l4','l5','l6']
df1.set_index('new_index', inplace=True)
df1

Unnamed: 0_level_0,col1,col2,col3,col4,col5
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
l1,1,1,oida,cool,False
l2,a,a,oida,cool,False
l3,2,2,oida,cool,True
l4,b,b,oida,cool,False
l5,3,3,oida,cool,False
l6,c,c,oida,cool,False


In [80]:
# Reset DataFrame index is also possible
df1 = df1.reset_index(drop=True, )
# OR df1.reset_index(drop=True, inplace = True)
df1

Unnamed: 0,col1,col2,col3,col4,col5
0,1,1,oida,cool,False
1,a,a,oida,cool,False
2,2,2,oida,cool,True
3,b,b,oida,cool,False
4,3,3,oida,cool,False
5,c,c,oida,cool,False


### <font color='#BFD72F'>2. Indexing and Slicing </font> <a class="anchor" id="P2"></a>
  [Back to TOC](#toc)

The basic operations are as follows:

| Operation | Syntax | Result |
| ----- | ----- | ---- |
|Select column | df[col] | Series |
| Select row by label | df.loc[label] | Series |
| Select row by integer location | df.iloc[loc] | Series |
| Slice rows | df[start:end] | DataFrame |
| Select rows by boolean vector | df[bool_vec] | DataFrame |


https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

In [81]:
# Creata a DataFrame with random data (using standard normal distribution)
df = pd.DataFrame(np.random.randn(6, 5), columns=['c1','c2','c3','c4','c5'], index=['l1','l2','l3','l4','l5','l6'])
df

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,3.341122,0.403435,0.198862,-0.739774,0.272548
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


In [83]:
# Select column 'c3'
df['c3'] # or
df.c3
#code here

l1   -0.594993
l2   -2.165200
l3   -1.208574
l4    0.198862
l5    0.325248
l6    0.788213
Name: c3, dtype: float64

In [84]:
# If column name is single words or multiple words tied together with an underscore, can use this notation
df.c3

l1   -0.594993
l2   -2.165200
l3   -1.208574
l4    0.198862
l5    0.325248
l6    0.788213
Name: c3, dtype: float64

In [85]:
# Select more than one column (with DataFrame output)
df[['c3','c5']]

Unnamed: 0,c3,c5
l1,-0.594993,-1.320038
l2,-2.1652,-0.356811
l3,-1.208574,0.849681
l4,0.198862,0.272548
l5,0.325248,0.224757
l6,0.788213,0.479833


In [88]:
# Select row 'l2' by label (loc)
df.loc['l2']

c1    0.330005
c2    1.323038
c3   -2.165200
c4    0.006963
c5   -0.356811
Name: l2, dtype: float64

In [97]:
# Select row 'l2' to 'l4' by label (loc)
df.loc['l2':'l4']
#code here

Unnamed: 0,c1,c2,c3,c4,c5
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,3.341122,0.403435,0.198862,-0.739774,0.272548


In [104]:
# Select one row by integer location (iloc)
# Note: as in numpy indexing starts with '0'
df.iloc[1]

c1    0.330005
c2    1.323038
c3   -2.165200
c4    0.006963
c5   -0.356811
Name: l2, dtype: float64

In [99]:
# Select one row by integer location (iloc) using negative indexing
df.iloc[-1]

c1    0.562777
c2   -0.577687
c3    0.788213
c4   -0.156211
c5    0.479833
Name: l6, dtype: float64

In [101]:
# Select more than one row by integer location (iloc)
# Note: the index notation is [inclusive:exclusive]; [a,b)
df.iloc[1:3]

Unnamed: 0,c1,c2,c3,c4,c5
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681


In [105]:
# Slice DataFrame by rows, defining both limits
df[1:5]

Unnamed: 0,c1,c2,c3,c4,c5
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,3.341122,0.403435,0.198862,-0.739774,0.272548
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757


In [106]:
# Slice DataFrame by rows, letting one open limit
df[2:]

Unnamed: 0,c1,c2,c3,c4,c5
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,3.341122,0.403435,0.198862,-0.739774,0.272548
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


In [107]:
# Select rows by boolean vector
print(df['c1']<0) # This boolean output is used...
print('\n')
print(df[df['c1']<0]) # ...to filter df

l1     True
l2    False
l3     True
l4    False
l5     True
l6    False
Name: c1, dtype: bool


          c1        c2        c3        c4        c5
l1 -0.076629 -0.616865 -0.594993  0.950307 -1.320038
l3 -1.599392  0.033633 -1.208574  0.840661  0.849681
l5 -0.080600 -1.004251  0.325248 -0.020234  0.224757


In [118]:
# Select rows by boolean vector and filter the column 'c3'

print(df['c3']<0)
print('\n')
print(df[df['c3']<0])
#code here

l1     True
l2     True
l3     True
l4    False
l5    False
l6    False
Name: c3, dtype: bool


          c1        c2        c3        c4        c5
l1 -0.076629 -0.616865 -0.594993  0.950307 -1.320038
l2  0.330005  1.323038 -2.165200  0.006963 -0.356811
l3 -1.599392  0.033633 -1.208574  0.840661  0.849681


In [119]:
# Select rows by boolean vector and logic
df[(df['c1']<0) & (df['c3']<0)]

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681


In [120]:
# Select rows and columns by label (loc)
df.loc['l3':'l6','c4':'c5']

Unnamed: 0,c4,c5
l3,0.840661,0.849681
l4,-0.739774,0.272548
l5,-0.020234,0.224757
l6,-0.156211,0.479833


In [121]:
# Select rows and columns by integer location (iloc)
df.iloc[3:6,3]

l4   -0.739774
l5   -0.020234
l6   -0.156211
Name: c4, dtype: float64

In [122]:
# Select rows and columns by label, without loc
df['l3':'l6']['c4']

l3    0.840661
l4   -0.739774
l5   -0.020234
l6   -0.156211
Name: c4, dtype: float64

In [123]:
# Select rows and columns by integer location, without iloc
df[3:6]['c4']

l4   -0.739774
l5   -0.020234
l6   -0.156211
Name: c4, dtype: float64

In [124]:
# How to delete columns
df.drop(['c5'], axis=1) # axis=1 defines this is applicable to columns

Unnamed: 0,c1,c2,c3,c4
l1,-0.076629,-0.616865,-0.594993,0.950307
l2,0.330005,1.323038,-2.1652,0.006963
l3,-1.599392,0.033633,-1.208574,0.840661
l4,3.341122,0.403435,0.198862,-0.739774
l5,-0.0806,-1.004251,0.325248,-0.020234
l6,0.562777,-0.577687,0.788213,-0.156211


In [132]:
# Delete rows 'l2' and 'l4'
df.drop(['l2','l4'], axis=0)

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


#### query()

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html <br>
*DataFrame.query(expr, inplace=False, kwargs)* <br>
Query the columns of a DataFrame with a boolean expression.
<br><br>[Back to TOC](#toc)

In [127]:
# Filter rows with boolean expression for: negative values in c1
df.query('c1<0')

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757


In [128]:
# Filter rows with boolean expression for: positive values in c1 AND negative values in c2
df.query('c1>0 & c2<0')

Unnamed: 0,c1,c2,c3,c4,c5
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


In [138]:
# Filter rows with boolean expression for: positive values in c1 OR negative values in c2
df.query('c1>0 | c2<0')

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l4,3.341122,0.403435,0.198862,-0.739774,0.272548
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


#### where()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html <br>
*DataFrame.where(cond, other=_NoDefault.no_default, inplace=False, axis=None, level=None)* <br>
The boolean result is replace by DataFrame values if the condition is True or NaN ohterwise!
<br><br>[Back to TOC](#toc)

In [139]:
# Check the value of rows that have negative values in c1
df.where((df['c1']<0))

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l2,,,,,
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,,,,,
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,,,,,


In [140]:
# Check the value of rows that have positive values in c1 AND negative values in c2
df.where((df['c1']>0) & (df['c2']<0))

Unnamed: 0,c1,c2,c3,c4,c5
l1,,,,,
l2,,,,,
l3,,,,,
l4,,,,,
l5,,,,,
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


#### filter()

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html <br>
*DataFrame.filter(items=None, like=None, regex=None, axis=None)* <br>
Subset the dataframe rows or columns according to its index, not DataFrame content.
<br><br>[Back to TOC](#toc)

In [141]:
# Create 3 new columns
df['d1']=1
df['d2']=2
df['d3']=3
df

Unnamed: 0,c1,c2,c3,c4,c5,d1,d2,d3
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038,1,2,3
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811,1,2,3
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681,1,2,3
l4,3.341122,0.403435,0.198862,-0.739774,0.272548,1,2,3
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757,1,2,3
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833,1,2,3


In [142]:
# Filter columns that start with letter 'c'
# Note: regex will be detailed in few weeks
df.filter(regex='c')

Unnamed: 0,c1,c2,c3,c4,c5
l1,-0.076629,-0.616865,-0.594993,0.950307,-1.320038
l2,0.330005,1.323038,-2.1652,0.006963,-0.356811
l3,-1.599392,0.033633,-1.208574,0.840661,0.849681
l4,3.341122,0.403435,0.198862,-0.739774,0.272548
l5,-0.0806,-1.004251,0.325248,-0.020234,0.224757
l6,0.562777,-0.577687,0.788213,-0.156211,0.479833


In [151]:
# Filter can also be done with complete label (using 'items') or part of it (using 'like')
#df.filter(like='2')
df.filter(regex='2$')

Unnamed: 0,c2,d2
l1,-0.616865,2
l2,1.323038,2
l3,0.033633,2
l4,0.403435,2
l5,-1.004251,2
l6,-0.577687,2


### <font color='#BFD72F'>3. MultiIndex and advanced indexing </font> <a class="anchor" id="P3"></a>
  [Back to TOC](#toc)

The MultiIndex/Hierarchical indexing object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects.
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

#### MultiIndexing from DataFrame <a class="anchor" id="P3.1"></a>

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html<br>
*classmethod MultiIndex.from_frame(df, sortorder=None, names=None)*

In [3]:
# Create a MultiIndexing from DataFrame
import pandas as pd
df1 = pd.DataFrame([
    ["morning", "measure_one"], 
    ["morning", "measure_two"], 
                    ["afternoon", "measure_one"], 
    ["afternoon", "measure_two"],
                    ["night", "measure_one"], 
    ["night", "measure_two"]], 
                    columns=["index_1", "index_2"])

m_index = pd.MultiIndex.from_frame(df1)
m_index

MultiIndex([(  'morning', 'measure_one'),
            (  'morning', 'measure_two'),
            ('afternoon', 'measure_one'),
            ('afternoon', 'measure_two'),
            (    'night', 'measure_one'),
            (    'night', 'measure_two')],
           names=['index_1', 'index_2'])

In [153]:
# Create a DataFrame with MultiIndexing
df = pd.DataFrame(np.random.randn(6, 4), index=m_index[:6], columns=['station_1','station_2','station_3','station_4'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_2,station_3,station_4
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
morning,measure_one,-1.11039,-2.552172,-0.367237,0.732908
morning,measure_two,0.495317,-0.648151,-0.230684,-1.786846
afternoon,measure_one,0.245475,-0.135267,-1.312154,1.052776
afternoon,measure_two,0.6654,0.512635,0.647473,-0.223894
night,measure_one,0.196007,0.380577,-0.534307,0.056584
night,measure_two,0.060593,-0.015213,1.076551,-0.353717


#### Advanced Indexing <a class="anchor" id="P3.2"></a>
[Back to TOC](#toc)

The same applies...
| Operation | Syntax | Result |
| ----- | ----- | ---- |
|Select column | df[col] | Series |
| Select row by label | df.loc[label] | Series |
| Select row by integer location | df.iloc[loc] | Series |
| Slice rows | df[start:end] | DataFrame |
| Select rows by boolean vector | df[bool_vec] | DataFrame |

In [154]:
# Select the column 'station_1'
df['station_1']

index_1    index_2    
morning    measure_one   -1.110390
           measure_two    0.495317
afternoon  measure_one    0.245475
           measure_two    0.665400
night      measure_one    0.196007
           measure_two    0.060593
Name: station_1, dtype: float64

In [157]:
# Select more than one column, for instance 'station_1' and'station_3'
df[['station_1', 'station_3']]

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_3
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1
morning,measure_one,-1.11039,-0.367237
morning,measure_two,0.495317,-0.230684
afternoon,measure_one,0.245475,-1.312154
afternoon,measure_two,0.6654,0.647473
night,measure_one,0.196007,-0.534307
night,measure_two,0.060593,1.076551


In [158]:
# Select rows by label (loc) using 1 level index
df.loc['morning']

Unnamed: 0_level_0,station_1,station_2,station_3,station_4
index_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
measure_one,-1.11039,-2.552172,-0.367237,0.732908
measure_two,0.495317,-0.648151,-0.230684,-1.786846


In [159]:
# Select rows by label (loc) using 2 levels index
df.loc['morning','measure_one']

station_1   -1.110390
station_2   -2.552172
station_3   -0.367237
station_4    0.732908
Name: (morning, measure_one), dtype: float64

In [160]:
# Select rows by label (loc) using 2 levels index
df.loc[[('morning','measure_one'),('afternoon','measure_one')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_2,station_3,station_4
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
morning,measure_one,-1.11039,-2.552172,-0.367237,0.732908
afternoon,measure_one,0.245475,-0.135267,-1.312154,1.052776


In [161]:
# Select row by integer location (iloc)
df.iloc[1]

station_1    0.495317
station_2   -0.648151
station_3   -0.230684
station_4   -1.786846
Name: (morning, measure_two), dtype: float64

In [162]:
# Select row by integer location (iloc) using negative indexing
df.iloc[-1]

station_1    0.060593
station_2   -0.015213
station_3    1.076551
station_4   -0.353717
Name: (night, measure_two), dtype: float64

In [163]:
# Select row by integer location (iloc), defining the row interval
df.iloc[1:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_2,station_3,station_4
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
morning,measure_two,0.495317,-0.648151,-0.230684,-1.786846
afternoon,measure_one,0.245475,-0.135267,-1.312154,1.052776
afternoon,measure_two,0.6654,0.512635,0.647473,-0.223894
night,measure_one,0.196007,0.380577,-0.534307,0.056584


In [164]:
# Slice rows as any other DataFrame
df[1:4]

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_2,station_3,station_4
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
morning,measure_two,0.495317,-0.648151,-0.230684,-1.786846
afternoon,measure_one,0.245475,-0.135267,-1.312154,1.052776
afternoon,measure_two,0.6654,0.512635,0.647473,-0.223894


In [165]:
# Select rows by boolean vector
df[df['station_2']<0]

Unnamed: 0_level_0,Unnamed: 1_level_0,station_1,station_2,station_3,station_4
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
morning,measure_one,-1.11039,-2.552172,-0.367237,0.732908
morning,measure_two,0.495317,-0.648151,-0.230684,-1.786846
afternoon,measure_one,0.245475,-0.135267,-1.312154,1.052776
night,measure_two,0.060593,-0.015213,1.076551,-0.353717


In [166]:
# Select rows by boolean vector and filter column
df[df['station_2']<0]['station_3']

index_1    index_2    
morning    measure_one   -0.367237
           measure_two   -0.230684
afternoon  measure_one   -1.312154
night      measure_two    1.076551
Name: station_3, dtype: float64

In [167]:
# Select rows and columns by label (loc)
df.loc['morning','station_2':'station_4']

Unnamed: 0_level_0,station_2,station_3,station_4
index_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
measure_one,-2.552172,-0.367237,0.732908
measure_two,-0.648151,-0.230684,-1.786846


In [168]:
# Select rows and columns by label (loc)
df.loc[('morning','measure_two'),'station_2':'station_4']

station_2   -0.648151
station_3   -0.230684
station_4   -1.786846
Name: (morning, measure_two), dtype: float64

In [169]:
# Select rows and column by integer location (iloc)
df.iloc[3:6,3]

index_1    index_2    
afternoon  measure_two   -0.223894
night      measure_one    0.056584
           measure_two   -0.353717
Name: station_4, dtype: float64

In [170]:
# Select rows and columns by integer location, without iloc
df[1:4]['station_3']

index_1    index_2    
morning    measure_two   -0.230684
afternoon  measure_one   -1.312154
           measure_two    0.647473
Name: station_3, dtype: float64

#### Multi to Single Indexing / Reset Index<a class="anchor" id="P3.3"></a>
[Back to TOC](#toc)


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html <br>
*DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates=_NoDefault.no_default, names=None)* <br>
Reset the index, or a level of it.

In [172]:
# Just resetting the index will do the job...
df.reset_index(drop=1)  # by default drop=False, so new columns with original index are added

Unnamed: 0,station_1,station_2,station_3,station_4
0,-1.11039,-2.552172,-0.367237,0.732908
1,0.495317,-0.648151,-0.230684,-1.786846
2,0.245475,-0.135267,-1.312154,1.052776
3,0.6654,0.512635,0.647473,-0.223894
4,0.196007,0.380577,-0.534307,0.056584
5,0.060593,-0.015213,1.076551,-0.353717


In [174]:
# ...but can also just drop one level, by label...
df.reset_index(level=['index_2'], drop=1) 

Unnamed: 0_level_0,station_1,station_2,station_3,station_4
index_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
morning,-1.11039,-2.552172,-0.367237,0.732908
morning,0.495317,-0.648151,-0.230684,-1.786846
afternoon,0.245475,-0.135267,-1.312154,1.052776
afternoon,0.6654,0.512635,0.647473,-0.223894
night,0.196007,0.380577,-0.534307,0.056584
night,0.060593,-0.015213,1.076551,-0.353717


In [176]:
# ...or by index
df.reset_index(level=[1], drop=1) 

Unnamed: 0_level_0,station_1,station_2,station_3,station_4
index_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
morning,-1.11039,-2.552172,-0.367237,0.732908
morning,0.495317,-0.648151,-0.230684,-1.786846
afternoon,0.245475,-0.135267,-1.312154,1.052776
afternoon,0.6654,0.512635,0.647473,-0.223894
night,0.196007,0.380577,-0.534307,0.056584
night,0.060593,-0.015213,1.076551,-0.353717


### <font color='#BFD72F'>4. Try it out </font> <a class="anchor" id="P4"></a>
  [Back to TOC](#toc)

In [180]:
# Run this line to get a new DataFrame (netflix user database)
df = pd.read_csv('./netflix_dataset.csv', index_col=0)
df

df.head(10)

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Plan Duration
0,1,Basic,10,15-01-22,10-06-23,United States,28,Male,Smartphone,1 Month
1,2,Premium,15,05-09-21,22-06-23,Canada,35,Female,Tablet,1 Month
2,3,Standard,12,28-02-23,27-06-23,United Kingdom,42,Male,Smart TV,1 Month
3,4,Standard,12,10-07-22,26-06-23,Australia,51,Female,Laptop,1 Month
4,5,Basic,10,01-05-23,28-06-23,Germany,33,Male,Smartphone,1 Month
5,6,Premium,15,18-03-22,27-06-23,France,29,Female,Smart TV,1 Month
6,7,Standard,12,09-12-21,25-06-23,Brazil,46,Male,Tablet,1 Month
7,8,Basic,10,02-04-23,24-06-23,Mexico,39,Female,Laptop,1 Month
8,9,Standard,12,20-10-22,23-06-23,Spain,37,Male,Smartphone,1 Month
9,10,Premium,15,07-01-23,22-06-23,Italy,44,Female,Smart TV,1 Month


In [187]:
df.info() # i can see if there are any missing values...
# the FIRST thing to do!

<class 'pandas.core.frame.DataFrame'>
Index: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   User ID            2500 non-null   int64  
 1   Subscription Type  2500 non-null   object 
 2   Monthly Revenue    2500 non-null   float64
 3   Join Date          2500 non-null   object 
 4   Last Payment Date  2500 non-null   object 
 5   Country            2500 non-null   object 
 6   Age                2500 non-null   int64  
 7   Gender             2500 non-null   object 
 8   Device             2500 non-null   object 
 9   Plan Duration      2500 non-null   object 
dtypes: float64(1), int64(2), object(7)
memory usage: 214.8+ KB


In [183]:
df['Monthly Revenue'] = df['Monthly Revenue'].astype('float')

In [184]:
df.describe() #all numerics show

Unnamed: 0,User ID,Monthly Revenue,Age
count,2500.0,2500.0,2500.0
mean,1250.5,12.5084,38.7956
std,721.83216,1.686851,7.171778
min,1.0,10.0,26.0
25%,625.75,11.0,32.0
50%,1250.5,12.0,39.0
75%,1875.25,14.0,45.0
max,2500.0,15.0,51.0


In [185]:
df.describe(include='int') #just integers show

Unnamed: 0,User ID,Age
count,2500.0,2500.0
mean,1250.5,38.7956
std,721.83216,7.171778
min,1.0,26.0
25%,625.75,32.0
50%,1250.5,39.0
75%,1875.25,45.0
max,2500.0,51.0


In [186]:
df.describe(include='object') #just objects show

Unnamed: 0,Subscription Type,Join Date,Last Payment Date,Country,Gender,Device,Plan Duration
count,2500,2500,2500,2500,2500,2500,2500
unique,3,300,26,10,2,4,1
top,Basic,05-11-22,28-06-23,United States,Female,Laptop,1 Month
freq,999,33,164,451,1257,636,2500


In [202]:
# Filter columns by data type, to obtain only the numeric ones (as seen in class)

a = df.select_dtypes(include='number').columns
df[a]

Unnamed: 0,User ID,Monthly Revenue,Age
0,1,10.0,28
1,2,15.0,35
2,3,12.0,42
3,4,12.0,51
4,5,10.0,33
...,...,...,...
2495,2496,14.0,28
2496,2497,15.0,33
2497,2498,12.0,38
2498,2499,13.0,48


In [205]:
# Filter all columns that contains text (what data type is it?)

b = df.select_dtypes(include='object').columns
df[b]

Unnamed: 0,Subscription Type,Join Date,Last Payment Date,Country,Gender,Device,Plan Duration
0,Basic,15-01-22,10-06-23,United States,Male,Smartphone,1 Month
1,Premium,05-09-21,22-06-23,Canada,Female,Tablet,1 Month
2,Standard,28-02-23,27-06-23,United Kingdom,Male,Smart TV,1 Month
3,Standard,10-07-22,26-06-23,Australia,Female,Laptop,1 Month
4,Basic,01-05-23,28-06-23,Germany,Male,Smartphone,1 Month
...,...,...,...,...,...,...,...
2495,Premium,25-07-22,12-07-23,Spain,Female,Smart TV,1 Month
2496,Basic,04-08-22,14-07-23,Spain,Female,Smart TV,1 Month
2497,Standard,09-08-22,15-07-23,United States,Male,Laptop,1 Month
2498,Standard,12-08-22,12-07-23,Canada,Female,Tablet,1 Month


In [215]:
# Using query(), filter subscriptions of users between 30 and 40 years old, and revenue equal to 15
# Note: columns with spaces must be surrounded by backticks (`Monthly Revenue`)

df.query('Age>29 & Age<41 & `Monthly Revenue`==15')

# or df[df... & df...]

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Plan Duration
1,2,Premium,15.0,05-09-21,22-06-23,Canada,35,Female,Tablet,1 Month
15,16,Premium,15.0,07-04-22,27-06-23,France,36,Male,Tablet,1 Month
18,19,Premium,15.0,15-02-23,23-06-23,Spain,32,Female,Smart TV,1 Month
28,29,Premium,15.0,19-12-22,23-06-23,Spain,36,Female,Laptop,1 Month
35,36,Premium,15.0,01-03-22,27-06-23,France,35,Male,Tablet,1 Month
...,...,...,...,...,...,...,...,...,...,...
2459,2460,Basic,15.0,12-11-22,13-07-23,Germany,35,Female,Smart TV,1 Month
2464,2465,Basic,15.0,03-11-22,11-07-23,Italy,30,Male,Tablet,1 Month
2466,2467,Basic,15.0,26-10-22,11-07-23,Spain,40,Female,Smartphone,1 Month
2496,2497,Basic,15.0,04-08-22,14-07-23,Spain,33,Female,Smart TV,1 Month


#### That's all for today and feel free to complement your knowledge with online documentation.
*https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html*

$$
\sum_{n \in \mathbb Z} e^{-in\omega x}
$$