# Machine Learning

## HSE, 2024-25

# Seminar 2. Pandas

*   [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
*   [Lots of useful and good practices](https://pandas.pydata.org/docs/user_guide/cookbook.html)

There are two main objects in pandas: pandas Series and pandas DataFrame. The first is essentially an abstraction over a one-dimensional data array with additional metadata, and the second abstraction is essentially a “table” consisting of pandas Series sets.

## Creating an object



### Pandas Series
Let's start with pd.Series. Just like for a numpy array, we can set the data type. All the same data types are available as in numpy + it is possible to convert some data types to others using astype + you can specify [its functions](https://pbpython.com/pandas_dtypes.html) for conversion.

In [None]:
import numpy as np
import pandas as pd

In [None]:
s = pd.Series([1,2,3], dtype=np.int32, name='numbers') # pd.Series
s

Unnamed: 0,numbers
0,1
1,2
2,3


In [None]:
s['Lada'] = 10
s

Unnamed: 0,numbers
0,1
1,2
2,3
x,10
Lada,10


Note the column on the left, this is an index, and unless otherwise specified, it is created automatically. We will encounter indexes for both pd.Series and pd.DataFrame. What does it give? The analogy here is the same as with a telephone directory. Indexes allow you to more logically categorize information, as well as more optimally perform some operations on series (pd.Series) and dataframes (pd.DataFrame). Briefly, it can be noted that the indices
1. Identifies data (provides metadata) using known indicators that are important for analysis, visualization, and display in an interactive console
2. Enable automatic and explicit data alignment.
3. Allows you to intuitively retrieve and configure subsets of a dataset.

In addition, note that the series also has a name. This is useful when we need to insert a new column into a DataFrame without explicitly specifying a name.

In the following way we can specify an arbitrary index, now our entries identify the letters a b c

In [None]:
s = pd.Series([1,2,3], dtype=np.int32, name='numbers', index=['a', 'b', 'c'])
s

Unnamed: 0,numbers
a,1
b,2
c,3


In addition to the index (property s.index), integer indexing is also preserved.

Below is a selection simply by integer index (end-to-end), as if we were working with a regular list.

In [None]:
s['c'] #

3

The .loc access method allows you to select exactly by index.
Please note that square brackets are used here. Most likely, this was done so that such a selection would be similar to a selection from a regular list.

In [None]:
s.loc['c']

3

You can view the index separately using the .index property

In [None]:
s.index[0:2]

Index(['a', 'b'], dtype='object')

### Pandas DataFrame

Pandas DataFrame is already a two-dimensional object (analogue of a two-dimensional NumPy array - matrix)

Let's create a pandas DataFrame from a random numpy matrix

In [None]:
m = np.random.rand(5,3)
df = pd.DataFrame(m)
df

Unnamed: 0,0,1,2
0,0.016619,0.513514,0.22862
1,0.199597,0.308959,0.256003
2,0.224755,0.282382,0.656739
3,0.925231,0.365754,0.044201
4,0.587947,0.197427,0.655381


We see a row index that was created automatically, as well as a column index (or simply columns), which were also set automatically. We have a rather unusual table view, let's give the columns more understandable names.

In [None]:
np.random.rand(5,3)

In [None]:
m = np.random.rand(5,3)
df = pd.DataFrame(data=m, columns=['first', 'second', 'third'],)
df['name'] = ['Dima', 'Ivan', 'Nikolay', 'Dima', 'Pavel']
df

Unnamed: 0,first,second,third,name
0,0.887491,0.294756,0.168454,Dima
1,0.291785,0.914294,0.217355,Ivan
2,0.908683,0.701471,0.576963,Nikolay
3,0.136764,0.232178,0.982347,Dima
4,0.236495,0.134165,0.725935,Pavel


In pandas DataFrame, selection with square brackets occurs by columns

In [None]:
df['first']
df['name']
df.name

Unnamed: 0,name
0,Dima
1,Ivan
2,Nikolay
3,Dima
4,Pavel


In [None]:
type(df['name'])

In [None]:
df['name'].unique()

array(['Dima', 'Ivan', 'Nikolay', 'Pavel'], dtype=object)

In [None]:
type(df[['name']])

In [None]:
df[['name']].unique()

AttributeError: 'DataFrame' object has no attribute 'unique'

In [None]:
df[['name']]

Unnamed: 0,name
0,Dima
1,Ivan
2,Nikolay
3,Dima
4,Pavel


In [None]:
df[0] # there is no such column, there will be an error

KeyError: 0

But SUDDENLY, if we try to apply slicing as in regular numpy arrays or lists, the selection will occur by row. This is the feature that we are forced to simply remember. The selection takes place using integer end-to-end indexing (0,1,2,3,4,...).

In [None]:
df[:2]


Unnamed: 0,first,second,third,name
0,0.887491,0.294756,0.168454,Dima
1,0.291785,0.914294,0.217355,Ivan


In [None]:
df['index'] = [1, 2, 3, 4, 1]
df.set_index('index', inplace=True)
df

Unnamed: 0_level_0,first,second,third,name
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.887491,0.294756,0.168454,Dima
2,0.291785,0.914294,0.217355,Ivan
3,0.908683,0.701471,0.576963,Nikolay
4,0.136764,0.232178,0.982347,Dima
1,0.236495,0.134165,0.725935,Pavel


In [None]:
df.loc[[1, 2], ['name', 'third']]

Unnamed: 0_level_0,name,third
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Dima,0.168454
1,Pavel,0.725935
2,Ivan,0.217355


There is a convenient way to initialize a new DataFrame using a dictionary. The keys will become column names, and the values ​​of the keys will become columns.

In [None]:
# pd.DataFrame through the dictionary
d = {
    'name': ['Dmitry', 'Alexey', 'Vladimir', 'Elena'],
    'age': [24, 25, 30, 40]
}
pd.DataFrame(d)

Unnamed: 0,name,age
0,Dmitry,24
1,Alexey,25
2,Vladimir,30
3,Elena,40


In [None]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'x': 10},
           {'a': 100, 'b': 200,'d': 400},
           {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
data = pd.DataFrame(mydict)
data


Unnamed: 0,a,b,c,d,x
0,1,2,3.0,4,10.0
1,100,200,,400,
2,1000,2000,3000.0,4000,


## View table

By default, colab notebook (or jupyter notebook) will "truncate" the display of tablets, since if there are a lot of lines, they can take up a lot of space and confuse your browser and completely crash your computer.

In [None]:
pd.DataFrame(np.random.rand(100,2)) # this will skip a few lines to save space

Unnamed: 0,0,1
0,0.117413,0.895163
1,0.098924,0.518112
2,0.345985,0.144807
3,0.587942,0.250417
4,0.143252,0.467746
...,...,...
95,0.185550,0.984437
96,0.713008,0.829795
97,0.426212,0.700351
98,0.482440,0.791235


However, it is unlikely that you will need to look through 100,000 rows of a table manually. As a rule, we only need to look at the first few rows of the table to understand what is there and whether we read our table correctly from the file.

In [None]:
df.head(2) # show the first two lines of the dataframe

Unnamed: 0,first,second,third,name
0,0.456191,0.922987,0.962495,Dima
1,0.524134,0.06222,0.634252,Ivan


In [None]:
df.tail(2) # last two lines from the end

Unnamed: 0,first,second,third,name
3,0.361775,0.991741,0.273335,Dima
4,0.532018,0.623318,0.916571,Pavel


In [None]:
df.sample(2)

Unnamed: 0,first,second,third,name
1,0.524134,0.06222,0.634252,Ivan
4,0.532018,0.623318,0.916571,Pavel


We can separately view the string index and columns using the corresponding object properties

In [None]:
df.index

Index([1, 2, 3, 4, 1], dtype='int64', name='index')

In [None]:
df.columns

Index(['first', 'second', 'third', 'name'], dtype='object')

In [None]:
df.columns[1:3]

Index(['second', 'third'], dtype='object')

Find out the shape of our table

In [None]:
df.shape

(5, 4)

In [None]:
df.shape[0]

5

In [None]:
df.shape[1]

4

View data types

In [None]:
df.dtypes

Unnamed: 0,0
first,float64
second,float64
third,float64
name,object


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   5 non-null      float64
 1   second  5 non-null      float64
 2   third   5 non-null      float64
 3   name    5 non-null      object 
dtypes: float64(3), object(1)
memory usage: 372.0+ bytes


And change them using the astype method. Please note that we can pass a whole dictionary in which the keys are the names of the columns, and the values ​​​​by the key are the data type to which we want to convert the corresponding column.

In [None]:
df = df.astype({'first': np.float32})

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   5 non-null      float32
 1   second  5 non-null      float64
 2   third   5 non-null      float64
 3   name    5 non-null      object 
dtypes: float32(1), float64(2), object(1)
memory usage: 352.0+ bytes


We can discard all metadata and move to a numpy matrix to work with using methods from the numpy library.

In [None]:
df.to_numpy()

array([[0.4561910629272461, 0.9229871036614111, 0.9624945896343942,
        'Dima'],
       [0.5241341590881348, 0.062220130041355626, 0.6342517233358534,
        'Ivan'],
       [0.38148126006126404, 0.4247111638505364, 0.2297194061794231,
        'Nikolay'],
       [0.3617745637893677, 0.9917409677883949, 0.27333524048243296,
        'Dima'],
       [0.5320175886154175, 0.6233180390611812, 0.9165708625555892,
        'Pavel']], dtype=object)

The .describe() method is extremely useful, as it gives us descriptive statistics on our dataframe.

In [None]:
df_desc = df.describe()

Unnamed: 0,first,second,third
count,5.0,5.0,5.0
mean,0.45112,0.604995,0.603274
std,0.078634,0.380176,0.345165
min,0.361775,0.06222,0.229719
25%,0.381481,0.424711,0.273335
50%,0.456191,0.623318,0.634252
75%,0.524134,0.922987,0.916571
max,0.532018,0.991741,0.962495


And in the info method we can immediately check for gaps (Non-Null Count column), and also see how much memory our table occupies (the less, the better, of course).

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   5 non-null      float32
 1   second  5 non-null      float64
 2   third   5 non-null      float64
 3   name    5 non-null      object 
dtypes: float32(1), float64(2), object(1)
memory usage: 352.0+ bytes


In [None]:
df.astype({'first': np.float16}).info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   5 non-null      float16
 1   second  5 non-null      float64
 2   third   5 non-null      float64
 3   name    5 non-null      object 
dtypes: float16(1), float64(2), object(1)
memory usage: 342.0+ bytes


In [None]:
pd.isna(df['first'])

By and large, columns are the same index, only along the horizontal axis (axis = 1). Row and column indexes can replace each other, let's demonstrate this using the transpose operation.

![axes](https://railsware.com/blog/wp-content/uploads/2018/11/data-frame-axes.png)

![axes](https://i.stack.imgur.com/FzimB.png)

In [None]:
df.T

index,1,2,3,4,1.1
first,0.887491,0.291785,0.908683,0.136764,0.236495
second,0.294756,0.914294,0.701471,0.232178,0.134165
third,0.168454,0.217355,0.576963,0.982347,0.725935
name,Dima,Ivan,Nikolay,Dima,Pavel


We can sort table rows by column values. Please note that the index remains the same.

In [None]:
df.sort_values(['name','first'], ascending=[False, True]) # ascending=False descending

Unnamed: 0_level_0,first,second,third,name
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.236495,0.134165,0.725935,Pavel
3,0.908683,0.701471,0.576963,Nikolay
2,0.291785,0.914294,0.217355,Ivan
4,0.136764,0.232178,0.982347,Dima
1,0.887491,0.294756,0.168454,Dima


And we can sort the index.

In [None]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,third,second,name,first
0,0.962495,0.922987,Dima,0.456191
1,0.634252,0.06222,Ivan,0.524134
2,0.229719,0.424711,Nikolay,0.381481
3,0.273335,0.991741,Dima,0.361775
4,0.916571,0.623318,Pavel,0.532018


## Samples and slices
Detailed information on data samples is provided [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

### Square brackets
As noted above, for dataframes, sampling occurs by columns.

In [None]:
df['third']

Unnamed: 0_level_0,third
index,Unnamed: 1_level_1
1,0.168454
2,0.217355
3,0.576963
4,0.982347
1,0.725935


In [None]:
df[np.array(['third', 'first'])]

Unnamed: 0_level_0,third,first
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.168454,0.887491
2,0.217355,0.291785
3,0.576963,0.908683
4,0.982347,0.136764
1,0.725935,0.236495


In [None]:
df[1:4] # slicing by end-to-end integer index as in an array

Unnamed: 0,first,second,third,name
1,0.524134,0.06222,0.634252,Ivan
2,0.381481,0.424711,0.229719,Nikolay
3,0.361775,0.991741,0.273335,Dima


### Sampling by label (loc)
Let's add a new column to our table and make it a new index using the .set_index() method

In [None]:
df = df.reset_index()
df

Unnamed: 0,index,first,second,third,name
0,1,0.887491,0.294756,0.168454,Dima
1,2,0.291785,0.914294,0.217355,Ivan
2,3,0.908683,0.701471,0.576963,Nikolay
3,4,0.136764,0.232178,0.982347,Dima
4,1,0.236495,0.134165,0.725935,Pavel


In [None]:
df['new_index'] = pd.Series(['a', 'b', 'e', 'c', 'g'])
df

Unnamed: 0,index,first,second,third,name,new_index
0,1,0.887491,0.294756,0.168454,Dima,a
1,2,0.291785,0.914294,0.217355,Ivan,b
2,3,0.908683,0.701471,0.576963,Nikolay,e
3,4,0.136764,0.232178,0.982347,Dima,c
4,1,0.236495,0.134165,0.725935,Pavel,g


In [None]:
df = df.set_index('new_index')
df = df.drop('index', axis = 1)
df

Unnamed: 0_level_0,first,second,third,name
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,0.887491,0.294756,0.168454,Dima
b,0.291785,0.914294,0.217355,Ivan
e,0.908683,0.701471,0.576963,Nikolay
c,0.136764,0.232178,0.982347,Dima
g,0.236495,0.134165,0.725935,Pavel


Now using .loc we can navigate by this index

In [None]:
df.loc['b']

Unnamed: 0,b
first,0.291785
second,0.914294
third,0.217355
name,Ivan


And even use ranges (slices) by index

In [None]:
df.loc['b':'c']

Unnamed: 0_level_0,first,second,third,name
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,0.291785,0.914294,0.217355,Ivan
e,0.908683,0.701471,0.576963,Nikolay
c,0.136764,0.232178,0.982347,Dima


Separated by commas, we can also specify a filter by columns

In [None]:
df.loc['b':'c':2, ['third', 'first']]

Unnamed: 0_level_0,third,first
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1
b,0.217355,0.291785
c,0.982347,0.136764


### Selection by position in the table (iloc)
End-to-end integer indexing is preserved and is accessible using the .iloc method. Brackets are also square

In [None]:
df

Unnamed: 0_level_0,first,second,third,name
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,0.887491,0.294756,0.168454,Dima
b,0.291785,0.914294,0.217355,Ivan
e,0.908683,0.701471,0.576963,Nikolay
c,0.136764,0.232178,0.982347,Dima
g,0.236495,0.134165,0.725935,Pavel


In [None]:
df.iloc[1:3]

Unnamed: 0_level_0,first,second,third,name
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,0.291785,0.914294,0.217355,Ivan
e,0.908683,0.701471,0.576963,Nikolay


The selection occurs exactly by **row number** and **column number** (starting from zero)

In [None]:
df.iloc[[0, 2, 3], [1, 2]]

Unnamed: 0_level_0,second,third
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.294756,0.168454
e,0.701471,0.576963
c,0.232178,0.982347


## Reading and writing data

The pandas library has a huge number of capabilities for reading and writing data.

For example, in the pd.read_csv method, format-specific options are available (for example, the column separator sep), but you can also, for example, supplement the list of values ​​that pandas considers blanks by default by explicitly setting the na_values ​​parameter.

More detailed information about reading and writing data can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

In this section and further we will work with the iris.csv dataset

The data itself can be downloaded from here https://drive.google.com/file/d/1fjyopp9FZ-g6KIsIE8vPX2r62A43h2XI/view?usp=sharing

And to upload a file to your laptop storage, you can use the field below:

In [None]:
from google.colab import files
uploaded = files.upload()

Saving iris.csv to iris.csv


Let's download and open the dataset itself:

In [None]:
iris = pd.read_csv('iris.csv', header='infer', sep=',')
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


In [None]:
pd.read_ # a very large number of reading formats available

In [None]:
# to_csv() and others

In [None]:
iris.to_csv('testtest.csv', header=True, index=False) # we save the list of columns as the first line, we do NOT write the index as the first column

In [None]:
!ls

iris.csv  mikhail.csv  sample_data


Let's now practice on a real dataset and repeat on it the samples and slices that we just recently analyzed!

### Tasks for independent solution (for data sampling)

1. Print first 4 rows and first 2 columns using .iloc method
2. Print only `sepal.length` and `petal.length` columns using loc and/or square brackets
3. Make the ***variety*** column index using the .set_index() method, and select only the 'Setosa' species using .loc

In [None]:
# 1. your code here

iris.iloc[:4, :2]

Unnamed: 0,sepal.length,sepal.width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1


In [None]:
# 2. your code here

iris[['sepal.length', 'petal.length']]

Unnamed: 0,sepal.length,petal.length
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5
4,5.0,1.4
...,...,...
145,6.7,5.2
146,6.3,5.0
147,6.5,5.2
148,6.2,5.4


In [None]:
iris.loc[:, ['sepal.length', 'petal.length']]

Unnamed: 0,sepal.length,petal.length
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5
4,5.0,1.4
...,...,...
145,6.7,5.2
146,6.3,5.0
147,6.5,5.2
148,6.2,5.4


In [None]:
# 3. your code here

iris.set_index('variety').loc['Setosa']

Unnamed: 0_level_0,sepal.length,sepal.width,petal.length,petal.width
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Setosa,5.1,3.5,1.4,0.2
Setosa,4.9,3.0,1.4,0.2
Setosa,4.7,3.2,1.3,0.2
Setosa,4.6,3.1,1.5,0.2
Setosa,5.0,3.6,1.4,0.2
Setosa,5.4,3.9,1.7,0.4
Setosa,4.6,3.4,1.4,0.3
Setosa,5.0,3.4,1.5,0.2
Setosa,4.4,2.9,1.4,0.2
Setosa,4.9,3.1,1.5,0.1


## Data filtering (sampling by mask)
Just like in numpy, pandas has the ability to make selections by mask. But here the mechanism is somewhat different. If in numpy we received a matrix from True and False and each element was associated with the value True (take into the sample) or False (do not take into the sample), then in pandas the mask is a pandas Series **with the same indexing** as the original dataframe or a series consisting of True or False values. That is, we indicate which lines go into the resulting selection and which do not.

In [None]:
vec = np.random.rand(3, 2)
vec

array([[0.61543947, 0.99788564],
       [0.19776353, 0.49429452],
       [0.83293117, 0.01492045]])

In [None]:
vec > 0.5

array([[ True,  True],
       [False, False],
       [ True, False]])

In [None]:
vec[vec > 0.5]

array([0.61543947, 0.99788564, 0.83293117])

In [None]:
vec_df = pd.DataFrame(vec)
vec_df > 0.5

Unnamed: 0,0,1
0,True,True
1,False,False
2,True,False


In [None]:
iris[iris['']]

Unnamed: 0,0,1
0,0.615439,0.997886


In [None]:
iris['sepal.length'] > 5.0

Unnamed: 0,sepal.length
0,True
1,False
2,False
3,False
4,False
...,...
145,True
146,True
147,True
148,True


Of course, we can arrange conditions in logical chains

In [None]:
iris[(iris['sepal.length'] > 5.0) & (iris['sepal.width'] <= 3.0)]

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
53,5.5,2.3,4.0,1.3,Versicolor
54,6.5,2.8,4.6,1.5,Versicolor
55,5.7,2.8,4.5,1.3,Versicolor
58,6.6,2.9,4.6,1.3,Versicolor
59,5.2,2.7,3.9,1.4,Versicolor
...,...,...,...,...,...
142,5.8,2.7,5.1,1.9,Virginica
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica


In [None]:
type(vec_df) == type(vec)

False

We can even shuffle the values, but the sample will still remain the same due to index matching!

In [None]:
iris.sort_values(['sepal.length', 'petal.length'])[(iris['sepal.length'] > 5.0) & (iris['sepal.width'] <= 3.0)]

To filter by values ​​in a column, use the .isin() method

In [None]:
iris['variety']

Unnamed: 0,variety
0,Setosa
1,Setosa
2,Setosa
3,Setosa
4,Setosa
...,...
145,Virginica
146,Virginica
147,Virginica
148,Virginica


In [None]:
iris['variety'].isin(['Setosa', 'Virginica']) # multiple check
# ((iris['variety'] == 'Setosa') | (iris['variety'] == 'Virginica'))
# (iris['variety'] in ['Setosa', 'Virginica'])

Unnamed: 0,variety
0,True
1,True
2,True
3,True
4,True
...,...
145,True
146,True
147,True
148,True


## Inserting values
Values ​​can be inserted using the .loc and .iloc access methods, as well as the .at method. The difference is that .loc and .iloc are a little more universal, and allow you to change an entire range at once, while it is important to respect the dimensions of the inserted data. .at in turn gives us a point-to-point insertion "in place", and is better from a code reading point of view.

In [None]:
df = pd.DataFrame(np.random.rand(6,3),
                  index=['a','b','c','d','e','f'],
                  columns=['first', 'second', 'third'])
df

Unnamed: 0,first,second,third
a,0.542129,0.14306,0.865551
b,0.2252,0.342926,0.425872
c,0.59816,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,0.723726,0.979219
f,0.653162,0.564815,0.714769


In [None]:
df.loc['b','first'] = 1.0
df

Unnamed: 0,first,second,third
a,0.542129,0.14306,0.865551
b,1.0,0.342926,0.425872
c,0.59816,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,0.723726,0.979219
f,0.653162,0.564815,0.714769


In [None]:
df.loc['a':'c', 'first'] = [1, 2, 3]
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,100.0,0.979219
f,0.653162,0.564815,0.714769


In [None]:
df.at['e', 'second'] = 100
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,1.0,0.342926,0.425872
c,1.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,100.0,0.979219
f,0.653162,0.564815,0.714769


## Missing values
By default, missing values ​​are not included in the calculations, and most often in place of the missing values ​​you can find the value np.nan (Not a Number), or None (for non-numeric types)

In [None]:
# let's make some missing values ​​on purpose
df.at['e', 'second'] = np.nan
df.at['e', 'third'] = np.nan
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,,
f,0.653162,0.564815,0.714769


The .isna() method returns us a map with gaps. Skip where the value is True.

In [None]:
df.isna()

Unnamed: 0,first,second,third
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False
e,False,True,True
f,False,False,False


In [None]:
df.isna().any().any()

True

Let me remind you that, in principle, quantitative information about omissions can be obtained using the .info() method

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, a to f
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   first   6 non-null      float64
 1   second  5 non-null      float64
 2   third   5 non-null      float64
dtypes: float64(3)
memory usage: 364.0+ bytes


To remove gaps, use the .dropna() method.

By default, .dropna() will remove those lines that have at least one gap in the line.

In [None]:
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,,
f,0.653162,0.564815,0.714769


In [None]:
df.dropna()

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
f,0.653162,0.564815,0.714769


In [None]:
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,,
f,0.653162,0.564815,0.714769


And with transposition you can delete entire columns

In [None]:
df.T.dropna().T

Unnamed: 0,first
a,0.5
b,0.5
c,0.5
d,0.850098
e,0.779078
f,0.142762


Or use the axis=1 parameter

In [None]:
df.dropna(axis=1)

Unnamed: 0,first
a,0.5
b,0.5
c,0.5
d,0.850098
e,0.779078
f,0.142762


But still, we are often still interested in data with gaps. To work with them you can use the .fillna() method

This is how we will fill all the gaps with the same value

In [None]:
df.fillna(0)

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,0.84497,0.391347
e,0.194378,0.0,0.0
f,0.653162,0.564815,0.714769


But usually we still want to fill different columns with different values

In [None]:
df.fillna({'second': 0, 'third': 1.0})

Unnamed: 0,first,second,third
a,0.5,0.257904,0.26982
b,0.5,0.884408,0.459196
c,0.5,0.632172,0.624907
d,0.850098,0.989912,0.373285
e,0.779078,0.0,1.0
f,0.142762,0.072535,0.763757


There are more advanced filling methods; let’s make several gaps in a row

In [None]:
df.at['d', 'second'] = np.nan
df.at['d', 'third'] = np.nan
df

Unnamed: 0,first,second,third
a,1.0,0.14306,0.865551
b,2.0,0.342926,0.425872
c,3.0,0.564947,0.637436
d,0.890208,,
e,0.194378,,
f,0.653162,0.564815,0.714769


The bfill method fills a series of gaps with the last valid (non Null) value, iterating through the table from the end.

In [None]:
df.fillna(method='bfill')

  df.fillna(method='bfill')


Unnamed: 0,first,second,third
a,0.5,0.257904,0.26982
b,0.5,0.884408,0.459196
c,0.5,0.632172,0.624907
d,0.850098,0.072535,0.763757
e,0.779078,0.072535,0.763757
f,0.142762,0.072535,0.763757


The ffill method does the same thing, but iterates from the beginning of the table

In [None]:
df.fillna(method='ffill')

  df.fillna(method='ffill')


Unnamed: 0,first,second,third
a,0.5,0.257904,0.26982
b,0.5,0.884408,0.459196
c,0.5,0.632172,0.624907
d,0.850098,0.632172,0.624907
e,0.779078,0.632172,0.624907
f,0.142762,0.072535,0.763757


bfill and ffill are especially useful for filling gaps in a time series. This functionality is described in more detail [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html).

[Interpolation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html), unfortunately, is performed using separate methods.

## Statistics
Of course, pandas has a bunch of methods for calculating various statistics.
You can see the full list of methods [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats)

In [None]:
# mean, std, var, value_counts, df +- series

This is how we can calculate the average values ​​for all columns at once

In [None]:
df.mean()

Unnamed: 0,0
first,1.289625
second,0.403937
third,0.660907


And so calculate the average for only one column

In [None]:
df['first'].mean()

0.5453229902580239

Same for [standard deviation](https://berg.com.ua/indicators-overlays/stdev/#:~:text=%D0%A1%D1%82%D0%B0%D0%BD%D0%B4%D0%B0%D1%80%D1%82%D0%BD%D0%BE%D0%B5%20%D0%BE%D1%82%D0%BA%D0%BB%D0%BE%D0%BD%D0%B5%D0%BD%D0%B8%D0%B5%20%D0%BC%D0%BE%D0%B6%D0%BD%D0%BE%20%D0%B2%D1%8B%D1%80%D0%B0%D0%B7%D0%B8%D1%82%D1%8C%20%D1%84%D0%BE%D1%80%D0%BC%D1%83%D0%BB%D0%BE%D0%B9,%D0%BD%D0%B0%20%D0%BA%D0%BE%D0%BB%D0%B8%D1%87%D0%B5%D1%81%D1%82%D0%B2%D0%BE%20%D1%8D%D0%BB%D0%B5%D0%BC%D0%B5%D0%BD%D1%82%D0%BE%D0%B2%20%D0%B2%20%D0%B2%D1%8B%D0%B1%D0%BE%D1%80%D0%BA%D0%B5.)

In [None]:
df.std() # standard

Unnamed: 0,0
first,0.251296
second,0.365479
third,0.213194


Or [dispersion](https://ru.qwe.wiki/wiki/Variance)

In [None]:
df.var() # dispersion

Unnamed: 0,0
first,0.063149
second,0.133575
third,0.045452


By using .value_counts() you can count the number of occurrences of unique values

In [None]:
iris['variety'].value_counts()

Unnamed: 0_level_0,count
variety,Unnamed: 1_level_1
Setosa,50
Versicolor,50
Virginica,50


## Applying functions to data (apply)
Still, sometimes pandas requires new functions with their own processing logic. Then the .apply method comes to the rescue

It works like this. We pass the function that is responsible for the logic as the first argument, and we pass the axis as the second, that is, we indicate processing by columns or rows.

In the function itself that defines the logic, you must remember to return the row or column back to the table (return).

Full feature description available [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [None]:
def my_function(row):
  row['first'] = row['first']**2
  row['second'] = row['second'] - 1
  return row

print("begin")
df.apply(my_function, axis=1) # then strings will be sent to my_function, to parameter r
# for *iterate line by line*
#.  apply the my_function function to the string and overwrite the line

begin


Unnamed: 0,first,second,third
a,1.0,-1.85694,0.865551
b,16.0,-1.657074,0.425872
c,81.0,-1.435053,0.637436
d,0.62801,,
e,0.001428,,
f,0.182005,-1.435185,0.714769


In [None]:
def my_function(column):
  if column.name == 'first':
    column = column**2
  if column.name == 'second':
    column = column - 1
  return column

print("begin")
df.apply(my_function, axis=0) # then columns will be sent to my_function, to parameter c
# for *iterate through columns*
#.  apply the my_function function to the column and overwrite the column

begin


Unnamed: 0,first,second,third
a,1.0,-2.85694,0.865551
b,256.0,-2.657074,0.425872
c,6561.0,-2.435053,0.637436
d,0.394397,,
e,2e-06,,
f,0.033126,-2.435185,0.714769


## Methods for working with strings
There is a nice opportunity to work with vectorized copies of functions for standard [str data type](https://pyprog.pro/python/py/str/str_methods.html). For example, we can convert all strings to uppercase or lowercase, count the number of certain characters, etc. If you have a pandas.Series object referenced by the s variable, then you can access these methods by calling the s.str.'name of the method for working with strings' property.

[pandas.Series.str](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html)

[Guide to working with string columns.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

In [None]:
iris['variety'].str.upper()

Unnamed: 0,variety
0,SETOSA
1,SETOSA
2,SETOSA
3,SETOSA
4,SETOSA
...,...
145,VIRGINICA
146,VIRGINICA
147,VIRGINICA
148,VIRGINICA


Let's set another string column in our dataframe

In [None]:
df['fourth'] = pd.Series(['python', 'data analysis', 'i love python', 'zoom', 'string', 'beautiful string'], index=df.index)
df

Unnamed: 0,first,second,third,fourth
a,1.0,-1.85694,0.865551,python
b,16.0,-1.657074,0.425872,data analysis
c,81.0,-1.435053,0.637436,i love python
d,0.62801,,,zoom
e,0.001428,,,string
f,0.182005,-1.435185,0.714769,beautiful string


Convert the entire column to uppercase

In [None]:
df['fourth'].str.upper()

Unnamed: 0,fourth
a,PYTHON
b,DATA ANALYSIS
c,I LOVE PYTHON
d,ZOOM
e,STRING
f,BEAUTIFUL STRING


Or split all lines by a specific character

In [None]:
df['fourth'].str.split(' ', n = 1)

Unnamed: 0,fourth
a,[python]
b,"[data, analysis]"
c,"[i, love python]"
d,[zoom]
e,[string]
f,"[beautiful, string]"


You can also specify the maximum number of splits as the second argument, and create a new dataframe from the resulting arrays, where a split element will be written to each column (expand=True).

Description of the method [pandas.Series.str.split()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html )

In [None]:
df['fourth'].str.split(' ', expand=True)

Unnamed: 0,0,1,2
a,python,,
b,data,analysis,
c,i,love,python
d,zoom,,
e,string,,
f,beautiful,string,


## Dataframe connection

[Guide to pd.merge, pd.join and pd.concat methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

### pd.concat method

Let's consider using the pd.concat method for concatenation (connection along axes) based on the iris dataframe.

In [None]:
iris

pd.concat takes as input a sequence of dataframes or series to connect. By default, the connection occurs along axis=0, but, of course, a horizontal connection can also be made.

In [None]:
vertical_concat = pd.concat([iris, iris])
vertical_concat

Note that the indexes are not reset and we now see two entries at the same index. To create a new index, you must specify the parameter ignore_index=True.

In [None]:
vertical_concat.loc[110] # indexes are not reset, if you need to reset, use the reset_index() method

Or reset the index after connecting

In [None]:
vertical_concat.reset_index()

In [None]:
vertical_concat.reset_index().loc[110]

Likewise, horizontal connection

In [None]:
horizontal_concat = pd.concat([iris, iris], axis=1)
horizontal_concat

In [None]:
horizontal_concat['sepal.length']

It is also worth noting the important join parameter, which is set to 'outer' by default, but can be set to 'inner'. This parameter specifies what to do with those indexes that are not in one of the dataframes participating in the connection.

'outer' unites dataframes.

'inner' leaves only intersections at index.

See examples

In [None]:
pd.concat([iris, iris[:50]], join='outer', axis=1)

In [None]:
pd.concat([iris, iris[:50]], join='inner', axis=1)

In [None]:
pd.concat([iris, iris[['sepal.length', 'petal.length']]], join='outer', axis=0)

In [None]:
pd.concat([iris, iris[['sepal.length', 'petal.length']]], join='inner', axis=0)

### pd.merge (and pd.join) method

pandas has full-featured, high-performance in-memory join operations, idiomatically very similar to relational databases such as SQL.

Generally speaking, there are 3 types of connections
1. internal connection (inner)
2. left join (left), so the rows from the left table remain, and for unknown values ​​the right one is set to NaN
3. right join (right) so the rows from the right table remain, and for unknown values ​​the left one is set to NaN
4. external connection (outer). combination of left and right join

To remember, you can use this picture. It must be emphasized that the intersection and union occurs in the set of keys (columns) along which the connection occurs. So, for example, for inner join we leave in the resulting sample a subset of all possible pairwise combinations of rows, with the condition that the values ​​in the columns (keys) by which the join occurs are the same.

![joins](https://i.stack.imgur.com/VQ5XP.png)

[Comparison of use with sql joins](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join)


We will work with the pd.merge() method as it is more versatile, although sometimes it is shorter to use the pd.join() method

In [None]:
# let's look at some example tasks
import pandas as pd
import numpy as np

df_left = pd.DataFrame({
    'name': ['Dmitry', 'Sergey', 'Anna'],
    'age': [20, 30, 40]
}, index=['a', 'a', 'b'])

df_right = pd.DataFrame({
    'name': ['Dmitry', 'Sergey', 'Anna', 'Vasiliy'],
    'second_name': ['Petrov', 'Ivanov', 'Smirnova', 'Alexandrov']
}, index=['a', 'b', 'c', 'b'])

In [None]:
df_left

Unnamed: 0,name,age
a,Dmitry,20
a,Sergey,30
b,Anna,40


In [None]:
df_right

Unnamed: 0,name,second_name
a,Dmitry,Petrov
b,Sergey,Ivanov
c,Anna,Smirnova
b,Vasiliy,Alexandrov


pd.merge has quite a lot of parameters, which when you see them for the first time you can be a little *precipitated* :). Let's take it in order.

The first 2 parameters: left and right. These are the left and right dataframes (tables) that will participate in the join.

left_index and right_index take the values ​​True or False. Specify whether to use an index as a key for the left table and the same for the right one. So, calling pd.merge(left, right, left_index=True, right_index=True) will make a join, where the indexes in the left and right table will be checked for equality

left_on and right_on are used when we want to join not by index, but by columns; they take as values, respectively, the names of the columns from the left table and from the right one; you can pass several column names in the list at once, but the number of columns on the left and right must coincide. Thus, pd.merge(left, right, left_on='A', right_on='B') will make a join in which the values ​​in column 'A' of the left table and column 'B' of the right table will be checked for equality.

We can combine left_index, right_index and left_on, right_on. For example, use index as a key in the left table, and column 'B' in the right table: pd.merge(left, right, left_index=True, right_on='B').

If the names of the columns for the join in both tables are the same, then instead of passing identical values ​​to left_on and right_on, you can simply specify the parameter on='column name'.

The how parameter specifies the connection type, and can take the values ​​'inner' (default), 'outer', 'left' and 'right'.

It is also interesting that there is a validate parameter, which checks the resulting dataframe depending on our expectations of the result. Accepts the following values:
- “one_to_one” or “1:1”: checks that the keys used in the join are unique in the left and right table

- “one_to_many” or “1:m”: Checks that the keys are unique in the left table

- “many_to_one” or “m:1”: Checks that the keys are unique in the right table

- “many_to_many” or “m:m”: can be specified, but no checks occur. Keys may not be unique in both tables.

###Tasks for independent solution (on pd.merge)
1. Join the rows of the first table with the second by index (inner join)
2. Join the rows of the first table with the second by index (left join)
3. Join the rows of the first table with the second by index (right join)
4. Connect the rows of the first table with the second using the name column (inner join)
5. Join the rows of the first table with the second using the name column (right join)

In [None]:
# 1. your code here

In [None]:
# 2. your code here

In [None]:
# 3. your code here

In [None]:
# 4. your code here

In [None]:
# 5. your code here

## Grouping
Very often we need to calculate various parameters and build graphs in groups. There is a groupby operation in pandas for all of this, and it works essentially the same as in SQL.

Guide about [grouping](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) and [one more on habr](https://habr.com/ru/post/501214/).

In [None]:
# groupby, max, min, describe, agg, apply

This is how we can group by the variety column and calculate the average value in each column by group

In [None]:
iris.groupby('variety').mean()

Unnamed: 0_level_0,sepal.length,sepal.width,petal.length,petal.width
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Setosa,5.006,3.428,1.462,0.246
Versicolor,5.936,2.77,4.26,1.326
Virginica,6.588,2.974,5.552,2.026


Or maximum

In [None]:
iris.groupby('variety').max()

Unnamed: 0_level_0,sepal.length,sepal.width,petal.length,petal.width
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Setosa,5.8,4.4,1.9,0.6
Versicolor,7.0,3.4,5.1,1.8
Virginica,7.9,3.8,6.9,2.5


Or minimum

In [None]:
iris.groupby('variety').min()

Unnamed: 0_level_0,sepal.length,sepal.width,petal.length,petal.width
variety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Setosa,4.3,2.3,1.0,0.1
Versicolor,4.9,2.0,3.0,1.0
Virginica,4.9,2.2,4.5,1.4


Sometimes you need to calculate different metrics for different columns, use the .agg() method for this

In [None]:
iris.groupby('variety').agg({
    'sepal.length': ['max', 'min'],
    'petal.length': ['mean', 'median'],
    'petal.width': 'max'
    })

Unnamed: 0_level_0,sepal.length,sepal.length,petal.length,petal.length,petal.width
Unnamed: 0_level_1,max,min,mean,median,max
variety,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Setosa,5.8,4.3,1.462,1.5,0.6
Versicolor,7.0,4.9,4.26,4.35,1.8
Virginica,7.9,4.9,5.552,5.55,2.5


There's even a .describe(), but its output looks a bit big

In [None]:
iris.groupby('variety').describe()

Unnamed: 0_level_0,sepal.length,sepal.length,sepal.length,sepal.length,sepal.length,sepal.length,sepal.length,sepal.length,sepal.width,sepal.width,...,petal.length,petal.length,petal.width,petal.width,petal.width,petal.width,petal.width,petal.width,petal.width,petal.width
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
variety,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8,50.0,3.428,...,1.575,1.9,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
Versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0,50.0,2.77,...,4.6,5.1,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
Virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9,50.0,2.974,...,5.875,6.9,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


Please note that after we called the .groupby() method, a special DataFrameGroupBy object is returned to us. There is a separate page for it with available methods. After applying one of the aggregation functions, we get a new dataframe with which we already know how to work.

In [None]:
iris.groupby('variety')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7c5a11754490>

DataFrameGroupBy also has an apply method, which is similar in behavior to apply for DataFrame

In [None]:
def my_function(gr):
  # I look at the value of the varity column in the first line of the group, and depending on this I do something...
  if gr.iloc[0]['variety'] == 'Setosa':
    gr['sepal.width'] = gr['sepal.width']**2
  if gr.iloc[0]['variety'] == 'Virginica':
    gr['sepal.width'] = gr['sepal.width']**3
  return gr

iris.groupby('variety', group_keys = False).apply(my_function)
# for each subgroup:
#   put the subgroup into the gr variable
#   apply transformations to my_function
#   return the converted group back to the table (return)

### Assignment for independent solution (at df.groupby.apply)

Write an aggregation function that calculates the mean square for each column for each group. Fill in the cell with the code below

In [None]:
def my_function(gr):
  return '''your code here'''

iris.groupby('variety').apply(my_function)

### Other aggregation functions for grouping

There are also a number of interesting methods such as cumulative sum (cumsum) or rank (rank). This way we can assign ordinal values ​​to objects in each group depending on one of the columns. It can be very useful, for example, if you need to trace the evolution of some parameter depending on the event number over time.

In [None]:
iris['sepal.length.rank'] = iris.groupby('variety')['sepal.length'].rank(method='dense')
iris

## Multiindex
Another name for a multi-level, multi-dimensional index. Occurs when grouping occurs across multiple columns. Let's look at how we can work with it.

In [None]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,0.266897,-0.851742
1,bar,one,-1.648909,0.855905
2,foo,two,2.604975,0.881153
3,bar,three,0.802271,-1.197982
4,foo,two,-1.32594,-0.873265
5,bar,two,0.334465,-1.422754
6,foo,one,-0.355242,0.408657
7,foo,three,0.363084,-1.090783


In [None]:
df.groupby(['A','B']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.648909,0.855905
bar,three,0.802271,-1.197982
bar,two,0.334465,-1.422754
foo,one,-0.044173,-0.221543
foo,three,0.363084,-1.090783
foo,two,0.639517,0.003944


In [None]:
df.groupby(['A','B']).mean().index

MultiIndex([('bar',   'one'),
            ('bar', 'three'),
            ('bar',   'two'),
            ('foo',   'one'),
            ('foo', 'three'),
            ('foo',   'two')],
           names=['A', 'B'])

In [None]:
multi_gr = df.groupby(['A','B']).mean()
multi_gr

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.648909,0.855905
bar,three,0.802271,-1.197982
bar,two,0.334465,-1.422754
foo,one,-0.044173,-0.221543
foo,three,0.363084,-1.090783
foo,two,0.639517,0.003944


In loc navigation, we can now pass 2 row values ​​in a tuple that represent the index levels.

In [None]:
multi_gr.loc[('foo', 'three')]

Unnamed: 0_level_0,foo
Unnamed: 0_level_1,three
C,0.363084
D,-1.090783


No changes to the columns, and if we had a multi-index there, we could also make selections by passing tuples

In [None]:
multi_gr.loc[('foo', 'three'), 'C']

0.36308429689600896

However, pandas will not understand a record like ('foo', 'one':'three'), so if we are interested in multi-index slicing, we need to explicitly specify the slice using the function [slice()](https://www.programiz.com/python-programming/methods/built-in/slice). In fact, when specifying slices using a colon, the same slice object is initialized as when using the slice() function. : is a kind of syntactic sugar.

In [None]:
multi_gr.loc[('foo', slice('one','three')), ['C', 'D']]

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,-0.044173,-0.221543
foo,three,0.363084,-1.090783


In [None]:
multi_gr.loc[slice(('foo','three'), ('bar','two'), -1), ['C', 'D']]

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,three,0.363084,-1.090783
foo,one,-0.044173,-0.221543
bar,two,0.334465,-1.422754


.iloc works too!

In [None]:
multi_gr.iloc[1:3]

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,three,0.802271,-1.197982
bar,two,0.334465,-1.422754


Using the unstack and stack methods, we can unpack index levels into columns and pack them back into rows

In [None]:
multi_gr

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.648909,0.855905
bar,three,0.802271,-1.197982
bar,two,0.334465,-1.422754
foo,one,-0.044173,-0.221543
foo,three,0.363084,-1.090783
foo,two,0.639517,0.003944


In [None]:
multi_gr.unstack(level=1) # first index level unpacked

Unnamed: 0_level_0,C,C,C,D,D,D
B,one,three,two,one,three,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,-1.648909,0.802271,0.334465,0.855905,-1.197982,-1.422754
foo,-0.044173,0.363084,0.639517,-0.221543,-1.090783,0.003944


In [None]:
multi_gr.unstack(level=0) # index level zero unpacked

Unnamed: 0_level_0,C,C,D,D
A,bar,foo,bar,foo
B,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,-1.648909,-0.044173,0.855905,-0.221543
three,0.802271,0.363084,-1.197982,-1.090783
two,0.334465,0.639517,-1.422754,0.003944


In [None]:
multi_gr.unstack(level=0).stack(level=0)

  multi_gr.unstack(level=0).stack(level=0)


Unnamed: 0_level_0,A,bar,foo
B,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,C,-1.648909,-0.044173
one,D,0.855905,-0.221543
three,C,0.802271,0.363084
three,D,-1.197982,-1.090783
two,C,0.334465,0.639517
two,D,-1.422754,0.003944


In [None]:
multi_gr.unstack(level=0).stack(level=1) # essentially the reverse operation

  multi_gr.unstack(level=0).stack(level=1) # essentially the reverse operation


Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
B,A,Unnamed: 2_level_1,Unnamed: 3_level_1
one,bar,-1.648909,0.855905
one,foo,-0.044173,-0.221543
three,bar,0.802271,-1.197982
three,foo,0.363084,-1.090783
two,bar,0.334465,-1.422754
two,foo,0.639517,0.003944


In [None]:
multi_gr

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.648909,0.855905
bar,three,0.802271,-1.197982
bar,two,0.334465,-1.422754
foo,one,-0.044173,-0.221543
foo,three,0.363084,-1.090783
foo,two,0.639517,0.003944


In [None]:
multi_gr.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,C,-1.648909
bar,one,D,0.855905
bar,three,C,0.802271
bar,three,D,-1.197982
bar,two,C,0.334465
bar,two,D,-1.422754
foo,one,C,-0.044173
foo,one,D,-0.221543
foo,three,C,0.363084
foo,three,D,-1.090783


## Pivot tables ([pivot table](http://datareview.info/article/svodnyie-tablitsyi-v-python/))

The ability to create pivot tables is present in spreadsheets and other programs that operate on tabular data. A pivot table takes data from individual columns as input and groups them together to form a two-dimensional table that provides multidimensional summarization of the data. To get a feel for the difference between a PivotTable and a GroupBy operation, you can think of a PivotTable as a multidimensional version of GroupBy aggregation. That is, the data is divided, transformed and combined, but the division and union are carried out not according to a one-dimensional index, but according to a two-dimensional grid.

In [None]:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                      'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})

In [None]:
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-1.841852,-1.535826
1,one,B,foo,-1.321115,1.145076
2,two,C,foo,-0.538177,-0.232545
3,three,A,bar,0.220179,-0.630737
4,one,B,bar,-0.572993,0.044623
5,one,C,bar,1.110605,-0.698981
6,two,A,foo,0.539923,0.352976
7,three,B,foo,1.429373,-0.23752
8,one,C,foo,-0.231555,-0.268223
9,one,A,bar,-0.816409,0.367162


In [None]:
# pivot(values, index, columns, margins)

There are following options
- values - ​​values ​​for aggregation
- index - string index (one of the columns for grouping)
- columns - column index (one of the columns for grouping)
- aggfunc - aggregation function

In [None]:
pd.pivot_table(df, values='E', index='A', columns='B', aggfunc='mean')

B,A,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,-0.584332,0.59485,-0.483602
three,-0.630737,-0.23752,0.138591
two,0.352976,-1.997584,-0.232545


margins gives us an additional summary of all groups

In [None]:
pd.pivot_table(df, values='E', index='A', columns='B', aggfunc='mean', margins=True)

B,A,B,C,All
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,-0.584332,0.59485,-0.483602,-0.157695
three,-0.630737,-0.23752,0.138591,-0.243222
two,0.352976,-1.997584,-0.232545,-0.625718
All,-0.361606,-0.261351,-0.26529,-0.296082


In [None]:
pd.pivot_table(df, values='E', index='A', columns='B', aggfunc='max')

B,A,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0.367162,1.145076,-0.268223
three,-0.630737,-0.23752,0.138591
two,0.352976,-1.997584,-0.232545


Unfortunately, it is impossible to apply one function during aggregation and apply another during aggregation. But we can bypass this limitation manually.

In [None]:
pvt = pd.pivot_table(df, values='E', index='A', columns='B', aggfunc='max')
pvt['All_mean'] = pvt.mean(axis=1)
pvt.loc['All_mean'] = pvt.mean(axis=0)
pvt

B,A,B,C,All_mean
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,0.367162,1.145076,-0.268223,0.414672
three,-0.630737,-0.23752,0.138591,-0.243222
two,0.352976,-1.997584,-0.232545,-0.625718
All_mean,0.0298,-0.363343,-0.120726,-0.151423


You can transfer more than one column to an index or columns

In [None]:
pd.pivot_table(df, values='E' , index=['A','C'], columns='B', aggfunc='mean', margins=True)

Unnamed: 0_level_0,B,A,B,C,All
A,C,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,bar,0.367162,0.044623,-0.698981,-0.095732
one,foo,-1.535826,1.145076,-0.268223,-0.219658
three,bar,-0.630737,,0.138591,-0.246073
three,foo,,-0.23752,,-0.23752
two,bar,,-1.997584,,-1.997584
two,foo,0.352976,,-0.232545,0.060215
All,,-0.361606,-0.261351,-0.26529,-0.296082


Using crosstab we can make pivot tables from columns of different dataframes

In [None]:
pd.crosstab(df['A'], df['B'], values=df['E'], aggfunc='max')

*It is important to remember that all the commands and functions above are not all the capabilities of Pandas - there are many, many more! However, we have at least tried to put together and demonstrate to you the most important and most commonly used tools in this library. Therefore, the above guide will help you get comfortable and begin your initial journey of getting to know this wonderful library! :)*