# Pandas

[Pandas](https://pandas.pydata.org/) to biblioteka do pracy z danymi w formie tabelarycznej. Jest wykorzystywana w szeroko pojętej analizie danych i manipulacji na danych.

W większości zastosowań przy pomocy tej biblioteki zastąpić można często używane przez astronomów linuxowe komendy wiersza poleceń takie jak: for, split, grep, awk.

Kilka punktów wprost z dokumentacji:

Library Highlights
* A fast and efficient DataFrame object for data manipulation with integrated indexing;

* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

* Flexible reshaping and pivoting of data sets;

* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

* Columns can be inserted and deleted from data structures for size mutability;

* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

* High performance merging and joining of data sets;

* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;

* Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

* Highly optimized for performance, with critical code paths written in Cython or C.

* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.



[10 minute to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

## Poznawanie danych

### Wymiary:
    - 1-D: Series; e.g.
        - Solar planets: [Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune]
        - Set of astronomical objects and when they were observed:
            [[NGC1952, 2012-05-01],
             [NGC224, 2013-01-23],
             [NGC5194, 2014-02-13]]
    - 2-D: DataFrame; e.g (more business oriented):
        - 3 months of sales information for 3 fictitious companies:
            sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
                     {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
                     {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]

### Indeks

* Jest to wartość (klucz), którego uzywamy jako referencję dla każdego elementu. (Uwaga: Nie musi być jednoznaczny)

* Większość danych posiada przynajniej jeden indeks

In [101]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Definicja Series
Series to jednowymiarowa etykietowana tablica, która może przechowywać dane dowolnego typu.

Etykiety kolejnych elementów zbiorowo nazywamy `index`.

Ideowo Series biblioteki pandas tworzymy w następujący sposób:

**s = pd.Series(data, index=index)**

Gdzie danymi może być:
- lista 
- ndarray
- słownik pythona
- skalar

gdzie index to lista indeksów.

#### Tworzenie Series z listy

In [102]:
solar_planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune']

In [103]:
splanets = pd.Series(solar_planets)

In [105]:
?pd.Series()

In [104]:
splanets

0    Mercury
1      Venus
2      Earth
3       Mars
4    Jupiter
5     Saturn
6     Uranus
7    Neptune
dtype: object

In [106]:
splanets.index

RangeIndex(start=0, stop=8, step=1)

#### Tworzenie series z ndarray

1. Bez indeksu

In [107]:
np.random.randn(5)

array([ 0.23563796,  0.06674269,  0.13274601, -0.06337702, -0.63565239])

In [110]:
s1 = pd.Series(np.random.randn(5))

In [111]:
s1

0   -1.674876
1    0.693976
2    0.298720
3    1.093886
4   -1.443769
dtype: float64

In [112]:
s1.index

RangeIndex(start=0, stop=5, step=1)

2. Z indeksem

In [113]:
s2 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [114]:
s2

a    1.288988
b   -0.308112
c   -0.720290
d   -0.786458
e   -0.771848
dtype: float64

In [115]:
s2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

#### Create a Series array from a python dictionary

In [116]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [117]:
sd = pd.Series(d)

In [118]:
sd

a    0.0
b    1.0
c    2.0
dtype: float64

### Definicja DataFrame 

DataFrame to dwuwymiarowa etykietowana struktura z kolumnami o potencjalnie róznych typach. 

Można o nich myśleć jak o **arkuszu excela, tablicy SQL albo o słowniku obiektów Series.**

Jest to najczęściej używany typ biblioteki pandas.

Podobnie jak Series, DataFrame może być stworzone na wiele sposobów, m.in. z:
- słownika zawierającego 1d ndarrays, lists, dicts lub Series,
- 2D numpy.ndarray,
- recndarray,
- Series,
- z innego DataFrame.

#### Z listy słowników

In [133]:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200,'Mar': 140},
                 {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
                 {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]

In [134]:
df = pd.DataFrame(sales)

In [135]:
df

Unnamed: 0,account,Jan,Feb,Mar
0,Jones LLC,150,200,140
1,Alpha Co,200,210,215
2,Blue Inc,50,90,95


In [136]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   account  3 non-null      object
 1   Jan      3 non-null      int64 
 2   Feb      3 non-null      int64 
 3   Mar      3 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


In [137]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [141]:
df= df.set_index('account')

In [142]:
df

Unnamed: 0_level_0,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jones LLC,150,200,140
Alpha Co,200,210,215
Blue Inc,50,90,95


In [145]:
df= df.transpose()

In [146]:
df

account,Jones LLC,Alpha Co,Blue Inc
Jan,150,200,50
Feb,200,210,90
Mar,140,215,95


#### Ze słownika Series lub innych słowników

In [147]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [148]:
df = pd.DataFrame(d)

In [149]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   one     3 non-null      float64
 1   two     4 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes


In [152]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [153]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [154]:
df.columns

Index(['one', 'two'], dtype='object')

#### Ze słownika ndarrays lub list

In [155]:
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}

In [156]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [157]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


#### Z listy słowników

In [158]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [159]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [160]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [161]:
pd.DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


## IO

Zaprezentuję jedynie wczytywanie z plików CSV, ale pracować można także z tablicami **SQL** oraz z plikami **.fits**. 

Funkcja do wczytywania plków csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [162]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [163]:
import os
import pandas as pd
import matplotlib.pyplot as plt
# Przygotowanie ścieżki do pliku:
gdrive_path= "/content/gdrive/MyDrive"
data_directory= "python_tutorial/data" # Jeżeli zapisaliśmy dane gdzieś indziej musimy tutaj to zmienić.
data_path= os.path.join(gdrive_path, data_directory, "pandas_data/galaxy_sample.csv")

In [164]:
!head -30 {data_path}

# This catalog has been produced on behalf of Jorge Carretero (jorgecarreteropalacios@gmail.com) with ID #1365.
# It took 0:01:07 (h:mm:ss) to complete and the SQL issued was:
# 
# SELECT unique_gal_id, ra_gal, dec_gal, z_cgal, z_cgal_v, lmhalo, (mr_gal - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS abs_mag, gr_gal AS color, (des_asahi_full_i_true - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS app_mag FROM micecatv2_0_view TABLESAMPLE (BUCKET 1 OUT OF 2048)
# 
# Please, remember to follow the citation guide if you use any of this data in your work.
# 
# Generated by CosmoHub (https://cosmohub.pic.es) on 2017-09-08 06:48:53.790000 UTC.
unique_gal_id,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
28581888,6.322946,25.82068,0.30917,0.30894,13.5638,-19.107266531240747,0.8683,20.77373398755808
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748015323386,0.8183,21.80361977657571
23693312,17.699937,75.128659,0.7227,0.72423,11.4068,-19.40324439630544,0.5311,22.553554157161358
611532

In [174]:
!tail -5 {data_path}

61456384,86.53829,14.352161,0.14028,0.14349,13.8817,-14.45119540825824,0.7766,23.48320515021344
48244736,10.330761,15.381618,0.64717,0.64651,11.6325,-19.97822762045092,0.6734,21.861671442659432
53536768,4.038207,22.9018,0.92618,0.92719,12.6905,-20.333802931012343,0.4837,22.52959657460289
72990720,34.863831,2.792849,0.19577,0.19544,12.4254,-16.18298123008797,0.7664,22.620619081191325
63993856,68.025492,34.740194,0.20558,0.20466,13.8767,-14.26427450495626,0.3934,24.659826370898234


In [165]:
unique_gal_id_field = 'unique_gal_id'

In [170]:
galaxy_sample = pd.read_csv(data_path, sep=',', index_col = unique_gal_id_field, comment='#', na_values = '\\N')

In [171]:
galaxy_sample.head()

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28581888,6.322946,25.82068,0.30917,0.30894,13.5638,-19.107267,0.8683,20.773734
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748,0.8183,21.80362
23693312,17.699937,75.128659,0.7227,0.72423,11.4068,-19.403244,0.5311,22.553554
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
8955904,32.202269,71.912705,0.29446,0.29086,13.2412,-15.597617,0.1631,24.230383


In [172]:
galaxy_sample.tail(10)

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
65386496,66.37077,48.483216,0.23369,0.23318,13.096,-16.209401,0.8161,22.876399
35108864,15.013475,0.029625,0.34678,0.3458,11.8156,-20.119203,0.7126,20.080397
65302528,56.558624,36.787993,0.2412,0.24362,12.445,-16.964598,0.3157,22.367401
37863424,9.710972,34.448477,0.46939,0.46876,12.2258,-20.530049,0.7172,20.48535
75610112,43.292052,41.488275,0.28673,0.28792,13.6292,-15.888523,0.3105,23.902177
61456384,86.53829,14.352161,0.14028,0.14349,13.8817,-14.451195,0.7766,23.483205
48244736,10.330761,15.381618,0.64717,0.64651,11.6325,-19.978228,0.6734,21.861671
53536768,4.038207,22.9018,0.92618,0.92719,12.6905,-20.333803,0.4837,22.529597
72990720,34.863831,2.792849,0.19577,0.19544,12.4254,-16.182981,0.7664,22.620619
63993856,68.025492,34.740194,0.20558,0.20466,13.8767,-14.264275,0.3934,24.659826


In [175]:
galaxy_sample.describe()

Unnamed: 0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
count,243988.0,243988.0,243988.0,243988.0,243988.0,243988.0,243988.0,243988.0
mean,44.997907,33.59257,0.713766,0.713795,11.75481,-18.731732,0.583744,23.089854
std,25.659674,21.614009,0.344807,0.344841,0.848638,1.886084,0.227726,1.157234
min,0.0,-0.000512,0.07305,0.07034,10.077,-23.352722,-0.1686,14.606459
25%,23.076223,15.321979,0.4283,0.42821,11.1571,-20.145995,0.4027,22.515134
50%,45.04113,31.260208,0.700715,0.700655,11.5631,-19.056837,0.5481,23.376388
75%,66.876238,49.74322,0.97926,0.979553,12.2054,-17.604014,0.7884,23.93137
max,90.0,89.834812,1.41708,1.42185,15.2683,-12.979675,1.3467,24.957492


In [176]:
galaxy_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 243988 entries, 28581888 to 63993856
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   ra_gal    243988 non-null  float64
 1   dec_gal   243988 non-null  float64
 2   z_cgal    243988 non-null  float64
 3   z_cgal_v  243988 non-null  float64
 4   lmhalo    243988 non-null  float64
 5   abs_mag   243988 non-null  float64
 6   color     243988 non-null  float64
 7   app_mag   243988 non-null  float64
dtypes: float64(8)
memory usage: 16.8 MB


In [177]:
filename_bz2= os.path.join(gdrive_path, data_directory, "pandas_data/galaxy_sample.csv.bz2")
galaxy_sample_bz2 = pd.read_csv(filename_bz2, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')

In [178]:
galaxy_sample_bz2.head()

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28581888,6.322946,25.82068,0.30917,0.30894,13.5638,-19.107267,0.8683,20.773734
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748,0.8183,21.80362
23693312,17.699937,75.128659,0.7227,0.72423,11.4068,-19.403244,0.5311,22.553554
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
8955904,32.202269,71.912705,0.29446,0.29086,13.2412,-15.597617,0.1631,24.230383


In [179]:
galaxy_sample.dtypes

ra_gal      float64
dec_gal     float64
z_cgal      float64
z_cgal_v    float64
lmhalo      float64
abs_mag     float64
color       float64
app_mag     float64
dtype: object

### Zapisywanie do pliku

[`to_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [181]:
outfile= os.path.join(gdrive_path, data_directory, "pandas_data/outfile_name.csv")
galaxy_sample_bz2.to_csv(outfile,
          columns = ['ra_gal', 'dec_gal','color'],
          index=True,
          header=True
          )

In [182]:
!head {outfile}

unique_gal_id,ra_gal,dec_gal,color
28581888,6.322946,25.82068,0.8683
6686720,32.696644,53.073577,0.8183
23693312,17.699937,75.128659,0.5311
6115328,61.603497,59.016913,0.17800000000000002
8955904,32.202269,71.912705,0.1631
12351488,41.511403,78.611159,0.8499
15034368,36.454817999999996,1.470759,0.4932
17098752,22.305408,32.920644,0.1543
15955968,30.081703000000005,6.398922,0.3872


## Indeksowanie i "slicing"

Podstawy indeksowania prezentuje poniższa tabela:

| Operation                      | Syntax           | Result        |
|--------------------------------|------------------|---------------|
| Select column                  | df[column label] | Series        |
| Select row by index            | df.loc[index]    | Series        |
| Select row by integer location | df.iloc[pos]     | Series        |
| Slice rows                     | df[5:10]         | DataFrame     |
| Select rows by boolean vector  | df[bool_vec]     | DataFrame     |

In [183]:
filename_bz2= os.path.join(gdrive_path, data_directory, "pandas_data/galaxy_sample.csv.bz2")
galaxy_sample_bz2 = pd.read_csv(filename_bz2, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')

In [184]:
galaxy_sample.head()

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
28581888,6.322946,25.82068,0.30917,0.30894,13.5638,-19.107267,0.8683,20.773734
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748,0.8183,21.80362
23693312,17.699937,75.128659,0.7227,0.72423,11.4068,-19.403244,0.5311,22.553554
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
8955904,32.202269,71.912705,0.29446,0.29086,13.2412,-15.597617,0.1631,24.230383


In [185]:
len(galaxy_sample)

243988

* wybieranie kolumny

In [186]:
galaxy_sample['ra_gal'].head()

unique_gal_id
28581888     6.322946
6686720     32.696644
23693312    17.699937
6115328     61.603497
8955904     32.202269
Name: ra_gal, dtype: float64

In [187]:
type(galaxy_sample['dec_gal'])

pandas.core.series.Series

In [193]:
galaxy_sample[['ra_gal','dec_gal','lmhalo']].head()

Unnamed: 0_level_0,ra_gal,dec_gal,lmhalo
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
28581888,6.322946,25.82068,13.5638
6686720,32.696644,53.073577,14.4175
23693312,17.699937,75.128659,11.4068
6115328,61.603497,59.016913,13.9224
8955904,32.202269,71.912705,13.2412


* Wybieranie wiersza przez indeks

In [194]:
galaxy_sample.loc[28581888]

ra_gal       6.322946
dec_gal     25.820680
z_cgal       0.309170
z_cgal_v     0.308940
lmhalo      13.563800
abs_mag    -19.107267
color        0.868300
app_mag     20.773734
Name: 28581888, dtype: float64

In [195]:
type(galaxy_sample.loc[28581888])

pandas.core.series.Series

* Wybieranie wiersza przez położenie (wyrażone liczbą naturalną)

In [196]:
galaxy_sample.iloc[0]

ra_gal       6.322946
dec_gal     25.820680
z_cgal       0.309170
z_cgal_v     0.308940
lmhalo      13.563800
abs_mag    -19.107267
color        0.868300
app_mag     20.773734
Name: 28581888, dtype: float64

In [197]:
type(galaxy_sample.iloc[0])

pandas.core.series.Series

* "Slice" przez wiersze

In [198]:
galaxy_sample.iloc[3:7]

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
8955904,32.202269,71.912705,0.29446,0.29086,13.2412,-15.597617,0.1631,24.230383
12351488,41.511403,78.611159,0.76228,0.75945,13.7969,-18.716348,0.8499,23.537352
15034368,36.454818,1.470759,0.16827,0.16882,12.4438,-16.753474,0.4932,21.759327


In [199]:
galaxy_sample[3:7]

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
8955904,32.202269,71.912705,0.29446,0.29086,13.2412,-15.597617,0.1631,24.230383
12351488,41.511403,78.611159,0.76228,0.75945,13.7969,-18.716348,0.8499,23.537352
15034368,36.454818,1.470759,0.16827,0.16882,12.4438,-16.753474,0.4932,21.759327


In [200]:
type(galaxy_sample.iloc[3:7])

pandas.core.frame.DataFrame

* Wybieranie wierszy przez maskę

In [203]:
(galaxy_sample['ra_gal'] < 45).tail()

unique_gal_id
61456384    False
48244736     True
53536768     True
72990720     True
63993856    False
Name: ra_gal, dtype: bool

In [204]:
type(galaxy_sample['ra_gal'] < 45)

pandas.core.series.Series

In [205]:
galaxy_sample[galaxy_sample['ra_gal'] > 45].head()

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6115328,61.603497,59.016913,0.21891,0.21605,13.9224,-16.929097,0.178,22.106301
490539008,50.947777,47.746368,0.52318,0.52377,10.847,-16.736888,0.5417,24.525811
486078464,82.925868,40.278724,0.43548,0.43453,10.7899,-16.574389,0.4893,24.224212
497823744,60.660616,2.134763,0.44947,0.44761,10.7859,-16.568143,0.3134,24.271558
499359744,58.234774,38.335735,0.5406,0.5412,10.8696,-17.017661,0.5047,24.231038


In [210]:
# AND - &
# OR  - |
# NOT - ~
# (galaxy_sample.z_cgal <= 0.2) OR (galaxy_sample.z_cgal >= 1.0)

In [212]:
galaxy_sample[(galaxy_sample.z_cgal <= 0.2) | (galaxy_sample.z_cgal >= 1.0)].head()

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748,0.8183,21.80362
15034368,36.454818,1.470759,0.16827,0.16882,12.4438,-16.753474,0.4932,21.759327
17098752,22.305408,32.920644,0.19433,0.19137,12.1769,-16.021589,0.1543,22.861211
15955968,30.081703,6.398922,0.18972,0.1857,13.6578,-15.146481,0.3872,23.614319
17274880,16.696839,21.846449,0.19777,0.19551,13.7144,-14.924288,0.2035,23.980411


In [213]:
galaxy_sample[(galaxy_sample.z_cgal <= 1.0) & (galaxy_sample.index.isin([6686720,13615360,3231232]))]

Unnamed: 0_level_0,ra_gal,dec_gal,z_cgal,z_cgal_v,lmhalo,abs_mag,color,app_mag
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6686720,32.696644,53.073577,0.17156,0.16844,14.4175,-16.58748,0.8183,21.80362


In [219]:
galaxy_sample[(galaxy_sample['ra_gal'] < 1.) & (galaxy_sample['dec_gal'] < 1.)]\
              [['ra_gal','dec_gal']]\
              .head()

Unnamed: 0_level_0,ra_gal,dec_gal
unique_gal_id,Unnamed: 1_level_1,Unnamed: 2_level_1
60252160,0.50634,0.603683
325529600,0.511039,0.443529
443926528,0.322707,0.102679
60200960,0.944569,0.540849
445370368,0.774362,0.125568


### Merge, join i concatenate

<https://pandas.pydata.org/pandas-docs/stable/merging.html>

- pandas dostarcza wiele funkcjonalności dla łattwego łączenia Series i DataFrame w oparciu o różne zależności pomiędzy indeksami i algebrę relacyjną w przypadku operacji typu `join`/`merge`.

- metoda *concat* :
```
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
```

In [220]:
df1 = pd.DataFrame(
    {'A': ['A0', 'A1', 'A2', 'A3'],
     'B': ['B0', 'B1', 'B2', 'B3'],
     'C': ['C0', 'C1', 'C2', 'C3'],
     'D': ['D0', 'D1', 'D2', 'D3']},
    index=[0, 1, 2, 3]
)

In [221]:
df2 = pd.DataFrame(
    {'A': ['A4', 'A5', 'A6', 'A7'],
     'B': ['B4', 'B5', 'B6', 'B7'],
     'C': ['C4', 'C5', 'C6', 'C7'],
     'D': ['D4', 'D5', 'D6', 'D7']},
    index=[4, 5, 6, 7]
) 

In [222]:
df3 = pd.DataFrame(
    {'A': ['A8', 'A9', 'A10', 'A11'],
     'B': ['B8', 'B9', 'B10', 'B11'],
     'C': ['C8', 'C9', 'C10', 'C11'],
     'D': ['D8', 'D9', 'D10', 'D11']},
    index=[8, 9, 10, 11]
)

In [223]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [224]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [225]:
frames = [df1, df2, df3]

In [226]:
frames

[    A   B   C   D
 0  A0  B0  C0  D0
 1  A1  B1  C1  D1
 2  A2  B2  C2  D2
 3  A3  B3  C3  D3,     A   B   C   D
 4  A4  B4  C4  D4
 5  A5  B5  C5  D5
 6  A6  B6  C6  D6
 7  A7  B7  C7  D7,       A    B    C    D
 8    A8   B8   C8   D8
 9    A9   B9   C9   D9
 10  A10  B10  C10  D10
 11  A11  B11  C11  D11]

In [227]:
result = pd.concat(frames)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [228]:
# Multiindex
result = pd.concat(frames, keys=['x', 'y','z'])

In [229]:
result

Unnamed: 0,Unnamed: 1,A,B,C,D
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [230]:
result.index

MultiIndex([('x',  0),
            ('x',  1),
            ('x',  2),
            ('x',  3),
            ('y',  4),
            ('y',  5),
            ('y',  6),
            ('y',  7),
            ('z',  8),
            ('z',  9),
            ('z', 10),
            ('z', 11)],
           )

In [233]:
result.loc['y']

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [232]:
result.loc[('y',4)]

A    A4
B    B4
C    C4
D    D4
Name: (y, 4), dtype: object

In [234]:
df4 = pd.DataFrame(
    {'B': ['B2', 'B3', 'B6', 'B7'],
     'D': ['D2', 'D3', 'D6', 'D7'],
     'F': ['F2', 'F3', 'F6', 'F7']},
    index=[2, 3, 6, 7]
)

In [235]:
df4

Unnamed: 0,B,D,F
2,B2,D2,F2
3,B3,D3,F3
6,B6,D6,F6
7,B7,D7,F7


In [236]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [237]:
result = pd.concat([df1, df4])
result

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
2,,B2,,D2,F2
3,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


In [238]:
result = pd.concat([df1, df4], axis=1)
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


In [239]:
result = pd.concat([df1, df4], axis=1, join='inner')
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


### Takeaways

* Nie przeglądaj DataFrame przy pomocy funkcji for, jeżeli naprawdę nie musisz.
* Preferuj wbudowane metody nad własne funkcje.
* Pracuj na standardowych formatach danych.

---
#### View vs. Copy

<https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy>