# Pandas 1

Pandas yra duomenų analizės biblioteka, sukurta NumPy pagrindu. Pandas yra pagrindinis įrankis Python aplinkoje, skirtas duomenų analizei, išvalymui ir paruošimui. Pandas pasižymi sparta ir produktyvumu. Galima dirbti su duomenimis iš įvairių šaltinių.

Pandas diegiasi conda install pandas arba pip install pandas

In [300]:
import pandas as pd
import numpy as np

# Serijos

Serijos (Series) yra smulkus pandas duomenų darinys, sukurtas ant NumPy array pagrindo. 

In [301]:
labels = ['x', 'y', 'z']
data = [20, 30, 40]
pd.Series(data=data)

0    20
1    30
2    40
dtype: int64

Matyti, kad nuo įprastų masyvų, pandas serija skiriasi tuo, kad turi indeksaciją. Vienas iš parametrų, kuriuos galime perduoti kurdami seriją yra index.

In [302]:
pd.Series(data=data, index=labels)


x    20
y    30
z    40
dtype: int64

Pandas series galima kurti ir su python žodynais:



In [303]:
zodynas = {'x':20, 'y':30, 'z':40}
pd.Series(zodynas)

x    20
y    30
z    40
dtype: int64

#### Reikšmės traukimas iš serijos

In [304]:
serija = pd.Series(data=[1,2,3,4,5], index=['Vilnius', 'Kaunas', 'Klaipėda', 'Panevėžys', 'Šiauliai'])

In [305]:
serija

Vilnius      1
Kaunas       2
Klaipėda     3
Panevėžys    4
Šiauliai     5
dtype: int64

In [306]:
serija["Vilnius"]

1

#### Operacijos su serijomis

In [307]:
serija2 = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Lentvaris', 'Šiauliai', 'Klaipėda'])

In [308]:
serija2

Vilnius      1
Kaunas       2
Lentvaris    3
Šiauliai     4
Klaipėda     5
dtype: int64

naudojant sudėtį, pandas pagal galimybes bandys sumuoti reikšmes:

In [309]:
serija + serija2

Kaunas       4.0
Klaipėda     8.0
Lentvaris    NaN
Panevėžys    NaN
Vilnius      2.0
Šiauliai     9.0
dtype: float64

Ten, kur pandos negalėjo atlikti sudėties veiksmo, sugeneravo NaN - not a number. Tiek Pandas, tiek NumPy mėgsta integer reikšmes versti į float, kad išlaikytų kiek įmanoma tikslesnę informaciją.

Galime atlikti ir kitas operacijas:

In [310]:
serija - serija2

Kaunas       0.0
Klaipėda    -2.0
Lentvaris    NaN
Panevėžys    NaN
Vilnius      0.0
Šiauliai     1.0
dtype: float64

In [311]:
serija * serija2

Kaunas        4.0
Klaipėda     15.0
Lentvaris     NaN
Panevėžys     NaN
Vilnius       1.0
Šiauliai     20.0
dtype: float64

In [312]:
serija / serija2

Kaunas       1.00
Klaipėda     0.60
Lentvaris     NaN
Panevėžys     NaN
Vilnius      1.00
Šiauliai     1.25
dtype: float64

# DataFrames

DataFrames yra pagrindinis pandas operacijų objektas. Jeigu norime susikurti naują DF, reikia į parametrus perduoti data, index, columns:



In [313]:
df = pd.DataFrame(np.random.rand(5,6), 
                  ['a', 'b', 'c', 'd', 'e'], 
                  ['U', 'V', 'W', 'X', 'Y', 'Z'])
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


Kiekvienas stulpelis yra pandas serija, jos tarpusavyje dalijasi indeksais (a, b, c, d, e), pvz.:



In [314]:
df['U']

a    0.975097
b    0.583449
c    0.600700
d    0.772712
e    0.404471
Name: U, dtype: float64

In [315]:
type(df['U'])

pandas.core.series.Series

Jei norime daugiau stulpelių:

In [316]:
df[['U', 'Y', 'Z']]

Unnamed: 0,U,Y,Z
a,0.975097,0.1724,0.819233
b,0.583449,0.959989,0.511862
c,0.6007,0.890436,0.506157
d,0.772712,0.992515,0.069996
e,0.404471,0.044638,0.721126


Tik reikia turėti omenyje, kad čia jau dataframe, o ne serija

In [317]:
type(df[['U', 'Y', 'Z']])

pandas.core.frame.DataFrame

#### Naujo stulpelio sukūrimas


In [318]:
df['naujas'] = [1, 2, 3, 4, 5]
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,1
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,2
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,3
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996,4
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126,5


#### Stulpelio ištrynimas



In [319]:
df.drop('naujas', axis=1)


Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


axis=0 reikštų, kad atliekame veiksmą su eilute. 1 tuo tarpu reiškia stulpelį.



#### Inplace parametras



paskutinis mūsų veiksmas originalaus šaltinio nepakeitė, jeigu dabar išsikviesime df, matysime, kad jis koks buvo, toks ir liko:



In [320]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,1
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,2
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,3
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996,4
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126,5


df funkcijos (pvz: df drop) paties dataframe'o nekeičia, o tik grąžina atnaujintą informaciją.

Tai reikia arba išsaugoti į kitą kintamąjį:



In [321]:
df2 = df.drop('naujas', axis=1)
df2

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126




arba, jeigu norime pakeisti originalą, turime nurodyti parametrą inplace=True:

In [322]:
df.drop('naujas', axis=1, inplace=True)
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


Tai padeda mums, kadangi jeigu inplace nėra True - mes netyčia nesugadiname duomenų

#### Pabandykime ištrinti eilutę:



In [323]:
df.drop('e')

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996


trinant eilutę parametro axis=0 nurodyti nebūtina, tai yra default reikšmė

#### Eilučių traukimas



In [324]:
df.loc['e']

U    0.404471
V    0.261135
W    0.286117
X    0.621187
Y    0.044638
Z    0.721126
Name: e, dtype: float64

eilutes galime traukti ir pagal indeksą:



In [325]:
df.iloc[4]

U    0.404471
V    0.261135
W    0.286117
X    0.621187
Y    0.044638
Z    0.721126
Name: e, dtype: float64

#### Subsets

jeigu norime pavienės reikšmės iš lentelės:

In [326]:
df.loc['c', 'U']

0.6007004101357305

jeigu norime fragmento iš eilučių ir stulpelių (subset):

In [327]:
df.loc[['a', 'c'], ['U', 'V', 'Z']]

Unnamed: 0,U,V,Z
a,0.975097,0.908093,0.819233
c,0.6007,0.275267,0.506157


#### Duomenų traukimas pagal sąlygą:

duomenų traukimas pagal sąlygą yra labai panašus, kaip ir numPy:



In [328]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


In [329]:
df[df>0.4] 


Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,,0.594996,,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,,,,0.890436,0.506157
d,0.772712,,,,0.992515,
e,0.404471,,,0.621187,,0.721126


kur reikšmės atitinką sąlygą, turime reikšmes, kur neatitinka - NaN.



jeigu prireiktų subset'o, kur stulpelio 'W' reikšmės yra > 0.5:



In [330]:
kazkas = df[df['W']>0.5]
kazkas2 = df[[True, True, True, False, False]]
kazkas


Unnamed: 0,U,V,W,X,Y,Z
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862


Skirtumas tarp šių operacijų toks, kad kai sąlygą taikome visam DataFrame'ui, gauname tą patį DataFrame su NaN reikšmėmis, tose vietose, kur originalios reikšmės neatitinka sąlygos. Kai sąlygą taikome stulpeliams, gauname tik tas eilutes, kurios atitinka sąlygą, t.y. vykdome filtravimą.



#### Užklausų kombinavimas



In [331]:
df[df['W']>0.5][['U', 'W', 'Z']]


Unnamed: 0,U,W,Z
b,0.583449,0.803541,0.511862


šiame pavyzdyje gauname rezultatą, kokį gautumėm paeiliui ivykdę dvi atskiras eilutes: df1 = df[df['W']>0.5], df1[['U', 'W', 'Z']]. Užklausų kombinavimas leidžia mums nekurti atmintyje papildomų kintamųjų (kaip šiuo atveju df1).



#### Sąlygų kombinavimas



In [332]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


In [333]:
df[(df['U']>0.5) & (df['Z']<0.5)]

Unnamed: 0,U,V,W,X,Y,Z
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996


gavome tas eilutes, kuriose U stulpelyje reikšmės didesnės, o Z stulpelyje mažesnės už 0.5.



#### Operacijos su index stulpeliu

reset_index paverčia mūsų seną indeksą dar vienu stulpeliu, ir sukuria naują indeksą iš skaičių. Reikia naudoti inplace=True, jei norime pakeisti originalą.



In [334]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


In [335]:
df.reset_index()
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


Norint sukurti naują indeksą, reikia pridėti naują stulpelį:



In [336]:
naujas_indeksas = 'Vilnius Kaunas Klaipėda Šiauliai Panevėžys'.split()
naujas_indeksas


['Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai', 'Panevėžys']

In [337]:
df['Miestai'] = naujas_indeksas
df

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,Vilnius
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,Kaunas
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,Klaipėda
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996,Šiauliai
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126,Panevėžys


In [338]:
df.set_index('Miestai')


Unnamed: 0_level_0,U,V,W,X,Y,Z
Miestai,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Vilnius,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233
Kaunas,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862
Klaipėda,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157
Šiauliai,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996
Panevėžys,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126


#### Eilučių skaičius

galime nurodyti kiek norime eilučių duomenų su metodu head()

In [341]:
df.head()

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,Vilnius
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,Kaunas
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,Klaipėda
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996,Šiauliai
e,0.404471,0.261135,0.286117,0.621187,0.044638,0.721126,Panevėžys


In [342]:
df.head(3)

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,Vilnius
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,Kaunas
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,Klaipėda


In [343]:
df.head(-1)

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.975097,0.908093,0.336792,0.594996,0.1724,0.819233,Vilnius
b,0.583449,0.984551,0.803541,0.694863,0.959989,0.511862,Kaunas
c,0.6007,0.275267,0.312654,0.015492,0.890436,0.506157,Klaipėda
d,0.772712,0.159734,0.313451,0.293376,0.992515,0.069996,Šiauliai


#### Failų nuskaitymas


In [376]:
df = pd.read_csv("test.csv", index_col="id")
df

Unnamed: 0_level_0,firstname,lastname,email,email2,profession
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,Anthia,O'Carroll,Anthia.O'Carroll@yopmail.com,Anthia.O'Carroll@gmail.com,worker
101,Tina,Noam,Tina.Noam@yopmail.com,Tina.Noam@gmail.com,worker
102,Bettine,Wildermuth,Bettine.Wildermuth@yopmail.com,Bettine.Wildermuth@gmail.com,developer
103,Zia,My,Zia.My@yopmail.com,Zia.My@gmail.com,firefighter
104,Nollie,Rooney,Nollie.Rooney@yopmail.com,Nollie.Rooney@gmail.com,developer
...,...,...,...,...,...
195,Rori,Firmin,Rori.Firmin@yopmail.com,Rori.Firmin@gmail.com,firefighter
196,Roberta,Alwin,Roberta.Alwin@yopmail.com,Roberta.Alwin@gmail.com,worker
197,Ofilia,Bebe,Ofilia.Bebe@yopmail.com,Ofilia.Bebe@gmail.com,doctor
198,Tiffie,Angelis,Tiffie.Angelis@yopmail.com,Tiffie.Angelis@gmail.com,worker


In [374]:
df.set_index("id", inplace=True)

In [375]:
df

Unnamed: 0_level_0,firstname,lastname,email,email2,profession
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,Anthia,O'Carroll,Anthia.O'Carroll@yopmail.com,Anthia.O'Carroll@gmail.com,worker
101,Tina,Noam,Tina.Noam@yopmail.com,Tina.Noam@gmail.com,worker
102,Bettine,Wildermuth,Bettine.Wildermuth@yopmail.com,Bettine.Wildermuth@gmail.com,developer
103,Zia,My,Zia.My@yopmail.com,Zia.My@gmail.com,firefighter
104,Nollie,Rooney,Nollie.Rooney@yopmail.com,Nollie.Rooney@gmail.com,developer
...,...,...,...,...,...
195,Rori,Firmin,Rori.Firmin@yopmail.com,Rori.Firmin@gmail.com,firefighter
196,Roberta,Alwin,Roberta.Alwin@yopmail.com,Roberta.Alwin@gmail.com,worker
197,Ofilia,Bebe,Ofilia.Bebe@yopmail.com,Ofilia.Bebe@gmail.com,doctor
198,Tiffie,Angelis,Tiffie.Angelis@yopmail.com,Tiffie.Angelis@gmail.com,worker
