___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<head>
    <center><title>~ Pandas Datenrahmen | Lektion-1 ~</title></center>
</head>
    

# Datenrahmen

``DataFrames`` sind das Arbeitspferd der Pandas und direkt von der Programmiersprache R inspiriert. Wir können uns einen DataFrame als eine Ansammlung von Series-Objekten vorstellen, die zusammengestellt wurden, um denselben Index zu verwenden. Lassen Sie uns Pandas verwenden, um dieses Thema zu erkunden!
https://daten_setcience.eu/de/programmierung/python-pandas-datenrahmen/

In [139]:
import pandas as pd
import numpy as np

## Erstellen eines Datenrahmen 

### Erstellen eines Datenrahmen unter Verwendung der ``list`` von Daten und Spalten

In [140]:
daten_set = [1, 3, 5, 7, 9, 18]
columns = ['alter']
daten_set, columns

([1, 3, 5, 7, 9, 18], ['alter'])

In [141]:
pd.DataFrame(daten_set, columns=columns)

Unnamed: 0,alter
0,1
1,3
2,5
3,7
4,9
5,18


### Erstellen eines Datenrahmen mit einem ``NumPy Array``

In [142]:
daten_set = np.arange(1, 24, 2).reshape(3, 4)
daten_set

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [143]:
pd.DataFrame(daten_set, columns=['var1','var2','var3','var4'])

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [144]:
df = pd.DataFrame(data=daten_set, columns=['var1','var2','var3','var4'])
df

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [145]:
df.head(2)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15


In [146]:
df.tail(2)

Unnamed: 0,var1,var2,var3,var4
1,9,11,13,15
2,17,19,21,23


In [147]:
df.sample(2)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15


In [148]:
df.columns

Index(['var1', 'var2', 'var3', 'var4'], dtype='object')

In [149]:
[i for i in df.columns]

['var1', 'var2', 'var3', 'var4']

In [150]:
df.columns=['new1','new2','new3','new4']
df

Unnamed: 0,new1,new2,new3,new4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [151]:
type(df)

pandas.core.frame.DataFrame

In [152]:
print("Zeil-Spalte:", df.shape, "Spalte:", df.shape[1], "Dimention:",  df.ndim, "Größe:", df.size, "len:", len(df))

Zeil-Spalte: (3, 4) Spalte: 4 Dimention: 2 Größe: 12 len: 3


In [153]:
df

Unnamed: 0,new1,new2,new3,new4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [154]:
df.values

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [155]:
df.index.values

array([0, 1, 2], dtype=int64)

In [156]:
print("Index:", df.index.values, "Index[1]:", df.index[1])

Index: [0 1 2] Index[1]: 1


### Erstellen eines Datenrahmen mit einem ``dict``

In [157]:
s1 = np.random.randint(2, 10, size = 4)
s2 = np.random.randint(3, 10, size = 4)
s3 = np.random.randint(4, 15, size = 4)

In [158]:
s1, s2, s3

(array([8, 6, 7, 8]), array([8, 9, 8, 5]), array([ 9,  5, 11,  8]))

In [159]:
dict_= {'var1':s1,'var2':s2,'var3':s3}

In [160]:
df_ = pd.DataFrame(dict_)
df_

Unnamed: 0,var1,var2,var3
0,8,8,9
1,6,9,5
2,7,8,11
3,8,5,8


In [161]:
df_.index

RangeIndex(start=0, stop=4, step=1)

In [162]:
[i for i in df_.index]

[0, 1, 2, 3]

In [163]:
df_.index = ["a", "b", "c", "d"]

In [164]:
df_

Unnamed: 0,var1,var2,var3
a,8,8,9
b,6,9,5
c,7,8,11
d,8,5,8


In [165]:
# Wir können jeden Spaltennamen überprüfen, ob er zum DataFrame gehört oder nicht
"var2" in df_, 'var5' in df_

(True, False)

## Indizierung, Auswahl und Schneiden von Datenrahmen
Betrachten wir nun noch einmal die Methoden ``(Indizierung)indexing`` ``Auswahl(selection)`` und ``Schneiden(slicing)`` und verschiedene ``Attribute(attribute)`` mit einem anderen DataFrame

In [166]:
from numpy.random import randn
np.random.seed(101)

In [167]:
df = pd.DataFrame(randn(5, 4),
                    index='A B C D E'.split(),
                    columns='W X Y Z'.split())

In [168]:
'A B C D E'.split()

['A', 'B', 'C', 'D', 'E']

In [169]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [170]:
# Erstellen eines Datenrahmen durch 'positionale Argumente'
pd.DataFrame(randn(5, 4), 'a b c d e'.split(), 'w x y z'.split())

Unnamed: 0,w,x,y,z
a,0.302665,1.693723,-1.706086,-1.159119
b,-0.134841,0.390528,0.166905,0.184502
c,0.807706,0.07296,0.638787,0.329646
d,-0.497104,-0.75407,-0.943406,0.484752
e,-0.116773,1.901755,0.238127,1.996652


In [171]:
# Erstellen eines Datenrahmendurch 'Schlüsselwortargumente'
pd.DataFrame(randn(5, 4), columns='w x y z'.split(), index='a b c d e'.split())

Unnamed: 0,w,x,y,z
a,-0.993263,0.1968,-1.136645,0.000366
b,1.025984,-0.156598,-0.031579,0.649826
c,2.154846,-0.610259,-0.755325,-0.346419
d,0.147027,-0.479448,0.558769,1.02481
e,-0.925874,1.862864,-1.133817,0.610478


### Auswahl und Indizierung

Lernen wir die verschiedenen Methoden kennen, um Daten aus einem Datenrahmen zu holen.

In [172]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [173]:
df['Y']

A    0.907969
B   -0.848077
C    0.528813
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [174]:
# SQL-Syntax (NICHT EMPFOHLEN!)
df.Y

A    0.907969
B   -0.848077
C    0.528813
D   -0.933237
E    2.605967
Name: Y, dtype: float64

Datenrahmen-Spalten sind nur Serien

In [175]:
df['Y'], type(df['Y'])

(A    0.907969
 B   -0.848077
 C    0.528813
 D   -0.933237
 E    2.605967
 Name: Y, dtype: float64,
 pandas.core.series.Series)

In [176]:
df[['Y']], type(df[['Y']])

(          Y
 A  0.907969
 B -0.848077
 C  0.528813
 D -0.933237
 E  2.605967,
 pandas.core.frame.DataFrame)

In [177]:
# Übergeben Sie eine Liste mit Spaltennamen
# df['Z','X'] gibt Fehler
df[['Z','X']]

Unnamed: 0,Z,X
A,0.503826,0.628133
B,0.605965,-0.319318
C,-0.589001,0.740122
D,0.955057,-0.758872
E,0.683509,1.978757


In [178]:
df["X":"Z"]

Unnamed: 0,W,X,Y,Z


In [179]:
df['B':'C']

Unnamed: 0,W,X,Y,Z
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [180]:
df["A":"C"][["Y", "Z"]]

Unnamed: 0,Y,Z
A,0.907969,0.503826
B,-0.848077,0.605965
C,0.528813,-0.589001


**Neue Spalte erstellen:**

In [181]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [182]:
df['X*Y'] = df['X'] * df['Y']
df

Unnamed: 0,W,X,Y,Z,X*Y
A,2.70685,0.628133,0.907969,0.503826,0.570325
B,0.651118,-0.319318,-0.848077,0.605965,0.270806
C,-2.018168,0.740122,0.528813,-0.589001,0.391387
D,0.188695,-0.758872,-0.933237,0.955057,0.708208
E,0.190794,1.978757,2.605967,0.683509,5.156577


In [183]:
df["T"] = [1, 2, 3, 4, 5]

df.T      # T transpos un T si satir ve sutunlari

Unnamed: 0,A,B,C,D,E
W,2.70685,0.651118,-2.018168,0.188695,0.190794
X,0.628133,-0.319318,0.740122,-0.758872,1.978757
Y,0.907969,-0.848077,0.528813,-0.933237,2.605967
Z,0.503826,0.605965,-0.589001,0.955057,0.683509
X*Y,0.570325,0.270806,0.391387,0.708208,5.156577
T,1.0,2.0,3.0,4.0,5.0


In [184]:
df["s"] = [1, 2, 3, 4]
df

ValueError: Length of values (4) does not match length of index (5)

### Spalten & Zeilen entfernen
http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-drop.ipynb

#### Spalten entfernen

In [None]:
df

Unnamed: 0,W,X,Y,Z,T,X*Y
A,2.70685,0.628133,0.907969,0.503826,1,0.570325
B,0.651118,-0.319318,-0.848077,0.605965,2,0.270806
C,-2.018168,0.740122,0.528813,-0.589001,3,0.391387
D,0.188695,-0.758872,-0.933237,0.955057,4,0.708208
E,0.190794,1.978757,2.605967,0.683509,5,5.156577


In [None]:
df.drop('X*Y', axis=1)

Unnamed: 0,W,X,Y,Z,T
A,2.70685,0.628133,0.907969,0.503826,1
B,0.651118,-0.319318,-0.848077,0.605965,2
C,-2.018168,0.740122,0.528813,-0.589001,3
D,0.188695,-0.758872,-0.933237,0.955057,4
E,0.190794,1.978757,2.605967,0.683509,5


In [None]:
df

Unnamed: 0,W,X,Y,Z,T,X*Y
A,2.70685,0.628133,0.907969,0.503826,1,0.570325
B,0.651118,-0.319318,-0.848077,0.605965,2,0.270806
C,-2.018168,0.740122,0.528813,-0.589001,3,0.391387
D,0.188695,-0.758872,-0.933237,0.955057,4,0.708208
E,0.190794,1.978757,2.605967,0.683509,5,5.156577


In [None]:
df.drop(["X*Y", "T"], axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df

Unnamed: 0,W,X,Y,Z,X*Y,T
A,2.70685,0.628133,0.907969,0.503826,0.570325,1
B,0.651118,-0.319318,-0.848077,0.605965,0.270806,2
C,-2.018168,0.740122,0.528813,-0.589001,0.391387,3
D,0.188695,-0.758872,-0.933237,0.955057,0.708208,4
E,0.190794,1.978757,2.605967,0.683509,5.156577,5


In [None]:
# Nicht vorhanden, sofern ``inplace`` nict gibt an!
df.drop(["X*Y", "T"], axis=1, inplace=True)

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


#### Zeilen entfernen

In [None]:
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
# der Standardwert der Achse ist 0 (axis= 0)
df = df.drop('C', axis=0)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df.drop(["D","E"], axis=0, inplace=True)

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965


### Zeilen auswählen

Werfen wir zunächst einen kurzen Blick auf ``.loc[]`` und ``.iloc[]``

#### ``.loc[] ``
Ermöglicht es uns, Daten mit Labels(Namen) von Zeilen (Index) und Spalten auszuwählen

#### `.iloc[]` 
Ermöglicht es uns, Daten mit **Indexnummern** von Zeilen (Index) und Spalten auszuwählen. es ist wie eine klassische Indizierungslogik

In [None]:
daten_set = np.random.randint(1, 40, size=(8, 4))
df = pd.DataFrame(daten_set, columns = ["var1","var2","var3",'var4'])
df

Unnamed: 0,var1,var2,var3,var4
0,4,38,30,23
1,22,22,18,24
2,31,37,8,21
3,28,12,6,23
4,26,19,14,39
5,4,15,24,14
6,25,21,1,30
7,12,28,34,25


In [None]:
df.loc[4]

var1    26
var2    19
var3    14
var4    39
Name: 4, dtype: int32

In [None]:
df.loc[[4]]

Unnamed: 0,var1,var2,var3,var4
4,26,19,14,39


In [None]:
# Slicing erzeugt den gleichen Datentyp. Hier, Datenrahmen
df.loc[2:5]

Unnamed: 0,var1,var2,var3,var4
2,31,37,8,21
3,28,12,6,23
4,26,19,14,39
5,4,15,24,14


In [None]:
df.iloc[2:5]

Unnamed: 0,var1,var2,var3,var4
2,31,37,8,21
3,28,12,6,23
4,26,19,14,39


In [None]:
df

Unnamed: 0,var1,var2,var3,var4
0,4,38,30,23
1,22,22,18,24
2,31,37,8,21
3,28,12,6,23
4,26,19,14,39
5,4,15,24,14
6,25,21,1,30
7,12,28,34,25


In [None]:
df.index='a b c d e f g h'.split()
df

Unnamed: 0,var1,var2,var3,var4
a,4,38,30,23
b,22,22,18,24
c,31,37,8,21
d,28,12,6,23
e,26,19,14,39
f,4,15,24,14
g,25,21,1,30
h,12,28,34,25


In [None]:
df.iloc[1:4]

Unnamed: 0,var1,var2,var3,var4
b,22,22,18,24
c,31,37,8,21
d,28,12,6,23


In [None]:
# df.loc[1:4] gibt Fehler, weil die Indizes/die Indexe sind markiert wurden

In [None]:
df.loc['c':'g']

Unnamed: 0,var1,var2,var3,var4
c,31,37,8,21
d,28,12,6,23
e,26,19,14,39
f,4,15,24,14
g,25,21,1,30


In [None]:
df

Unnamed: 0,var1,var2,var3,var4
a,4,38,30,23
b,22,22,18,24
c,31,37,8,21
d,28,12,6,23
e,26,19,14,39
f,4,15,24,14
g,25,21,1,30
h,12,28,34,25


In [None]:
df.iloc[4, 1]

19

In [None]:
df.iloc[:, 1]

a    38
b    22
c    37
d    12
e    19
f    15
g    21
h    28
Name: var2, dtype: int32

In [None]:
df.loc['d':'g', 'var3']

d     6
e    14
f    24
g     1
Name: var3, dtype: int32

In [None]:
df.loc[:, 'var3']

a    30
b    18
c     8
d     6
e    14
f    24
g     1
h    34
Name: var3, dtype: int32

In [None]:
df.loc[df.index.isin(['c', 'g'])]

Unnamed: 0,var1,var2,var3,var4
c,31,37,8,21
g,25,21,1,30


In [None]:
df.loc['d':'g'][['var3']]

Unnamed: 0,var3
d,6
e,14
f,24
g,1


In [None]:
# Wie können wir diese Daten als Datenframe und nicht als Serie auswählen?
df.loc['d':'g'][['var3']]

Unnamed: 0,var3
d,6
e,14
f,24
g,1


In [None]:
df.loc['d':'g', ["var3"]]

Unnamed: 0,var3
d,6
e,14
f,24
g,1


In [None]:
df.iloc[2:5, 2]

c     8
d     6
e    14
Name: var3, dtype: int32

In [None]:
df.iloc[2:5][['var2']]

Unnamed: 0,var2
c,37
d,12
e,19


Let' s continue to examine ``.loc[]`` and ``.iloc[]`` 

In [None]:
df = pd.DataFrame(randn(5, 4),
                    index='A B C D E'.split(),
                    columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.251915,0.518763,0.587968,-0.148194
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df.loc['C']

W   -0.876702
X    0.792750
Y    0.539118
Z    0.669774
Name: C, dtype: float64

Oder wählen Sie basierend auf der Position anstelle des Labels

In [None]:
df.iloc[2]

W   -0.876702
X    0.792750
Y    0.539118
Z    0.669774
Name: C, dtype: float64

In [None]:
type(df.iloc[2])

pandas.core.series.Series

In [None]:
df.iloc[2].values

array([-0.87670184,  0.79275011,  0.53911754,  0.66977395])

In [None]:
df.iloc[[2]]

Unnamed: 0,W,X,Y,Z
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
type(df.iloc[[2]])

pandas.core.frame.DataFrame

In [None]:
df.iloc[[2]].values

array([[-0.87670184,  0.79275011,  0.53911754,  0.66977395]])

In [None]:
# gibt als Datenrahmen zurück
df.loc[['C']]

Unnamed: 0,W,X,Y,Z
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
# gibt als Datenrahmen zurück
df.iloc[[2]]

Unnamed: 0,W,X,Y,Z
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
# Nun, wie können wir die gesamte Spalte 'Y' mit '.iloc[]' auswählen
df.iloc[:, 2]

A    1.297617
B    0.587968
C    0.539118
D    0.779475
E   -1.263599
Name: Y, dtype: float64

In [None]:
df.iloc[:,[2]]

Unnamed: 0,Y
A,1.297617
B,0.587968
C,0.539118
D,0.779475
E,-1.263599


In [None]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [None]:
df[['Y','X']]

Unnamed: 0,Y,X
A,1.297617,-0.454696
B,0.587968,0.518763
C,0.539118,0.79275
D,0.779475,-0.446181
E,-1.263599,-2.002399


In [None]:
df[['X','Y']]

Unnamed: 0,X,Y
A,-0.454696,1.297617
B,0.518763,0.587968
C,0.79275,0.539118
D,-0.446181,0.779475
E,-2.002399,-1.263599


#### Auswahl einer Teilmenge(subset) von Zeilen und Spalten

 `.loc[[row labels|names], [column labels|names]]`

`.iloc[[row index numbers], [column index numbers]]`

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.251915,0.518763,0.587968,-0.148194
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df.loc['C','Z']

0.6697739471644235

In [None]:
# Wählen wir dieselben Daten wie einen Datenrahmen
df.loc[['C'],['Z']]

Unnamed: 0,Z
C,0.669774


In [None]:
df.loc[['C']][['Z']]

Unnamed: 0,Z
C,0.669774


In [None]:
df.loc[['A','C']]

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
df.loc[['A','C'],['W','Z']]

Unnamed: 0,W,Z
A,-0.758436,-0.825378
C,-0.876702,0.669774


In [None]:
df.loc[['A','C']][['W','Z']]

Unnamed: 0,W,Z
A,-0.758436,-0.825378
C,-0.876702,0.669774


In [None]:
df.iloc[[0,  2], [0, 3]]

Unnamed: 0,W,Z
A,-0.758436,-0.825378
C,-0.876702,0.669774


#### Bedingte Auswahl
Ein wichtiges Merkmal von Pandas ist die bedingte Auswahl mit Klammernotation, die der numpy sehr ähnlich ist:

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.251915,0.518763,0.587968,-0.148194
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
# gibt einen Datenrahmen zurück, der aus dem Typ bool besteh
df > 0.5

Unnamed: 0,W,X,Y,Z
A,False,False,True,False
B,False,True,True,False
C,False,True,True,True
D,False,False,True,False
E,False,False,False,False


In [None]:
df[df > 0.5]

Unnamed: 0,W,X,Y,Z
A,,,1.297617,
B,,0.518763,0.587968,
C,,0.79275,0.539118,0.669774
D,,,0.779475,
E,,,,


In [None]:
# Es gibt basierend auf Zeilen zurück.
df[df['Z'] > 0.5]

Unnamed: 0,W,X,Y,Z
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
df[['Z']]

Unnamed: 0,Z
A,-0.825378
B,-0.148194
C,0.669774
D,0.4799
E,-0.696232


In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.251915,0.518763,0.587968,-0.148194
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df[df['X'] < 1][['W']]

Unnamed: 0,W
A,-0.758436
B,0.251915
C,-0.876702
D,-1.270484
E,-0.960697


In [None]:
# Wie können wir die Daten als Datenrahmen auswählen

In [None]:
df[df['Y'] > 0][['Z', 'W', 'Y']]

Unnamed: 0,Z,W,Y
A,-0.825378,-0.758436,1.297617
B,-0.148194,0.251915,0.587968
C,0.669774,-0.876702,0.539118
D,0.4799,-1.270484,0.779475


Hinweis: Für zwei Bedingungen können Sie 

**|** → `or`, 

**&** → `and` mit Klammern verwenden.

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.251915,0.518763,0.587968,-0.148194
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df[(df['W'] > 0) & (df['Y'] < 1)]

Unnamed: 0,W,X,Y,Z
B,0.251915,0.518763,0.587968,-0.148194


In [None]:
df[(df['W'] > 0) & (df['Y'] < 1)] = 0
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.0,0.0,0.0,0.0
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


#### Bedingte Auswahl mit ``.loc[]`` und ``.iloc[]``

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.0,0.0,0.0,0.0
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df[df.X > 0]

Unnamed: 0,W,X,Y,Z
C,-0.876702,0.79275,0.539118,0.669774


In [None]:
df['X'] > 0

A     True
B    False
C     True
D    False
E     True
Name: X, dtype: bool

In [None]:
df[df.X > 0][['X','Z']]

Unnamed: 0,X,Z
A,0.628133,0.503826
C,0.740122,-0.589001
E,1.978757,0.683509


In [None]:
df.loc[(df.X > 0), ['X','Z']]

Unnamed: 0,X,Z
A,0.628133,0.503826
C,0.740122,-0.589001
E,1.978757,0.683509


In [None]:
df.loc[(df.X > 0)][['X','Z']]

Unnamed: 0,X,Z
A,0.628133,0.503826
C,0.740122,-0.589001
E,1.978757,0.683509


In [None]:
df.loc[((df.W > 1) | (df.Y < 1)), ['Y','Z']]

Unnamed: 0,Y,Z
B,0.0,0.0
C,0.539118,0.669774
D,0.779475,0.4799
E,-1.263599,-0.696232


## Weitere Indexdetails

Lassen Sie uns noch einige weitere Funktionen der Indizierung besprechen, einschließlich des Zurücksetzens des Index oder eines anderen Festlegens. Wir werden auch über die Indexhierarchie sprechen!

In [None]:
df

Unnamed: 0,W,X,Y,Z,X*Y,T
A,2.70685,0.628133,0.907969,0.503826,0.570325,1
B,0.651118,-0.319318,-0.848077,0.605965,0.270806,2
C,-2.018168,0.740122,0.528813,-0.589001,0.391387,3
D,0.188695,-0.758872,-0.933237,0.955057,0.708208,4
E,0.190794,1.978757,2.605967,0.683509,5.156577,5


In [None]:
# Zurücksetzen auf Standard 0,1...n Index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-0.758436,-0.454696,1.297617,-0.825378
1,B,0.0,0.0,0.0,0.0
2,C,-0.876702,0.79275,0.539118,0.669774
3,D,-1.270484,-0.446181,0.779475,0.4799
4,E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df

Unnamed: 0,W,X,Y,Z
A,-0.758436,-0.454696,1.297617,-0.825378
B,0.0,0.0,0.0,0.0
C,-0.876702,0.79275,0.539118,0.669774
D,-1.270484,-0.446181,0.779475,0.4799
E,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
df.reset_index(drop=True)

Unnamed: 0,W,X,Y,Z
0,-0.758436,-0.454696,1.297617,-0.825378
1,0.0,0.0,0.0,0.0
2,-0.876702,0.79275,0.539118,0.669774
3,-1.270484,-0.446181,0.779475,0.4799
4,-0.960697,-2.002399,-1.263599,-0.696232


In [None]:
neueindx = 'CA NY WY OR CO'.split()
neueindx

['CA', 'NY', 'WY', 'OR', 'CO']

In [None]:
df['neueidx'] = neueindx
df

Unnamed: 0,W,X,Y,Z,X*Y,T,neueidx
A,2.70685,0.628133,0.907969,0.503826,0.570325,1,CA
B,0.651118,-0.319318,-0.848077,0.605965,0.270806,2,NY
C,-2.018168,0.740122,0.528813,-0.589001,0.391387,3,WY
D,0.188695,-0.758872,-0.933237,0.955057,0.708208,4,OR
E,0.190794,1.978757,2.605967,0.683509,5.156577,5,CO


In [None]:
df.set_index('neueidx')

Unnamed: 0_level_0,W,X,Y,Z,X*Y,T
neueidx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CA,2.70685,0.628133,0.907969,0.503826,0.570325,1
NY,0.651118,-0.319318,-0.848077,0.605965,0.270806,2
WY,-2.018168,0.740122,0.528813,-0.589001,0.391387,3
OR,0.188695,-0.758872,-0.933237,0.955057,0.708208,4
CO,0.190794,1.978757,2.605967,0.683509,5.156577,5


In [None]:
df

Unnamed: 0,W,X,Y,Z,neueidx
A,-0.758436,-0.454696,1.297617,-0.825378,CA
B,0.0,0.0,0.0,0.0,NY
C,-0.876702,0.79275,0.539118,0.669774,WY
D,-1.270484,-0.446181,0.779475,0.4799,OR
E,-0.960697,-2.002399,-1.263599,-0.696232,CO


In [None]:
df.set_index('neueidx',inplace=True)

In [None]:
df

Unnamed: 0_level_0,W,X,Y,Z
neueidx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.758436,-0.454696,1.297617,-0.825378
NY,0.0,0.0,0.0,0.0
WY,-0.876702,0.79275,0.539118,0.669774
OR,-1.270484,-0.446181,0.779475,0.4799
CO,-0.960697,-2.002399,-1.263599,-0.696232


## Multi-Index und Index-Hierarchie

Lassen Sie uns die Arbeit mit Multi-Index durchgehen. Zuerst erstellen wir ein kurzes Beispiel dafür, wie ein Multi-Indexed Datenrahmen aussehen würde:

In [186]:
# Indexstufen
stufe1 = [1, 2, 3, 1, 2, 3, 5, 6, 7]
stufe2 = ['M1', 'M1', 'M1', 'M2', 'M2', 'M2','M3', 'M3', 'M3']
multi_index = list(zip(stufe2 , stufe1))
multi_index

[('M1', 1),
 ('M1', 2),
 ('M1', 3),
 ('M2', 1),
 ('M2', 2),
 ('M2', 3),
 ('M3', 5),
 ('M3', 6),
 ('M3', 7)]

In [187]:
index_ = pd.MultiIndex.from_tuples(multi_index)

In [188]:
index_

MultiIndex([('M1', 1),
            ('M1', 2),
            ('M1', 3),
            ('M2', 1),
            ('M2', 2),
            ('M2', 3),
            ('M3', 5),
            ('M3', 6),
            ('M3', 7)],
           )

In [189]:
df = pd.DataFrame(np.random.randn(9, 4), 
                  index=index_, 
                  columns=['A','B','C','D'])
df

Unnamed: 0,Unnamed: 1,A,B,C,D
M1,1,0.38603,2.084019,-0.376519,0.230336
M1,2,0.681209,1.035125,-0.03116,1.939932
M1,3,-1.005187,-0.74179,0.187125,-0.732845
M2,1,-1.38292,1.482495,0.961458,-2.141212
M2,2,0.992573,1.192241,-1.04678,1.292765
M2,3,-1.467514,-0.494095,-0.162535,0.485809
M3,5,0.392489,0.221491,-0.855196,1.54199
M3,6,0.666319,-0.538235,-0.568581,1.407338
M3,7,0.641806,-0.9051,-0.391157,1.028293


Lassen Sie uns nun zeigen, wie man dies indiziert! Für die Indexhierarchie verwenden wir ``df.loc[]``, wenn dies auf der Spaltenachse wäre, würden Sie einfach die normale Klammernotation ``df[]`` verwenden. Der Aufruf einer Ebene des Indexes gibt den Unterdatenrahmen zurück:

In [190]:
df.loc['M1']

Unnamed: 0,A,B,C,D
1,0.38603,2.084019,-0.376519,0.230336
2,0.681209,1.035125,-0.03116,1.939932
3,-1.005187,-0.74179,0.187125,-0.732845


In [191]:
df.loc['M1'].loc[2]

A    0.681209
B    1.035125
C   -0.031160
D    1.939932
Name: 2, dtype: float64

In [192]:
df.loc['M1'].loc[[2]]

Unnamed: 0,A,B,C,D
2,0.681209,1.035125,-0.03116,1.939932


In [193]:
df.index.names

FrozenList([None, None])

In [195]:
df.index.names = ['Group','Num']

In [196]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,0.38603,2.084019,-0.376519,0.230336
M1,2,0.681209,1.035125,-0.03116,1.939932
M1,3,-1.005187,-0.74179,0.187125,-0.732845
M2,1,-1.38292,1.482495,0.961458,-2.141212
M2,2,0.992573,1.192241,-1.04678,1.292765
M2,3,-1.467514,-0.494095,-0.162535,0.485809
M3,5,0.392489,0.221491,-0.855196,1.54199
M3,6,0.666319,-0.538235,-0.568581,1.407338
M3,7,0.641806,-0.9051,-0.391157,1.028293


let's take a quick look at the ``.xs()``
http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-xs.ipynb

In [197]:
# Diese Methode benötigt ein `key`-Argument, um Daten auf einer bestimmten Ebene eines MultiIndex auszuwählen.
df.xs('M1')

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.38603,2.084019,-0.376519,0.230336
2,0.681209,1.035125,-0.03116,1.939932
3,-1.005187,-0.74179,0.187125,-0.732845


In [198]:
df.loc['M1']

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.38603,2.084019,-0.376519,0.230336
2,0.681209,1.035125,-0.03116,1.939932
3,-1.005187,-0.74179,0.187125,-0.732845


In [139]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,-1.14822,1.607435,-1.22687,1.405532
M1,2,-1.137201,-0.535478,2.142717,1.691452
M1,3,0.275225,-0.852057,0.298659,-0.56537
M2,1,0.358325,0.699676,0.417366,-0.238049
M2,2,-1.850038,1.049774,-0.43787,0.608334
M2,3,-0.342021,0.58902,0.827388,0.163044
M3,5,0.031363,0.783105,0.06956,0.660136
M3,6,0.811349,-1.299794,2.195249,-0.620243
M3,7,-1.531769,0.061996,0.823122,0.644121


In [200]:
df.xs(['M1', 2])

  df.xs(['M1', 2])


A    0.681209
B    1.035125
C   -0.031160
D    1.939932
Name: (M1, 2), dtype: float64

In [201]:
df.xs(('M3',6))

A    0.666319
B   -0.538235
C   -0.568581
D    1.407338
Name: (M3, 6), dtype: float64

In [142]:
df.xs(('M3',6), level=[0,1])

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M3,6,0.811349,-1.299794,2.195249,-0.620243


In [143]:
df.xs(5, level='Num')

Unnamed: 0_level_0,A,B,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3,0.031363,0.783105,0.06956,0.660136


In [144]:
df.xs(3, level=1)

Unnamed: 0_level_0,A,B,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1,0.275225,-0.852057,0.298659,-0.56537
M2,-0.342021,0.58902,0.827388,0.163044


In [145]:
df.xs('C', axis=1)

Group  Num
M1     1     -1.226870
       2      2.142717
       3      0.298659
M2     1      0.417366
       2     -0.437870
       3      0.827388
M3     5      0.069560
       6      2.195249
       7      0.823122
Name: C, dtype: float64

## Lernen wir neue Funktionen/Attribute/Methoden zu "iris daten_set" kennen

In [202]:
from sklearn import datasets
import seaborn as sns

In [203]:
df = sns.load_dataset("iris")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [148]:
df.shape

(150, 5)

In [149]:
df.ndim

2

In [150]:
df.size

750

In [151]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [152]:
df.sample(4)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
130,7.4,2.8,6.1,1.9,virginica
66,5.6,3.0,4.5,1.5,versicolor
2,4.7,3.2,1.3,0.2,setosa
106,4.9,2.5,4.5,1.7,virginica


In [153]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [154]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [205]:
df.describe(include= 'object')

Unnamed: 0,species
count,150
unique,3
top,setosa
freq,50


In [206]:
df.describe(include= 'all')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,setosa
freq,,,,,50
mean,5.843333,3.057333,3.758,1.199333,
std,0.828066,0.435866,1.765298,0.762238,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [155]:
df.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [156]:
df.mean()

  df.mean()


sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

In [157]:
df.sum(axis=0)

sepal_length                                                876.5
sepal_width                                                 458.6
petal_length                                                563.7
petal_width                                                 179.9
species         setosasetosasetosasetosasetosasetosasetosaseto...
dtype: object

In [158]:
df.sum(axis=1)

  df.sum(axis=1)


0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

In [159]:
df.sepal_length.sum()

876.5

In [160]:
df.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [161]:
df.isnull()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [162]:
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [163]:
len(df)

150

In [164]:
df.head(9)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa


In [165]:
df.iloc[0:6 ,0:]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


In [166]:
df.loc[0:6, :]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa


In [207]:
df.loc[0:6]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa


In [167]:
df.drop('species', axis=1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [168]:
df[(df.sepal_length > 5) & (df.sepal_width > 3)].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
10,5.4,3.7,1.5,0.2,setosa
14,5.8,4.0,1.2,0.2,setosa
15,5.7,4.4,1.5,0.4,setosa


In [169]:
df[(df.sepal_length > 5) | (df.sepal_width > 3)].tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [170]:
df.sort_values(by='species', ascending=True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
27,5.2,3.5,1.5,0.2,setosa
28,5.2,3.4,1.4,0.2,setosa
29,4.7,3.2,1.6,0.2,setosa
30,4.8,3.1,1.6,0.2,setosa
...,...,...,...,...,...
119,6.0,2.2,5.0,1.5,virginica
120,6.9,3.2,5.7,2.3,virginica
121,5.6,2.8,4.9,2.0,virginica
111,6.4,2.7,5.3,1.9,virginica


<head>
    <center><title>~ Ende der Pandas Datenrahmen | Lektion-1 ~</title></center>
</head>