<a href="https://colab.research.google.com/github/Gulayrose/DAwPy/blob/main/DAwPy_S4_(Pandas_DataFrames%2C_Selection_and_Indexing)_empty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<h1><p style="text-align: center;">Pandas Lesson, Session - 4</p><h1>
    

# Data Frames

 - ### ``DataFrames`` are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [1]:
import numpy as np
import pandas as pd

 ## Creating a DataFrame using the ``list``s of data and columns

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

In [2]:
data = [1,3,5,7,9,18]
data

[1, 3, 5, 7, 9, 18]

In [3]:
pd.DataFrame(data=data)

Unnamed: 0,0
0,1
1,3
2,5
3,7
4,9
5,18


In [4]:
pd.Series(data=data)

0     1
1     3
2     5
3     7
4     9
5    18
dtype: int64

In [5]:
pd.DataFrame(data=data, columns=["col_1"])

Unnamed: 0,col_1
0,1
1,3
2,5
3,7
4,9
5,18


 ## Creating a DataFrame using a ``NumPy Arrays``

In [6]:
m = np.arange(1,24,2).reshape(3,4)
m

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [7]:
df=pd.DataFrame(data=m, columns=['var1','var2','var3','var4'])
df # column ismi yazarken kac tane column varsa o kadar isim vermemiz gerek

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [8]:
df.head(1)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7


In [9]:
df.tail(1)

Unnamed: 0,var1,var2,var3,var4
2,17,19,21,23


In [12]:
df.sample(2)

Unnamed: 0,var1,var2,var3,var4
2,17,19,21,23
1,9,11,13,15


In [14]:
df.columns

Index(['var1', 'var2', 'var3', 'var4'], dtype='object')

In [15]:
for i in df.columns: 
  print(i)


var1
var2
var3
var4


In [16]:
for i in df.columns: 
  print(df[i].sum())

27
33
39
45


In [75]:
df

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [82]:
df.columns

Index(['new1', 'new2', 'new3', 'new4'], dtype='object')

In [78]:
df.columns = ["new1", "new2","new3","new4"]
df # column larin adini degistirebiliriz

Unnamed: 0,new1,new2,new3,new4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [79]:
df.index=["a", "b", "c"]
df#index lerin adini degistirebiliriz

Unnamed: 0,new1,new2,new3,new4
a,1,3,5,7
b,9,11,13,15
c,17,19,21,23


In [80]:
df.rename(columns = {"new1": "a", "new2": "b"})

Unnamed: 0,a,b,new3,new4
a,1,3,5,7
b,9,11,13,15
c,17,19,21,23


In [81]:
df.rename(index = {"a": 1, "b": 2})

Unnamed: 0,new1,new2,new3,new4
1,1,3,5,7
2,9,11,13,15
c,17,19,21,23


In [17]:
df.shape

(3, 4)

In [18]:
df.size

12

In [19]:
len(df)

3

In [20]:
df.values

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [21]:
type(df)

pandas.core.frame.DataFrame

In [22]:
type(df.values)

numpy.ndarray

 ## Creating a DataFrame using a ``dict``

In [24]:
s1 = np.random.randint(2,10, size = 4)
s2 = np.random.randint(3,10, size = 4)
s3 = np.random.randint(4,15, size = 4)

In [25]:
s1

array([9, 3, 3, 6])

In [26]:
s2

array([9, 7, 6, 3])

In [27]:
s3

array([ 8,  7,  5, 14])

In [28]:
my_dict = {"var1":s1,"var2":s2,"var3":s3}
my_dict

{'var1': array([9, 3, 3, 6]),
 'var2': array([9, 7, 6, 3]),
 'var3': array([ 8,  7,  5, 14])}

In [29]:
df1 = pd.DataFrame(my_dict)
df1

Unnamed: 0,var1,var2,var3
0,9,9,8
1,3,7,7
2,3,6,5
3,6,3,14


In [30]:
df1.index

RangeIndex(start=0, stop=4, step=1)

In [31]:
[i for i in df1.index]

[0, 1, 2, 3]

In [32]:
"var2" in df1

True

### Now, let's examine again the ***idexing, selection*** and ***slicing*** methods and several ***attributes*** using a different DataFrame

In [33]:
from numpy.random import randn

In [35]:
randn(5,4)

array([[ 0.30266545,  1.69372293, -1.70608593, -1.15911942],
       [-0.13484072,  0.39052784,  0.16690464,  0.18450186],
       [ 0.80770591,  0.07295968,  0.63878701,  0.3296463 ],
       [-0.49710402, -0.7540697 , -0.9434064 ,  0.48475165],
       [-0.11677332,  1.9017548 ,  0.23812696,  1.99665229]])

In [36]:
'A B C D E '.split()

['A', 'B', 'C', 'D', 'E']

In [124]:
np.random.seed(101) # herkesde ayni olusmasi icin seed kullandik
df3 = pd.DataFrame(randn(5,4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())
df3

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [37]:
df3["W"]

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [38]:
type(df3["W"])

pandas.core.series.Series

In [39]:
df3[["W"]] # ek bir koseli parantez koyarsak seri den df e gecer 

Unnamed: 0,W
A,2.70685
B,0.651118
C,-2.018168
D,0.188695
E,0.190794


In [41]:
type(df3[["W"]])

pandas.core.frame.DataFrame

In [42]:
df3.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [43]:
type(df3.W)

pandas.core.series.Series

In [73]:
df3.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [72]:
df3

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


#### DataFrame Columns are just Series

In [45]:
df3[["W","Z"]]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [46]:
df3["A":"C"] # sliceing sadece index icin yapilabilir.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [47]:
df3["A":"C"]["W"]

A    2.706850
B    0.651118
C   -2.018168
Name: W, dtype: float64

In [50]:
df3["A":"C"][["W","Y"]]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077
C,-2.018168,0.528813


In [51]:
df3["new"] = df3["X"] * df3["Y"]
df3

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,0.570325
B,0.651118,-0.319318,-0.848077,0.605965,0.270806
C,-2.018168,0.740122,0.528813,-0.589001,0.391387
D,0.188695,-0.758872,-0.933237,0.955057,0.708208
E,0.190794,1.978757,2.605967,0.683509,5.156577


In [52]:
df3["new2"] = [1,2,3,4,5]
df3

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,0.570325,1
B,0.651118,-0.319318,-0.848077,0.605965,0.270806,2
C,-2.018168,0.740122,0.528813,-0.589001,0.391387,3
D,0.188695,-0.758872,-0.933237,0.955057,0.708208,4
E,0.190794,1.978757,2.605967,0.683509,5.156577,5


In [55]:
df3.columns = ["W","X","Y","Z","new","new2"]
df3 

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,0.570325,1
B,0.651118,-0.319318,-0.848077,0.605965,0.270806,2
C,-2.018168,0.740122,0.528813,-0.589001,0.391387,3
D,0.188695,-0.758872,-0.933237,0.955057,0.708208,4
E,0.190794,1.978757,2.605967,0.683509,5.156577,5


**Creating a new column:**

In [56]:
df3 = df3[["new","new2","W","X","Y","Z"]]
df3 # column larin yerini degistirdik

Unnamed: 0,new,new2,W,X,Y,Z
A,0.570325,1,2.70685,0.628133,0.907969,0.503826
B,0.270806,2,0.651118,-0.319318,-0.848077,0.605965
C,0.391387,3,-2.018168,0.740122,0.528813,-0.589001
D,0.708208,4,0.188695,-0.758872,-0.933237,0.955057
E,5.156577,5,0.190794,1.978757,2.605967,0.683509


## [Removing Columns & Rows](http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-drop.ipynb)

 ### Removing Columns

In [57]:
df3.drop("new2",axis=1) #new2 yi yolladik.kalici olmasi icin inflace= true secilir yada baska bir degere atamali

Unnamed: 0,new,W,X,Y,Z
A,0.570325,2.70685,0.628133,0.907969,0.503826
B,0.270806,0.651118,-0.319318,-0.848077,0.605965
C,0.391387,-2.018168,0.740122,0.528813,-0.589001
D,0.708208,0.188695,-0.758872,-0.933237,0.955057
E,5.156577,0.190794,1.978757,2.605967,0.683509


In [58]:
df3.drop(["new","new2"],axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [69]:
df3

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [71]:
df3.drop(columns = ["new", "new2"])

KeyError: ignored

In [64]:
df3.drop(["new", "new2"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


 ### Removing rows

In [65]:
df3.drop("C",axis=0) # bunu normalde df4 diye degiskene atayabiliriz


Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [66]:
df3.drop(index= ["B"])

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selecting Rows

### First, let's take a quick look at [`.loc[]`](http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-loc.ipynb) | [`.iloc[]`](http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-iloc.ipynb)

#### `.loc[]` → allows us to select data using **labels** (names) of rows (index) & columns

#### `.iloc[]` → allows us to select data using **index numbers** of rows (index) & columns. it's like classical indexing logic

In [83]:
m = np.random.randint(1,40, size=(8,4))
df4 = pd.DataFrame(m, columns = ["var1","var2","var3",'var4'])
df4

Unnamed: 0,var1,var2,var3,var4
0,38,23,10,3
1,19,29,12,11
2,31,36,29,4
3,20,21,15,6
4,6,7,25,38
5,8,38,5,24
6,36,16,35,4
7,19,14,4,38


In [84]:
df4.loc[4] # 4.row u getirdi

var1     6
var2     7
var3    25
var4    38
Name: 4, dtype: int64

In [85]:
df4.loc[[4]]

Unnamed: 0,var1,var2,var3,var4
4,6,7,25,38


In [86]:
df4.loc[2:5] #burda 5 i aldi

Unnamed: 0,var1,var2,var3,var4
2,31,36,29,4
3,20,21,15,6
4,6,7,25,38
5,8,38,5,24


In [87]:
df4.iloc[2:5] #burda 5 i almadi

Unnamed: 0,var1,var2,var3,var4
2,31,36,29,4
3,20,21,15,6
4,6,7,25,38


In [88]:
df4.index='a b c d e f g h'.split()
df4 # index lerin adini degisitirdik

Unnamed: 0,var1,var2,var3,var4
a,38,23,10,3
b,19,29,12,11
c,31,36,29,4
d,20,21,15,6
e,6,7,25,38
f,8,38,5,24
g,36,16,35,4
h,19,14,4,38


In [89]:
df4.iloc[1:4] # yukarda degsitirmemize ragmen bize index leri verdi

Unnamed: 0,var1,var2,var3,var4
b,19,29,12,11
c,31,36,29,4
d,20,21,15,6


In [91]:
#df4.loc[1:4] # hata verir cunku artik biz index isimlerini degistirdik

#### Let's continue to examine `.loc[]` and `.iloc[]` using ``df3`` again

In [93]:
df4.loc["a":"d"]

Unnamed: 0,var1,var2,var3,var4
a,38,23,10,3
b,19,29,12,11
c,31,36,29,4
d,20,21,15,6


In [94]:
df4

Unnamed: 0,var1,var2,var3,var4
a,38,23,10,3
b,19,29,12,11
c,31,36,29,4
d,20,21,15,6
e,6,7,25,38
f,8,38,5,24
g,36,16,35,4
h,19,14,4,38


In [95]:
df4.iloc[3,1]

21

In [96]:
df4.loc["d","var2"] #[row,cloumn]

21

In [97]:
df4.loc["d":"g","var3"]

d    15
e    25
f     5
g    35
Name: var3, dtype: int64

In [98]:
df4.loc["d":"g"]["var3"] #bu bize seri verir

d    15
e    25
f     5
g    35
Name: var3, dtype: int64

In [99]:
df4.loc["d":"g"][["var3"]] # bu bize df verir

Unnamed: 0,var3
d,15
e,25
f,5
g,35


### Selecting subset of rows and columns

 - ### `.loc[[row labels|names], [column labels|names]]`

 - ### `.iloc[[row index numbers], [column index numbers]]`

In [100]:
df4.loc["d":"g",["var3"]]

Unnamed: 0,var3
d,15
e,25
f,5
g,35


In [101]:
df4.iloc[2:5,2]

c    29
d    15
e    25
Name: var3, dtype: int64

In [102]:
df4.iloc[2:5][["var2"]]

Unnamed: 0,var2
c,36
d,21
e,7


In [103]:
df3

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [104]:
df3.loc["C"]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [105]:
df3.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [106]:
df3.loc[["C"]]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


In [107]:
df3.loc["C","Z"]

-0.5890005332865824

In [108]:
df3.loc[["C"],["Z"]]

Unnamed: 0,Z
C,-0.589001


In [109]:
df3.loc[["A","C"],["X","Z"]]

Unnamed: 0,X,Z
A,0.628133,0.503826
C,0.740122,-0.589001


In [110]:
df3.iloc[[0,2],[0,3]]

Unnamed: 0,W,Z
A,2.70685,0.503826
C,-2.018168,-0.589001


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [111]:
df3

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [112]:
df3>2 #bool doner

Unnamed: 0,W,X,Y,Z
A,True,False,False,False
B,False,False,False,False
C,False,False,False,False
D,False,False,False,False
E,False,False,True,False


In [113]:
df3[df3>2] # kosulu saglayani gosterir digerleri nan olur

Unnamed: 0,W,X,Y,Z
A,2.70685,,,
B,,,,
C,,,,
D,,,,
E,,,2.605967,


In [115]:
df3[df3["Z"]>0.5]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [116]:
df3[df3["Z"]>0.5]["X"]

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [117]:
df3[df3["Z"]>0.7]["X"]

D   -0.758872
Name: X, dtype: float64

In [129]:
df3[(df3['W']>0) & (df3['Y']<1)] = 0

In [128]:
df3

Unnamed: 0,W,X,Y,Z
A,0.0,0.0,0.0,0.0
B,0.0,0.0,0.0,0.0
C,-2.018168,0.740122,0.528813,-0.589001
D,0.0,0.0,0.0,0.0
E,0.190794,1.978757,2.605967,0.683509


In [130]:
df3.loc[(df3.X>0),["X","Y"]]

Unnamed: 0,X,Y
C,0.740122,0.528813
E,1.978757,2.605967


#### For two conditions you can use **|** → `or`,  **&** →  `and` with parenthesis:

### Conditional selection using ``.loc[]`` and ``.iloc[]``

In [131]:
df3.loc[((df3.W>1) | (df3.Y<1)), ['Y','Z']]

Unnamed: 0,Y,Z
A,0.0,0.0
B,0.0,0.0
C,0.528813,-0.589001
D,0.0,0.0


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [132]:
df3

Unnamed: 0,W,X,Y,Z
A,0.0,0.0,0.0,0.0
B,0.0,0.0,0.0,0.0
C,-2.018168,0.740122,0.528813,-0.589001
D,0.0,0.0,0.0,0.0
E,0.190794,1.978757,2.605967,0.683509


In [133]:
df3.reset_index() #bize index atadi

Unnamed: 0,index,W,X,Y,Z
0,A,0.0,0.0,0.0,0.0
1,B,0.0,0.0,0.0,0.0
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.0,0.0,0.0,0.0
4,E,0.190794,1.978757,2.605967,0.683509


In [134]:
df3.reset_index(drop=True) #eski haline getirir.

Unnamed: 0,W,X,Y,Z
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,-2.018168,0.740122,0.528813,-0.589001
3,0.0,0.0,0.0,0.0
4,0.190794,1.978757,2.605967,0.683509


In [135]:
df3.set_index("Z") # z yi basa aldik kalici olmayavn degiusiklik.Z artik index oldu


Unnamed: 0_level_0,W,X,Y
Z,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
-0.589001,-2.018168,0.740122,0.528813
0.0,0.0,0.0,0.0
0.683509,0.190794,1.978757,2.605967


In [136]:
df3.reset_index(drop=True,inplace=True)
df3

Unnamed: 0,W,X,Y,Z
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,-2.018168,0.740122,0.528813,-0.589001
3,0.0,0.0,0.0,0.0
4,0.190794,1.978757,2.605967,0.683509


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

### let's take a quick look at the [``.xs()``](http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-xs.ipynb)

### Let's learn new functions/attributes/methods on "iris dataset" 

# End of the Session