# pandas modülü 

Veri işleme ve veri temizleme de oldukçta verimli şekilde kullanılanbilen yanısıra veri analizi, veri mandeciliği ve makine öğrenmesi kullanılan bir modüldür. pandas modülü iki türlü veri tipi üzerine inşa edilmiştir: seriler ve DataFrame. Serileri tek boyutlu, DataFrame' leri de iki boyutlu diziler olarak düşünebiliriz. Join ve Groupby ile SQL tablolarındaki gibi sorgulamalar yapılabilir. .cvx, .xlsx gibi csv ve excel dosyalarından data çekerek çalışabilir ve tekrar bu formatlarda data dönüşümü yapılabilir. 

Anaconda ile pandas otomaitkmen kurulur. Fakat eğer yüklenmemişse anaconda promt tan conda install pandas komutu ile Anacondaya dahil edilebilir. Python idle ya da VC Code kullanılıyorsa pip install pandas komutu ile yüklenebilir. 

# pandas serileri

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Tanımlanma Yöntemleri - 1

In [3]:
labels_list = ["Hagi","Popescu","Tafarel","Okan Buruk","Bülent Korkmaz"]

In [4]:
data_list = [10,20,30,40,50]

pd.Series() cursor parantez içerisindeyken Shift+TAB' basılırsa 

In [5]:
pd.Series()

  pd.Series()


Series([], dtype: float64)

shift + TAB ile özellikleri görüebilir.

In [6]:
pd.Series(data = data_list)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [7]:
pd.Series(data = data_list, index = labels_list)

Hagi              10
Popescu           20
Tafarel           30
Okan Buruk        40
Bülent Korkmaz    50
dtype: int64

sıralı verilecekse data = tanımına gerek yoktur. 

In [8]:
pd.Series(data_list, labels_list)

Hagi              10
Popescu           20
Tafarel           30
Okan Buruk        40
Bülent Korkmaz    50
dtype: int64

In [None]:
# Tanımlanma Yöntemleri - 2

herhangi bir index verilmediğinde, index otomatikman 0' dan başlar 

In [9]:
pd.Series(data_list)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [10]:
npArray = np.array([100,200,300,400,500])

In [11]:
pd.Series(data = npArray)

0    100
1    200
2    300
3    400
4    500
dtype: int32

In [12]:
pd.Series(data = npArray,index = labels_list)

Hagi              100
Popescu           200
Tafarel           300
Okan Buruk        400
Bülent Korkmaz    500
dtype: int32

In [13]:
pd.Series(data = npArray,index = ["A","B","C","D","E"])

A    100
B    200
C    300
D    400
E    500
dtype: int32

In [None]:
# Tanımlanma Yöntemleri - 3

In [14]:
dataDict = {"Hagi":10,"Tafarel":80,"Aykut":11} 

In [15]:
pd.Series(dataDict)

Hagi       10
Tafarel    80
Aykut      11
dtype: int64

# Seriler Üzerinde Operasyonlar

2019 ve 2020 yıllarındaki ihracat verileri (rastgele verilerdir)

In [16]:
s2019 = pd.Series([5,10,14,20],["Buğday","Mısır","Kiraz","Erik"])

In [17]:
s2020 = pd.Series([2,12,12,6],["Buğday","Mısır","Çilek","Erik"])

In [18]:
s2019

Buğday     5
Mısır     10
Kiraz     14
Erik      20
dtype: int64

In [19]:
s2020

Buğday     2
Mısır     12
Çilek     12
Erik       6
dtype: int64

In [20]:
s2019 + s2020

Buğday     7.0
Erik      26.0
Kiraz      NaN
Mısır     22.0
Çilek      NaN
dtype: float64

NaN : not a number

In [21]:
s2019["Buğday"] # Buğday'a karşılık gelen 

5

In [22]:
s2020["Buğday"] # Buğday'a karşılık gelen 

2

In [23]:
total = s2019 + s2020

In [24]:
total["Buğday"]

7.0

In [25]:
total["üzüm"]

KeyError: 'üzüm'

# Dataframelerin Tanımlanmaları

DataFrame' ler birçok serinin aynı kümede bulunmasıdır ve SQL tablolarına oldukça benzerdir. Böylece bu tablolar üzerinde işlem yapabiliriz.  

In [27]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [28]:
randn(3,3)

array([[ 0.35821443, -0.15590421, -1.2684308 ],
       [-0.06694089, -0.50789645,  0.62098399],
       [ 1.01986768,  0.75825894, -0.28986443]])

In [29]:
df = pd.DataFrame(randn(3,3),index = ["A","B","C"],
                  columns = ["sütun1","sütun2","sütun3"])
df

Unnamed: 0,sütun1,sütun2,sütun3
A,0.74185,-2.121489,0.897391
B,0.680279,0.755771,0.358398
C,1.079519,-0.883834,1.395758


In [30]:
df["sütun1"]

A    0.741850
B    0.680279
C    1.079519
Name: sütun1, dtype: float64

In [31]:
df["sütun2"]

A   -2.121489
B    0.755771
C   -0.883834
Name: sütun2, dtype: float64

In [32]:
type(df["sütun1"])

pandas.core.series.Series

satırları yani indexlere karşı gelen elemanları görmek için .loc() kullanılır.

In [33]:
df.loc["A"]

sütun1    0.741850
sütun2   -2.121489
sütun3    0.897391
Name: A, dtype: float64

In [34]:
type(df.loc["A"])

pandas.core.series.Series

In [None]:
# DataFrame' i parçalama

In [35]:
df[["sütun1","sütun2"]]

Unnamed: 0,sütun1,sütun2
A,0.74185,-2.121489
B,0.680279,0.755771
C,1.079519,-0.883834


# DataFrame'e Sütün Ekleme

In [36]:
df["sütun4"] = pd.Series(randn(3),index = ["A","B","C"])
df

Unnamed: 0,sütun1,sütun2,sütun3,sütun4
A,0.74185,-2.121489,0.897391,-0.092363
B,0.680279,0.755771,0.358398,1.531593
C,1.079519,-0.883834,1.395758,0.205678


In [37]:
df["sütun5"] = df["sütun1"] + df["sütun2"] + df["sütun3"]
df

Unnamed: 0,sütun1,sütun2,sütun3,sütun4,sütun5
A,0.74185,-2.121489,0.897391,-0.092363,-0.482247
B,0.680279,0.755771,0.358398,1.531593,1.794447
C,1.079519,-0.883834,1.395758,0.205678,1.591443


In [None]:
# Dataframeden Sütün Silme

In [38]:
df.drop("sütunn5") 
# Böyle verirsek default axis değeri 0 olduğu için index silmeye çalışacak.

KeyError: "['sütunn5'] not found in axis"

In [39]:
df.drop("sütun5", axis = 1)

Unnamed: 0,sütun1,sütun2,sütun3,sütun4
A,0.74185,-2.121489,0.897391,-0.092363
B,0.680279,0.755771,0.358398,1.531593
C,1.079519,-0.883834,1.395758,0.205678


In [40]:
df

Unnamed: 0,sütun1,sütun2,sütun3,sütun4,sütun5
A,0.74185,-2.121489,0.897391,-0.092363,-0.482247
B,0.680279,0.755771,0.358398,1.531593,1.794447
C,1.079519,-0.883834,1.395758,0.205678,1.591443


In [None]:
df # değişiklikler yansımadı.

In [41]:
df.drop("sütun5",axis = 1,inplace=True)
df

Unnamed: 0,sütun1,sütun2,sütun3,sütun4
A,0.74185,-2.121489,0.897391,-0.092363
B,0.680279,0.755771,0.358398,1.531593
C,1.079519,-0.883834,1.395758,0.205678


In [None]:
# Loc ve ILoc

In [42]:
df.loc["A"]

sütun1    0.741850
sütun2   -2.121489
sütun3    0.897391
sütun4   -0.092363
Name: A, dtype: float64

In [43]:
df.iloc[0] # 0.index yani A

sütun1    0.741850
sütun2   -2.121489
sütun3    0.897391
sütun4   -0.092363
Name: A, dtype: float64

In [44]:
df.loc["A","sütun1"]

0.7418499093612967

In [45]:
df.loc[["A","B"],["sütun1","sütun2"]]

Unnamed: 0,sütun1,sütun2
A,0.74185,-2.121489
B,0.680279,0.755771


# DataFrame Filtreleme

In [46]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [47]:
df = pd.DataFrame(randn(4,3),["A","B","C","D"],["Column1","Column2","Column3"])

In [48]:
df

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [49]:
df > -1

Unnamed: 0,Column1,Column2,Column3
A,False,True,True
B,False,True,True
C,True,True,True
D,False,True,True


In [50]:
booldf = df > 0
booldf

Unnamed: 0,Column1,Column2,Column3
A,False,True,True
B,False,False,True
C,True,False,False
D,False,True,True


In [51]:
df[booldf]

Unnamed: 0,Column1,Column2,Column3
A,,2.055278,0.723531
B,,,0.069158
C,0.694593,,
D,,0.392598,0.928522


In [52]:
df[df > 0]

Unnamed: 0,Column1,Column2,Column3
A,,2.055278,0.723531
B,,,0.069158
C,0.694593,,
D,,0.392598,0.928522


In [53]:
df

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [54]:
df["Column1"]

A   -1.402757
B   -1.303609
C    0.694593
D   -1.605943
Name: Column1, dtype: float64

In [55]:
df["Column1"] > -1

A    False
B    False
C     True
D    False
Name: Column1, dtype: bool

In [56]:
df[df["Column1"] > -1]

Unnamed: 0,Column1,Column2,Column3
C,0.694593,-0.268569,-0.265927


A ve B indeksleri false olduğundan df içinde filtreleme yapılan üst sorguda A ve B doğrudan filtrelenmiş oldu. 

In [57]:
df

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [58]:
df["Column1"] > 0.5

A    False
B    False
C     True
D    False
Name: Column1, dtype: bool

In [59]:
df[df["Column1"] > 0.5]

Unnamed: 0,Column1,Column2,Column3
C,0.694593,-0.268569,-0.265927


In [60]:
df["Column2"] > 0.5

A     True
B    False
C    False
D    False
Name: Column2, dtype: bool

In [61]:
df[df["Column2"] > 0.5]

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531


In [62]:
df

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [63]:
df[df["Column1"] > -1]

Unnamed: 0,Column1,Column2,Column3
C,0.694593,-0.268569,-0.265927


In [64]:
df[df["Column2"] > -1]

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [65]:
df[(df["Column1"] > -1) & (df["Column2"] > 0)]

Unnamed: 0,Column1,Column2,Column3


In [66]:
df[(df["Column1"] > -1) | (df["Column2"] > 0)]

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [None]:
| # alt gr + tire 

In [67]:
df

Unnamed: 0,Column1,Column2,Column3
A,-1.402757,2.055278,0.723531
B,-1.303609,-0.015641,0.069158
C,0.694593,-0.268569,-0.265927
D,-1.605943,0.392598,0.928522


In [68]:
df["Column4"] = pd.Series(randn(4),index = ["A","B","C","D"])
df

Unnamed: 0,Column1,Column2,Column3,Column4
A,-1.402757,2.055278,0.723531,-1.531644
B,-1.303609,-0.015641,0.069158,1.500013
C,0.694593,-0.268569,-0.265927,-1.078551
D,-1.605943,0.392598,0.928522,1.74834


In [69]:
df["Column5"] = randn(4)
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
A,-1.402757,2.055278,0.723531,-1.531644,-0.460878
B,-1.303609,-0.015641,0.069158,1.500013,0.167203
C,0.694593,-0.268569,-0.265927,-1.078551,-0.850132
D,-1.605943,0.392598,0.928522,1.74834,0.264729


In [70]:
df["Column6"] = ["newValue1","newValue2","newValue3","newValue4"]
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6
A,-1.402757,2.055278,0.723531,-1.531644,-0.460878,newValue1
B,-1.303609,-0.015641,0.069158,1.500013,0.167203,newValue2
C,0.694593,-0.268569,-0.265927,-1.078551,-0.850132,newValue3
D,-1.605943,0.392598,0.928522,1.74834,0.264729,newValue4


In [None]:
# indexleri yeni eklenen Column6 ile değiiştirmek için .set.index(sütun adı)

In [71]:
df.set_index("Column6")

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5
Column6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
newValue1,-1.402757,2.055278,0.723531,-1.531644,-0.460878
newValue2,-1.303609,-0.015641,0.069158,1.500013,0.167203
newValue3,0.694593,-0.268569,-0.265927,-1.078551,-0.850132
newValue4,-1.605943,0.392598,0.928522,1.74834,0.264729


In [72]:
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6
A,-1.402757,2.055278,0.723531,-1.531644,-0.460878,newValue1
B,-1.303609,-0.015641,0.069158,1.500013,0.167203,newValue2
C,0.694593,-0.268569,-0.265927,-1.078551,-0.850132,newValue3
D,-1.605943,0.392598,0.928522,1.74834,0.264729,newValue4


In [73]:
df.set_index("Column6",inplace = True)
df

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5
Column6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
newValue1,-1.402757,2.055278,0.723531,-1.531644,-0.460878
newValue2,-1.303609,-0.015641,0.069158,1.500013,0.167203
newValue3,0.694593,-0.268569,-0.265927,-1.078551,-0.850132
newValue4,-1.605943,0.392598,0.928522,1.74834,0.264729


In [74]:
df.index.names # Indekslerin ismi

FrozenList(['Column6'])