# PANDAS

Introduction to Pandas
In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

 Introduction to Pandas

Series

DataFrames

Missing Data

GroupBy

Merging,Joining,and Concatenating

Operations

Data Input and Output


## Series
###  Q: What is Series in Pandas?
A: A Series is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. By using a Series method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.  

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

### Practice
You can think of the pandas series as a column with labels. Because, series can only contain a single list with index, whereas the DataFrames can be made of more than one series.

In [15]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [16]:
labels = ['a', 'b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10, 'b':20, 'c':30}

** Using Lists**

In [17]:
pd.Series(data=my_data) # Panda serileri cok cesitlidir ancak biz su an data ve indexe odaklanacagiz. 
                        # burada benim datama pandas otamatik index atadi

0    10
1    20
2    30
dtype: int64

In [18]:
pd.Series(data=my_data, index=labels) # burada benim datama labels dizinimi index olarak veriyorum

a    10
b    20
c    30
dtype: int64

In [19]:
pd.Series(my_data, labels) # bu sekilde de yapabilirim yani ilk siradaki veri ikinci siradaki index

a    10
b    20
c    30
dtype: int64

** NumPy Arrays **

In [20]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [21]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

** Dictionary**

In [22]:
pd.Series(d) # d = {'a':10, 'b':20, 'c':30} bu bir sozluk yapisidir. burada keyler : a,b,c index durumunda 
                                # value ler ise seri durumunda

a    10
b    20
c    30
dtype: int64

In [23]:
d

{'a': 10, 'b': 20, 'c': 30}

### Data in a Series

A pandas Series can hold a variety of object types:

In [24]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [25]:
# Even functions (although unlikely that you will use this)
pd.Series(data=[sum,print,len]) # datamizi objelerden secersek bu sekilde bir pandas serisi olur

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [26]:
ser1 = pd.Series([1,2,3,4],['USA','Germany','USSR','Japan']) # burada ulkeler index durumunda, sayilar ise data durumunda

In [27]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [28]:
ser2 = pd.Series([1,2,3,4],['USA','Germany','Italy','Japan']) 

In [29]:
ser2

USA        1
Germany    2
Italy      3
Japan      4
dtype: int64

In [30]:
ser1['USA'] # diziden bir index alabilmek icin alacagimiz index i yazariz. yani index numarasini yazariz. 
            # Usually your index is going to be hopefully either a number or a string.

1

In [31]:
labels

['a', 'b', 'c']

In [32]:
ser3 = pd.Series(data=labels)

In [33]:
ser3

0    a
1    b
2    c
dtype: object

In [34]:
ser3[0] # 0 inci indexteki degerleri getirir

'a'

In [35]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [36]:
ser2

USA        1
Germany    2
Italy      3
Japan      4
dtype: int64

In [37]:
ser1 + ser2 # bu islemin sonucu ulkeler index oldugu icin toplam degeri ulkeleri (indexleri) harf sirasina gore siralar
            # karsisina gelen degerleri ise toplar eger bir indexin karsinda toplanacak diger deger yoksa Nan yazar. 

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

### Let's Practice
You can think of the pandas series as a column with labels. Because, series can only contain a single list with index, whereas the DataFrames can be made of more than one series.

### Let's Practice
*import numpy as np*

*import pandas as pd* 

*dict = {'x':10,'y':20}* 

*pd.Series(dict) *

*What is the output?*

*x    10*

*y    20*

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com
2,Jason,Seaborn,jasonseaborn@email.com


## Pandas - DataFrames - Part 1
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

#### Q: What is DataFrame?
A: A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:

The columns can be heterogeneous types like int and bool.
It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It isdenoted as "columns" in the case of columns and "index" in case of rows.

In [38]:
import numpy as np
import pandas as pd

In [39]:
from numpy.random import randn
np.random.seed(101)

In [40]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], ['W','X','Y','Z']) # sift tab yaparak pd.DataFrame inceledigimizde
                                                                    # ilk kisim data (5 satir 4 stunluk bir matris) , 
                                                                    # ikinci kisim ['A','B','C','D','E'] index yani satirlar
                                                                    # ucuncu kisim ['W','X','Y','Z'] kolonlar stunlar oluyor

In [41]:
df # BU ciktitaki her bir stun aslinda bir panda serisidir.  data frameler aslinda ayni indexi paylasan serilerdir. 
        #burada A,B,C,D,E indexlerini paylasan 'W','X','Y','Z' serileri gorulmektedir 

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [42]:
randn(5,4) # satir sayisi 5 olan stunsayisi 4 olan bir matris olusturuyor

array([[ 0.30266545,  1.69372293, -1.70608593, -1.15911942],
       [-0.13484072,  0.39052784,  0.16690464,  0.18450186],
       [ 0.80770591,  0.07295968,  0.63878701,  0.3296463 ],
       [-0.49710402, -0.7540697 , -0.9434064 ,  0.48475165],
       [-0.11677332,  1.9017548 ,  0.23812696,  1.99665229]])

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [43]:
df['W'] # burada W serisini cekiyoruz. Fram deki hetbir kolon bir stundur
        # DataFrame den (bir veri cercevesinden) bir sutun secip cikarmanin iki farkli yontemi vardir.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [44]:
# DataFrame Columns are just Series
type(df['W']) # burada kolanlarin birer seri oldugunu gorebiliriz

pandas.core.series.Series

In [45]:
type(df) # buradan df nin bir  frame.DataFrame oldugunu gorebiliriz. 

pandas.core.frame.DataFrame

In [46]:
# SQL Syntax (NOT RECOMMENDED!)
df.W     # # DataFrame den (bir veri cercevesinden) bir sutun secip cikarmanin iki farkli yontemi vardir.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [47]:
df[['W','Z']] ## birden cok stun istersek ve data frame formatinda olmasini istersek 
            ## ayrica istedigimiz stunlari listelemeliyiz yani ikinci bir koseli parentez kullaniriz

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


# DataFrame yeni bir kolon eklemek

In [48]:
df['new'] = df['W'] + df['Y'] # zaten var olan kolanlar arasinda aritmetik islem yaparak yeni bir kolon olusturabilirim

In [49]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


## DataFrame bir kolon Cikarmak
** Removing Columns**

In [50]:
df.drop('new', axis = 1, inplace=True) # shift tab ile icine bakalim varsayilan olarak axis = 0 
                                      # axis = 0 x eksenini yani satirlari temsil eder
                                      #  axis = 0 y eksenini yani stun, kolonlari temsil eder 
                        # benim erismek istedigim stunlar icin axis = 1 yapmaliyim. bu sekilde new stununu silebilirim

In [51]:
df # new kolonumun silinmedigini goruyorum. sift tab yaptigimda --inplace=False-- argumentin false olmasindan kaynaklaniyor.
   # bunun icin calismalari yaptiktan sonra degisikliklerin kalici olmasi icin --inplace=True-- inplace argumanimi True yapmaliyim

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [53]:
df.drop('E', axis=0) # defoult olarak axis = 0  oldugundan tekrar belirtmemize gerek yok

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [44]:
df.shape # df nin sekline baktigimizda e satiri silinmesine ragmen halen (5, 4) bir matris olarak gorulur.
        # bunun sebebi --inplace=False-- argumentini True yapmadigimizdandir.

(5, 4)

In [45]:
df # E satirinin silinmedigini goruyoruz --inplace=False-- argumentin false olmasindan kaynaklaniyor

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [46]:
df[['Z','X']]

Unnamed: 0,Z,X
A,0.503826,0.628133
B,0.605965,-0.319318
C,-0.589001,0.740122
D,0.955057,-0.758872
E,0.683509,1.978757


# ROWS

*** dataframe den satir secmenini iki yontemi vardir .loc .iloc*** 

In [47]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


** Selecting Rows**

In [209]:
df.loc['A'] # 0 numarali indexe atanmis olan A isimli indexindeki satiri .loc ile cektigimde sutun basliklari index olur. 
            # A satirindan asagidaki gibi siralanan bir seri elde ediyorum

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [48]:
#df.loc['A','B']

In [54]:
df.loc['C'] # C satiri 3. indexi dondurecektir

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [55]:
df.iloc[2] # .iloc ile index numarasi belirterek veri seti cekebilirim. 
        # Burada 0 index numarasindan baslamak uzere 0,1,2.ikinci index yani 3. satir olan c satirini dondurecektir.

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [50]:
df.loc['A', 'Y'] # A satiri ve Y kolon numaralarinin kesisimi ---0.907969	-- ni getirir

0.9079694464765431

In [51]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [52]:
df.loc[['A','B'],['W','Y']] # .loc yonteminde buradada yine iki koseli parantez icerisine ilk once satir ['A','B'] isimlerini 
                            # sonra sutun isimlerini ['W','Y'] (indexlerini egil) yazarak donduruyorum

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


In [58]:
# Let's Practice
import pandas as pd
people_dict={"first":["Richard", "Robert", "Jason"],
             "last" :["Stone", "Deepdive", "Seaborn"],
             "email": ["richardstone@email.com", "robertdeepdive@email.com", 
                        "jasonseaborn@email.com"]}
df=pd.DataFrame(people_dict)
df

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com
2,Jason,Seaborn,jasonseaborn@email.com


In [59]:
df["last"]

0       Stone
1    Deepdive
2     Seaborn
Name: last, dtype: object

In [60]:
# Let's Practice
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [61]:
data[['city','age']]

Unnamed: 0,city,age
0,Gurgaon,10
1,Delhi,22
2,Mumbai,13
3,Delhi,21
4,Mumbai,12
5,Delhi,11
6,Mumbai,17


# DataFrames - Part 2

Q: How to set index and reset index?
A: We can set the index column while making a data frame using set_index() method. But sometimes, a data frame is made from two or more data frames, and then the index can be changed using this method. The Reset index of the DataFrame is used to reset the index by using the reset_index command. If the DataFrame has a multiIndex, this method can remove one or more levels. 

- Interview Q&A

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [62]:
import numpy as np
import pandas as pd

In [63]:
from numpy.random import randn
np.random.seed(101)

In [64]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], ['W','X','Y','Z'])

In [65]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [67]:
df > 0 # conditional selection boolen deger dondurur

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [68]:
booldf = df>0

In [69]:
df[booldf]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [70]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [71]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [74]:
df["W"]>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [75]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [76]:
df[df['W']>0] # W kolonuna gore 0 dan buyuklerden olusan bir dataframe olusturmak istersem C satirinin kaldirildigini gorururz

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [78]:
df[df['Z']<0] # z kolonunda 0dan kucuk olanlardan olusan bir data frame

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


In [79]:
resuldf = df[df['W']>0]

In [80]:
resuldf

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [81]:
resuldf['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [82]:
df[df['W']>0]['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [83]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [85]:
boolser = df['W']>0

In [86]:
boolser

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [89]:
result = df[boolser]
result

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [90]:
result = df[boolser]
result
mycols = ['Y','X']
result[mycols]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


*** For two conditions you can use | and & with parenthesis: ***

In [91]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
# W stununun 0 dan buyuk oldugu ve Y nin 1 den buyuk oldugu bir dataframe olusturalim

In [101]:
df[(df['W']>0) & (df['Y']>1)] # parantes icleri boolen deger dondurur . & and demektir

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


In [102]:
df[(df['W']>0) || (df['Y']>1)] #  duz cizgi or demektir

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [95]:
True and True

True

In [96]:
True and False

False

In [97]:
False and False

False

In [98]:
df['W']>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [100]:
df['Y']>1

A    False
B    False
C    False
D    False
E     True
Name: Y, dtype: bool

## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [103]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [113]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [114]:
newind = 'CA NY WY OR CO'.split()

In [115]:
'CA NY WY OR CO'.split()

['CA', 'NY', 'WY', 'OR', 'CO']

In [116]:
newind 

['CA', 'NY', 'WY', 'OR', 'CO']

In [117]:
df['States'] = newind

In [118]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [119]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [120]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [121]:
df.set_index('States',inplace=True)

In [122]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [123]:
### Let's Practice
import pandas as pd
people_dict={"first":["Richard", "Robert", "Jason"],
             "last" :["Stone", "Deepdive", "Seaborn"],
             "email": ["richardstone@email.com", "robertdeepdive@email.com", 
                        "jasonseaborn@email.com"]}
df=pd.DataFrame(people_dict)
df

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com
2,Jason,Seaborn,jasonseaborn@email.com


In [124]:
df[(df['first']=="Richard") | (df['last'] == 'Deepdive')]

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com


In [125]:
### Let's Practice
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [126]:
data[data["age"]>=21]

Unnamed: 0,age,section,city,gender,favourite_color
1,22,B,Delhi,F,blue
3,21,B,Delhi,M,pink


In [127]:
data['new'] = data['city'].str[:2]

In [128]:
data

Unnamed: 0,age,section,city,gender,favourite_color,new
0,10,A,Gurgaon,M,red,Gu
1,22,B,Delhi,F,blue,De
2,13,C,Mumbai,F,yellow,Mu
3,21,B,Delhi,M,pink,De
4,12,B,Mumbai,M,black,Mu
5,11,A,Delhi,M,green,De
6,17,A,Mumbai,F,red,Mu


In [129]:
data.set_index('section', inplace=True)

In [130]:
data

Unnamed: 0_level_0,age,city,gender,favourite_color,new
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,10,Gurgaon,M,red,Gu
B,22,Delhi,F,blue,De
C,13,Mumbai,F,yellow,Mu
B,21,Delhi,M,pink,De
B,12,Mumbai,M,black,Mu
A,11,Delhi,M,green,De
A,17,Mumbai,F,red,Mu


#Pandas Part-1
## DataFrames - Part 3

## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

Q: What is Multiple Indexing?
A: Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.

- Interview Q&A

In [132]:
import numpy as np
import pandas as pd

In [133]:
x = [1, 2, 3]
y = [4, 5, 6]
zipped = zip(x, y)
list(zipped)

[(1, 4), (2, 5), (3, 6)]

In [148]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [149]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [150]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [153]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [154]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [155]:
df.index.names

FrozenList([None, None])

In [156]:
df.index.names = ['Groups','Num']

In [157]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [158]:
df.loc['G2']

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.166905,0.184502
2,0.807706,0.07296
3,0.638787,0.329646


In [160]:
df.loc['G2'].loc[2]['B']

0.07295967531703869

In [161]:
df.loc['G1']

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [163]:
# cross section method
df.xs('G1') # cross section

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [164]:
df.xs(1, level='Num') # NUm seviyesinden 1 leri getir

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


In [165]:
##Let's Practice
import pandas as pd
people_dict={"first":["Richard", "Robert", "Jason"],
             "last" :["Stone", "Deepdive", "Seaborn"],
             "email": ["richardstone@email.com", "robertdeepdive@email.com", 
                        "jasonseaborn@email.com"]}
df=pd.DataFrame(people_dict)
df

Unnamed: 0,first,last,email
0,Richard,Stone,richardstone@email.com
1,Robert,Deepdive,robertdeepdive@email.com
2,Jason,Seaborn,jasonseaborn@email.com


In [167]:
df.iloc[1,2] # ilk rakam satir indexi ikinci rakam stun indexi

'robertdeepdive@email.com'

In [168]:
##Let's Practice
outside = ['A1','A1','A1','A2','A2','A2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
np.random.seed(101)
data = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['X','Y'])
data

Unnamed: 0,Unnamed: 1,X,Y
A1,1,2.70685,0.628133
A1,2,0.907969,0.503826
A1,3,0.651118,-0.319318
A2,1,-0.848077,0.605965
A2,2,-2.018168,0.740122
A2,3,0.528813,-0.589001


In [169]:
data.loc['A1']

Unnamed: 0,X,Y
1,2.70685,0.628133
2,0.907969,0.503826
3,0.651118,-0.319318


# Check Yourself-2


In [170]:
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})

data

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green
6,17,A,Mumbai,F,red


In [172]:
data["city"]

0    Gurgaon
1      Delhi
2     Mumbai
3      Delhi
4     Mumbai
5      Delhi
6     Mumbai
Name: city, dtype: object

In [173]:
data.drop(6, axis=0, inplace=True)

In [174]:
data

Unnamed: 0,age,section,city,gender,favourite_color
0,10,A,Gurgaon,M,red
1,22,B,Delhi,F,blue
2,13,C,Mumbai,F,yellow
3,21,B,Delhi,M,pink
4,12,B,Mumbai,M,black
5,11,A,Delhi,M,green


In [175]:
data.iloc[1]

age                   22
section                B
city               Delhi
gender                 F
favourite_color     blue
Name: 1, dtype: object

In [176]:
data.iloc[[3,4], [1,2]]

Unnamed: 0,section,city
3,B,Delhi
4,B,Mumbai


In [177]:
data.iloc[3:5, 1:3]

Unnamed: 0,section,city
3,B,Delhi
4,B,Mumbai


## Check Yourself-3

In [178]:
data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'D', 'E', 'F', 'G'],
    'country' : [ 'USA', 'GBR', 'TUR', 'NLD', 'BEL', 'DEU', 'FRA'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', 'blue', 'yellow', 'pink', 'black', 'green', 'red']
})
data.set_index('section', inplace=True)
data

Unnamed: 0_level_0,age,country,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,10,USA,M,red
B,22,GBR,F,blue
C,13,TUR,F,yellow
D,21,NLD,M,pink
E,12,BEL,M,black
F,11,DEU,M,green
G,17,FRA,F,red


In [183]:
data[data['age']>15]

Unnamed: 0_level_0,age,country,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
B,22,GBR,F,blue
D,21,NLD,M,pink
G,17,FRA,F,red


In [184]:
data[data['age']>15]['country']

section
B    GBR
D    NLD
G    FRA
Name: country, dtype: object

In [185]:
data[(data['age']>15) & (data['favourite_color'] == 'red')]

Unnamed: 0_level_0,age,country,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
G,17,FRA,F,red


In [188]:
data[(data['age']>20) | (data['favourite_color'] == 'red')]

Unnamed: 0_level_0,age,country,gender,favourite_color
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,10,USA,M,red
B,22,GBR,F,blue
D,21,NLD,M,pink
G,17,FRA,F,red


In [189]:
outside = ['A1','A1','A1','A2','A2','A2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
np.random.seed(101)
data = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['X','Y'])
data

Unnamed: 0,Unnamed: 1,X,Y
A1,1,2.70685,0.628133
A1,2,0.907969,0.503826
A1,3,0.651118,-0.319318
A2,1,-0.848077,0.605965
A2,2,-2.018168,0.740122
A2,3,0.528813,-0.589001


In [190]:
data['Y']['A2']

1    0.605965
2    0.740122
3   -0.589001
Name: Y, dtype: float64

In [191]:
data.loc['A1','X']

1    2.706850
2    0.907969
3    0.651118
Name: X, dtype: float64

In [192]:
data['X']['A1']

1    2.706850
2    0.907969
3    0.651118
Name: X, dtype: float64

In [193]:
data.loc['A2'].loc[1, 'Y']

0.6059653494949336

# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [194]:
import pandas as pd
import numpy as np
import seaborn as sns

In [195]:
d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}

In [196]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [197]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [199]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [200]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [201]:
df.dropna(thresh=2) # bir satirda gecek ikitane Na degerini dusurur

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In [203]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [205]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

*** DA with Python-S3- 27/08/20 - Joseph (Pandas Series/Data Frames-1)Page***

## Creating a Pandas Series ##

In [53]:
import numpy as np
import pandas as pd

### Creating a Pandas Series with Basic Format

In [54]:
pd.Series([10,88,3,4,5])

0    10
1    88
2     3
3     4
4     5
dtype: int64

In [219]:
ser = pd.Series([10,88,3,4,5])

In [220]:
ser

0    10
1    88
2     3
3     4
4     5
dtype: int64

In [221]:
# Basic Attributes of Series
# atributlerde parantez yoktur. pandas serisinin atributlerine bakalim

In [222]:
type(ser)

pandas.core.series.Series

In [223]:
ser.dtype

dtype('int64')

In [224]:
ser.size

5

In [225]:
ser.ndim

1

In [226]:
ser.values

array([10, 88,  3,  4,  5], dtype=int64)

In [227]:
for i in ser.values: 
    print(i)

10
88
3
4
5


In [228]:
[i for i in ser.values] # list conferations method ile acilimi

[10, 88, 3, 4, 5]

In [229]:
ser.head(3) # serinini ilk uc elemanini gosterir

0    10
1    88
2     3
dtype: int64

In [230]:
ser.tail(3) # sondan uc satiri gosterir

2    3
3    4
4    5
dtype: int64

In [231]:
string = "clarusway"
pd.Series([i for i in string])

0    c
1    l
2    a
3    r
4    u
5    s
6    w
7    a
8    y
dtype: object

In [232]:
# Creating Pandas Series by Using a list, numpy array or dictionary

In [233]:
label = ["a","b","c"]
my_lis= [10,20,30]
arr = np.array([10,20,30])
d = {"a":10, "b":20, "c":30}

In [234]:
pd.Series(data=my_lis)

0    10
1    20
2    30
dtype: int64

In [235]:
# Using NumPy Arrays

In [236]:
arr

array([10, 20, 30])

In [237]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [238]:
pd.Series(arr,label)

a    10
b    20
c    30
dtype: int32

In [239]:
# Using Dictionary

In [240]:
d   # burada a , b, c key
    # 10,20,30 Value

{'a': 10, 'b': 20, 'c': 30}

In [241]:
pd.Series(d) # key ler index olur Value ler datanin kendisi olur. 

a    10
b    20
c    30
dtype: int64

# Data in a Series

In [242]:
pd.Series(data=label)

0    a
1    b
2    c
dtype: object

In [243]:
pd.Series([sum, print, len]) # icerisine python objectler dahi koyabiliriz

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

In [244]:
mix_data = [1, "cat", True]

In [245]:
pd.Series(mix_data)

0       1
1     cat
2    True
dtype: object

In [246]:
# https://numpy.org/devdocs/user/quickstart.html

# Indexing Pandas Series

In [247]:
serr1 = pd.Series([1,2,3,4], index=["USA","Germany","USSR","Japan"])

In [248]:
serr1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [249]:
serr2 = pd.Series([1,2,5,4], index=["USA","Germany","Italy","Japan"])

In [250]:
serr2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [251]:
serr1["USA"]

1

In [252]:
serr3 = pd.Series(data = label)

In [253]:
serr3 

0    a
1    b
2    c
dtype: object

In [254]:
serr3[0]

'a'

In [255]:
serr1+serr2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

In [256]:
# Indexing Examples

In [257]:
a = np.array([1,2,33,444,75])

In [258]:
a

array([  1,   2,  33, 444,  75])

In [259]:
panser = pd.Series(a)

In [260]:
panser

0      1
1      2
2     33
3    444
4     75
dtype: int32

In [261]:
panser[0]

1

In [262]:
panser[0:3]

0     1
1     2
2    33
dtype: int32

In [263]:
serr1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [264]:
serr1["USA":"USSR"] # USA ve USSR arasindakileri dahil ederek aldi

USA        1
Germany    2
USSR       3
dtype: int64

In [265]:
# Pandas_Series[index] | pandas_Series[[indices,indices]]

In [266]:
panser = pd.Series([121,200,150,99], index = ["ali", "veli", "gul", "nur"])

In [267]:
panser

ali     121
veli    200
gul     150
nur      99
dtype: int64

In [268]:
panser["ali"]

121

In [269]:
panser[0]

121

In [270]:
panser[[ "veli", "nur" ]]

veli    200
nur      99
dtype: int64

In [271]:
panser[0:3]

ali     121
veli    200
gul     150
dtype: int64

In [272]:
panser["ali":"nur"]

ali     121
veli    200
gul     150
nur      99
dtype: int64

In [273]:
# Several Selecting Attributes

In [274]:
panser.index

Index(['ali', 'veli', 'gul', 'nur'], dtype='object')

In [275]:
panser.values

array([121, 200, 150,  99], dtype=int64)

In [276]:
panser.items  # atribute

<bound method Series.items of ali     121
veli    200
gul     150
nur      99
dtype: int64>

In [277]:
panser.items()

<zip at 0x1d935654b00>

In [278]:
list(panser.items()) # bu items () bir object bu objectimizi collections yapisi olan list icerisine soktuk

[('ali', 121), ('veli', 200), ('gul', 150), ('nur', 99)]

In [279]:
for index, value in panser.items():
    print(index, "-", value)

ali - 121
veli - 200
gul - 150
nur - 99


In [280]:
"mehmet" in panser

False

In [281]:
"ali" in panser

True

In [282]:
"gulnur" in panser

False

In [283]:
"gul" in panser

True

In [284]:
99 in panser.values

True

In [285]:
500 in panser.values

False

In [286]:
panser["veli"]

200

In [287]:
panser["veli"] = 571 # broodcasting yaptik birtane indexi cagirip atama yaptik

In [288]:
panser

ali     121
veli    571
gul     150
nur      99
dtype: int64

In [289]:
panser > 130

ali     False
veli     True
gul      True
nur     False
dtype: bool

In [290]:
panser[panser > 130]

veli    571
gul     150
dtype: int64

In [291]:
## pANDAS dATAfRAMES

In [292]:
# creating a DataFrame using the list s of data and columns

In [293]:
datam = [1,2,39,67,90]

In [294]:
datam

[1, 2, 39, 67, 90]

In [295]:
pd.DataFrame(datam, columns = ["column_name"])

Unnamed: 0,column_name
0,1
1,2
2,39
3,67
4,90


In [296]:
# creating a DataFrame using a Numpy Arrays

In [297]:
m = np.arange(1,10).reshape((3,3))
m

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

m

In [298]:
# creating a DataFrame using a Numpy Arrays

In [299]:
pd.DataFrame(m,columns = ["var1","var2","var3"])

Unnamed: 0,var1,var2,var3
0,1,2,3
1,4,5,6
2,7,8,9


In [300]:
pd.DataFrame(data = m,columns = ["var1","var2","var3"])

Unnamed: 0,var1,var2,var3
0,1,2,3
1,4,5,6
2,7,8,9


In [301]:
df = pd.DataFrame(data = m,columns = ["var1","var2","var3"])
df

Unnamed: 0,var1,var2,var3
0,1,2,3
1,4,5,6
2,7,8,9


In [302]:
df.head(1)

Unnamed: 0,var1,var2,var3
0,1,2,3


In [303]:
df.head(3)

Unnamed: 0,var1,var2,var3
0,1,2,3
1,4,5,6
2,7,8,9


In [304]:
df.columns 

Index(['var1', 'var2', 'var3'], dtype='object')

In [305]:
for i in df.columns:
    print(i)

var1
var2
var3


In [306]:
df.columns = ["new1", "new2", "new3"]

In [307]:
df

Unnamed: 0,new1,new2,new3
0,1,2,3
1,4,5,6
2,7,8,9


In [308]:
type(df)

pandas.core.frame.DataFrame

In [309]:
df.shape

(3, 3)

In [310]:
df.ndim

2

In [311]:
df.size # 3*3 9 elemanli

9

In [312]:
df.values # datamizdaki degerleri verir

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

# creating a DataFrame using a dict

In [313]:
s1 = np.random.randint(10,size = 5)
s2 = np.random.randint(10,size = 5)
s3 = np.random.randint(10,size = 5)

In [314]:
s1

array([6, 5, 6, 9, 2])

In [315]:
s2

array([2, 1, 3, 3, 3])

In [316]:
s3

array([4, 5, 9, 5, 8])

In [317]:
myDict = {"var1":s1, "var2":s2, "var3" :s3}
myDict

{'var1': array([6, 5, 6, 9, 2]),
 'var2': array([2, 1, 3, 3, 3]),
 'var3': array([4, 5, 9, 5, 8])}

In [318]:
df1 = pd.DataFrame(myDict)
df1

Unnamed: 0,var1,var2,var3
0,6,2,4
1,5,1,5
2,6,3,9
3,9,3,5
4,2,3,8


In [319]:
pwd

'C:\\Users\\Mustafa\\Desktop\\MyWorkSpace\\Github_Data_Science\\03_DataAnalysis_with_Py'

In [320]:
# ornekcsv.csv

In [2]:
df3 = pd.read_csv("ornekcsv.csv", delimiter = ";")

NameError: name 'pd' is not defined

In [322]:
df3.head()

NameError: name 'df3' is not defined

In [323]:
df1

Unnamed: 0,var1,var2,var3
0,6,2,4
1,5,1,5
2,6,3,9
3,9,3,5
4,2,3,8


In [324]:
df1[1:3]

Unnamed: 0,var1,var2,var3
1,5,1,5
2,6,3,9


In [325]:
df1.index

RangeIndex(start=0, stop=5, step=1)

In [326]:
[i for i in df1.index]

[0, 1, 2, 3, 4]

In [327]:
df1.index = ["a","b","c","d","e"]

df1

In [328]:
df1

Unnamed: 0,var1,var2,var3
a,6,2,4
b,5,1,5
c,6,3,9
d,9,3,5
e,2,3,8


In [329]:
df1["b":"d"] # slijslamada dilimleme kendi turunden veri dondurur

Unnamed: 0,var1,var2,var3
b,5,1,5
c,6,3,9
d,9,3,5


In [330]:
"var2" in df1 # dataframe de  stunlarim arasinda var2 adli bir degiskenim varmi

True

In [331]:
len("var2")

4

In [332]:
"Joseph" in df1

False

In [333]:
## Now, let's examine again the indexing; selection and slicing methods and several attributes using a different DataFrame

In [334]:
from numpy.random import randn
np.random.seed(101)

In [335]:
df4 = pd.DataFrame(randn(5,4), index="A B C D E".split(), columns = "W X Y Z".split())

In [336]:
df4

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [337]:
df4 = pd.DataFrame(index="A B C D E".split(), randn(5,4), columns = "W X Y Z".split())

SyntaxError: positional argument follows keyword argument (<ipython-input-337-e6a4fade62aa>, line 1)

In [338]:
df4

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [339]:
df4["W"]

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [340]:
type(df4["W"])

pandas.core.series.Series

In [341]:
df4["W"].values

array([ 2.70684984,  0.65111795, -2.01816824,  0.18869531,  0.19079432])

In [342]:
df4[["W"]]

Unnamed: 0,W
A,2.70685
B,0.651118
C,-2.018168
D,0.188695
E,0.190794


In [343]:
type(df4[["W"]])

pandas.core.frame.DataFrame

In [344]:
istedigimstunlar = ["W", "Z"]

In [345]:
df4[istedigimstunlar]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [346]:
WZ_df = df4[istedigimstunlar]
WZ_df

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [347]:
df4[["W", "Z"]]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [348]:
df4["W"]
# df4.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [349]:
# df4["W"]
df4.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [350]:
df4["A":"C"] # slicing

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [351]:
df4["C":"C"]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


In [352]:
df4["new"] = df4["W"]+df4["Y"] # yeni bir stun olusturma

In [353]:
df4["new2"] = df4["W"]*df4["Y"]

In [354]:
df4

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,3.614819,2.457737
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,-0.552198
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-1.067235
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.176097
E,0.190794,1.978757,2.605967,0.683509,2.796762,0.497204


In [355]:
## https://www.w3resource.com/python-exercises/pandas/index.php

In [356]:
df4.drop(["new", axis = 1, inplace = True])

SyntaxError: invalid syntax (<ipython-input-356-9586f21d8aba>, line 1)

## Creating a new column

# Removing Columns & Rows
# Removing Columns

In [357]:
df4

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,3.614819,2.457737
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,-0.552198
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-1.067235
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.176097
E,0.190794,1.978757,2.605967,0.683509,2.796762,0.497204


In [358]:
df4.drop("E", axis=0)

Unnamed: 0,W,X,Y,Z,new,new2
A,2.70685,0.628133,0.907969,0.503826,3.614819,2.457737
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959,-0.552198
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355,-1.067235
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542,-0.176097


In [359]:
df.loc['viper'] # loc index numaralari ile calismaz sayilarla calisir. 

KeyError: 'viper'

In [360]:
m = np.random.randint(1,30, size = (10,3))
df4 = pd.DataFrame(m, columns = ["var1","var2","var3"])
df4

Unnamed: 0,var1,var2,var3
0,8,11,21
1,25,8,7
2,15,10,21
3,19,24,8
4,8,16,13
5,1,21,11
6,13,18,25
7,12,20,16
8,2,25,5
9,29,20,19


In [361]:
df4.loc[1] # loc satirlari cagirir loc icerisindeki label etiketleri getirir

var1    25
var2     8
var3     7
Name: 1, dtype: int32

In [362]:
df4.loc[1:4] # labellar ile calisirken 1 , 2, 3, 4 isimli labellari cagirir

Unnamed: 0,var1,var2,var3
1,25,8,7
2,15,10,21
3,19,24,8
4,8,16,13


In [363]:
df4.iloc[1:4] # iloc indexlerle calisir

Unnamed: 0,var1,var2,var3
1,25,8,7
2,15,10,21
3,19,24,8


In [364]:
df4.index = "a b c d e f g h i j".split()

In [365]:
df4

Unnamed: 0,var1,var2,var3
a,8,11,21
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13
f,1,21,11
g,13,18,25
h,12,20,16
i,2,25,5
j,29,20,19


In [366]:
df4.loc[1:4]

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

In [367]:
df4.iloc[1:4]

Unnamed: 0,var1,var2,var3
b,25,8,7
c,15,10,21
d,19,24,8


In [368]:
df4.loc["b":"e", "var2"]

b     8
c    10
d    24
e    16
Name: var2, dtype: int32

In [369]:
df4.loc["b":"e"]

Unnamed: 0,var1,var2,var3
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13


In [370]:
df4.loc["b":"e"], ["var2"]

(   var1  var2  var3
 b    25     8     7
 c    15    10    21
 d    19    24     8
 e     8    16    13,
 ['var2'])

In [371]:
df4.loc["b":"e"], [["var2"]]

(   var1  var2  var3
 b    25     8     7
 c    15    10    21
 d    19    24     8
 e     8    16    13,
 [['var2']])

In [372]:
df4.loc["b":"e", ["var2"]]

Unnamed: 0,var2
b,8
c,10
d,24
e,16


In [373]:
df4.iloc[1:5,1]

b     8
c    10
d    24
e    16
Name: var2, dtype: int32

In [374]:
df4

Unnamed: 0,var1,var2,var3
a,8,11,21
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13
f,1,21,11
g,13,18,25
h,12,20,16
i,2,25,5
j,29,20,19


In [375]:
df4.iloc[1:5] [["var2"]]

Unnamed: 0,var2
b,8
c,10
d,24
e,16


In [376]:
df4.iloc[1:5]

Unnamed: 0,var1,var2,var3
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13


In [377]:
df4

Unnamed: 0,var1,var2,var3
a,8,11,21
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13
f,1,21,11
g,13,18,25
h,12,20,16
i,2,25,5
j,29,20,19


In [378]:
df2 = pd.DataFrame(randn(5,4), index="A B C D E".split(), columns = "W X Y Z".split())

df2

In [379]:
df2

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
C,-0.993263,0.1968,-1.136645,0.000366
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [380]:
df2.iloc[2]

W   -0.993263
X    0.196800
Y   -1.136645
Z    0.000366
Name: C, dtype: float64

In [381]:
df2.loc[["B"]]

Unnamed: 0,W,X,Y,Z
B,-0.116773,1.901755,0.238127,1.996652


In [382]:
df2.iloc[:,2]

A   -0.943406
B    0.238127
C   -1.136645
D   -0.031579
E   -0.755325
Name: Y, dtype: float64

In [383]:
df2.iloc[:,[2]]

Unnamed: 0,Y
A,-0.943406
B,0.238127
C,-1.136645
D,-0.031579
E,-0.755325


In [384]:
df2.Y

A   -0.943406
B    0.238127
C   -1.136645
D   -0.031579
E   -0.755325
Name: Y, dtype: float64

In [385]:
df2[["Y"]]

Unnamed: 0,Y
A,-0.943406
B,0.238127
C,-1.136645
D,-0.031579
E,-0.755325


In [386]:
df2["Y"]

A   -0.943406
B    0.238127
C   -1.136645
D   -0.031579
E   -0.755325
Name: Y, dtype: float64

In [387]:
## Selecting sunset of rows and columns

In [388]:
# .loc[[row]]

In [389]:
df2

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
C,-0.993263,0.1968,-1.136645,0.000366
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [390]:
df2.loc["B","Y"]

0.23812695876901832

In [391]:
df2.loc[["B"],["Y"]]

Unnamed: 0,Y
B,0.238127


In [392]:
df2.loc[["A", "B"],["W", "Y"]]

Unnamed: 0,W,Y
A,-0.497104,-0.943406
B,-0.116773,0.238127


In [393]:
df2

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
C,-0.993263,0.1968,-1.136645,0.000366
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [394]:
df2.iloc[[0,1],[0,2]] # satir indexleri stun indexleri

Unnamed: 0,W,Y
A,-0.497104,-0.943406
B,-0.116773,0.238127


In [395]:
df2 > 0

Unnamed: 0,W,X,Y,Z
A,False,False,False,True
B,False,True,True,True
C,False,True,False,True
D,True,False,False,True
E,True,False,False,False


In [396]:
df2[df2 > 0]

Unnamed: 0,W,X,Y,Z
A,,,,0.484752
B,,1.901755,0.238127,1.996652
C,,0.1968,,0.000366
D,1.025984,,,0.649826
E,2.154846,,,


In [397]:
df2["W"] > 0

A    False
B    False
C    False
D     True
E     True
Name: W, dtype: bool

In [398]:
df2[df2["w"]"W"] > 0

SyntaxError: invalid syntax (<ipython-input-398-465682646091>, line 1)

In [399]:
df2[[True,True,False,True,True]]

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [400]:
df2[df2["w"] > 0] [["Y"]]

KeyError: 'w'

In [401]:
df2

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
C,-0.993263,0.1968,-1.136645,0.000366
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [402]:
df2[(df2["W"] >0) (df2["Y"]>1)

SyntaxError: unexpected EOF while parsing (<ipython-input-402-33b8472f5449>, line 1)

In [403]:
df2.loc[(df2.X > 0), ["X","Z"]]

Unnamed: 0,X,Z
B,1.901755,1.996652
C,0.1968,0.000366


In [404]:
df2.loc[((df2.W) > 2 | (df2.W<-2)), ["W", "Y"]]

Unnamed: 0,W,Y
D,1.025984,-0.031579
E,2.154846,-0.755325


In [405]:
df4

Unnamed: 0,var1,var2,var3
a,8,11,21
b,25,8,7
c,15,10,21
d,19,24,8
e,8,16,13
f,1,21,11
g,13,18,25
h,12,20,16
i,2,25,5
j,29,20,19


In [406]:
df2

Unnamed: 0,W,X,Y,Z
A,-0.497104,-0.75407,-0.943406,0.484752
B,-0.116773,1.901755,0.238127,1.996652
C,-0.993263,0.1968,-1.136645,0.000366
D,1.025984,-0.156598,-0.031579,0.649826
E,2.154846,-0.610259,-0.755325,-0.346419


In [407]:
df2.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-0.497104,-0.75407,-0.943406,0.484752
1,B,-0.116773,1.901755,0.238127,1.996652
2,C,-0.993263,0.1968,-1.136645,0.000366
3,D,1.025984,-0.156598,-0.031579,0.649826
4,E,2.154846,-0.610259,-0.755325,-0.346419


In [408]:
df2.reset_index(drop = True)

Unnamed: 0,W,X,Y,Z
0,-0.497104,-0.75407,-0.943406,0.484752
1,-0.116773,1.901755,0.238127,1.996652
2,-0.993263,0.1968,-1.136645,0.000366
3,1.025984,-0.156598,-0.031579,0.649826
4,2.154846,-0.610259,-0.755325,-0.346419


In [410]:
newindx = " CA NY WY OR CO".split()
newindx

['CA', 'NY', 'WY', 'OR', 'CO']

In [415]:
df2["States"] = newindx
df

Unnamed: 0,new1,new2,new3
0,1,2,3
1,4,5,6
2,7,8,9


#df

In [None]:
# df2.reset_index(drop = True)

## Multi-Index and Index Hierarchy

In [417]:
# Index Levels
outside = ['M1', 'M1', 'M1', 'M2', 'M2', 'M2']
inside = [1, 2, 3, 1, 2, 3]
multi_index = list(zip(outside, inside))
multi_index

[('M1', 1), ('M1', 2), ('M1', 3), ('M2', 1), ('M2', 2), ('M2', 3)]

In [418]:
hier_index = pd.MultiIndex.from_tuples(multi_index)

In [419]:
df5 = pd.DataFrame(np.random.randn(6, 2), index = hier_index, columns = ["A", "B"])

In [420]:
df5

Unnamed: 0,Unnamed: 1,A,B
M1,1,0.147027,-0.479448
M1,2,0.558769,1.02481
M1,3,-0.925874,1.862864
M2,1,-1.133817,0.610478
M2,2,0.38603,2.084019
M2,3,-0.376519,0.230336


In [421]:
df5.loc["M1"]

Unnamed: 0,A,B
1,0.147027,-0.479448
2,0.558769,1.02481
3,-0.925874,1.862864


In [423]:
df5.loc["M1"].loc[1] # 1. satiri getirdi

A    0.147027
B   -0.479448
Name: 1, dtype: float64

In [424]:
df5.loc["M1"].loc[[1]] # koseli parantezi ikili yaparsak data frame olarak getiririrz

Unnamed: 0,A,B
1,0.147027,-0.479448


In [425]:
df5.index

MultiIndex([('M1', 1),
            ('M1', 2),
            ('M1', 3),
            ('M2', 1),
            ('M2', 2),
            ('M2', 3)],
           )

In [426]:
df5.index.names

FrozenList([None, None])

In [427]:
df5.index.names = ["Group", "Num"] # stunlara indexlere isim vermek istersek

In [428]:
df5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
M1,1,0.147027,-0.479448
M1,2,0.558769,1.02481
M1,3,-0.925874,1.862864
M2,1,-1.133817,0.610478
M2,2,0.38603,2.084019
M2,3,-0.376519,0.230336


In [None]:
## Let's take a quik 

In [429]:
d = {'num_legs': [4, 4, 2, 2],
     'num_wings': [0, 0, 2, 2],
     'class': ['mammal', 'mammal', 'mammal', 'bird'],
     'animal': ['cat', 'dog', 'bat', 'penguin'],
     'locomotion': ['walks', 'walks', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,num_legs,num_wings
class,animal,locomotion,Unnamed: 3_level_1,Unnamed: 4_level_1
mammal,cat,walks,4,0
mammal,dog,walks,4,0
mammal,bat,flies,2,2
bird,penguin,walks,2,2


In [430]:
df5.xs("M1")

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.147027,-0.479448
2,0.558769,1.02481
3,-0.925874,1.862864


In [431]:
df5.xs(("M1",1))# buradan ilk satiri secmek istiyorum

A    0.147027
B   -0.479448
Name: (M1, 1), dtype: float64

In [432]:
df5.xs(("M1",1), level =[0,1])

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
M1,1,0.147027,-0.479448


In [433]:
df5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
M1,1,0.147027,-0.479448
M1,2,0.558769,1.02481
M1,3,-0.925874,1.862864
M2,1,-1.133817,0.610478
M2,2,0.38603,2.084019
M2,3,-0.376519,0.230336


In [436]:
df5.xs(1, level = "Num")

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
M1,0.147027,-0.479448
M2,-1.133817,0.610478


In [438]:
df5.xs(1, level = 1)

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
M1,0.147027,-0.479448
M2,-1.133817,0.610478
