# What is Pandas?

Pandas is one of the most important libraries of Python.
</br>
Pandas has data structures for data analysis. The most used of these are Series and DataFrame data structures. 

**Series is:**
- one dimensional, i.e. it consists of a column
- homogeneous data
- size immutable, i.e when you add or delete value from series, it creates new series object
- values of data are mutable

**Data frame is**
- two-dimensional, i.e. it consists of rows and columns
- heterogeneous data
- size mutable
- values of data mutable

To install Pandas, you can use "pip install pandas"

In [2]:
import pandas as pd # Let's import pandas with pd

In [2]:
pd.__version__ # To print the installed vesion pandas

'1.3.0'

In [30]:
from pandas import Series, DataFrame

# Series Data Structure

In [3]:
obj=pd.Series([1,"John",3.5,"Hey"])
obj

0       1
1    John
2     3.5
3     Hey
dtype: object

In [5]:
print(obj.values)
print(obj.index)

[1 'John' 3.5 'Hey']
RangeIndex(start=0, stop=4, step=1)


In [10]:
# index passed must be of the same length
obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2

a       1
b    John
c     3.5
d     Hey
dtype: object

In [11]:
obj3=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c"])
obj3

ValueError: Length of values (4) does not match length of index (3)

In [12]:
obj4=pd.Series([1,"John",3.5],index=["a","b","c","d"])
obj4

ValueError: Length of values (3) does not match length of index (4)

In [15]:
print(obj2.index)
print(obj2.values)

Index(['a', 'b', 'c', 'd'], dtype='object')
[1 'John' 3.5 'Hey']


#### Creating Series from dictionary

In [18]:
score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}
names=pd.Series(score) 
names

Jane    90
Bill    80
Elon    85
Tom     75
Tim     95
dtype: int64

In [19]:
names["Tim"] 

95

In [21]:
names['Timmy']

KeyError: 'Timmy'

In [22]:
"Tom" in names

True

In [23]:
"Can" in names

False

In [26]:
names.name = 'qiymetler'
names.index.name = 'adlar'
names

adlar
Jane    90
Bill    80
Elon    85
Tom     75
Tim     95
Name: qiymetler, dtype: int64

In [40]:
names.to_excel('names_series.xlsx')

In [7]:
# Index order is persisted and the missing element is filled with NaN
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a','f'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
f    NaN
dtype: float64


In [46]:
s.to_excel('s.xlsx')

In [3]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c'])
print(s)

b    1.0
c    2.0
dtype: float64


###### Accessing Data from Series

In [8]:
# get a value
print(s[0])

1.0


In [9]:
# get multiple values
print(s[:3])
print(s[-3:])

b    1.0
c    2.0
d    NaN
dtype: float64
d    NaN
a    0.0
f    NaN
dtype: float64


###### Retrieve Data Using Label (Index)

In [10]:
# get a value
print(s['a'])

0.0


In [11]:
# get multiple values
print(s[['a', 'b']])

a    0.0
b    1.0
dtype: float64


In [12]:
# getting non-existing key (index)
print(s['g'])

KeyError: 'g'

# Working with DataFrame Data Structure

In [2]:
data = {'sheher': ['Baku', 'Sumgayit', 'Guba', 'Aghdam', 'Yevlakh', 'Barda'],
 'il': [2000, 2001, 2002, 2001, 2002, 2003],
 'ehali': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,sheher,il,ehali
0,Baku,2000,1.5
1,Sumgayit,2001,1.7
2,Guba,2002,3.6
3,Aghdam,2001,2.4
4,Yevlakh,2002,2.9
5,Barda,2003,3.2


In [4]:
frame.head(2)

Unnamed: 0,sheher,il,ehali
0,Baku,2000,1.5
1,Sumgayit,2001,1.7


In [3]:
frame.tail()

Unnamed: 0,sheher,il,ehali
1,Sumgayit,2001,1.7
2,Guba,2002,3.6
3,Aghdam,2001,2.4
4,Yevlakh,2002,2.9
5,Barda,2003,3.2


#### Creating Dataframe from existing

In [3]:
data= {"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],      
       "score":[90,80,85,75,95,60,65],      
       "sport":["Wrestling","Football","Skiing","Swimming","Tennis", "Karete","Surfing"],      
      "sex":["M","M","M","M","F","F","F"]
      }

In [4]:
df=pd.DataFrame(data)

In [5]:
df

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


In [6]:
# defining column order
df=pd.DataFrame(data,columns=["name","sport","sex","score"])
df

Unnamed: 0,name,sport,sex,score
0,Bill,Wrestling,M,90
1,Tom,Football,M,80
2,Tim,Skiing,M,85
3,John,Swimming,M,75
4,Alex,Tennis,F,95
5,Vanessa,Karete,F,60
6,Kate,Surfing,F,65


In [7]:
df.head()

Unnamed: 0,name,sport,sex,score
0,Bill,Wrestling,M,90
1,Tom,Football,M,80
2,Tim,Skiing,M,85
3,John,Swimming,M,75
4,Alex,Tennis,F,95


In [8]:
df.tail() 

Unnamed: 0,name,sport,sex,score
2,Tim,Skiing,M,85
3,John,Swimming,M,75
4,Alex,Tennis,F,95
5,Vanessa,Karete,F,60
6,Kate,Surfing,F,65


In [9]:
df.tail(3)

Unnamed: 0,name,sport,sex,score
4,Alex,Tennis,F,95
5,Vanessa,Karete,F,60
6,Kate,Surfing,F,65


In [10]:
df.head(2)

Unnamed: 0,name,sport,sex,score
0,Bill,Wrestling,M,90
1,Tom,Football,M,80


In [11]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"])
df

Unnamed: 0,name,sport,gender,score,age
0,Bill,Wrestling,,90,
1,Tom,Football,,80,
2,Tim,Skiing,,85,
3,John,Swimming,,75,
4,Alex,Tennis,,95,
5,Vanessa,Karete,,60,
6,Kate,Surfing,,65,


In [12]:
df=pd.DataFrame(data,columns=["name", "sport"])
df

Unnamed: 0,name,sport
0,Bill,Wrestling
1,Tom,Football
2,Tim,Skiing
3,John,Swimming
4,Alex,Tennis
5,Vanessa,Karete
6,Kate,Surfing


In [22]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],
                index=["one","two","three","four","five","six","seven"])
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,
two,Tom,Football,,80,
three,Tim,Skiing,,85,
four,John,Swimming,,75,
five,Alex,Tennis,,95,
six,Vanessa,Karete,,60,
seven,Kate,Surfing,,65,


frame[column] works for any column name, but frame.column only works when the column name is a valid Python variable name

In [23]:
df["sport"]

one      Wrestling
two       Football
three       Skiing
four      Swimming
five        Tennis
six         Karete
seven      Surfing
Name: sport, dtype: object

In [24]:
my_columns=["name","sport"]
df[my_columns]

Unnamed: 0,name,sport
one,Bill,Wrestling
two,Tom,Football
three,Tim,Skiing
four,John,Swimming
five,Alex,Tennis
six,Vanessa,Karete
seven,Kate,Surfing


In [25]:
df.sport

one      Wrestling
two       Football
three       Skiing
four      Swimming
five        Tennis
six         Karete
seven      Surfing
Name: sport, dtype: object

In [28]:
df["age"]=18
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,18
three,Tim,Skiing,,85,18
four,John,Swimming,,75,18
five,Alex,Tennis,,95,18
six,Vanessa,Karete,,60,18
seven,Kate,Surfing,,65,18


In [29]:
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"], 
                index=["one","two","three","four","five","six","seven"])
values=[18,19,20,18,17,17,18]
df["age"]=values
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,19
three,Tim,Skiing,,85,20
four,John,Swimming,,75,18
five,Alex,Tennis,,95,17
six,Vanessa,Karete,,60,17
seven,Kate,Surfing,,65,18


In [30]:
df["pass"] = df.score>=70
df

Unnamed: 0,name,sport,gender,score,age,pass
one,Bill,Wrestling,,90,18,True
two,Tom,Football,,80,19,True
three,Tim,Skiing,,85,20,True
four,John,Swimming,,75,18,True
five,Alex,Tennis,,95,17,True
six,Vanessa,Karete,,60,17,False
seven,Kate,Surfing,,65,18,False


In [31]:
del df["pass"]
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,18
two,Tom,Football,,80,19
three,Tim,Skiing,,85,20
four,John,Swimming,,75,18
five,Alex,Tennis,,95,17
six,Vanessa,Karete,,60,17
seven,Kate,Surfing,,65,18


In [32]:
scores={"Math":{"A":85,"B":90,"C":95}, "Physics":{"A":90,"B":80,"C":75}}

In [33]:
scores_df=pd.DataFrame(scores)
scores_df

Unnamed: 0,Math,Physics
A,85,90
B,90,80
C,95,75


In [34]:
scores_df.T

Unnamed: 0,A,B,C
Math,85,90,95
Physics,90,80,75


In [35]:
scores_df.index.name="name"
scores_df.columns.name="lesson"

In [36]:
scores_df

lesson,Math,Physics
name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,85,90
B,90,80
C,95,75


In [37]:
scores_df.values

array([[85, 90],
       [90, 80],
       [95, 75]], dtype=int64)

In [40]:
scores_index=scores_df.index
scores_index

Index(['A', 'B', 'C'], dtype='object', name='name')

In [50]:
scores_index[2]="Jack"
scores_index

TypeError: Index does not support mutable operations

In [52]:
scores_df.rename({'B':'new_B'}, axis=0, inplace=True)

In [53]:
scores_df

lesson,Math,Physics
name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,85,90
new_B,90,80
C,95,75


# Indexing & Selection & Filtering in Series

In [55]:
import numpy as np

In [56]:
obj=pd.Series(np.arange(5),
              index=["a","b","c","d","e"])

In [57]:
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [58]:
obj["c"]

2

In [59]:
obj[2]

2

In [60]:
obj[0:3]

a    0
b    1
c    2
dtype: int32

In [61]:
obj[["a","c"]]

a    0
c    2
dtype: int32

In [62]:
obj[[0,2]]

a    0
c    2
dtype: int32

In [63]:
obj[obj<2]

a    0
b    1
dtype: int32

In [64]:
obj["a":"c"]

a    0
b    1
c    2
dtype: int32

In [65]:
obj["b":"c"]=5
obj

a    0
b    5
c    5
d    3
e    4
dtype: int32

## DataFrame Indexing

In [129]:
data=pd.DataFrame(
    np.arange(16).reshape(4,4),
    columns=["one","two","three","four"])
data

Unnamed: 0,one,two,three,four
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [135]:
data[:3]

Unnamed: 0,one,two,three,four
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [128]:
data[2]

KeyError: 2

In [146]:
data['one'][2] # column, row

8

In [139]:
data=pd.DataFrame(
    np.arange(16).reshape(4,4),
    index=['London', 'Paris', 'Berlin', 'Istanbul'],
    columns=["one","two","three","four"])
data

Unnamed: 0,one,two,three,four
London,0,1,2,3
Paris,4,5,6,7
Berlin,8,9,10,11
Istanbul,12,13,14,15


In [140]:
data["two"]

London       1
Paris        5
Berlin       9
Istanbul    13
Name: two, dtype: int32

In [141]:
data[["one","two"]]

Unnamed: 0,one,two
London,0,1
Paris,4,5
Berlin,8,9
Istanbul,12,13


In [142]:
data[:3]

Unnamed: 0,one,two,three,four
London,0,1,2,3
Paris,4,5,6,7
Berlin,8,9,10,11


In [143]:
data[data["four"]>5]

Unnamed: 0,one,two,three,four
Paris,4,5,6,7
Berlin,8,9,10,11
Istanbul,12,13,14,15


In [144]:
data[data["four"]>5][['one']]

Unnamed: 0,one
Paris,4
Berlin,8
Istanbul,12


In [145]:
data[data["four"]>5]['one']

Paris        4
Berlin       8
Istanbul    12
Name: one, dtype: int32

In [125]:
data[data<5]=0
data

Unnamed: 0,one,two,three,four
London,0,0,0,0
Paris,0,5,6,7
Berlin,8,9,10,11
Istanbul,12,13,14,15


## Selecting with iloc and loc

### Index-based indexing

In [147]:
data.iloc[1]

one      4
two      5
three    6
four     7
Name: Paris, dtype: int32

In [148]:
data.iloc[0][1]

1

In [149]:
data.iloc[1,[1,2,3]]

two      5
three    6
four     7
Name: Paris, dtype: int32

In [150]:
data.iloc[[1,3],[1,2,3]]

Unnamed: 0,two,three,four
Paris,5,6,7
Istanbul,13,14,15


In [151]:
data.iloc[:, :2]

Unnamed: 0,one,two
London,0,1
Paris,4,5
Berlin,8,9
Istanbul,12,13


In [152]:
data.iloc[2,'one']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [153]:
data.iloc[1]['one']

4

### Label-based indexing

In [154]:
data.loc["Paris",["one","two"]]

one    4
two    5
Name: Paris, dtype: int32

In [155]:
data.loc[:"Paris","four"]

London    3
Paris     7
Name: four, dtype: int32

In [156]:
data.loc[data['one']==0,'one']

London    0
Name: one, dtype: int32

In [157]:
data.loc['London']['one']

0

### Conditional selection

In [159]:
data = {'sheher': ['Baku', 'Sumgayit', 'Guba', 'Aghdam', 'Yevlakh', 'Barda'],
 'il': [2000, 2001, 2002, 2001, 2002, 2003],
 'ehali': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,sheher,il,ehali
0,Baku,2000,1.5
1,Sumgayit,2001,1.7
2,Guba,2002,3.6
3,Aghdam,2001,2.4
4,Yevlakh,2002,2.9
5,Barda,2003,3.2


In [160]:
frame_2 = frame[frame['il']>2001]
frame_2  

Unnamed: 0,sheher,il,ehali
2,Guba,2002,3.6
4,Yevlakh,2002,2.9
5,Barda,2003,3.2


In [161]:
frame.sheher == 'Guba'

0    False
1    False
2     True
3    False
4    False
5    False
Name: sheher, dtype: bool

In [163]:
frame.loc[frame.sheher == 'Barda']

Unnamed: 0,sheher,il,ehali
5,Barda,2003,3.2


In [165]:
frame.loc[frame.sheher == 'Barda',['ehali']]

Unnamed: 0,ehali
5,3.2


In [169]:
# or
frame[(frame['il']>=2002) | (frame['ehali']>3.4)]

Unnamed: 0,sheher,il,ehali
2,Guba,2002,3.6
4,Yevlakh,2002,2.9
5,Barda,2003,3.2


In [166]:
# and
frame.loc[(frame.ehali> 3.0) & (frame.il >= 2002)]

Unnamed: 0,sheher,il,ehali
2,Guba,2002,3.6
5,Barda,2003,3.2


In [170]:
frame.loc[((frame.ehali> 3.0) | (frame.il >= 2002)),['sheher']]

Unnamed: 0,sheher
2,Guba
4,Yevlakh
5,Barda


In [172]:
frame.loc[((frame.ehali> 3.0) | (frame.il >= 2002))][['sheher']]

Unnamed: 0,sheher
2,Guba
4,Yevlakh
5,Barda


# Some Useful Methods in Pandas

In [178]:
s=pd.Series([1,2,3,4],
            index=["a","b","c","d"])
s

a    1
b    2
c    3
d    4
dtype: int64

In [174]:
s["a"]

1

In [179]:
s2=s.reindex(["b","d","a","c","e","f"])
s2

b    2.0
d    4.0
a    1.0
c    3.0
e    NaN
f    NaN
dtype: float64

In [180]:
s3=pd.Series(["blue","yellow","purple"],
             index=[0,2,4])
s3

0      blue
2    yellow
4    purple
dtype: object

In [181]:
s3.reindex(range(6))

0      blue
1       NaN
2    yellow
3       NaN
4    purple
5       NaN
dtype: object

In [182]:
s3.reindex(range(6), method="ffill")

0      blue
1      blue
2    yellow
3    yellow
4    purple
5    purple
dtype: object

In [183]:
s3.reindex(range(6), method="bfill")

0      blue
1    yellow
2    yellow
3    purple
4    purple
5       NaN
dtype: object

In [184]:
df=pd.DataFrame(np.arange(9).reshape(3,3),
                index=["a","c","d"],
                columns=["Tim","Tom","Kate"])
df

Unnamed: 0,Tim,Tom,Kate
a,0,1,2
c,3,4,5
d,6,7,8


In [185]:
df2=df.reindex(["d","c","b","a","a"])
df2

Unnamed: 0,Tim,Tom,Kate
d,6.0,7.0,8.0
c,3.0,4.0,5.0
b,,,
a,0.0,1.0,2.0
a,0.0,1.0,2.0


In [186]:
names=["Kate","Tim","Tom"]
df.reindex(columns=names)

Unnamed: 0,Kate,Tim,Tom
a,2,0,1
c,5,3,4
d,8,6,7


In [188]:
s=pd.Series(np.arange(5.),
            index=["a","b","c","d","e"])
s

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [189]:
new_s=s.drop("b") #default axis is 0
new_s

a    0.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [190]:
s.drop(["c","d"])

a    0.0
b    1.0
e    4.0
dtype: float64

In [191]:
data=pd.DataFrame(np.arange(16).reshape(4,4),
                  index=["Kate","Tim",
                         "Tom","Alex"],
                  columns=list("ABCD"))
data

Unnamed: 0,A,B,C,D
Kate,0,1,2,3
Tim,4,5,6,7
Tom,8,9,10,11
Alex,12,13,14,15


In [192]:
data.drop(["Kate","Tim"])

Unnamed: 0,A,B,C,D
Tom,8,9,10,11
Alex,12,13,14,15


In [193]:
data.drop("A",axis=1)

Unnamed: 0,B,C,D
Kate,1,2,3
Tim,5,6,7
Tom,9,10,11
Alex,13,14,15


In [194]:
data.drop("Kate",axis=0)

Unnamed: 0,A,B,C,D
Tim,4,5,6,7
Tom,8,9,10,11
Alex,12,13,14,15


In [195]:
data

Unnamed: 0,A,B,C,D
Kate,0,1,2,3
Tim,4,5,6,7
Tom,8,9,10,11
Alex,12,13,14,15


# Arithmetic Operations in Pandas

In [201]:
s1=pd.Series(np.arange(4),
             index=["a","c","d","e"], name='df1')
s2=pd.Series(np.arange(5),
             index=["a","c","e","f","g"], name='df2')

In [202]:
print(s1,'\n')
print(s2)

a    0
c    1
d    2
e    3
Name: df1, dtype: int32 

a    0
c    1
e    2
f    3
g    4
Name: df2, dtype: int32


In [107]:
s1+s2

a    0.0
c    2.0
d    NaN
e    5.0
f    NaN
g    NaN
dtype: float64

In [203]:
df1=pd.DataFrame(
    np.arange(6).reshape(2,3),
    columns=list("ABC"),
    index=["Tim","Tom"])

df2=pd.DataFrame(
    np.arange(9).reshape(3,3),
    columns=list("ACD"),
    index=["Tim","Kate","Tom"])

In [204]:
print(df1)
print(df2)

     A  B  C
Tim  0  1  2
Tom  3  4  5
      A  C  D
Tim   0  1  2
Kate  3  4  5
Tom   6  7  8


In [205]:
df1+df2

Unnamed: 0,A,B,C,D
Kate,,,,
Tim,0.0,,3.0,
Tom,9.0,,12.0,


In [209]:
df1.add(df2,fill_value=0) # if value exists in one of dataframes return that value, else nan

Unnamed: 0,A,B,C,D
Kate,3.0,,4.0,5.0
Tim,0.0,1.0,3.0,2.0
Tom,9.0,4.0,12.0,8.0


In [210]:
df1.add(df2,fill_value=10)

Unnamed: 0,A,B,C,D
Kate,13.0,,14.0,15.0
Tim,0.0,11.0,3.0,12.0
Tom,9.0,14.0,12.0,18.0


In [211]:
1/df1

Unnamed: 0,A,B,C
Tim,inf,1.0,0.5
Tom,0.333333,0.25,0.2


In [212]:
df1*3

Unnamed: 0,A,B,C
Tim,0,3,6
Tom,9,12,15


In [214]:
df1.mul(df2)

Unnamed: 0,A,B,C,D
Kate,,,,
Tim,0.0,,2.0,
Tom,18.0,,35.0,


In [215]:
df1.mul(3)

Unnamed: 0,A,B,C
Tim,0,3,6
Tom,9,12,15


In [218]:
df2

Unnamed: 0,A,C,D
Tim,0,1,2
Kate,3,4,5
Tom,6,7,8


In [217]:
s=df2.iloc[1]
s

A    3
C    4
D    5
Name: Kate, dtype: int32

In [219]:
df2+s # adds each value of corresponding colum (rowwise)

Unnamed: 0,A,C,D
Tim,3,5,7
Kate,6,8,10
Tom,9,11,13


In [223]:
s2=df2["A"]
s2

Tim     0
Kate    3
Tom     6
Name: A, dtype: int32

In [225]:
df2.add(s2,axis="index") # adds value of corresponding row(columnwise)

Unnamed: 0,A,C,D
Tim,0,1,2
Kate,6,7,8
Tom,12,13,14


In [226]:
df2

Unnamed: 0,A,C,D
Tim,0,1,2
Kate,3,4,5
Tom,6,7,8


# Sorting & Ranking in Pandas

In [227]:
s=pd.Series(range(5),
            index=["e","d","a","b","c"])
s

e    0
d    1
a    2
b    3
c    4
dtype: int64

In [231]:
s.sort_index() #different parameters

a    2
b    3
c    4
d    1
e    0
dtype: int64

In [232]:
df=pd.DataFrame(
    np.arange(12).reshape(3,4),
    index=["two","one","three"],
    columns=["d","a","b","c"])
df

Unnamed: 0,d,a,b,c
two,0,1,2,3
one,4,5,6,7
three,8,9,10,11


In [233]:
df.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,8,9,10,11
two,0,1,2,3


In [234]:
df.sort_index(axis=1)

Unnamed: 0,a,b,c,d
two,1,2,3,0
one,5,6,7,4
three,9,10,11,8


In [235]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
two,0,3,2,1
one,4,7,6,5
three,8,11,10,9


In [236]:
s2=pd.Series([5,np.nan,3,-1,9])
s2

0    5.0
1    NaN
2    3.0
3   -1.0
4    9.0
dtype: float64

In [238]:
s2.sort_values()

3   -1.0
2    3.0
0    5.0
4    9.0
1    NaN
dtype: float64

In [239]:
df2=pd.DataFrame(
    {"a":[5,3,-1,9],"b":[1,-2,0,5]})
df2

Unnamed: 0,a,b
0,5,1
1,3,-2
2,-1,0
3,9,5


In [240]:
df2.sort_values(by="b")

Unnamed: 0,a,b
1,3,-2
2,-1,0
0,5,1
3,9,5


In [241]:
df2.sort_values(by=["b","a"])

Unnamed: 0,a,b
1,3,-2
2,-1,0
0,5,1
3,9,5


# Summarizing & Computing Descriptive Statistics

In [242]:
import pandas as pd
import numpy as np 

In [246]:
df=pd.DataFrame(
    [[2.4,np.nan],[6.3,-5.4],
     [np.nan,np.nan],[0.75,-1.3]],
    index=["a","b","c","d"],
    columns=["one","two"])
df

Unnamed: 0,one,two
a,2.4,
b,6.3,-5.4
c,,
d,0.75,-1.3


In [247]:
df.sum()

one    9.45
two   -6.70
dtype: float64

In [248]:
df.sum(axis=1)

a    2.40
b    0.90
c    0.00
d   -0.55
dtype: float64

In [249]:
df.mean(axis=1)

a    2.400
b    0.450
c      NaN
d   -0.275
dtype: float64

In [250]:
df.mean(axis=1,skipna=False)

a      NaN
b    0.450
c      NaN
d   -0.275
dtype: float64

In [251]:
df.idxmax()

one    b
two    d
dtype: object

In [252]:
df.idxmin()

one    d
two    b
dtype: object

In [253]:
df.cumsum()

Unnamed: 0,one,two
a,2.4,
b,8.7,-5.4
c,,
d,9.45,-6.7


In [254]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.15,-3.35
std,2.85,2.899138
min,0.75,-5.4
25%,1.575,-4.375
50%,2.4,-3.35
75%,4.35,-2.325
max,6.3,-1.3


In [255]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
one,3.0,3.15,2.85,0.75,1.575,2.4,4.35,6.3
two,2.0,-3.35,2.899138,-5.4,-4.375,-3.35,-2.325,-1.3


In [261]:
df.one.describe()

count    3.000
mean     3.150
std      2.850
min      0.750
25%      1.575
50%      2.400
75%      4.350
max      6.300
Name: one, dtype: float64

In [256]:
df.dtypes

one    float64
two    float64
dtype: object

In [259]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   one     3 non-null      float64
 1   two     2 non-null      float64
dtypes: float64(2)
memory usage: 268.0+ bytes


In [262]:
df.nunique()

one    3
two    2
dtype: int64

In [280]:
df.one.nunique()

3

In [278]:
# To see a list of unique values we can use the unique() function:

df['one'].unique()

array([2.4 , 6.3 ,  nan, 0.75])

In [282]:
# To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:
df['one'].value_counts(dropna=False)

2.40    1
6.30    1
NaN     1
0.75    1
Name: one, dtype: int64

In [283]:
df

Unnamed: 0,one,two
a,2.4,
b,6.3,-5.4
c,,
d,0.75,-1.3


In [None]:
df.insert(0, column="new", value=3)

In [292]:
df.insert(3, column="new1", value=3)

In [293]:
df

Unnamed: 0,new,one,two,new1
a,3,2.4,,3
b,3,6.3,-5.4,3
c,3,,,3
d,3,0.75,-1.3,3


# Data Reading & Writting in Pandas

In [295]:
df2=pd.read_table("Data/data2.txt")
df2

Unnamed: 0,"Tom,80,M"
0,"Tim,85,M"
1,"Kim,70,M"
2,"Kate,90,F"
3,"Alex,75,F"


In [296]:
df2=pd.read_table("Data/data2.txt", sep=",")
df2

Unnamed: 0,Tom,80,M
0,Tim,85,M
1,Kim,70,M
2,Kate,90,F
3,Alex,75,F


In [299]:
df=pd.read_table("Data/data2.txt",
                 sep=",",
                 header=None)
df

Unnamed: 0,0,1,2
0,Tom,80,M
1,Tim,85,M
2,Kim,70,M
3,Kate,90,F
4,Alex,75,F


In [302]:
df=pd.read_table("Data/data2.txt",
                 sep=",",
                 header=None, 
                 names=["name","score","sex"])
df

Unnamed: 0,name,score,sex
0,Tom,80,M
1,Tim,85,M
2,Kim,70,M
3,Kate,90,F
4,Alex,75,F


In [303]:
df=pd.read_table("Data/data2.txt",
                 sep=",",header=None, 
                 names=["name","score","sex"],
                 index_col="name")
df

Unnamed: 0_level_0,score,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tom,80,M
Tim,85,M
Kim,70,M
Kate,90,F
Alex,75,F


In [304]:
df2=pd.read_table("Data/data3.txt",
                  sep=",",
                  index_col=["lesson","name"])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two
lesson,name,Unnamed: 2_level_1,Unnamed: 3_level_1
Math,Kim,80,85
Math,Tim,90,70
Math,Tom,70,95
Stat,Kate,65,90
Stat,Alex,85,80
Stat,Sam,55,70


In [308]:
df3=pd.read_table("Data/data4.txt",
                  sep=",")
df3

Unnamed: 0,Unnamed: 1,#hello
name,score,sex
#scores of students,,
Tim,80,M
Kate,85,F
Alex,70,M
Tom,90,M
Kim,75,F


In [309]:
df3=pd.read_table("Data/data4.txt",
                  sep=",", 
                  skiprows=[0,2])
df3

Unnamed: 0,name,score,sex
0,Tim,80,M
1,Kate,85,F
2,Alex,70,M
3,Tom,90,M
4,Kim,75,F


In [310]:
df3=pd.read_table("Data/data4.txt",
                  sep=",", 
                  skiprows=[0,2],
                  usecols=[0,1])
df3

Unnamed: 0,name,score
0,Tim,80
1,Kate,85
2,Alex,70
3,Tom,90
4,Kim,75


In [313]:
df3=pd.read_table("Data/data4.txt",
                  sep=",", 
                  skiprows=[0,2],
                  usecols=[0,1],
                  nrows=3)
df3

Unnamed: 0,name,score
0,Tim,80
1,Kate,85
2,Alex,70


## Writing Data

In [314]:
df=pd.read_csv("Data/data.txt",sep="\t")
df

Unnamed: 0,name,score,sex
0,Tim,80,M
1,Tom,85,M
2,Kim,70,F
3,Sam,90,M
4,Efe,75,M


In [315]:
df.to_csv("Data/new_data.csv")

In [316]:
df.to_excel("Data/new_data.xlsx")

# Missing Data in Pandas

In [317]:
s=pd.Series(["Sam",np.nan,"Tim","Kim"])
s

0    Sam
1    NaN
2    Tim
3    Kim
dtype: object

In [318]:
s.isnull()

0    False
1     True
2    False
3    False
dtype: bool

In [319]:
s.notnull()

0     True
1    False
2     True
3     True
dtype: bool

In [322]:
s[4]=None
print(s)
s.isnull()

0     Sam
1     NaN
2     Tim
3    None
4    None
dtype: object


0    False
1     True
2    False
3     True
4     True
dtype: bool

In [323]:
s.dropna()

0    Sam
2    Tim
dtype: object

In [324]:
from numpy import nan as NA

In [363]:
df=pd.DataFrame([[1,2,3,4],[4,NA,5,6],
                 [NA,NA,NA,NA]])
df

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,4.0,,5.0,6.0
2,,,,


In [354]:
df.dropna()

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4


In [355]:
df.dropna(how="all")

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4
1,4.0,,5.0,6
2,,,,7


In [356]:
df[1]=NA
df

Unnamed: 0,0,1,2,3
0,1.0,,3.0,4
1,4.0,,5.0,6
2,,,,7


In [357]:
df.dropna(axis=1,how="all")

Unnamed: 0,0,2,3
0,1.0,3.0,4
1,4.0,5.0,6
2,,,7


In [358]:
df

Unnamed: 0,0,1,2,3
0,1.0,,3.0,4
1,4.0,,5.0,6
2,,,,7


In [359]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2,3
0,1.0,,3.0,4
1,4.0,,5.0,6


In [360]:
df.fillna(0)

Unnamed: 0,0,1,2,3
0,1.0,0.0,3.0,4
1,4.0,0.0,5.0,6
2,0.0,0.0,0.0,7


In [367]:
df.fillna({0:15,1:25,2:35}) #for each column separetly

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,4.0,25.0,5.0,6.0
2,15.0,25.0,35.0,


In [368]:
df

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,4.0,,5.0,6.0
2,,,,


In [369]:
df.fillna(0,inplace=True)
df

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,4.0,0.0,5.0,6.0
2,0.0,0.0,0.0,0.0


In [370]:
df=pd.DataFrame([[1,2,3],[4,NA,5],
                 [NA,NA,NA]])
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,,,


In [373]:
df.fillna(method="ffill")

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,4.0,2.0,5.0


In [374]:
df.fillna(method="bfill")

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,,,


In [176]:
df.fillna(method="ffill",limit=1) #how many max nan values to fill

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,4.0,,5.0


In [375]:
data=pd.Series([1,0,NA,5])
data

0    1.0
1    0.0
2    NaN
3    5.0
dtype: float64

In [376]:
data.fillna(data.mean())

0    1.0
1    0.0
2    2.0
3    5.0
dtype: float64

In [377]:
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,,,


In [378]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,2.5,2.0,4.0


In [386]:
data=pd.DataFrame({"a":["one","two"]*3,
                   "b":[1,1,2,3,2,3]})
data

Unnamed: 0,a,b
0,one,1
1,two,1
2,one,2
3,two,3
4,one,2
5,two,3


In [387]:
data.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

In [380]:
data.drop_duplicates()

0    1.0
1    0.0
2    NaN
3    5.0
dtype: float64

In [388]:
data["c"]=range(6)
data

Unnamed: 0,a,b,c
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,2,4
5,two,3,5


In [389]:
data.duplicated(["a","b"],keep="last")

0    False
1    False
2     True
3     True
4    False
5    False
dtype: bool

In [397]:
s=pd.Series([80,70,90,60])
s

0    80
1    70
2    90
3    60
dtype: int64

In [398]:
s.replace(70,np.nan)

0    80.0
1     NaN
2    90.0
3    60.0
dtype: float64

In [399]:
s.replace([70,60],[np.nan,0])

0    80.0
1     NaN
2    90.0
3     0.0
dtype: float64

In [400]:
s.replace({90:100,60:0})

0     80
1     70
2    100
3      0
dtype: int64

In [416]:
df=pd.DataFrame(
    np.arange(12).reshape(3,4),
    index=[0,1,2],
    columns=["tim","tom","kim","sam"])
df

Unnamed: 0,tim,tom,kim,sam
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [420]:
df.rename(index={"one":"ten"},
          columns={"sam":"kate"},
          inplace=True)
df

Unnamed: 0,tim,tom,kim,kate
ten,0,1,2,3
two,4,5,6,7
three,8,9,10,11


In [421]:
df.rename({"kate":'katty'},
          axis=1,
          inplace=True)
df

Unnamed: 0,tim,tom,kim,katty
ten,0,1,2,3
two,4,5,6,7
three,8,9,10,11
