
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.


# Series


Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). ... The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

In [4]:
import pandas as pd
import numpy as np

list_1 = ['a', 'b', 'c', 'd']
labels = [1, 2, 3, 4]

ser1 = pd.Series(data=list_1, index=labels)

ser1

1    a
2    b
3    c
4    d
dtype: object

In [6]:
data = np.array(['a', 'b', 'c', 'd', 'e'])

ser = pd.Series(data)

print(ser)

0    a
1    b
2    c
3    d
4    e
dtype: object


In [7]:
dict_1 = {'f_name' : "livia", "l_name" : "macedo", 'age' : 24}
ser_d = pd.Series(dict_1)

print(ser_d)

f_name     livia
l_name    macedo
age           24
dtype: object


In [8]:
ser_d['f_name']

'livia'

In [9]:
ser[0]

'a'

In [11]:
ser_d.dtype

dtype('O')

In [15]:
# math operations on series
arr = np.array([1, 2, 3, 4])

serr = pd.Series(arr)

print("- : \n", serr - serr)

print("+ : \n", serr + serr)

print("* : \n", serr * serr)

print("/ : \n", serr / serr)

- : 
 0    0
1    0
2    0
3    0
dtype: int32
+ : 
 0    2
1    4
2    6
3    8
dtype: int32
* : 
 0     1
1     4
2     9
3    16
dtype: int32
/ : 
 0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64


In [17]:
np.exp(serr)

0     2.718282
1     7.389056
2    20.085537
3    54.598150
dtype: float64

In [24]:
# the difference between series and numpy arrays are that the operations is doing according to the laels
ser_4 = pd.Series({4: 5, 5: 6, 6: 7, 7: 8})
serr + ser_4
# this operations does not work because labels are differnts

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
dtype: float64

In [27]:
ser_5 = pd.Series({0: 5, 1: 6, 2: 7, 3: 8})
serr + ser_5

# this operation works, because series labels are equal is aligned

0     6
1     8
2    10
3    12
dtype: int64

In [28]:
# is possible to sign name to series
ser_6 = pd.Series({0: 5, 1: 6, 2: 7, 3: 8}, name="rand_numbers")
ser_6.name

'rand_numbers'

# Dataframes

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. 

In [32]:
arr_rr = np.random.randint(10, 50, size=(2, 3))
arr_rr

array([[27, 11, 40],
       [42, 29, 44]])

In [35]:
df_1 = pd.DataFrame(arr_rr, ['A', 'B'], ['C', 'D', 'E']) # first: [A, B] is row labels, ['C', 'D', 'E'] is col labels 
df_1

Unnamed: 0,C,D,E
A,27,11,40
B,42,29,44


In [38]:
arrr = np.array(['C++', 'Python', 'C', 'Java'])
pd.DataFrame(arrr)

Unnamed: 0,0
0,C++
1,Python
2,C
3,Java


In [39]:
# creating dataframe from dictionary
data = {'name': ['tom', 'nick', 'krish', 'jack'],
         'age': [20, 10, 30, 40]}

print(data)

{'name': ['tom', 'nick', 'krish', 'jack'], 'age': [20, 10, 30, 40]}


In [40]:
pd.DataFrame(data)

Unnamed: 0,name,age
0,tom,20
1,nick,10
2,krish,30
3,jack,40


In [71]:
dict_3  = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
          'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

dict_3

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [72]:
sd = pd.DataFrame(dict_3)
print(sd)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


In [48]:
# creating dataframe from dictionary
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [51]:
# changing orientation
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
                      orient='index', columns=['one', 'two', 'three'])


Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


In [52]:
print(df_1.shape) # getting a tuple with row and col

(2, 3)


# Editing and retrieving data

In [60]:
sss = pd.DataFrame(np.random.randint(10, 50, size=(2, 3)), ['A', 'B'], ['C', 'D', 'E'])
print(sss)
sss['C']

    C   D   E
A  13  46  44
B  11  49  47


A    13
B    11
Name: C, dtype: int32

In [61]:
sss[['C', 'E']]

Unnamed: 0,C,E
A,13,44
B,11,47


In [62]:
sss.loc['A'] # getting data as series Access a group of rows and columns by label(s) or a boolean array.

C    13
D    46
E    44
Name: A, dtype: int32

In [63]:
sss.iloc[1] # getting by index position

C    11
D    49
E    47
Name: B, dtype: int32

In [64]:
sss.loc['A', 'C'] # Getting a cell by specifying row and colunm (loc is location)

13

In [65]:
sss.loc[['A', 'B'], ['D', 'E']]

Unnamed: 0,D,E
A,46,44
B,49,47


In [68]:
# making new collunms
# this code is going to be applied to all rows
sss['total'] = sss['C'] + sss['D'] + sss['E']
print(sss)

    C   D   E  total
A  13  46  44    103
B  11  49  47    107


In [73]:
print(sd)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


In [74]:
sd['mult'] = sd['one'] * sd['two']
print(sd)

   one  two  mult
a  1.0  1.0   1.0
b  2.0  2.0   4.0
c  3.0  3.0   9.0
d  NaN  4.0   NaN


In [75]:
dict_2 = {'C': 44, 'D' : 45, 'E': 46}
print(dict_2)
new_row = pd.Series(dict_2, name='F')
print(new_row)


{'C': 44, 'D': 45, 'E': 46}
C    44
D    45
E    46
Name: F, dtype: int64


In [76]:
sss.append(new_row)

Unnamed: 0,C,D,E,total
A,13.0,46.0,44.0,103.0
B,11.0,49.0,47.0,107.0
F,44.0,45.0,46.0,


Inplace is an argument used in different functions. Some functions in which inplace is used as an attributes like, set_index(), dropna(), fillna(), reset_index(), drop(), replace() and many more. The default value of this attribute is False and it returns the copy of the object.

In [78]:
# delete colunms
sss.drop('total', axis=1, inplace=True) # axis=1 reference the colunm, 0

In [79]:
sss

Unnamed: 0,C,D,E
A,13,46,44
B,11,49,47


In [81]:
sss['total'] = sss['C'] + sss['D'] + sss['E']

In [83]:
sss

Unnamed: 0,C,D,E,total
A,13,46,44,103
B,11,49,47,107


In [90]:
sss2 = sss.rename(columns={'total' : 'novonome'}) # here, implace is defalt false, that's why the original does not change
print(sss2)
print(sss)

    C   D   E  novonome
A  13  46  44       103
B  11  49  47       107
    C   D   E  total
A  13  46  44    103
B  11  49  47    107


In [98]:
sss2 = sss.rename(columns={'total' : 'novonome'}, inplace=False) # here, implace is defalt false, that's why the original does not change
print(sss2)
print(sss)

    C   D   E  novonome
A  13  46  44       103
B  11  49  47       107
    C   D   E  novonome
A  13  46  44       103
B  11  49  47       107


In [100]:
sss.rename(columns={'total' : 'novonome'}, inplace=True) # here, implace is defalt false, that's why the original does not change

sss

Unnamed: 0,C,D,E,novonome
A,13,46,44,103
B,11,49,47,107


In [109]:
# delete row
sss.rename(columns={'novonome': 'total'}, inplace=True)
sss = sss.append(new_row)
sss

Unnamed: 0,C,D,E,total
A,13.0,46.0,44.0,103.0
B,11.0,49.0,47.0,107.0
F,44.0,45.0,46.0,


In [110]:
sss.drop('F', axis=0, inplace=True) # axis=0 is row, axis = 1 is for col
print(sss)

      C     D     E  total
A  13.0  46.0  44.0  103.0
B  11.0  49.0  47.0  107.0


In [111]:
sss['gender'] = ['female', 'male']
sss

Unnamed: 0,C,D,E,total,gender
A,13.0,46.0,44.0,103.0,female
B,11.0,49.0,47.0,107.0,male


In [112]:
sss.set_index('gender', inplace=True)
sss

Unnamed: 0_level_0,C,D,E,total
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,13.0,46.0,44.0,103.0
male,11.0,49.0,47.0,107.0


In [113]:
# to reset the indexes
sss.reset_index(inplace=True)

In [114]:
sss

Unnamed: 0,gender,C,D,E,total
0,female,13.0,46.0,44.0,103.0
1,male,11.0,49.0,47.0,107.0


In [117]:
# creating a new collumn and leave the original dataframe untouch
sss.assign(cd=sss['C'] / sss['D']) #cd is the name

Unnamed: 0,gender,C,D,E,total,cd
0,female,13.0,46.0,44.0,103.0,0.282609
1,male,11.0,49.0,47.0,107.0,0.22449


In [118]:
# with assign is possible to pass in a function
sss.assign(ssfd = lambda x: (x['C'] / x['D']))

Unnamed: 0,gender,C,D,E,total,ssfd
0,female,13.0,46.0,44.0,103.0,0.282609
1,male,11.0,49.0,47.0,107.0,0.22449


In [121]:
df_3 = pd.DataFrame({'A' : [1., np.nan, 3., np.nan]})
df_4 = pd.DataFrame({'A' : [8., 9., 2., 4.]})

# this will make the df_3 receives the values of df_4 if the df_3 is nan
df_3.combine_first(df_4)

Unnamed: 0,A
0,1.0
1,9.0
2,3.0
3,4.0


# Conditional Selection

In [125]:
arr_2 = np.random.randint(10, 50, size=(2,3))
df_1= pd.DataFrame(arr_2, ['A', 'B'], ['C', 'D', 'E'])
print(df_1)

    C   D   E
A  48  29  18
B  39  48  33


In [126]:
print("Greater than 40\n", df_1 > 40)

Greater than 40
        C      D      E
A   True  False  False
B  False   True  False


In [128]:
print("Greater than 40\n", df_1.gt(40)) # other way to make comparison

Greater than 40
        C      D      E
A   True  False  False
B  False   True  False


In [129]:
print("Less than 40\n", df_1.lt(40)) # less than

Greater than 40
        C      D     E
A  False   True  True
B   True  False  True


In [130]:
print("Less than or equal to 40\n", df_1.le(40)) # less than or equal to

Less than or equal to 40
        C      D     E
A  False   True  True
B   True  False  True


In [131]:
print("Greater than or equal to 40\n", df_1.ge(40)) # greater than or equal to

Greater than or equal to 40
        C      D      E
A   True  False  False
B  False   True  False


In [132]:
print("equal to 40\n", df_1.eq(40)) # equal to

equal to 40
        C      D      E
A  False  False  False
B  False  False  False


In [133]:
print("not equal to 40\n", df_1.ne(40)) # not equal to

not equal to 40
       C     D     E
A  True  True  True
B  True  True  True


In [134]:
# i can put conditions in brackets too
bool_1 = df_1 > 40
df_1[bool_1]

Unnamed: 0,C,D,E
A,48.0,,
B,,48.0,


In [135]:
# getting bool from a collunm
df_1['E'] > 40

A    False
B    False
Name: E, dtype: bool

In [136]:
# return a row if a cell value in a collunm matches a condition
df_1[df_1['E'] > 30]

Unnamed: 0,C,D,E
B,39,48,33


In [139]:
df_2 = df_1[df_1['E'] > 30]
df_2

Unnamed: 0,C,D,E
B,39,48,33


In [143]:
print(df_1[df_1['E'] > 30]['C'])

B    39
Name: C, dtype: int32


In [144]:
print(df_1[df_1['E'] > 30][['C', 'D']])

    C   D
B  39  48


In [147]:
arr3 = np.array([[1,2,3],[4,5,6],[7,8,9]])
df33 = pd.DataFrame(arr3, ['A', 'B', 'C'], ['X', 'Y', 'Z'])
df33

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9


In [152]:
df33[(df33['X'] > 2 ) & ( df33['X'] < 7 ) ] # I want all the lines where the col x has a number greater than 2 and less than 7

Unnamed: 0,X,Y,Z
B,4,5,6


In [153]:
df33[(df33['X'] > 2 ) | ( df33['X'] < 7 ) ] # I want all the lines where the col x has a number greater than 2 and less than 7

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9
