Pandas are useful when working with tabular data. Cleaning, restructuring (long to wide, merging), manipulating, quuering, and ploting the data will also be done using pandas
In pandas data table is called a DataFrame

In [None]:
!pip install pandas
!pip install numpy


There are two types of data structures in pandas; Series and DataFrame. Series are different from list and dictionary. Each item or element of a series has an index and can be retrieved. The index can be thought of as a key in a dictionary. The actual data column will have a label that can be retrieved using .name attribute,

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Make a series
countries=['PAK','IND','BGD']
pd.Series(countries)

0    PAK
1    IND
2    BGD
dtype: object

In [4]:
#let's make a series of whole numbers
life_expectancy_birth=pd.Series([66,68,74])
life_expectancy_birth
life=pd.Series([66,68,74],index=(countries))
life

PAK    66
IND    68
BGD    74
dtype: int64

In [25]:
#None and NAN are not the same
np.nan==None

False

In [33]:
#An equality test can not be done on nan
a=np.nan
a
a==a
#a special function is used
np.isnan(a)


True

In [35]:
#Series from dictionary
countries_life_expectancy={'PAK':66,'IND':68,'BGD':74}
lifes=pd.Series(countries_life_expectancy)
lifes


PAK    66
IND    68
BGD    74
dtype: int64

In [6]:
countries_dict={'PAK':'LOWER MIDDLE','IND':'LOWER MIDDLE','BGD':'LOWER MIDDLE','AFG':'LOW'}
countries=pd.Series(countries_dict,index=['PAK','IND','AFG','NEP']) #countries names are the index labels
countries

PAK    LOWER MIDDLE
IND    LOWER MIDDLE
AFG             LOW
NEP             NaN
dtype: object

BGD is not part of the series even though it was part of the dictionary. However, NEP is included as a missing value
Querying a Series

In [26]:
# A pandas Series can be queried either by the index position or the index label. If you don't give an 
# index to the series when querying, the position and the label are effectively the same values. To 
# query by numeric location, starting at zero, use the iloc attribute. To query by the index label, 
# you can use the loc attribute. 
countries.iloc[0] #will return the income level of the 1st entry
countries.loc["PAK"] #Will do the same

'LOWER MIDDLE'

In [33]:
countries_codes=pd.Series({586:'PAK',356:'IND',50:'BGD'})
countries_codes
#countries_codes[0] #this gives an error, because index zero does not exist
countries_codes.iloc[0]


'PAK'

In [13]:
#Operations on each element of series, for example computing average or transforming variables
lifes #contains life expectancy of 3 countries
type(lifes)
#we want to compute average life expectancy
total=0
for life in lifes:
    total+=life
print(total/len(lifes))
#this is slow, alternative method is as follows

69.33333333333333


In [25]:
import numpy as np
total=np.sum(lifes)
print(total/len(lifes))
print(np.mean(lifes))
print(np.var(lifes))
print(np.sqrt(np.var(lifes)))
print(np.median(lifes))


69.33333333333333
69.33333333333333
11.555555555555555
3.39934634239519
68.0


In [37]:
# A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can 
# apply an operation to every value in the series, changing the series. For instance, I want to subtract 
#the mean from each observation of lifes. This is faster than iterating through each observation
lifes-=np.mean(lifes)
lifes

PAK   -3.333333
IND   -1.333333
BGD    4.666667
dtype: float64

The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

In [45]:
#define lifes again
countries_life_expectancy={'PAK':66,'IND':68,'BGD':74}
lifes=pd.Series(countries_life_expectancy)
lifes
lifes.loc[0]=np.nan
lifes
lifes.loc['PAK']=np.nan #this will replace the existing value at 'PAK' index to nan

PAK    66.0
IND    68.0
BGD    74.0
0       NaN
dtype: float64

index values are not unique in pandas

In [49]:
pak_life=pd.Series([68,70],index=['PAK','PAK'])
pak_life

PAK    68
PAK    70
dtype: int64

In [52]:
#let's append alifes and pak_lifes
all_lifes=lifes.append(pak_life)
all_lifes


  all_lifes=lifes.append(pak_life)


PAK     NaN
PAK    68.0
PAK    70.0
dtype: float64

In [53]:
all_lifes.loc['PAK'] #all values for PAK

PAK     NaN
PAK    68.0
PAK    70.0
dtype: float64