### Basic Pandas Functions

In [2]:
import pandas as pd
import numpy as np

# pandas Data Strucutres

## Series

A Series is a one-dimentional array-like object, including a sequence of value (similar to NumPy array) and an associated array of *index*. 

In [3]:
#1. Make a series in pandas

obj=pd.Series([4,5,-3,2])
obj
#this Series contains two part, value as numpy array([4,5,-3,2]) and index... 
#Default index is the integer array from 0 to N-1, where N is the length of the Series. 

0    4
1    5
2   -3
3    2
dtype: int64

In [4]:
#2. To see values of a series
obj.values

array([ 4,  5, -3,  2])

In [5]:
#3. To see values of a index
#It tells us  first and last value and the incremnet by of the series.
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
4# Specify a different index
obj2=pd.Series([4,5,-3,2],index=['d','c','a','b'])
obj2

d    4
c    5
a   -3
b    2
dtype: int64

In [7]:
obj2.index #object is a Python object type

Index(['d', 'c', 'a', 'b'], dtype='object')

In [8]:
#5. Getting rows values respect to a particular index
#pandas has more fexibility to use index than NumPy. 

obj2['c']

5

In [9]:
#6. Getting rows values respect to a row number.

obj2[1]

5

In [10]:
#7. Getting subset of rows woth respect to indexes.

obj2[['a','d']]
#['a','d'] can be seen as a list of indices. It returns to a subset of the original Seires, which is also a Seiries. 

a   -3
d    4
dtype: int64

### NumPy_like operations

In [11]:
#8 Filterting Rows

obj2[obj2>0]

d    4
c    5
b    2
dtype: int64

In [12]:
#9 Taking exponential of valaues

np.exp(obj2)

d     54.598150
c    148.413159
a      0.049787
b      7.389056
dtype: float64

#10. Create a Series from Dict

### <font color='red'>**Exercise 1**</font>

Can you create a Series from a Dictionary where {'ham':100,'egg':200}?

Note the index in the resulting Series will have the dict's keys in sorted order. 

In [13]:
food={'ham':100,'egg':200}
obj3=pd.Series(food)
obj3

ham    100
egg    200
dtype: int64

In [14]:
#11. Over wirte a dictionary

#We can override the Series by passing new dict keys in order. No value for
#bread, it appears as NaN meaning not a number. 
#Since 'egg' is not in new index list, it is excluded from the Seires. 
new=['bread','ham']
obj4=pd.Series(food,index=new)
obj4

bread      NaN
ham      100.0
dtype: float64

### missing data: 'missing' or 'NA'

In [15]:
#12. Checking Null values

pd.isnull(obj4)
#obj3.isnull()

bread     True
ham      False
dtype: bool

In [16]:
#12. Checking Not Null values

pd.notnull(obj4)

bread    False
ham       True
dtype: bool

In [17]:
#13. Chaing value in rows
obj4['bread']=300

obj4

bread    300.0
ham      100.0
dtype: float64

In [18]:
#14. Adding two series
print(obj3)
print(obj4)
obj4+obj3

ham    100
egg    200
dtype: int64
bread    300.0
ham      100.0
dtype: float64


bread      NaN
egg        NaN
ham      200.0
dtype: float64

## DataFrame

There are many possible data inputs to DataFrame. Such as, np array, dict of lists ot tuples, dict of Series, dict of dicts and so on...

We only intorudce how to contruct DataFrame through dict of lists. 

In [19]:
#15. Make a DataFrame

#create a DataFrame through a dict of equal length lists or NumPy arrays:
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
     'year':[2000,2001,2002,2000,2001,2002],
     'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame=pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2000,2.4
4,Nevada,2001,2.9
5,Nevada,2002,3.2


In [20]:
#16. Make Index in a dataframe

frame=pd.DataFrame(data,index=[1,2,3,4,5,6])
frame

Unnamed: 0,state,year,pop
1,Ohio,2000,1.5
2,Ohio,2001,1.7
3,Ohio,2002,3.6
4,Nevada,2000,2.4
5,Nevada,2001,2.9
6,Nevada,2002,3.2


In [21]:
#17. Arranging columns in a dataframe

frame2=pd.DataFrame(data,index=range(1,7),columns=['year','state','pop','debt'])#columns are arranged in order
frame2

Unnamed: 0,year,state,pop,debt
1,2000,Ohio,1.5,
2,2001,Ohio,1.7,
3,2002,Ohio,3.6,
4,2000,Nevada,2.4,
5,2001,Nevada,2.9,
6,2002,Nevada,3.2,


In [22]:
#18. Filtering row in a dataframe

frame2.loc[6] #Retrive a specific row

year       2002
state    Nevada
pop         3.2
debt        NaN
Name: 6, dtype: object

In [23]:
#19. Making an arrray of digits from 1 to 7 

np.arange(1.0,7.0,1.0)

array([1., 2., 3., 4., 5., 6.])

In [24]:
#20. Giving values to a column in dataframe through NumPy

#how did we asign float number 1.0-6.0 to the debt. 
frame2['debt']=np.arange(1.0,7.0,1.0)
frame2

Unnamed: 0,year,state,pop,debt
1,2000,Ohio,1.5,1.0
2,2001,Ohio,1.7,2.0
3,2002,Ohio,3.6,3.0
4,2000,Nevada,2.4,4.0
5,2001,Nevada,2.9,5.0
6,2002,Nevada,3.2,6.0


In [25]:
#21. Replacing inbulit index with a coutomized index.

#if you assign Series to column in DataFrame. The labels will be realigned exactly to the DataFrame's index, inserting missing values to the rest.
val=pd.Series([-1.2,-1.5,-1.7],index=[2,4,5])
val

2   -1.2
4   -1.5
5   -1.7
dtype: float64

In [26]:
#22. Checking if a vaiue exist on a column

frame2.state=='Ohio'

1     True
2     True
3     True
4    False
5    False
6    False
Name: state, dtype: bool

In [27]:
#23. Add a new column with boolean value by checking if a particluar values exit on other column.

frame2['eastern']= frame2.state=='Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
1,2000,Ohio,1.5,1.0,True
2,2001,Ohio,1.7,2.0,True
3,2002,Ohio,3.6,3.0,True
4,2000,Nevada,2.4,4.0,False
5,2001,Nevada,2.9,5.0,False
6,2002,Nevada,3.2,6.0,False


In [28]:
#24. Delete a column from a dataframe

del frame2['eastern']
frame2.columns
#frame2.drop(columns=['eastern'])

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [29]:
#25 Tranforming data Frame

frame2.T # Transpose DataFrame. note that "eastern" has been deleted

Unnamed: 0,1,2,3,4,5,6
year,2000,2001,2002,2000,2001,2002
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2
debt,1.0,2.0,3.0,4.0,5.0,6.0


In [30]:
#26. Extracting values from data frame as no array


frame2.values
#values attribute returns the data contained in the DataFrame as a two-dimentional np array.

array([[2000, 'Ohio', 1.5, 1.0],
       [2001, 'Ohio', 1.7, 2.0],
       [2002, 'Ohio', 3.6, 3.0],
       [2000, 'Nevada', 2.4, 4.0],
       [2001, 'Nevada', 2.9, 5.0],
       [2002, 'Nevada', 3.2, 6.0]], dtype=object)

## Index Objects
Index Objects in pandas are different with the regular index in array we talk about. 

Any array or other sequence of labels when cosntructing a Series or DataFrame in internally convered to an Index.

In [31]:
#27. Making a index in series as object

import pandas as pd
labels=pd.Index(['foo','foo',1,2,3,4,7])
labels

Index(['foo', 'foo', 1, 2, 3, 4, 7], dtype='object')

In [32]:
#28. Can index column has duplicate values?

pd.DataFrame(frame,index=labels)
# a pandas Index can contain duplicate lables.

Unnamed: 0,state,year,pop
foo,,,
foo,,,
1,Ohio,2000.0,1.5
2,Ohio,2001.0,1.7
3,Ohio,2002.0,3.6
4,Nevada,2000.0,2.4
7,,,


# Essential Functionality

## 1. Reindexing
Calling reindex on the Seires or DataFrame rearranges the data according to the new index, introducing missing values if any index values were not aleady present. 

In [33]:
#29 create a Seires and then apply reindex. 

import pandas as pd
obj=pd.Series([4.5,.3,-2],index=['a','c','d'])
obj

a    4.5
c    0.3
d   -2.0
dtype: float64

In [34]:
30 #calling reindex
obj2=obj.reindex(['a','b','c','d'])
obj2

a    4.5
b    NaN
c    0.3
d   -2.0
dtype: float64

In [35]:
#31 for time sires data, we can fill values when reindex when some data are missing.
#ffill means forward filling. 
obj3=obj.reindex(['a','b','c','d','e'],method='ffill')
obj3

a    4.5
b    4.5
c    0.3
d   -2.0
e   -2.0
dtype: float64

In [36]:
#32, create a DataFrame and then apply reindex for both index and columns. 
frame=pd.DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],columns=['Ohio','Texas','California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [37]:
#33. Reindex IN Data Frame

frame2=frame.reindex(['a','b','c','d'])
frame2

# Question: Will frame.reindex rewrite frame too?

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,,,


In [38]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


#34

<font color='red'> Exercise:</font>

In frame2,can you fill the value in row 'd' by [9.0,10.0,11.0]?

In [39]:
frame2.loc['d']=[9.0,10.0,11.0]
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,9.0,10.0,11.0


#35

<font color='red'> Exercise:</font>

In frame2,can you replace the columns by ['SF','NYC','Chichago'] ?

In [40]:
d = {"Ohio":9.0, "Texas":10.0, "California": 11.0}
frame2.loc['d'] = d
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,9.0,10.0,11.0


In [41]:
#36. Renaming all column header in ince by a new list

cities=['SF','NYC','Chichago']
frame2.reindex(columns=cities)

Unnamed: 0,SF,NYC,Chichago
a,,,
b,,,
c,,,
d,,,


In [42]:
#37. Renaming columns in a dataframe

frame2.rename(columns = {'Ohio':'SF','Texas':'NYC','California':'Chichago'})

Unnamed: 0,SF,NYC,Chichago
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,9.0,10.0,11.0


## 2. Dropping Entries from an Axis

The _drop_ method will return a new object with the indicated value deleted from an axis.  

In [49]:
#38.  Create a Series and apply drop for a particular index

obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)
obj.drop('c')

# Note: obj has not been re-written. 

a    0
b    1
c    2
d    3
e    4
dtype: int64


a    0
b    1
d    3
e    4
dtype: int64

In [47]:
#39. Droppping a row from index series

# If you want to modify the current object, use inplace
# Be careful to use this, because this will distroy data. 
obj.drop('c',inplace=True)
obj

a    0
b    1
d    3
e    4
dtype: int64

In [100]:
#40.  Create a DataFrame and apply drop() on series column

frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['one','two','three','four'])
print(frame,"\n")


#Without specifying the axis, it will start from axis 0, or rows. 
print(frame.drop('A'),"\n")

print(frame.drop(['A','B']),"\n")


#axis 1 is the column. 
print(frame.drop('one', axis = 1),"\n")


   one  two  three  four
A    0    1      2     3
B    4    5      6     7
C    8    9     10    11
D   12   13     14    15 

   one  two  three  four
B    4    5      6     7
C    8    9     10    11
D   12   13     14    15 

   one  two  three  four
C    8    9     10    11
D   12   13     14    15 

   two  three  four
A    1      2     3
B    5      6     7
C    9     10    11
D   13     14    15 



## 3. Indexing and Selection

Series indexing works analogously to NumPy array indexing, excepting that you can use the Series' index value instead of only integers. 

In [76]:
import numpy as np
obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
print(obj)

a    0
b    1
c    2
d    3
e    4
dtype: int64


#41

### <font color='red'> Exercise </font>

How do you select the value of index 'c'?

In [78]:
print (obj['c'])

2


#42

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence. 

In [104]:
frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['one','two','three','four'])
frame

# this syntax slices based on axis 0 or row. 
print(frame[:2],"\n")
print(frame['two'],"\n")
print(frame[['one','two']],"\n")


   one  two  three  four
A    0    1      2     3
B    4    5      6     7 

A     1
B     5
C     9
D    13
Name: two, dtype: int64 

   one  two
A    0    1
B    4    5
C    8    9
D   12   13 



In [107]:
#43 Selcting Ceratin Row and Column

#When you want to select a certain row, try loc and iloc. 

data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [108]:
data.loc['Colorado',['two','three']]

two      5
three    6
Name: Colorado, dtype: int64

In [109]:
#44 Getting Rows and column w.r.t Row and column numbers

#Silimar selection with index represented as integers 
data.iloc[1,[1,2]]

two      5
three    6
Name: Colorado, dtype: int64

In [110]:
data.iloc[2] # Select the entire row of index 2. 

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [94]:
#45. Selecting Rows and columns with condition on values of data

data.iloc[:,:2][data.one < 5]

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5


#46

### <font color='red'> **Exercise:**</font>

slice value greater than or equal to 8 for column "one " and "four"

In [111]:

data[["one","four"]][data.one>=8]


Unnamed: 0,one,four
Utah,8,11
New York,12,15


#47

### <font color='red'> **Exercise:**</font>

select a single scalar value of row 'Ohio' and column 'one'. 

In [115]:
data.iloc[0,0]



0

## 4. Arithmetic and Data Alignment

You can add same type of objects. We will try Seires and then DataFrame. 

When there is mismatch of labels, the internal data alignment introduces missing values in the label location that don't overlap. Missing values will then be passed in future arithmetic computations. 

In [123]:
#48 Adding two series reults in  outer join

s1=pd.Series([1.2,3.4],index=['a','c'])
s2=pd.Series([-1.2,2.,-3.4,5.0],index=['a','b','c','d'])
s1+s2

a    0.0
b    NaN
c    0.0
d    NaN
dtype: float64

In [130]:
#49 list() convert an iterable (tuple, string, set, dictionary) to a list.

df1=pd.DataFrame(np.arange(6.).reshape(2,3),columns=list('bcd'),index=list('AB'))
print(df1)

#OR

df2=pd.DataFrame(np.arange(4.).reshape(2,2),
                 columns=list('ab'),index=list('AC'))
print(df2)

#Adding two series
df1+df2

     b    c    d
A  0.0  1.0  2.0
B  3.0  4.0  5.0
     a    b
A  0.0  1.0
C  2.0  3.0


Unnamed: 0,a,b,c,d
A,,1.0,,
B,,,,
C,,,,


**Arithmetic between DataFrame and Series**
matches the index of the Series on the DataFrame's **columns**, broadcasting down the rows. 

In [143]:
#50 A row of a data frame can be subtracting from a data frame by making the row a separate series first

print(df1,"\n")

S=df1.iloc[0]
print(S,"\n")

print(df1-S)

     b    c    d
A  0.0  1.0  2.0
B  3.0  4.0  5.0 

b    0.0
c    1.0
d    2.0
Name: A, dtype: float64 

     b    c    d
A  0.0  0.0  0.0
B  3.0  3.0  3.0


In [148]:
#51 What if we define a new pure Series. 
S2=pd.Series(range(4),index=list('bcde'))
S2

df1+S2

Unnamed: 0,b,c,d,e
A,0.0,2.0,4.0,
B,3.0,5.0,7.0,


In [151]:
#52. Numpy element-wise array methods also work with pandas objects. 

frame=pd.DataFrame(np.random.randn(4,3),
                   columns=list('abc'),index=list('ABCD'))
frame

Unnamed: 0,a,b,c
A,0.331346,0.01918,0.83061
B,1.481152,-0.094824,-1.579902
C,0.996314,-0.458132,0.581007
D,-0.914769,1.670821,-0.167576


In [None]:
np.abs(frame)

## <font color='red'> The following is Self-study material.</font>

## 5. Sorting and Ranking
The data is sorted by rows or columns in lexicographically asceding order by default. We apply sort_index method in both Series and DataFrame. 

In [155]:
#53. Sorting A data frame from index

S=pd.Series(range(4),index=list('cdba'))
S

S.sort_index()

a    3
b    2
c    0
d    1
dtype: int64

In [159]:
#54 You can sort Column name with sort_index by using axis  =1

frame=pd.DataFrame(np.arange(8).reshape(2,4),index=['B','A'],columns=list('badc'))
frame

frame.sort_index(axis=1)#sort columns

Unnamed: 0,a,b,c,d
B,1,0,3,2
A,5,4,7,6


In [162]:
#55 sort the index in descending order. 

frame.sort_index(axis=1,ascending=False)

Unnamed: 0,d,c,b,a
B,2,3,0,1
A,6,7,4,5


In [164]:
#56.  To sort a Series by its values. 
# Any missing value will be sorted to the end of the Series by default. 

S.sort_values(ascending=False)

a    3
b    2
d    1
c    0
dtype: int64

In [168]:
#57.  To sort a DataFrame by its values. 
# Pass one or more column names to by option of sort_values. We sort in revser order. 

frame.sort_values(by='a',ascending=False)

Unnamed: 0,b,a,d,c
A,4,5,6,7
B,0,1,2,3


### Ranking
Ranking assigns ranks from 1 through the number of valide data points in an array. 
For example, a list [3,1.5, 3] can be sorted to ascending order first - [1.5, 3, 3]. 1.5 is at rank 1,two 3s are at rank 2 and 3. 
Those two 3s are called a tied group. 
By default, each 3 will have a mean rank 2.5. 

In [171]:
#58. Make a series and find the rank of values

S=pd.Series([7,-5,7,4,7])
print(S)
S.rank()

0    7
1   -5
2    7
3    4
4    7
dtype: int64


0    4.0
1    1.0
2    4.0
3    2.0
4    4.0
dtype: float64

Ranks can also be assigned to the order in which they are observed in the data. 
Take [1.5, 3,3] for example again. the first 3 will take rank 2, and the second 3 will take rank 3. 

In [173]:
#59. Use First Method in ranking

#First Method means to have different ranks for identical values instead o ftaking mean for identical value ranks

S.rank(method='first')

0    3.0
1    1.0
2    4.0
3    2.0
4    5.0
dtype: float64

**More tie-breaking method with rank**

'avaerage': default one

'min': Use the minimum rank for the whole group

'max': Use the maximum rank for the whole group

'first':Assign ranks in the order the values appear in the data


In [176]:
#60. Use the min and max method in ranking

#Min - Means take minimum rank as the rank for identical values
#Max - Means take maximum rank as the rank for identical values


print(S.rank(method='min'))
print(S.rank(method='max'))

0    3.0
1    1.0
2    3.0
3    2.0
4    3.0
dtype: float64
0    5.0
1    1.0
2    5.0
3    2.0
4    5.0
dtype: float64


In [185]:
#61 Adding a row in data frame with a new elemnt in the series column 

D=pd.DataFrame(np.arange(4).reshape(2,2),
               index=list('ab'), columns=list('CD'))
print(D)
D.loc['d']=[3,3]
print(D)

   C  D
a  0  1
b  2  3
   C  D
a  0  1
b  2  3
d  3  3


In [195]:
#62 Rank above data frame based on the column, or rank across the row
D.rank(axis=0, method='first',ascending=False)

Unnamed: 0,C,D
a,3.0,3.0
b,2.0,1.0
d,1.0,2.0
