# Pandas: Data analysis Tool

##### Data structures

1. Series: 1Dim (rows)
2. DataFrame: 2Dim (rows , columns)
3. Panel: Combi. Series + dataFrame (dataFrame inside dataFrame)

##### Import required libraries

In [1]:
import pandas as pd

##### To see methods associated with pandas

##### Create empty series 
Series is a class of pandas module

In [2]:
pd.Series()
# The default dtype for empty Series will be 'object'
# The default type of data in series is float64

  pd.Series()


Series([], dtype: float64)

# Componets of Series
pd.Series(
    data=None,
    index=None,
    dtype=None,
    
    name=None,
    copy=False,
    fastpath=False,
)

In [3]:
# As float64 data type is depricated it generates error above. So provide data type explicitelly    

s=pd.Series(data=None,dtype = 'object')      
s

Series([], dtype: object)

##### Structure of Series

In [4]:
"""
When we have Series it has 3 things:
left side represents index/labels==> default it starts from 0
right side represents values/data
at the end will have data type
"""
#pd.Series(data,index,dtype)

pd.Series(data=[10,20,30,40,55])       # It can be written as pd.Series([10,20,30,40,50])

0    10
1    20
2    30
3    40
4    55
dtype: int64

** tuple,list, dict are supported by series but, set data type is not allowed.

##### Series creation using dictionary

In [5]:
pd.Series({1:'Amol',2:'Ajay',3:'Amar'})      #key as index and value as data

1    Amol
2    Ajay
3    Amar
dtype: object

##### Series creation using tuple

In [6]:
pd.Series(range(1,6))

0    1
1    2
2    3
3    4
4    5
dtype: int64

##### Series creation using numpy

In [7]:
import numpy as np
pd.Series(np.arange(1,7))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int32

##### Customizing the index of Series

In [8]:
s=pd.Series(data=[1,2,3,4,5],index=[10,20,30,40,50])
s

10    1
20    2
30    3
40    4
50    5
dtype: int64

In [9]:
s1=pd.Series(data=[10,20,30,40,50],index=('a','b','c','d','e'))
s1

a    10
b    20
c    30
d    40
e    50
dtype: int64

##### Customizing the datatype of series elements to improve **space complexity

In [10]:
# This data cant require int64 intead use int16

s1=pd.Series(data=[10,20,30,40,50],index=('a','b','c','d','e'),dtype='int16')
s1                    

a    10
b    20
c    30
d    40
e    50
dtype: int16

In [11]:
# Boolean data type is also possible for data

pd.Series(data=[10,20,30,40],index=['a','b','c','d'],dtype='bool')

a    True
b    True
c    True
d    True
dtype: bool

In [12]:
# False, zero, empty string, and None all are False

pd.Series(data=[False,0,'',None],index=['a','b','c','d'],dtype='bool')     

a    False
b    False
c    False
d    False
dtype: bool

In [13]:
##### instead of float64 here f2 is provided indicating that 2 Bytes

pd.Series(data=['10','20','30','40'],index=['a','b','c','d'],dtype='f2')

a    10.0
b    20.0
c    30.0
d    40.0
dtype: float16

##### Series creation using linspace method of numpy
It will create series of 8 elements in the range 1 to 2. Here end point is included

In [14]:
pd.Series(np.linspace(1,2,8), dtype = 'f2')

0    1.000000
1    1.142578
2    1.286133
3    1.428711
4    1.571289
5    1.713867
6    1.857422
7    2.000000
dtype: float16

##### Series creation using arange method of numpy
It will create a Series of elements starting from start upto stop, but stop is excluded

In [15]:
s = pd.Series(np.arange(101,111))
s

0    101
1    102
2    103
3    104
4    105
5    106
6    107
7    108
8    109
9    110
dtype: int32

##### Sequence of upcasting
If data is hetrogeneous then it will be auto type casted to heigher data type     
int==>float==>complex==>bool==>str==>object

##### Indexing on Series

In [16]:
s[4]          # Access Sries element through index  

105

In [17]:
s[4]=700     # Update value of Series using indexing

In [18]:
# Access 110 with -ve index

s[-1]       

KeyError: -1

##### Sclicing on Series

In [20]:
s[2:8:2]                 # To get alternet series values from index 2 to 8

2    103
4    700
6    107
dtype: int32

In [21]:
s[:3]=[1,2,3]            # To update first three values of series
s

0      1
1      2
2      3
3    104
4    700
5    106
6    107
7    108
8    109
9    110
dtype: int32

In [22]:
s[:]=range(1,11)          # To update all values of series

##### Info of Series

In [23]:
s.size                    

10

In [24]:
s.index

RangeIndex(start=0, stop=10, step=1)

In [25]:
for i in s.index:
    print(i)

0
1
2
3
4
5
6
7
8
9


In [26]:
s[s<5]=90            # Upadating values of element less than 5 to 90
s

0    90
1    90
2    90
3    90
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int32

In [27]:
s[s<6]=[10]     # Length must match
s

0    90
1    90
2    90
3    90
4    10
5     6
6     7
7     8
8     9
9    10
dtype: int32

In [28]:
for i in range(s.size):
    if s[i]==6:                    # Change value of 6 to 600
        s[i]=600

In [29]:
s.values                           # To access values of series

array([ 90,  90,  90,  90,  10, 600,   7,   8,   9,  10])

In [30]:
s.dtype                            # To get datatype of Series elements

dtype('int32')

In [31]:
sorted(s)                          # To sort the values of Series

[7, 8, 9, 10, 10, 90, 90, 90, 90, 600]

##### To see descriptive analysis of Series data

In [32]:
s.describe()   
# mean is Average
# std- Standard deviation is sqrt((x-mue)2/N)
# 50% mean of 105 and 106
# i is the first value in sorted data
# j is the last value in sorted data
# 50%=   i+(j-1)*0.5

count     10.00000
mean     100.40000
std      180.12107
min        7.00000
25%        9.25000
50%       50.00000
75%       90.00000
max      600.00000
dtype: float64

In [33]:
s.describe()[['25%','75%']]       # To get some entries of series

25%     9.25
75%    90.00
dtype: float64

In [34]:
s.head()      # Retrive first five record

0    90
1    90
2    90
3    90
4    10
dtype: int32

In [35]:
s.head(3)    # Retrive specified number of record

0    90
1    90
2    90
dtype: int32

In [36]:
s.head(-4)    # Retrive all rows except last specified number of rows(4)

0     90
1     90
2     90
3     90
4     10
5    600
dtype: int32

In [37]:
s.tail()       # Retrive last five rows

5    600
6      7
7      8
8      9
9     10
dtype: int32

In [38]:
s.tail(2)       # Retrive last specified number of rows

8     9
9    10
dtype: int32

In [39]:
s.tail(-2)      # Retrive all rows except first two rows

2     90
3     90
4     10
5    600
6      7
7      8
8      9
9     10
dtype: int32

In [40]:
s=pd.Series([10,20,10,30])
s

0    10
1    20
2    10
3    30
dtype: int64

In [41]:
s.unique()       # It returns unique data entries

array([10, 20, 30], dtype=int64)

In [42]:
s.nunique()     # It returns number of unique records

3

In [43]:
s.value_counts()       # It will gives occurance of each record

10    2
20    1
30    1
dtype: int64

In [44]:
import pandas as pd
s = pd.Series([10,20,30])

In [45]:
s1 = pd.Series([1,2,3])

In [46]:
s

0    10
1    20
2    30
dtype: int64

In [47]:
s1

0    1
1    2
2    3
dtype: int64

##### Operations on series

In [48]:
# Addition

pd.Series.add(s,s1)     # s+s1

0    11
1    22
2    33
dtype: int64

In [49]:
# Cummulative addition

pd.Series.cumsum(s)

0    10
1    30
2    60
dtype: int64

In [50]:
# Subtraction

pd.Series.subtract(s,s1)  # s-s1

0     9
1    18
2    27
dtype: int64

In [51]:
# Multiplication

pd.Series.multiply(s,s1)    # s*s1

0    10
1    40
2    90
dtype: int64

In [52]:
# Division

pd.Series.divide(s,s1)     #s1/s

0    10.0
1    10.0
2    10.0
dtype: float64

In [53]:
# Perform addition of all elements of series

pd.Series.sum(s)

60

In [54]:
# Cummulative minimum - to replace next heigher value by previous small value

pd.Series.cummin(s)       

0    10
1    10
2    10
dtype: int64

In [55]:
s = pd.Series([30,10,20])  # It detect minimum value and as soon as found replace latter values by minimum value
pd.Series.cummin(s)

0    30
1    10
2    10
dtype: int64

In [56]:
pd.Series.cummax(s)       # It detect maximum value and as soon as found replace latter values by maximum value

0    30
1    30
2    30
dtype: int64

In [57]:
s = pd.Series([30,20,10])
pd.Series.cummax(s)  

0    30
1    30
2    30
dtype: int64

##### Mearging two Series

In [58]:
# Series creation using numpy array

s = pd.Series(np.arange(101,107))
s

0    101
1    102
2    103
3    104
4    105
5    106
dtype: int32

In [59]:
s1 = pd.Series(np.arange(111,117)) 
s1

0    111
1    112
2    113
3    114
4    115
5    116
dtype: int32

In [60]:
s,s1                 # It gives result as series one after another

(0    101
 1    102
 2    103
 3    104
 4    105
 5    106
 dtype: int32,
 0    111
 1    112
 2    113
 3    114
 4    115
 5    116
 dtype: int32)

In [61]:
s.append(s1)           # It gives series as combined but without streamline indexing

0    101
1    102
2    103
3    104
4    105
5    106
0    111
1    112
2    113
3    114
4    115
5    116
dtype: int32

In [62]:
s2=s.append(s1,ignore_index=True)     # Index appending with streamline indexing

In [63]:
s

0    101
1    102
2    103
3    104
4    105
5    106
dtype: int32

In [64]:
s.pop(2)    # To pop item at index 2 i.e. 103 and return it. changes are inplace

103

In [65]:
s

0    101
1    102
3    104
4    105
5    106
dtype: int32

In [66]:
s2

0     101
1     102
2     103
3     104
4     105
5     106
6     111
7     112
8     113
9     114
10    115
11    116
dtype: int32

In [67]:
s2.nlargest()       # Return largest 5 elements with their indixes in descending order

11    116
10    115
9     114
8     113
7     112
dtype: int32

In [68]:
s2.nlargest(2)     # To access largest 2 numbers from series       

11    116
10    115
dtype: int32

In [69]:
s2.nsmallest()     # To access 5 smallest values from series with indexes in ascending order

0    101
1    102
2    103
3    104
4    105
dtype: int32

In [70]:
s2.nsmallest(2)     # To retrive 2 smallest entries of series in ascending order

0    101
1    102
dtype: int32

In [71]:
s = pd.Series([2,5,3,5,5,2])
s

0    2
1    5
2    3
3    5
4    5
5    2
dtype: int64

In [72]:
s.nlargest()      # Top down approach

1    5
3    5
4    5
2    3
0    2
dtype: int64

In [73]:
s.nlargest(keep='last')       # Bottom up approach

4    5
3    5
1    5
2    3
5    2
dtype: int64

In [74]:
s.nlargest(3)[::-1]      # To get reversed output

4    5
3    5
1    5
dtype: int64

In [75]:
s.dtype

dtype('int64')

In [76]:
s.astype('int32')         # Change datatype after creation    

0    2
1    5
2    3
3    5
4    5
5    2
dtype: int32

In [77]:
s = pd.Series(['HR','Developer','coder','HR'])
s

0           HR
1    Developer
2        coder
3           HR
dtype: object

In [78]:
s.__sizeof__()

374

In [79]:
s1=s.astype('category')       # Here only 3 categories are ther ['Developer', 'HR', 'coder']
s1 

0           HR
1    Developer
2        coder
3           HR
dtype: category
Categories (3, object): ['Developer', 'HR', 'coder']

In [80]:
s1.__sizeof__()             # Size of category is more than nornal object

427

In [81]:
s1.ndim                     # To get dimension of series

1

In [82]:
s1.dtype                   # To get datatype of series

CategoricalDtype(categories=['Developer', 'HR', 'coder'], ordered=False)

In [83]:
s.dtype

dtype('O')

In [84]:
s = pd.Series(np.arange(101,107))
s

0    101
1    102
2    103
3    104
4    105
5    106
dtype: int32

##### Condition on Series

In [85]:
s.where(s<105)     # put not a number for entries which are not satisfying condition

0    101.0
1    102.0
2    103.0
3    104.0
4      NaN
5      NaN
dtype: float64

In [86]:
s.where(s<103,other=0)      # customize value where condition becomes false

0    101
1    102
2      0
3      0
4      0
5      0
dtype: int32

In [87]:
s            # s remain unchanged

0    101
1    102
2    103
3    104
4    105
5    106
dtype: int32

In [88]:
# To make changes inplace set inplace = True

s.where(s<103,other=0,inplace=True)
s

0    101
1    102
2      0
3      0
4      0
5      0
dtype: int32

In [89]:
# Condition as index to make changes inplace

s<103    # codition gives boolean output

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

In [90]:
# If we provide this boolean Series as index to origional Series it will perform inplace changes

s[s<103]

0    101
1    102
2      0
3      0
4      0
5      0
dtype: int32

In [91]:
s[:4]      # Use of sclicing to get first four records

0    101
1    102
2      0
3      0
dtype: int32

##### How to deal with missing values

In [92]:
s = pd.Series([np.nan,2,np.nan,2,np.nan,3,4,2,np.nan])
s

0    NaN
1    2.0
2    NaN
3    2.0
4    NaN
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

In [93]:
s.isna()      # Gives boolean output

0     True
1    False
2     True
3    False
4     True
5    False
6    False
7    False
8     True
dtype: bool

In [94]:
s.isna().sum()           # To get count of total nan values

4

In [95]:
s.isnull().sum()          # To get count of total nan values

4

##### Fill nan

In [96]:
# Fill NaN values(Missing values) by zero

s.fillna(0)

0    0.0
1    2.0
2    0.0
3    2.0
4    0.0
5    3.0
6    4.0
7    2.0
8    0.0
dtype: float64

In [97]:
# Fill missing values by string

s.fillna('Missing')

0    Missing
1        2.0
2    Missing
3        2.0
4    Missing
5        3.0
6        4.0
7        2.0
8    Missing
dtype: object

In [98]:
# Fill missing values by mean of all values

s.fillna(s.mean())

0    2.6
1    2.0
2    2.6
3    2.0
4    2.6
5    3.0
6    4.0
7    2.0
8    2.6
dtype: float64

##### Central measures of tendency

In [99]:
s

0    NaN
1    2.0
2    NaN
3    2.0
4    NaN
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

In [100]:
s.mean()           #mean is the average      

2.6

In [101]:
s.median()      # It is the middle value in case of odd number of values
                # Average of two middle values in case of even number of values

2.0

In [102]:
s.mode()         # It gives most frequent value

0    2.0
dtype: float64

##### Forward filling and backword filling

In [103]:
s

0    NaN
1    2.0
2    NaN
3    2.0
4    NaN
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

In [104]:
 # In forword filling if first value is NaN then it remain as it is and remaining NaN values are filled by its previous value

s.ffill()      

0    NaN
1    2.0
2    2.0
3    2.0
4    2.0
5    3.0
6    4.0
7    2.0
8    2.0
dtype: float64

In [105]:
# In backword filling if last value is nan then it cant be changed and remaining NaN values are filled by its next value

s.bfill()       

0    2.0
1    2.0
2    2.0
3    2.0
4    3.0
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

##### ffill and bfill using fillna method

In [106]:
s.fillna(method='ffill')        # Forword Fill

0    NaN
1    2.0
2    2.0
3    2.0
4    2.0
5    3.0
6    4.0
7    2.0
8    2.0
dtype: float64

In [108]:
 s.fillna(method='bfill')     # Back fill

0    2.0
1    2.0
2    2.0
3    2.0
4    3.0
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

##### Rather than dealing with NaN, simplest way is to remove NaN
If count of missing values (NaN) is too small as compaired with the count of total records then best way is to remove NaN 

In [109]:
s

0    NaN
1    2.0
2    NaN
3    2.0
4    NaN
5    3.0
6    4.0
7    2.0
8    NaN
dtype: float64

In [110]:
s.dropna(inplace=True)
s

1    2.0
3    2.0
5    3.0
6    4.0
7    2.0
dtype: float64