# Pandas


## The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows.


In [1]:
import pandas as pd
import numpy as np

data = pd.Series([0.25,0.5,0.75,1])
print(data)

# We see that in the preceeding output, the Series wraps both a sequence of values and a 
# sequence of indices , they can be accessed with values and index attributes 

print(F"Values = {data.values}")
print(F"indices = {data.index}")

# The index is an array like object of type pd.Index 

# like Numpy array, data can be accessed by the associated index via square braket notation

print(F"\n 2nd element = {data[1]}, return type = {type(data[1])}")
print(F"\n slicing \n{data[1:3]}, \n return type = {type(data[1:3])}")

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
Values = [0.25 0.5  0.75 1.  ]
indices = RangeIndex(start=0, stop=4, step=1)

 2nd element = 0.5, return type = <class 'numpy.float64'>

 slicing 
1    0.50
2    0.75
dtype: float64, 
 return type = <class 'pandas.core.series.Series'>


In [2]:
# The explicit index defination gives the Series Object additional capabilites 
# For example , the index need not be integer but can be any desired type 

data = pd.Series([0.25,0.5,0.75,1],index=['a','b','c','d'])
print(data)

# Accessing with index works
print(F"\n data['b'] = {data['b']}")

# We can even use noncontiguous or nonsequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],index=[2, 5, 3, 7])
print("\n")
print(data)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

 data['b'] = 0.5


2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64


In [3]:
# Series as specialized dictionary 
# A dictionary is a structure that maps arbitrary keys to arbitrary values 
# A Series object maps typed keys to typed values 

# We can create a Series object from dictionary to get clear analogy 
population_dict = { 'California': 38332521,
                    'Texas': 26448193,
                    'New York': 19651127,
                    'Florida': 19552860,
                    'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)


# We can perform dictionary styled operation
print(F"\n dictionary styled operation {population['California']}")

# The Series also supports array-styled operations like slicing
print(F"\n Array styled slicing operation = {population['California':'Florida']}")   # end is inclusive here

# Other ways of creating Series object

# data can be a scalar, which is repeated to fill the specified index

print(F"\n  Series created from scalar \n {pd.Series(5, index=[100, 200, 300])}")

# When a dictionary is passed Keys will be dictionary keys by defualt, but if we want to keep only selected keys, we can pass a lndex list
#In this case Series is populated with specified indices only

print(F"\n passing indices when passing dictionary \n {pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])}") 




California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

 dictionary styled operation 38332521

 Array styled slicing operation = California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

  Series created from scalar 
 100    5
200    5
300    5
dtype: int64

 passing indices when passing dictionary 
 3    c
2    a
dtype: object


## DataFrames
A dataframe is analogous to a two-dimensional array with flexible row indices and flexible column names.

Just as we can think two-dimensional array as an ordered sequence of aligned one-dimenional columns, we can think of a DataFrame as a sequnce of aligned Series objects. Here, by aligned we mean they share the **same index**.   

In [4]:
# To demonstrate this, lets create a dataframe by combining two Series objects

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)

print(F"\n Area = ")
print(area)

# We can use a dictionary to construct a single two dimensional object containing this information

states = pd.DataFrame({'population': population, 'area': area})
print("\n DataFrame created by merging two Series Objects with common indices")
print(states)

# Like Series object DataFrame also has an index attribute that gives access to index labels 

print("\n Index column of DataFrame")
print(states.index)

# Additionally the DataFrame has a columns attribute which is an Index object building the column names.

print("\n Index object holding the names of columns")
print(states.columns)

# Thus the dataframe can be thought as a generalization of a two-dimensional NumPy array where 
# both rows and columns have generalized index for accessing the data

# DataFrames as specialized dictionaries 
# We can think of DataFrames as special dictionaries where a DataFrame maps a column name 
# to a Series of column data

print("\n DataFrames as dictionaries where column name acts as key and column data acts as values")
print(states['area'])

# Creating DataFrame from a single Series object

print(F"\n Creating DataFrame from a single Series object")
print(pd.DataFrame(population, columns=['population']))  # we have to specify the column name , otherwise it would be 0 , 1, 2

# Creating DataFrame from a list of dicts 

print(F"\n Creating DataFrame from a list of dicts")
data = [{'a' : i + 1, 'b' : 2 * i + 1} for i in range(3)]
print(pd.DataFrame(data))

# Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e.,

print("\n Pandas fills missing values with NaN")
print(pd.DataFrame([{'a' : 1, 'b' : 2}, {'b': 3, 'c':4}]))

# Creating DataFrame from a 2-dimensional numpy array

print(F"\n Creating DataFrame from a 2-dimensional numpy array")
print(pd.DataFrame(np.random.rand(3, 2),columns=['foo', 'bar'],index=['a', 'b', 'c']))



 Area = 
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

 DataFrame created by merging two Series Objects with common indices
            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995

 Index column of DataFrame
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

 Index object holding the names of columns
Index(['population', 'area'], dtype='object')

 DataFrames as dictionaries where column name acts as key and column data acts as values
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

 Creating DataFrame from a single Series object
            population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135

 Creatin

In [5]:
# The Index Object

# We have seen both Series and DataFrame have explicit index that let's us reference and modify data.
# Index is an immutable array , an ordered multiset - because Index can have duplicate values 

# Let's construct Index from a list of integers 

print("Constructing  Index from a list of integers ")
ind = pd.Index([2,3,5,7,8])
print(ind)

# Index is an immutable array but we can perform operations like slicing, accessing elements through []
print("\n Accessing from []")
print(ind[1])

print("\n Slicing")
print(ind[::-1])

print("\n Index also has some attributes luke numpy arrays")
print(F"size = {ind.size}, shape = {ind.shape}, dim = {ind.ndim}, dtype = {ind.dtype}")


# Index as ordered multiset

# Pandas objects are designed to facilitate operations such as joins across datasets,
# which depend on many aspects of set arithmetic. The Index object follows many of
# the conventions used by Python’s built-in set data structure, so that unions, intersec‐
# tions, differences, and other combinations can be computed in a familiar way:

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

print(F"\n indA = {indA},\n indB = {indB}")
print(F"\n union = {indA | indB}")
print(F"\n Intersection = {indA & indB}")
print(F"\n symmetric difference = {indA ^ indB}")



Constructing  Index from a list of integers 
Int64Index([2, 3, 5, 7, 8], dtype='int64')

 Accessing from []
3

 Slicing
Int64Index([8, 7, 5, 3, 2], dtype='int64')

 Index also has some attributes luke numpy arrays
size = 5, shape = (5,), dim = 1, dtype = int64

 indA = Int64Index([1, 3, 5, 7, 9], dtype='int64'),
 indB = Int64Index([2, 3, 5, 7, 11], dtype='int64')

 union = Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

 Intersection = Int64Index([3, 5, 7], dtype='int64')

 symmetric difference = Int64Index([1, 2, 9, 11], dtype='int64')


In [13]:
# Series as dictionary

# Like a dictionary, the Series object provides a mapping from a collection of keys to a
# collection of values

data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])

# Performing dictionary operations 

print("in operator")
print('a' in data)

print("\n d.keys()")
print(data.keys())

print("\n d.items()")
print(list(data.items()))

print("\n Adding a new key-value pair")
data['e'] = 1.25
print(data)


# Series as one-domensional array

print("\n Slicing explicit indexing")
print(data['a':'c'])

print("\n Slicing implicit indexing")
print(data[0:2])

print("\n Masking")
print(data[(data > 0.3) & (data < 0.8)])

print("\n Fancy indexing")
print(data[['a','e']])

# A confusing part could be when using explicit indexing data['a':'c'] 'c' is included
# When using implicit indexing data[0:2] 2 is excluded

in operator
True

 d.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')

 d.items()
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

 Adding a new key-value pair
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

 Slicing explicit indexing
a    0.25
b    0.50
c    0.75
dtype: float64

 Slicing implicit indexing
a    0.25
b    0.50
dtype: float64

 Masking
b    0.50
c    0.75
dtype: float64

 Fancy indexing
a    0.25
e    1.25
dtype: float64


In [19]:
# Indexes: loc, iloc and ix 

# Slicing  and indexing can be source of confusion, if our Series has explicit integer indexing
# an indexing operation such as data[1] uses explicit indexing, while a slicing operation like 
# data[1:3] uses implicit indexing 

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

print(data)

print("\n Explicit index when indexing")
print(data[1])

print("\n Implicit index when slicing")
print(data[1:3])

# To avoid this confusion Pandas has some special indexer attributes. 

print("\n loc always uses explicit indexing")
print(data.loc[1])
print(data.loc[1:3])

print("\n iloc always uses implicit indexing")
print(data.iloc[1])
print(data.iloc[1:3])

# loc and iloc are useful while having integer indexes to keep the code clean and readable.

1    a
3    b
5    c
dtype: object

 Explicit index when indexing
a

 Implicit index when slicing
3    b
5    c
dtype: object

 loc always uses explicit indexing
a
1    a
3    b
dtype: object

 iloc always uses implicit indexing
b
3    b
5    c
dtype: object


In [51]:
# Data Selection in DataFrame

# Data Frames can be thought as 2-dimensional  structured arrays or
# dictionary with column names as keys and Series objects with same index as values 

# DataFrames as dictionaries

area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})

print("The individual Series that make up that dataframe can be accessed by dictionary style indexing of the column name")
print(data["area"])

print("\n e can also use . notation but we need to careful as it should not conflict with any builtin attributes or mathods")
print(data.area)  # but data.pop is not possible as it has a builtin pop method
print(data.pop is data["pop"])

print("\n Dictionary styled syntax can also be used to add new columns")
data["density"] = data["pop"] / data["area"]
print(data)

# DataFrames as two-dimensional arrays

print("\n DataFramea as a 2D array")
print(data.values)

print(F"\n type of data.values = {type(data.values)}")   # it;s nump.ndarray 
print("\n We can perform array like operations on dataframe itself eg transpose")
print(data.T)

# But in numpy the first index in [a,b,c,..] accesses the row, but in DataFrames the first index ['col'] accesses the column
# We cannot access rows directly by [] method, but we can use loc, iloc, ix
# We can parallely access columns also by using loc, iloc and ix 

print("\n iloc - implicit indexing")
print(data.iloc[:3,:2]) # parallel access of columns with integer indexing (implicit indexing)


print("\n loc - explicit indexing")
print(data.loc[:"New York", :"pop"])

print("\n Sorry pandas removed ix because it was subject to same confusion")

print("\n While using loc and iloc , we can also use filters (masks) for rows")
print(data.loc[data.density > 100, ['pop' , 'density']])

print("\n Any of these indexing convention can be used to modify data")
data.iloc[0,2] = 90
print(data)

# There are some conventions which might seem odd but are quite useful
print("\n Though we cannot access the rows directly with [], we can slice them")
print(data["California":"New York"])   # explicit indexing 
print(data[0:3])                       # implicit indexing

# direct masking effects can be interpreted row-wise

print("\n Direct masking is interpreted row-wise")
print(data[data.density > 100])

The individual Series that make up that dataframe can be accessed by dictionary style indexing of the column name
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

 e can also use . notation but we need to careful as it should not conflict with any builtin attributes or mathods
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
False

 Dictionary styled syntax can also be used to add new columns
              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Florida     170312  19552860  114.806121
Illinois    149995  12882135   85.883763

 DataFramea as a 2D array
[[4.23967000e+05 3.83325210e+07 9.04139261e+01]
 [6.95662000e+05 2.64481930e+07 3.80187404e+01]
 [1.41297000e+05 1.96511270e+07 1.39076746e+02]
 [1.70312000e+05 1.95528

In [55]:
# UFuncs on Series and DataFrames

rng = np.random.RandomState(42)   # creates a random number generator
ser = pd.Series(rng.randint(0,10,4))   #  4 random numbers between 0 and 10
df = pd.DataFrame(rng.randint(0,10,(3,4)))  # 3 x 4 matrix of random numbers between 0 and 10 

print("Series = ")
print(ser)
print("\n DataFrame = ")
print(df)

print("\n If we apply UFuncs on these object result would be another pandas object with indices preserved")
print(np.exp(ser))
print("\n")
print(np.sin(df * np.pi / 4))


Series = 
0    6
1    3
2    7
3    4
dtype: int64

 DataFrame = 
   0  1  2  3
0  6  9  2  6
1  7  4  3  7
2  7  2  5  4

 If we apply UFuncs on these object result would be another pandas object with indices preserved
0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64


          0             1         2             3
0 -1.000000  7.071068e-01  1.000000 -1.000000e+00
1 -0.707107  1.224647e-16  0.707107 -7.071068e-01
2 -0.707107  1.000000e+00 -0.707107  1.224647e-16
