## Getting started with Pandas

It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python.

pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib


In [1]:
#import libraries

import numpy as np
import pandas as pd

In [2]:
# importing series and df to local space as they are frequently used
from pandas import Series, DataFrame

### Series

one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index.

In [3]:
obj = pd.Series([4,7,5])
obj

0    4
1    7
2    5
dtype: int64

In [6]:
# array representation of series
obj.array


<PandasArray>
[4, 7, 5]
Length: 3, dtype: int64

In [5]:
#index object of series
obj.index

RangeIndex(start=0, stop=3, step=1)

In [7]:
#often we craete series with index

obj2 = pd.Series([4,5,6], index=["d","b","c"])
obj2.index

Index(['d', 'b', 'c'], dtype='object')

In [8]:
#we can use labels to select single values 

obj2["d"]

4

In [9]:
#to slect set of values
obj2[["d", "b"]]

d    4
b    5
dtype: int64

In [11]:
# filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve the index-value link

obj2[obj2 >1]


d    4
b    5
c    6
dtype: int64

In [12]:
obj2*2

d     8
b    10
c    12
dtype: int64

In [13]:
np.exp(obj2) # from numpy ***

d     54.598150
b    148.413159
c    403.428793
dtype: float64

#####  Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary: ****

In [16]:
'b' in obj2


True

In [17]:
#creating series from dictionary by passing it
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [18]:
# series to dictionary using method
obj4 = obj3.to_dict()
obj4

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [21]:
# passing an index with the dictionary keys in the order you want them to appear in the resulting Series
states = ["California", "Ohio", "Oregon", "Texas"]
obj3 = pd.Series(sdata, index = states)
obj3

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [23]:
## detecting missing data
pd.isna(obj3)


California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [24]:
pd.notna(obj3)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [26]:
obj3.isna() #series also has these methods

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]:
#A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations **
obj4 = pd.Series(obj4)
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [33]:
## series object itself and its index have a name

obj4.name = "pop"
obj4.index.name = "state"
obj4

state
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: pop, dtype: int64

In [36]:
obj.index = ["Bob", "Steve", "Jeff"]
obj

Bob      4
Steve    7
Jeff     5
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [37]:
##craeting dataframe with dictionary of equal length lists 
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [38]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [39]:
# head ( for large only first 5)
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [40]:
# tail returns ( last 5)
frame.tail

<bound method NDFrame.tail of     state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2>

In [42]:
#if we provide sequence it will ordered in that sequence

pd.DataFrame(data, columns=["year", "state", "pop"])


Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [43]:
## passing a column without values gives rows with NaN

pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [49]:
# retrieving column in pandas
frame["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [50]:
# retrieving column in pandas
frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [51]:
#modifying column values
frame2 = frame
frame2['pop'] = 3

In [52]:
frame2

Unnamed: 0,state,year,pop
0,Ohio,2000,3
1,Ohio,2001,3
2,Ohio,2002,3
3,Nevada,2001,3
4,Nevada,2002,3
5,Nevada,2003,3


In [55]:
#modifying using array of values
import numpy as np
frame2['pop'] = np.arange(6)

In [56]:
frame2

Unnamed: 0,state,year,pop
0,Ohio,2000,0
1,Ohio,2001,1
2,Ohio,2002,2
3,Nevada,2001,3
4,Nevada,2002,4
5,Nevada,2003,5


In [58]:
#assigning lists or arrays to a column, the value’s length must match the length of the DataFrame.
# If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present

val = pd.Series([-1.2, -1.5, -1.7], index = [2,4,5])
frame["debt"] = val

In [59]:
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,0,
1,Ohio,2001,1,
2,Ohio,2002,2,-1.2
3,Nevada,2001,3,
4,Nevada,2002,4,-1.5
5,Nevada,2003,5,-1.7


In [61]:
##assigning column that doesn't exist will create new column
frame["eastern"] = frame["state"] == "Ohio"

In [62]:
frame

Unnamed: 0,state,year,pop,debt,eastern
0,Ohio,2000,0,,True
1,Ohio,2001,1,,True
2,Ohio,2002,2,-1.2,True
3,Nevada,2001,3,,False
4,Nevada,2002,4,-1.5,False
5,Nevada,2003,5,-1.7,False


In [63]:
# deleting columns
del frame["eastern"]

In [64]:
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,0,
1,Ohio,2001,1,
2,Ohio,2002,2,-1.2
3,Nevada,2001,3,
4,Nevada,2002,4,-1.5
5,Nevada,2003,5,-1.7


In [65]:
frame.columns

Index(['state', 'year', 'pop', 'debt'], dtype='object')

In [66]:
frame.index

RangeIndex(start=0, stop=6, step=1)

In [69]:
##passing nested dictionary
##outer as columns, inner keys as row indices

populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
    "Nevada": {2001: 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [70]:
## can transpose rows and columns

frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


#### Many data inputs can be passed to DataFrame constructor

In [72]:
## DataFrame's to_numpy method returns the data contained in the DataFrame as a two-dimensional ndarray
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

###  Index Objects

pandas’s Index objects are responsible for holding the axis labels (including a DataFrame's column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index

Index objects are immutable and thus can’t be modified by the user

Immutability makes it safer to share Index objects among data structures


In [77]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index2 = obj.index

In [78]:
index2

Index(['a', 'b', 'c'], dtype='object')

In [79]:
index2[1:]

Index(['b', 'c'], dtype='object')

In [80]:
#sharing index with other
obj2= pd.Series([1.3,2,0], index=index2)

In [81]:
obj2

a    1.3
b    2.0
c    0.0
dtype: float64

In [82]:
# behaves like a set
'a' in obj2.index

True

In [83]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [84]:
frame3.index

Int64Index([2000, 2001, 2002], dtype='int64')

In [85]:
2000 in frame3.index

True

In [87]:
"Ohio" in frame3.columns

True

In [88]:
# can contain duplicate labels
pd.Index(["foo", "foo", "bar", "bar"])


Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

##### there are many methods and properties and index

#### Reindexing

In [92]:
#reindex - create a new object with the values rearranged to align with the new index.
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [93]:
#for timeseries like ordered data while reindexing we can use method option to fill with forward values
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3


0      blue
2    purple
4    yellow
dtype: object

In [94]:
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [96]:
#reindex by columns
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                      index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])

In [98]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns = states) #ohio is dropped as it is not in columns

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [99]:
#another way for above
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


#### many function arguments can be passed to reindex

In [100]:
# reindexing can also be done using loc operator, will work only if all indexes are already in df
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


In [102]:
type(frame)

pandas.core.frame.DataFrame

## Dropping entries from Axis

In [104]:
#series
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [105]:
obj.drop("c") #drop index

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [106]:
obj.drop(["c", "b"]) #dropping mutiple 

a    0.0
d    3.0
e    4.0
dtype: float64

In [112]:
# dataframe
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=["Ohio", "Colorado", "Utah", "New York"],
                        columns=["one", "two", "three", "four"])

In [108]:
data.drop(index=["Colorado", "Utah"]) #dropping row labls

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [111]:
data.drop(columns=["one", "two"]) #dropping column labels

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [113]:
#other ways using axis option
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## indexing selction and filtering

In [2]:
#Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [6]:
print(obj["b"] , obj[1])
print(obj[2:4], obj[["b", "a", "d"]])
print(obj[[1, 3]], obj[obj < 2])

1.0 1.0
c    2.0
d    3.0
dtype: float64 b    1.0
a    0.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64 a    0.0
b    1.0
dtype: float64


In [8]:
##loc is because of the different treatment of integers when indexing with []. Regular []-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index ***
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
print(obj1[[0, 1, 2]])
print(obj2[[0, 1, 2]])
print(obj2.loc[[0, 1, 2]]) #it will fail as index doesn't have integers

0    2
1    3
2    1
dtype: int64
a    1
b    2
c    3
dtype: int64


KeyError: "None of [Int64Index([0, 1, 2], dtype='int64')] are in the [index]"

## loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers ***

In [11]:
# assigning to modify corresponding sections
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

In [12]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])

In [14]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [17]:
data["two"] #selecting columns

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [18]:
data[["three", "one"]] #multiple colimns

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [19]:
data[:2] # rows

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [20]:
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [23]:
data[:1] #its rows( passing single elment or list will give columns)

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3


In [24]:
data < 5 # boolean data frame

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [26]:
data[data<5] =0 #assigning 0 to cells meeting this condition
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Selection on DataFrame with loc and iloc

In [27]:
data


Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [28]:
#loc and iloc for label-based and integer-based indexing******
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int32

In [29]:
data.loc[["Colorado", "Utah"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11


In [31]:
## Separate row and column selection with comma ****
data.loc["Utah", ["one", "three"]]

one       8
three    10
Name: Utah, dtype: int32

In [32]:
#selction with integers
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [33]:
data.iloc[[2,1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,0,5,6,7


In [34]:
data.iloc[[2,1], [0,1]] #index and columns

Unnamed: 0,one,two
Utah,8,9
Colorado,0,5


In [36]:
#Both indexing functions work with slices in addition to single labels or lists of labels:****
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [37]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [38]:
#using boolean arrrays
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Integer indexing pitfalls

## Pitfalls with chained indexing

## Arithmetic and Data Alignment

## Arithmetic methods with fill values

## Operations between DataFrame and Series

## Function Application and Mapping

## Sorting and Ranking

## Axis Indexes with Duplicate Labels

## Summarizing and Computing Descriptive Statistics

## Correlation and Covariance

## Unique Values, Value Counts, and Membership