# Introduction to Pandas

Pandas is an open-source Python library used for data manipulation, analysis, and cleaning. It provides easy-to-use data structures like:

It’s built on top of NumPy and is especially powerful for working with structured data (like tables from Excel, CSV, SQL, etc.).

## Why We Use Pandas:
We use Pandas because it:
- Makes data easy to read, clean, and manipulate
- Works well with large datasets
- Integrates easily with NumPy, Matplotlib, Scikit-learn, and other Python libraries
- Helps with exploratory data analysis (EDA), transforming raw data into usable formats
- Offers flexible and fast performance for data operations

In [1]:
import pandas as pd

Here we use pandas as pd. whenever you see pd. in code, it’s referring to pandas. You may also find it
easier to import Series and DataFrame into the local namespace since they are so
frequently used:

In [2]:
from pandas import Series, DataFrame

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for
every problem, they provide a solid foundation for a wide variety of data tasks.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) of the same type and an associated array of data labels,
called its index. The simplest Series is formed from only an array of data:

In [3]:
obj = pd.Series([4, 7, -5, 3])
obj

Unnamed: 0,0
0,4
1,7
2,-5
3,3


When you create a Series and display it, it shows two parts:

 - The index (on the left)

 - The values (on the right)

If you **don’t give your own index**, Pandas will automatically create one for you — starting from **0** up to **N-1** (where **N** is the number of items).

You can also get:

 - The actual values (as an array) using the .array or .values attribute

 - The index using the .index attribute

In [4]:
obj.array

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64

When you create a Series in Pandas and use obj.array, it returns the internal data array of the Series. Depending on the type of data and the version of Pandas you're using, this could either be a PandasArray or a NumPy ndarray. In older versions or when the Series contains simple data types like integers or floats, obj.array often returns a NumPy array directly. This is completely normal and expected behavior. If you specifically want to get the data as a NumPy array, it’s safer to use obj.values or obj.to_numpy(), as they consistently return a NumPy array across all versions. So, even if obj.array gives a NumPy array instead of a PandasArray, your data is still intact and usable — it's just represented in a more optimized format under the hood.

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

Unnamed: 0,0
d,4
b,7
a,-5
c,3


In [7]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single
values or a set of values:


In [8]:
obj2["a"]

np.int64(-5)

In [9]:
obj2["d"] = 6

In [10]:
obj2[["c", "a", "d"]]

Unnamed: 0,0
c,3
a,-5
d,6


Here ["c", "a", "d"] is interpreted as a list of indices, even though it contains
strings instead of integers.

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link:


In [11]:
obj2[obj2 > 0]

Unnamed: 0,0
d,6
b,7
c,3


In [12]:
obj2 * 2

Unnamed: 0,0
d,12
b,14
a,-10
c,6


In [13]:
import numpy as np
np.exp(obj2)

Unnamed: 0,0
d,403.428793
b,1096.633158
a,0.006738
c,20.085537


Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a
mapping of index values to data values. It can be used in many contexts where you
might use a dictionary:

In [14]:
"b" in obj2

True

In [15]:
"e" in obj2

False

Should you have data contained in a Python dictionary, you can create a Series from
it by passing the dictionary:


In [16]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [17]:
obj3 = pd.Series(sdata)

obj3

Unnamed: 0,0
Ohio,35000
Texas,71000
Oregon,16000
Utah,5000


A Series can be converted back to a dictionary with its to_dict method

In [18]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

When you are only passing a dictionary, the index in the resulting Series will respect
the order of the keys according to the dictionary’s keys method, which depends on
the key insertion order. You can override this by passing an index with the dictionary
keys in the order you want them to appear in the resulting Series:


In [19]:
states = ["California", "Ohio", "Oregon", "Texas"]

obj4 = pd.Series(sdata, index=states)

In [20]:
obj4

Unnamed: 0,0
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


Here, three values found in sdata were placed in the appropriate locations, but since
no value for "California" was found, it appears as NaN (Not a Number), which is
considered in pandas to mark missing or NA values. Since "Utah" was not included
in states, it is excluded from the resulting object.

I will use the terms “missing,” “NA,” or “null” interchangeably to refer to missing data.
The isna and notna functions in pandas should be used to detect missing data:


In [21]:
pd.isna(obj4)

Unnamed: 0,0
California,True
Ohio,False
Oregon,False
Texas,False


In [24]:
pd.notna(obj4)

Unnamed: 0,0
California,False
Ohio,True
Oregon,True
Texas,True


In [25]:
obj3

Unnamed: 0,0
Ohio,35000
Texas,71000
Oregon,16000
Utah,5000


In [26]:
obj4

Unnamed: 0,0
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


In [27]:
obj3+ obj4

Unnamed: 0,0
California,
Ohio,70000.0
Oregon,32000.0
Texas,142000.0
Utah,


In [28]:
obj4.name = "population"

obj4.index.name = "state"

In [29]:
obj4

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


In [30]:
obj4.dtype

dtype('float64')

In [33]:
obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

In [35]:
## DataFrame

In [None]:
##
"""
A DataFrame represents a rectangular table of data and contains an ordered,
named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.).
The DataFrame has both a row and column index;
it can be thought of as a dictionary of Series all sharing the same index.
"""

In [37]:
data = {"state":["Ohio", "Ohio","Ohio", "Nevada","Nevada", "Nevada"],
        "year":[2000, 2001, 2002, 2001, 2002, 2003],
        "pop":[1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [38]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [39]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [40]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [41]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [43]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"]
                      )
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [45]:
frame2["state"]

Unnamed: 0,state
0,Ohio
1,Ohio
2,Ohio
3,Nevada
4,Nevada
5,Nevada


In [46]:
frame.year

Unnamed: 0,year
0,2000
1,2001
2,2002
3,2001
4,2002
5,2003


In [47]:
frame2.loc[1]

Unnamed: 0,1
year,2001
state,Ohio
pop,1.7
debt,


In [48]:
frame2.iloc[2]

Unnamed: 0,2
year,2002
state,Ohio
pop,3.6
debt,


In [49]:
frame2["debt"] = 16.5

In [50]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [51]:
frame2.debt = np.arange(6.)

In [52]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present:

In [53]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])

frame2.debt = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column.

The del keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the state column equals "Ohio":

In [55]:
frame2["eastern"]=frame2.state == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


Note: New columns cannot be created with the frame2.eastern dot attribute notation.

In [58]:
del frame2["eastern"]

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.

In [59]:
populations = {"Ohio":{2000:1.5, 2001:1.7, 2002:3.6},
               "Nevada":{2001:2.4, 2002:2.9}}

If the nested dictionary is passed to the DataFrame, pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices:

In [60]:
frame3 = pd.DataFrame(populations)

frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [61]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Warning
Note that transposing discards the column data types if the columns do not all have the same data type, so transposing and then transposing back may lose the previous type information. The columns become arrays of pure Python objects in this case.

In [62]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


In [64]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [65]:
frame3.index.name='year'

frame3.columns.name='state'

frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Unlike Series, DataFrame does not have a name attribute. DataFrame’s to_numpy method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [66]:
frame.to_numpy()

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

In [67]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

## Index Objects

pandas’s Index objects are responsible for holding the axis labels (including a DataFrame’s column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [68]:
obj = pd.Series(np.arange(3), index=['a','b','c'])

index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [69]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [70]:
index[1] = "d" ## type error

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [71]:
labels = pd.Index(np.arange(3))

labels

Index([0, 1, 2], dtype='int64')

In [72]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)

obj2

Unnamed: 0,0
0,1.5
1,-2.5
2,0.0


In addition to being array-like, an Index also behaves like a fixed-size set:

In [73]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [74]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [75]:
"Ohio" in frame3.columns

True

In [76]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels:

In [78]:
pd.Index(["foo", "foo", "bar", "bar"])


Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Selections with duplicate labels will select all occurrences of that label.

Each Index has a number of methods and properties for set logic, which answer other common questions about the data it contains. Some useful ones are summarized

| Method/Property  | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| append()         | Concatenate with additional Index objects, producing a new Index            |
| difference()     | Compute set difference as an Index                                          |
| intersection()   | Compute set intersection                                                    |
| union()          | Compute set union                                                           |
| isin()           | Compute Boolean array indicating whether each value is contained in the passed collection |
| delete()         | Compute new Index with element at Index i deleted                           |
| drop()           | Compute new Index by deleting passed values                                 |
| insert()         | Compute new Index by inserting element at Index i                           |
| is_monotonic     | Returns True if each element is greater than or equal to the previous element|
| is_unique        | Returns True if the Index has no duplicate values                           |
| unique()         | Compute the array of unique values in the Index                             |


## Reindexing

An important method on pandas objects is reindex, which means to create a new object with the values rearranged to align with the new index. Consider an example:

In [79]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d','b','a','c'])

obj

Unnamed: 0,0
d,4.5
b,7.2
a,-5.3
c,3.6


Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [81]:
obj2 = obj.reindex(['a','c','d','e'])

obj2

Unnamed: 0,0
a,-5.3
c,3.6
d,4.5
e,
