# Introduction to Pandas

Pandas is an open-source Python library used for data manipulation, analysis, and cleaning. It provides easy-to-use data structures like:

It’s built on top of NumPy and is especially powerful for working with structured data (like tables from Excel, CSV, SQL, etc.).

## Why We Use Pandas:
We use Pandas because it:
- Makes data easy to read, clean, and manipulate
- Works well with large datasets
- Integrates easily with NumPy, Matplotlib, Scikit-learn, and other Python libraries
- Helps with exploratory data analysis (EDA), transforming raw data into usable formats
- Offers flexible and fast performance for data operations

In [1]:
import pandas as pd

Here we use pandas as pd. whenever you see pd. in code, it’s referring to pandas. You may also find it
easier to import Series and DataFrame into the local namespace since they are so
frequently used:

In [2]:
from pandas import Series, DataFrame

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for
every problem, they provide a solid foundation for a wide variety of data tasks.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) of the same type and an associated array of data labels,
called its index. The simplest Series is formed from only an array of data:

In [6]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

When you create a Series and display it, it shows two parts:

 - The index (on the left)

 - The values (on the right)

If you **don’t give your own index**, Pandas will automatically create one for you — starting from **0** up to **N-1** (where **N** is the number of items).

You can also get:

 - The actual values (as an array) using the .array or .values attribute

 - The index using the .index attribute

In [7]:
obj.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

When you create a Series in Pandas and use obj.array, it returns the internal data array of the Series. Depending on the type of data and the version of Pandas you're using, this could either be a PandasArray or a NumPy ndarray. In older versions or when the Series contains simple data types like integers or floats, obj.array often returns a NumPy array directly. This is completely normal and expected behavior. If you specifically want to get the data as a NumPy array, it’s safer to use obj.values or obj.to_numpy(), as they consistently return a NumPy array across all versions. So, even if obj.array gives a NumPy array instead of a PandasArray, your data is still intact and usable — it's just represented in a more optimized format under the hood.

In [8]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [11]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single
values or a set of values:


In [12]:
obj2["a"]

-5

In [13]:
obj2["d"] = 6

In [14]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

Here ["c", "a", "d"] is interpreted as a list of indices, even though it contains
strings instead of integers.

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link:


In [15]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [16]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [17]:
import numpy as np
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a
mapping of index values to data values. It can be used in many contexts where you
might use a dictionary:

In [18]:
"b" in obj2

True

In [19]:
"e" in obj2

False

Should you have data contained in a Python dictionary, you can create a Series from
it by passing the dictionary:


In [20]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [22]:
obj3 = pd.Series(sdata)

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

A Series can be converted back to a dictionary with its to_dict method

In [24]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

When you are only passing a dictionary, the index in the resulting Series will respect
the order of the keys according to the dictionary’s keys method, which depends on
the key insertion order. You can override this by passing an index with the dictionary
keys in the order you want them to appear in the resulting Series:


In [26]:
states = ["California", "Ohio", "Oregon", "Texas"]

obj4 = pd.Series(sdata, index=states)

In [27]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since
no value for "California" was found, it appears as NaN (Not a Number), which is
considered in pandas to mark missing or NA values. Since "Utah" was not included
in states, it is excluded from the resulting object.

I will use the terms “missing,” “NA,” or “null” interchangeably to refer to missing data.
The isna and notna functions in pandas should be used to detect missing data:


In [28]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool