In [3]:

import pandas as pd
import numpy as np
from random import random

#### Introduction to Pandas
Pandas is a Python library used to simplify and streamline the entire process of **data analysis** and **manipulation** by providing powerful tools for **data cleaning**, **transforming**, and exploring structured data like spreadsheets or SQL tables. It provides two key data structures:

- **Series**: A one-dimensional labeled array, similar to a column in a spreadsheet or database.

- **DataFrame**: A two-dimensional labeled data structure, like a table or spreadsheet with rows and columns.

Pandas makes handling structured data very easy and efficient.

*In other words:*

##### Pandas has three core structures:

- Series → 1-D labeled array (like a column in Excel).

- DataFrame → 2-D table (collection of Series).

- Index → label system for rows/columns (acts like a set with alignment logic).

All three are built on top of NumPy arrays, but they add labels, alignment, heterogeneous dtypes, and rich methods.

In [5]:
# from a Python list
s1 = pd.Series([10, 20, 30, 40])
print(s1)

# custom index
s1 = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
print(s1)


0    10
1    20
2    30
3    40
dtype: int64
A    10
B    20
C    30
D    40
dtype: int64


In [7]:
# from a dict

data = {'Alice': 85, 'Bob': 90, 'Charlie': 78}
result = pd.Series(data)
result

# Keys become index labels, values become data.

Alice      85
Bob        90
Charlie    78
dtype: int64

In [9]:
# scaler + index
s = pd.Series(5, index=['x','y','z'])
s

x    5
y    5
z    5
dtype: int64

In [15]:
# from NumPy array
arr = np.random.rand(4)
s = pd.Series(arr, index=list('WXYZ'))
s

W    0.653478
X    0.362264
Y    0.011001
Z    0.974031
dtype: float64

In [18]:
# Pandas aligns labels. Missing labels → NaN.

sA = pd.Series({'A': 1, 'B': 2, 'C': 3})
sB = pd.Series({'B': 10, 'C': 20, 'D': 30})
print(sA + sB)


A     NaN
B    12.0
C    23.0
D     NaN
dtype: float64


<bound method NDFrame.describe of A    1
B    2
C    3
dtype: int64>

Internal working:

- Series stores data in a NumPy array (BlockManager internally).
- The index is a separate Index object.
- Operations use vectorized NumPy ops where possible (C-speed).
- Alignment uses the index mapping — very powerful for joining or adding differently-shaped data.

In [50]:
# Method 1: Dictionary of lists
record = {
    'Product': ['Laptop', 'Phone', 'Tablet'],
    'Price': [1200, 800, 500],
    'Stock': [15, 30, 25]
}
data = pd.DataFrame(record)
print(data)



  Product  Price  Stock
0  Laptop   1200     15
1   Phone    800     30
2  Tablet    500     25


In [26]:
# Method 2: List of dictionaries
data2 = [
    {'Product': 'Laptop', 'Price': 1200, 'Stock': 15},
    {'Product': 'Phone', 'Price': 800, 'Stock': 30},
    {'Product': 'Tablet', 'Price': 500, 'Stock': 25}
]
df2 = pd.DataFrame(data2)
print(df2)


  Product  Price  Stock
0  Laptop   1200     15
1   Phone    800     30
2  Tablet    500     25


In [117]:

# Method 3: NumPy array with column names
arr = np.random.randint(0, 100, size=(5, 3))
data = pd.DataFrame(arr, columns=['Maths', 'Science', 'English'], index=['Student1', 'Student2', 'Student3', 'Student4', 'Student5'])

print(data)


          Maths  Science  English
Student1     98       59       16
Student2      3       68        0
Student3     47       27       64
Student4     18       11        6
Student5      6       33       28


In [132]:

# Method 4: From Series
s1 = pd.Series([100, 200, 300], name="Revenue")
s2 = pd.Series([80, 150, 250], name="Cost")
data = pd.DataFrame({"Revenue": s1, "Cost": s2})
print(data)

   Revenue  Cost
0      100    80
1      200   150
2      300   250


DataFrame uses a BlockManager internally to store each column block by dtype.

When you perform arithmetic, pandas aligns indices first, then uses NumPy vectorized ops on the underlying arrays.

Index objects are immutable to ensure predictable alignment semantics.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   order_id    60 non-null     int64         
 1   date        60 non-null     datetime64[ns]
 2   customer    60 non-null     object        
 3   region      60 non-null     object        
 4   product     60 non-null     object        
 5   units       60 non-null     int64         
 6   unit_price  60 non-null     float64       
 7   returned    60 non-null     bool          
 8   notes       50 non-null     object        
 9   revenue     60 non-null     float64       
dtypes: bool(1), datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 4.4+ KB


#### Panel: 3-dimensional data structure (items × major_axis × minor_axis)

- items → like the "sheets" in 3D data

- major_axis → rows

- minor_axis → columns

Think of it like a DataFrame for multiple tables stacked along a 3rd axis.

`Panel (items) → DataFrame → Series`

In [14]:
df.head()

Unnamed: 0,order_id,date,customer,region,product,units,unit_price,returned,notes,revenue
0,2001,2025-01-01,Diya,East,Headphones,5,38319.36,False,,191596.8
1,2002,2025-01-02,Reyansh,East,Smartphone,7,12644.24,False,Promo Applied,88509.68
2,2003,2025-01-03,Sai,North,Headphones,7,36682.6,False,Gift,256778.2
3,2004,2025-01-04,Sara,North,Smartphone,4,19021.37,False,Promo Applied,76085.48
4,2005,2025-01-05,Aditi,East,Laptop,7,31982.99,False,Urgent,223880.93


#### MultiIndex DataFrame for 3D data

In [11]:
tuples = [
    ('USA', 'New York'),
    ('USA', 'California'),
    ('USA', 'Texas'),
    ('Canada', 'Ontario'),
    ('Canada', 'Quebec')
]

index = pd.MultiIndex.from_tuples(
    tuples, 
    names=['Country', 'State']
)

df_data = pd.DataFrame({
    'Population': [8.3, 39.5, 29.1, 14.7, 8.5],
    'GDP': [1900, 3600, 2100, 900, 450]
}, index=index)
df_data


Unnamed: 0_level_0,Unnamed: 1_level_0,Population,GDP
Country,State,Unnamed: 2_level_1,Unnamed: 3_level_1
USA,New York,8.3,1900
USA,California,39.5,3600
USA,Texas,29.1,2100
Canada,Ontario,14.7,900
Canada,Quebec,8.5,450


In [16]:
dates = pd.date_range("2025-01-01", periods=5)
products = ["Laptop", "Smartphone", "Tablet"]
items = ["Electronics", "Furniture"]
print(dates)

data = np.random.randint(1, 20, size=(2, 5, 3))

# Running pd.Panel() in pandas ≥1.0 will throw an error.


index = pd.MultiIndex.from_product([items, dates], names=["Item", "Date"])
df_panel = pd.DataFrame(data.reshape(-1, 3), index=index, columns=products)
df_panel


DatetimeIndex(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04',
               '2025-01-05'],
              dtype='datetime64[ns]', freq='D')


Unnamed: 0_level_0,Unnamed: 1_level_0,Laptop,Smartphone,Tablet
Item,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Electronics,2025-01-01,14,8,2
Electronics,2025-01-02,16,3,8
Electronics,2025-01-03,10,4,3
Electronics,2025-01-04,7,13,16
Electronics,2025-01-05,2,17,18
Furniture,2025-01-01,8,18,13
Furniture,2025-01-02,11,5,12
Furniture,2025-01-03,3,12,6
Furniture,2025-01-04,1,14,17
Furniture,2025-01-05,4,8,5


In [None]:
# Select by item:
df_panel.loc['Electronics']

Date
2025-01-01     8
2025-01-02     3
2025-01-03     4
2025-01-04    13
2025-01-05    17
Freq: D, Name: Smartphone, dtype: int64

In [None]:
df_panel.xs