# NumPy & Pandas
*Created: 2025-11-03*

Welcome! This notebook is designed for programmers familiar with **R** to learn **NumPy** and **pandas** by solving practical, bite-sized exercises.

In [6]:
# Load numpy 
import numpy as np
print(np.__version__, "numpy version")

2.3.3 numpy version


# Before We Begin: Important Things to Remember

These are key differences and reminders before we dive into **NumPy**.  


---

### 1. Everything in Python is an Object

In R you often call functions like `dim(x)` or `mean(x)`.  
In Python, **every piece of data is an object** that *knows things about itself* and *can do things*.

```python
x = "Bioinformaticians rock"
print(x.upper())   # method → does something
print(len(x))      # function → asks something, one of the buil-in functions



## Lets initialize an array


In [7]:
# any python sequence (list, tuple, range, string, nested lists/tuples) can be converted into an array

lst = [1,2,3,4,5,6]
a = np.array(lst)

print("Array a:\n", a)

# attributes to get info about array
print("Shape:", a.shape)
print("Number of dimensions:", a.ndim)
print("Data type of elements:", a.dtype)
print("Size (total elements):", a.size)

# methods to get info about array ; parenthesis
print("Length:", len(a))
print("Type:", type(a))


# R equivalent:
# a <- c(1,2,3,4,5,6) or 1:6
# length(a); typeof(a), dim(a)

Array a:
 [1 2 3 4 5 6]
Shape: (6,)
Number of dimensions: 1
Data type of elements: int64
Size (total elements): 6
Length: 6
Type: <class 'numpy.ndarray'>


### Other ways to create arrays - using built-in functions


In [8]:
# you can specify datatype to very strictly control it
print(np.array(lst))

a_str = np.array(lst, dtype=np.str_)
print(f"Adding dtype=np.str_ converts it to a array of string:\n {a_str}")
print("-"*10)

print(f"np.arange(10)") 
print(np.arange(10)) # 0 to 9, notice 10 not included as it starts from 0
print("-"*10)

print(f"np.arange(0, 10, 2)")
print(np.arange(0, 10, 2)) # start, stop, step
print("-"*10)

print(f"np.arange(0, 10, 2, dtype=float)")
print(np.arange(0, 10, 2, dtype=float))
print("-"*10)

print(f"np.linspace(1., 4., 6)")
print(np.linspace(1., 4., 6)) # start, stop, num of elements



[1 2 3 4 5 6]
Adding dtype=np.str_ converts it to a array of string:
 ['1' '2' '3' '4' '5' '6']
----------
np.arange(10)
[0 1 2 3 4 5 6 7 8 9]
----------
np.arange(0, 10, 2)
[0 2 4 6 8]
----------
np.arange(0, 10, 2, dtype=float)
[0. 2. 4. 6. 8.]
----------
np.linspace(1., 4., 6)
[1.  1.6 2.2 2.8 3.4 4. ]


In [9]:
# Two- and higher-dimensional arrays can be initialized from nested Python sequences:

a_2d = np.array([[10, 20, 30], [40, 50, 60]]) # 2X3 matrix
print("Array a_2d:\n", a_2d)
print("Shape:", a_2d.shape)
print("-"*10)

# or, we can simple use the 1D array and mention shapes to create a 2D array"   
print("np.array(lst).reshape(2, 3)) # 2 rows, 3 columns\n", np.array(lst).reshape(2, 3))
print("-"*10)

a_2d = a_2d.reshape(3,2)
print("Reshaped Array a_2d:\n", a_2d)

print("Shape:", a_2d.shape)
print("Number of dimensions:", a_2d.ndim)
print("Data type of elements:", a_2d.dtype)
print("Size (total elements):", a_2d.size)
print("-"*10)

Array a_2d:
 [[10 20 30]
 [40 50 60]]
Shape: (2, 3)
----------
np.array(lst).reshape(2, 3)) # 2 rows, 3 columns
 [[1 2 3]
 [4 5 6]]
----------
Reshaped Array a_2d:
 [[10 20]
 [30 40]
 [50 60]]
Shape: (3, 2)
Number of dimensions: 2
Data type of elements: int64
Size (total elements): 6
----------


In [10]:
# create zeros, ones, identity matrix
print(f"np.zeros((2,3))")
print(np.zeros((2,3))) # shape
print("-"*10)

print(f"np.ones((3,2), dtype=np.int_)")
print(np.ones((3,2), dtype=np.int_)) # shape and datatype
print("-"*10)

print(f"np.eye(3)")
print(np.eye(3)) # identity matrix of size n x n
print(np.eye(3).shape)

np.zeros((2,3))
[[0. 0. 0.]
 [0. 0. 0.]]
----------
np.ones((3,2), dtype=np.int_)
[[1 1]
 [1 1]
 [1 1]]
----------
np.eye(3)
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
(3, 3)


In [11]:
print("Original:\n", a_2d)

# ravel() → returns a view if possible
r = a_2d.ravel()
print("ravel():", r)

# flatten() → always returns a copy
f = a_2d.flatten()
print("flatten():", f)

# View and Copy behavior

print("a (original):\n", a_2d)      
# Modify the ravel() result
r[0] = 99
print("\nAfter modifying ravel():")
print("r:", r)
print("a (original):\n", a_2d)   # changed!

# Modify the flatten() result
f[1] = 55
print("\nAfter modifying flatten():")
print("f:", f)
print("a (original):\n", a_2d)   # unchanged


Original:
 [[10 20]
 [30 40]
 [50 60]]
ravel(): [10 20 30 40 50 60]
flatten(): [10 20 30 40 50 60]
a (original):
 [[10 20]
 [30 40]
 [50 60]]

After modifying ravel():
r: [99 20 30 40 50 60]
a (original):
 [[99 20]
 [30 40]
 [50 60]]

After modifying flatten():
f: [10 55 30 40 50 60]
a (original):
 [[99 20]
 [30 40]
 [50 60]]


```
In R

x <- matrix(1:6, nrow = 2)
as.vector(x)

It always returns a copy.
There’s no concept of a “view” of the same memory.

So:

flatten() behaves like R’s as.vector() → copy.

ravel() is a performance-optimized version that avoids copying whenever possible.


## Accessing numpy array

In [12]:
# Python starts with index 0
#1D
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print("x:", x)
print("x[0] # First element\n", x[0])
print("x[:3] # upto 3rd element\n", x[:3])
print("x[-1] # Last element \n", x[-1])  # negative indexing acts as n+i, here len(x) + (-i) = 10 - 1 = 9    
print("x[::-1] # Reverse Array\n", x[::-1])

print("-"*20)
print("Slicing and Striding")
print("-"*20)

# The basic slice syntax is i:j:k where i is the starting index, j is the stopping index, and k is the step 
print("x:", x)
print("x[1:7:2]) # From index 1 to 6, step 2\nAns:", x[1:7:2])

# Negative i and j are interpreted as n + i and n + j where n is the number of elements in the corresponding dimension. 
# Negative k makes stepping go towards smaller indices

print("x[-3:3:-1]) # From index -3 to 3, step -1\nAns:", x[-3:3:-1])

print("-"*20)
print("Fancy Slice Indexing")
# We can also slice based on indexing
x = x[::-1].copy()
print(x)
print(x[np.array([1,6,2])])

print("-"*20)
print("Boolean Indexing")
# We can also use boolean indexing
print(x)
mask = (x>5) & (x<7)
print(x[mask])

x: [0 1 2 3 4 5 6 7 8 9]
x[0] # First element
 0
x[:3] # upto 3rd element
 [0 1 2]
x[-1] # Last element 
 9
x[::-1] # Reverse Array
 [9 8 7 6 5 4 3 2 1 0]
--------------------
Slicing and Striding
--------------------
x: [0 1 2 3 4 5 6 7 8 9]
x[1:7:2]) # From index 1 to 6, step 2
Ans: [1 3 5]
x[-3:3:-1]) # From index -3 to 3, step -1
Ans: [7 6 5 4]
--------------------
Fancy Slice Indexing
[9 8 7 6 5 4 3 2 1 0]
[8 3 7]
--------------------
Boolean Indexing
[9 8 7 6 5 4 3 2 1 0]
[6]


In [13]:
#2D
print("a_2d\n", a_2d)
print("a_2d[0, 0] =", a_2d[0, 0])   # first row, first column
print("First row:", a_2d[0])
print("Second column:", a_2d[:, 1])

y = np.arange(35).reshape(5, 7)
print(y[1,5])
print(y)
print(y.ndim)
print(y.shape)
print(y[np.array([0,2,4]), np.array([0,1,2])])


a_2d
 [[99 20]
 [30 40]
 [50 60]]
a_2d[0, 0] = 99
First row: [99 20]
Second column: [20 40 60]
12
[[ 0  1  2  3  4  5  6]
 [ 7  8  9 10 11 12 13]
 [14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27]
 [28 29 30 31 32 33 34]]
2
(5, 7)
[ 0 15 30]


In [14]:
# 3D
a_3d = np.arange(24).reshape(2, 3, 4)
print("3D array a_3d:\n", a_3d)
print("Shape:", a_3d.shape)   # (depth, rows, columns)
print("Dimensions:", a_3d.ndim)
print("-"*10)

# Accessing elements
print("a_3d[0] # First 2D slice (index 0):\n", a_3d[0])
print("-"*10)

print("a_3d[1, 2, 3] # Element at depth=1, row=2, col=3 \n", a_3d[1, 2, 3])

# NOTE: NumPy can handle any number of dimensions — 3D, 4D, 10D —
# This generality is what makes it powerful for images, time-series, ML tensors, etc.

3D array a_3d:
 [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
Shape: (2, 3, 4)
Dimensions: 3
----------
a_3d[0] # First 2D slice (index 0):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
----------
a_3d[1, 2, 3] # Element at depth=1, row=2, col=3 
 23


###  Recap — Understanding Array Dimensions

| Concept | R Analogy | NumPy Attribute <br> (np.ndarray) | Example |
|----------|------------|-----------------|----------|
| Vector | `c(1,2,3)` | `.shape → (n,)` | 1D |
| Matrix | `matrix(1:6, 2, 3)` | `.shape → (rows, cols)` | 2D |
| Array (tensor) | `array(1:24, c(2,3,4))` | `.shape → (2,3,4)` | 3D |

- `a.ndim` tells you **how many axes (dimensions)** the array has  
- `a.shape` tells you **how long each dimension is**  
- `a.size` gives total number of elements  

NumPy extends R’s vector/matrix idea to **arbitrary dimensions**, all with the same consistent syntax.


## Broadcasting

NumPy aligns arrays from the trailing dimensions (rightmost side).

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

* Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
* Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
* Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.


In other words, Two dimensions are compatible when:

* they are equal, or
* one of them is 1.
If they’re not compatible → ValueError.

![Alt Text](broadcast_examples.png "Broadcasting Examples")


In [None]:
# how NumPy treat arrays with different shapes during arithmetic operations

# Scalar broadcasting
a = np.array([1.0, 2.0, 3.0])
b = 10
print(a*b)

# Vector broadcasting
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]]) # 3 X 3 matrix

b = np.array([10, 20, 30]) # 3 element vector

print(A + b) # b is broadcasted across each row of A

# Column broadcasting
A = np.arange(9).reshape(3, 3)  # 3 X 3 matrix
print("Matrix A:\n", A)
col = np.array([[0], [10], [20]])   # 3 X 1 column vector
print("Column vector col:\n", col)
print(A - col)

[10. 20. 30.]
[[11 22 33]
 [14 25 36]
 [17 28 39]]
Matrix A:
 [[0 1 2]
 [3 4 5]
 [6 7 8]]
Column vector col:
 [[ 0]
 [10]
 [20]]
[[  0   1   2]
 [ -7  -6  -5]
 [-14 -13 -12]]


# Array vs Scientific  Python Ecosystem
```
You might hear of a 0-D (zero-dimensional) array referred to as a “scalar”, a 1-D (one-dimensional) array as a “vector”, a 2-D (two-dimensional) array as a “matrix”, or an N-D (N-dimensional, where “N” is typically an integer greater than 2) array as a “tensor”. For clarity, it is best to avoid the mathematical terms when referring to an array because the mathematical objects with these names behave differently than arrays (e.g. “matrix” multiplication is fundamentally different from “array” multiplication), and there are other objects in the scientific Python ecosystem that have these names (e.g. the fundamental data structure of PyTorch is the “tensor”).

In [None]:
import numpy as np

a = np.array([[1, 2],
              [3, 4]])

b = np.array([[5, 6],
              [7, 8]])

# elementwise multiply (in R: a * b)
print(a * b)   

# matrix multiply (in R: a %*% b)
print(a @ b)  # (or) np.dot(a,b)  # previous confusion (python < 3.5) between matrix multiplication is solved with @ operator

[[ 5 12]
 [21 32]]
[[19 22]
 [43 50]]


In [None]:
# Lets do some exercise in indexing and broadcasting with arrays of different shapes

# Basic indexing
# 1D array of 0–9
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# TODO:
# 1. Select the element at index 3.
# 2. Get the slice [2, 3, 4].
# 3. Get every second element from index 1 to 8 (inclusive of 1, exclusive of 9).
# 4. Get the last 3 elements using negative indices.
# 5. Reverse the array using slicing.

# Fill in:
# a = ...
# b = ...
# c = ...
# d = ...
# e = ...

#2D indexing 

A = np.array([
    [10, 11, 12],
    [20, 21, 22],
    [30, 31, 32],
    [40, 41, 42]
])

# TODO:
# 1. Select the element 31 using 2D indexing.
# 2. Get the second row as a 1D array.
# 3. Get the first column (all rows).
# 4. Get the submatrix:
#       [[21, 22],
#        [31, 32]]

# Fill in:
# v1 = ...      # 31
# row2 = ...
# col1 = ...
# sub = ...

# Boolean indexing
B = np.array([5, 12, 7, 20, 3, 15])

# TODO:
# 1. Create a boolean mask for elements > 10.
# 2. Use the mask to select those elements.
# 3. Select elements between 5 and 15 (inclusive) using a combined condition.
#    (Remember: use & and parentheses.)

# mask = ...
# big = ...
# mid = ...

# broadcasting column-wise + row-wise
M = np.array([
    [1, 2, 3],
    [4, 5, 6]
])  # shape (2, 3)

row = np.array([10, 20, 30])    # shape (3,)
col = np.array([[100],
                [200]])         # shape (2,1)

# TODO:
# 1. Predict the shape and result of M + row.
# 2. Predict the shape and result of M + col.
# 3. Then run them and verify.

# mr = M + row
# mc = M + col

# R vs Python/Numpy broadcasting with different lengths
a = np.array([1, 2, 3, 4])   # shape (4,)
b = np.array([10, 20])       # shape (2,)

# TODO:
# 1. In R, c(1,2,3,4) + c(10,20) works (recycling).
#    What do you EXPECT here?
# 2. Run: a + b
#    What happens, and why?

# res = 

# Fake gene expression: rows = genes, cols = samples
expr = np.array([
    [10, 15, 20],   # gene1
    [ 5,  3,  8],   # gene2
    [30, 25, 35],   # gene3
])  # shape (3, 3)

# Per-sample scaling factors (e.g. library size correction)
scale = np.array([0.5, 1.0, 2.0])   # shape (3,)

# TODO:
# 1. Use broadcasting to apply `scale` to each column of expr.
#    (Each column j: expr[:, j] * scale[j]), think about broadcasting in python
# 2. Compute per-gene mean AFTER scaling (mean() is a method).
# 3. Predict shapes at each step.

# scaled = ...
# gene_mean = ...

# scaled = expr * scale
# print(scaled)
# scaled.mean(axis=1)



## I/O in NumPy

In [153]:
# I/O loadtxt or genfromtxt to read data from a text file into a NumPy array

# Using np.loadtxt to read a CSV file, skipping first 2 rows and first column
# For numeric data only
mat = np.loadtxt("gene_expression.txt", delimiter=",", skiprows=2, usecols=range(1,5))
print(mat)
print(mat.shape)
print(mat.dtype)

# Using np.genfromtxt to read a CSV file, automatically handling missing values
data = np.genfromtxt("gene_expression.txt", delimiter=",", comments="#", dtype=None)
print(data)

print(data.shape)
print(f"{data.dtype} - Unicode string array where each element can hold up to 7 characters.")

data = np.genfromtxt("gene_expression_missing.txt", delimiter=",", comments="#", filling_values=np.nan)
print(data)

[[120. 135. 150. 160.]
 [ 80.  90.  97.  95.]
 [300. 280. 310. 290.]
 [ 45.  55.  50.  60.]
 [500. 480. 520. 510.]]
(5, 4)
float64
[['Gene' 'Sample1' 'Sample2' 'Sample3' 'Sample4']
 ['GeneA' '120' '135' '150' '160']
 ['GeneB' '80' '90' '97' '95']
 ['GeneC' '300' '280' '310' '290']
 ['GeneD' '45' '55' '50' '60']
 ['GeneE' '500' '480' '520' '510']]
(6, 5)
<U7 - Unicode string array where each element can hold up to 7 characters.
[[ nan  nan  nan  nan  nan]
 [ nan 120. 135. 150. 160.]
 [ nan  80.  nan  90.  95.]
 [ nan 300. 280. 310. 290.]
 [ nan  45.  55.  nan  60.]
 [ nan 500. 480. 520. 510.]]


```
When NumPy loads a text file with mixed data (like GeneA, 120, Sample1, etc.),
it automatically promotes everything to the smallest common type that can represent all entries —
and in this case, that’s string.

It then scans the data and finds the longest string (here, 7 characters like "Sample1").
So it sets the dtype to <U7.


In [125]:
# Aggregation functions
print(mat.mean(axis=1, keepdims=True))  # mean of gene expression across samples 
print(mat.sum(axis=0, keepdims=True))  # sum of gene expression per sample  

# performing aggregation on string array will give error
#print(data.mean(axis=1))

[[141.25]
 [ 90.5 ]
 [295.  ]
 [ 52.5 ]
 [502.5 ]]
[[1045. 1040. 1127. 1115.]]


In [None]:
# Can you extract colnames and rownames from this "data" array?
#colnames = data[1:5,][:,0]
#rownames = data[:,1:5][0]
#print(colnames)
#print(rownames)

['GeneA' 'GeneB' 'GeneC' 'GeneD']
['Sample1' 'Sample2' 'Sample3' 'Sample4']


* For example, we can use the matrix as numpy arrays
* get the metadata and fetch necessary details from it (like gene names, sample names, etc)
* then save it appropriately in txt ot npz(compressed numpy)

# But, this is where Pandas would be more convenient with mixed types

In [187]:
import pandas as pd

# (Optional) display tweaks
pd.set_option("display.max_rows", 50)
pd.set_option("display.width", 120)
print(pd.__version__, "pandas version")

2.3.3 pandas version


## Create a dataframe from csv

In [316]:
# R equivalent:
# df <- read.csv("gene_expression.txt", skip=1, row.names=1)
df = pd.read_csv("gene_expression.txt", skiprows=1) # or explicitly add sep="," or delimiter=","
display(df)

# Setting index column to "Gene"
# R equivalent:
# df <- read.csv("gene_expression.txt", skip=1, row.names=1)
df1 = pd.read_csv("gene_expression.txt", sep=",", index_col=0, skiprows=1)
display(df1)

Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
0,GeneA,120,135,150,160
1,GeneB,80,90,97,95
2,GeneC,300,280,310,290
3,GeneD,45,55,50,60
4,GeneE,500,480,520,510


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GeneA,120,135,150,160
GeneB,80,90,97,95
GeneC,300,280,310,290
GeneD,45,55,50,60
GeneE,500,480,520,510


In [317]:
# Preview top and bottom rows
df.head()   # like head(df) in R
df.tail()   # like tail(df) in R

# Summary info (like str(df) in R)
df.info() # gives concise summary of dataframe

# Quick statistics summary (like summary(df) in R)
df.describe() # gives descriptive statistics

# Check for missing values
df.isna().sum() # count of missing values per column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Gene     5 non-null      object
 1   Sample1  5 non-null      int64 
 2   Sample2  5 non-null      int64 
 3   Sample3  5 non-null      int64 
 4   Sample4  5 non-null      int64 
dtypes: int64(4), object(1)
memory usage: 332.0+ bytes


Gene       0
Sample1    0
Sample2    0
Sample3    0
Sample4    0
dtype: int64

## Extracting data from DataFrame

In [318]:
display(df)

# Get column names
print(df.columns) # R equivalent: colnames(df)

# Get row names
print(df.index) #  R equivalent: rownames(df)

# Select one column (returns a Series)
print(df["Sample1"]) # R equivalent: df$Sample1

# Select multiple columns (returns a DataFrame)
display(df[["Sample1", "Sample2"]]) # R equivalent: df[, c("Sample1", "Sample2")]

# Select by position (iloc)
print(df.iloc[:, 1:3])    # same as df[, 2:3] in R
print(df.iloc[1:3])    # same as df[1:2,] in R

# Select by label (loc)
df.loc[:, ["Sample1", "Sample2"]]  # same as df[, c("Sample1", "Sample2")] in R


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
0,GeneA,120,135,150,160
1,GeneB,80,90,97,95
2,GeneC,300,280,310,290
3,GeneD,45,55,50,60
4,GeneE,500,480,520,510


Index(['Gene', 'Sample1', 'Sample2', 'Sample3', 'Sample4'], dtype='object')
RangeIndex(start=0, stop=5, step=1)
0    120
1     80
2    300
3     45
4    500
Name: Sample1, dtype: int64


Unnamed: 0,Sample1,Sample2
0,120,135
1,80,90
2,300,280
3,45,55
4,500,480


   Sample1  Sample2
0      120      135
1       80       90
2      300      280
3       45       55
4      500      480
    Gene  Sample1  Sample2  Sample3  Sample4
1  GeneB       80       90       97       95
2  GeneC      300      280      310      290


Unnamed: 0,Sample1,Sample2
0,120,135
1,80,90
2,300,280
3,45,55
4,500,480


### Filtering

In [None]:
# Filter rows where Sample1 > 100
display(df[df["Sample1"] > 100])  # R equivalent: df[df$Sample1 > 100, ]

# Combine conditions (note the parentheses!)
display(df[(df["Sample1"] > 100) & (df["Sample3"] < 200)]) # R equivalent: df[df$Sample1 > 100 & df$Sample3 < 200, ]

# Negate condition (~)

display(df[~df["Gene"].str.startswith("GeneA")]) # R equivalent: df[!startsWith(df$Gene, "GeneA"), ]


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
0,GeneA,120,135,150,160
2,GeneC,300,280,310,290
4,GeneE,500,480,520,510


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
0,GeneA,120,135,150,160


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
1,GeneB,80,90,97,95
2,GeneC,300,280,310,290
3,GeneD,45,55,50,60
4,GeneE,500,480,520,510


### Handling missing values

In [330]:

# Check missing values
display(df)

# lets introduce missing values for demonstration
df.iloc[0,1] = np.nan
df.iloc[3,3] = np.nan
display(df)

print(df.isna().sum()) # R equivalent: colSums(is.na(df))

# Drop rows with any missing values
df.dropna() # R equivalent: na.omit(df)

colmeans = df.mean(numeric_only=True)
print(colmeans)
# Fill missing values with column means
df.fillna(colmeans, inplace=True) # R equivalent: df[is.na(df)] <- colMeans(df, na.rm=TRUE)

# Fill with specific value
# df.fillna(0, inplace=True) # R equivalent: df[is.na(df)] <- 0

display(df)

Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GeneA,231.25,135,150.0,160,676.25
GeneB,80.0,90,97.0,95,362.0
GeneC,300.0,280,310.0,290,1180.0
GeneD,45.0,55,269.25,60,429.25
GeneE,500.0,480,520.0,510,2010.0


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GeneA,231.25,,150.0,160.0,676.25
GeneB,80.0,90.0,97.0,95.0,362.0
GeneC,300.0,280.0,310.0,290.0,1180.0
GeneD,45.0,55.0,269.25,,429.25
GeneE,500.0,480.0,520.0,510.0,2010.0


Sample1    0
Sample2    1
Sample3    0
Sample4    1
Total      0
dtype: int64
Sample1    231.25
Sample2    226.25
Sample3    269.25
Sample4    263.75
Total      931.50
dtype: float64


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GeneA,231.25,226.25,150.0,160.0,676.25
GeneB,80.0,90.0,97.0,95.0,362.0
GeneC,300.0,280.0,310.0,290.0,1180.0
GeneD,45.0,55.0,269.25,263.75,429.25
GeneE,500.0,480.0,520.0,510.0,2010.0


### Apply operations column-wise or row-wise

In [None]:
# Column means
print(df.mean(numeric_only=True))

# Row means
print(df.mean(axis=1, numeric_only=True))

# Create a new column (total expression per gene)
df["Total"] = df[["Sample1", "Sample2", "Sample3", "Sample4"]].sum(axis=1)
display(df)

# Apply custom lambda
#df["Range"] = df.apply(lambda row: row.max() - row.min(), axis=1) # R equivalent: apply(..., 1, function(row) max(row) - min(row))
# Must throw error as "Gene" column is string)
# pandas preserve datatypes of each column

# So either set_index to "Gene" (or)) df.set_index("Gene", inplace=True)
#df.set_index("Gene", inplace=True)


# (or) use numeric_only=True in max() and min()
df2 = df.apply(lambda row: row.max(numeric_only=True) - row.min(numeric_only=True), axis=1)

# (or) specify datatype to select from the dataframe
# df.select_dtypes(include=np.number).apply(
#    lambda row: row.max() - row.min(), axis=1
#)

display(df)


# lets stop here and create a normalized matrix using lammda function x/x.mean() and assing to df_norm.
## HINT: make sure only numeric datatypes are used for calculation
##
##

Sample1    231.25
Sample2    226.25
Sample3    269.25
Sample4    263.75
Total      931.50
dtype: float64
Gene
GeneA    288.75
GeneB    144.80
GeneC    472.00
GeneD    212.45
GeneE    804.00
dtype: float64


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GeneA,231.25,226.25,150.0,160.0,767.5
GeneB,80.0,90.0,97.0,95.0,362.0
GeneC,300.0,280.0,310.0,290.0,1180.0
GeneD,45.0,55.0,269.25,263.75,633.0
GeneE,500.0,480.0,520.0,510.0,2010.0


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GeneA,231.25,226.25,150.0,160.0,767.5
GeneB,80.0,90.0,97.0,95.0,362.0
GeneC,300.0,280.0,310.0,290.0,1180.0
GeneD,45.0,55.0,269.25,263.75,633.0
GeneE,500.0,480.0,520.0,510.0,2010.0


In [None]:
# We can alos pipe multiple operations together like in R using method chaining
# For example, drop NA, create Total column, filter rows where Sample1 > 150, sort by Total descending
display(df)
like_piping_in_R = (df
 .dropna()
 .assign(Total=lambda d: d[["Sample1", "Sample2", "Sample3", "Sample4"]].sum(axis=1))
 .query("Sample1 > 150")
 .sort_values("Total", ascending=False)
)
display(like_piping_in_R)


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4
0,GeneA,120,135.0,150.0,160
1,GeneB,80,,90.0,95
2,GeneC,300,280.0,310.0,290
3,GeneD,45,55.0,,60
4,GeneE,500,480.0,520.0,510


Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4,Total
4,GeneE,500,480.0,520.0,510,2010.0
2,GeneC,300,280.0,310.0,290,1180.0


### Grouping and summarizing

In [334]:
display(df)
# Suppose we add a fake group column
df["Group"] = ["Control", "Control", "Treatment", "Treatment", "Control"]

# Group and compute mean per group
display(df.groupby("Group")[["Sample1", "Sample2", "Sample3", "Sample4"]].mean())

# Multiple summaries at once
display(df.groupby("Group").agg({"Sample1": ["mean", "std"], 
                                 "Sample2": ["min", "max"]}))


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total,Group
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GeneA,231.25,226.25,150.0,160.0,767.5,Control
GeneB,80.0,90.0,97.0,95.0,362.0,Control
GeneC,300.0,280.0,310.0,290.0,1180.0,Treatment
GeneD,45.0,55.0,269.25,263.75,633.0,Treatment
GeneE,500.0,480.0,520.0,510.0,2010.0,Control


Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Control,270.416667,265.416667,255.666667,255.0
Treatment,172.5,167.5,289.625,276.875


Unnamed: 0_level_0,Sample1,Sample1,Sample2,Sample2
Unnamed: 0_level_1,mean,std,min,max
Group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Control,270.416667,212.721698,90.0,480.0
Treatment,172.5,180.312229,55.0,280.0


### Reshaping data

(Equivalent to R’s pivot_longer() / pivot_wider())


In [341]:
# Melt to long format (tidy)
df = df.reset_index()  # make sure "Gene" is a column, not index
display(df)

df_long = df.melt(id_vars=["Gene", "Group"], var_name="Sample", value_name="Expression")
print(df_long.head())

# Pivot back to wide
df_wide = df_long.pivot(index="Gene", columns="Sample", values="Expression")
print(df_wide)

Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4,Total,Group
0,GeneA,231.25,226.25,150.0,160.0,767.5,Control
1,GeneB,80.0,90.0,97.0,95.0,362.0,Control
2,GeneC,300.0,280.0,310.0,290.0,1180.0,Treatment
3,GeneD,45.0,55.0,269.25,263.75,633.0,Treatment
4,GeneE,500.0,480.0,520.0,510.0,2010.0,Control


    Gene      Group   Sample  Expression
0  GeneA    Control  Sample1      231.25
1  GeneB    Control  Sample1       80.00
2  GeneC  Treatment  Sample1      300.00
3  GeneD  Treatment  Sample1       45.00
4  GeneE    Control  Sample1      500.00
Sample  Sample1  Sample2  Sample3  Sample4   Total
Gene                                              
GeneA    231.25   226.25   150.00   160.00   767.5
GeneB     80.00    90.00    97.00    95.00   362.0
GeneC    300.00   280.00   310.00   290.00  1180.0
GeneD     45.00    55.00   269.25   263.75   633.0
GeneE    500.00   480.00   520.00   510.00  2010.0


### Merging and joining DataFrames

In [342]:
meta = pd.DataFrame({
    "Gene": ["GeneA", "GeneC", "GeneD", "GeneE"],
    "Pathway": ["X", "Y", "Z", "Y"]
})
display(df)
display(meta)

# Left join on Gene
# R equivalent: merge(df, meta, by="Gene", all.x=TRUE)
merged_left = pd.merge(df, meta, on="Gene", how="left") # rows present in left dataframe (df), others filled with NaN
print("\nLeft join:\n", merged_left)

# inner join
# R equivalent: merge(df, meta, by="Gene", all=FALSE)
merged_inner = pd.merge(df, meta, on="Gene", how="inner") # rows only present in both dataframes
print("\nInner join:\n", merged_inner)



Unnamed: 0,Gene,Sample1,Sample2,Sample3,Sample4,Total,Group
0,GeneA,231.25,226.25,150.0,160.0,767.5,Control
1,GeneB,80.0,90.0,97.0,95.0,362.0,Control
2,GeneC,300.0,280.0,310.0,290.0,1180.0,Treatment
3,GeneD,45.0,55.0,269.25,263.75,633.0,Treatment
4,GeneE,500.0,480.0,520.0,510.0,2010.0,Control


Unnamed: 0,Gene,Pathway
0,GeneA,X
1,GeneC,Y
2,GeneD,Z
3,GeneE,Y



Left join:
     Gene  Sample1  Sample2  Sample3  Sample4   Total      Group Pathway
0  GeneA   231.25   226.25   150.00   160.00   767.5    Control       X
1  GeneB    80.00    90.00    97.00    95.00   362.0    Control     NaN
2  GeneC   300.00   280.00   310.00   290.00  1180.0  Treatment       Y
3  GeneD    45.00    55.00   269.25   263.75   633.0  Treatment       Z
4  GeneE   500.00   480.00   520.00   510.00  2010.0    Control       Y

Inner join:
     Gene  Sample1  Sample2  Sample3  Sample4   Total      Group Pathway
0  GeneA   231.25   226.25   150.00   160.00   767.5    Control       X
1  GeneC   300.00   280.00   310.00   290.00  1180.0  Treatment       Y
2  GeneD    45.00    55.00   269.25   263.75   633.0  Treatment       Z
3  GeneE   500.00   480.00   520.00   510.00  2010.0    Control       Y


In [343]:
# or we can use join using index, same as left join in this case
df3 = df.set_index("Gene")
meta3 = meta.set_index("Gene")

joined = df3.join(meta3, how="left")
joined

Unnamed: 0_level_0,Sample1,Sample2,Sample3,Sample4,Total,Group,Pathway
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GeneA,231.25,226.25,150.0,160.0,767.5,Control,X
GeneB,80.0,90.0,97.0,95.0,362.0,Control,
GeneC,300.0,280.0,310.0,290.0,1180.0,Treatment,Y
GeneD,45.0,55.0,269.25,263.75,633.0,Treatment,Z
GeneE,500.0,480.0,520.0,510.0,2010.0,Control,Y


In [344]:
# concatenation of DataFrames (R equivalent: rbind() or cbind())
df3 = pd.DataFrame({
    "Gene": ["GeneF", "GeneG"],
    "Sample1": [12, 14],
    "Sample2": [18, 16],
    "Sample3": [22, 24],
    "Sample4": [28, 26],
    "Total": [80, 80],
    "Group": ["Control", "Control"]})

df3

concat_all = pd.concat([df, df3], axis=0) # row-wise concatenation (like rbind in R)
print("Row-wise concatenation:\n", concat_all)  

# concatenating with common indexes gives clean cbind   
df3.set_index("Gene", inplace=True)
df.set_index("Gene", inplace=True)
concat_col = pd.concat([df, df3], axis=1) # column-wise concatenation (like cbind in R)
print("Column-wise concatenation:\n", concat_col)


Row-wise concatenation:
     Gene  Sample1  Sample2  Sample3  Sample4   Total      Group
0  GeneA   231.25   226.25   150.00   160.00   767.5    Control
1  GeneB    80.00    90.00    97.00    95.00   362.0    Control
2  GeneC   300.00   280.00   310.00   290.00  1180.0  Treatment
3  GeneD    45.00    55.00   269.25   263.75   633.0  Treatment
4  GeneE   500.00   480.00   520.00   510.00  2010.0    Control
0  GeneF    12.00    18.00    22.00    28.00    80.0    Control
1  GeneG    14.00    16.00    24.00    26.00    80.0    Control
Column-wise concatenation:
        Sample1  Sample2  Sample3  Sample4   Total      Group  Sample1  Sample2  Sample3  Sample4  Total    Group
Gene                                                                                                            
GeneA   231.25   226.25   150.00   160.00   767.5    Control      NaN      NaN      NaN      NaN    NaN      NaN
GeneB    80.00    90.00    97.00    95.00   362.0    Control      NaN      NaN      NaN      NaN

In [None]:
df = df.reset_index()
df["Gene"].str.extract(r"(Gene)([A-Z])")[1]


array([0, 1, 2, 3, 4])