# Lecture 3: Introduction to Pandas

This notebook introduces the basics of pandas, a powerful data manipulation and analysis library for Python.

To get started we need to load the pandas package using import. Calling it pd is a standard so I would stick with it!

In [4]:
import pandas as pd

## Loading Data from Zenodo

We'll load a CRISPR dataset from Zenodo to explore pandas functionality.

In [5]:
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)
print(type(df))

<class 'pandas.core.frame.DataFrame'>


## Basic DataFrame Inspection

Let's explore the basic commands to inspect our DataFrame.

### 1. Check DataFrame dimensions with `.shape`

The `.shape` attribute returns a tuple showing (rows, columns). This gives you a quick overview of your dataset size.

In [6]:
df.shape

(94, 17211)

### 2. Preview the first and last few rows with `.head()` and '.tail()'

The `.head()` method shows the first 5 rows by default. This helps you understand the structure and content of your data quickly. You can specify a different number of rows with `.head(n)
`. tail does the same but starting at the end.

In [7]:
df.head()

Unnamed: 0,model_id,cell_line_name,stripped_cell_line_name,oncotree_lineage,oncotree_primary_disease,oncotree_subtype,A1BG,A1CF,A2M,A2ML1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
0,ACH-000004,HEL,HEL,Myeloid,Acute Myeloid Leukemia,"AML, NOS",0.005506,-0.069734,-0.098829,-0.018614,...,-0.178914,-0.33103,0.233527,0.079958,0.084987,0.059995,-0.313832,0.111395,0.261894,-0.08686
1,ACH-000005,HEL 92.1.7,HEL9217,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,-0.112103,0.008739,0.157658,0.106528,...,-0.273276,-0.341893,-0.027406,-0.080102,-0.027048,-0.247472,-0.132481,-0.061705,0.039933,-0.191611
2,ACH-000017,SK-BR-3,SKBR3,Breast,Invasive Breast Carcinoma,Invasive Breast Carcinoma,-0.032172,-0.102181,-0.012779,0.018938,...,-0.2573,-0.425172,0.048183,0.278794,-0.110023,-0.030859,-0.207186,0.053118,-0.257893,-0.201468
3,ACH-000019,MCF7,MCF7,Breast,Invasive Breast Carcinoma,Invasive Breast Carcinoma,0.036307,0.018542,0.095517,0.052744,...,-0.131793,-0.744819,0.06786,0.021533,0.030147,-0.038263,-0.150743,0.185094,-0.10824,-0.490506
4,ACH-000028,KPL-1,KPL1,Breast,Invasive Breast Carcinoma,Invasive Breast Carcinoma,-0.187719,-0.149819,0.077679,0.123123,...,-0.493698,-0.536292,0.143591,0.076073,-0.034894,-0.003931,-0.071018,-0.119227,-0.118681,-0.194408


In [8]:
df.tail()

Unnamed: 0,model_id,cell_line_name,stripped_cell_line_name,oncotree_lineage,oncotree_primary_disease,oncotree_subtype,A1BG,A1CF,A2M,A2ML1,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
89,ACH-002399,21NT,21NT,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,0.01667,-0.102891,0.039184,0.012374,...,-0.26699,-0.384871,-0.167387,0.044349,-0.022211,0.00689,-0.248336,-0.028865,-0.244521,-0.226447
90,ACH-002400,21MT-1,21MT1,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,0.035992,-0.093128,-0.02401,-0.002297,...,-0.072272,-0.261889,-0.089042,0.280628,-0.100707,-0.048671,0.062155,-0.062474,-0.184105,-0.209444
91,ACH-002401,21MT-2,21MT2,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,-0.042245,-0.09255,0.095894,0.102182,...,-0.231961,-0.434732,0.023069,0.156461,0.016354,-0.003715,-0.166764,-0.052551,-0.277299,-0.185255
92,ACH-002475,HAP1,HAP1,Myeloid,Myeloproliferative Neoplasms,"Chronic Myeloid Leukemia, BCR-ABL1+",-0.326848,-0.106305,0.328807,0.087881,...,-0.532781,-0.502823,-0.185465,0.1336,0.069168,-0.035464,-0.103719,0.022963,0.181658,-0.200864
93,ACH-002499,UACC-3199,UACC3199,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,0.049671,-0.01877,0.120281,-0.052114,...,-0.006244,-0.346016,0.088168,0.191364,-0.034267,0.216754,-0.382172,-0.111361,-0.017968,-0.374272


### 3. Get detailed information with `.info()`

The `.info()` method provides a comprehensive overview including:
- Number of rows and columns
- Column names and their data types
- Non-null count for each column (useful for identifying missing data)
- Memory usage of the DataFrame

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Columns: 17211 entries, model_id to ZZZ3
dtypes: float64(17205), object(6)
memory usage: 12.3+ MB


### 4. Statistical summary with `.describe()`

The `.describe()` method generates descriptive statistics for numerical columns:
- **count**: Number of non-null values
- **mean**: Average value
- **std**: Standard deviation (measure of spread)
- **min/max**: Minimum and maximum values
- **25%, 50%, 75%**: Quartiles (percentiles) that help understand data distribution

In [10]:
df.describe()

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
count,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0,...,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0,94.0
mean,-0.054935,-0.093692,0.029812,0.064967,-0.114661,-0.03868,0.032787,-0.115873,-0.045571,0.108615,...,-0.204066,-0.379184,0.047896,0.082465,-0.071971,-0.040266,-0.12797,-0.020021,-0.118972,-0.307278
std,0.141256,0.106651,0.099385,0.095671,0.138206,0.099466,0.102891,0.219413,0.114337,0.090607,...,0.171297,0.236019,0.15251,0.142629,0.124164,0.094981,0.132863,0.140636,0.145147,0.190793
min,-0.626283,-0.451665,-0.186092,-0.24527,-0.616101,-0.320568,-0.16209,-0.831448,-0.296332,-0.164341,...,-0.872698,-0.893691,-0.65011,-0.323497,-0.48939,-0.247472,-0.446544,-0.456646,-0.589946,-1.146622
25%,-0.112972,-0.138388,-0.026509,0.008149,-0.189462,-0.07779,-0.036059,-0.239771,-0.110043,0.051166,...,-0.272001,-0.514609,-0.035083,0.01294,-0.150538,-0.094475,-0.201977,-0.098958,-0.201573,-0.3831
50%,-0.070896,-0.091606,0.028296,0.060233,-0.084616,-0.053613,0.018043,-0.09677,-0.056092,0.121614,...,-0.208706,-0.377878,0.06582,0.082953,-0.080664,-0.039286,-0.132295,-0.019451,-0.119042,-0.29621
75%,0.004349,-0.030879,0.081367,0.117312,-0.025943,0.013791,0.090748,0.014472,0.009893,0.163191,...,-0.092252,-0.225159,0.139735,0.163576,-0.016515,0.006408,-0.051638,0.054303,-0.03829,-0.201779
max,0.464101,0.210794,0.411906,0.313749,0.168673,0.236373,0.382592,0.330231,0.334593,0.357325,...,0.163556,0.193544,0.398083,0.462117,0.395878,0.216754,0.365249,0.285742,0.305158,0.136894


### 5. List all column names with `.columns`

The `.columns` attribute returns an Index object containing all column names. This is essential for understanding what variables are available in your dataset and for referencing specific columns later.

In [11]:
df.columns

Index(['model_id', 'cell_line_name', 'stripped_cell_line_name',
       'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype',
       'A1BG', 'A1CF', 'A2M', 'A2ML1',
       ...
       'ZWILCH', 'ZWINT', 'ZXDA', 'ZXDB', 'ZXDC', 'ZYG11A', 'ZYG11B', 'ZYX',
       'ZZEF1', 'ZZZ3'],
      dtype='object', length=17211)

### 6. select individual columns
Filtering data and accessing seperate areas of the df is an essential pandas skill.
The most basic operation is to retrieve a single column

In [13]:
model_id = df['model_id']
print(type(model_id))
print(model_id.head())




<class 'pandas.core.series.Series'>
0    ACH-000004
1    ACH-000005
2    ACH-000017
3    ACH-000019
4    ACH-000028
Name: model_id, dtype: object
