📦 Phase 1.3: Pandas (DataFrames)
🔍 Why it matters:
Every real AI project uses CSV/Excel/tables as input.

Pandas is your go-to for loading, cleaning, transforming, filtering data.

Used in ML pipelines, LLM data prep, analytics, etc.

✅ Core Concepts Checklist
Concept	Why It Matters
DataFrame, Series	Core data types in Pandas
read_csv(), to_csv()	Load/save datasets
Column indexing, slicing	Access/modify columns/rows
Filtering with conditions	Get only the data you want
isnull(), fillna(), dropna()	Clean missing data
groupby(), agg()	Summarize, analyze by category
Merging & joining	Combine multiple datasets
Sorting, renaming, resetting index	Final cleanup before model

In [54]:
# Create or Load a Dataset
import pandas as pd
import numpy as np

# data_1d = {[25, 30, 35]}

# columns name, age, score
data = {
    'Name': ['Nobita', 'Doraemon', 'Suneo'],
    'Age': [12, 13, np.nan],    # Use np.nan(specifically for numeric) instead of None
    'Score': [0, 85, 80]
}

# data_null = {
#     'Name': ['Nobita', 'NAN', 'Suneo'],
#     'Age': [12, 13, None],
#     'Score': [0, 85, 80]
# }

df = pd.DataFrame(data)
print(df)

       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   NaN     80


In [None]:
# Accessing Data
print(df['Age'], end="\n\n")

print(df[['Name', 'Score']], end="\n\n") #access multiple columns

# iloc - integer location, loc - label based location
print(df.iloc[0])            # first row 
print(df.iloc[0:2])          # First two rows
print(df.loc[2, 'Name']) 
print(df.loc[:, 'Age'])      # slicing

0    12.0
1    13.0
2     NaN
Name: Age, dtype: float64

       Name  Score
0    Nobita      0
1  Doraemon     85
2     Suneo     80

Name     Nobita
Age        12.0
Score         0
Name: 0, dtype: object
Suneo
0    12.0
1    13.0
2     NaN
Name: Age, dtype: float64


In [56]:
# Filter data
print(df[df['Age'] > 12], end="\n\n")
print(df[df['Age'] == 13]) # == None - not works

       Name   Age  Score
1  Doraemon  13.0     85

       Name   Age  Score
1  Doraemon  13.0     85


In [57]:
# Cleaning Missing Values
print(df.isnull().sum()) #Count missing values

print(df.dropna()) #Removes rows with NAN
print(df.fillna(0)) #Fills NAN's with 0s


Name     0
Age      1
Score    0
dtype: int64
       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   0.0     80


In [None]:
# Grouping & Aggregation
# df.groupby('Score').mean(numeric_only=True) - not possible, because grouping by score, but taking mean of non numeric like 'Name', so use numeric only - name column will be ignored
print(df.groupby('Score').mean(numeric_only=True), end="\n\n")    # Group by Age, get mean Score

print(df.groupby('Score')[['Age']].mean(), end='\n\n') #Drops non numeric like 'Name' columns and 'Age'(manually mentioned) skips nan for 'Age -> calculates group-wise mean for Age 

df.loc[len(df)] = ['Doraemon',20,100]
print(df.groupby('Name').agg({
    'Score': 'mean',          #missing value should be np.nan (numeric), None is python object
    'Age': 'max'              #max required numeric type
}))

        Age
Score      
0      12.0
80      NaN
85     13.0
100    20.0

        Age
Score      
0      12.0
80      NaN
85     13.0
100    20.0

          Score   Age
Name                 
Doraemon   97.5  20.0
Nobita      0.0  12.0
Suneo      80.0   NaN


In [None]:
# Merging Datasets
df1 = pd.DataFrame({'ID': [1,2], 'Name': ['A','B']})
df2 = pd.DataFrame({'ID': [1,2], 'Score': [90,80]})

merged = pd.merge(df1, df2, on='ID') # without on operator - default merges on ID
print(merged)

   ID Name  Score
0   1    A     90
1   2    B     80
