📦 Phase 1.3: Pandas (DataFrames)
🔍 Why it matters:
Every real AI project uses CSV/Excel/tables as input.

Pandas is your go-to for loading, cleaning, transforming, filtering data.

Used in ML pipelines, LLM data prep, analytics, etc.

✅ Core Concepts Checklist
Concept	Why It Matters
DataFrame, Series	Core data types in Pandas
read_csv(), to_csv()	Load/save datasets
Column indexing, slicing	Access/modify columns/rows
Filtering with conditions	Get only the data you want
isnull(), fillna(), dropna()	Clean missing data
groupby(), agg()	Summarize, analyze by category
Merging & joining	Combine multiple datasets
Sorting, renaming, resetting index	Final cleanup before model

In [123]:
# Create or Load a Dataset
import pandas as pd
import numpy as np

# data_1d = {[25, 30, 35]}

# columns name, age, score
data = {
    'Name': ['Nobita', 'Doraemon', 'Suneo'],
    'Age': [12, 13, np.nan],    # Use np.nan(specifically for numeric) instead of None
    'Score': [0, 85, 80]
}

# data_null = {
#     'Name': ['Nobita', 'NAN', 'Suneo'],
#     'Age': [12, 13, None],
#     'Score': [0, 85, 80]
# }

df = pd.DataFrame(data)
print(df)

       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   NaN     80


In [124]:
# Accessing Data
print(df['Age'], end="\n\n")

print(df.columns)
print(df.values)

print(df[['Name', 'Score']], end="\n\n") #access multiple columns

# iloc - integer location, loc - label based location
print(df.iloc[0])            # first row 
print(df.iloc[0:2])          # First two rows
print(df.loc[2, 'Name']) 
print(df.loc[:, 'Age'])      # slicing

0    12.0
1    13.0
2     NaN
Name: Age, dtype: float64

Index(['Name', 'Age', 'Score'], dtype='object')
[['Nobita' 12.0 0]
 ['Doraemon' 13.0 85]
 ['Suneo' nan 80]]
       Name  Score
0    Nobita      0
1  Doraemon     85
2     Suneo     80

Name     Nobita
Age        12.0
Score         0
Name: 0, dtype: object
       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
Suneo
0    12.0
1    13.0
2     NaN
Name: Age, dtype: float64


In [125]:
# Filter data
print(df[df['Age'] > 12], end="\n\n")
print(df[df['Age'] == 13]) # == None - not works

       Name   Age  Score
1  Doraemon  13.0     85

       Name   Age  Score
1  Doraemon  13.0     85


In [126]:
# Cleaning Missing Values
print(df.isnull().sum()) #Count missing values

print(df.dropna()) #Removes rows with NAN
print(df.fillna(0)) #Fills NAN's with 0s


Name     0
Age      1
Score    0
dtype: int64
       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   0.0     80


In [127]:
# Grouping & Aggregation
# df.groupby('Score').mean(numeric_only=True) - not possible, because grouping by score, but taking mean of non numeric like 'Name', so use numeric only - name column will be ignored
print(df.groupby('Score').mean(numeric_only=True), end="\n\n")    # Group by Age, get mean Score

print(df.groupby('Score')[['Age']].mean(), end='\n\n') #Drops non numeric like 'Name' columns and 'Age'(manually mentioned) skips nan for 'Age -> calculates group-wise mean for Age 

df.loc[len(df)] = ['Doraemon',20,100]
print(df.groupby('Name').agg({
    'Score': 'mean',          #missing value should be np.nan (numeric), None is python object
    'Age': 'max'              #max required numeric type
}), end='\n\n')

print(
    df.agg({
    'Score': ['min', 'max', 'mean'],
    'Age': 'median'
})
)

        Age
Score      
0      12.0
80      NaN
85     13.0

        Age
Score      
0      12.0
80      NaN
85     13.0

          Score   Age
Name                 
Doraemon   92.5  20.0
Nobita      0.0  12.0
Suneo      80.0   NaN

         Score   Age
min       0.00   NaN
max     100.00   NaN
mean     66.25   NaN
median     NaN  13.0


In [128]:
# Merging Datasets
df1 = pd.DataFrame({'ID': [1,2], 'Name': ['A','B']})
df2 = pd.DataFrame({'ID': [1,2], 'Score': [90,80]})
df3 = pd.DataFrame({'ID1': [1,2], 'Score': [90,80]})

merged = pd.merge(df1, df2, on='ID') # without on operator - default merges on ID
print(merged)

   ID Name  Score
0   1    A     90
1   2    B     80


In [129]:
# Joining by index
print(df1.columns)
print(df3.columns)

# if there is matching columns, join that otherwise error
# if there is no matching column, on joining makes second df nan values

df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df3 = df3.set_index('Score')

joined = df1.join(df2) #JOINING OF ID, ID, here name, score not possible -> beacuse id is common so breaks
print(joined)
joined1 = df1.join(df3) # joining of name, score without common columns - GIVE NAN FOR 2ND DATAFRAME
print(joined1)


Index(['ID', 'Name'], dtype='object')
Index(['ID1', 'Score'], dtype='object')
   Name  Score
ID            
1     A     90
2     B     80
   Name  ID1
ID          
1     A  NaN
2     B  NaN


ADVANCED

In [130]:
#Save to CSV
print(df)
df.to_csv("studentsfinal.csv", index=False) #  to avoid writing row numbers as a column

       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   NaN     80
3  Doraemon  20.0    100


In [132]:
#sort
print(df, end='\n\n')

print(df.sort_values(by='Score',ascending=False), end='\n\n')
df.sort_index() #sort by row index

       Name   Age  Score
0    Nobita  12.0      0
1  Doraemon  13.0     85
2     Suneo   NaN     80
3  Doraemon  20.0    100

       Name   Age  Score
3  Doraemon  20.0    100
1  Doraemon  13.0     85
2     Suneo   NaN     80
0    Nobita  12.0      0



Unnamed: 0,Name,Age,Score
0,Nobita,12.0,0
1,Doraemon,13.0,85
2,Suneo,,80
3,Doraemon,20.0,100


In [133]:
#rename
df.rename(columns={'Name':'StudentName', 'Score':'Marks'})

Unnamed: 0,StudentName,Age,Marks
0,Nobita,12.0,0
1,Doraemon,13.0,85
2,Suneo,,80
3,Doraemon,20.0,100


In [140]:
#reset index - sort according to me, then reset

df = df.sort_values(by='Score',ascending=False)
print(df, end='\n\n')
df.reset_index(drop=True)

       Name   Age  Score
3  Doraemon  20.0    100
1  Doraemon  13.0     85
2     Suneo   NaN     80
0    Nobita  12.0      0



Unnamed: 0,Name,Age,Score
0,Doraemon,20.0,100
1,Doraemon,13.0,85
2,Suneo,,80
3,Nobita,12.0,0


💡 Mini Project Idea: "Top Students"
Load a CSV of students (Name, Marks, Attendance)

Filter: Attendance > 75%

Show top 2 scorers

Save filtered data to CSV

In [35]:
import pandas as pd

# df = pd.DataFrame({
#     'Name': ['Chhota Bheem', 'Raju', 'Chutki', 'Jaggu'],
#     'Marks': [100, 90, 80, 89],
#     'Attendance': [90, 95, 100, 98]
# })

# df.to_csv("student_attendance.csv", index=False)

df = pd.read_csv("student_attendance.csv")
print(df, end="\n\n")

#filter
df = df[df['Attendance'] > 75]
print(df, end="\n\n")

print(df.head(2))
print(df.tail(2), end="\n\n")

#top 5 performers
df = df.sort_values(by='Marks', ascending=False).head(2)
print(df, end="\n\n")

#filtered student attendance
df.to_csv("filtered_student_attendance.csv")

df.shape

           Name  Marks  Attendance
0  Chhota Bheem    100          90
1          Raju     90          95
2        Chutki     80         100
3         Jaggu     89          98

           Name  Marks  Attendance
0  Chhota Bheem    100          90
1          Raju     90          95
2        Chutki     80         100
3         Jaggu     89          98

           Name  Marks  Attendance
0  Chhota Bheem    100          90
1          Raju     90          95
     Name  Marks  Attendance
2  Chutki     80         100
3   Jaggu     89          98

           Name  Marks  Attendance
0  Chhota Bheem    100          90
1          Raju     90          95



(2, 3)

🧪 Mini Task:
Add a new student to your current DataFrame

Re-print the updated table

In [39]:
import pandas as pd
df = pd.DataFrame({
    'Name': ['Doraemon', 'Nobita'],
    'Scores': [85,90]
})

print(df, end="\n\n")

#method 1
df.loc[len(df)] = ['Suneo', 70]
print(df, end="\n\n")

#method 2 - concat
new_row = {'Name': 'Gian', 'Scores': 55}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
print(df, end="\n\n")

#method 3 - insert at required position
df.loc[2.5] = ['Suzuka', 60]
print(df)
df = df.sort_index().reset_index(drop=True)
print(df)

       Name  Scores
0  Doraemon      85
1    Nobita      90

       Name  Scores
0  Doraemon      85
1    Nobita      90
2     Suneo      70

       Name  Scores
0  Doraemon      85
1    Nobita      90
2     Suneo      70
3      Gian      55

         Name  Scores
0.0  Doraemon      85
1.0    Nobita      90
2.0     Suneo      70
3.0      Gian      55
2.5    Suzuka      60
       Name  Scores
0  Doraemon      85
1    Nobita      90
2     Suneo      70
3    Suzuka      60
4      Gian      55


1. unique value per column
2. random n rows
3. column info, nulls, datatypes
4. summary status(mean, std, etc)
5. shaoe, dtypes

In [48]:
#unique
print(df, "\n\n")
print(df.nunique(), "\n\n")

#random n rows
print(df.sample(2), "\n\n")

#column info
print(df['Name'].info(), "\n\n")

#summary
print(df.describe(), "\n\n")

#shape, dtype
print(df.shape)
print(df.dtypes)

       Name  Scores
0  Doraemon      85
1    Nobita      90
2     Suneo      70
3    Suzuka      60
4      Gian      55 


Name      5
Scores    5
dtype: int64 


     Name  Scores
2   Suneo      70
3  Suzuka      60 


<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: Name
Non-Null Count  Dtype 
--------------  ----- 
5 non-null      object
dtypes: object(1)
memory usage: 172.0+ bytes
None 


          Scores
count   5.000000
mean   72.000000
std    15.247951
min    55.000000
25%    60.000000
50%    70.000000
75%    85.000000
max    90.000000 


(5, 2)
Name      object
Scores     int64
dtype: object
