# Pandas Foundations for ML

## Objective
Learn and document core Pandas operations for ML projects:
- DataFrames and Series
- Reading/writing CSV
- Indexing, slicing, filtering
- Basic statistics and summary

## Why this matters
Pandas is the **standard tool** for tabular data in ML.  
It allows efficient data manipulation and exploration before feeding into models.

In [1]:
import pandas as pd

# Sample data
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "Score": [88, 92, 79, 85]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92
2,Charlie,35,79
3,David,40,85


## Key Notes
- `df.shape` → dimensions
- `df.columns` → column names
- `df.dtypes` → data types
- `df.info()` → concise summary

In [2]:
print("Shape:", df.shape)
print("Columns:", df.columns)
print("\nData types:\n", df.dtypes)
print("\nInfo:")
df.info()

Shape: (4, 3)
Columns: Index(['Name', 'Age', 'Score'], dtype='object')

Data types:
 Name     object
Age       int64
Score     int64
dtype: object

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   Score   4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


In [3]:
# Select a single column
print("Ages:", df["Age"])

# Select multiple columns
print("\nName and Score:\n", df[["Name", "Score"]])

# Row selection by index
print("\nFirst row:\n", df.iloc[0])
print("\nLast two rows:\n", df.iloc[-2:])

Ages: 0    25
1    30
2    35
3    40
Name: Age, dtype: int64

Name and Score:
       Name  Score
0    Alice     88
1      Bob     92
2  Charlie     79
3    David     85

First row:
 Name     Alice
Age         25
Score       88
Name: 0, dtype: object

Last two rows:
       Name  Age  Score
2  Charlie   35     79
3    David   40     85


In [4]:
# Filter rows
high_scores = df[df["Score"] > 85]
print("Students with Score > 85:\n", high_scores)

# Boolean condition
ages_above_30 = df[df["Age"] > 30]
print("\nStudents older than 30:\n", ages_above_30)

Students with Score > 85:
     Name  Age  Score
0  Alice   25     88
1    Bob   30     92

Students older than 30:
       Name  Age  Score
2  Charlie   35     79
3    David   40     85


In [5]:
# Descriptive stats
print("Mean Age:", df["Age"].mean())
print("Max Score:", df["Score"].max())
print("Summary statistics:\n", df.describe())


Mean Age: 32.5
Max Score: 92
Summary statistics:
              Age      Score
count   4.000000   4.000000
mean   32.500000  86.000000
std     6.454972   5.477226
min    25.000000  79.000000
25%    28.750000  83.500000
50%    32.500000  86.500000
75%    36.250000  89.000000
max    40.000000  92.000000


In [6]:
# Save to CSV
df.to_csv("students.csv", index=False)

# Read back
df_loaded = pd.read_csv("students.csv")
df_loaded

Unnamed: 0,Name,Age,Score
0,Alice,25,88
1,Bob,30,92
2,Charlie,35,79
3,David,40,85
