<a href="https://colab.research.google.com/github/Sumanasri02/ai-ml-learning-journey/blob/main/02_Python_for_ML/02_pandas_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Basics for Machine Learning

## Overview
Pandas is used for data manipulation and analysis.
In Machine Learning, Pandas helps load datasets, clean data,
handle missing values, and prepare data for models.


In [12]:
import pandas as pd
import numpy as np


## Creating a DataFrame
A DataFrame is a 2D labeled data structure similar to a table.


In [14]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [22, 25, 23, 24],
    "Score": [85, 90, 78, 88]
}
df = pd.DataFrame(data)
df

{'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [22, 25, 23, 24], 'Score': [85, 90, 78, 88]}


Unnamed: 0,Name,Age,Score
0,Alice,22,85
1,Bob,25,90
2,Charlie,23,78
3,David,24,88


## Basic Data Exploration
Common methods to understand the dataset.


In [15]:
df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   Score   4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


Unnamed: 0,Age,Score
count,4.0,4.0
mean,23.5,85.25
std,1.290994,5.251984
min,22.0,78.0
25%,22.75,83.25
50%,23.5,86.5
75%,24.25,88.5
max,25.0,90.0


## Selecting Data
Access specific columns or rows from a DataFrame.


In [16]:
# Select a column
df["Age"]
# Select multiple columns
df[["Name", "Score"]]
# Select rows using index
df.loc[0]

Unnamed: 0,0
Name,Alice
Age,22
Score,85


## Handling Missing Values
Real-world datasets often contain missing values.


In [17]:
df_missing = df.copy()
df_missing.loc[1, "Score"] = np.nan
df_missing

Unnamed: 0,Name,Age,Score
0,Alice,22,85.0
1,Bob,25,
2,Charlie,23,78.0
3,David,24,88.0


In [19]:
# Check missing values
df_missing.isnull().sum()
# Fill missing values
df_filled = df_missing.fillna(df_missing["Score"].mean())
df_filled


Unnamed: 0,Name,Age,Score
0,Alice,22,85.0
1,Bob,25,83.666667
2,Charlie,23,78.0
3,David,24,88.0


## Filtering Data
Filter rows based on conditions.


In [20]:
# Students with score greater than 85
df[df["Score"] > 85]

Unnamed: 0,Name,Age,Score
1,Bob,25,90
3,David,24,88


## Key Takeaways
- Pandas DataFrames store tabular data
- Easy data exploration and cleaning
- Essential for ML data preprocessing
