# Introduction to Pandas for Machine Learning
## What is Pandas?
Pandas is the go-to library for data manipulation and analysis in Python. 

If NumPy is the engine for numerical operations, Pandas is the cockpit—it gives you intuitive, labeled data structures to explore, clean, and prepare real-world datasets for machine learning.

## Why Pandas?   
- Real data isn’t just numbers—it has names, categories, missing values, and timestamps.  
- Pandas handles all this cleanly and efficiently.  
- Every ML workflow starts with loading and cleaning data—Pandas makes it easy!
         
## Core ideas
Think of a DataFrame as a smart spreadsheet or SQL table:   
- Rows = observations (e.g., one customer, one house)  
- Columns = features/variables (e.g., age, price, color)  
- Labels = meaningful names (no more guessing what arr[:, 3] means!)     

In [None]:
import pandas as pd

# Create a DataFrame from a dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Paris', 'London', 'Berlin']
}
df = pd.DataFrame(data)
print(df)

Using panda on real Kaggle data

In [None]:
df = pd.read_csv('data/housing.csv')
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
# Data types + missing values
df.info()

In [None]:
# # Summary stats (mean, std, min, max, etc.)
df.describe()

In [None]:
# By column name (returns a "Series" – like a 1D labeled array)
ages = df['total_rooms']
ages

In [None]:
df['total_rooms'].mean()  # Mean of one column

In [None]:
# Multiple columns
subset = df[['total_rooms', 'median_income']]
subset.head()

## Basic transformations

In [None]:
# Create new columns based on transformations
df['m_house_value_kilos'] = df['median_house_value'] / 1000

In [None]:
df[['median_house_value', 'm_house_value_kilos']]

In [None]:
df['house_value_per_person'] = df['median_house_value'] / df['population']
df['house_value_per_person'].describe()

In [None]:
# Filter rows
adults = df[df['total_rooms'] <= 100]
adults.head()

In [None]:
# Sort by a column
df_sorted = df.sort_values('median_income', ascending=False)
df_sorted.head()

## Why Pandas is Crucial for ML
- Load raw data in different formats: csv, xls, text, ...
- Understand data: head, describe, ...
- Clean data: clean missing values, remove irrelevant columns, ...
- Filter/transform data: make the algorithm to focus on relevant data
- Get data ready for models: .values -> Numpy arrays

**Golden Rule**: Always explore your data with Pandas BEFORE modeling!

# Exercises
1. Perform these actions in the following dataset:
- Display the DataFrame
- Check for missing values
- Fill missing 'sales' with the median
- Add a 'revenue' column (sales * price)
- Convert features to NumPy array

In [None]:
data = {
    'product': ['A', 'B', 'C', 'D'],
    'sales': [100, 150, None, 200], 
    'price': [10.0, 15.0, 20.0, 25.0]
}
df = pd.DataFrame(data)