# Dataset Selection and Preprocessing

## Choosing and preparing your data

In this notebook, we will learn how to select a good dataset and prepare it for machine learning. Proper preprocessing makes our models more accurate and reliable.


## 🎯 Dataset Selection Criteria

- **Relevance:** Does it match your problem?
- **Quality:** Is the data clean and reliable?
- **Size:** Are there enough samples?
- **Features:** Are important variables included?

_💡 Popular beginner datasets: Titanic, Iris, Boston Housing, MNIST_

## 🔍 Exploratory Data Analysis (EDA)

Understanding your data is a crucial step before modeling. It helps you see patterns, spot problems, and decide how to clean or transform your data.

### Understanding Your Data

- Distribution of variables
- Relationships between features
- Missing values patterns
- Statistical summaries

*(Insert EDA process flowchart image here)*

## 🧹 Data Cleaning Essentials

Cleaning data involves handling missing values, outliers, duplicate records, and ensuring correct data types. Clean data leads to better model performance.

- Missing Values: Remove, fill, or interpolate
- Outliers: Detect and handle extreme values
- Duplicates: Remove redundant records
- Data Types: Ensure correct formats

*(Insert data cleaning comparison image here)*

## 🔧 Data Preprocessing Code

Here's an example Python code snippet demonstrating common data preprocessing steps using pandas and scikit-learn.


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data
df = pd.read_csv('titanic.csv')

# Basic info
print(df.info())
print(df.describe())

# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Encode categorical variables
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])

# Scale numerical features
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

print("Data preprocessing complete!")

## 🚀 Open in Colab

[Open this notebook in Google Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/4/concept_2.ipynb)

## 🎯 Key Takeaway

Quality data is the foundation of successful machine learning — garbage in, garbage out!

### Think About It

How would you handle missing age data in a customer dataset?