# Scikit-learn for Machine Learning (Beginner-friendly)

**Learning Objectives:**
- Build and evaluate classification, regression, and clustering models
- Master the complete ML pipeline from data to predictions
- Apply preprocessing techniques and model evaluation metrics
- Understand when to use different algorithms and how to tune them

**Prerequisites:** Python basics, NumPy fundamentals, Pandas data preprocessing (complete previous notebooks first)

**Estimated Time:** ~90 minutes

---

Scikit-learn is the go-to library for machine learning in Python. This notebook brings together everything you've learned in NumPy and Pandas to build actual ML models that can make predictions on real data.

**Why Scikit-learn?** It provides:
- Consistent API across all algorithms (fit, predict, score)
- Built-in preprocessing tools that work seamlessly with Pandas
- Comprehensive model evaluation and validation tools
- Production-ready implementations of proven algorithms

**Learning Path Connection:** This notebook uses:
- **NumPy skills**: Array operations, mathematical functions, broadcasting
- **Pandas skills**: Data cleaning, feature engineering, train/test splits
- **New ML skills**: Model training, evaluation, and prediction

**What You'll Build:** Complete ML projects including customer classification, sales prediction, and customer segmentation - exactly what data scientists do every day!

**🎯 Success Indicators:** By the end, you should be able to:
- Train models and make accurate predictions on new data
- Evaluate model performance using appropriate metrics
- Choose the right algorithm for different types of problems
- Build complete ML pipelines from raw data to final predictions

**💡 Beginner Tips:**
- Start simple - basic models often work surprisingly well
- Always split your data before training (never test on training data!)
- Focus on understanding the problem before choosing algorithms
- Model evaluation is as important as model training

**🔗 ML Problem Types We'll Cover:**
- **Classification**: Predicting categories (premium vs regular customers)
- **Regression**: Predicting numbers (sales amounts, prices)
- **Clustering**: Finding hidden groups in data (customer segments)


In [None]:
# Essential imports for ML
import numpy as np
import pandas as pd
from datetime import datetime

# Scikit-learn core modules
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# ML Algorithms we'll use
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

# Set random seed for reproducibility (remember this from NumPy and Pandas!)
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 10)

print(f"Scikit-learn ready! Using reproducible random seed: 42")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Import sklearn and check version
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
print("\n🚀 Ready to build ML models!")