# Machine Learning Basics

## Overview
This notebook introduces fundamental machine learning concepts using scikit-learn. We'll cover:
- Basic ML workflow
- Classification (predicting categories)
- Regression (predicting continuous values)
- Model evaluation

## Setup
Run this in Google Colab or local Jupyter environment.

In [None]:
# Install required packages (uncomment if needed)
# !pip install scikit-learn pandas matplotlib seaborn numpy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_iris, load_diabetes, make_classification

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print(" Libraries imported successfully!")

## Part 1: Classification - Iris Dataset

### What is Classification?
Classification is predicting which category (class) an item belongs to.

**Example:** Given flower measurements, predict the species (Setosa, Versicolor, or Virginica)

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features: sepal length, sepal width, petal length, petal width
y = iris.target # Target: species (0, 1, or 2)

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(y, iris.target_names)

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Scatter plot
for species in iris.target_names:
 subset = df[df['species'] == species]
 axes[0].scatter(subset['petal length (cm)'], subset['petal width (cm)'], 
 label=species, alpha=0.7, s=100)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('Iris Dataset - Petal Dimensions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Distribution plot
df.groupby('species')['petal length (cm)'].plot(kind='kde', ax=axes[1], legend=True)
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_title('Distribution of Petal Length by Species')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n Notice how different species cluster together!")

### The Machine Learning Workflow

```
1. Split data → Train/Test sets
2. Train model → Learn patterns from training data
3. Evaluate → Test on unseen data
4. Predict → Use model for new data
```

In [None]:
# Step 1: Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print("\n We train on 80% and test on 20% to evaluate performance on unseen data")

In [None]:
# Step 2: Train a Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

print(" Model trained!")
print("\n The model has learned patterns from the training data")

In [None]:
# Step 3: Evaluate the model
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

In [None]:
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
 xticklabels=iris.target_names, 
 yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

print("\n Diagonal elements = correct predictions")
print(" Off-diagonal = misclassifications")

In [None]:
# Step 4: Make predictions on new data
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]]) # Example measurements
prediction = model.predict(new_flower)
probability = model.predict_proba(new_flower)

print(f"Predicted species: {iris.target_names[prediction[0]]}")
print("\nProbabilities:")
for i, species in enumerate(iris.target_names):
 print(f" {species}: {probability[0][i]:.2%}")

## Part 2: Comparing Different Classifiers

Let's compare multiple algorithms!

In [None]:
# Define multiple models
models = {
 'Logistic Regression': LogisticRegression(max_iter=200),
 'Decision Tree': DecisionTreeClassifier(random_state=42),
 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

# Train and evaluate each model
results = {}
for name, model in models.items():
 model.fit(X_train, y_train)
 y_pred = model.predict(X_test)
 accuracy = accuracy_score(y_test, y_pred)
 results[name] = accuracy
 print(f"{name}: {accuracy:.2%}")

# Visualize comparison
plt.figure(figsize=(10, 6))
plt.bar(results.keys(), results.values(), color=['#3498db', '#e74c3c', '#2ecc71'])
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.ylim([0.9, 1.0])
for i, (name, acc) in enumerate(results.items()):
 plt.text(i, acc + 0.005, f'{acc:.2%}', ha='center', fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.show()

## Part 3: Regression - Predicting Continuous Values

### What is Regression?
Regression predicts continuous numerical values.

**Example:** Predicting house prices, stock prices, temperature, etc.

In [None]:
# Load diabetes dataset (predicting disease progression)
diabetes = load_diabetes()
X_reg = diabetes.data
y_reg = diabetes.target

print(f"Dataset shape: {X_reg.shape}")
print(f"Features: {diabetes.feature_names}")
print(f"Target: Disease progression (continuous value)")

In [None]:
# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
 X_reg, y_reg, test_size=0.2, random_state=42
)

# Train Linear Regression model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = reg_model.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.3f}")
print("\n R² Score: 1.0 = perfect, 0.0 = baseline")

In [None]:
# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test_reg, y_pred_reg, alpha=0.6, s=100)
plt.plot([y_test_reg.min(), y_test_reg.max()], 
 [y_test_reg.min(), y_test_reg.max()], 
 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Regression: Predicted vs Actual')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n Points closer to the red line = better predictions")

## Key Concepts Summary

### Classification vs Regression

| Aspect | Classification | Regression |
|--------|---------------|------------|
| **Output** | Categories (discrete) | Numbers (continuous) |
| **Example** | Spam/Not Spam | House Price |
| **Metrics** | Accuracy, F1-Score | MSE, R² |
| **Algorithms** | Logistic Regression, Decision Trees | Linear Regression, SVR |

### ML Workflow
1. **Load Data** → Understand your dataset
2. **Split Data** → Train/Test sets
3. **Train Model** → Learn patterns
4. **Evaluate** → Measure performance
5. **Predict** → Use on new data

### Key Takeaways
- Machine Learning learns patterns from data
- Always split data to evaluate on unseen examples
- Different algorithms have different strengths
- Visualization helps understand data and results

## Exercise

Try modifying the code:
1. Change the train/test split ratio (e.g., 70/30)
2. Try different features from the Iris dataset
3. Experiment with model parameters
4. Create visualizations for other feature combinations

## Next Steps
- **Next Notebook:** Generative AI Demo with HuggingFace
- **Module 2:** Mathematical Foundations
- **Module 3:** Neural Networks from Scratch