# 🌸 Your First ML Model in Jupyter Notebook

**Welcome to local machine learning development!** This notebook will guide you through building your first ML model on your own computer.

**What you'll build**: A flower species classifier that can identify iris flowers from their measurements.

**Advantages of local development**:
- 💻 Full control over your environment
- 📁 Easy access to local files
- 🔒 Complete privacy for your data
- ⚡ No internet required after setup

**How to use this notebook**:
1. **Make sure you have the required packages**: `pip install numpy pandas matplotlib scikit-learn seaborn`
2. **Run each cell**: Press Shift + Enter for each code block
3. **Save your work**: Use keyboard shortcut or File → Save to save locally

**Time needed**: 15-20 minutes

---

## Step 0: Verify Your Setup ✅

Let's make sure everything is installed correctly:

In [None]:
# Verify your Jupyter setup
import sys
import os
from pathlib import Path

print("🔍 Checking your setup...")
print(f"🐍 Python version: {sys.version}")
print(f"📍 Python location: {sys.executable}")
print(f"📁 Current directory: {os.getcwd()}")
print(f"💻 Operating system: {os.name}")

# Check if we can import required packages
required_packages = ['numpy', 'pandas', 'matplotlib', 'sklearn', 'seaborn']
missing_packages = []

for package in required_packages:
    try:
        __import__(package)
        print(f"✅ {package} is installed")
    except ImportError:
        print(f"❌ {package} is missing")
        missing_packages.append(package)

if missing_packages:
    print(f"\n🚨 Please install missing packages:")
    print(f"pip install {' '.join(missing_packages)}")
else:
    print("\n🎉 All packages are installed! You're ready to go!")

## Step 1: Import Libraries 📚

Now let's import all the tools we need for machine learning:

In [None]:
# Import the tools we need for machine learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib  # For saving models locally
from datetime import datetime

# Set up plotting for Jupyter
%matplotlib inline
plt.style.use('default')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print("🚀 Ready to build your first ML model locally!")
print(f"⏰ Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Step 2: Load and Explore the Data 🔍

Let's load the famous iris flower dataset and see what we're working with:

In [None]:
# Load the famous iris flower dataset
print("🌸 Loading the Iris dataset...")
data = load_iris()

# Convert to a pandas DataFrame for easier handling
df = pd.DataFrame(data.data, columns=data.feature_names)
df['species'] = data.target_names[data.target]

# Display basic information
print(f"📊 Dataset shape: {df.shape[0]} flowers, {df.shape[1]-1} measurements")
print(f"🏷️  Species: {', '.join(data.target_names)}")
print(f"📏 Measurements: {', '.join(data.feature_names)}")

# Show the first few flowers
print("\n🔍 First 5 flowers in our dataset:")
display(df.head())

# Show basic statistics
print("\n📈 Basic statistics:")
display(df.describe())

# Check for any missing data
print(f"\n❓ Missing values: {df.isnull().sum().sum()}")
print("✅ Dataset is clean and ready!")

## Step 3: Visualize the Data 📈

Let's create beautiful visualizations to understand our data better:

In [None]:
# Create beautiful visualizations to understand our data
print("📈 Creating data visualizations...")

# Set up a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('🌸 Iris Dataset Analysis', fontsize=16, fontweight='bold')

# Plot 1: Sepal measurements by species
for i, species in enumerate(data.target_names):
    species_data = df[df['species'] == species]
    axes[0, 0].scatter(species_data['sepal length (cm)'], species_data['sepal width (cm)'], 
                      label=species, alpha=0.7, s=50)
axes[0, 0].set_xlabel('Sepal Length (cm)')
axes[0, 0].set_ylabel('Sepal Width (cm)')
axes[0, 0].set_title('Sepal Measurements by Species')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Petal measurements by species  
for i, species in enumerate(data.target_names):
    species_data = df[df['species'] == species]
    axes[0, 1].scatter(species_data['petal length (cm)'], species_data['petal width (cm)'], 
                      label=species, alpha=0.7, s=50)
axes[0, 1].set_xlabel('Petal Length (cm)')
axes[0, 1].set_ylabel('Petal Width (cm)')
axes[0, 1].set_title('Petal Measurements by Species')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Distribution of measurements
df[data.feature_names].hist(bins=20, ax=axes[1, 0], alpha=0.7, color='skyblue')
axes[1, 0].set_title('Distribution of All Measurements')

# Plot 4: Species count
species_counts = df['species'].value_counts()
bars = axes[1, 1].bar(species_counts.index, species_counts.values, 
                     color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[1, 1].set_title('Number of Flowers per Species')
axes[1, 1].set_ylabel('Count')
# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., height + 0.5,
                    f'{int(height)}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("📊 Key observations:")
print("• Each species has distinct petal characteristics")
print("• Setosa has the smallest petals")
print("• Virginica has the largest petals")
print("• This should make classification easier!")

## Step 4: Prepare Data for Machine Learning 🔧

Now let's split our data into training and testing sets:

In [None]:
# Prepare our data for training a machine learning model
print("🔧 Preparing data for machine learning...")

# Separate features (measurements) from target (species)
X = data.data  # Features: sepal length, sepal width, petal length, petal width
y = data.target  # Target: species (0=setosa, 1=versicolor, 2=virginica)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3,      # Use 30% for testing
    random_state=42,    # For reproducible results
    stratify=y          # Ensure equal representation of each species
)

print(f"📚 Training set: {X_train.shape[0]} flowers")
print(f"🧪 Testing set: {X_test.shape[0]} flowers")
print(f"📊 Features per flower: {X_train.shape[1]}")

# Show the split by species
train_species = pd.Series(y_train).value_counts().sort_index()
test_species = pd.Series(y_test).value_counts().sort_index()

print("\n🌸 Training set by species:")
for i, species in enumerate(data.target_names):
    print(f"  {species}: {train_species[i]} flowers")

print("\n🧪 Testing set by species:")
for i, species in enumerate(data.target_names):
    print(f"  {species}: {test_species[i]} flowers")

# Save the data splits locally (Jupyter advantage!)
train_df = pd.DataFrame(X_train, columns=data.feature_names)
train_df['species'] = y_train
test_df = pd.DataFrame(X_test, columns=data.feature_names)
test_df['species'] = y_test

train_df.to_csv('iris_train.csv', index=False)
test_df.to_csv('iris_test.csv', index=False)
print("\n💾 Data splits saved to iris_train.csv and iris_test.csv")

## Step 5: Train the Machine Learning Model 🤖

Time to create and train our AI model!

In [None]:
# Create and train our machine learning model
print("🤖 Training the machine learning model...")

# Create a Random Forest classifier
model = RandomForestClassifier(
    n_estimators=100,    # Use 100 decision trees
    random_state=42,     # For reproducible results
    max_depth=3          # Prevent overfitting
)

# Train the model on our training data
start_time = datetime.now()
model.fit(X_train, y_train)
training_time = (datetime.now() - start_time).total_seconds()

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"🎯 Model trained successfully!")
print(f"⏱️  Training time: {training_time:.3f} seconds")
print(f"📈 Accuracy on test set: {accuracy:.1%}")

# Show which features are most important
feature_importance = pd.DataFrame({
    'feature': data.feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n🔍 Most important features for classification:")
for _, row in feature_importance.iterrows():
    print(f"  {row['feature']}: {row['importance']:.3f}")

# Save feature importance to CSV (local advantage!)
feature_importance.to_csv('feature_importance.csv', index=False)
print("\n💾 Feature importance saved to feature_importance.csv")

## Step 6: Evaluate Model Performance 📊

Let's see how well our model performs in detail:

In [None]:
# Evaluate how well our model performs
print("📊 Evaluating model performance...")

# Detailed classification report
print("\n📋 Detailed Classification Report:")
report = classification_report(y_test, y_pred, target_names=data.target_names, output_dict=True)
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Save classification report to CSV
report_df = pd.DataFrame(report).transpose()
report_df.to_csv('classification_report.csv')
print("💾 Classification report saved to classification_report.csv")

# Create a confusion matrix to see where the model makes mistakes
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=data.target_names, 
            yticklabels=data.target_names,
            cbar_kws={'label': 'Number of Flowers'})
plt.title('🎯 Confusion Matrix: Actual vs Predicted Species', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Species', fontsize=12)
plt.ylabel('Actual Species', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
print("💾 Confusion matrix saved to confusion_matrix.png")

# Calculate per-species accuracy
print("\n🌸 Accuracy by species:")
species_accuracy = {}
for i, species in enumerate(data.target_names):
    species_mask = y_test == i
    if species_mask.sum() > 0:
        species_acc = (y_pred[species_mask] == i).mean()
        species_accuracy[species] = species_acc
        print(f"  {species}: {species_acc:.1%}")

# Show any misclassifications
misclassified = X_test[y_test != y_pred]
if len(misclassified) > 0:
    print(f"\n❌ Misclassified flowers: {len(misclassified)}")
    print("These are the flowers our model got wrong - let's learn from them!")
    
    # Save misclassified examples
    misclassified_df = pd.DataFrame(misclassified, columns=data.feature_names)
    misclassified_df['actual'] = data.target_names[y_test[y_test != y_pred]]
    misclassified_df['predicted'] = data.target_names[y_pred[y_test != y_pred]]
    misclassified_df.to_csv('misclassified_flowers.csv', index=False)
    print("💾 Misclassified examples saved to misclassified_flowers.csv")
else:
    print("\n🎉 Perfect classification! No mistakes on the test set!")

## Step 7: Make Predictions on New Flowers 🔮

Now for the exciting part - let's use our model to predict new flower species!

In [None]:
# Use our trained model to predict new flower species
print("🔮 Making predictions on new flowers...")

# Create some example new flowers to classify
new_flowers = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Small petals - likely Setosa
    [6.2, 2.8, 4.8, 1.8],  # Large petals - likely Virginica  
    [5.7, 2.8, 4.1, 1.3],  # Medium petals - likely Versicolor
    [4.9, 3.1, 1.5, 0.1],  # Very small petals - likely Setosa
    [7.2, 3.0, 5.8, 1.6]   # Very large petals - likely Virginica
])

# Make predictions
predictions = model.predict(new_flowers)
probabilities = model.predict_proba(new_flowers)

print("\n🌸 Prediction Results:")
print("=" * 60)

# Create a results DataFrame
results_data = []

for i, (flower, pred, prob) in enumerate(zip(new_flowers, predictions, probabilities)):
    species = data.target_names[pred]
    confidence = prob.max()
    
    print(f"\n🌺 Flower #{i+1}:")
    print(f"   Measurements: {flower}")
    print(f"   Predicted species: {species}")
    print(f"   Confidence: {confidence:.1%}")
    
    # Store results for CSV
    result_row = {
        'flower_id': i+1,
        'sepal_length': flower[0],
        'sepal_width': flower[1],
        'petal_length': flower[2],
        'petal_width': flower[3],
        'predicted_species': species,
        'confidence': confidence
    }
    
    # Add probabilities for each species
    for j, (species_name, probability) in enumerate(zip(data.target_names, prob)):
        result_row[f'prob_{species_name.lower()}'] = probability
    
    results_data.append(result_row)
    
    # Show probability for each species
    print("   Probabilities:")
    for j, (species_name, probability) in enumerate(zip(data.target_names, prob)):
        emoji = "🎯" if j == pred else "  "
        print(f"     {emoji} {species_name}: {probability:.1%}")

# Save predictions to CSV (local advantage!)
predictions_df = pd.DataFrame(results_data)
predictions_df.to_csv('new_flower_predictions.csv', index=False)
print("\n💾 Predictions saved to new_flower_predictions.csv")
print("\n✨ Amazing! Your model can now identify iris species from measurements!")

## 💻 Jupyter Notebook Special Features

Since you're using Jupyter locally, let's explore some unique advantages:

In [None]:
# Save your trained model locally (Jupyter's superpower!)
model_filename = f'iris_classifier_{datetime.now().strftime("%Y%m%d_%H%M%S")}.pkl'
joblib.dump(model, model_filename)
print(f"💾 Model saved as: {model_filename}")
print(f"📁 File size: {os.path.getsize(model_filename) / 1024:.1f} KB")

# Show all files created in this session
print("\n📂 Files created in this session:")
created_files = [
    'iris_train.csv', 'iris_test.csv', 'feature_importance.csv',
    'classification_report.csv', 'confusion_matrix.png', 
    'new_flower_predictions.csv', model_filename
]

for file in created_files:
    if os.path.exists(file):
        size = os.path.getsize(file)
        print(f"  📄 {file} ({size:,} bytes)")

print("\n💡 Advantages of local development:")
print("  ✅ All files saved to your computer")
print("  ✅ No internet required after setup")
print("  ✅ Full control over your environment")
print("  ✅ Easy integration with other local tools")
print("  ✅ Complete privacy for your data")

In [None]:
# Load and test the saved model (demonstrating persistence)
print("🔄 Testing model persistence...")

# Load the model back from disk
loaded_model = joblib.load(model_filename)

# Test it on a new flower
test_flower = np.array([[5.0, 3.0, 1.6, 0.2]])
prediction = loaded_model.predict(test_flower)
probability = loaded_model.predict_proba(test_flower)

print(f"✅ Model loaded successfully from {model_filename}")
print(f"🌸 Test prediction: {data.target_names[prediction[0]]}")
print(f"🎯 Confidence: {probability.max():.1%}")
print("\n💡 Your model is now saved and can be used anytime!")

In [None]:
# Create a summary report
summary = {
    'experiment_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'dataset_size': len(df),
    'training_size': len(X_train),
    'test_size': len(X_test),
    'model_type': 'RandomForestClassifier',
    'n_estimators': model.n_estimators,
    'accuracy': accuracy,
    'training_time_seconds': training_time,
    'model_file': model_filename,
    'python_version': sys.version,
    'working_directory': os.getcwd()
}

summary_df = pd.DataFrame([summary])
summary_df.to_csv('experiment_summary.csv', index=False)

print("📊 Experiment Summary:")
for key, value in summary.items():
    print(f"  {key}: {value}")

print("\n💾 Summary saved to experiment_summary.csv")
print("🎉 Complete local ML workflow demonstrated!")

## 🎉 Congratulations!

You've successfully built your first machine learning model in Jupyter Notebook!

### What you accomplished:
✅ Set up and verified your local ML environment  
✅ Loaded and explored a dataset of 150 flowers  
✅ Visualized data patterns across 4 features  
✅ Trained a Random Forest model with 100 trees  
✅ Achieved 95%+ accuracy on unseen data  
✅ Made predictions on new flower measurements  
✅ Saved your model and results locally  

### Key ML concepts you learned:
🧠 Data loading and exploration  
🧠 Data visualization and pattern recognition  
🧠 Train/test split for model evaluation  
🧠 Model training and prediction  
🧠 Performance evaluation and interpretation  
🧠 Model persistence and reuse  

### Local development advantages you experienced:
💻 Full control over your environment  
💻 Automatic file saving to your computer  
💻 No internet dependency after setup  
💻 Easy integration with local tools  
💻 Complete data privacy  

### Files created in this session:
📄 `iris_train.csv` - Training data  
📄 `iris_test.csv` - Test data  
📄 `feature_importance.csv` - Feature analysis  
📄 `classification_report.csv` - Model performance  
📄 `confusion_matrix.png` - Visualization  
📄 `new_flower_predictions.csv` - Prediction results  
📄 `iris_classifier_[timestamp].pkl` - Trained model  
📄 `experiment_summary.csv` - Complete summary  

### Next steps:
🚀 Try different algorithms (SVM, Neural Networks)  
🚀 Work with your own datasets  
🚀 Learn feature engineering and data preprocessing  
🚀 Build models for regression and clustering  
🚀 Explore Jupyter extensions and widgets  

### Resources:
📚 [Complete ML Guide](../04-first-ml-example.md)  
📚 [Next Steps](../05-next-steps.md)  
📚 [Google Colab Version](colab-sample.ipynb)  
📚 [Python Script Version](python-sample.py)  

---

**Remember**: You now have a complete local ML development setup! The same concepts work in Google Colab and Python IDEs too. You have transferable skills across all ML environments. 🌟

**Happy local learning!** 🎓✨