
# Automated Machine Learning Pipeline

This notebook implements an end-to-end ML pipeline, including dataset loading, preprocessing, training, and testing.

---

## Step 1: Dataset Loading

We will use the Iris dataset for this example.


In [2]:

# Import necessary libraries
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target

# Display the first few rows
data.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0



## Step 2: Data Preprocessing

We will handle missing values (if any), normalize the data, and prepare it for training.


In [3]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Function to preprocess the data
def preprocess_data(df):
    # Handle missing values (if any)
    df = df.fillna(df.mean())
    
    # Separate features and target
    X = df.iloc[:, :-1]
    y = df['target']
    
    # Normalize the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

# Preprocess the data
X_train, X_test, y_train, y_test = preprocess_data(data)
X_train[:5]  # Display the first few rows of the preprocessed data


array([[-1.50652052,  1.24920112, -1.56757623, -1.3154443 ],
       [-0.17367395,  3.09077525, -1.2833891 , -1.05217993],
       [ 1.03800476,  0.09821729,  0.36489628,  0.26414192],
       [-1.26418478,  0.78880759, -1.22655167, -1.3154443 ],
       [-1.74885626,  0.32841405, -1.39706395, -1.3154443 ]])


## Step 3: Model Training

We will train a Decision Tree classifier using the preprocessed data.


In [4]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")


Model Accuracy: 1.00



## Step 4: Saving the Model

We will save the trained model using `joblib`.


In [5]:

import joblib

# Save the model
joblib.dump(model, 'decision_tree_model.pkl')
print("Model saved as 'decision_tree_model.pkl'")


Model saved as 'decision_tree_model.pkl'



## Step 5: Unit Tests

We will write and execute unit tests to verify the preprocessing function and model performance.


In [6]:

# Unit test for data preprocessing
def test_preprocess_data():
    X_train, X_test, y_train, y_test = preprocess_data(data)
    assert X_train.shape[0] > 0, "Training data is empty!"
    assert X_test.shape[0] > 0, "Testing data is empty!"
    assert len(y_train) > 0, "Training labels are empty!"
    assert len(y_test) > 0, "Testing labels are empty!"
    print("Preprocessing test passed!")

# Unit test for model performance
def test_model_performance():
    assert accuracy > 0.8, "Model accuracy is below 80%!"
    print("Model performance test passed!")

# Run the tests
test_preprocess_data()
test_model_performance()


Preprocessing test passed!
Model performance test passed!
