**Automated Machine Learning Workflow**


Description:

This project provides a streamlined machine learning workflow that automates data preprocessing, model selection, training, and evaluation. It supports both regression and classification tasks, allowing users to easily compare the performance of different models.

Functionality:

1. Data Preprocessing: Handles missing values, encodes categorical variables, and scales/normalizes data.
2. Model Selection: Offers a range of machine learning models, including linear regression, random forest, support vector machines, and more.
3. Model Training and Evaluation: Trains selected models and evaluates their performance using metrics such as accuracy, precision, recall, and F1-score.
4. Result Visualization: Displays evaluation metrics and visualizations (e.g., confusion matrices, ROC curves) for easy model comparison.


Usage Guidelines:

1. Install required libraries: pandas, scikit-learn, numpy, and matplotlib.
2. Prepare your dataset: Ensure it's in CSV format and contains a target variable.
3. Run the script: Execute the Python script, providing the dataset path and target variable name.
4. Select task: Choose regression or classification.
5. Select model: Pick from the available models.
6. Review results: Evaluate model performance using metrics and visualizations.

Input Requirements:

1. Dataset path (e.g., data.csv)
2. Target variable name (e.g., target_variable)
3. Task type (regression/classification)
4. Model selection (e.g., linear regression, random forest)

Output:

1. Evaluation metrics (accuracy, precision, recall, F1-score, etc.)
2. Visualizations (confusion matrices, ROC curves, etc.)
3. Trained model object for further use or deployment.


In [1]:
import pandas as pd

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVR, SVC

In [5]:
def select_model(data, target, model_type, task):
    """
    Select a model for regression or classification tasks.

    Parameters:
    - data (Pandas DataFrame): Input data
    - target (str): Target variable column name
    - model_type (str): Model type (e.g., random_forest, linear, ridge, lasso, svm)
    - task (str): Task type (regression, classification)

    Returns:
    - trained_model (scikit-learn model)
    - evaluation_metrics (dict)
    """
    # Convert object columns to category
    for col in data.columns:
        if data[col].dtype == 'object':
            data[col] = data[col].astype('category')

    # Create Label Encoder
    le = LabelEncoder()
    for col in data.columns:
        if data[col].dtype == 'category':
            data[col] = le.fit_transform(data[col])

    # Split data into training and testing sets
    X = data.drop(target, axis=1)
    y = data[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Model selection and training
    if task == 'regression':
        if model_type == 'random_forest':
            model = RandomForestRegressor(n_estimators=100, random_state=42)
        elif model_type == 'linear':
            model = LinearRegression()
        elif model_type == 'ridge':
            model = SVR(kernel='ridge')
        elif model_type == 'lasso':
            model = SVR(kernel='linear')
        elif model_type == 'svm':
            model = SVR(kernel='rbf')
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        evaluation_metrics = {
            'mse': mean_squared_error(y_test, y_pred),
            'rmse': mean_squared_error(y_test, y_pred)**0.5,
            'r2': r2_score(y_test, y_pred)
        }
    elif task == 'classification':
        if model_type == 'random_forest':
            model = RandomForestClassifier(n_estimators=100, random_state=42)
        elif model_type == 'linear':
            model = LogisticRegression(max_iter=1000)
        elif model_type == 'ridge':
            model = SVC(kernel='linear', probability=True)
        elif model_type == 'lasso':
            model = SVC(kernel='linear', probability=True)
        elif model_type == 'svm':
            model = SVC(kernel='rbf', probability=True)
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        evaluation_metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'classification_report': classification_report(y_test, y_pred)
        }

    return model, evaluation_metrics

def main():
    data_path = input("Enter data file path (e.g., data.csv): ")
    data = pd.read_csv(data_path)

    print("Select task:")
    print("1. Regression")
    print("2. Classification")
    task_choice = input("Enter choice (1/2): ")
    if task_choice == '1':
        task = 'regression'
    elif task_choice == '2':
        task = 'classification'
    else:
        print("Invalid choice. Exiting.")
        return

    target = input("Enter target variable column name: ")

    print("Select model:")
    print("1. Random Forest")
    print("2. Linear")
    print("3. Ridge")
    print("4. Lasso")
    print("5. SVM")
    model_choice = input("Enter choice (1/2/3/4/5): ")
    if model_choice == '1':
        model_type = 'random_forest'
    elif model_choice == '2':
        model_type = 'linear'
    elif model_choice == '3':
        model_type = 'ridge'
    elif model_choice == '4':
        model_type = 'lasso'
    elif model_choice == '5':
        model_type = 'svm'
    else:
        print("Invalid choice. Exiting.")
        return

    model, evaluation_metrics = select_model(data, target, model_type, task)
    print(evaluation_metrics)

if __name__ == "__main__":
    main()

Enter data file path (e.g., data.csv): /content/archive (36).zip
Select task:
1. Regression
2. Classification
Enter choice (1/2): 2
Enter target variable column name: Diagnosis
Select model:
1. Random Forest
2. Linear
3. Ridge
4. Lasso
5. SVM
Enter choice (1/2/3/4/5): 1
{'accuracy': 0.9230769230769231, 'classification_report': '              precision    recall  f1-score   support\n\n           0       1.00      0.90      0.95        20\n           1       0.75      1.00      0.86         6\n\n    accuracy                           0.92        26\n   macro avg       0.88      0.95      0.90        26\nweighted avg       0.94      0.92      0.93        26\n'}
