<a href="https://colab.research.google.com/github/MariamHawwari/MariamHawwari/blob/main/Mariam_Hawwari_Streamlit_App_(Indexation_Final_Project)_TAL_M2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **01. Installing Packages**

In [None]:
!pip install streamlit pycaret pyngrok



# **02. Creating the Streamlit App (app.py)**


*   Streamlit re-runs the entire script from top to bottom every time an interaction occurs. **Session States** (dictionary-like object: st.session_state) are used for values that need to be kept between reruns, so they don't reset each time. Before using session state variables, it's common to check if they've already been initialized to avoid recalculating, retraining, overwriting or accessing undefined values. Some values to consider are uploaded data, processed data, app state, user input, results, etc.



In [None]:
# Streamlit requires a separate Python script (app.py), which can be saved using %%writefile app.py and run with !streamlit run app.py
%%writefile app.py
import streamlit as st
import pandas as pd
import pycaret.classification as clf
import pycaret.regression as reg
import io
import hashlib
from sklearn.preprocessing import LabelEncoder

# Main function manages the workflow of the app
def main():

    st.title("Machine Learning App with PyCaret")

    # Initialize session state variables
    if 'data' not in st.session_state:
        st.session_state.data = None
    if 'best_model' not in st.session_state:
        st.session_state.best_model = None
    if 'trained_models' not in st.session_state:
        st.session_state.trained_models = []
    if 'task_type' not in st.session_state:
        st.session_state.task_type = None
    if 'data_uploaded' not in st.session_state:
        st.session_state.data_uploaded = False
    if 'model_trained' not in st.session_state:
        st.session_state.model_trained = False


    # Data Upload
    handle_data_upload()

    # Preprocessing (only if data is uploaded)
    if st.session_state.data_uploaded == True:
        handle_preprocessing()
        handle_model_training()

    # Model Evaluation (only if model is trained)
    if st.session_state.model_trained == True:
        handle_model_evaluation()

def handle_data_upload():
    st.header("Upload Your Dataset")
    uploaded_file = st.file_uploader("Upload CSV, Excel, or JSON file", type=["csv", "xlsx", "json"])

    # Ensures the function only runs if a file has been uploaded
    if uploaded_file is not None:

        try:
            if uploaded_file.name.endswith('.csv'):
                dataset = pd.read_csv(uploaded_file)
            elif uploaded_file.name.endswith('.xlsx'):
                dataset = pd.read_excel(uploaded_file)
            elif uploaded_file.name.endswith('.json'):
                dataset = pd.read_json(uploaded_file)

            st.session_state.data = dataset
            st.session_state.data_uploaded = True
            st.success("Data uploaded successfully!")
            display_data_overview(dataset)

            # Display missing values per column
            st.write("Missing Values per Column:")
            st.write(dataset.isna().sum())

            # Detect task type only when the data is first uploaded
            if 'task_type' not in st.session_state or st.session_state.task_type is None:
                task_type = detect_task_type(dataset)
                st.session_state.task_type = task_type  # Store the task type
                st.write(f"Detected Task Type: {task_type}")

            # Give a unique identifier for the data by hashing the dataset
            current_data_hash = hashlib.sha256(pd.util.hash_pandas_object(st.session_state.data).values).hexdigest()

            # Check if the dataset has changed since the last session
            should_reset_training_info = 'latest_dataset' in st.session_state and st.session_state.latest_dataset != current_data_hash

            # If the dataset has changed, reset the training information to prevent using outdated models
            if should_reset_training_info == True:
                st.session_state.trained_models = []
                st.session_state.model_trained = False
                st.session_state.best_model = None

        except Exception as e:
            st.error(f"Error reading file: {e}")

def display_data_overview(dataset):
    st.write("Data Overview:")
    st.write(dataset.head())
    st.write("Data Info:")

    # Create an empty text buffer in memory
    buffer = io.StringIO()

    # Redirect output to buffer instead of printing to console
    dataset.info(buf=buffer)

    # Get text from buffer and store it as a string
    info_string = buffer.getvalue()

    # Display the string as a formatted code block in Streamlit
    st.code(info_string, language="python")

def handle_preprocessing():
    st.header("Preprocess Your Data (Optional)")

    # Ensure the original data is not modified until the user applies the changes
    dataset = st.session_state.data.copy()

    # Use the task type stored in session state (do not recalculate)
    task_type = st.session_state.task_type  # <-- This ensures task type remains fixed
    st.write(f"Detected Task Type: {task_type}")

    # [Dropping Columns]
    columns_to_drop = st.multiselect("Select columns to drop", dataset.columns[:-1])

    # Drop columns if list is not empty then update the dataframe
    if columns_to_drop:
        dataset = dataset.drop(columns=columns_to_drop)

    # [Handling Missing Values]
    missing_value_option = st.selectbox("Handle missing values", ["None", "Drop rows", "Fill with mean", "Fill with median"])

    if missing_value_option == "Drop rows":
        dataset = dataset.dropna()

    elif missing_value_option == "Fill with mean":
        # Check for missing values then update dataframe
        if dataset.isnull().any().any():
            dataset = dataset.fillna(dataset.mean())
        else:
            st.warning("No missing values found to fill with mean.")

    elif missing_value_option == "Fill with median":
        # Check for missing values then update dataframe
        if dataset.isnull().any().any():
            dataset = dataset.fillna(dataset.median())
        else:
            st.warning("No missing values found to fill with median.")

    # [Label Encoding]
    if 'label_encoders' not in st.session_state:
        st.session_state.label_encoders = {} #store the encoders
    label_encode_cols = st.multiselect("Select columns for Label Encoding", dataset.iloc[:, :-1].select_dtypes(include=['object']).columns)

    for col in label_encode_cols:
        if col not in st.session_state.label_encoders:
           le = LabelEncoder()
           st.session_state.label_encoders[col] = le
           dataset[col] = st.session_state.label_encoders[col].fit_transform(dataset[col])
        else:
           dataset[col] = st.session_state.label_encoders[col].transform(dataset[col])

    # [One-Hot Encoding] as explained in the class
    one_hot_encode_cols = st.multiselect("Select columns for One-Hot Encoding", dataset.iloc[:, :-1].select_dtypes(include=['object']).columns)
    dataset = pd.get_dummies(dataset, columns=one_hot_encode_cols, drop_first=True)

    st.write("Processed Data:")
    st.write(dataset.head())

    if st.button("Apply Changes"):
        st.session_state.data = dataset

def detect_task_type(dataset):
    target_column = dataset.columns[-1]
    if dataset[target_column].dtype == 'object' or dataset[target_column].nunique() < 10:
        return "Classification"
    else:
        return "Regression"

def handle_model_training():
    st.header("Train Your Model")
    dataset = st.session_state.data
    target_column = dataset.columns[-1]

    task_type = detect_task_type(dataset)
    st.session_state.task_type = task_type
    st.write(f"Detected Task Type: {task_type}")

    st.write('Training Data: ')
    st.write(st.session_state.data.head())

    # Check if setup has already been run for the current task type and data
    current_data_hash = hashlib.sha256(pd.util.hash_pandas_object(st.session_state.data).values).hexdigest()
    dataset_changed = 'latest_dataset' not in st.session_state or st.session_state.latest_dataset != current_data_hash
    if task_type == "Classification":
        if dataset_changed:
            # Initializes PyCaret’s classification module with the new dataset
            clf.setup(data=dataset, target=target_column, session_id=123)
            # The dataset hash is stored in to track future changes
            st.session_state.latest_dataset = current_data_hash
        available_models = clf.models()
    elif task_type == "Regression":
        if dataset_changed:
            reg.setup(data=dataset, target=target_column, session_id=123)
            st.session_state.latest_dataset = current_data_hash
        available_models = reg.models()

    # Create a dictionary of model names and their indexes
    model_options = {row['Name']: index for index, row in available_models.iterrows()}
    # User selects model
    selected_model_names = st.multiselect("Select Models to Compare", options=list(model_options.keys())) # Linear Regression, Elastick Search
    # Convert selected model names into their corresponding PyCaret model IDs
    selected_model_ids = [model_options[model_name] for model_name in selected_model_names] # lr, es
    # Start training when the user clicks the button
    if st.button("Start Training"):
        try:
            if task_type == "Classification":
                st.session_state.trained_models = clf.compare_models(include=selected_model_ids, n_select=len(selected_model_ids))
                comparison_results = clf.pull()
            elif task_type == "Regression":
                st.session_state.trained_models = reg.compare_models(include=selected_model_ids, n_select=len(selected_model_ids))
                comparison_results = reg.pull()
            # Store the best model and display results
            st.session_state.best_model = st.session_state.trained_models[0]
            st.session_state.model_trained = True

            st.success("Model training completed!")
            st.write("Best Model: ", st.session_state.best_model)
            st.write("Comparison Results:")
            st.write(comparison_results)
        except Exception as e:
            st.error(f"Error during training: {e}")

def handle_model_evaluation():
    st.header("Evaluate Your Model")
    #  Retrieve trained models from session state
    model_names = [model.__class__.__name__ for model in st.session_state.trained_models]
    selected_model_name = st.selectbox("Select a model to evaluate", model_names)
    # Find the model from st.session_state.trained_models whose class name matches selected_model_name
    selected_model = next((model for model in st.session_state.trained_models if model.__class__.__name__ == selected_model_name), None)

    if selected_model:
        st.write(f"Model '{selected_model_name}' Evaluation Metrics:")

        try:
            st.write("Predictions:")
            if st.session_state.task_type == "Classification":
                predictions = clf.predict_model(selected_model)
                st.write(predictions.head())

                plot_options = ["auc", "confusion_matrix"]
                plot_type = st.selectbox("Select plot type", plot_options)
                st.write(f"Plot: {plot_type}")
                clf.plot_model(selected_model, plot=plot_type, display_format='streamlit')

            elif st.session_state.task_type == "Regression":
                plot_options = ["residuals", "error"]
                predictions = reg.predict_model(selected_model)
                st.write(predictions.head())

                plot_type = st.selectbox("Select plot type", plot_options)
                st.write(f"Plot: {plot_type}")
                reg.plot_model(selected_model, plot=plot_type, display_format='streamlit')
        except Exception as e:
            st.error(f"Error during evaluation: {e}")

if __name__ == "__main__":
    main()

Overwriting app.py


**Note:** Please note that you may need to run the last two cells multiple times for the link to work properly.

In [None]:
# Use ngrok to generate a public URL for the app
from pyngrok import ngrok

# Start Streamlit app in the background and hide logs and errors from the Streamlit app
!streamlit run app.py &>/dev/null&

# Set up Ngrok authentication token
ngrok.set_auth_token("2rjVaEovPNwX8gLOu16QG87wiVW_2TjTAN9cbsb4YGUAYf6nN")

# Create an Ngrok tunnel
public_url = ngrok.connect(addr='8501', proto='http')
print("Your Streamlit app is live at:", public_url)

Your Streamlit app is live at: NgrokTunnel: "https://39e3-35-237-14-48.ngrok-free.app" -> "http://localhost:8501"


In [None]:
# Check running processes and stop Streamlit and Ngrok when needed
!ps aux
!pkill ngrok
!pkill -f "streamlit run app.py"

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   1076     8 ?        Ss   02:28   0:00 /sbin/docker-init -- /datalab/run
root           8  2.8  0.4 905756 62592 ?        Sl   02:28   0:02 /tools/node/bin/node /datalab/web
root          22  0.0  0.0   7376  3464 ?        S    02:28   0:00 /bin/bash -e /usr/local/colab/bin
root          24  0.0  0.0   7376  1804 ?        S    02:28   0:00 /bin/bash -e /datalab/run.sh
root          25  0.2  0.1 1237860 16392 ?       Sl   02:28   0:00 /usr/colab/bin/kernel_manager_pro
root          27  0.0  0.0   5808  1072 ?        Ss   02:28   0:00 tail -n +0 -F /root/.config/Googl
root          34  0.0  0.0   5808  1000 ?        Ss   02:28   0:00 tail -n +0 -F /root/.config/Googl
root          71 16.0  0.0      0     0 ?        Z    02:28   0:15 [python3] <defunct>
root          72  0.5  0.3  63756 50872 ?        S    02:28   0:00 python3 /usr/local/bin/colab-file
root          89  3.9  0.9 370080 1