<a href="https://colab.research.google.com/github/NadiaHolmlund/M6_Group_Assignments/blob/main/Group_Assignment_3/Group_Assignment_3_exp_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Select one of your previous projects that includes a machine learning component and use MLflow to track and manage your machine learning experiments. The following tasks should be performed:

- Train a machine learning model using the data from your previous project. You can use any machine learning model that is appropriate for your data and problem.

- Use MLflow to track and manage your machine learning experiments. Log the hyperparameters, metrics, and artifacts of your machine learning experiments in MLflow. Save structured and unstructured information related to your trained model in SQLite within MLflow.

- Optionally, prepare an ML app based on three layers (data, business, presentation) to provide a user-friendly interface for interacting with your machine learning model. This will involve creating a data layer that handles the data processing pipeline and provides functions for loading and preprocessing the data, a business layer that implements the machine learning model and its related functions, and a presentation layer that implements the user interface and connects it to the business layer.

# Imports

In [10]:
!pip install mlflow -q

In [11]:
import pandas as pd
import sqlite3

pd.set_option('max_colwidth', 1000)
pd.describe_option('max_colwidth')

display.max_colwidth : int or None
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output. A 'None' value means unlimited.
    [default: 50] [currently: 1000]


# Cleaning the data

In [12]:
# Reading the CSV files into Pandas Dataframes and merging them together based on player ID
baseball_master = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Master.csv', encoding="ISO-8859-1")
baseball_batting = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Batting.csv', encoding="ISO-8859-1")

baseball = baseball_master.merge(baseball_batting, on = 'playerID')

In [13]:
# Examining the DataFrame
baseball.head()

Unnamed: 0,lahmanID,playerID,managerID,hofID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,2.0,28.0,39.0,,3.0,6.0,4.0,13.0,122.0
1,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,3.0,1.0,49.0,61.0,5.0,3.0,7.0,4.0,20.0,153.0
2,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,4.0,37.0,54.0,6.0,2.0,5.0,7.0,21.0,153.0
3,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,1.0,1.0,57.0,58.0,15.0,0.0,0.0,3.0,13.0,151.0
4,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,4.0,1.0,59.0,49.0,16.0,1.0,0.0,3.0,21.0,153.0


In [14]:
# Examining the DataFrame
baseball.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96609 entries, 0 to 96608
Data columns (total 56 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lahmanID      96609 non-null  int64  
 1   playerID      96609 non-null  object 
 2   managerID     6546 non-null   object 
 3   hofID         17793 non-null  object 
 4   birthYear     96294 non-null  float64
 5   birthMonth    95759 non-null  float64
 6   birthDay      95397 non-null  float64
 7   birthCountry  96174 non-null  object 
 8   birthState    86389 non-null  object 
 9   birthCity     95844 non-null  object 
 10  deathYear     39107 non-null  float64
 11  deathMonth    39096 non-null  float64
 12  deathDay      39095 non-null  float64
 13  deathCountry  38839 non-null  object 
 14  deathState    38415 non-null  object 
 15  deathCity     38804 non-null  object 
 16  nameFirst     96557 non-null  object 
 17  nameLast      96609 non-null  object 
 18  nameNote      2459 non-nul

In [15]:
# Extracting columns to be included in the database
baseball_db = baseball[['weight', 'height', 'G', 'AB', 'HR']]

In [16]:
# Dropping NaN values
baseball_db = baseball_db.dropna()

In [17]:
# Examining the dataFrame
baseball_db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88718 entries, 0 to 96604
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  88718 non-null  float64
 1   height  88718 non-null  float64
 2   G       88718 non-null  int64  
 3   AB      88718 non-null  float64
 4   HR      88718 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 4.1 MB


In [19]:
# Saving the Dataframe as CSV
baseball_db.to_csv('baseball_db.csv', index=False)

# Setting up the Data Layer

Creating a SQLite database for the baseball dataset

In terminal: create a file named database.py and paste the code

In [21]:
#database.py
import sqlite3
import pandas as pd

def init_db():
  # Load the Iris dataset into a Pandas DataFrame
  url = "https://github.com/NadiaHolmlund/M6_Group_Assignments/blob/main/Group_Assignment_3/Data/baseball_db.csv"
  df = pd.read_csv(url, header=None, names=["weight", "height", "G", "AB", "HR"])

  # Connect to the SQLite database
  conn = sqlite3.connect("baseball.db")

  # Save the Pandas DataFrame to the SQLite database
  df.to_sql("baseball", conn, if_exists="replace", index=False)

  # Close the connection to the SQLite database
  conn.close()

if __name__ == '__main__':
    init_db()

ParserError: ignored

# Setting up the Business Layer

Using the Scikit-Learn library to train a machine learning model for HR prediction.

The code also sets up an experiment named "HR_Prediction" and logs the model's parameters, performance metrics, and the trained model itself as an artifact in MLflow.

## Model ***with*** MLflow

In [10]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect("baseball.db")

# Read data from a table using Pandas
data_df = pd.read_sql("SELECT * FROM baseball", conn)

def train_model():
    mlflow.set_experiment("HR_Prediction")
    X = data_df.drop('HR', axis=1)
    y = data_df['HR']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    regr = RandomForestRegressor()

    with mlflow.start_run():
        regr.fit(X_train, y_train)

        # Log model parameters
        mlflow.log_param("n_estimators", regr.n_estimators)
        mlflow.log_param("criterion", regr.criterion)

        # Log model performance metrics
        train_score = regr.score(X_train, y_train)
        test_score = regr.score(X_test, y_test)
        mlflow.log_metric("train_score", train_score)
        mlflow.log_metric("test_score", test_score)

        # Save the model as an artifact
        mlflow.sklearn.log_model(regr, "model")

    return regr, test_score

if __name__ == '__main__':
    regr, accuracy = train_model()
    print(f"Model trained with accuracy: {accuracy}")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(regr, "model")
    mlflow.sklearn.log_model(regr, "model", registered_model_name="HR_model")
    mlflow.sklearn.save_model(regr, "HR_model")

    # Launch MLflow UI
    import os
    os.system("mlflow ui")

2023/04/16 18:54:58 INFO mlflow.tracking.fluent: Experiment with name 'HR_Prediction' does not exist. Creating a new experiment.


Model trained with accuracy: 0.6505444769367961


Successfully registered model 'HR_model'.
2023/04/16 18:55:42 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: HR_model, version 1
Created version '1' of model 'HR_model'.


To view the MLflow UI, run the command "mlflow ui" in terminal

## Model ***without*** MLflow

In terminal: create a file named model.py and paste the code

In [5]:
import pandas as pd
import sqlite3
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pickle

# Connect to the SQLite database
conn = sqlite3.connect("baseball.db")

# Read data from a table using Pandas
data_df = pd.read_sql("SELECT * FROM baseball", conn)

def train_model():
    X = data_df.drop('HR', axis=1)
    y = data_df['HR']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    regr = RandomForestRegressor()
    regr.fit(X_train, y_train)

    with open("model.pkl", "wb") as f:
        pickle.dump(regr, f)

    return regr.score(X_test, y_test)

if __name__ == '__main__':
    accuracy = train_model()
    print(f"Model trained with accuracy: {accuracy}")

ValueError: ignored

# Setting up the Presentation Layer

Using HTML and CSS to create and style the prediction app

In terminal: create a folder named templates and create a file named index.html and paste the code

In [None]:
# Setting up HTML for the app
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>HR Prediction</title>
    <link rel="stylesheet" href="static/style.css">
</head>
<body>
    <h1>HR Prediction</h1>
    <form action="/predict" method="post">
        <label for="weight">Weight:</label>
        <input type="number" step="10" id="weight" name="weight" required><br><br>
        <label for="height">Height:</label>
        <input type="number" step="10" id="heithg" name="height" required><br><br>
        <label for="G">Games Played:</label>
        <input type="number" step="1" id="G" name="G" required><br><br>
        <label for="AB">At-bat:</label>
        <input type="number" step="1" id="AB" name="AB" required><br><br>
        <input type="submit" value="Predict">
    </form>
    {% if prediction %}
    <h2>Prediction: {{ prediction }}</h2>
    {% endif %}
</body>
</html>

In terminal: create a folder named static and create a file named style.css and paste the code

In [None]:
# Setting up CSS for the app
body {
    font-family: Arial, sans-serif;
    max-width: 600px;
    margin: 0 auto;
    padding: 20px;
}

input[type=number], input[type=submit] {
    width: 100%;
    padding: 5px;
    margin: 5px 0;
    box-sizing: border-box;
}

input[type=submit] {
    background-color: #4CAF50;
    color: white;
    cursor: pointer;
}

# Connecting the Data Layer, Business Layer and Presentation Layer using Flask

In terminal: Create a file named app.py and paste the code

In [None]:
from flask import Flask, render_template, request, jsonify
import pickle
import sqlite3

app = Flask(__name__)

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

@app.route("/", methods=["GET"])
def index():
    return render_template("index.html", prediction=None)

@app.route("/predict", methods=["POST"])
def classify():
    sepal_length = float(request.form["weight"])
    sepal_width = float(request.form["height"])
    petal_length = float(request.form["G"])
    petal_width = float(request.form["AB"])

    data = [[weight, height, G, AB]]
    prediction = model.predict(data)[0]

    # Save the data to the database
    connection = sqlite3.connect("baseball.db")
    cursor = connection.cursor()
    cursor.execute("INSERT INTO baseball (weight, height, G, AB, HR) VALUES (?, ?, ?, ?, ?)",
                   (weight, height, G, AB, prediction))
    connection.commit()
    connection.close()

    return jsonify({"prediction": prediction})


if __name__ == "__main__":
    app.run(debug=True, port=5002)

Run app.py and navigate to http://127.0.0.1:5000/ to see the HR Prediction app. The data is also be saved to the SQLite database.