<a href="https://colab.research.google.com/github/NadiaHolmlund/M6_Group_Assignments/blob/main/Group_Assignment_4/NHN_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Choose a previous project that involves a machine learning component and perform the following tasks:

- Train a machine learning model using the data from your previous project. Select an appropriate machine learning model based on your data and problem.

- Integrate MLflow for tracking and managing your machine learning experiments. Log hyperparameters, metrics, and artifacts of your experiments in MLflow. Save structured and unstructured information related to your trained model in SQLite within MLflow.

- Develop a user-friendly interface for your ML app using Streamlit. Optionally, you can create a three-layer ML app (data, business, presentation) for a user-friendly interface to interact with the machine learning model.

- Dockerize your ML app, ensuring that the SQLite database, MLflow, and the Streamlit or custom interface are all functioning correctly within the Docker image.

- Upload your dockerized app to Docker Hub and provide instructions for running the app from the Docker Hub repository.

# Imports

In [8]:
!pip install mlflow -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.7/82.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.9/212.9 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.5/147.5 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone


In [9]:
import pandas as pd
import sqlite3

pd.set_option('max_colwidth', 1000)
pd.describe_option('max_colwidth')

display.max_colwidth : int or None
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output. A 'None' value means unlimited.
    [default: 50] [currently: 1000]


# Cleaning the data

In [10]:
# Reading the CSV files into Pandas Dataframes and merging them together based on player ID
baseball_master = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Master.csv', encoding="ISO-8859-1")
baseball_batting = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Batting.csv', encoding="ISO-8859-1")

baseball = baseball_master.merge(baseball_batting, on = 'playerID')

In [11]:
# Examining the DataFrame
baseball.head()

Unnamed: 0,lahmanID,playerID,managerID,hofID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,2.0,28.0,39.0,,3.0,6.0,4.0,13.0,122.0
1,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,3.0,1.0,49.0,61.0,5.0,3.0,7.0,4.0,20.0,153.0
2,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,4.0,37.0,54.0,6.0,2.0,5.0,7.0,21.0,153.0
3,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,1.0,1.0,57.0,58.0,15.0,0.0,0.0,3.0,13.0,151.0
4,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,4.0,1.0,59.0,49.0,16.0,1.0,0.0,3.0,21.0,153.0


In [12]:
# Examining the DataFrame
baseball.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96609 entries, 0 to 96608
Data columns (total 56 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lahmanID      96609 non-null  int64  
 1   playerID      96609 non-null  object 
 2   managerID     6546 non-null   object 
 3   hofID         17793 non-null  object 
 4   birthYear     96294 non-null  float64
 5   birthMonth    95759 non-null  float64
 6   birthDay      95397 non-null  float64
 7   birthCountry  96174 non-null  object 
 8   birthState    86389 non-null  object 
 9   birthCity     95844 non-null  object 
 10  deathYear     39107 non-null  float64
 11  deathMonth    39096 non-null  float64
 12  deathDay      39095 non-null  float64
 13  deathCountry  38839 non-null  object 
 14  deathState    38415 non-null  object 
 15  deathCity     38804 non-null  object 
 16  nameFirst     96557 non-null  object 
 17  nameLast      96609 non-null  object 
 18  nameNote      2459 non-nul

In [13]:
# Extracting columns to be included in the database
baseball_clean = baseball[['weight', 'height', 'G', 'AB', 'HR']]

In [14]:
# Dropping NaN values
baseball_clean = baseball_clean.dropna()

In [15]:
# Examining the dataFrame
baseball_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88718 entries, 0 to 96604
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  88718 non-null  float64
 1   height  88718 non-null  float64
 2   G       88718 non-null  int64  
 3   AB      88718 non-null  float64
 4   HR      88718 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 4.1 MB


In [16]:
# Saving the cleaned dataset
baseball_clean.to_csv('baseball_clean.csv', index=False)

In [17]:
baseball_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88718 entries, 0 to 96604
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  88718 non-null  float64
 1   height  88718 non-null  float64
 2   G       88718 non-null  int64  
 3   AB      88718 non-null  float64
 4   HR      88718 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 4.1 MB


# Setting up the Data Layer

Creating a SQLite database for the baseball dataset

Create a new file named **database.py** and paste the following code:

In [18]:
#database.py
import sqlite3
import pandas as pd

def init_db():
  # Load the Iris dataset into a Pandas DataFrame
  url = "https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Data/baseball_clean.csv"
  baseball_db = pd.read_csv(url)

  # Connect to the SQLite database
  conn = sqlite3.connect("baseball.db")

  # Save the Pandas DataFrame to the SQLite database
  baseball_db.to_sql("baseball", conn, if_exists="replace", index=False)

  # Close the connection to the SQLite database
  conn.close()

if __name__ == '__main__':
    init_db()

# Setting up the Business Layer with MLFlow

Using the Scikit-Learn library to train a machine learning model for HR prediction.
The code also sets up an experiment named "HR_Prediction_exp_x" and logs the model's parameters, performance metrics, and the trained model itself as an artifact in MLflow.

Create a new file named **ml_model.py** and paste the following code:

In [19]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect("baseball.db")

# Read data from a table using Pandas
data_df = pd.read_sql("SELECT * FROM baseball", conn)

def train_model():
    mlflow.set_experiment("HR_Prediction_exp_0")
    X = data_df.drop('HR', axis=1)
    y = data_df['HR']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    RFR = RandomForestRegressor(n_estimators=100, criterion="squared_error")

    with mlflow.start_run():
        RFR.fit(X_train, y_train)

        # Log model parameters
        mlflow.log_param("n_estimators", RFR.n_estimators)
        mlflow.log_param("criterion", RFR.criterion)

        # Log model performance metrics
        train_score = RFR.score(X_train, y_train)
        test_score = RFR.score(X_test, y_test)
        mlflow.log_metric("train_score", train_score)
        mlflow.log_metric("test_score", test_score)

        # Save the model as an artifact
        mlflow.sklearn.log_model(RFR, "model")

    return RFR, test_score

if __name__ == '__main__':
    RFR, accuracy = train_model()
    print(f"Model trained with accuracy: {accuracy}")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(RFR, "model")
    mlflow.sklearn.log_model(RFR, "model", registered_model_name="HR_model")
    mlflow.sklearn.save_model(RFR, "HR_model")

    # Launch MLflow UI
    #import os
    #os.system("mlflow ui")

2023/04/28 11:11:01 INFO mlflow.tracking.fluent: Experiment with name 'HR_Prediction_exp_0' does not exist. Creating a new experiment.


Model trained with accuracy: 0.6512343316062437


Successfully registered model 'HR_model'.
2023/04/28 11:11:53 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: HR_model, version 1
Created version '1' of model 'HR_model'.


# Test in chunks

In [None]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect("baseball.db")

# Read data from a table using Pandas in chunks
chunk_size = 1000
data_df = pd.DataFrame()
for chunk in pd.read_sql("SELECT * FROM baseball", conn, chunksize=chunk_size):
    data_df = pd.concat([data_df, chunk])

def train_model(data_df):
    mlflow.set_experiment("HR_Prediction_exp_0")
    X = data_df.drop('HR', axis=1)
    y = data_df['HR']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    RFR = RandomForestRegressor(n_estimators=100, criterion="squared_error")

    with mlflow.start_run():
        RFR.fit(X_train, y_train)

        # Log model parameters
        mlflow.log_param("n_estimators", RFR.n_estimators)
        mlflow.log_param("criterion", RFR.criterion)

        # Log model performance metrics
        train_score = RFR.score(X_train, y_train)
        test_score = RFR.score(X_test, y_test)
        mlflow.log_metric("train_score", train_score)
        mlflow.log_metric("test_score", test_score)

        # Save the model as an artifact
        mlflow.sklearn.log_model(RFR, "model")

    return RFR, test_score

if __name__ == '__main__':
    # Call the train_model function for each chunk of data
    for chunk in pd.read_sql("SELECT * FROM baseball", conn, chunksize=chunk_size):
        RFR, accuracy = train_model(chunk)
        print(f"Model trained with accuracy: {accuracy}")
        mlflow.log_metric("accuracy", accuracy)
        mlflow.sklearn.log_model(RFR, "model")
        mlflow.sklearn.log_model(RFR, "model", registered_model_name="HR_model")
        mlflow.sklearn.save_model(RFR, "HR_model")

    # Launch MLflow UI
    import os
    os.system("mlflow ui")

# Presentation Layer

Creating a presentation layer using HTML and CSS

Create a new folder named templates, and inside it, create a new file named index.html. Paste the following code:

In [None]:
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>HR Prediction</title>
    <link rel="stylesheet" href="static/style.css">
</head>
<body>
    <h1>HR Prediction</h1>
    <form action="/classify" method="post">
        <label for="weight">Weight:</label>
        <input type="number" step="0.1" id="weight" name="weight" required><br><br>
        <label for="height">Height:</label>
        <input type="number" step="0.1" id="height" name="height" required><br><br>
        <label for="G">G:</label>
        <input type="number" step="0.1" id="G" name="G" required><br><br>
        <label for="AB">AB:</label>
        <input type="number" step="0.1" id="AB" name="AB" required><br><br>
        <input type="submit" value="predict">
    </form>
    {% if prediction %}
    <h2>Prediction: {{ prediction }}</h2>
    {% endif %}
</body>
</html>

Create a new folder named static, and inside it, create a new file named style.css. Paste the following code:

In [None]:
body {
    font-family: Arial, sans-serif;
    max-width: 600px;
    margin: 0 auto;
    padding: 20px;
}

input[type=number], input[type=submit] {
    width: 100%;
    padding: 5px;
    margin: 5px 0;
    box-sizing: border-box;
}

input[type=submit] {
    background-color: #4CAF50;
    color: white;
    cursor: pointer;
}

# Connecting the Layers with Flask

Connecting the data layer, busines layer and presentaiton layer with Flask

Create a new file named app.py and paste the following code:

In [None]:
from flask import Flask, render_template, request, jsonify
import pickle
import sqlite3

app = Flask(__name__)

with open("HR_model/model.pkl", "rb") as f:
    model = pickle.load(f)

@app.route("/", methods=["GET"])
def index():
    return render_template("index.html", prediction=None)

@app.route("/classify", methods=["POST"])
def classify():
    sepal_length = float(request.form["weight"])
    sepal_width = float(request.form["height"])
    petal_length = float(request.form["G"])
    petal_width = float(request.form["AB"])

    data = [[weight, height, G, AB]]
    prediction = model.predict(data)[0]

    # Save the data to the database
    connection = sqlite3.connect("baseball.db")
    cursor = connection.cursor()
    cursor.execute("INSERT INTO baseball (weight, height, G, AB, HR) VALUES (?, ?, ?, ?, ?)",
                   (weight, height, G, AB, prediction))
    connection.commit()
    connection.close()

    return jsonify({"prediction": prediction})


if __name__ == "__main__":
    app.run(debug=True, port=5000)


Run app.py and navigate to http://127.0.0.1:5000/ in your web browser.

# Streamlit

In [6]:
!pip install -q streamlit

In [21]:
import streamlit as st
import pickle

# Setting up page configurations
st.set_page_config(
    page_title="HR Prediction",
    page_icon="⚾",
    layout="wide")

# Loading the model
@st.experimental_singleton
def read_objects():
    model = pickle.load(open('HR_model/model.pkl','rb'))

    return model

model = read_objects()

def predict():
    data = [[weight, height, G, AB]]
    prediction = model.predict(data)[0]

# Setting up the page
weight = st.number_input('Weight')
height = st.number_input('Height')
G = st.number_input('G')
ABCMeta = st.number_input('AB')

# make a nice button that triggers creation of a new data-line in the format that the model expects and prediction
if st.button('Predict! 🚀'):
    # make a DF for categories and transform with one-hot-encoder
    user_input = pd.DataFrame({'weight':weight,'height':height, 'G':G, 'AB':AB}, index=[0])

    #run prediction for 1 new observation
    predicted_value = model.predict(user_input)[0]

    #print out result to user
    st.metric(label="Predicted HR", value=f'{round(predicted_value)} kr')

2023-04-28 11:16:34.431 `st.experimental_singleton` is deprecated. Please use the new command `st.cache_resource` instead, which has the same behavior. More information [in our docs](https://docs.streamlit.io/library/advanced-features/caching).


In [22]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
[K[?25h+ localtunnel@2.0.2
updated 1 package and audited 36 packages in 0.546s

3 packages are looking for funding
  run `npm fund` for details

found [92m0[0m vulnerabilities



In [23]:
!streamlit run app.py &>/content/logs.txt &

In [25]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 1.592s
your url is: https://rare-bags-cheer-34-171-41-84.loca.lt
^C


# Dockerizing the App

## Step 1: Create a Dockerfile

Create a new file called Dockerfile in the project directory and add the following:

In [None]:
# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory
WORKDIR /baseball

# Copy the requirements file into the container
COPY requirements.txt /baseball

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Copy the rest of the application code
COPY . /baseball

# Make the script executable
RUN chmod +x /baseball/app.py

# Run the Python scripts sequentially when the container launches
CMD python database.py && python ml_model.py && python app.py

## Step 2: Build the Docker image

Run the following command in your terminal to build the Docker image:

In [2]:
docker build -t baseball-docker .

SyntaxError: ignored

## Step 3: Run the Docker container

After the image is built, you can run the Docker container with:

In [None]:
docker run baseball-docker

## Step 4: Push the Docker image to Docker Hub

Log in to your Docker Hub account from the command line:

In [None]:
docker login

Push the Docker image to your Docker Hub repository:

In [None]:
docker push NadiaHolmlund/baseball_docker:v1.0.0

## Step 5: Share the Docker image with the company

Share the Docker image URL with the company. They can now pull the image from Docker Hub and run the container on their infrastructure:

In [None]:
docker pull NadiaHolmlund/baseball_docker:v1.0.0
docker run -p 8000:8000 NadiaHolmlund/baseball_docker:v1.0.0