<a href="https://colab.research.google.com/github/NadiaHolmlund/M6_Group_Assignments/blob/main/Group_Assignment_4/Group_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Choose a previous project that involves a machine learning component and perform the following tasks:

- Train a machine learning model using the data from your previous project. Select an appropriate machine learning model based on your data and problem.

- Integrate MLflow for tracking and managing your machine learning experiments. Log hyperparameters, metrics, and artifacts of your experiments in MLflow. Save structured and unstructured information related to your trained model in SQLite within MLflow.

- Develop a user-friendly interface for your ML app using Streamlit. Optionally, you can create a three-layer ML app (data, business, presentation) for a user-friendly interface to interact with the machine learning model.

- Dockerize your ML app, ensuring that the SQLite database, MLflow, and the Streamlit or custom interface are all functioning correctly within the Docker image.

- Upload your dockerized app to Docker Hub and provide instructions for running the app from the Docker Hub repository.

# Preprocessing

In [None]:
# Importing libraries
import pandas as pd

pd.set_option('max_colwidth', 1000)
pd.describe_option('max_colwidth')

In [None]:
# Reading the CSV files into Pandas Dataframes and merging them together based on player ID
baseball_master = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Master.csv', encoding="ISO-8859-1")
baseball_batting = pd.read_csv('https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_3/Data/Batting.csv', encoding="ISO-8859-1")

baseball = baseball_master.merge(baseball_batting, on = 'playerID')

In [None]:
# Examining the DataFrame
baseball.head()

Unnamed: 0,lahmanID,playerID,managerID,hofID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,2.0,28.0,39.0,,3.0,6.0,4.0,13.0,122.0
1,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,3.0,1.0,49.0,61.0,5.0,3.0,7.0,4.0,20.0,153.0
2,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,2.0,4.0,37.0,54.0,6.0,2.0,5.0,7.0,21.0,153.0
3,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,1.0,1.0,57.0,58.0,15.0,0.0,0.0,3.0,13.0,151.0
4,1,aaronha01,,aaronha01h,1934.0,2.0,5.0,USA,AL,Mobile,...,4.0,1.0,59.0,49.0,16.0,1.0,0.0,3.0,21.0,153.0


In [None]:
# Examining the DataFrame
baseball.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96609 entries, 0 to 96608
Data columns (total 56 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   lahmanID      96609 non-null  int64  
 1   playerID      96609 non-null  object 
 2   managerID     6546 non-null   object 
 3   hofID         17793 non-null  object 
 4   birthYear     96294 non-null  float64
 5   birthMonth    95759 non-null  float64
 6   birthDay      95397 non-null  float64
 7   birthCountry  96174 non-null  object 
 8   birthState    86389 non-null  object 
 9   birthCity     95844 non-null  object 
 10  deathYear     39107 non-null  float64
 11  deathMonth    39096 non-null  float64
 12  deathDay      39095 non-null  float64
 13  deathCountry  38839 non-null  object 
 14  deathState    38415 non-null  object 
 15  deathCity     38804 non-null  object 
 16  nameFirst     96557 non-null  object 
 17  nameLast      96609 non-null  object 
 18  nameNote      2459 non-nul

In [None]:
# Extracting columns to be included in the database
baseball_clean = baseball[['weight', 'height', 'G', 'AB', 'HR']]

In [None]:
# Dropping NaN values
baseball_clean = baseball_clean.dropna()

In [None]:
# Examining the DataFrame
baseball_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88718 entries, 0 to 96604
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  88718 non-null  float64
 1   height  88718 non-null  float64
 2   G       88718 non-null  int64  
 3   AB      88718 non-null  float64
 4   HR      88718 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 4.1 MB


In [None]:
# Saving the cleaned dataset
baseball_clean.to_csv('baseball_clean.csv', index=False)

In [None]:
# Examining the DataFrame
baseball_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88718 entries, 0 to 96604
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  88718 non-null  float64
 1   height  88718 non-null  float64
 2   G       88718 non-null  int64  
 3   AB      88718 non-null  float64
 4   HR      88718 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 4.1 MB


# Creating a three-layer ML app

Before continuing, create an instance in AWS and save the key pair in Downloads.

Note that the following code sections are to be added and executed in terminal.

Open terminal and run the following commands to:
- Open the Downloads folder: cd Downloads
- Restrict permission to the key pair: sudo chmod 700 -i key_pair_name.pem
- Connect to the AWS instance: ssh ubuntu@Public_IPv4_address -i key_pair_name.pem
- Create a directory for the ML app: mkdir baseball
- Open the directory for the ML app: cd baseball


## Setting up the Data Layer

Creating a SQLite database for the baseball dataset based on the cleaned data.

In terminal, run the following command to create a new file named **database.py**: vim database.py

Paste the following code in the file:

In [None]:
#database.py
import sqlite3
import pandas as pd

def init_db():
  # Load the cleaned dataset into a Pandas DataFrame
  url = "https://raw.githubusercontent.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Data/baseball_clean.csv"
  baseball_db = pd.read_csv(url)

  # Connect to the SQLite database
  conn = sqlite3.connect("baseball.db")

  # Save the Pandas DataFrame to the SQLite database
  baseball_db.to_sql("baseball", conn, if_exists="replace", index=False)

  # Close the connection to the SQLite database
  conn.close()

if __name__ == '__main__':
    init_db()

## Setting up the Business Layer with MLFlow

Training a machine learning model for HR prediction using the Scikit-learn library.

The code also sets up an experiment named "HR_Prediction_exp_x" and logs the model's parameters, performance metrics, and the trained model itself as an artifact in MLflow.

In terminal, run the following command to create a new file named **ml_model.py**: vim ml_model.py

Paste the following code in the file:

In [None]:
# ml_model.py
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect("baseball.db")

# Read data from a table using Pandas in chunks to avoid the terminal killing the training process
chunk_size = 1000
data_df = pd.DataFrame()
for chunk in pd.read_sql("SELECT * FROM baseball", conn, chunksize=chunk_size):
    data_df = pd.concat([data_df, chunk])

# Define a function to train the model
def train_model(data_df):
    mlflow.set_experiment("HR_Prediction_exp_0")
    X = data_df.drop('HR', axis=1)
    y = data_df['HR']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    RFR = RandomForestRegressor(n_estimators=100, criterion="squared_error")

    with mlflow.start_run():
        RFR.fit(X_train, y_train)

        # Log model parameters
        mlflow.log_param("n_estimators", RFR.n_estimators)
        mlflow.log_param("criterion", RFR.criterion)

        # Log model performance metrics
        train_score = RFR.score(X_train, y_train)
        test_score = RFR.score(X_test, y_test)
        mlflow.log_metric("train_score", train_score)
        mlflow.log_metric("test_score", test_score)

        # Save the model as an artifact
        mlflow.sklearn.log_model(RFR, "model")

    return RFR, test_score

if __name__ == '__main__':
    # Call the train_model function for each chunk of data
    for chunk in pd.read_sql("SELECT * FROM baseball", conn, chunksize=chunk_size):
        RFR, accuracy = train_model(chunk)
        print(f"Model trained with accuracy: {accuracy}")
        mlflow.log_metric("accuracy", accuracy)
        mlflow.sklearn.log_model(RFR, "model")
        mlflow.sklearn.log_model(RFR, "model", registered_model_name="HR_model")
        mlflow.sklearn.save_model(RFR, "HR_model")

    # Launch MLflow UI
    import os
    os.system("mlflow ui")

## Setting up the Presentation Layer

Creating a presentation layer with HTML and CSS.

In terminal, run the following command to create a new folder named **templates**: mkdir templates

In terminal, run the following command to open the templates folder: cd templates

In terminal, run the following command to create a new file named **index.html** inside the templates folder: vim index.html

Paste the following code in the file:

In terminal, run the following command to exit the templates folder: cd ..

In [None]:
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Predict the number of Home-Runs</title>
    <link rel="stylesheet" href="static/style.css">
</head>
<body>
    <h1>Predict the number of Home-Runs</h1>
    <form action="/predict" method="post">
        <label for="weight">Weight:</label>
        <input type="number" step="10" id="weight" name="weight" required><br><br>
        <label for="height">Height:</label>
        <input type="number" step="10" id="height" name="height" required><br><br>
        <label for="G">Games:</label>
        <input type="number" step="10" id="G" name="G" required><br><br>
        <label for="AB">At-Bat:</label>
        <input type="number" step="10" id="AB" name="AB" required><br><br>
        <input type="submit" value="Predict the number of Home-Runs">
    </form>
    {% if prediction %}
    <h2>Predicted number of Home-Runs: {{ prediction }}</h2>
    {% endif %}
</body>
</html>

In terminal, create a folder named **static*** and, inside this folder, create a file named **style.css**. Paste the following code:

In terminal, run the following command to create a new folder named **static**: mkdir static

In terminal, run the following command to open the static folder: cd static

In terminal, run the following command to create a new file named **style.css** inside the static folder: vim style.css

Paste the following code in the file:

In terminal, run the following command to exit the static folder: cd ..

In [None]:
body {
    font-family: Courier New, Arial, sans-serif;
    max-width: 600px;
    margin: 0 auto;
    padding: 20px;
}

input[type=number], input[type=submit] {
    width: 100%;
    padding: 5px;
    margin: 5px 0;
    box-sizing: border-box;
}

input[type=submit] {
    background-color: #c90076;
    color: white;
    cursor: pointer;
}

## Connecting the Layers with Flask

Connecting the data layer, busines layer and presentaiton layer with Flask.

In terminal, run the following command to create a new file named **app.py** : vim app.py

Paste the following code in the file:

In [None]:
from flask import Flask, render_template, request, jsonify
import pickle
import sqlite3

app = Flask(__name__)

with open("HR_model/model.pkl", "rb") as f:
    model = pickle.load(f)

@app.route("/", methods=["GET"])
def index():
    return render_template("index.html", prediction=None)

@app.route("/predict", methods=["POST"])
def classify():
    weight = float(request.form["weight"])
    height = float(request.form["height"])
    G = float(request.form["G"])
    AB = float(request.form["AB"])

    data = [[weight, height, G, AB]]
    prediction = model.predict(data)[0]

    # Save the data to the database
    connection = sqlite3.connect("baseball.db")
    cursor = connection.cursor()
    cursor.execute("INSERT INTO baseball (weight, height, G, AB, HR) VALUES (?, ?, ?, ?, ?)",
                   (weight, height, G, AB, prediction))
    connection.commit()
    connection.close()

    return jsonify({"prediction": prediction})


if __name__ == "__main__":
    app.run(debug=True, port=5002)

The directory should now include the following files and folders:

![picture](https://raw.github.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Images/Screenshot%202023-04-29%20at%2001.01.07.png)

# Running the scripts in terminal

Run the following commands in terminal to:
- Get information about the latest version of packages available: sudo apt update
- Install Pip: sudo apt install python3-pip
- Install Pandas: pip install pandas
- Install MLFlow: pip install mlflow
- Install Flask: pip install flask
- Run the database script: python3 database.py
- Run the model script: python3 ml_model.py
- Run the app script: python3 app.py
- In a new terminal tab, create a connection to the remote server: ssh -NL 5002:localhost:5002 ubuntu@Public_IPv4_address -i key_pair_name.pem
- Navigate to http://127.0.0.1:5002/ in a web browser to interact with the app

The app should now be available in localhost as per below:

![picture](https://raw.github.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Images/Screenshot%202023-04-29%20at%2001.09.00.png)

![picture](https://raw.github.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Images/Screenshot%202023-04-29%20at%2001.09.06.png)

# Dockerizing the App

## Step 1: Create a Dockerfile

In terminal, run the following command to create a new file named **Dockerfile** : vim Dockerfile

Paste the following code in the file:

In [None]:
# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory
WORKDIR /baseball

# Copy the requirements file into the container
COPY HR_model/requirements.txt /baseball

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Copy the rest of the application code
COPY . /baseball

# Make the script executable
RUN chmod +x /baseball/app.py

# Run the Python scripts sequentially when the container launches
CMD python database.py && python ml_model.py && python app.py

## Step 2: Build the Docker image

In terminal, run the following command to build the Docker image.

In [None]:
sudo snap install docker

In [None]:
sudo docker build -t baseball_docker .

## Step 3: Run the Docker container

In terminal, run the following command to run the Docker container:

In [None]:
sudo docker run baseball_docker

## Step 4: Push the Docker image to Docker Hub

In terminal, run the following command and log in to your Docker Hub:

In [None]:
sudo docker login

In terminal, run the following code to tag the docker image:

Note: insert docker_username in the code

In [None]:
sudo docker tag baseball_docker docker_username/baseball_docker:v1.0.0

In terminal, run the following code to push the Docker image to your Docker Hub repository:

Note: insert docker_username in the code

In [None]:
sudo docker push docker_username/baseball_docker:v1.0.0

## Step 5: Share the Docker image

Share the Docker image URL with a company. They can now pull the image from Docker Hub and run the container on their infrastructure:

Note: insert docker_username in the code

In [None]:
docker pull docker_username/baseball_docker:v1.0.0

In [None]:
docker run -p 8000:8000 docker_username/baseball_docker:v1.0.0

The Docker image should now be available on Docker Hub as per below:

![picture](https://raw.github.com/NadiaHolmlund/M6_Group_Assignments/main/Group_Assignment_4/Images/Screenshot%202023-04-29%20at%2001.41.40.png)