# How to use

Code for UI itself MUST be encapsulated in a separate .py file (cannot run as cells in ipynb).

In this notebook, I use `%% writefile app.py` to write a new streamlit .py file to the local venv (you can check with `!ls`).

Once your streamlit .py file is ready, to run, skip forward to the last section: **Running streamlit instance**.

# Resources

[Main page](https://streamlit.io/)

[Documentation](https://docs.streamlit.io/library/get-started/main-concepts)

[Youtube tutorial](https://github.com/dataprofessor/ml-app)

[Linking streamlit and colab](https://medium.com/@jcharistech/how-to-run-streamlit-apps-from-colab-29b969a1bdfc)

In [None]:
!streamlit run https://raw.githubusercontent.com/streamlit/demo-uber-nyc-pickups/master/streamlit_app.py

2022-07-25 09:49:33.408 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.73.210.146:8501[0m
[0m
[34m  Stopping...[0m
^C


# Code for streamlit app starts here

In [1]:
%%writefile requirements.txt

transformers==4.21.0 # for BERT, pytorch already inbuilt to colab
streamlit==1.11.1
pyngrok==4.1.1 # newer versions don't work

Writing requirements.txt


In [2]:
# install dependencies
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.21.0
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 6.9 MB/s 
[?25hCollecting streamlit==1.11.1
  Downloading streamlit-1.11.1-py2.py3-none-any.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 3.1 MB/s 
[?25hCollecting pyngrok==4.1.1
  Downloading pyngrok-4.1.1.tar.gz (18 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 21.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 k

In [None]:
# you can use this to check if installation is working
!streamlit hello
# need to stop this manually before running your own streamlit instance!

2022-08-04 10:23:03.599 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[0m
[34m[1m  Welcome to Streamlit. Check out our demo in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://35.243.188.1:8501[0m
[0m
  Ready to create your own Python apps super quickly?[0m
  Head over to [0m[1mhttps://docs.streamlit.io[0m
[0m
  May you create awesome apps![0m
[0m
[0m


In [None]:
# DO NOT USE ME
# get train/test dataset
!wget https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_stances.csv
!wget https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/train_bodies.csv
!wget https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/competition_test_bodies.csv
!wget https://raw.githubusercontent.com/FakeNewsChallenge/fnc-1/master/competition_test_stances.csv

# get pretrained BERT model
!wget https://github.com/SzeChang/Fake_News_Challenge/blob/main/ModelWeight/fake_model_CNN_LSTM.pt

In [4]:
# check all required files downloaded
!ls

competition_test_bodies.csv   fake_model_CNN_LSTM.pt  train_bodies.csv
competition_test_stances.csv  sample_data	      train_stances.csv


In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 29.4 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 58.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 32.1 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninsta

In [3]:
import torch.nn as nn

class BERT_Arch(nn.Module):
    def __init__(self, bert_head, bert_body):
      super(BERT_Arch, self).__init__()
      self.bert_head = bert_head
      self.bert_body = bert_body
      # Max pooling layer 
      self.max_pooling = nn.MaxPool1d(4,stride=4)
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      # relu activation function
      self.relu =  nn.ReLU()
      # dense layer 1
      self.fc = nn.Linear(384,768)
      self.fc1 = nn.Linear(768,512)
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,4)
      #softmax activation function 
      self.softmax = nn.LogSoftmax(dim=1)
 
    #define the forward pass
    def forward(self, sent_id_head, sent_id_body, mask_head, mask_body):
      # print(sent_id.size()) 
      # print(mask.size()) 
      #pass the inputs to the model   
      _, cls_hs_h = self.bert_head(sent_id_head, attention_mask=mask_head) 
      _, cls_hs_b = self.bert_body(sent_id_body, attention_mask=mask_body) 
      cls_hs = torch.cat((cls_hs_h,cls_hs_b),dim=1) 
      max_pool_out =torch.squeeze(self.max_pooling(cls_hs.unsqueeze(0))) 
      fc_out = self.fc(max_pool_out)
      fc_act_out = self.relu(fc_out)
      x = self.fc1(fc_act_out)
      x = self.relu(x)
      x = self.dropout(x)
      # output layer
      x = self.fc2(x)
      # apply softmax activation
      x = self.softmax(x)
      return x

In [5]:
import torch
from transformers import AutoModel, BertTokenizerFast

PATH = 'saved_weights_bert_2.pt'

bert_head = AutoModel.from_pretrained('bert-base-uncased')
bert_body = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

device = torch.device("cuda")
model = BERT_Arch(bert_head, bert_body)
model = model.to(device)
model.load_state_dict(torch.load(PATH))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_

RuntimeError: ignored

In [7]:
# test if can just writefile the model and import into app.py 
# YES it works!
%%writefile danson.py

def moonlighter():
  return "danson"

Overwriting danson.py


## Working file

In [8]:
# streamlit base .py file
# demonstrates basic functionality of a streamlit app


%%writefile app.py

import streamlit as st
from danson import *

#---------------------------------#
# Page layout

PAGE_CONFIG = {"page_title":"AI Project Group 18",
              "page_icon":":newspaper:",
              "layout":"wide"}
st.set_page_config(**PAGE_CONFIG)

#---------------------------------#
# Model building



#---------------------------------#
# Main panel

st.write("""
# The Moonlighter: Danson Lim

In this implementation, the *BERT* model is used to train for Fake News Stance Detection... *(TO UPDATE THIS)*

Try adjusting the hyperparameters in the sidebar!
""")

user_input = st.text_area("Enter your news here:")

st.write("Below should output Danson.")
st.write(moonlighter())

#---------------------------------#
# Sidebar - Collects user input features into dataframe

st.sidebar.title("Model Customisation Tools")

stance_type = st.sidebar.radio("Stance Type", ("Agree", "Disagree", "Discuss", "Unrelated"), index=3, key=3)
if stance_type == "Agree":
    st.markdown("## Agree")
elif stance_type == "Disagree":
	  st.markdown("## Disagree")
elif stance_type == "Discuss":
	  st.markdown("## Discuss")
elif stance_type == "Unrelated":
	  st.markdown("## Unrelated")

with st.sidebar.header("1. Set Parameters"):
    split_size = st.sidebar.slider('Data split ratio (% for Training Set)', 10, 90, 80, 5)

with st.sidebar.subheader("2. Learning Parameters"):
    parameter_n_estimators = st.sidebar.slider('Number of estimators (n_estimators)', 0, 1000, 100, 100)
    parameter_max_features = st.sidebar.select_slider('Max features (max_features)', options=['auto', 'sqrt', 'log2'])
    parameter_min_samples_split = st.sidebar.slider('Minimum number of samples required to split an internal node (min_samples_split)', 1, 10, 2, 1)
    parameter_min_samples_leaf = st.sidebar.slider('Minimum number of samples required to be at a leaf node (min_samples_leaf)', 1, 10, 2, 1)

with st.sidebar.subheader("3. General Parameters"):
    parameter_random_state = st.sidebar.slider('Seed number (random_state)', 0, 1000, 42, 1)
    parameter_criterion = st.sidebar.select_slider('Performance measure (criterion)', options=['mse', 'mae'])
    parameter_bootstrap = st.sidebar.select_slider('Bootstrap samples when building trees (bootstrap)', options=[True, False])
    parameter_oob_score = st.sidebar.select_slider('Whether to use out-of-bag samples to estimate the R^2 on unseen data (oob_score)', options=[False, True])
    parameter_n_jobs = st.sidebar.select_slider('Number of jobs to run in parallel (n_jobs)', options=[1, -1])



Overwriting app.py


## ML app file (from online)

In [None]:
%%writefile ml-app.py

import streamlit as st
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_diabetes, load_boston

#---------------------------------#
# Page layout
## Page expands to full width
st.set_page_config(page_title='The Machine Learning App',
    layout='wide')

#---------------------------------#
# Model building
def build_model(df):
    X = df.iloc[:,:-1] # Using all column except for the last column as X
    Y = df.iloc[:,-1] # Selecting the last column as Y

    # Data splitting
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=(100-split_size)/100)
    
    st.markdown('**1.2. Data splits**')
    st.write('Training set')
    st.info(X_train.shape)
    st.write('Test set')
    st.info(X_test.shape)

    st.markdown('**1.3. Variable details**:')
    st.write('X variable')
    st.info(list(X.columns))
    st.write('Y variable')
    st.info(Y.name)

    rf = RandomForestRegressor(n_estimators=parameter_n_estimators,
        random_state=parameter_random_state,
        max_features=parameter_max_features,
        criterion=parameter_criterion,
        min_samples_split=parameter_min_samples_split,
        min_samples_leaf=parameter_min_samples_leaf,
        bootstrap=parameter_bootstrap,
        oob_score=parameter_oob_score,
        n_jobs=parameter_n_jobs)
    rf.fit(X_train, Y_train)

    st.subheader('2. Model Performance')

    st.markdown('**2.1. Training set**')
    Y_pred_train = rf.predict(X_train)
    st.write('Coefficient of determination ($R^2$):')
    st.info( r2_score(Y_train, Y_pred_train) )

    st.write('Error (MSE or MAE):')
    st.info( mean_squared_error(Y_train, Y_pred_train) )

    st.markdown('**2.2. Test set**')
    Y_pred_test = rf.predict(X_test)
    st.write('Coefficient of determination ($R^2$):')
    st.info( r2_score(Y_test, Y_pred_test) )

    st.write('Error (MSE or MAE):')
    st.info( mean_squared_error(Y_test, Y_pred_test) )

    st.subheader('3. Model Parameters')
    st.write(rf.get_params())

#---------------------------------#
st.write("""

# The Machine Learning App

In this implementation, the *RandomForestRegressor()* function is used in this app for build a regression model using the **Random Forest** algorithm.

Try adjusting the hyperparameters!

""")

#---------------------------------#
# Sidebar - Collects user input features into dataframe
with st.sidebar.header('1. Upload your CSV data'):
    uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
    st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv)
""")

# Sidebar - Specify parameter settings
with st.sidebar.header('2. Set Parameters'):
    split_size = st.sidebar.slider('Data split ratio (% for Training Set)', 10, 90, 80, 5)

with st.sidebar.subheader('2.1. Learning Parameters'):
    parameter_n_estimators = st.sidebar.slider('Number of estimators (n_estimators)', 0, 1000, 100, 100)
    parameter_max_features = st.sidebar.select_slider('Max features (max_features)', options=['auto', 'sqrt', 'log2'])
    parameter_min_samples_split = st.sidebar.slider('Minimum number of samples required to split an internal node (min_samples_split)', 1, 10, 2, 1)
    parameter_min_samples_leaf = st.sidebar.slider('Minimum number of samples required to be at a leaf node (min_samples_leaf)', 1, 10, 2, 1)

with st.sidebar.subheader('2.2. General Parameters'):
    parameter_random_state = st.sidebar.slider('Seed number (random_state)', 0, 1000, 42, 1)
    parameter_criterion = st.sidebar.select_slider('Performance measure (criterion)', options=['mse', 'mae'])
    parameter_bootstrap = st.sidebar.select_slider('Bootstrap samples when building trees (bootstrap)', options=[True, False])
    parameter_oob_score = st.sidebar.select_slider('Whether to use out-of-bag samples to estimate the R^2 on unseen data (oob_score)', options=[False, True])
    parameter_n_jobs = st.sidebar.select_slider('Number of jobs to run in parallel (n_jobs)', options=[1, -1])

#---------------------------------#
# Main panel

# Displays the dataset
st.subheader('1. Dataset')

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.markdown('**1.1. Glimpse of dataset**')
    st.write(df)
    build_model(df)
else:
    st.info('Awaiting for CSV file to be uploaded.')
    if st.button('Press to use Example Dataset'):
        # Boston housing dataset
        boston = load_boston()
        X = pd.DataFrame(boston.data, columns=boston.feature_names)
        Y = pd.Series(boston.target, name='response')
        df = pd.concat( [X,Y], axis=1 )

        st.markdown('The Boston housing dataset is used as the example.')
        st.write(df.head(5))

        build_model(df)


Overwriting ml-app.py


## GitHub file (penguin)

In [None]:
!git clone https://github.com/dataprofessor/code.git

Cloning into 'code'...
remote: Enumerating objects: 657, done.[K
remote: Counting objects:   4% (1/21)[Kremote: Counting objects:   9% (2/21)[Kremote: Counting objects:  14% (3/21)[Kremote: Counting objects:  19% (4/21)[Kremote: Counting objects:  23% (5/21)[Kremote: Counting objects:  28% (6/21)[Kremote: Counting objects:  33% (7/21)[Kremote: Counting objects:  38% (8/21)[Kremote: Counting objects:  42% (9/21)[Kremote: Counting objects:  47% (10/21)[Kremote: Counting objects:  52% (11/21)[Kremote: Counting objects:  57% (12/21)[Kremote: Counting objects:  61% (13/21)[Kremote: Counting objects:  66% (14/21)[Kremote: Counting objects:  71% (15/21)[Kremote: Counting objects:  76% (16/21)[Kremote: Counting objects:  80% (17/21)[Kremote: Counting objects:  85% (18/21)[Kremote: Counting objects:  90% (19/21)[Kremote: Counting objects:  95% (20/21)[Kremote: Counting objects: 100% (21/21)[Kremote: Counting objects: 100% (21/21), done.[K
remote: Comp

In [None]:
%cd code/streamlit/part3
!ls

/content/code/streamlit/part3
penguins-app.py       penguins_clf.pkl	    penguins-model-building.py
penguins_cleaned.csv  penguins_example.csv


In [None]:
# test penguin-app.py file, from https://www.youtube.com/watch?v=Eai1jaZrRDs&list=PLtqF5YXg7GLmCvTswG32NqQypOuYkPRUE&index=3
%%writefile penguin-app.py

import streamlit as st
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier

st.write("""
# Penguin Prediction App
This app predicts the **Palmer Penguin** species!
Data obtained from the [palmerpenguins library](https://github.com/allisonhorst/palmerpenguins) in R by Allison Horst.
""")

st.sidebar.header('User Input Features')

st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/penguins_example.csv)
""")

# Collects user input features into dataframe
uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
if uploaded_file is not None:
    input_df = pd.read_csv(uploaded_file)
else:
    def user_input_features():
        island = st.sidebar.selectbox('Island',('Biscoe','Dream','Torgersen'))
        sex = st.sidebar.selectbox('Sex',('male','female'))
        bill_length_mm = st.sidebar.slider('Bill length (mm)', 32.1,59.6,43.9)
        bill_depth_mm = st.sidebar.slider('Bill depth (mm)', 13.1,21.5,17.2)
        flipper_length_mm = st.sidebar.slider('Flipper length (mm)', 172.0,231.0,201.0)
        body_mass_g = st.sidebar.slider('Body mass (g)', 2700.0,6300.0,4207.0)
        data = {'island': island,
                'bill_length_mm': bill_length_mm,
                'bill_depth_mm': bill_depth_mm,
                'flipper_length_mm': flipper_length_mm,
                'body_mass_g': body_mass_g,
                'sex': sex}
        features = pd.DataFrame(data, index=[0])
        return features
    input_df = user_input_features()

# Combines user input features with entire penguins dataset
# This will be useful for the encoding phase
penguins_url = 'https://raw.githubusercontent.com/dataprofessor/code/master/streamlit/part3/penguins_cleaned.csv'
penguins_raw = pd.read_csv(penguins_url)
penguins = penguins_raw.drop(columns=['species'])
df = pd.concat([input_df,penguins],axis=0)

# Encoding of ordinal features
# https://www.kaggle.com/pratik1120/penguin-dataset-eda-classification-and-clustering
encode = ['sex','island']
for col in encode:
    dummy = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df,dummy], axis=1)
    del df[col]
df = df[:1] # Selects only the first row (the user input data)

# Displays the user input features
st.subheader('User Input features')

if uploaded_file is not None:
    st.write(df)
else:
    st.write('Awaiting CSV file to be uploaded. Currently using example input parameters (shown below).')
    st.write(df)

# Reads in saved classification model (need to reupload when runtime is restarted)
penguins_pickle = open(DATA_PATH+'/penguins_clf.pkl','rb')
load_clf = pickle.load(penguins_pickle)

# Apply model to make predictions
prediction = load_clf.predict(df)
prediction_proba = load_clf.predict_proba(df)


st.subheader('Prediction')
penguins_species = np.array(['Adelie','Chinstrap','Gentoo'])
st.write(penguins_species[prediction])

st.subheader('Prediction Probability')
st.write(prediction_proba)

Overwriting penguin-app.py


# Running streamlit instance

In [3]:
# check if app.py has been written to colab sandbox
!ls

app.py	danson.py  requirements.txt  sample_data


In [4]:
# ngrok authentication, only needs to be done once at start of runtime
!ngrok authtoken 2CQtJERhcUlxLR6cdKdzfP8J9jC_56J8CecbbnGjX8dp1tE4j

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [5]:
# start streamlit app instance
!streamlit run app.py &>/dev/null& # change app.py to your streamlit app name
!pgrep streamlit # outputs streamlit process number (required for killing)

288


In [6]:
from pyngrok import ngrok
# setup tunnel to 8501 (streamlit port)
pub_url = ngrok.connect(port='8501')
print(pub_url) # generates url for app

http://46a5-34-125-122-152.ngrok.io


In [None]:
# shutdown
!kill 288 # change the process number
ngrok.kill()