# How to use

Code for UI itself MUST be encapsulated in a separate .py file (cannot run as cells in ipynb).

In this notebook, I use `%% writefile app.py` to write a new streamlit .py file to the local venv (you can check with `!ls`).

Once your streamlit .py file is ready, to run, skip forward to the last section: **Running streamlit instance**.

# Resources

[Main page](https://streamlit.io/)

[Documentation](https://docs.streamlit.io/library/get-started/main-concepts)

[Youtube tutorial](https://github.com/dataprofessor/ml-app)

[Linking streamlit and colab](https://medium.com/@jcharistech/how-to-run-streamlit-apps-from-colab-29b969a1bdfc)

In [None]:
# ignore me
!streamlit run https://raw.githubusercontent.com/streamlit/demo-uber-nyc-pickups/master/streamlit_app.py

2022-07-25 09:49:33.408 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.73.210.146:8501[0m
[0m
[34m  Stopping...[0m
^C


# Code for streamlit app starts here

In [5]:
%%writefile requirements.txt

transformers==4.21.0 # for BERT, pytorch already inbuilt to colab
streamlit==1.11.1
pyngrok==4.1.1 # newer versions don't work

Writing requirements.txt


In [6]:
# install dependencies
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.21.0
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 33.0 MB/s 
[?25hCollecting streamlit==1.11.1
  Downloading streamlit-1.11.1-py2.py3-none-any.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 23.4 MB/s 
[?25hCollecting pyngrok==4.1.1
  Downloading pyngrok-4.1.1.tar.gz (18 kB)
Collecting validators>=0.2
  Downloading validators-0.20.0.tar.gz (30 kB)
Collecting gitpython!=3.1.19
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 61.6 MB/s 
[?25hCollecting pydeck>=0.1.dev5
  Downloading pydeck-0.7.1-py2.py3-none-any.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 39.7 MB/s 
Collecting rich>=10.11.0
  Downloading rich-12.5.1-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 73.3 MB/s 
[?25

In [1]:
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import torch.nn as nn

class BERT_Arch(nn.Module):
    def __init__(self, bert_head, bert_body):
      super(BERT_Arch, self).__init__()
      self.bert_head = bert_head
      self.bert_body = bert_body
      # Max pooling layer 
      self.max_pooling = nn.MaxPool1d(4,stride=4)
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      # relu activation function
      self.relu =  nn.ReLU()
      # dense layer 1
      self.fc = nn.Linear(384,768)
      self.fc1 = nn.Linear(768,512)
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,4)
      #softmax activation function 
      self.softmax = nn.LogSoftmax(dim=1)
 
    #define the forward pass
    def forward(self, sent_id_head, sent_id_body, mask_head, mask_body):
      # print(sent_id.size()) 
      # print(mask.size()) 
      #pass the inputs to the model   
      _, cls_hs_h = self.bert_head(sent_id_head, attention_mask=mask_head) 
      _, cls_hs_b = self.bert_body(sent_id_body, attention_mask=mask_body) 
      cls_hs = torch.cat((cls_hs_h,cls_hs_b),dim=1) 
      max_pool_out =torch.squeeze(self.max_pooling(cls_hs.unsqueeze(0))) 
      fc_out = self.fc(max_pool_out)
      fc_act_out = self.relu(fc_out)
      x = self.fc1(fc_act_out)
      x = self.relu(x)
      x = self.dropout(x)
      # output layer
      x = self.fc2(x)
      # apply softmax activation
      x = self.softmax(x)
      return x

In [4]:
import torch
from transformers import AutoModel, BertTokenizerFast

PATH = '/content/drive/MyDrive/saved_weights_bert_2.pt'

bert_head = AutoModel.from_pretrained('bert-base-uncased')
bert_body = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

device = torch.device("cuda")
model = BERT_Arch(bert_head, bert_body)
model = model.to(device)
model.load_state_dict(torch.load(PATH))

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.pr

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

<All keys matched successfully>

In [None]:
# tokenisation and tensor




In [None]:
# test if can just writefile the model and import into app.py 
# YES it works! ignore this cell now
%%writefile danson.py

def moonlighter():
  return "danson"

## Working file

In [22]:
# streamlit base .py file

%%writefile app.py

import streamlit as st
import pandas as pd
# from danson import *

#---------------------------------#
# Page layout

PAGE_CONFIG = {"page_title":"AI Project Group 18",
              "page_icon":":newspaper:",
              "layout":"wide"}
st.set_page_config(**PAGE_CONFIG)

#---------------------------------#
# Model building

#TODO: load model and prediction functions here (or separate .py file?)
# all the logic should go here






def convert_input(user_input):
	processed_input = preprocess(user_input)
	processed_input = " ".join(processed_input)
	x = []
	x.append(processed_input)
	processed_input = pd.Series(x)
	processed_sequences = tokenizer.texts_to_sequences(processed_input)
	processed_test = pad_sequences(processed_sequences, maxlen = 40, truncating = 'post')
	return processed_test





#---------------------------------#
# Main panel

st.write("""
# unReal

#### *Fake News Classification and Prediction through AI*

In this implementation, the *BERT* model is used to train for Fake News Stance Detection... *(TO UPDATE THIS)*

Try adjusting the hyperparameters in the sidebar!
""")

user_input_head = st.text_area("Enter your news header here:")
user_input_body = st.text_area("Enter your news body here:")

st.write("# Inputted News")
st.write("#### (only has output below when there are both header and body inputs)")
if (user_input_head != "") & (user_input_body != ""):
  st.write("Header:", user_input_head)
  st.write("Body:", user_input_body)

# if user_input !="":
#   processed_input = convert_input(user_input)
# 	prediction = model.predict(processed_input)
# 	if prediction.item() > 0.5:
# 		st.markdown("## Warning: Fake News Detected 👎")
# 		st.write("Your News: ")
# 		st.write(user_input)
# 	else:
# 		st.markdown("## Hurrah: Real News Detected 👍")
# 		st.markdown("Your News: ")
# 		st.write(user_input)

#TODO: remember to remove this LOL
# st.write("Below should output Danson.")
# st.write(moonlighter())

#---------------------------------#
# Sidebar - Collects user input features into dataframe

st.sidebar.title("Model Customisation Tools")

stance_type = st.sidebar.radio("Stance Type", ("Agree", "Disagree", "Discusses", "Unrelated"), index=3, key=3)
st.markdown("# Selected Stance Type")
if stance_type == "Agree":
  st.markdown("## Agree")
elif stance_type == "Disagree":
  st.markdown("## Disagree")
elif stance_type == "Discusses":
	st.markdown("## Discusses")
elif stance_type == "Unrelated":
  st.markdown("## Unrelated")

#TODO: remove everything below if not needed
st.sidebar.header("*Everything below is for show only*")
with st.sidebar.header("1. Set Parameters"):
  split_size = st.sidebar.slider('Data split ratio (% for Training Set)', 10, 90, 80, 5)

with st.sidebar.subheader("2. Learning Parameters"):
  parameter_n_estimators = st.sidebar.slider('Number of estimators (n_estimators)', 0, 1000, 100, 100)
  parameter_max_features = st.sidebar.select_slider('Max features (max_features)', options=['auto', 'sqrt', 'log2'])
  parameter_min_samples_split = st.sidebar.slider('Minimum number of samples required to split an internal node (min_samples_split)', 1, 10, 2, 1)
  parameter_min_samples_leaf = st.sidebar.slider('Minimum number of samples required to be at a leaf node (min_samples_leaf)', 1, 10, 2, 1)

with st.sidebar.subheader("3. General Parameters"):
  parameter_random_state = st.sidebar.slider('Seed number (random_state)', 0, 1000, 42, 1)
  parameter_criterion = st.sidebar.select_slider('Performance measure (criterion)', options=['mse', 'mae'])
  parameter_bootstrap = st.sidebar.select_slider('Bootstrap samples when building trees (bootstrap)', options=[True, False])
  parameter_oob_score = st.sidebar.select_slider('Whether to use out-of-bag samples to estimate the R^2 on unseen data (oob_score)', options=[False, True])
  parameter_n_jobs = st.sidebar.select_slider('Number of jobs to run in parallel (n_jobs)', options=[1, -1])



Overwriting app.py


## ML app file (for ref)

In [None]:
%%writefile ml-app.py

import streamlit as st
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_diabetes, load_boston

#---------------------------------#
# Page layout
## Page expands to full width
st.set_page_config(page_title='The Machine Learning App',
    layout='wide')

#---------------------------------#
# Model building
def build_model(df):
    X = df.iloc[:,:-1] # Using all column except for the last column as X
    Y = df.iloc[:,-1] # Selecting the last column as Y

    # Data splitting
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=(100-split_size)/100)
    
    st.markdown('**1.2. Data splits**')
    st.write('Training set')
    st.info(X_train.shape)
    st.write('Test set')
    st.info(X_test.shape)

    st.markdown('**1.3. Variable details**:')
    st.write('X variable')
    st.info(list(X.columns))
    st.write('Y variable')
    st.info(Y.name)

    rf = RandomForestRegressor(n_estimators=parameter_n_estimators,
        random_state=parameter_random_state,
        max_features=parameter_max_features,
        criterion=parameter_criterion,
        min_samples_split=parameter_min_samples_split,
        min_samples_leaf=parameter_min_samples_leaf,
        bootstrap=parameter_bootstrap,
        oob_score=parameter_oob_score,
        n_jobs=parameter_n_jobs)
    rf.fit(X_train, Y_train)

    st.subheader('2. Model Performance')

    st.markdown('**2.1. Training set**')
    Y_pred_train = rf.predict(X_train)
    st.write('Coefficient of determination ($R^2$):')
    st.info( r2_score(Y_train, Y_pred_train) )

    st.write('Error (MSE or MAE):')
    st.info( mean_squared_error(Y_train, Y_pred_train) )

    st.markdown('**2.2. Test set**')
    Y_pred_test = rf.predict(X_test)
    st.write('Coefficient of determination ($R^2$):')
    st.info( r2_score(Y_test, Y_pred_test) )

    st.write('Error (MSE or MAE):')
    st.info( mean_squared_error(Y_test, Y_pred_test) )

    st.subheader('3. Model Parameters')
    st.write(rf.get_params())

#---------------------------------#
st.write("""

# The Machine Learning App

In this implementation, the *RandomForestRegressor()* function is used in this app for build a regression model using the **Random Forest** algorithm.

Try adjusting the hyperparameters!

""")

#---------------------------------#
# Sidebar - Collects user input features into dataframe
with st.sidebar.header('1. Upload your CSV data'):
    uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
    st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv)
""")

# Sidebar - Specify parameter settings
with st.sidebar.header('2. Set Parameters'):
    split_size = st.sidebar.slider('Data split ratio (% for Training Set)', 10, 90, 80, 5)

with st.sidebar.subheader('2.1. Learning Parameters'):
    parameter_n_estimators = st.sidebar.slider('Number of estimators (n_estimators)', 0, 1000, 100, 100)
    parameter_max_features = st.sidebar.select_slider('Max features (max_features)', options=['auto', 'sqrt', 'log2'])
    parameter_min_samples_split = st.sidebar.slider('Minimum number of samples required to split an internal node (min_samples_split)', 1, 10, 2, 1)
    parameter_min_samples_leaf = st.sidebar.slider('Minimum number of samples required to be at a leaf node (min_samples_leaf)', 1, 10, 2, 1)

with st.sidebar.subheader('2.2. General Parameters'):
    parameter_random_state = st.sidebar.slider('Seed number (random_state)', 0, 1000, 42, 1)
    parameter_criterion = st.sidebar.select_slider('Performance measure (criterion)', options=['mse', 'mae'])
    parameter_bootstrap = st.sidebar.select_slider('Bootstrap samples when building trees (bootstrap)', options=[True, False])
    parameter_oob_score = st.sidebar.select_slider('Whether to use out-of-bag samples to estimate the R^2 on unseen data (oob_score)', options=[False, True])
    parameter_n_jobs = st.sidebar.select_slider('Number of jobs to run in parallel (n_jobs)', options=[1, -1])

#---------------------------------#
# Main panel

# Displays the dataset
st.subheader('1. Dataset')

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.markdown('**1.1. Glimpse of dataset**')
    st.write(df)
    build_model(df)
else:
    st.info('Awaiting for CSV file to be uploaded.')
    if st.button('Press to use Example Dataset'):
        # Boston housing dataset
        boston = load_boston()
        X = pd.DataFrame(boston.data, columns=boston.feature_names)
        Y = pd.Series(boston.target, name='response')
        df = pd.concat( [X,Y], axis=1 )

        st.markdown('The Boston housing dataset is used as the example.')
        st.write(df.head(5))

        build_model(df)


Overwriting ml-app.py


# Running streamlit instance

In [2]:
# check if app.py has been written to colab sandbox
!ls

app.py	drive  requirements.txt  sample_data


In [3]:
# ngrok authentication, only needs to be done once at start of runtime
!ngrok authtoken 2CQtJERhcUlxLR6cdKdzfP8J9jC_56J8CecbbnGjX8dp1tE4j

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [24]:
# start streamlit app instance
!streamlit run app.py &>/dev/null& # change app.py to your streamlit app name
!pgrep streamlit # outputs streamlit process number (required for killing)

891


In [25]:
from pyngrok import ngrok
# setup tunnel to 8501 (streamlit port)
pub_url = ngrok.connect(port='8501')
print(pub_url) # generates url for app

http://2c9a-34-143-164-186.ngrok.io


In [26]:
# shutdown
!kill 891 # change the process number
ngrok.kill()