# SEC Filing Predictive Model

## Overview of this notebook

### Preprocess Summary

- Number of records: The reports with less than 500 word phrases are excluded. 44817 records are used in building our models. 

- Features Selection: After data exploration, the non-text features are excluded because they don't provide much value to our models but add complexity of data process and modeling. 

- Text Features Extraction: The text can generate 5,449,804 unique tokens which is too large to process. In the TFIDF process, I tried to keep the number of features below 100,000 to limit the computing power that is needed. The word phrases that are in 90% of the reports and the word phrases that are in less than 70 reports are excluded. The shape of the data we feed to the models is (44817, 97629). The Matrix transformed to CSR (Compressed Sparse Row) format to minimize ram usage. 

- Cross Validation: 20% of the data is set to be the test dataset. The Stratified K Fold is used to allocate the train and the test sets. 

#### Models (Traditional Machine Learning Classification)

Five tradition classification model are run. The Logistic Regression has the higher accuracy. Below is the accuracy of each model.  

- Logistic Regression - 73%
- K-Nearest Neighbors - 65%
- Random Forest - 72%
- Support Vector Machine - 73% 

#### Models (Neural Network)

- ANN-a: node: 64, learning rate = 0.001 (default), layer 1, dropout = 0 (default), batch size = 32
    * Best Result: 1st epoch, train 72.54%, test 72.6%
- ANN-b: node: 64, learning rate = 0.001 (default), layer 1, dropout = 0.8 (default), batch size = 32
    * Best Result: 3rd epoch, train 72.58%, test 72.58%
- ANN-c: node: 64, learning rate = 0.005 (default), layer 1, dropout = 0.2 (default), batch size = 32
    * Best Result: 1st epoch, train 72.49%, test 72.55%
- ANN-d: node: (1000, 500, 50), learning rate = 0.01, layer 3 , dropout = (0.5, 0.5, 0.5), batch size = 32
    * Best Result: 1st epoch, train 72.47%, test 72.52%
- RNN-LSTM: 
    * Limit the data set to the top 100,000 words.
    * Set the max number of words in each reports at 2500.
    * Result: train 72.53%, test 72.52% for each epoch


In [1]:
%%time
import numpy as np
import sklearn
import pandas as pd 
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adadelta,Adam,RMSprop
from keras.utils import np_utils

from sklearn.preprocessing import StandardScaler

import sys  #system specific parameters and names
import gc   #garbage collector interface

from scipy.sparse import hstack, csr_matrix

from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow as tf 

import category_encoders as ce

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

CPU times: user 5.28 s, sys: 1.18 s, total: 6.46 s
Wall time: 15.5 s


In [2]:
import numpy as np
import pandas as pd

In [3]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

## Data Preparation

### Load Data

The index sec pickle file is loaded. It contains stock price, and the bag of word phrase of the reports. 

In [4]:
%%time
file_in ='/Users/wailunchung/Documents/GitHub/Capstone_data/index_sec'
import pickle as pickle
with open(file_in, "rb") as fh:
    data2 = pickle.load(fh)


CPU times: user 18.2 s, sys: 19.4 s, total: 37.6 s
Wall time: 49.1 s


### Run basic cleaning 

#### exclude short reports

The reports with less than 500 word phrases are excluded to ensure they are the full reports. 



In [5]:
%%time
# remove <500
data2 = data2.loc[(data2['file_text_length'] >= 500)]

CPU times: user 51.7 ms, sys: 236 ms, total: 288 ms
Wall time: 1.11 s


#### Response Variable

Originality, the response variable is if the percentage change after 20 days of the reporting date is higher than 5 percent. The per_change_exceeding is calculated. 

The eact same filing could have a different response depending on the day it was filed. We adjust the response variable to reflect a relative performance between each company after 20 days of the reporting. We calculate the mean percentage change response for each week and subtract that value from each individual response. The adjusted percentage change can factor in the dynamic market conditions in the period. 

Response variable: the stock price after 20 days of filing exceed 5% of the average stock price change of the same period. 

In [6]:
# get year and week of the filing
data2['weeknum'] = data2['FileDate'].dt.week
data2['year'] = data2['FileDate'].dt.year

In [7]:
yearweek_average = data2.groupby(['year', 'weeknum']).agg({'Pct_Change_20':['mean', 'min', 'max']})
yearweek_average.columns = ['Pct_Change_20_mean', 'Pct_Change_20_min', 'Pct_Change_20_max']
yearweek_average = yearweek_average.reset_index()

In [8]:
yearweek_average.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                171 non-null    int64  
 1   weeknum             171 non-null    int64  
 2   Pct_Change_20_mean  171 non-null    float64
 3   Pct_Change_20_min   171 non-null    float64
 4   Pct_Change_20_max   171 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 6.8 KB


In [9]:
data2 = pd.merge(data2, yearweek_average,  how='left', left_on=['year','weeknum'], right_on = ['year','weeknum'])

In [10]:
data2['adj_Pct_Change_20'] = data2['Pct_Change_20'] - data2['Pct_Change_20_mean']

In [11]:
%%time
# get response 
data2['per_change_exceeding'] = np.where(data2['adj_Pct_Change_20'] > 5, 1, 0)

CPU times: user 2.62 ms, sys: 1.22 ms, total: 3.84 ms
Wall time: 2.94 ms


In [12]:
data2.head(2)

Unnamed: 0,CompanyCIK,CompanyName,FileType,FileDate,EdgarTextUrl,EdgarHtmlUrl,AccessionNumber,SecFileName,CompanyTicker,FileDate_ClosingPrice,...,FileName,f_text,file_text_length,weeknum,year,Pct_Change_20_mean,Pct_Change_20_min,Pct_Change_20_max,adj_Pct_Change_20,per_change_exceeding
0,717954,UNIFIRST CORP,10-Q,2019-01-03,edgar/data/717954/0001284084-19-000002.txt,edgar/data/717954/0001284084-19-000002-index.html,0001284084-19-000002,2019-QTR1,UNF,133.860001,...,717954_0001284084-19-000002.txt,"[various estimate, the result, timely decision...",1870,1,2019,3.403937,-5.583489,21.985292,-1.424262,0
1,1084765,RESOURCES CONNECTION INC,10-Q,2019-01-03,edgar/data/1084765/0001193125-19-001543.txt,edgar/data/1084765/0001193125-19-001543-index....,0001193125-19-001543,2019-QTR1,RGP,13.6,...,1084765_0001193125-19-001543.txt,"[asc topic contract term, limited number, the ...",2305,1,2019,3.403937,-5.583489,21.985292,18.581354,1


### Other Company related information: Industry and Sector

Below extract the industry and sector of the companies. It could be used for our model, if we decide not to use these variable as features to our model, it can be used for checking the performance of the models. e.g. which industries get the best prediction from our models...

There is 6839 tickers. 

#### prepare unique company ticker df because yahoo finance library takes time to call

In [13]:
data2['CompanyTicker'] = data2['CompanyTicker'].astype('str') 

In [14]:
ticker_df = data2.CompanyTicker.value_counts()
ticker_df = ticker_df.reset_index()
ticker_df.columns = ['ticker','ticker_cnt']
ticker_df['industry'] = ''
ticker_df['beta'] = ''
ticker_df['sector'] = ''

In [15]:
ticker_df

Unnamed: 0,ticker,ticker_cnt,industry,beta,sector
0,CNNC,29,,,
1,DKMR,21,,,
2,SIPN,20,,,
3,SRCO,17,,,
4,MEPW,15,,,
...,...,...,...,...,...
6835,INRE,1,,,
6836,OPAD,1,,,
6837,LTCH,1,,,
6838,MIR,1,,,


In [None]:
file_name = '/Users/wailunchung/Documents/GitHub/capstone_projectb/tickers_industry.csv'
#ticker_df.head(349).to_csv(file_name, index=False)

In [16]:
file_name = '/Users/wailunchung/Documents/GitHub/capstone_projectb/tickers_industry.csv'

In [None]:
%%time
import yfinance as yf
for index, row in ticker_df.iterrows():
    if index > 6127:
        print(row['ticker'])
        print(str(index) + '/6839') 
        info = yf.Ticker(row['ticker']).info
        industry = info.get('industry')
        beta = info.get('beta')
        sector = info.get('sector')
        ticker_df.at[index,'industry'] = industry
        ticker_df.at[index,'beta'] = beta
        ticker_df.at[index,'sector'] = sector
        ticker_df[index:index+1].to_csv(file_name, mode='a', index=False, header=False)


PTRA
6128/6839
ACXP
6129/6839
XMTR
6130/6839
AOMR
6131/6839
COUR
6132/6839
ZIP
6133/6839
MCW
6134/6839
TKNO
6135/6839
ALIT
6136/6839
UOLI
6137/6839


## Data Exploratory 

### Check the number of companies

Company Ticker can be one important feature, however, there are 6840 unique tickers from this column. The one-hot encoding method will create too many features that overflow our computing capacity. We consider:

1. exclude the ticker column from our model. 
2. use binary encoding to reduce the number of features, but it creates dependencies between features. 


In [12]:
data2.groupby('CompanyTicker').size()

CompanyTicker
A       9
AA      9
AAAU    7
AAC     3
AACI    1
       ..
ZYME    9
ZYNE    9
ZYRX    3
ZYXI    9
ZZLL    9
Length: 6840, dtype: int64

In [13]:
len(data2['CompanyTicker'].unique())

6840

In [14]:
y = np.array(data2["per_change_exceeding"])

### Check the stats of numeric columns
There is a stock with very high stock price. It is double checked and verified the price is correct. 

In [15]:
data2.describe().apply(lambda s: s.apply('{0:.5f}'.format)).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CompanyCIK,44817.0,1176809.80969,508825.55306,1750.0,888981.0,1315257.0,1580345.0,1888734.0
FileDate_ClosingPrice,44817.0,114.3273,5029.73504,1e-05,3.5,13.28,39.25,435200.0
FileDate_Plus_20_Price,44817.0,114.72741,5042.89659,1e-05,3.48,13.4,39.42,432469.0
Pct_Change_20,44817.0,150.5739,24007.71995,-99.86667,-6.24999,0.0,5.99539,5009900.35545
Share_Unit_Value_Raw,44817.0,0.40011,142.47021,-17000.0,-0.6,0.0,0.76,16801.0
file_text_length,44817.0,1988.5389,1384.89905,500.0,1127.0,1555.0,2335.0,14139.0
weeknum,44817.0,30.0434,12.40878,1.0,19.0,32.0,44.0,53.0
year,44817.0,2020.13352,0.8611,2019.0,2019.0,2020.0,2021.0,2022.0
Pct_Change_20_mean,44817.0,150.5739,786.19869,-28.28753,-1.75519,0.93324,9.12629,5013.25957
Pct_Change_20_min,44817.0,-69.63613,17.50373,-99.86667,-80.35641,-74.0,-59.25926,0.10273


## Prepare functions to free memory

The index sec file is large, as we process the data and create transformed data matrices. More data is created. The following functions help check the memory usage when we delete and garbage collect dataframes.

In [16]:
def obj_size_fmt(num):
    if num<10**3:
        return "{:.2f}{}".format(num,"B")
    elif ((num>=10**3)&(num<10**6)):
        return "{:.2f}{}".format(num/(1.024*10**3),"KB")
    elif ((num>=10**6)&(num<10**9)):
        return "{:.2f}{}".format(num/(1.024*10**6),"MB")
    else:
        return "{:.2f}{}".format(num/(1.024*10**9),"GB")

def memory_usage():
    memory_usage_by_variable=pd.DataFrame({k:sys.getsizeof(v)\
    for (k,v) in globals().items()},index=['Size'])
    memory_usage_by_variable=memory_usage_by_variable.T
    memory_usage_by_variable=memory_usage_by_variable.sort_values(by='Size',ascending=False).head(10)
    memory_usage_by_variable['Size']=memory_usage_by_variable['Size'].apply(lambda x: obj_size_fmt(x))
    return memory_usage_by_variable

## Text features

### TF-IDF vectorization

The Index Sec file contains bag of word phrases. 

### Check size of matrix

#### max_df
"max_df" is used for removing terms that appear too frequently. There are some phrases that probably all the reports would use, for example, "report", "revenue", etc. I attempt to set max_df = .98, it ignores terms that appear in more than 95% of the reports. 

#### min_df
Similar to "max_df", "min_df" is used for removing terms that appear too infrequently. I attempt to set min_df = 0.05, it ignores terms that appear in less than 5% of the reports. 

#### After setting the limits, the matrix is 44817 x 4632. We have more room to relax the limits. 

In [17]:
%%time

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None,
    max_df=0.95,     
    min_df=0.05
    )  

K = tfidf.fit_transform(data2['f_text']).todense()

CPU times: user 1min 40s, sys: 39.9 s, total: 2min 20s
Wall time: 2min 40s


In [18]:
K.shape

(44817, 4632)

Note:
- 95% -> 4.6K 
- 96% -> 5K 
- 97% -> 7K 
- 99% -> 20K 
- 99%, 0.5% -> 37025
- 99%, 0.25% -> 65655
- 99%, 0.1% ~ 45 doc -> 
- 99%, 80 -? 87305

### TFIDF Vectorizer Threshold

We want to limit our features to be within 100K because of the limited processing power. Since the most frequent term and least frequent term do not produce much value, we exclude the terms that appear in 90% of the reports and terms that appear in less than 70 of our 44817 reports. By setting these threshold, there will be 97629 features. Our input matrix size is (44817, 97629). 


In [19]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None,
    max_df=0.90,     
    min_df=70,       
    stop_words= None,
    strip_accents=None,
    use_idf=True,
    sublinear_tf=True)

CPU times: user 1.55 s, sys: 7.4 s, total: 8.96 s
Wall time: 11.5 s


In [20]:
%%time
tfidf.fit(data2["f_text"])
tfidf.vocabulary_

CPU times: user 1min 44s, sys: 45.7 s, total: 2min 30s
Wall time: 2min 54s


{'various estimate': 95696,
 'the result': 85852,
 'timely decision': 93155,
 'the asset': 73211,
 'our property plant': 53178,
 'hazardous material': 30132,
 'available cash': 11377,
 'tax cut': 70920,
 'any material loss': 7836,
 'november revenue': 45188,
 'other party': 48485,
 'the company restriction': 75044,
 'content impact': 17570,
 'net': 42097,
 'the nature amount timing': 82000,
 'its distribution center': 34462,
 'the annual period': 72754,
 'certain share': 14698,
 'the component': 75414,
 'relevant information': 61028,
 'material disruption': 39836,
 'january': 35577,
 'the stock': 87202,
 'the company date': 74672,
 'transfer': 93870,
 'new taxis': 43126,
 'filer small reporting company': 26538,
 'service cost': 64326,
 'total consolidated asset': 93376,
 'vi note': 96026,
 'million share': 40719,
 'the same product': 86300,
 'its consideration': 34313,
 'the issuer class': 80631,
 'market condition': 39453,
 'consideration': 17102,
 'the balance sheet date': 73473,
 't

In [21]:
%%time
vector_1 = tfidf.transform(data2["f_text"])

CPU times: user 1min 16s, sys: 16.7 s, total: 1min 33s
Wall time: 1min 40s


In [22]:
vector_1.shape

(44817, 97629)

In [23]:
%%time
X0 = vector_1.todense()

CPU times: user 10.4 s, sys: 17.2 s, total: 27.5 s
Wall time: 45.6 s


In [24]:
X0.shape

(44817, 97629)

#### Convert Matrix to Compressed Sparse Row Matrix
Since the matrix is sparse, it is wasteful to store the zero elements. CSR breaks down the data frame for fitting into RAM, so data can easily fit in RAM. Performing operations using only non-zero values of the sparse matrix can greatly increase execution speed of the algorithm.

In [25]:
%%time
from scipy import sparse
X = csr_matrix(X0)

CPU times: user 1min 8s, sys: 1min 16s, total: 2min 25s
Wall time: 3min 38s


In [26]:
X.shape

(44817, 97629)

In [27]:
memory_usage()

Unnamed: 0,Size
data2,806.67MB
_20,163.84MB
_12,717.27KB
y,350.24KB
_11,38.38KB
_15,8.22KB
yearweek_average,6.82KB
TfidfVectorizer,1.96KB
StandardScaler,1.04KB
RMSprop,1.04KB


#### Clear memory 
The data is extracted and ready to be modeled. The original dataframe are deleted and cleared to free up some memory. 

In [32]:
%%time
del data2 
#del vector_2
gc.collect()
memory_usage()

CPU times: user 14.8 s, sys: 3min 24s, total: 3min 39s
Wall time: 8min 59s


Unnamed: 0,Size
_20,163.84MB
_,1.60MB
_30,1.60MB
ticker_df,1.60MB
_12,717.27KB
y,350.24KB
_11,38.38KB
_15,8.22KB
yearweek_average,6.82KB
TfidfVectorizer,1.96KB


## Non text features

### Prepare non-text features

We review some of the non-text attributes and see if they produce value to be the features to our model. 
* Report type: we only have 10-Q reports
* Report quarter: it may produce some value to the model. 
* Year of the report: it provides the time information, however, the data is only collected from 2019 to 2022. It doesn't produce much value to the model. If we use the model to predict stock movement in the future, for example 2023, 2025, it may not help much. 
* The stock price when the report is filed: it may provide a some information about the size of the company, However, the price do not always reflect the size of a company, because the size and price are also dependent of the number of shares of the stock. 
* Company ticker: company is an important factor to the stock price, however, the company data is sparse. One-hot encoding for over 6000 possible values might get out of hand, especially some of the compnay are very rare with only 1 report. This leads to the problem of “sparsity” with a huge matrix and almost every value is zero. 

In [None]:
%%time
df_non_text_features = data2[['FileType','SecFileName','CompanyTicker','FileDate_ClosingPrice','FileDate']]
df_non_text_features = df_non_text_features.reset_index(drop=True)
df_non_text_features["SecFileName"] = df_non_text_features["SecFileName"].str.slice(5, 9)
df_non_text_features["Year"] = df_non_text_features["FileDate"].dt.year
df_non_text_features = df_non_text_features.drop(columns = ['FileDate'])


In [None]:
df_non_text_features.describe().apply(lambda s: s.apply('{0:.5f}'.format)).transpose()

In [None]:
df_non_text_features.groupby('SecFileName').size()

In [None]:
df_non_text_features.groupby('FileType').size()

In [None]:
# do not use on hot encoding
# df_non_text_features = pd.get_dummies(df_non_text_features, columns=['FileType','SecFileName','CompanyTicker'])

In [None]:
%%time
del data2
gc.collect()
memory_usage()

### Encode categorical columns (reserved if non-text features will be used) 

In [None]:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['SecFileName','CompanyTicker']);
# transform the data 
data_binary = encoder.fit_transform(df_non_text_features);

In [None]:
data_binary.head(5)

In [None]:
#normalize = ["FileDate_ClosingPrice","Year"]
#from sklearn.preprocessing import StandardScaler
std = StandardScaler()

In [None]:
alldex = data_binary.index

In [None]:
normdf = pd.DataFrame(std.fit_transform(data_binary),columns=data_binary.columns).set_index(alldex)

In [None]:
non_text_features = csr_matrix(normdf)

In [None]:
memory_usage()

### Stacking the text features and non-text features (reserved if non-text features will be used) 

Below is the code to stack the text features and non-text features. After reviewing the non-text features, we determined the information from the non-text features is relatively small and stacking the sparse matrix of the text features would also minimize the learning we can get from the non-text features. We decide to only use the text features. 

In [None]:
#print("Sparse Matrix..")
# Sparse Matrix
#train_features = hstack([
#    X,
#    non_text_features], 'csr'
#)
#del train_word_features, train_char_features
#print("train shape: {} rows, {}".format(*train_features.shape))

### Clear duplicated data

In [None]:
%%time
del df_non_text_features
del data_binary
del alldex
del normdf
del non_text_features
gc.collect()
memory_usage()

## Prepare train and test set for Cross Validation

In [33]:
%%time
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits= 5,shuffle=True,random_state=42)

for train, test in cv.split(X,y):
    X_train = X[train] 
    X_test  = X[test] 
    y_train = y[train]
    y_test  = y[test] 
    
print('Size of training data: ', X_train.shape[0], 'and its shape : ', X_train.shape)
print('Size of training labels: ', len(y_train), 'and its shape : ', y_train.shape)
print('Size of test data: ', X_test.shape[0], 'and its shape : ', X_test.shape)
print('Size of test labels: ', len(y_test), 'and its shape : ', y_test.shape)

Size of training data:  35854 and its shape :  (35854, 97629)
Size of training labels:  35854 and its shape :  (35854,)
Size of test data:  8963 and its shape :  (8963, 97629)
Size of test labels:  8963 and its shape :  (8963,)
CPU times: user 1.31 s, sys: 3.19 s, total: 4.5 s
Wall time: 8.2 s


## Model: Traditional Classification Model

### Model: Logistic Regression

In [34]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

loss = []
lr = LogisticRegression(solver="sag", max_iter=100)
lr.fit(X_train, y_train)
print("Auc Score: ",np.mean(cross_val_score(lr, X_train, y_train, cv=3, scoring='roc_auc')))


Auc Score:  0.5365273535172462
CPU times: user 44.3 s, sys: 1.06 s, total: 45.4 s
Wall time: 47.5 s


In [35]:
y_pred = lr.predict(X_test)

In [None]:
#probabilities = model.predict(X_test)
#predictions = [float(np.round(x)) for x in probabilities]
accuracy = np.mean(y_pred == y_test)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))

In [36]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[7123   11]
 [1808   21]]


In [37]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89      7134
           1       0.66      0.01      0.02      1829

    accuracy                           0.80      8963
   macro avg       0.73      0.50      0.45      8963
weighted avg       0.77      0.80      0.71      8963



### Model: K-Nearest Neighbors

In [28]:
%%time
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
print("Auc Score: ",np.mean(cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')))

Auc Score:  0.5417877396071421
CPU times: user 10min 8s, sys: 30.3 s, total: 10min 38s
Wall time: 11min 10s


In [29]:
y_pred = clf.predict(X_test)

In [30]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[5175 1325]
 [1841  622]]


In [31]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.80      0.77      6500
           1       0.32      0.25      0.28      2463

    accuracy                           0.65      8963
   macro avg       0.53      0.52      0.52      8963
weighted avg       0.62      0.65      0.63      8963



### Model: Random Forest

In [32]:
%%time
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=20, n_estimators=150, n_jobs=-1)
clf.fit(X_train, y_train)
print("Auc Score: ",np.mean(cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')))

Auc Score:  0.5449668874651226
CPU times: user 3min 26s, sys: 6.58 s, total: 3min 33s
Wall time: 2min 58s


In [33]:
y_pred = clf.predict(X_test)

In [34]:
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[6473   27]
 [2439   24]]


In [35]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      1.00      0.84      6500
           1       0.47      0.01      0.02      2463

    accuracy                           0.72      8963
   macro avg       0.60      0.50      0.43      8963
weighted avg       0.66      0.72      0.61      8963



### Model: Naive Bayes

In [36]:
%%time 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_X_train = scaler.fit_transform(X_train)

TypeError: MinMaxScaler does not support sparse input. Consider using MaxAbsScaler instead.

In [41]:
%%time
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(scaled_X_train, y_train)
print("Auc Score: ",np.mean(cross_val_score(clf, scaled_X_train, y_train, cv=3, scoring='roc_auc')))

NameError: name 'scaled_X_train' is not defined

In [38]:
y_pred = clf.predict(X_test)


In [39]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[6473   27]
 [2439   24]]


In [40]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      1.00      0.84      6500
           1       0.47      0.01      0.02      2463

    accuracy                           0.72      8963
   macro avg       0.60      0.50      0.43      8963
weighted avg       0.66      0.72      0.61      8963



### Model: Support Vector Machine

In [43]:
%%time
from sklearn.linear_model import SGDClassifier
#clf_svm = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)
clf_svm = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)
clf_svm.fit(X_train, y_train)
print("Auc Score: ",np.mean(cross_val_score(clf_svm, X_train, y_train, cv=3, scoring='roc_auc')))


Auc Score:  0.548101224011268
CPU times: user 6.1 s, sys: 1.8 s, total: 7.9 s
Wall time: 9.5 s


In [44]:
y_pred = clf_svm.predict(X_test)


In [45]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[6500    0]
 [2463    0]]


In [46]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      1.00      0.84      6500
           1       0.00      0.00      0.00      2463

    accuracy                           0.73      8963
   macro avg       0.36      0.50      0.42      8963
weighted avg       0.53      0.73      0.61      8963



## Model Neural Network

### Model ANN-a: Neural Network ANN 1 layer

#### model parameters
- node: 64
- learning rate = 0.001 (default)
- layer 1
- dropout = 0 (default) 
- batch size = 32 
- epoch = 8 

#### Result: 
1st epoch, 72.53%, 72.6%
overfitting problem after 2nd epoch

In [38]:
%%time 
batch_size = 32
nb_epochs = 8

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())

2022-06-08 02:08:57.537797: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                6248320   
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 6,248,385
Trainable params: 6,248,385
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 211 ms, sys: 506 ms, total: 717 ms
Wall time: 2.23 s


In [39]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 19min 41s, sys: 6min 58s, total: 26min 40s
Wall time: 13min 39s


### Model ANN-b: Neural Network ANN 1 layer with dropout 0.8

#### model parameters
- node: 64
- learning rate = 0.001 (default)
- layer 1
- dropout = 0.8 <---
- batch size = 32 
- epoch = 8 + 8

#### Result: 
* 3th epoch, 72.82%, 72.58%
* there is still an overfiting problem

In [40]:
%%time 
batch_size = 32
nb_epochs = 8

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dropout(0.8))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 64)                6248320   
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 6,248,385
Trainable params: 6,248,385
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 126 ms, sys: 90.2 ms, total: 216 ms
Wall time: 219 ms


In [41]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 19min 57s, sys: 7min 6s, total: 27min 4s
Wall time: 14min 19s


#### dropout = 0.2

In [42]:
%%time 
batch_size = 32
nb_epochs = 8

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 64)                6248320   
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 6,248,385
Trainable params: 6,248,385
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 133 ms, sys: 77 ms, total: 210 ms
Wall time: 183 ms


In [43]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 25min 23s, sys: 8min 30s, total: 33min 54s
Wall time: 18min 52s


### higher learning rate 
higher learning rate does not improve the accuracy

In [44]:
%%time 
batch_size = 32
nb_epochs = 8
learning_rate = 0.005

custom_adam = tf.keras.optimizers.Adam(lr=learning_rate) #, beta_1=0.9, beta_2=0.999, epsilon=1e-8)

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=custom_adam,
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 64)                6248320   
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 6,248,385
Trainable params: 6,248,385
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 139 ms, sys: 97.8 ms, total: 237 ms
Wall time: 213 ms


In [45]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 27min 54s, sys: 9min 14s, total: 37min 9s
Wall time: 20min 55s


#### add 1 layer with .5 drop out

In [46]:
%%time 
batch_size = 32
nb_epochs = 8

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 64)                6248320   
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_9 (Dense)             (None, 32)                2080      
                                                                 
 dropout_3 (Dropout)         (None, 32)                0         
                                                                 
 dense_10 (Dense)            (None, 1)                 33        
                                                                 
Total params: 6,250,433
Trainable params: 6,250,433
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 159 ms, sys: 95.5 ms, t

In [47]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 27min 26s, sys: 8min 56s, total: 36min 23s
Wall time: 21min 11s


### Model: Deeper neural network with more neurons

In [48]:
np.random.seed(122)
batch_size = 32
nb_epochs = 5

model.add(Dense(64, activation='relu', input_dim=97629))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))

model = Sequential()
model.add(Dense(1000,activation='relu',input_shape= (97629,)))
model.add(Dropout(0.5))
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))
#model.add(Dense(nb_classes))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),metrics=['accuracy'])

print(model.summary())


Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_13 (Dense)            (None, 1000)              97630000  
                                                                 
 dropout_6 (Dropout)         (None, 1000)              0         
                                                                 
 dense_14 (Dense)            (None, 500)               500500    
                                                                 
 dropout_7 (Dropout)         (None, 500)               0         
                                                                 
 dense_15 (Dense)            (None, 50)                25050     
                                                                 
 dropout_8 (Dropout)         (None, 50)                0         
                                                                 
 dense_16 (Dense)            (None, 1)                

In [49]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 4h 43min 6s, sys: 3h 18min 6s, total: 8h 1min 13s
Wall time: 3h 2min 21s


In [50]:
loss, accuracy = model.evaluate(X_train, y_train)
print("\nLoss: %.2f, Accuracy: %.2f%%" % (loss, accuracy*100))


Loss: 0.23, Accuracy: 90.19%


In [51]:
probabilities = model.predict(X_test)
predictions = [float(np.round(x)) for x in probabilities]
accuracy = np.mean(predictions == y_test)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))


Prediction Accuracy: 75.12%


In [52]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
print(cm)

[[6480  654]
 [1576  253]]


In [53]:
print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

           0       0.80      0.91      0.85      7134
           1       0.28      0.14      0.18      1829

    accuracy                           0.75      8963
   macro avg       0.54      0.52      0.52      8963
weighted avg       0.70      0.75      0.72      8963



note (model with tickers)
32, epoch 8,  84.4%, 71.54% best epoch 3 -> 75.1%, 74.5%


### LSTM (Long Short Term Memory) model
The ANN model doesn't seem to do better than logistic regresssion. LSTM maybe another model can be tried. 

#### LSTM Modeling
* Vectorize consumer complaints text, by turning each text into either a sequence of integers or into a vector.
* Limit the data set to the top 100,000 words.
* Set the max number of words in each reports at 2500.

In [20]:
%%time
from keras.preprocessing.text import Tokenizer
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 100000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 2500
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='', lower=True)
tokenizer.fit_on_texts(data2['f_text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 5449804 unique tokens.
CPU times: user 3min 8s, sys: 27.9 s, total: 3min 36s
Wall time: 4min 7s


* Truncate and pad the input sequences so that they are all in the same length for modeling.

In [21]:
%%time
from tensorflow.keras.preprocessing.sequence import pad_sequences
X = tokenizer.texts_to_sequences(data2['f_text'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)


Shape of data tensor: (44817, 2500)
CPU times: user 1min 41s, sys: 25.1 s, total: 2min 6s
Wall time: 2min 38s


In [22]:
%%time
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits= 5,shuffle=True,random_state=42)

for train, test in cv.split(X,y):
    X_train = X[train] 
    X_test  = X[test] 
    y_train = y[train]
    y_test  = y[test] 
    
print('Size of training data: ', X_train.shape[0], 'and its shape : ', X_train.shape)
print('Size of training labels: ', len(y_train), 'and its shape : ', y_train.shape)
print('Size of test data: ', X_test.shape[0], 'and its shape : ', X_test.shape)
print('Size of test labels: ', len(y_test), 'and its shape : ', y_test.shape)

Size of training data:  35854 and its shape :  (35854, 2500)
Size of training labels:  35854 and its shape :  (35854,)
Size of test data:  8963 and its shape :  (8963, 2500)
Size of test labels:  8963 and its shape :  (8963,)
CPU times: user 675 ms, sys: 959 ms, total: 1.63 s
Wall time: 2.3 s


* The first layer is the embedded layer that uses 100 length vectors to represent each word.
* SpatialDropout1D performs variational dropout in NLP models.
* The next layer is the LSTM layer with 100 memory units.
* The output layer must create 1 output values.
* Activation function is softmax for binary classification.
* Because it is a binary classification problem, categorical_crossentropy is used as the loss function.

In [24]:
%%time 
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from tensorflow.keras.layers import LSTM

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 32
nb_epochs = 5

print(model.summary())

# history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 2500, 100)         10000000  
                                                                 
 spatial_dropout1d_3 (Spatia  (None, 2500, 100)        0         
 lDropout1D)                                                     
                                                                 
 lstm_2 (LSTM)               (None, 100)               80400     
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 10,080,501
Trainable params: 10,080,501
Non-trainable params: 0
_________________________________________________________________
None
CPU times: user 312 ms, sys: 85.4 ms, total: 397 ms
Wall time: 606 ms


In [25]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 14h 32min 36s, sys: 3h 41min 15s, total: 18h 13min 51s
Wall time: 6h 35min 17s


## ########## BELOW are WIP codes ########## ##

### CNN-BiLSTM

In [None]:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.callbacks import Callback
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.preprocessing import text, sequence
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D

In [None]:
sequence_input = Input(shape=(maxlen, ))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool]) 
# x = Dense(128, activation='relu')(x)
# x = Dropout(0.1)(x)
preds = Dense(6, activation="sigmoid")(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.summary()

### CNN-LSTM

In [41]:
from tensorflow.keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Flatten

batch_size = 32
nb_epochs = 8

max_length = 97629
embedding_vector_features=45
vocab_size = 97629

model = Sequential()
model.add(Embedding(5000, 100, input_length=97629))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

NameError: name 'SpatialDropout1D' is not defined

In [36]:
%%time
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, epochs=nb_epochs,verbose=1) 

Epoch 1/8


TypeError: in user code:

    File "/Users/wailunchung/.pyenv/versions/3.8.12/lib/python3.8/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/Users/wailunchung/.pyenv/versions/3.8.12/lib/python3.8/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/wailunchung/.pyenv/versions/3.8.12/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/Users/wailunchung/.pyenv/versions/3.8.12/lib/python3.8/site-packages/keras/engine/training.py", line 859, in train_step
        y_pred = self(x, training=True)
    File "/Users/wailunchung/.pyenv/versions/3.8.12/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None

    TypeError: Exception encountered when calling layer "embedding_4" (type Embedding).
    
    Failed to convert elements of SparseTensor(indices=Tensor("DeserializeSparse:0", shape=(None, 2), dtype=int64), values=Tensor("sequential_8/embedding_4/Cast:0", shape=(None,), dtype=int32), dense_shape=Tensor("stack:0", shape=(2,), dtype=int64)) to Tensor. Consider casting elements to a supported type. See https://www.tensorflow.org/api_docs/python/tf/dtypes for supported TF dtypes.
    
    Call arguments received:
      • inputs=<tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x2dfeb69d0>


In [None]:
filters = 100
kernel_size = 3
activation = 'relu'
input1 = Input(shape=(max_length,))
embeddding1 = Embedding(input_dim=97629, 
                            output_dim=1, 
                            input_length=max_length, 
                            input_shape=(max_length, ),
                            # Assign the embedding weight with word2vec embedding marix
                            weights = [emb_matrix],
                            # Set the weight to be not trainable (static)
                            trainable = False)(input1)
conv1 = Conv1D(filters=filters, kernel_size=kernel_size, activation='relu', 
                   kernel_constraint= MaxNorm( max_value=3, axis=[0,1]))(embeddding1)

In [None]:
def ensemble_CNN_BiGRU(filters = 100, kernel_size = 3, activation='relu', 
                   input_dim = None, output_dim=300, max_length = None, emb_matrix = None):
  
    # Channel 1D CNN
input1 = Input(shape=(max_length,))
embeddding1 = Embedding(input_dim=input_dim, 
                            output_dim=output_dim, 
                            input_length=max_length, 
                            input_shape=(max_length, ),
                            # Assign the embedding weight with word2vec embedding marix
                            weights = [emb_matrix],
                            # Set the weight to be not trainable (static)
                            trainable = False)(input1)
    conv1 = Conv1D(filters=filters, kernel_size=kernel_size, activation='relu', 
                   kernel_constraint= MaxNorm( max_value=3, axis=[0,1]))(embeddding1)
    pool1 = MaxPool1D(pool_size=2, strides=2)(conv1)
    flat1 = Flatten()(pool1)
    drop1 = Dropout(0.5)(flat1)
    dense1 = Dense(10, activation='relu')(drop1)
    drop1 = Dropout(0.5)(dense1)
    out1 = Dense(1, activation='sigmoid')(drop1)
    
    # Channel BiGRU
    input2 = Input(shape=(max_length,))
    embeddding2 = Embedding(input_dim=input_dim, 
                            output_dim=output_dim, 
                            input_length=max_length, 
                            input_shape=(max_length, ),
                            # Assign the embedding weight with word2vec embedding marix
                            weights = [emb_matrix],
                            # Set the weight to be not trainable (static)
                            trainable = False,
                            mask_zero=True)(input2)
    gru2 = Bidirectional(GRU(64))(embeddding2)
    drop2 = Dropout(0.5)(gru2)
    out2 = Dense(1, activation='sigmoid')(drop2)
    
    # Merge
    merged = concatenate([out1, out2])
    
    # Interpretation
    outputs = Dense(1, activation='sigmoid')(merged)
    model = Model(inputs=[input1, input2], outputs=outputs)
    
    # Compile
    model.compile( loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
%%time
from matplotlib import pyplot
train_acc = model.evaluate(X_train, y_train, verbose=0)
test_acc = model.evaluate(X_test, y_test, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc[1], test_acc[1]))
# plot training history
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

In [None]:
memory_usage()

## Model 2

In [None]:
np.random.seed(122)
nb_classes = 2
batch_size = 32
nb_epochs = 8
learning_rate = 0.01

Y_train = np_utils.to_categorical(y_train, nb_classes)

model = Sequential()
model.add(Dense(1000,input_shape= (10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))
#model.add(Dense(nb_classes))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),metrics=['accuracy'])

print(model.summary())

In [None]:
import tensorflow as tf
from tensorflow.keras import layers 
from tensorflow.keras.regularizers import l2

# defining the architecture of this connected neural network

def build_fc_model():    
    '''defining the model using the Sequential class'''
    fc_model = tf.keras.Sequential([
      # First define a input layer
      tf.keras.layers.InputLayer(input_shape=(X_train.shape[1],)),
      # Defining the activation function for the first fully connected (Dense) layer      
      tf.keras.layers.Dense(100, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.L2(0.01)),      
      # Defining the activation function for the second fully connected (Dense) layer      
      tf.keras.layers.Dense(100, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.L2(0.01)),     
      # Defining the activation function for the third fully connected (Dense) layer      
      tf.keras.layers.Dense(100, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.L2(0.01)),   
      # Defining the second Dense layer to output the classification probabilities
      tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, activity_regularizer=tf.keras.regularizers.L2(0.01))       
    ])
    return fc_model

model2 = build_fc_model()
model2.summary()

In [None]:
learning_rate = 0.01

model2.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), 
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
%%time
BATCH_SIZE = 64
EPOCHS = 5
history2 = model2.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=BATCH_SIZE, epochs=EPOCHS,verbose=1)

In [None]:
loss, accuracy = model2.evaluate(X_train, y_train)
print("\nLoss: %.2f, Accuracy: %.2f%%" % (loss, accuracy*100))

In [None]:
import numpy
probabilities = model2.predict(X_test)
predictions = [float(numpy.round(x)) for x in probabilities]
accuracy = numpy.mean(predictions == y_test)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))

In [None]:
outfile = '/Users/wailunchung/Documents/GitHub/Capstone_data/my_model'
model.save(outfile)