## Experimenting with an Fully Connected Neural Network (FCNN) Classifier:

* input layer consist of 1024 neurons, with activation function of ReLU and droup out with probability of 40%
* 4-hidden layer deep neural network, with activation functions of ReLU, and layers with 512, 256, 256, and 256 neurons.
* output layer consist of 2 neurons, with activation function of sigmoid



In [3]:
import os 
datasets = ["testdata.manual.2009.06.14.csv", "training.1600000.processed.noemoticon.csv"]
train_path = os.path.join("dataset", datasets[1])

**Loading Data**

In [5]:
import os
from tqdm import trange
from dataloader import DataLoader

data_loader = DataLoader()

In [6]:
data = data_loader.read_df(train_path, 
                           df_type='csv', encoding='latin-1',
                           names=["Sentiment", "ID", "Date", "Query","UserID","Tweet"])
data.head(2)

Unnamed: 0,Sentiment,ID,Date,Query,UserID,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...


**preprocessing using texthero**

In [7]:
import texthero as hero

data['CleanTweet'] = data['Tweet'].pipe(hero.clean)

data.head(2)

Unnamed: 0,Sentiment,ID,Date,Query,UserID,Tweet,CleanTweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com 2y1zl awww bummer ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset update facebook texting might cry result...


**Extracting top words for TF-IDF representation**

In [8]:
def get_vocabs(pos_words_dic, neg_words_dic, ft):
    vocabs = []
    for word,freq in pos_words_dic.items():
        if freq > ft and word not in vocabs:
            vocabs.append(word)

    for word,freq in neg_words_dic.items():
        if freq > ft and word not in vocabs:
            vocabs.append(word)
    return vocabs
    
ft = 5
top_pos_words = hero.top_words(data[data['Sentiment'] == 4]['CleanTweet'])
top_neg_words = hero.top_words(data[data['Sentiment'] == 0]['CleanTweet'])
vocabs5 = get_vocabs(top_pos_words.to_dict(), top_neg_words.to_dict(), ft=ft)
print( "ft:{}, vocab lenght:{}".format(ft, len(vocabs5)))

ft:5, vocab lenght:51758


**Train test split**

In [9]:
def transform_labels(label):
    if label==4:
        return 1
    return label

data['Sentiment'] = data['Sentiment'].apply(lambda x:transform_labels(x))

In [10]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data['CleanTweet'].tolist(), 
                                                    data['Sentiment'].tolist(),
                                                    test_size=0.3, random_state=40)

print("Train size:", len(x_train))
print("Test size:", len(x_test))

Train size: 1120000
Test size: 480000


## TFIDFFCNN Model

TF-IDF representation with Fully Connected Neural Network model

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from model import ModelPipeline
from fcnn import FCNN

model=ModelPipeline(estimator=FCNN(input_dim=len(vocabs5), 
                                   nb_classes=2, 
                                   best_model="fcnn", 
                                   epoch=3,
                                   batch_size=1024,
                                   verbose=1,
                                   validation_split=0.1),
                    transformer=TfidfVectorizer(vocabulary=vocabs5) )

model.fit(x_train, y_train)

Using TensorFlow backend.


Train on 1008000 samples, validate on 112000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [12]:
from sklearn.metrics import classification_report

y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.77      0.78    239943
           1       0.78      0.80      0.79    240057

    accuracy                           0.79    480000
   macro avg       0.79      0.79      0.79    480000
weighted avg       0.79      0.79      0.79    480000

