### Imports

In [None]:
import numpy as np
import pandas as pd
import csv

#imporst deep leraning
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.layers import Bidirectional
from keras.layers import Activation

#import pandas profiling
import random 
import names
import pandas_profiling

### Reading Dataset

In [None]:
# Train Dataset
train_data = pd.read_csv('train.csv')
train_data = train_data.sample(frac = 1) # Randomly Smaple data, ratio is 100%
train_data.head()

In [None]:
# Test Dataset
test_data = pd.read_csv('test.csv')
test_data.head()

In [None]:
table = [["id","Qualitative Nominal"],["title","Qualitative Nominal"],
         ["author","Qualitative Nominal"],["text","Qualitative Nominal"],
         ["label","Discrete Quantitative"]]

filing = pd.DataFrame(table, columns=["Variable", "Classification"])
filing

### Classification of Variables

### Data Dictionaty

The **fake news** file contains actual information about ... as follows:


**train.csv:** A full training dataset with the following attributes:


- **ID:** unique id for a news article


- **TITLE:** the title of a news article


- **AUTHOR:** author of the news article


- **TEXT:** describe....


- **LABEL:** a label that marks the article as potentially unreliable
 - 1: unreliable
 - 0: reliable


**test.csv:** A testing training dataset with all the same attributes at train.csv without the label.

### Data Profiling

When importing the data we need to understand them and identify the range of specific predictors, identify the data type of each predictor, as well as calculate the number or percentage of missing values ​​for each predictor. We will use the pandas_profiling library which provides many extremely useful functions for exploratory data analysis.

In [None]:
profile = pandas_profiling.ProfileReport(train_data)
display(profile)

#### Settings

It is important to know which columns are missing data and in what proportion. Lack of data can affect training and lead to learning failures. So, is it possible to tell if there is missing data in the dataset? Yes, by the report generated by pandas_profiting, we have identified:

Attribute title has 558 samples ( 2.68%) with missing values.
Attribute author has 1957 samples ( 9.41%) with missing values.
Attribute text has 39 samples ( 0.19%) with missing values.

We have decided to eliminate the rows that there is any missing data.

In [None]:
print('Before dropna we have {} lines in train'.format(train_data.shape[0]))
train_data.dropna(inplace=True)
print('After dropna we have {} lines in train'.format(train_data.shape[0]))

After eliminating the missing data lines, we applied a descriptive analysis to check if there was an imbalance in the data, and we identified that 7% of the data is unbalanced. In this step we will balance the dataset.

In [None]:
train_data.label.describe()

In [None]:
unreliable = train_data[train_data['label'] == 1]
print('Unreliable：', len(unreliable))

reliable = train_data[train_data['label'] == 0]
print('Reliable：', len(reliable))

We apply random shuffle to balance the number of untrusted and trusted records.

In [None]:
mean = min(len(unreliable), len(reliable))

un_data = unreliable.sample(n = mean)
print('Unreliable：', len(un_data))
r_data = reliable.sample(n = mean)
print('Reliable：', len(r_data))

train_data = pd.concat([un_data, r_data])

In the next profiling step, we created two new features, named: title_author_text and len_title_author_text.
These columns store the concatenation of the title, author, text, and the total sentence size.
Training will be based on this feature.

In [None]:
train_data['title_author_text'] = train_data['title'] + ' ' + train_data['author'] + ' ' + train_data['text']
train_data['len_title_author_text'] = [len(x) for x in train_data['title_author_text']]

In [None]:
detail = train_data['len_title_author_text'].describe()
print(detail)

### Model Deep Learning

#### Trainning model

The data for training the model will be separated by 80% for training and 20% for tests.

In [None]:
#train_features = train_data.drop(['id', 'label'], axis=1)
train_features = train_data['title_author_text']
train_targets = train_data['label']

X_train, X_test, y_train, y_test = train_test_split(train_features, train_targets, test_size=0.2, random_state=42)

print('Train Data Feature: {}'.format(len(X_train)))
print('Train Data Label: {}'.format(len(y_train)))

print('Test Data Feature: {}'.format(len(X_test)))
print('Test Data Label: {}'.format(len(y_test)))

In [None]:
# fix random seed for reproducibility
# This method is called when RandomState is initialized.
np.random.seed(7)

We decided to use a token dictionary with no more than 5000 words, a reasonable token number for the template.

In [None]:
num_token = 5000
token = Tokenizer(num_words = num_token, filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
token.fit_on_texts(X_train)

We truncate and fill the input sequences so they are the same size for modeling, since vectors of the same length are required to perform the calculation in Keras.
The maximum length of the feature string will be the feature's average (len_title_author_text).

In [None]:
max_review_length = int(detail['mean'])

X_train_token = token.texts_to_sequences(X_train)
X_test_token = token.texts_to_sequences(X_test)

X_train_seq = sequence.pad_sequences(X_train_token, maxlen=max_review_length)
X_test_seq = sequence.pad_sequences(X_test_token, maxlen=max_review_length)

In creating the model, the first layer is the Embedded Layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 100 units of memory. Since this is a classification problem, we use a Densa output layer with a single neuron and a sigmoid activation function to make predictions 0 or 1 for both classes (Unreliable and Reliable).

Recurrent neural networks such as LSTM generally have the overfitting problem, so we apply the elimination layers with Dropout Keras.

Since this work is a classification problem, it is important to identify the logloss value, and for this we apply the function (binary_crossentropy) and the ADAM optimization algorithm. We only added two times with a batch of 64 ratings to space the weight updates.

In [None]:
embedding_vector_length = 32
lstm_dim = 100
dropout = 0.2

# model = Sequential()
# model.add(Embedding(input_dim=num_token, output_dim=embedding_vector_length, input_length=max_review_length))
# model.add(Bidirectional(LSTM(lstm_dim), merge_mode = 'sum'))
# model.add(Dense(units = 256, activation = 'relu'))
# model.add(Dense(units = 1, activation = 'sigmoid'))
# model.add(Dropout(dropout))
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model = Sequential()
model.add(Embedding(input_dim=num_token, output_dim=embedding_vector_length, input_length=max_review_length))
model.add(LSTM(lstm_dim))
model.add(Dense(1, activation='sigmoid'))
model.add(Dropout(dropout))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

model.fit(X_train_seq, y_train, epochs=3, batch_size=64)

In [None]:
scores = model.evaluate(X_test_seq, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

#### Testing model

In [None]:
test_id = test_data['id']
X_test_data = test_data['title'].fillna("")

In [None]:
X_test_token = token.texts_to_sequences(X_test_data)
X_test_seq = sequence.pad_sequences(X_test_token, maxlen = max_review_length)

predict = model.predict_classes(X_test_seq)
predict_classes = predict.reshape(-1)

In [None]:
test_data['label'] = [predict for predict in predict_classes]
test_data.head()

#### Results

In [None]:
result = test_data[['id', 'label']]
result.head()

In [None]:
# Any results you write to the current directory are saved as output.
result.to_csv('submission.csv', index = False)