# Homework 1 Antoine

Objective: Use an LSTM to differentiate between human-written and AI-written texts

We will first implement 3 different LSTMs. A basic one, a bidirectional one and one accompanied by a CNN. Once we have found the best of the 3 we will focus on improving these hyperparameters to optimize it. 
(We assume that if a model is worse than another then its optimized version will be worse than the optimized version of the other. This is a strong assumption but it will allow us to avoid varying everything in order to find the best one)

To initialize the program, we call all the necessary libraries and data. We check that the data is correctly imported.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D
from tensorflow.keras.callbacks import EarlyStopping

seed = 111

data = pd.read_csv('AI_human1.csv')
data.head()

Unnamed: 0,text,generated
0,Cars. Cars have been around since they became ...,0.0
1,Transportation is a large necessity in most co...,0.0
2,"""America's love affair with it's vehicles seem...",0.0
3,How often do you ride in a car? Do you drive a...,0.0
4,Cars are a wonderful thing. They are perhaps o...,0.0


We divide the dataset by 10 to save processing time. 

In [2]:
data = data.sample(frac=0.1, random_state=seed) 
data.head()
label_encoder = LabelEncoder()
data['generated'] = label_encoder.fit_transform(data['generated'])

We randomly print 5 texts to see if they need to be cleaned or not. At first sight, it is impossible to differentiate between AI-generated texts and human texts. Therefore, there is no processing to be done on the text.

In [3]:
print("Exemples de textes du dataset :")
for i in np.random.choice(len(data), 5, replace=False):
    print(f"Texte {i+1}: {data['text'].iloc[i]} (Label: {data['generated'].iloc[i]})\n")

Exemples de textes du dataset :
Texte 6367: Venus is one of the brightest points of light in the night sky and the closest in distance as well. However, the Venus has proved a very challenging place to explore or to examine more closely. In this exploring process, it will have a variety of stumbling blocks to make scientists stop to explore it. But actually, striving to meet the challenge presented by Venus has value, not only because of the insight to be gained on the planet itself, but also because human curiosity will likely lead us into many equally intimidating endeavors. So that's why people studying Venus is a worthy pursuit despite the dangers it presents.

Here are some evidences regarding why the people still to explore it. For instance: Our travels on Earth and beyond should not be limited by dangers and doubts but should be expanded to meet the very edges of imagination and innovation (Paragraph 8). According to this evidence of the article, I can see innovation can help pe

We slit the dataset in 3 groups train, test and validation

In [3]:
# We divide the dataset
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=seed)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=seed)


print(f"Train size: {train_data.shape[0]}, Validation size: {val_data.shape[0]}, Test size: {test_data.shape[0]}")
print("Train labels distribution:\n", train_data['generated'].value_counts())
print("Validation labels distribution:\n", val_data['generated'].value_counts())
print("Test labels distribution:\n", test_data['generated'].value_counts())

Train size: 38979, Validation size: 4872, Test size: 4873
Train labels distribution:
 generated
0    24334
1    14645
Name: count, dtype: int64
Validation labels distribution:
 generated
0    3039
1    1833
Name: count, dtype: int64
Test labels distribution:
 generated
0    3053
1    1820
Name: count, dtype: int64


Tockenisation: (We tockenized in the simplest way with Keras. There are many, some of which are more useful depending on the model. For example, the LSTM + CNN model works better with Bert tokenization. For a complete test, it would therefore be necessary to also evaluate the models with their optimal tokenization.)

In [4]:
# Tokenisation
max_words = 5000
max_len = 150

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_data['text'])

X_train = tokenizer.texts_to_sequences(train_data['text'])
X_val = tokenizer.texts_to_sequences(val_data['text'])
X_test = tokenizer.texts_to_sequences(test_data['text'])

# Padding
X_train = pad_sequences(X_train, maxlen=max_len)
X_val = pad_sequences(X_val, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

y_train = train_data['generated']
y_val = val_data['generated']
y_test = test_data['generated']

Our first simple LSTM:

In [5]:
# bulding a simple LSTM
model = Sequential()
model.add(Embedding(max_words, 100, input_length=max_len))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# training
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), callbacks=[early_stopping],verbose=1)

# Evaluation
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.4f}')



Epoch 1/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m154s[0m 235ms/step - accuracy: 0.8726 - loss: 0.2899 - val_accuracy: 0.9731 - val_loss: 0.0889
Epoch 2/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m193s[0m 221ms/step - accuracy: 0.8981 - loss: 0.2997 - val_accuracy: 0.9019 - val_loss: 0.2573
Epoch 3/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 221ms/step - accuracy: 0.9655 - loss: 0.1133 - val_accuracy: 0.9799 - val_loss: 0.0726
Epoch 4/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 222ms/step - accuracy: 0.9807 - loss: 0.0675 - val_accuracy: 0.9809 - val_loss: 0.0661
Epoch 5/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m130s[0m 214ms/step - accuracy: 0.9879 - loss: 0.0431 - val_accuracy: 0.9840 - val_loss: 0.0542
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 38ms/step - accuracy: 0.9866 - loss: 0.0481
Test Accuracy: 0.9858


The results are too good to be realistic. An LSTM has a success rate of between 75 and 85%. I've tried everything to find the source of the problem, but I still can't find it. This ruins the exercise because if the simplest model is already perfect, then the rest is useless

Création of Bidirectional LSTM:

In [8]:
from tensorflow.keras.layers import Bidirectional

# Bidirectional LSTM
model_bidirectional = Sequential([
    Embedding(input_dim=max_words, output_dim=100),
    SpatialDropout1D(0.2),
    Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)),
    Dense(1, activation='sigmoid')
])
model_bidirectional.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [9]:

# Early stopping (to stop before overfitting)
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# training
history_bidirectional = model_bidirectional.fit(X_train, y_train, epochs=5, batch_size=264, validation_data=(X_val, y_val), callbacks=[early_stopping], verbose=1)

loss_bidirectional, accuracy_bidirectional = model_bidirectional.evaluate(X_test, y_test)
print(f'Test Accuracy (Bidirectional LSTM): {accuracy_bidirectional:.4f}')

Epoch 1/5
[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m428s[0m 3s/step - accuracy: 0.8122 - loss: 0.3833 - val_accuracy: 0.9791 - val_loss: 0.0671
Epoch 2/5
[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m445s[0m 3s/step - accuracy: 0.9813 - loss: 0.0632 - val_accuracy: 0.9846 - val_loss: 0.0506
Epoch 3/5
[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m458s[0m 3s/step - accuracy: 0.9877 - loss: 0.0435 - val_accuracy: 0.9819 - val_loss: 0.0585
Epoch 4/5
[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m460s[0m 3s/step - accuracy: 0.9896 - loss: 0.0374 - val_accuracy: 0.9838 - val_loss: 0.0506
Epoch 5/5
[1m148/148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m478s[0m 3s/step - accuracy: 0.9896 - loss: 0.0360 - val_accuracy: 0.9862 - val_loss: 0.0420
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 111ms/step - accuracy: 0.9903 - loss: 0.0335
Test Accuracy (Bidirectional LSTM): 0.9885


The results are also too good. This shows that the problem is not the model but the data. (I added lines at the beginning of the code to check certain values to find the problem) (as you can still see the bad results you understand that I have not found anything).

In [10]:
from tensorflow.keras.layers import Conv1D, MaxPooling1D

model_lstm_cnn = Sequential([
    Embedding(input_dim=max_words, output_dim=100),
    SpatialDropout1D(0.2),

    # Convolution Conv1D
    Conv1D(filters=64, kernel_size=5, activation='relu'),
    MaxPooling1D(pool_size=4),

    Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)),

    Dense(1, activation='sigmoid')
])

model_lstm_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [11]:
# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# training the model
history_lstm_cnn = model_lstm_cnn.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), callbacks=[early_stopping], verbose=1)

loss_lstm_cnn, accuracy_lstm_cnn = model_lstm_cnn.evaluate(X_test, y_test, verbose=0)
print(f'Test Accuracy (LSTM-CNN): {accuracy_lstm_cnn:.4f}')

Epoch 1/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 107ms/step - accuracy: 0.8899 - loss: 0.2301 - val_accuracy: 0.9807 - val_loss: 0.0580
Epoch 2/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 105ms/step - accuracy: 0.9888 - loss: 0.0360 - val_accuracy: 0.9881 - val_loss: 0.0432
Epoch 3/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 109ms/step - accuracy: 0.9932 - loss: 0.0219 - val_accuracy: 0.9850 - val_loss: 0.0554
Epoch 4/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 106ms/step - accuracy: 0.9956 - loss: 0.0167 - val_accuracy: 0.9776 - val_loss: 0.0709
Epoch 5/5
[1m610/610[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 105ms/step - accuracy: 0.9960 - loss: 0.0129 - val_accuracy: 0.9881 - val_loss: 0.0480
Test Accuracy (LSTM-CNN): 0.9887


once again the model is close to perfection

# Conclusion:
The conclusion is not going to be very interesting because unfortunately I have not solved my problem on my data. Our 3 models are excellent because the data are problematic. We can therefore deduce nothing from it even if we easily imagine that the bidirectional model and CNN are normally better than the classic model.

If the data had no problems we would have varied the embedding, dropout rate, batch_size and other parameters to find the optimal version (for our dataset).