# AuTexTification

https://sites.google.com/view/autextification/home

The new era of automatic content generation has surged through powerful causal language models like Generative Pre-trained Transformer (GPT) (Radford et al., 2019) (Ouyang et al., 2022), Pathways Language Model (PaLM) (Chowdhery et al., 2022), BLOOM (Scao et al., 2022) or ChatGPT. Most of these models are publicly available, which boosts research and development of cutting-edge applications. However, they can also be used by malicious users or bots to spread untruthful news, reviews, or opinions (Jahawar et al., 2020). Thus, it is imperative to develop technology to automatically detect generated text for content moderation, including detecting fake news (Deng et al., 2022), bots in online environments (Tourille et al., 2022), and technical research (Rodríguez et al., 2022). Besides, in some legal and security applications, merely identifying machine-generated text may not be sufficient. Instead, it would be required to attribute the text to a generation model, e.g., to notify the developers of a model, protect intellectual property, or to distill responsibilities (Uchendu et al., 2020). The malicious potential of generated text is already a reality, which has led some conferences such as ICML to explicitly ban content generated by language models. In the not-so-distant future, advances in automatic text generation can lead opinion spam to the next level, which will be an imminent threat to companies, consumers, and readers.

## Download the data

This step will download the `AuTexTificationDataset` folder to your session.

The dataset consists of two folders, one for each subtask: `subtask_1` (machine-generated text detection) and `subtask_2` (model attribution). Each subtask's folder contains also two subfolders: `en` for the English subset, and `es` for the Spanish subset. Inside these folders you can see the files `train.tsv` and `test.tsv` corresponding to the train and test splits.

Each `tsv` file contains six columns: `id`, `prompt`, `text`, `label`, `model`, and `domain`. For the purpose of this notebook, only the `id`, `text` and `label` columns will be used: The `id` column is a unique identifier to identify each text. The `text` column is the text and the `label` column is the ground truth label (either `generated` or `human`). The other three columns are metadata from the generation process to build `AuTexTification` (but you can play with them to try to improve your results 😉).

We will work through this notebook only with the `subtask_1` in English, but the same code will work also for the `subtask_2` and Spanish.

Take a look to the folder structure and the files of `AuTexTificationDataset` on the left pannel, to have a better idea about the dataset.

In [1]:
!pip install gdown
import gdown
from google.colab import data_table
data_table.enable_dataframe_formatter()
DATASET_URL = "https://drive.google.com/drive/folders/17rMvLszfo-DQoIvzG1CusIQNwhnLKpB_"
gdown.download_folder(DATASET_URL, quiet=False)
###Anotations:
#V_ARRAY = VECT_TRAIN.toArray()
#my_features = np.hstack(V_ARRAY,otras_features) o vstack  --> Join to the vectorized variable.
#perplexity -> long texts o short texts.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Retrieving folder list


Retrieving folder 1a1tYlGlQmadf7zsYz3PPnTrplSHjM8lU subtask_1
Retrieving folder 12Fe7jpVeenEZGckFFQQA7-CVgPMBMIK3 en
Processing file 1dQxym7dPsw38VStCYPqsKGByDuq06XPY test.tsv
Processing file 1syrlebhgbvTg9tCZ-S2MOpwhFBNtGXXD train.tsv
Retrieving folder 1qQoUs57x_G1O-lP_OMPuuLlD_1TF-WoI es
Processing file 1DL0MZB8St7yAmsh70Y3sH8NQaoSzdZYs test.tsv
Processing file 1r6hAZp3mPldfhhKHrIMgQ33HiwzvB5O3 train.tsv
Retrieving folder 1m4yAowbTXz_IBvNRHOuDGQqZnza298Qd subtask_2
Retrieving folder 1_nNywlj-RYMW1tfkFVoEDLJLcVWNX9a6 en
Processing file 1-MEoYsUqdub_zTt8O6-9fGQtdSQloGHA test.tsv
Processing file 1nLCyHGDs8PiS15ZkwtCQ0_GscFhvR5RB train.tsv
Retrieving folder 1dwAWhHaWVaoiptD-JR0rL98XHJ04gbLP es
Processing file 1efLhja6Lr5B3ALu_X19ULpzYVj-8TN3Q test.tsv
Processing file 1F5mhR6tIRZCHhCwvzSEwXiEytNrpc6W9 train.tsv
Building directory structure completed


Retrieving folder list completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1dQxym7dPsw38VStCYPqsKGByDuq06XPY
To: /content/AuTexTificationDataset/subtask_1/en/test.tsv
100%|██████████| 7.92M/7.92M [00:00<00:00, 29.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1syrlebhgbvTg9tCZ-S2MOpwhFBNtGXXD
To: /content/AuTexTificationDataset/subtask_1/en/train.tsv
100%|██████████| 13.1M/13.1M [00:00<00:00, 43.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1DL0MZB8St7yAmsh70Y3sH8NQaoSzdZYs
To: /content/AuTexTificationDataset/subtask_1/es/test.tsv
100%|██████████| 7.80M/7.80M [00:00<00:00, 61.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1r6hAZp3mPldfhhKHrIMgQ33HiwzvB5O3
To: /content/AuTexTificationDataset/subtask_1/es/train.tsv
100%|██████████| 12.8M/12.8M [00:00<00:00, 115MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-MEoYsUqdub_zTt8O6-9fGQtdSQloGHA
To: /content/AuTexTificationDataset/subtask_2/en/test.tsv
100%|

['/content/AuTexTificationDataset/subtask_1/en/test.tsv',
 '/content/AuTexTificationDataset/subtask_1/en/train.tsv',
 '/content/AuTexTificationDataset/subtask_1/es/test.tsv',
 '/content/AuTexTificationDataset/subtask_1/es/train.tsv',
 '/content/AuTexTificationDataset/subtask_2/en/test.tsv',
 '/content/AuTexTificationDataset/subtask_2/en/train.tsv',
 '/content/AuTexTificationDataset/subtask_2/es/test.tsv',
 '/content/AuTexTificationDataset/subtask_2/es/train.tsv']

## Import libraries

We will work with Pandas to load the tsv files, and scikit-learn to perform vectorization (bag of words/chars), classification, and evaluation.

Add in this cell the libraries you want to perform further experiments.

In [2]:
import pandas as pd
import re# Used for regular expresions
#Measures that express how relevant a word is in a document
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
#Tools for evaluate our models
from sklearn.metrics import confusion_matrix, classification_report, f1_score

## Load the dataset

In [3]:
# Define the subtask and language you want to address.
subtask = "subtask_1"
language = "es"

In [4]:
# Load the tsv files as dataframes
df_train_raw = pd.read_csv(
    f"AuTexTificationDataset/{subtask}/{language}/train.tsv", delimiter="\t"
)
df_test = pd.read_csv(
    f"AuTexTificationDataset/{subtask}/{language}/test.tsv", delimiter="\t"
)
#Save df_train_raw for other models
df_train = df_train_raw

In [5]:
# Take a look at the training dataframe
df_train.head()

Unnamed: 0,id,prompt,text,label,model,domain
0,5464,NO-PROMPT,Entrada en vigor. La presente Directiva entrar...,human,NO-MODEL,legal
1,30129,"Estos podrían ser preguntas, categorías de inf...",Preguntas: 1. ¿Cuáles son los principales argu...,generated,F,wiki
2,19553,-¿Desea algo? -Póngame una caja,¿Desea algo? Póngame una caja de madera. ¿Qué ...,generated,E,tweets
3,13005,NO-PROMPT,"@victor28088 1665 Tweets no originales, que as...",human,NO-MODEL,tweets
4,16919,NO-PROMPT,De pequeño Dios me dio a elegir entre tener un...,human,NO-MODEL,tweets


In [None]:
df_test.head()

Unnamed: 0,id,text,domain
0,17414,Buscábamos tranquilidad y la encontramos. Me t...,reviews
1,16938,"Nos sorprendió la cena, si vas con media pensi...",reviews
2,17379,Servicio atento y magnificas vistas al rio.,reviews
3,5391,La Oficina Nacional de Estadísticas de China d...,news
4,17310,Pero no puedes tener a una sola persona sirvie...,reviews


Analysis of Variables and discretization of them (Synthetic Variables)

In [None]:
# Definir función para verificar la primera palabra de un texto
def verificar_primer_palabra(texto):
    primer_palabra = re.findall(r'\S+', texto)[0]
    # Obtener la primera palabra del texto considerando cualquier caracter no espaciado
    if primer_palabra[0].islower():  # Comprobar si la primera letra es minúscula
        return 1
    else:
        return 0

# Aplicar la función a la columna 'Texto' y crear una nueva columna 'Comienza_Minuscula'
df_train['Comienza_Minuscula'] = df_train['text'].apply(lambda x: verificar_primer_palabra(x))


In [None]:
df_train['puntos'] = df_train['text'].str.count('\.')
df_train['hashtags'] = df_train['text'].str.count('\#')
df_train['interrogacion_abierto'] = df_train['text'].str.count('\¿')
df_train['interrogacion_cerrado'] = df_train['text'].str.count('\?')
df_train['interrogacion_abierto'] = df_train['text'].str.count('\¡')
df_train['interrogacion_cerrado'] = df_train['text'].str.count('\!')
df_train['@'] = df_train['text'].str.count('\@')
df_train['porcentaje'] = df_train['text'].str.count('\%')
df_train['dolar'] = df_train['text'].str.count('\$')
df_train['comas'] = df_train['text'].str.count('\,')
df_train['puntos_comas'] = df_train['text'].str.count('\;')
df_train['puntos_supensivos'] = df_train['text'].str.count('\...')
df_train['barra_baja'] = df_train['text'].str.count('\_')


We discretize the domain variable

In [None]:
df_train['domain'] = df_train['domain'].astype('category').cat.codes

In [None]:
df_train



Unnamed: 0,id,prompt,text,label,model,domain,Comienza_Minuscula,puntos,hashtags,interrogacion_abierto,interrogacion_cerrado,@,porcentaje,dolar,comas,puntos_comas,puntos_supensivos,barra_baja
0,5464,NO-PROMPT,Entrada en vigor. La presente Directiva entrar...,human,NO-MODEL,0,0,6,0,0,0,0,0,0,1,0,5,0
1,30129,"Estos podrían ser preguntas, categorías de inf...",Preguntas: 1. ¿Cuáles son los principales argu...,generated,F,2,0,5,0,0,0,0,0,0,0,0,5,0
2,19553,-¿Desea algo? -Póngame una caja,¿Desea algo? Póngame una caja de madera. ¿Qué ...,generated,E,1,0,1,0,2,2,0,0,0,0,0,1,0
3,13005,NO-PROMPT,"@victor28088 1665 Tweets no originales, que as...",human,NO-MODEL,1,0,1,0,0,1,1,0,0,2,0,0,0
4,16919,NO-PROMPT,De pequeño Dios me dio a elegir entre tener un...,human,NO-MODEL,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32057,16850,NO-PROMPT,"Mamá, ¿por qué no me despertaste? Te hable 5 v...",human,NO-MODEL,1,0,0,0,0,0,0,0,0,3,0,0,0
32058,6265,NO-PROMPT,. Artículo 2. Los Estados miembros aplicarán l...,human,NO-MODEL,0,0,6,0,0,0,0,0,0,1,0,5,0
32059,11284,Mi memoria es:  5%,Mi memoria es:  5% de los médicos tienen una ...,generated,B,1,0,0,0,0,0,0,1,0,1,0,0,0
32060,860,HA ADOPTADO LA PRESENTE DECISIÓN:. Artículo 1.,APROBAR el proyecto de resolución que se adjun...,generated,B,0,0,4,0,0,0,0,0,0,4,0,4,0


##Training / Validation split
Remove categoric variables and keep the numeric variables (Synthetic variables)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df_train.drop(['id','prompt','text','label','model'],axis=1), df_train["label"], stratify=df_train["label"], random_state=42, test_size=0.3 )

In [None]:
X_train.head()

Unnamed: 0,domain,Comienza_Minuscula,puntos,hashtags,interrogacion_abierto,interrogacion_cerrado,@,porcentaje,dolar,comas,puntos_comas,puntos_supensivos,barra_baja
24284,1,0,0,0,1,1,0,0,0,0,0,0,0
21740,2,0,2,0,0,0,0,0,0,0,0,1,0
18482,0,0,4,0,0,0,0,0,0,2,0,4,0
9942,2,0,4,0,0,0,0,0,0,3,0,4,0
12645,1,0,1,1,0,5,0,0,0,0,1,1,0


In [None]:
y_train.head()

24284        human
21740    generated
18482    generated
9942         human
12645        human
Name: label, dtype: object

In [None]:
X_val.head()

Unnamed: 0,domain,Comienza_Minuscula,puntos,hashtags,interrogacion_abierto,interrogacion_cerrado,@,porcentaje,dolar,comas,puntos_comas,puntos_supensivos,barra_baja
18613,0,0,5,0,0,0,0,0,0,2,0,4,0
11869,2,0,7,0,0,0,0,0,0,2,0,7,0
8698,0,0,6,0,0,0,0,0,0,1,0,5,0
21587,2,0,2,0,0,0,0,0,0,2,0,2,0
5521,0,0,2,0,0,0,0,0,0,1,0,2,0


In [None]:
y_val.head()

18613        human
11869        human
8698         human
21587    generated
5521         human
Name: label, dtype: object

Apply RandomForest model to know the importance of these variables

In [None]:
# Instantiate the model
model = RandomForestClassifier(max_depth=1000,random_state=42)
#Your model is already trained, but you can play with the tolerance or the number of iterations to get better solutions.
model.fit(X_train, y_train)

In [None]:
# Then, we can predict the labels for each text in the test set
# calling `model.predict`
preds = model.predict(X_val)

In [None]:
# Inspect the predictions
preds

array(['human', 'human', 'human', ..., 'generated', 'generated', 'human'],
      dtype=object)

In [None]:
y_val.head()

18613        human
11869        human
8698         human
21587    generated
5521         human
Name: label, dtype: object

Evaluate model and importance

In [None]:
# Compute the macro-F1 score
mf1 = f1_score(y_true=y_val, y_pred=preds, average="macro")

# Compute the confusion matrix
conf_matrix = confusion_matrix(
    y_true=y_val,
    y_pred=preds,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
# Compute a classification report
clf_report = classification_report(y_true=y_val, y_pred=preds)

In [None]:
# Importante about variables
importance = model.feature_importances_
# Create a DataFrame with the importance variables
importance_df = pd.DataFrame({'Variable': X_train.columns, 'Importance': importance})
# Order the variables for importance descending
importance_df = importance_df.sort_values('Importance', ascending=False)

Print confusion matrix, f1-score and report

In [None]:
print(f"Macro-F1: {mf1}\n")
print(f"Confusion Matrix:\n{conf_matrix}\n")
print(f"Classification report:\n{clf_report}\n")
print(f"gini:\n{importance_df}\n")

Macro-F1: 0.6823825463035007

Confusion Matrix:
[[3069 1814]
 [1235 3501]]

Classification report:
              precision    recall  f1-score   support

   generated       0.71      0.63      0.67      4883
       human       0.66      0.74      0.70      4736

    accuracy                           0.68      9619
   macro avg       0.69      0.68      0.68      9619
weighted avg       0.69      0.68      0.68      9619


gini:
                 Variable  Importance
11      puntos_supensivos    0.265698
2                  puntos    0.256582
9                   comas    0.179784
0                  domain    0.122903
3                hashtags    0.028834
4   interrogacion_abierto    0.027729
5   interrogacion_cerrado    0.027524
6                       @    0.026579
10           puntos_comas    0.022450
7              porcentaje    0.017006
1      Comienza_Minuscula    0.011612
12             barra_baja    0.006939
8                   dolar    0.006359



##Logistic Regression

##Training / Validation split

In [None]:
from sklearn.model_selection import train_test_split
X_train1, X_val1, y_train1, y_val1 = train_test_split(df_train_raw["text"], df_train_raw["label"], stratify=df_train_raw["label"], random_state=42, test_size=0.3 )

## Vectorize the text

We will vectorize the texts using bag-of-words (frequencies). You can play with other alternatives like presence instead of frequencies, word/char n-grams, TF-IDF weighting, preprocessing the text before vectorization etc.

In [19]:
# First, fit the vectorizer on the training set (using the test set for this is cheating)
# Take a look at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# or https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# for more information about the vectorizer classes.
# Instantiate the vectorizer with BagOfWords
#vectorizer = CountVectorizer(analyzer="word", ngram_range=(2, 2), max_features=10000)
# Instantiate the vectorizer with TFIDF
vectorizer = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
# Fit vectorizer on the texts from the training set.
# For simplicity, we call `fit_transform` that fits
# the vectorizer and then transforms the texts into
# vectors.
# Note that `transform` will return a sparse matrix,
# if you need to work with dense vectors, cast it as:
# vect_train = vect_train.to_array()
vect_train1 = vectorizer.fit_transform(X_train1)

NameError: ignored

View the partitions in ngrams and type variable

In [None]:
print(vectorizer.vocabulary_)

{'ha': 4396, 'ab': 1795, 'bí': 2554, 'ía': 9817, 'a ': 1612, ' u': 943, 'un': 9118, 'na': 6002, ' v': 961, 've': 9315, 'ez': 4113, 'z ': 9555, ' t': 900, 'to': 8703, 'or': 6841, 'rt': 7831, 'tu': 8796, 'ug': 9050, 'gu': 4358, 'ui': 9053, 'it': 4960, 'ta': 8495, ' q': 802, 'qu': 7253, 'ue': 8979, 'e ': 3258, ' f': 439, 'fu': 4217, ' a': 118, ' s': 841, 'su': 8441, 'u ': 8875, ' p': 748, 'pr': 7181, 'ri': 7616, 'im': 4757, 'me': 5662, 'er': 3839, 'r ': 7280, ' d': 331, 'dí': 3250, 'de': 3020, ' c': 269, 'cl': 2771, 'la': 5239, 'as': 2241, 'se': 8190, 'es': 3955, 's ': 7928, ' y': 999, 'y ': 9462, 'cu': 2894, 'ua': 8909, 'an': 2055, 'nd': 6092, 'do': 3163, 'o ': 6408, ' l': 572, 'll': 5428, 'le': 5319, 'eg': 3576, 'ó ': 9913, ' ¡': 1032, 'ya': 9529, ' e': 378, 'st': 8389, 'ba': 2400, 'n ': 5862, 'va': 9284, 'ac': 1821, 'ca': 2567, 'ci': 2712, 'io': 4844, 'on': 6748, 'ne': 6121, 's!': 8077, 'hab': 4404, 'abí': 1819, 'bía': 2555, 'ía ': 9818, 'a u': 1743, ' un': 948, 'una': 9137, 'na ': 600

In [None]:
print(vect_train1.dtype)

float64


## Train a model

With scikit-learn is really easy to train a supervised model and use it to perform classification/regression on unseen data. Scikit-learn provides a wide set of models
for supervised learning which you can see here: https://scikit-learn.org/stable/supervised_learning.html

All the models can be used in the same way, since they expose the same interfaces (fit/predict):

1) Instantiate the model, e.g., `model = LogisticRegression()`

2) Train the model: `model.fit(data, labels)`

3) Predict using the trained model: `model.predict(data)`


Here we will use a LogisticRegression model as example, but you can use any of the models provided by scikit-learn like: Support Vector Machines, Random Forests, Decision Trees, Nearest Neighbors, Naive Bayes, etc. Even, you can play with ensembles of models or Neural Networks. However, for Neural Networks there are dedicated libraries strongly preferred than scikit-learn (see Torch or Tensorflow).

In [None]:
# Instantiate the model
model = LogisticRegression()
# Fit the model using the vectorized text of the training set and the ground truth labels.
# You can get convergence warnings when using some models. Don't worry, this is frequent
# in classification problems where the labels are not linearly separable. Your model is already
# trained, but you can play with the tolerance or the number of iterations to get better solutions.
model.fit(vect_train1, y_train1)

## Predict using a trained model

In [None]:
# First, we need to vectorize the test set using the vectorizer
# we fit some cells above.
vect_test1 = vectorizer.transform(X_val1)
# Then, we can predict the labels for each text in the test set
# calling `model.predict`
preds1 = model.predict(vect_test1)

In [None]:
# Inspect the predictions
preds1

array(['human', 'human', 'human', ..., 'generated', 'generated', 'human'],
      dtype=object)

In [None]:
y_val1.head()

18613        human
11869        human
8698         human
21587    generated
5521         human
Name: label, dtype: object

## Evaluate your model

Once the predictions of the test set are computed, you need to know how good is your approach. To quantify the quality of your approach, you can use evaluation metrics from `sklearn.metrics`. `sklearn.metrics` also provides methods to compute confusion matrices or generate reports with metrics like precision, recall, $F_1$, and accuracy.

The metric we used to evaluate approaches in AuTexTification was the Macro-F$_1$. We will use it in this notebook to evaluate your approach.

In [None]:
# Compute the macro-F1 score
mf11 = f1_score(y_true=y_val1, y_pred=preds1, average="macro")

# Compute the confusion matrix
conf_matrix1 = confusion_matrix(
    y_true=y_val1,
    y_pred=preds1,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
# Compute a classification report
clf_report1 = classification_report(y_true=y_val, y_pred=preds)

In [None]:
print(f"Macro-F1: {mf11}\n")
print(f"Confusion Matrix:\n{conf_matrix1}\n")
print(f"Classification report:\n{clf_report1}\n")

Macro-F1: 0.8076969045885459

Confusion Matrix:
[[4034  849]
 [ 999 3737]]

Classification report:
              precision    recall  f1-score   support

   generated       0.71      0.63      0.67      4883
       human       0.66      0.74      0.70      4736

    accuracy                           0.68      9619
   macro avg       0.69      0.68      0.68      9619
weighted avg       0.69      0.68      0.68      9619




#RandomForest Model


## Train Validation Split Random Forest

In [None]:
from sklearn.model_selection import train_test_split
X_train4, X_val4, y_train4, y_val4 = train_test_split(df_train_raw["text"], df_train_raw["label"], stratify=df_train_raw["label"], random_state=42, test_size=0.3 )

## Vectorizer the text for Random Forest

In [None]:
vectorizer4 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train4 = vectorizer4.fit_transform(X_train4)

## Train a Model RandomForest

In [None]:
# Instantiate the model
modelrf = RandomForestClassifier(max_depth=1000,random_state=42)
#Your model is already trained, but you can play with the tolerance or the number of iterations to get better solutions.
modelrf.fit(vect_train4, y_train4)

## Predict using a model Random Forest

In [None]:
# First, we need to vectorize the test set using the vectorizer
# we fit some cells above.
vect_test4 = vectorizer4.transform(X_val4)
# Then, we can predict the labels for each text in the test set
# calling `model.predict`
preds4 = modelrf.predict(vect_test4)

## Evaluate your model Random Forest

In [None]:
# Compute the macro-F1 score
mf4 = f1_score(y_true=y_val4, y_pred=preds4, average="macro")

# Compute the confusion matrix
conf_matrix4 = confusion_matrix(
    y_true=y_val4,
    y_pred=preds4,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
# Compute a classification report
clf_report4 = classification_report(y_true=y_val4, y_pred=preds4)

In [None]:
print(f"Macro-F1: {mf4}\n")
print(f"Confusion Matrix:\n{conf_matrix4}\n")
print(f"Classification report:\n{clf_report4}\n")

Macro-F1: 0.7954186512158166

Confusion Matrix:
[[4205  678]
 [1278 3458]]

Classification report:
              precision    recall  f1-score   support

   generated       0.77      0.86      0.81      4883
       human       0.84      0.73      0.78      4736

    accuracy                           0.80      9619
   macro avg       0.80      0.80      0.80      9619
weighted avg       0.80      0.80      0.80      9619




## Evaluate Random Forest with GridSearch

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


# Dividir el conjunto de datos en entrenamiento y prueba
X_train7, X_val7, y_train7, y_val7 = train_test_split(df_train.drop(['id','prompt','text','label','model'],axis=1), df_train["label"], stratify=df_train["label"], test_size=0.3, random_state=42)

#vectorizer7 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
#vect_train7 = vectorizer7.fit_transform(X_train7)

# Definir los hiperparámetros a ajustar
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}



# Crear el modelo de Random Forest
rf_model = RandomForestClassifier()



# Crear el objeto GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5)



# Ajustar el modelo utilizando Grid Search en los datos de entrenamiento
#grid_search.fit(vect_train7, y_train7)
grid_search.fit(X_train7, y_train7)


# Obtener la mejor configuración de hiperparámetros y el mejor puntaje
best_params = grid_search.best_params_
best_score = grid_search.best_score_



# Evaluar el modelo con la mejor configuración en los datos de prueba
best_model = RandomForestClassifier(**best_params)
#best_model.fit(vect_train7, y_train7)
#vect_test7 = vectorizer7.transform(X_val7)
#test_accuracy = best_model.score(vect_test7, y_val7)
best_model.fit(X_train7, y_train7)
test_accuracy = best_model.score(X_val7, y_val7)

# Imprimir los resultados
print("Mejores hiperparámetros encontrados:")
print(best_params)
print("Mejor puntaje obtenido en validación cruzada:")
print(best_score)
print("Precisión en los datos de prueba con la mejor configuración:")
print(test_accuracy)
# Importante about variables
importance7 = best_model.feature_importances_
# Create a DataFrame with the importance variables
importance_df7 = pd.DataFrame({'Variable': X_train7.columns, 'Importance': importance7})
# Order the variables for importance descending
importance_df7 = importance_df7.sort_values('Importance', ascending=False)
print(f"gini:\n{importance_df7}\n")

Mejores hiperparámetros encontrados:
{'max_depth': None, 'min_samples_split': 10, 'n_estimators': 50}
Mejor puntaje obtenido en validación cruzada:
0.6884107279072751
Precisión en los datos de prueba con la mejor configuración:
0.6881172679072669
gini:
                 Variable  Importance
11      puntos_supensivos    0.284627
2                  puntos    0.255078
9                   comas    0.167135
0                  domain    0.130568
3                hashtags    0.028837
4   interrogacion_abierto    0.025976
6                       @    0.023908
5   interrogacion_cerrado    0.023389
10           puntos_comas    0.020419
7              porcentaje    0.016198
1      Comienza_Minuscula    0.011138
12             barra_baja    0.006540
8                   dolar    0.006186



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


# Dividir el conjunto de datos en entrenamiento y prueba
X_train8, X_val8, y_train8, y_val8 = train_test_split(df_train_raw['text'], df_train_raw["label"], stratify=df_train_raw["label"], test_size=0.3, random_state=42)

vectorizer8 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train8 = vectorizer8.fit_transform(X_train8)

# Definir los hiperparámetros a ajustar
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}



# Crear el modelo de Random Forest
rf_model = RandomForestClassifier()



# Crear el objeto GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5)



# Ajustar el modelo utilizando Grid Search en los datos de entrenamiento
grid_search.fit(vect_train8, y_train8)


# Obtener la mejor configuración de hiperparámetros y el mejor puntaje
best_params = grid_search.best_params_
best_score = grid_search.best_score_



# Evaluar el modelo con la mejor configuración en los datos de prueba
best_model = RandomForestClassifier(**best_params)
vect_test8 = vectorizer8.transform(X_val8)

best_model.fit(vect_test8, y_train8)
test_accuracy = best_model.score(X_val8, y_val8)

# Imprimir los resultados
print("Mejores hiperparámetros encontrados:")
print(best_params)
print("Mejor puntaje obtenido en validación cruzada:")
print(best_score)
print("Precisión en los datos de prueba con la mejor configuración:")
print(test_accuracy)
# Importante about variables
importance8 = best_model.feature_importances_
# Create a DataFrame with the importance variables
importance_df8 = pd.DataFrame({'Variable': X_train8.columns, 'Importance': importance8})
# Order the variables for importance descending
importance_df8 = importance_df8.sort_values('Importance', ascending=False)
print(f"gini:\n{importance_df8}\n")

ValueError: ignored

#KNN Model

## Train Validation Split KNN

In [None]:
from sklearn.model_selection import train_test_split
X_train5, X_val5, y_train5, y_val5 = train_test_split(df_train_raw["text"], df_train_raw["label"], stratify=df_train_raw["label"], random_state=42, test_size=0.3 )

## Vectorizer the text for KNN

In [None]:
vectorizer5 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train5 = vectorizer5.fit_transform(X_train5)

## Train a Model KNN

In [None]:
model_knn = KNeighborsClassifier(n_neighbors=3)
model_knn.fit(vect_train5, y_train5)

## Predict using a model KNN

In [None]:
vect_test5 = vectorizer5.transform(X_val5)
preds5 = model_knn.predict(vect_test5)

## Evaluate your model KNN

In [None]:
# Compute the macro-F1 score SVM
mf5 = f1_score(y_true=y_val5, y_pred=preds5, average="macro")

# Compute the confusion matrix SVM
conf_matrix5 = confusion_matrix(
    y_true=y_val5,
    y_pred=preds5,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
# Compute a classification report SVM
clf_report5 = classification_report(y_true=y_val5, y_pred=preds5)
print(f"Macro-F1: {mf5}\n")
print(f"Confusion Matrix:\n{conf_matrix5}\n")
print(f"Classification report:\n{clf_report5}\n")

#Support Vector Machine Model

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

##Train Validation Split SVM

In [None]:
X_train2, X_val2, y_train2, y_val2 = train_test_split(df_train_raw["text"], df_train_raw["label"], stratify=df_train_raw["label"], random_state=42, test_size=0.3 )

## Vectorizer the text for SVM

In [None]:
vectorizer2 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train2 = vectorizer2.fit_transform(X_train2)

## Train a model SVM

In [None]:
#modelsvc = SVC(kernel='poly',degree=8) Macro-F1: 0.6008068757254312
#modelsvc = SVC(C=2,kernel='rbf') Macro-F1: 0.8499128702686488
#modelsvc = SVC(C=5,kernel='rbf') Macro-F1: 0.8507489629641684 <-- This one
modelsvc = SVC(C=5,kernel='rbf')
modelsvc.fit(vect_train2, y_train2)

## Predict using a model SVM

In [None]:
vect_test2 = vectorizer2.transform(X_val2)
preds2 = modelsvc.predict(vect_test2)

## Evaluate your model

In [None]:
# Compute the macro-F1 score SVM
mf2 = f1_score(y_true=y_val2, y_pred=preds2, average="macro")

# Compute the confusion matrix SVM
conf_matrix2 = confusion_matrix(
    y_true=y_val2,
    y_pred=preds2,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
# Compute a classification report SVM
clf_report2 = classification_report(y_true=y_val2, y_pred=preds2)
print(f"Macro-F1: {mf2}\n")
print(f"Confusion Matrix:\n{conf_matrix2}\n")
print(f"Classification report:\n{clf_report2}\n")

##Modelo SVM Tunning all data train.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
X_train_completo = df_train_raw['text']
vectorizer_c = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train_completo = vectorizer_c.fit_transform(X_train_completo)
y_train_completo = df_train_raw['label']
# Modelo SVM Tunning with all data train.
modelsvc = SVC(C=5,kernel='rbf')
# Train the model with all data train.
modelsvc.fit(vect_train_completo, y_train_completo)

## Train all data test - Test 1 ES & Test 1 EN.

In [None]:
X_test_t1 = df_test['text']
# Predictions
vect_test_completo = vectorizer_c.transform(X_test_t1)
y_pred_t1 = modelsvc.predict(vect_test_completo)

## Create the results to CSV File

In [None]:
# Create a DataFrame for results.
df_resultado = pd.DataFrame()
df_resultado['id'] = df_test['id']
df_resultado['label'] = y_pred_t1

In [None]:
#Create the CSV with tabulated data (\t)
df_resultado[['id', 'label']].to_csv('resultados.csv',sep='\t', index=False)

# Failed attempts

## GridSearch attempt

We tried to know the ideal parameters for our prediction model but the computational level is very high...

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


X_train3, X_val3, y_train3, y_val3 = train_test_split(df_train_raw["text"], df_train_raw["model"], stratify=df_train_raw["model"], random_state=42, test_size=0.3 )
vectorizer3 = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)
vect_train3 = vectorizer3.fit_transform(X_train3)

# Define the parameters for evaluate
param_grid = [
    {'C': [0.1, 1, 10, 100], 'kernel': ['linear']},
    {'C': [0.1, 1, 10, 100], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['rbf']},
    {'C': [0.1, 1, 10, 100], 'degree': [2, 3, 4], 'kernel': ['poly']}
]
### Part 1 - SVM model
# Create a SVM model
svm_clf = SVC()

# Search best parameters --> high computational level
svm_grid_search = GridSearchCV(svm_clf, param_grid, cv=5)
svm_grid_search.fit(vect_train3, y_train3)

# Best parameters
best_svm_params = svm_grid_search.best_params_
best_svm_clf = svm_grid_search.best_estimator_

# Predict using best parameters
vect_test3 = vectorizer.transform(X_val3)
preds3 =  best_svm_clf.predict(vect_test3)

# Display results
# Compute the macro-F1 score
mf3 = f1_score(y_true=y_val3, y_pred=preds3, average="macro")
# Compute a classification report
clf_report3 = classification_report(y_true=y_val3, y_pred=preds3)
# Compute the confusion matrix
conf_matrix3 = confusion_matrix(
    y_true=y_val3,
    y_pred=preds3,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
print(f"Macro-F1: {mf3}\n")
print(f"Confusion Matrix:\n{conf_matrix3}\n")
print(f"Classification report:\n{clf_report3}\n")
### Part 2 - Logistic Regression

# Create a model Logistic Regression
logreg_clf3 = LogisticRegression()

# Search best parameters --> high computational level
logreg_grid_search = GridSearchCV(logreg_clf3, param_grid, cv=5)
logreg_grid_search.fit(vect_train3, y_train3)

# Best parameters
best_logreg_params = logreg_grid_search.best_params_
best_logreg_clf = logreg_grid_search.best_estimator_

# Predict using best parameters
vect_test31 = vectorizer.transform(X_val3)
preds31 = best_logreg_clf.predict(vect_test31)

# Display results
# Compute the macro-F1 score
mf31 = f1_score(y_true=y_val3, y_pred=preds31, average="macro")
# Compute a classification report
clf_report31 = classification_report(y_true=y_val3, y_pred=preds31)
# Compute the confusion matrix
conf_matrix31 = confusion_matrix(
    y_true=y_val3,
    y_pred=preds31,
    labels=["generated", "human"]
    if subtask == "subtask_1"
    else ["A", "B", "C", "D", "E", "F"],
)
print(f"Macro-F1: {mf31}\n")
print(f"Confusion Matrix:\n{conf_matrix31}\n")
print(f"Classification report:\n{clf_report31}\n")

## Evaluate threshold to improve our model attempt

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score, recall_score

# División de los datos en conjuntos de entrenamiento y validación
X_train, X_val, y_train, y_val = train_test_split(df_train["text"], df_train["label"], stratify=df_train["label"], random_state=42, test_size=0.3)
vectorizer = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), max_features=10000)##????
# Vectorización del texto (usando tu vectorizador)
vect_train = vectorizer.fit_transform(X_train)
vect_val = vectorizer.transform(X_val)

# Instanciar y ajustar el modelo de regresión logística
model = LogisticRegression()
model.fit(vect_train, y_train)

# Obtener las probabilidades de predicción en el conjunto de validación
y_pred_prob_val = model.predict_proba(vect_val)[:, 1]

# Función para maximizar el F1, la precisión y el recall
def maximize_metrics(y_true, y_pred_prob):
    thresholds = np.arange(0.05, 1.0, 0.05)
    f1_scores = []
    accuracy_scores = []
    recall_scores = []
    true_positives = []

    for threshold in thresholds:
        y_pred = (y_pred_prob >= threshold).astype(int)
        f1 = f1_score(y_true, y_pred)
        accuracy = accuracy_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)

        f1_scores.append(f1)
        accuracy_scores.append(accuracy)
        recall_scores.append(recall)
        true_positives.append(sum((y_true == 1) & (y_pred == 1)))

    return thresholds, true_positives, accuracy_scores, recall_scores, f1_scores

# Llamada a la función maximize_metrics con las etiquetas verdaderas y las probabilidades de predicción
thresholds, true_positives, accuracy_scores, recall_scores, f1_scores = maximize_metrics(y_val, y_pred_prob_val)

# Crear DataFrame con los resultados
results_df = pd.DataFrame({'Umbral': thresholds, 'True Positives': true_positives, 'Accuracy': accuracy_scores, 'Recall': recall_scores, 'F1-Score': f1_scores})

# Ordenar los resultados por umbral
results_df = results_df.sort_values(by='Umbral')

# Calcular el porcentaje de aciertos
results_df['Aciertos'] = true_positives

# Calcular el porcentaje de accuracy
results_df['Accuracy (%)'] = results_df['Accuracy'] * 100

# Calcular el porcentaje de recall
results_df['Recall (%)'] = results_df['Recall'] * 100

# Calcular el porcentaje de f1-score
results_df['F1-Score (%)'] = results_df['F1-Score'] * 100

# Seleccionar las columnas deseadas en el orden especificado
results_df = results_df[['Umbral', 'True Positives', 'Accuracy (%)', 'Recall (%)', 'F1-Score (%)']]

# Imprimir el DataFrame
print(results_df)

## Sentimental Analyzer attempt

We tried withdraw feelings about paragrahs or parts of the text for integrated to dataframe train but the computational level is very high...

In [None]:
!pip install --user nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
parrafos = []
valores_compound = []
for text in df_train['text']:
      parrafos.extend(text.split('\n'))
      for parrafo in parrafos:
          sentences = (tokenizer.tokenize(parrafo))
          parrafo_compound = 0
          for sentence in sentences:
              analizador = SentimentIntensityAnalyzer()
              scores = analizador.polarity_scores(sentence)
              #print('sentence:',sentence)
              #print('scores:',scores)
              parrafo_compound += scores['compound']
          valores_compound.append(parrafo_compound / len(sentences))


In [None]:
df_train_raw['feelings'] = valores_compound
#Save for other test model and no compute again.
df_train_raw.to_csv('df_train.tsv',sep='\t', index=False)
df_test.to_csv('df_test.tsv',sep='\t', index=False)