![Team](intro.png)

## Introduction and Problem Statement

#### The goal of this task is to develop a machine learning model capable of accurately predicting the author of a text based solely on the presence or absence of specific words.

## Dataset Description

A structured dataset is provided in which each row represents a text sample, and the columns indicate either the presence of specific words in the text or the author of that text.

### • Word Columns
These columns represent individual words. Each entry is a binary value, where 1 indicates that the word appears in the text, and 0 indicates that it does not.

### • author Column
This column contains the name of the author who wrote the text. It serves as the target variable that the model must predict.

### • Example Structure of the Dataset

| word_1 | word_2 | ... | word_n | author        |
|:------:|:------:|:---:|:------:|:--------------|
|   0    |   1    | ... |   1    | Mason Reed    |
|   1    |   1    | ... |   1    | Ava Thompson  |
|   0    |   1    | ... |   0    | Liam Carter   |


The test dataset follows the same structure as the training dataset, except that it does not include the author column. It contains 2765 rows.

### Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

### Loading Training and Test Datasets

In [2]:
train_df = pd.read_csv(r'..\data\train.csv')
test_df = pd.read_csv(r'..\data\test.csv')
test_df.head(10)

Unnamed: 0,lung,council,solution,quite,rain,hair,skill,difficulty,add,pull,...,stocking,near,oil,dive,many,run,tender,asleep,eat,sweep
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Preprocessing and Feature Engineering

In [3]:
x = train_df.drop(columns='author')
y = train_df['author']

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

x_train, x_val, y_train, y_val = train_test_split(x, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

x_test = test_df.copy()

### Building and Training the Neural Network Model

In [21]:
from sklearn.preprocessing import MinMaxScaler

# --- Scaling ---
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)
x_test = scaler.transform(x_test)


from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping


num_classes = len(set(y_train))

# --- Model Architecture ---
model = Sequential([
    Input(shape=(x_train.shape[1],)),  
    Dense(256, activation='relu'),
    Dropout(0.4),
    Dense(128, activation='relu'),
    Dropout(0.4),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(num_classes, activation='softmax')
])

# --- Compile ---
optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# --- Early Stopping ---
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

# --- Train ---
model.fit(
    x_train, y_train,
    epochs=100,
    batch_size=32,
    validation_data=(x_val, y_val),
    callbacks=[early_stop]
)


Epoch 1/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.3822 - loss: 1.4628 - val_accuracy: 0.6375 - val_loss: 1.1442
Epoch 2/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6307 - loss: 1.0299 - val_accuracy: 0.7844 - val_loss: 0.6946
Epoch 3/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7171 - loss: 0.7706 - val_accuracy: 0.7984 - val_loss: 0.5754
Epoch 4/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.7668 - loss: 0.6657 - val_accuracy: 0.8031 - val_loss: 0.5494
Epoch 5/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.7876 - loss: 0.5925 - val_accuracy: 0.8219 - val_loss: 0.5392
Epoch 6/100
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8005 - loss: 0.5569 - val_accuracy: 0.8188 - val_loss: 0.5188
Epoch 7/100
[1m80/80[0m [32m━━━

<keras.src.callbacks.history.History at 0x248cd9962c0>

### Evaluation Metric of the Competition

Submissions are evaluated using the F1 Score, with macro averaging applied.
The F1 score considers both precision and recall, making it a balanced metric for multi-class classification.

The final score is calculated using the formula:

$$
\text{score} = \text{round}(\text{f1score},\, 3) \times 100
$$


A model’s F1 score is rounded to three decimal places, multiplied by 100, and used as your final score.
The maximum possible score is 100, and the minimum acceptable score is 40.
If your model achieves an F1 score below 0.40, the final score will be 0.

It is recommended that you evaluate your model on the training or validation set using this metric to ensure reliable performance.

In [24]:
from sklearn.metrics import f1_score

val_preds = model.predict(x_val)
val_pred_labels = val_preds.argmax(axis=1)

f1 = f1_score(y_val, val_pred_labels, average='macro')
print("F1 Score (macro):", round(f1, 3))
print("Final Score:", round(f1, 3) * 100)

[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
F1 Score (macro): 0.817
Final Score: 81.69999999999999


## Prediction for Test Data and Output

Save your model's predictions on the test data in a dataframe.
This dataframe must contain a single column named author, where the i-th row is your prediction for the i-th row of the test dataset.

|Column|Description|
|------|---|
|author|Predicted author of the text|

In [26]:
y_test_pred_probs = model.predict(x_test)
y_test_pred_labels = y_test_pred_probs.argmax(axis=1)

y_test_pred_names = label_encoder.inverse_transform(y_test_pred_labels)


submission_df = pd.DataFrame({
    'author': y_test_pred_names
})

submission_df.to_csv('../data/submission.csv', index=False)

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 


![Team](outro.png)