Dataset Introduction
A structured dataset is provided where each row represents a text, and the columns indicate either the presence of specific words in the text or the author of the text.

Word Columns:
These columns represent individual words. Each entry in these columns is binary, where 1 indicates the presence of the word in the text and 0 indicates its absence.

author Column:
This column contains the name of the author who wrote the text. It serves as the target variable that your model should predict.

Example Dataset Structure:
word_1	word_2	...	word_n	author
0	1	...	1	Mason Reed
1	1	...	1	Ava Thompson
0	1	...	0	Liam Carter
Test Dataset:
The test dataset follows the same structure as the training set, except it does not include the author column (the target variable). The test dataset contains 2,765 rows.

# LOAD DATA

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Reading/Loading the dataset files
df = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df

Unnamed: 0,author,lung,council,solution,quite,rain,hair,skill,difficulty,add,...,stocking,near,oil,dive,many,run,tender,asleep,eat,sweep
0,Mason Reed,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ava Thompson,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Liam Carter,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Mason Reed,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Olivia Bennett,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3191,Liam Carter,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3192,Liam Carter,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3193,Ava Thompson,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3194,Ethan Brooks,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# PREPROCESSING

In [5]:
# Preprocessing
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['author_encoded'] = label_encoder.fit_transform(df['author'])
X = df.drop(columns=['author', 'author_encoded'])
y = df['author_encoded']

X = X.astype(float).fillna(0)

# MODEL

In [6]:
from sklearn.metrics import f1_score
from sklearn.neural_network import MLPClassifier


model = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    alpha=0.001,  
    batch_size=32,
    learning_rate='adaptive',
    max_iter=100, 
    early_stopping=True, 
    validation_fraction=0.1, 
    random_state=42
)

model.fit(X, y)

# evaluate your model
from sklearn.metrics import f1_score
y_pred = model.predict(X)
f1_score = f1_score(y, y_pred, average='macro')
print(f"f1 score: {f1_score}")
print(f"main score: {np.round((f1_score*100),3)}")
#TODO


f1 score: 0.9220045842217356
main score: 92.2


# TEST DATA

In [7]:
# predict test samples
X_test = df_test.astype(float).fillna(0)
y_test_pred = model.predict(X_test)
test_predictions_decoded = label_encoder.inverse_transform(y_test_pred)
submission = pd.DataFrame(test_predictions_decoded, columns=['author'])
submission

Unnamed: 0,author
0,Olivia Bennett
1,Ethan Brooks
2,Liam Carter
3,Liam Carter
4,Olivia Bennett
...,...
794,Olivia Bennett
795,Liam Carter
796,Olivia Bennett
797,Ava Thompson
