# Sorting Hat: Mystery Rule Competition

During the [main workshop](https://www.kaggle.com/code/pulljosh/workshop-1-the-mystery-of-the-sorting-hat), we trained a model to correctly learn the Sorting Hat's rule: People with a vowel as the second letter of their name go into group A, and people with a consonant as the second letter of their name go into group B.

**But now the Sorting Hat has a new challenge. There's a new categorization rule, and you must deduce what it is.** (This rule also sorts people into three groups, not just two.)

- If you can make a model that learns the rule, that's great! You can submit your results to the Kaggle competition and compete against other AI club members to train the most successful model.
- **Even harder:** Can you figure out what the rule is and explain it to other people in words? This is harder than it sounds, but perhaps you can figure it out.

# Load labeled training data & unlabeled competition data

The following code loads the data from `given.csv` and `unknown.csv`.

In [4]:
from glob import glob
from pathlib import Path

import numpy as np
import pandas as pd

# input_directory = Path(glob('/kaggle/input/msu-ai-club*')[0])
df_given = pd.read_csv('given.csv')
df_unknown = pd.read_csv('unknown.csv')

### Preview `df_given`
This is a table of names with labels that the Sorting Hat has already classified for you. You should use this data to deduce what the categorization rule is. (Notice that this time there are three different labels, not just two!)

In [5]:
df_given["label"].value_counts()

0    1470
2     846
1     184
Name: label, dtype: int64

### Preview `df_unkown`
This table contains unlabeled names that the Sorting Hat has not yet classified. To prove that you have deduced the rule, you need to label these names yourself. The more accurately you label, the higher your score will be! You can compete against other MSU AI Club members by submitting your predictions to the Kaggle competition.

In [6]:
df_unknown

Unnamed: 0,name
0,Odelia
1,Marian
2,Grove
3,Beverly
4,Raul
...,...
4777,Trevon
4778,Tameka
4779,Yadira
4780,Fisher


# YOUR TASK: Train a model/deduce the rule
Use this space to determine what the categorization rule is! You could train a model that learns the rule, or try to figure it out yourself without using a model.

In [14]:
from sklearn.model_selection import train_test_split
def standard_name(name):
    if(len(name) < 10):
        space_num = 10-len(name)
        name = name + ' '*space_num
    
        
    return name[0:10]

def letter_to_numbers(letter):
    letter_number = ord(letter)-97

    numbers = np.zeros(26)

    for i in range (0,26):
        if letter_number == i:
                numbers[i] = 1
        
    return numbers

def name_to_numbers(name):
    name = standard_name(name)
    name = name.lower()
    
    return np.concatenate([letter_to_numbers(letter) for letter in name])

    
names = np.array(df_given["name"])
labels = np.array(df_given["label"])

# Split names and labels into training and testing sets
train_names, test_names, train_labels, test_labels = train_test_split(
    names, labels, test_size=0.25, random_state=0
)

train_data = [name_to_numbers(name) for name in train_names]
test_data = [name_to_numbers(name) for name in test_names]

# print("train_names:", train_names)
# print("train_labels:", train_labels)
# print("test_names:", test_names)
# print("test_labels:", test_labels)

from sklearn.neural_network import MLPClassifier

# Create the Multi-Layer Perceptron (MLP) model
classifier = MLPClassifier(
    solver='lbfgs',
    alpha=1e-5,
    hidden_layer_sizes=(1,1)
)

# Train the model using the training data
classifier.fit(train_data, train_labels)

# prediction = classifier.predict(name_to_numbers(name).reshape(1, -1))

# test_results = pd.DataFrame({
#     "Name": test_names,
#     "Actual": test_labels,
#     "Computer's Prediction": classifier.predict(test_data),
# })
# test_results["Correct"] = ["True" if correct else "FALSE" for correct in test_results["Actual"] == test_results["Computer's Prediction"]]


# test_results.to_csv("test_results.csv",index=False)

df_submission = df_unknown.copy()

df_submission.to_csv('submission.csv', index=False)
sub_names = np.array(df_submission["name"])

sub_label = []

for name in sub_names:
    sub_label.append(classifier.predict(name_to_numbers(name).reshape(1, -1))[0])

print(sub_names)
print(sub_label)

sub = pd.read_csv("submission.csv")
sub["label"] = sub_label

sub.to_csv("submission.csv", index=False)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


['Odelia' 'Marian' 'Grove' ... 'Yadira' 'Fisher' 'Simeon']
[0, 0, 0, 2, 2, 0, 0, 2, 0, 2, 1, 0, 0, 0, 2, 0, 1, 2, 1, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 2, 2, 0, 2, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 0, 2, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 2, 0, 1, 0, 1, 2, 0, 0, 1, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 2, 0, 0, 2, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 2, 1, 2, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 1, 2, 2, 0, 1, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 2, 1, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 2, 1

**Good luck deducing the rule!** Remember: Training a model is good. Understanding the rule and being able to explain it is even better.