# Categorical Feature Encoding Challenge Top 1 Solution

## This is a simple modeling notebook using Logistic Regression. This model reaches the top 1. If you think it's useful, please upvote 🙂
## I also shared [basic EDA notebook for everyone](https://www.kaggle.com/werooring/powerful-and-simple-eda-for-everyone)

#### I applied feature encoding, feature scaling to rank 1st place on the private leaderboard

- [Competition Link](https://www.kaggle.com/c/cat-in-the-dat/)
- [Modeling Notebook Reference Link](https://www.kaggle.com/dkomyagin/cat-in-the-dat-0-80285-private-lb-solution)

In [None]:
import pandas as pd

train = pd.read_csv('/kaggle/input/cat-in-the-dat/train.csv', index_col='id')
test = pd.read_csv('/kaggle/input/cat-in-the-dat/test.csv', index_col='id')
submission = pd.read_csv('/kaggle/input/cat-in-the-dat/sample_submission.csv', index_col='id')

In [None]:
# Combine training data with test data
all_data = pd.concat([train, test], ignore_index=True)
all_data = all_data.drop('target', axis=1) # Drop target value

## Feature Encoding

### Binary Feature Encoding

In [None]:
all_data['bin_3'] = all_data['bin_3'].map({'F':0, 'T':1})
all_data['bin_4'] = all_data['bin_4'].map({'N':0, 'Y':1})

### Ordinal Feature Encoding

In [None]:
ord1dict = {'Novice':0, 'Contributor':1, 
            'Expert':2, 'Master':3, 'Grandmaster':4}
ord2dict = {'Freezing':0, 'Cold':1, 'Warm':2, 
            'Hot':3, 'Boiling Hot':4, 'Lava Hot':5}

all_data['ord_1'] = all_data['ord_1'].map(ord1dict)
all_data['ord_2'] = all_data['ord_2'].map(ord2dict)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ord_345 = ['ord_3', 'ord_4', 'ord_5']

ord_encoder = OrdinalEncoder() # Create OrdinalEncoder object
# Apply ordinal encoding
all_data[ord_345] = ord_encoder.fit_transform(all_data[ord_345])

# Print encoding order by feature
for col, categories in zip(ord_345, ord_encoder.categories_):
    print(col)
    print(categories)

### Nominal Feature Encoding

In [None]:
all_data['nom_5'] = all_data['nom_5'].str[3:]
all_data['nom_6'] = all_data['nom_6'].str[3:]
all_data['nom_7'] = all_data['nom_7'].str[3:]
all_data['nom_8'] = all_data['nom_8'].str[3:]
all_data['nom_9'] = all_data['nom_9'].str[3:]

In [None]:
nom_cols = ['nom_' + str(i) for i in range(10)] # Nominal features

In [None]:
from sklearn.preprocessing import OneHotEncoder

nom_onehot_encoder = OneHotEncoder(drop='first') # Create OneHotEncoder object
# Apply one-hot encoding
encoded_nom_matrix = nom_onehot_encoder.fit_transform(all_data[nom_cols])

all_data = all_data.drop(nom_cols, axis=1) # Drop original nominal features

encoded_nom_matrix

### Date Feature Encoding

In [None]:
date_cols  = ['day', 'month'] # Date features

date_onehot_encoder = OneHotEncoder() # Create OneHotEncoder object
# Apply one-hot encoding
encoded_date_matrix = date_onehot_encoder.fit_transform(all_data[date_cols])

all_data = all_data.drop(date_cols, axis=1) # Drop original date features

encoded_date_matrix

## Feature Scaling

### Apply scaling to ordinal features

In [None]:
from sklearn.preprocessing import MinMaxScaler

ord_cols = ['ord_' + str(i) for i in range(6)] # ordinal features
# Min-max normalization
all_data[ord_cols] = MinMaxScaler().fit_transform(all_data[ord_cols])

### Aggregate encoded and feature scaled data

In [None]:
import scipy

# aggregate encoded and feature scaled data 
all_data_sprs = scipy.sparse.hstack([scipy.sparse.csr_matrix(all_data),
                                     encoded_nom_matrix,
                                     encoded_date_matrix],
                                    format='csr')

In [None]:
all_data_sprs

## Model Training/Submission

### Divide train data and test data

In [None]:
num_train = train.shape[0] # Number of train data

# Divide train data and test data
X_train = all_data_sprs[:num_train]
X_test = all_data_sprs[num_train:]

y_train = train['target']

### Model Training

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=0.125, solver='lbfgs', max_iter=800, verbose=0, n_jobs=-1)
clf.fit(X_train, y_train)

### Prediction and Submission

In [None]:
y_preds = clf.predict_proba(X_test)[:,1]

submission['target'] = y_preds
submission.to_csv('submission.csv')