# üè¢ OFFICE CATEGORY PREDICTION - BASELINE TEMPLATE

This notebook provides a simple baseline for the Office Classification challenge.

**Your task:** Improve upon this baseline by trying different approaches!

## üí° Hints on What to Try

1. **Feature Engineering** - Create interactions, polynomials, ratios
2. **Different Models** - Random Forest, XGBoost, Neural Networks
3. **Hyperparameter Tuning** - Optimize model parameters
4. **Ensemble Methods** - Combine multiple models
5. **Handle Missing Values Better** - Try different imputation strategies
6. **Encode Categoricals Differently** - One-hot encoding, target encoding


**Good luck!** üöÄ

In [1]:
# ============================================================================
# STEP 1: IMPORT LIBRARIES
# ============================================================================

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

In [3]:
# ============================================================================
# STEP 2: LOAD DATA
# ============================================================================

# Load training data
train = pd.read_csv('office_train.csv')

# Separate features and target
X = train.drop('OfficeCategory', axis=1)
y = train['OfficeCategory']

print("Dataset loaded successfully!")
print(f"Shape: {X.shape}")
print(f"Target distribution:\n{y.value_counts().sort_index()}")

# TODO: Explore the data here
# - Check for missing values: X.isnull().sum()
# - Look at feature distributions: X.describe()
# - Visualize relationships: Use matplotlib/seaborn
# - Understand which features matter most

Dataset loaded successfully!
Shape: (35000, 79)
Target distribution:
OfficeCategory
0    6675
1    7314
2    6906
3    7013
4    7092
Name: count, dtype: int64


In [4]:
# ============================================================================
# STEP 3: SIMPLE PREPROCESSING
# ============================================================================

def simple_preprocess(X_train, X_test=None):
    """
    Basic preprocessing: Fill missing values and encode categoricals

    TODO: Improve this function!
    Ideas:
    - Try different imputation strategies (mean, mode, KNN)
    - Create new features (interactions, ratios, polynomials)
    - Try one-hot encoding instead of label encoding
    - Handle outliers
    - Scale/normalize features
    """

    # Make copies
    X_train = X_train.copy()
    if X_test is not None:
        X_test = X_test.copy()

    # Identify feature types
    numeric_features = X_train.select_dtypes(include=[np.number]).columns
    categorical_features = X_train.select_dtypes(include=['object']).columns

    print(f"Numeric features: {len(numeric_features)}")
    print(f"Categorical features: {len(categorical_features)}")

    # Fill missing values - NUMERIC (median)
    for col in numeric_features:
        median_val = X_train[col].median()
        X_train[col] = X_train[col].fillna(median_val)
        if X_test is not None:
            X_test[col] = X_test[col].fillna(median_val)

    # Fill missing values - CATEGORICAL (mode)
    for col in categorical_features:
        mode_val = X_train[col].mode()[0] if len(X_train[col].mode()) > 0 else 'Missing'
        X_train[col] = X_train[col].fillna(mode_val)
        if X_test is not None:
            X_test[col] = X_test[col].fillna(mode_val)

    # Encode categorical features (label encoding)
    for col in categorical_features:
        le = LabelEncoder()
        if X_test is not None:
            # Fit on combined data
            combined = pd.concat([X_train[col].astype(str), X_test[col].astype(str)])
            le.fit(combined)
            X_train[col] = le.transform(X_train[col].astype(str))
            X_test[col] = le.transform(X_test[col].astype(str))
        else:
            X_train[col] = le.fit_transform(X_train[col].astype(str))

    # Final safety check
    X_train = X_train.fillna(0)
    if X_test is not None:
        X_test = X_test.fillna(0)

    if X_test is not None:
        return X_train, X_test
    return X_train

# Apply preprocessing
X_processed = simple_preprocess(X)

print(f"\nAfter preprocessing:")
print(f"Shape: {X_processed.shape}")
print(f"Missing values: {X_processed.isnull().sum().sum()}")

Numeric features: 37
Categorical features: 42

After preprocessing:
Shape: (35000, 79)
Missing values: 0


In [5]:
# ============================================================================
# STEP 4: TRAIN-VALIDATION SPLIT
# ============================================================================

X_train, X_val, y_train, y_val = train_test_split(
    X_processed, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nTrain set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")


Train set: (28000, 79)
Validation set: (7000, 79)


In [6]:
# ============================================================================
# STEP 5: TRAIN BASELINE MODEL (LOGISTIC REGRESSION)
# ============================================================================

# TODO: Try different models!
# - RandomForestClassifier
# - XGBClassifier
# - GradientBoostingClassifier
# - Neural Networks (MLPClassifier)
# - Ensemble methods (VotingClassifier, StackingClassifier)

print("\n" + "="*70)
print("TRAINING BASELINE MODEL: LOGISTIC REGRESSION")
print("="*70)

# Scale features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train logistic regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)


TRAINING BASELINE MODEL: LOGISTIC REGRESSION


In [9]:
# ============================================================================
# STEP 5: TRAIN BASELINE MODEL (LOGISTIC REGRESSION)
# ============================================================================

# TODO: Try different models!
# - RandomForestClassifier
# - XGBClassifier
# - GradientBoostingClassifier
# - Neural Networks (MLPClassifier)
# - Ensemble methods (VotingClassifier, StackingClassifier)

from xgboost import XGBClassifier


print("\n" + "="*70)
print("TRAINING BASELINE MODEL: LOGISTIC REGRESSION")
print("="*70)

# Train logistic regression
model = XGBClassifier(random_state=42)
model.fit(X_train_scaled, y_train)


TRAINING BASELINE MODEL: LOGISTIC REGRESSION


In [13]:
! pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m99.2/99.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [14]:
# ============================================================================
# STEP 5: TRAIN BASELINE MODEL (LOGISTIC REGRESSION)
# ============================================================================

# TODO: Try different models!
# - RandomForestClassifier
# - XGBClassifier
# - GradientBoostingClassifier
# - Neural Networks (MLPClassifier)
# - Ensemble methods (VotingClassifier, StackingClassifier)

from catboost import CatBoostClassifier

print("\n" + "="*70)
print("TRAINING BASELINE MODEL: LOGISTIC REGRESSION")
print("="*70)

# Train logistic regression
model = CatBoostClassifier(random_state=42)
model.fit(X_train_scaled, y_train)


TRAINING BASELINE MODEL: LOGISTIC REGRESSION
Learning rate set to 0.093784
0:	learn: 1.4835226	total: 131ms	remaining: 2m 10s
1:	learn: 1.3924813	total: 242ms	remaining: 2m
2:	learn: 1.3204038	total: 318ms	remaining: 1m 45s
3:	learn: 1.2634377	total: 418ms	remaining: 1m 44s
4:	learn: 1.2155667	total: 550ms	remaining: 1m 49s
5:	learn: 1.1691897	total: 674ms	remaining: 1m 51s
6:	learn: 1.1301259	total: 764ms	remaining: 1m 48s
7:	learn: 1.0981268	total: 853ms	remaining: 1m 45s
8:	learn: 1.0668474	total: 913ms	remaining: 1m 40s
9:	learn: 1.0381357	total: 970ms	remaining: 1m 36s
10:	learn: 1.0134671	total: 1.03s	remaining: 1m 32s
11:	learn: 0.9913765	total: 1.13s	remaining: 1m 33s
12:	learn: 0.9693068	total: 1.25s	remaining: 1m 34s
13:	learn: 0.9495444	total: 1.34s	remaining: 1m 34s
14:	learn: 0.9317452	total: 1.45s	remaining: 1m 35s
15:	learn: 0.9146031	total: 1.51s	remaining: 1m 33s
16:	learn: 0.8997651	total: 1.57s	remaining: 1m 30s
17:	learn: 0.8848572	total: 1.63s	remaining: 1m 29s
18

<catboost.core.CatBoostClassifier at 0x7e8f4d60a870>

In [15]:
# ============================================================================
# STEP 6: EVALUATE MODEL
# ============================================================================

# Predictions
y_train_pred = model.predict(X_train_scaled)
y_val_pred = model.predict(X_val_scaled)

# Calculate accuracy
train_acc = accuracy_score(y_train, y_train_pred)
val_acc = accuracy_score(y_val, y_val_pred)

print(f"\nTrain Accuracy: {train_acc*100:.2f}%")
print(f"Validation Accuracy: {val_acc*100:.2f}%")

# Detailed classification report
print("\nClassification Report (Validation):")
print(classification_report(y_val, y_val_pred))

# TODO: Add more evaluation metrics
# - Confusion matrix
# - Per-class accuracy
# - Cross-validation scores
# - Feature importance (for tree models)


Train Accuracy: 93.29%
Validation Accuracy: 85.56%

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.86      0.87      0.86      1335
           1       0.77      0.78      0.78      1463
           2       0.82      0.82      0.82      1381
           3       0.87      0.88      0.87      1403
           4       0.97      0.94      0.95      1418

    accuracy                           0.86      7000
   macro avg       0.86      0.86      0.86      7000
weighted avg       0.86      0.86      0.86      7000



In [30]:
# ============================================================================
# STEP 7: MAKE PREDICTIONS ON TEST SET (OPTIONAL)
# ============================================================================

# Upload this file on Kaggle

# Load test data
test = pd.read_csv('office_test.csv')

# Preprocess test data (use same preprocessing as training)
X_test_processed = simple_preprocess(X, test)[1]

# Scale test data
X_test_scaled = scaler.transform(X_test_processed)

# Make predictions
test_predictions = model.predict(X_test_scaled)

test_predictions_new = []

for tst in test_predictions.tolist():
  test_predictions_new.append(tst[0])

# Save predictions
submission = pd.DataFrame({
    'Id': range(len(test_predictions)),
    'OfficeCategory': test_predictions_new
})
submission.to_csv('submission.csv', index=False)

print("Predictions saved to submission.csv")

Numeric features: 37
Categorical features: 42
Predictions saved to submission.csv


In [29]:
test_predictions_new = []

for tst in test_predictions.tolist():
  test_predictions_new.append(tst[0])

test_predictions_new

[3,
 2,
 3,
 0,
 3,
 4,
 0,
 1,
 1,
 3,
 1,
 4,
 1,
 3,
 2,
 1,
 3,
 0,
 1,
 2,
 2,
 2,
 3,
 1,
 3,
 2,
 1,
 0,
 0,
 3,
 4,
 2,
 4,
 2,
 1,
 4,
 1,
 0,
 1,
 2,
 1,
 2,
 4,
 1,
 4,
 1,
 3,
 0,
 1,
 3,
 3,
 2,
 1,
 3,
 2,
 1,
 0,
 4,
 1,
 0,
 1,
 4,
 3,
 0,
 0,
 3,
 4,
 4,
 0,
 2,
 4,
 3,
 2,
 0,
 1,
 0,
 4,
 3,
 1,
 0,
 3,
 4,
 1,
 1,
 2,
 0,
 0,
 4,
 3,
 1,
 3,
 0,
 0,
 1,
 2,
 4,
 3,
 0,
 0,
 1,
 0,
 1,
 4,
 1,
 4,
 2,
 0,
 0,
 2,
 4,
 4,
 3,
 1,
 4,
 2,
 3,
 3,
 4,
 1,
 1,
 4,
 2,
 4,
 3,
 2,
 2,
 1,
 3,
 1,
 0,
 3,
 2,
 4,
 1,
 3,
 3,
 4,
 1,
 1,
 2,
 2,
 2,
 1,
 4,
 4,
 2,
 3,
 3,
 2,
 4,
 0,
 1,
 1,
 4,
 4,
 4,
 0,
 4,
 1,
 3,
 4,
 4,
 0,
 2,
 0,
 2,
 4,
 2,
 1,
 2,
 4,
 0,
 0,
 1,
 3,
 0,
 3,
 2,
 2,
 0,
 1,
 1,
 3,
 1,
 1,
 0,
 3,
 4,
 1,
 1,
 2,
 3,
 3,
 1,
 1,
 0,
 3,
 4,
 0,
 0,
 4,
 2,
 4,
 3,
 1,
 3,
 1,
 0,
 1,
 3,
 2,
 3,
 2,
 0,
 4,
 0,
 1,
 1,
 4,
 3,
 4,
 4,
 2,
 2,
 3,
 4,
 0,
 3,
 1,
 2,
 1,
 1,
 4,
 1,
 2,
 1,
 1,
 1,
 4,
 2,
 2,
 3,
 3,
 4,
 2,
 0,
 3,
 3,
 4,
 4,


# üéØ IDEAS TO TRY - IMPROVE YOUR MODEL!

## 1. üîß Feature Engineering

Create new features that capture relationships between variables:

```python
# Interaction features (Quality √ó Size effect)
X['Quality_Size'] = X['BuildingGrade'] * X['OfficeSpace']

# Polynomial features (Non-linear relationships)
X['OfficeSpace_squared'] = X['OfficeSpace'] ** 2
X['BuildingGrade_squared'] = X['BuildingGrade'] ** 2

# Ratio features (Relative measurements)
X['Space_Plot_Ratio'] = X['OfficeSpace'] / (X['PlotSize'] + 1)
X['Restroom_Meeting_Ratio'] = X['Restrooms'] / (X['MeetingRooms'] + 1)

# Aggregated features
X['TotalArea'] = X['OfficeSpace'] + X['BasementArea'] + X['ParkingArea']
```

---

## 2. üå≤ Different Models

Try tree-based models (often better than linear models for this type of data):

```python
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Random Forest
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

# XGBoost (usually best for tabular data)
model = XGBClassifier(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    random_state=42
)
model.fit(X_train, y_train)
```

---

## 3. üéõÔ∏è Hyperparameter Tuning

Optimize your model's parameters:

```python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, 25],
    'min_samples_split': [5, 10, 20]
}

# Grid search with cross-validation
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.4f}")
```

---

## 4. ü§ù Ensemble Methods

Combine multiple models for better predictions:

```python
from sklearn.ensemble import VotingClassifier

# Create individual models
model1 = RandomForestClassifier(n_estimators=200, random_state=42)
model2 = XGBClassifier(n_estimators=200, random_state=42)

# Voting ensemble (combines predictions)
ensemble = VotingClassifier(
    estimators=[('rf', model1), ('xgb', model2)],
    voting='soft',  # Average probabilities
    weights=[1, 1.2]  # Slightly favor XGBoost
)
ensemble.fit(X_train, y_train)
```

---

## 5. üéØ Feature Selection

Select only the most important features:

```python
from sklearn.feature_selection import SelectKBest, f_classif

# Select top 50 features
selector = SelectKBest(f_classif, k=50)
X_train_selected = selector.fit_transform(X_train, y_train)
X_val_selected = selector.transform(X_val)

# Get selected feature names
selected_features = X_train.columns[selector.get_support()]
print(f"Selected features: {list(selected_features)}")
```

---

## 6. üìà Cross-Validation

Get more reliable performance estimates:

```python
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=5,
    scoring='accuracy'
)

print(f"CV Accuracy: {scores.mean():.4f} (¬±{scores.std():.4f})")
```