# Preprocessing Steps (Must Be Run Before Any Modeling)

This notebook defines the standard preprocessing pipeline to be run on the raw dataset before training any model. It ensures all models are trained on a consistent set of rows and features.

> **Note**: This notebook does **not** modify the original dataset or save new files. It provides instructions to be run as the first step in any modeling notebook.


## Imports

Make sure to import the following libraries before running the preprocessing steps.


In [8]:
import sys
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold

# Automatically add the project root (1 level up) to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

from feature_engineer import (
    VandalismScorer,
    is_IP,
    account_age,
    comment_empty,
    word_count,
)

##  Preprocessing & Feature Engineering

The following code prepares the dataset by:

-  Removing invalid rows with `"BAD REQUEST"` in `added_lines` or `deleted_lines`.
- Adding custom features to enhance model performance and interpretability:
  - `comment_empty`: Whether the comment field is empty.
  - `account_age`: Age of the account at the time of the edit.
  - `is_IP`: Whether the editor is anonymous (based on IP address).
  - `word_count_added` / `word_count_deleted`: Number of words added or removed in the edit.

In [9]:
df = pd.read_csv("/Users/danielmilanesperez/Downloads/train.csv")

# Removing bad requests
df = df[
    ~((df["added_lines"] == "BAD REQUEST") | (df["deleted_lines"] == "BAD REQUEST"))
]

# Adding the "comment_empty" feature
df["comment_empty"] = comment_empty(df)

# Adding the "account_age" feature
df["account_age"] = df.apply(account_age, axis=1)

# Adding the "is_IP" feature
df["is_IP"] = df.apply(is_IP, axis=1)

# Adding the "word_count_added"  and "word_count_deleted" features
df["word_count_added"], df["word_count_deleted"] = zip(*df.apply(word_count, axis=1))

### `vandalism_score`

The following code creates a new feature called `vandalism_score`, which estimates the likelihood that an edit is vandalism. This score is computed using a simple Naive Bayes-style method based on the **words added** in each edit.

#### How It Works

- Words that are frequently seen in past **vandalism edits** (e.g., profanity, repeated characters) increase the score.
- Words that are common in **non-vandalism edits** (e.g., neutral or technical terms) lower the score.
- The final score reflects how similar the added words are to patterns typically seen in vandalism.

#### Avoiding Data Leakage

Because this feature uses the `isvandalism` label during training, we need to avoid data leakage when computing it. To do this, we use **cross-validation**:

1. Split the training data into `K` folds.
2. For each fold:
   - Train the scoring model on the other `K - 1` folds.
   - Compute the `vandalism_score` for the current fold using this trained model.
3. Repeat for all folds so that each score is generated without access to its own label.

This ensures that the `vandalism_score` is a valid, leak-free input feature for any downstream machine learning model.

In [10]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

df["vandalism_score"] = np.zeros(df.shape[0], dtype=float)

for train_index, test_index in cv.split(df, df["isvandalism"]):
    train_df = df.iloc[train_index]
    test_df = df.iloc[test_index]

    scorer = VandalismScorer()
    scorer.fit(
        train_df["added_lines"], train_df["deleted_lines"], train_df["isvandalism"]
    )
    test_scores = scorer.score(test_df["added_lines"], test_df["deleted_lines"])

    # Ensure correct alignment
    df.loc[test_df.index, "vandalism_score"] = test_scores

## Suggestions

- When applying a log transformation to numerical features, use `log(1 + x)` to safely handle zero values and avoid undefined results.
- Many numerical features are highly right-skewed and contain outliers. Ensure that the model you choose can handle this distribution, or consider removing or transforming outliers before training.
- For additional insights on feature importance and correlations, please refer to the EDA notebook located at `EDA/EDA.ipynb`. The comments there may help guide your feature selection and modeling decisions.
