# Preprocessing Steps (Must Be Run Before Any Modeling)

This notebook defines the standard preprocessing pipeline to be run on the raw dataset before training any model. It ensures all models are trained on a consistent set of rows and features.

> **Note**: This notebook does **not** modify the original dataset or save new files. It provides instructions to be run as the first step in any modeling notebook.


## Imports

Make sure to import the following libraries before running the preprocessing steps.


In [1]:
import sys
import os
import pandas as pd

# Automatically add the project root (1 level up) to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

from feature_engineer import preprocessor

##  Preprocessing & Feature Engineering


In [2]:
df = pd.read_csv(project_root+"/Data/train.csv")
preprocessor(df)

## Pipeline construction

As we discussed earlier, it is difficult to pre-compute vandalism scores when running cross-validation. They must be recomputed for each iteration of cross-val.

The robust way to do this is to create a pipeline with a VandalismScorer Transformer class, followed by the model to be trained.

In [3]:
from feature_engineer import VandalismScorer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

Example usage:

In [4]:
scorer = VandalismScorer(n_splits=4, random_state=42)

`n_splits` and `random_state` are passed to the `StratifiedKFold` object within `scorer`. `n_splits` is, strictly speaking, a hyper-parameter that can be tuned. However, to stick with what we discussed, use the following guidelines:

 - To use during cross-validation, `n_splits` passed to `VandalismScorer()` should be ***one less than*** the `n_splits` used in the overall cross-validation. Hence, the default for `n_splits` here is 4.

 - During final testing, `n_splits` should be set ***equal to*** the `n_splits` used in cross-validation for consistency.

In [5]:
scorer.fit(df, df['isvandalism'])

In [6]:
df_transformed = scorer.transform(df)

In [7]:
df_transformed

Unnamed: 0,EditType,EditID,comment,user,user_edit_count,user_distinct_pages,user_warns,user_reg_time,prev_user,common,...,previous_timestamp,isvandalism,num_edits_5d_before,is_person,comment_empty,account_age,is_IP,word_count_added,word_count_deleted,vandalism_score
0,change,329595189,,Nryan30,66,13,0,1259891940,219.78.124.42,,...,1259856305,False,1,0,True,0,False,131,1,2.897910e-25
1,change,232199357,/* Penis */,89.242.200.212,4,2,2,20080815230001,66.75.235.255,,...,1218816231,True,4,1,False,1,True,4,202,9.925523e-01
2,change,329877752,Reverted edits by [[Special:Contributions/71.2...,Chamal N,18697,0,2,1208605428,71.208.113.72,,...,1260025104,False,3,0,False,595,False,34,50,2.569290e-01
3,change,253129486,,Animaldudeyay1009,3,1,2,1227241317,J.delanoy,,...,1227241120,True,2,0,True,0,False,94,836,1.000000e+00
4,change,394520551,Adding Persondata using [[Project:AWB|AWB]] (7...,RjwilmsiBot,1602950,1309238,0,1257977968,LobãoV,,...,1285262356,False,0,1,False,356,False,34,0,1.736015e-22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25482,change,327368981,Reverted 1 edit by [[Special:Contributions/68....,TreyGeek,15458,4978,2,1203859836,68.40.112.72,,...,1258926849,False,1,0,False,637,False,26,26,5.000000e-01
25484,change,234810735,/* History */,59.180.151.222,1,1,0,20080828164140,68.50.79.137,,...,1219280292,True,0,0,False,1,True,86,79,9.543287e-01
25485,change,329132348,/* In Tamil Nadu */,66.184.61.179,1,1,0,20091201230141,RJFJR,,...,1259002677,False,0,0,False,1,True,340,337,4.304860e-03
25486,change,240599711,/* Biography */,75.157.130.175,6,1,2,20080924030549,J.delanoy,,...,1222225537,True,11,1,False,1,True,2,209,9.893988e-01


When training a model that needs `vandalism_score` as a feature, put the `VandalismScorer` object and the model object in a pipeline, along with any other pre-processing steps. This ensures there is no data leakage when computing `vandalism_score` during cross-validation.

For example, for KNN,

In [8]:
cols = ["user_edit_count", "user_warns", "num_recent_reversions", "num_edits_5d_before", "is_person", "added_lines", "deleted_lines"]

In [9]:
model_pipe = Pipeline([('scorer', VandalismScorer(n_splits = 5)), ('log', LogisticRegression())])

In [10]:
model_pipe.fit(df[cols], df['isvandalism'])

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
model_pipe.predict_proba(df[cols])

array([[8.22382612e-01, 1.77617388e-01],
       [5.62116331e-02, 9.43788367e-01],
       [9.99997131e-01, 2.86906793e-06],
       ...,
       [8.13952451e-01, 1.86047549e-01],
       [2.89475416e-02, 9.71052458e-01],
       [9.88399959e-01, 1.16000407e-02]])