# Preprocessing Steps (Must Be Run Before Any Modeling)

This notebook defines the standard preprocessing pipeline to be run on the raw dataset before training any model. It ensures all models are trained on a consistent set of rows and features.

> **Note**: This notebook does **not** modify the original dataset or save new files. It provides instructions to be run as the first step in any modeling notebook.


## Imports

Make sure to import the following libraries before running the preprocessing steps.


In [1]:
import sys
import os
import pandas as pd

# Automatically add the project root (1 level up) to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

from feature_engineer import preprocessor

##  Preprocessing & Feature Engineering


In [2]:
df = pd.read_csv(project_root+"/Data/train.csv")
preprocessor(df)

## Pipeline construction

As we discussed earlier, it is difficult to pre-compute vandalism scores when running cross-validation. They must be recomputed for each iteration of cross-val.

The robust way to do this is to create a pipeline with a VandalismScorer Transformer class, followed by the model to be trained.

In [3]:
from feature_engineer import VandalismScorer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Example usage:

In [4]:
scorer = VandalismScorer(n_splits=4, random_state=42)

`n_splits` and `random_state` are passed to the `StratifiedKFold` object within `scorer`. `n_splits` is, strictly speaking, a hyper-parameter that can be tuned. However, to stick with what we discussed, use the following guidelines:

 - To use during cross-validation, `n_splits` passed to `VandalismScorer()` should be ***one less than*** the `n_splits` used in the overall cross-validation. Hence, the default for `n_splits` here is 4.

 - During final testing, `n_splits` should be set ***equal to*** the `n_splits` used in cross-validation for consistency.

In [5]:
scorer.fit(df, df['isvandalism'])

When training a model that needs `vandalism_score` as a feature, put the `VandalismScorer` object and the model object in a pipeline, along with any other pre-processing steps. This ensures there is no data leakage when computing `vandalism_score` during cross-validation.

`scorer.fit(df, labels)` and `scorer.transform(df)` **both require** the dataframe `df` to have the columns `"added_lines", "deleted_lines", "EditID"`. The `transform(df)` method returns a DataFrame with
 - `"added_lines", "deleted_lines", "EditID"` columns dropped
 - `"vandalism_score"` column added
 - all other columns pass through

For example, for logistic regression,

In [23]:
cols = ["user_edit_count", "user_warns", "num_recent_reversions", "num_edits_5d_before", "is_person", "added_lines", "deleted_lines", "EditID"]

In [24]:
model_pipe = Pipeline([('scorer', VandalismScorer(n_splits = 5)), ('log', LogisticRegression())])

In [25]:
[df_train, df_test, labels_train, labels_test] = train_test_split(df[cols], df['isvandalism'], shuffle=True)


In [19]:
df_train

Unnamed: 0,user_edit_count,user_warns,num_recent_reversions,num_edits_5d_before,is_person,added_lines,deleted_lines
6276,75,0,0,0,0,*[http://www.panynj.gov/bridges-tunnels/bayonn...,*[http://www.panynj.gov/CommutingTravel/bridge...
21848,59461,7,0,0,0,"[[File:Green Action logo.png|right]],",[[Image:Green Action logo.png|right]]
6679,1712760,14,0,0,0,<!-- Additional parameters for this template ...,<!-- Additional parameters for this template ...
5197,4,1,0,6,0,"""'''Geometry''' ({{lang-grc|γεωμετρία}}; ''geo...","""'''Geometry''' ({{lang-grc|γεωμετρία}}; ''geo..."
2410,85,0,0,9,1,suck madd dick sebastian,"{{Infobox musical artist,| Name = C..."
...,...,...,...,...,...,...,...
5309,1,0,0,1,0,| caption = she had sex with steve mcDonald,| caption =
17985,4,2,0,3,0,<nowiki><nowiki>Insert non-formatted text here...,"==Road signs==,""There is a nearby [[hamlet (pl..."
17044,6,3,1,4,1,my name in pat lomitzer from fairfield prep,
22771,1,0,0,4,0,"""About 850 [[species]]&nbsp;<ref>{{cite journa...","""About 850 [[species]]&nbsp;<ref>{{cite journa..."


In [20]:
df_test

Unnamed: 0,user_edit_count,user_warns,num_recent_reversions,num_edits_5d_before,is_person,added_lines,deleted_lines
12760,2065,0,0,1,1,| ru_position = [[Rugby union positions#13. Ou...,"""| ru_position = Wing, Centre, Full Back"""
17919,2,1,0,3,1,| name = Pedro Pinchak,| name = Jimmy Pinchak
19613,416,0,0,0,0,"""The movie bears some resemblance to the exper...","""The movie bears some resemblance to the exper..."
4104,1,0,0,0,0,"""The '''Kentucky State Fair''' is the [[state ...","""The '''Kentucky State Fair''' is the [[state ..."
2783,190013,4,0,1,0,"| image2=Qingdao metro m3.jpg,| imagesize2=300...","|track_gauge = 1435,""'''Qingdao Metro''' ({{zh..."
...,...,...,...,...,...,...,...
17910,4,0,0,2,0,• Enrollment in AP classes has grown over the...,• Enrollment in AP classes has grown over the...
17945,31315,1,0,1,0,"""'''Substance over form''' is an [[accounting]...","""'''Substance over form''' is an [[accounting]..."
2308,13,0,0,21,0,"""|ShortSummary=Cartoon Network stars build foo...","""|ShortSummary=Cartoon Network stars build foo..."
16763,3,1,0,3,0,"""""""'''All Summer in a Day'''"""" is a [[short st...",",""""""'''All Summer in a Day'''"""" is a [[short s..."


In [26]:
model_pipe.fit(df_train, labels_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
model_pipe['scorer'].transform(df_train)

Unnamed: 0,user_edit_count,user_warns,num_recent_reversions,num_edits_5d_before,is_person,vandalism_score
0,212,23,0,3,0,7.582304e-01
1,930819,4,0,0,0,5.000000e-01
2,5,2,0,6,0,8.954977e-02
3,52062,0,0,1,1,2.423162e-40
4,1712760,14,0,0,0,4.680693e-04
...,...,...,...,...,...,...
19066,161,32,0,2,1,9.087259e-01
19067,3,1,0,9,0,1.286427e-01
19068,211,25,0,4,0,6.733326e-01
19069,1,1,0,11,0,8.494990e-01


In [39]:
test_pred = pd.Series(model_pipe.predict(df_test))

In [46]:
test_pred.shape

(6357,)

In [47]:
labels_test.shape

(6357,)

In [48]:
test_pred.index

RangeIndex(start=0, stop=6357, step=1)

In [50]:
labels_test.index

Index([ 5132,  8909,  2782, 10224,   797,  9717, 23739, 11237,  1036,  1884,
       ...
         156, 14207, 13342,  4004,  7387, 12435,  4561,  6920, 22394, 13641],
      dtype='int64', length=6357)

In [51]:
test_pred_vs_true = pd.DataFrame({'score_log_pipe_pred': test_pred, 'ground_truth': labels_test.reset_index(drop=True)})

In [52]:
test_pred_vs_true

Unnamed: 0,score_log_pipe_pred,ground_truth
0,True,True
1,False,False
2,False,False
3,False,False
4,True,False
...,...,...
6352,True,True
6353,True,True
6354,False,False
6355,False,False
