# Preprocessing Steps (Must Be Run Before Any Modeling)

This notebook defines the standard preprocessing pipeline to be run on the raw dataset before training any model. It ensures all models are trained on a consistent set of rows and features.

> **Note**: This notebook does **not** modify the original dataset or save new files. It provides instructions to be run as the first step in any modeling notebook.


## Imports

Make sure to import the following libraries before running the preprocessing steps.


In [1]:
import sys
import os
import pandas as pd

# Automatically add the project root (1 level up) to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

from feature_engineer import preprocessor

##  Preprocessing & Feature Engineering


In [2]:
df = pd.read_csv(project_root+"/Data/train.csv")
preprocessor(df)

## Pipeline construction

As we discussed earlier, it is difficult to pre-compute vandalism scores when running cross-validation. They must be recomputed for each iteration of cross-val.

The robust way to do this is to create a pipeline with a VandalismScorer Transformer class, followed by the model to be trained.

In [3]:
from feature_engineer import VandalismScorer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

Example usage:

In [4]:
scorer = VandalismScorer(n_splits=4, random_state=42)

`n_splits` and `random_state` are passed to the `StratifiedKFold` object within `scorer`. `n_splits` is, strictly speaking, a hyper-parameter that can be tuned. However, to stick with what we discussed, use the following guidelines:

 - To use during cross-validation, `n_splits` passed to `VandalismScorer()` should be ***one less than*** the `n_splits` used in the overall cross-validation. Hence, the default for `n_splits` here is 4.

 - During final testing, `n_splits` should be set ***equal to*** the `n_splits` used in cross-validation for consistency.

In [5]:
scorer.fit(df, df['isvandalism'])

In [64]:
import numpy as np
sample = df.sample(n = 10).replace(np.nan, '').reset_index(drop=True)
sample.reset_index(inplace=True)
sample['added_lines']

0    "Luna (Moon Lake), in the crater of the Nevado...
1    The school lunches are disgusting but its not ...
2              |Birth_name = Thien Thanh Thi Nguyen yo
3                             ,I HEARD IT'S REALLY 219
4                                                     
5                              Are a load of bollocks.
6    '''David Eldar''' is an [[Australia]]n [[Scrab...
7        *High Schools: [[Ruben S. Ayala High School]]
8    "'''LeBron Raymone James''' (born [[December 3...
9    ,"She is also a racist who hates all latino im...
Name: added_lines, dtype: object

In [65]:
sample['index']

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: index, dtype: int64

In [66]:
sample['classifier_index'] = sample[['EditID']].map(lambda x: scorer.EditID_to_classifier_index[x])

In [67]:
sample_features = (scorer.vectorizer_.transform(sample['added_lines']) - scorer.vectorizer_.transform(sample['deleted_lines'])).maximum(0)
sample_features.eliminate_zeros()
sample_features

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 121 stored elements and shape (10, 247639)>

In [70]:
sample.iloc[0]['index'].dtype

dtype('int64')

In [81]:
sample['vandalism_score'] = sample[['index', 'classifier_index']].apply(lambda x: scorer.nb_classifiers[x['classifier_index']].predict_proba(sample_features[x['index']])[:, scorer.nb_classifiers[x['classifier_index']].classes_].item(), axis=1)

In [82]:
sample[['classifier_index', 'vandalism_score', 'isvandalism']]

Unnamed: 0,classifier_index,vandalism_score,isvandalism
0,0,1.0,True
1,2,0.990113,True
2,1,0.92121,True
3,0,0.948237,True
4,0,0.5,True
5,0,0.484313,True
6,0,0.014285,True
7,1,0.09987,False
8,1,0.934783,True
9,0,0.908093,True


In [6]:
df_transformed = scorer.transform(df)

In [13]:
df_transformed.columns

Index(['EditType', 'EditID', 'comment', 'user', 'user_edit_count',
       'user_distinct_pages', 'user_warns', 'user_reg_time', 'prev_user',
       'common', 'current', 'previous', 'page_made_time', 'title', 'namespace',
       'creator', 'num_recent_edits', 'num_recent_reversions', 'current_minor',
       'current_timestamp', 'previous_timestamp', 'isvandalism',
       'num_edits_5d_before', 'is_person', 'comment_empty', 'account_age',
       'is_IP', 'word_count_added', 'word_count_deleted', 'vandalism_score'],
      dtype='object')

In [26]:
df_transformed.reset_index(inplace = True)
df_transformed.index

RangeIndex(start=0, stop=25428, step=1)

When training a model that needs `vandalism_score` as a feature, put the `VandalismScorer` object and the model object in a pipeline, along with any other pre-processing steps. This ensures there is no data leakage when computing `vandalism_score` during cross-validation.

For example, for KNN,

In [8]:
cols = ["user_edit_count", "user_warns", "num_recent_reversions", "num_edits_5d_before", "is_person", "added_lines", "deleted_lines"]

In [9]:
model_pipe = Pipeline([('scorer', VandalismScorer(n_splits = 5)), ('scale', StandardScaler())])

In [10]:
model_pipe.fit(df[cols], df['isvandalism'])

In [11]:
model_pipe.predict_proba(df[cols])

AttributeError: This 'Pipeline' has no attribute 'predict_proba'

In [12]:
model_pipe['scale'].feature_names_in_

array(['user_edit_count', 'user_warns', 'num_recent_reversions',
       'num_edits_5d_before', 'is_person', 'vandalism_score'],
      dtype=object)