In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('Wikipedia.csv')
df.head()

Unnamed: 0,Vandal,Minor,LoggedIn,HTTP,NumWordsAdded,NumWordsRemoved
0,0,1,1,1,96,0
1,0,1,1,0,3,1
2,0,0,1,0,0,4
3,0,1,0,0,10,92
4,0,1,1,1,94,10


In [None]:
# (a) exploring dataset

vandalism_count = df['Vandal'].sum()
print("Number of vandalism cases:", vandalism_count)

average_words_added = df['NumWordsAdded'].mean()
average_words_removed = df['NumWordsRemoved'].mean()
print("Average number of words added:", average_words_added)
print("Average number of words removed:", average_words_removed)

correlations = df.corr()['Vandal'].drop('Vandal')
most_correlated_variable = correlations.abs().idxmax()
correlation_value = correlations[most_correlated_variable]
print("Most correlated variable with Vandal:", most_correlated_variable)
print("Correlation value:", correlation_value)

df.corr()

Number of vandalism cases: 1815
Average number of words added: 4.050051599587204
Average number of words removed: 3.5128998968008256
Most correlated variable with Vandal: LoggedIn
Correlation value: -0.4292545748989881


Unnamed: 0,Vandal,Minor,LoggedIn,HTTP,NumWordsAdded,NumWordsRemoved
Vandal,1.0,-0.213995,-0.429255,0.151554,-0.000729,0.03636
Minor,-0.213995,1.0,0.445166,-0.084297,-0.007726,-0.037629
LoggedIn,-0.429255,0.445166,1.0,-0.110633,0.026223,-0.036422
HTTP,0.151554,-0.084297,-0.110633,1.0,0.114421,-0.039866
NumWordsAdded,-0.000729,-0.007726,0.026223,0.114421,1.0,0.025235
NumWordsRemoved,0.03636,-0.037629,-0.036422,-0.039866,0.025235,1.0


In [None]:
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf

In [None]:
# checking proportions on 0 & 1 to decide whether to use stratify

print(np.sum(df['Vandal']))
print(np.sum(df['Vandal'] == 0))

# counts are not so different, did not use stratify

1815
2061


In [None]:
# (b) Split the dataset & calulculate accuracy

df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

accuracy_baseline = np.sum(0 == df_test['Vandal'])/df_test.shape[0]
print("Baseline accuracy on testing set:",accuracy_baseline)

Baseline accuracy on testing set: 0.5442820292347378


In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [None]:
# (c) CART model

X = df.drop(columns=['Vandal'])
y = df['Vandal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_1, X_val, y_train_1, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

dt_10 = DecisionTreeClassifier(min_samples_leaf=10, random_state=42)
dt_50 = DecisionTreeClassifier(min_samples_leaf=50, random_state=42)
dt_100 = DecisionTreeClassifier(min_samples_leaf=100, random_state=42)

dt_10.fit(X_train_1, y_train_1)
dt_50.fit(X_train_1, y_train_1)
dt_100.fit(X_train_1, y_train_1)

accuracy_10 = accuracy_score(y_val, dt_10.predict(X_val))
accuracy_50 = accuracy_score(y_val, dt_50.predict(X_val))
accuracy_100 = accuracy_score(y_val, dt_100.predict(X_val))

accuracies = [accuracy_10, accuracy_50, accuracy_100]
min_samples_leaf_values = [10, 50, 100]
optimal_min_samples_leaf = min_samples_leaf_values[accuracies.index(max(accuracies))]

print("Validation Accuracies:")
print("min_samples_leaf=10:", accuracy_10)
print("min_samples_leaf=50:", accuracy_50)
print("min_samples_leaf=100:", accuracy_100)
print("Optimal min_samples_leaf:", optimal_min_samples_leaf)

final_tree = DecisionTreeClassifier(min_samples_leaf=optimal_min_samples_leaf, random_state=42)
final_tree.fit(X_train, y_train)

plt.figure(figsize=(100,50))
plot_tree(final_tree, feature_names=X.columns, filled=True, rounded=True, class_names=["Non-Vandal", "Vandal"])
plt.title("CART Model to Predict Vandalism")
plt.show()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# (c)

importances = final_tree.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print("Variables used with importance (most significant to least):")
for index, row in importance_df.iterrows():
    print(f"{row['Feature']}: {row['Importance']:.4f}")

test_accuracy = accuracy_score(y_test, final_tree.predict(X_test))
print("Test set accuracy of the final model:", test_accuracy)

Variables used with importance (most significant to least):
LoggedIn: 0.5608
NumWordsAdded: 0.2843
NumWordsRemoved: 0.1093
Minor: 0.0348
HTTP: 0.0108
Test set accuracy of the final model: 0.7497850386930353


(e)

(i) **Do you think the model you built could be useful to Wikipedia to detect vandalism? Why or why not?**

The CART model could somewhat be useful in detecting Wikipedia vandalism, achieving a test accuracy of about 75%. Key variables, such as whether a user was logged in, show quite strong correlations with vandalism—logged-out edits, for example, often coincide with suspicious behavior, aligning with the observation that anonymous edits can sometimes be problematic. Other features like the presence of HTTP links are less strongly correlated. However, the model has limitations. It may miss complex forms of vandalism that do not fit the identified patterns. Although setting min_samples_leaf to 10 optimized the model, CART would most likely show the best results as an underlying part of a larger system, using complementary approaches such as natural language processing or anomaly detection algorithms, which could catch all of the more tacit cases that may arise and further enhance its reliability. This would give a more holistic approach by considering the diversity in vandalism behaviors on Wikipedia increasing the robustness of this model.

(ii) **If you could collect more data about the edits, what variables would you want? Why?**

Additional variables that could enhance the model:

* Edit Duration (since vandalism might involve quick edits)
* User History (like past edits or a reputation score, as new or anonymous users could be more prone to vandalism)
* Time of Day (certain hours might see more vandalism)
* Edit Length Change (large changes may signal malicious edits)
* Reversion History (users with many reverted edits might be higher risk)
* Edit Summary (vandals may skip or use vague terms)
* Section Edited (sections like "Introduction" may attract more vandalism)
* IP Location (certain regions might have higher vandalism rates)
* Edit Overlap (multiple edits on the same content could indicate controversial areas)
* Grammar/Spelling Check (errors can be flags for vandalism)
* Content Context (NLP features such as sentiment or toxicity to catch negative language)
* User Role (admins/bots are less likely to vandalize compared to regular anonymous users)
* Device Type (mobile edits may be quicker and more prone to errors)
* Days Since Last Edit (infrequent editors may be higher risk)

Before using additional variables in the model, it is essential to remove any correlated variables to prevent multicollinearity.

(iii) **The data used in this problem was collected from the page about Language. Do you think this model would easily extend to other pages? Why or why not?**

The model may have limited generalizability to other Wikipedia pages as characteristics of editing patterns can vary significantly across different topics. For instance, compared to other niche topics, high-profile or controversial topics may be vulnerable to sophisticated and more varied vandalism methods. Specific terms and editing behaviors indicative of vandalism on the "Language" page may not be relevant elsewhere. The model can be much more generalizable with retraining or adaptation with diverse training data, including a wide variety of edits on Wikipedia pages. That would provide the correct relevance in different contexts.