# Copyright False Positive Detection using SGD Classifier

Earlier I only experimented by trying out sgd classifier model. In this phase, I tried to replace SVM model with SGD Classifier in Safaa. For this, I have tried duplicating all the safaa's steps essential in model training that is:

1.   Preprocessing
      - ensure_list_of_strings()
      - replace_entities(): used the same entity_recognizer model that is used in Safaa.
      - perform_text_substitutions()
2.   Vectorization: used the same vectorizer model that is used in Safaa.

And after this instead of SVM model, I have trained SGD Classifier model with aim to introduce incremental learning in Safaa.

3.   Trained SGD Classifier model





In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import re
import spacy
from joblib import load, dump
import pkg_resources
import shutil

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report

In [4]:
import pandas as pd

Load Dataset

In [5]:
df = pd.read_csv("/content/drive/MyDrive/Google Colab/gsoc/datasets/false_positive_detection_dataset.csv")
data = df['copyright']
labels = df['falsePositive']

### Load NER and Vectorizer Models

In [13]:
entity_recognizer = spacy.load("/content/drive/MyDrive/Google Colab/gsoc/models/entity_recognizer")
vectorizer = load("/content/drive/MyDrive/Google Colab/gsoc/models/false_positive_detection_vectorizer.pkl")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [12]:
# print(os.listdir("/content/drive/MyDrive/Google Colab/gsoc/models/entity_recognizer"))

['config.cfg', 'tokenizer', 'meta.json', 'tok2vec', 'vocab', 'ner']


### Preprocessing

In [14]:
def preprocess_data(data):
    """
    Preprocesses the given data by performing various text cleaning and
    transformation tasks.

    Parameters:
    data (iterable): The data to preprocess.

    Returns:
    data (list): List of preprocessed strings.
    """

    # Ensure the data is a list of strings
    data = ensure_list_of_strings(data)

    # Replace copyright holder entities in the data
    data = replace_entities(data)

    # Perform text substitutions for dates, numbers, symbols, emails, etc.
    data = perform_text_substitutions(data)

    return data

In [15]:
def ensure_list_of_strings(data):
    """
    Ensures the data is a list of strings.

    If the input data is not a list, attempts to convert it to a list.
    Then, ensures each element of the list is a string.

    Parameters:
    data (iterable): The data to be converted to a list of strings.

    Returns:
    list: A list of strings.
    """

    # If data is not a list, try converting it to a list
    if not isinstance(data, list):
        data = data.to_list()
    # Ensure each item in the list is a string
    return [str(item) for item in data]


In [16]:
def replace_entities(data):
    """
    Replaces detected copyright holder entities with ' ENTITY '.

    Uses the entity_recognizer model to identify copyright holder entities,
    which are often name or organization entities, and replaces them with
    the string ' ENTITY '.

    Parameters:
    data (list): A list of strings.

    Returns:
    list: A list of strings with copyright holder entities replaced.
    """

    new_data = []
    for sentence in data:
        # Process the sentence using the entity recognizer
        doc = entity_recognizer(sentence)
        new_sentence = doc.text
        for entity in doc.ents:
            # If the entity is a copyright holder entity, replace it with
            # ' ENTITY '
            if entity.label_ == 'ENT':
                new_sentence = re.sub(re.escape(entity.text), ' ENTITY ',
                                      new_sentence)
        new_data.append(new_sentence)
    return new_data

In [17]:
def perform_text_substitutions(data):
    """
    Performs a series of text substitutions to clean and standardize the
    data.

    This includes:
    - Replacing four-digit numbers (assumed to be years) with ' DATE '.
    - Removing all other numbers.
    - Replacing copyright symbols with ' COPYRIGHTSYMBOL '.
    - Replacing emails with ' EMAIL '.
    - Removing any special characters not already replaced or removed.
    - Converting text to lowercase.
    - Stripping extra whitespace from the text.

    Parameters:
    data (list): A list of strings.

    Returns:
    list: A list of cleaned and standardized strings.
    """

    # Define the substitution patterns and their replacements
    subs = [
        (r'\d{4}', ' DATE '),
        (r'\d+', ' '),
        (r'©', ' COPYRIGHTSYMBOL '),
        (r'\(c\)', ' COPYRIGHTSYMBOL '),
        (r'\(C\)', ' COPYRIGHTSYMBOL '),
        (
        r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""",
        ' EMAIL '),
        (r'[^a-zA-Z0-9]', ' ')
    ]
    # Perform the substitutions for each pattern in the list
    for pattern, replacement in subs:
        data = [re.sub(pattern, replacement, sentence) for sentence in data]
    # Convert text to lowercase and strip extra whitespace
    return [sentence.lower().strip() for sentence in data]

In [18]:
# Preprocess the data before training
preprocessed_data = preprocess_data(data)

### Train SGDC Model

In [None]:
# def train_false_positive_detector_model(self, data, labels):
#     """
#     Trains the false positive detector model from scratch.

#     Parameters:
#     data (iterable): The data to train the model on.
#     labels (iterable): The labels for the training data.
#     """

#     # Preprocess the data before training
#     preprocessed_data = self.preprocess_data(data)
#     # Fit the vectorizer to the preprocessed data
#     vectorized_data = self.vectorizer.fit_transform(preprocessed_data)
#     # Train the false positive detector model
#     self.false_positive_detector.fit(vectorized_data, labels)

In [19]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(preprocessed_data, labels, test_size=0.2, random_state=42)

In [20]:
# Fit the vectorizer to the preprocessed data
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [21]:
false_positive_detector = SGDClassifier()
false_positive_detector.fit(X_train_vectorized, y_train)

### Testing and experimenting

In [35]:
# Predicting on the test data
y_pred = false_positive_detector.predict(X_test_vectorized)


In [44]:
report_x = classification_report(y_test, y_pred)
print(report_x)

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3261
           1       0.97      0.98      0.97      1093

    accuracy                           0.99      4354
   macro avg       0.98      0.98      0.98      4354
weighted avg       0.99      0.99      0.99      4354



In [40]:
report = classification_report(y_test, y_pred, output_dict=True)
print(report)

{'0': {'precision': 0.9935265104808878, 'recall': 0.9883471327813554, 'f1-score': 0.9909300538047656, 'support': 3261}, '1': {'precision': 0.9657657657657658, 'recall': 0.9807868252516011, 'f1-score': 0.9732183386291421, 'support': 1093}, 'accuracy': 0.9864492420762517, 'macro avg': {'precision': 0.9796461381233268, 'recall': 0.9845669790164783, 'f1-score': 0.9820741962169539, 'support': 4354}, 'weighted avg': {'precision': 0.9865576326734399, 'recall': 0.9864492420762517, 'f1-score': 0.9864838193796494, 'support': 4354}}


In [28]:
X_test_df = pd.DataFrame(X_test)

In [64]:
X_test

['copyright   ecc  return response',
 'copyright  copyrightsymbol   date   entity   inc',
 'copyright by the  entity   england',
 'copyrightsymbol  you may not rent  lease  lend or encumber software   d  unless enforcement is prohibited by applicable law  you may not decompile  or reverse engineer software   e  the terms and conditions of this  entity  apply to any  entity   provided to you at sun s discretion  that replace an',
 'copyright  copyrightsymbol   date   date   entity   entity',
 'copyright  copyrightsymbol   date   entity',
 'copyright copyright  entity',
 'copyright or rights arising from limitations or exceptions that are provided for in connection with the copyright protection under copyright law or other applicable laws',
 'copyright agent tests testdata testdata  license  nolicenseconcluded comment  scanners found  bsd   clause and gpl',
 'copyright  copyrightsymbol   date   entity   entity   email    date',
 'copyright  copyrightsymbol   date    entity    entity    e

In [65]:
X_test_df

Unnamed: 0,0
0,copyright ecc return response
1,copyright copyrightsymbol date entity inc
2,copyright by the entity england
3,copyrightsymbol you may not rent lease lend...
4,copyright copyrightsymbol date date ent...
...,...
4349,copyright copyrightsymbol date date by ...
4350,copyright agent uses regular expressions to fi...
4351,copyright copyrightsymbol date entity e...
4352,copyright copyrightsymbol date entity a...


In [66]:
y_test_series = pd.Series(y_test).reset_index(drop=True)
y_pred_series = pd.Series(y_pred).reset_index(drop=True)

In [67]:
misclassified = X_test_df.loc[y_test_series != y_pred_series]
len(misclassified)

59

Misclassifications in '0'

In [42]:
report['0']['support'] - round(report['0']['recall'] * report['0']['support'])

38

Misclassifications in '1'

In [43]:
report['1']['support'] - round(report['1']['recall'] * report['1']['support'])

21

In [45]:
classes = [0,1]

[(1, 0, 1)]

In [None]:
# Option to return indices
return_index = True
if return_index:
    results = [(y_pred_series[i], i, X_test_df[i]) for i in misclassified_rows]
else:
    results = [(y_pred_series[i], X_test_df[i]) for i in misclassified_rows]