This code performs copyright notices classification using SGD Classifier model.
The dataset used is [*here*](https://github.com/ShreyaGautamm/gsoc_24/blob/11eb67190007842a872cf08e34cec0940ed1e0ae/files/datasets/fossology-master.csv).

In [1]:
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
df1 = pd.read_csv('/content/fossology-master.csv')
df1

Unnamed: 0,"Copyright Law. Subject to the following terms, Fedora Project grants to\nthe user (\""User\"") a license to this collective work pursuant to the GNU\nGeneral Public License version 2. By downloading, installing or using\nthe Software, User agrees to the terms of this agreement.\n\n1. THE SOFTWARE.",scan_code_copyrights,copyright,falsePositive,important
0,fossology-master/.dockerignore,-,copyright/agent_tests/Unit/test_copyright src/...,1,
1,fossology-master/.dockerignore,-,copyright/VERSION-copyright src/spdx2/agent_te...,1,
2,fossology-master/.dockerignore,-,copyright_list src/cli/fo_folder src/cli/fo_no...,1,1.0
3,fossology-master/.dockerignore,-,copyright/VERSION-keyword,1,
4,fossology-master/.dockerignore,-,copyright/VERSION-ecc,1,
...,...,...,...,...,...
43738,fossology-master/utils/runBuild_v2.0.php,© 2011-2012 Hewlett-Packard Development Compan...,© 2011-2012 Hewlett-Packard Development Compan...,0,
43739,fossology-master/utils/schemaspy.run,© Fossology contributors,© Fossology contributors,0,
43740,fossology-master/utils/template.pod,© Fossology contributors,© Fossology contributors,0,
43741,fossology-master/utils/unique.php,"© 2014 Hewlett-Packard Development Company, L.P.","© 2014 Hewlett-Packard Development Company, L.P.",0,


In [4]:
X = df1["copyright"]
y = df1["falsePositive"]

In [5]:
y.value_counts()

falsePositive
1    20016
0    15526
3     7450
6      399
5      221
4      131
Name: count, dtype: int64

In [6]:
df1['falsePositive'] = df1['falsePositive'].replace(3, 0)

In [7]:
df1['falsePositive'] = df1['falsePositive'].replace(4, 1)
df1['falsePositive'] = df1['falsePositive'].replace(5, 1)
df1['falsePositive'] = df1['falsePositive'].replace(6, 1)

In [8]:
y = df1["falsePositive"]

In [9]:
y.value_counts()

falsePositive
0    22976
1    20767
Name: count, dtype: int64

In [10]:
X = X.drop_duplicates()
y = y[X.index]

In [11]:
y.value_counts()

falsePositive
0    14304
1     5163
Name: count, dtype: int64

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

## SGDClassifier


In [14]:
sgd = SGDClassifier()
sgd.fit(X_train_vec, y_train)

In [15]:
y_pred = sgd.predict(X_test_vec)

In [16]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      2878
           1       0.96      0.98      0.97      1016

    accuracy                           0.98      3894
   macro avg       0.98      0.98      0.98      3894
weighted avg       0.98      0.98      0.98      3894



In [17]:
misclassified = X_test.loc[y_test != y_pred]
len(misclassified)

66

In [18]:
print(misclassified)

29680    copyright under the same terms as Perl</s>, th...
2939         Copyright (c) 199\tAdobe Multiple Master font
36520    copyright-software-19980720">previous version<...
16332    (C) CHARGE ANY FEE IN CONNECTION WITH THE SOFT...
35107    copyrighted software distributed under the ter...
                               ...                        
34018    Copyright Treaty of 1996, the WIPO Performance...
9306     Copyright Notice<a class="top" href="#releaseH...
32280    copyright (copyright_pk, agent_fk, pfile_fk, c...
16638                            (c) TCK Use Restrictions.
16670    Copyright (c) [YEAR] W3C® (MIT, ERCIM, Keio, B...
Name: copyright, Length: 66, dtype: object
