<a href="https://colab.research.google.com/github/L3peha/internshipVK/blob/main/Model_DetectingSpam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
If you are using Google Colab, uncomment the next lines to download fast text and datasets
'''

#!pip install fasttext
#!pip install scikit-plot

#!wget https://raw.githubusercontent.com/L3peha/internshipVK/main/raw/test_spam.csv
#!wget https://raw.githubusercontent.com/L3peha/internshipVK/main/raw/train_spam.csv

Import all needed libriries and methods

In [145]:
import fasttext
import pandas as pd
import numpy as np
import csv
import re
import scikitplot
import sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

This is function for printing ROC auc score for models

In [146]:
def roc_score_calc(X):
  X_data = []
  for i in X:
    if i[0]>i[1]:
      X_data.append(i[1])
    else:
      X_data.append(i[0])
  #'if else' here for situation if labels are changed between each other
  if sklearn.metrics.roc_auc_score(y_test, X_data)>0.5:
    print(sklearn.metrics.roc_auc_score(y_test, X_data))
  else:
    print(1-sklearn.metrics.roc_auc_score(y_test, X_data))

In [147]:
#prepearing data for models
dataset = pd.read_csv('train_spam.csv', delimiter=',', header=None).values
data = dataset[1:, 1]
target = dataset[1:, 0]

#spliting train data for test and train data
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.25)

#creating file for Fasttext library because it needs special format of data
with open('TrainX.txt', "w") as f_out:
  for i in range(len(X_train)):
    f_out.write("__label__" + y_train[i] + " " + X_train[i] +'\n')
  f_out.close()

In [148]:
#training FastText model
model = fasttext.train_supervised(input = "TrainX.txt", lr=0.1)

#geting probabilities via .predict for every line of test data
labels, probabilities = model.predict([re.sub('\n', ' ', i) for i in X_test])


In [149]:
print('ROC AUC score for FastText:')
print(1-sklearn.metrics.roc_auc_score(y_test, probabilities))

ROC AUC score for FastText:
0.6951716596786224


In [150]:
#transforming train data via CountVectorizer()
cv = CountVectorizer()
X_train_cVec = cv.fit_transform(X_train)

#creating and fitting svm model with train data
model1 = svm.SVC(probability=True)
model1.fit(X_train_cVec,y_train)

#transforming test data for roc auc score
X_test_cVec = cv.transform(X_test)

In [151]:
print('ROC AUC score for CountVectorizer() with svm:')
roc_score_calc(model1.predict_proba(X_test_cVec))

ROC AUC score for CountVectorizer() with svm:
0.5703030566357454


SVM model computing for 4 minutes, it's a lot longer compared to other models.

In [152]:
#creating LR model and fitting with vectorized train data
lr_basic = LogisticRegression(solver='saga', tol=1e-3, max_iter=500)
lr_basic.fit(X_train_cVec, y_train)

In [153]:
print('ROC AUC score for CountVectorizer() with LogisticRegression:')
roc_score_calc(lr_basic.predict_proba(X_test_cVec))

ROC AUC score for CountVectorizer() with LogisticRegression:
0.5501166605113945


In [154]:
#transforming train data via TfidfVectorizer()
tfid_vec = TfidfVectorizer()
X_train_tfid = tfid_vec.fit_transform(X_train)

#creating LR model and fitting with vectorized train data
clf = LogisticRegression(solver='saga', tol=1e-3, max_iter=500)
clf.fit(X_train_tfid, y_train)

#transforming test data for roc auc score
X_test_tfid = tfid_vec.transform(X_test)

In [155]:
print('ROC AUC score for TfidfVectorizer() with LogisticRegression:')
roc_score_calc(clf.predict_proba(X_test_tfid))

ROC AUC score for TfidfVectorizer() with LogisticRegression:
0.741270737836096


As we can see model with `TfidfVectorizer()` and LogisticRegression give the best ROC AUC score `0.741270737836096`. So i will use this model for creating final file.

In [156]:
#using all data from train_spam.csv to train modell
X_train_fin = tfid_vec.fit_transform(data)

finMod = LogisticRegression()
finMod.fit(X_train_fin, target)

#getting data from test_spam.csv
dataset_test = pd.read_csv('test_spam.csv', delimiter=',', header=None).values
data_test = dataset_test[1:, 0]

#predicting labels for test_data
answ = finMod.predict(tfid_vec.transform(data_test))

#creating and writing score and text into .csv file
with open('Answer.csv', 'w', newline='') as csvf:
  names = ['score', 'text']
  writer = csv.DictWriter(csvf, fieldnames = names)

  writer.writeheader()
  for i in range(len(answ)):
    writer.writerow({'score':answ[i],'text':data_test[i]})
  f_out.close()