**Skeptic vs paranormal subreddits**

**Task**: Classify a reddit as either from Skeptic subreddit or one of the "paranormal" subreddits (Paranormal, UFOs, TheTruthIsHere, Ghosts, Glitch-in-the-Matrix, conspiracytheories).

Used Count Vectorizer and Logistic Regression.

*Adam Mickiewicz University*

*Faculty of Mathematics and Computer Science*

*Subject: Intelligent information systems*


In [None]:
!git clone git://gonito.net/paranormal-or-skeptic 


# Loading Data


In [42]:
!xzcat /content/paranormal-or-skeptic/train/in.tsv.xz | wc -l

289579


In [18]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy.sparse import hstack
import csv
import datetime

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [20]:
def load_set(path, isTest):
  dataset = pd.read_csv(path+"/in.tsv.xz", delimiter="\t",header=None,names=["text","date"],quoting=csv.QUOTE_NONE)
  dataset["date"] = pd.to_datetime(dataset["date"].apply(lambda x: datetime.datetime.fromtimestamp(x).isoformat()))
  if not isTest:
    expected = pd.read_csv(path+"/expected.tsv",header=None,names=["class"],dtype="category")
    return dataset, expected
  return dataset

**Load all sets**

In [21]:
train_set, expected_train = load_set("/content/paranormal-or-skeptic/train/", False)
dev_set, expected_dev = load_set("/content/paranormal-or-skeptic/dev-0", False)
test_set = load_set("/content/paranormal-or-skeptic/test-A", True)

# Prepare data

In [22]:
def prepare_data(data):
  data["day"] = data["date"].dt.day
  data["month"] = data["date"].dt.month
  data["year"] = data["date"].dt.year
  return data

In [23]:
train_set = prepare_data(train_set)

In [24]:
train_set.sample(5)

Unnamed: 0,text,date,day,month,year
254825,"A while ago, I wrote a program that takes rand...",2011-03-30 20:12:11,30,3,2011
266656,I've had this done to me and I didn't know he ...,2012-01-30 14:46:19,30,1,2012
283655,I've watched a lot of his comedy and considere...,2012-05-18 02:01:25,18,5,2012
140212,[Will Storr v The Supernatural](http://www.ama...,2011-12-27 04:57:11,27,12,2011
184800,I do not see anything in this video\n\nnothing,2012-10-24 15:26:10,24,10,2012


# Train

In [25]:
vectorize = CountVectorizer(stop_words='english',ngram_range=(1,3),strip_accents='ascii')
vectorized = vectorize.fit_transform(train_set["text"])

In [26]:
X = vectorized
y = expected_train["class"]

In [27]:
model = LogisticRegression(max_iter=500)
model.fit(X,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Predict and evaluate

In [28]:
def predict_data(data):
  prepared = prepare_data(data)
  vectorized = vectorize.transform(data["text"])
  predicted = model.predict_proba(vectorized)[:,1]
  predicted[predicted < 0.05] = 0.05
  predicted[predicted > 0.95] = 0.95
  return predicted

In [29]:
dev_predicted = predict_data(dev_set)

In [30]:
dev_predicted

array([0.05      , 0.75847969, 0.86484399, ..., 0.0650311 , 0.95      ,
       0.37791457])

In [31]:
test_predicted = predict_data(test_set)

**Save to file**


In [35]:
np.savetxt('/content/paranormal-or-skeptic/test-A/out.tsv', test_predicted, '%f')
np.savetxt('/content/paranormal-or-skeptic/dev-0/out.tsv', dev_predicted, '%f')

**Check geval output**

In [36]:
!wget https://gonito.net/get/bin/geval
!chmod u+x geval

--2021-02-06 12:19:02--  https://gonito.net/get/bin/geval
Resolving gonito.net (gonito.net)... 178.216.200.70
Connecting to gonito.net (gonito.net)|178.216.200.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12638056 (12M) [application/octet-stream]
Saving to: ‘geval’


2021-02-06 12:19:04 (9.17 MB/s) - ‘geval’ saved [12638056/12638056]



In [40]:
!./geval -t "/content/paranormal-or-skeptic/dev-0" --metric Accuracy

0.8150606980273141


In [41]:
!./geval -t "/content/paranormal-or-skeptic/dev-0" --metric F2.0

0.689655172413793
