### Dataset

Name of the data set: **30Columnists**

The dataset consists of 50 articles for each of the 30 authors.

[Here](http://www.kemik.yildiz.edu.tr/veri_kumelerimiz.html) is the link to dataset.

### Load Dataset

> Load the dataset, unzip it, remove unnecessary stuff, construct the dataframe out of it with respect to columns **[author, article]**.

> **You can use the curl command the download dataset from its source *(1)*. You can upload the dataset yourself *(2)*.**

In [1]:
# 1
!curl http://www.kemik.yildiz.edu.tr/data/File/30Columnists.zip -o ./30Columnists.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7208k  100 7208k    0     0   431k      0  0:00:16  0:00:16 --:--:--  454k


In [2]:
# 2
#from google.colab import files
#uploaded = files.upload()

In [3]:
!unzip /content/30Columnists.zip -d /content > /dev/null

In [4]:
# Remove unnecessary files and directories
!rm -f /content/30Columnists/*.doc
!rm -f /content/30Columnists/*.tmp
!rm -rf /content/30Columnists/arff_files
!mv /content/30Columnists/raw_texts/* /content/30Columnists
!rmdir /content/30Columnists/raw_texts

In [5]:
# Get authors
import os
PATH = "/content/30Columnists"

authors = []
for author in os.listdir(PATH):
  authors.append(author)

authors.sort(key = lambda a: int(a))
print(authors)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30']


In [6]:
# Read every article of each author and append to dataset
import re
dataset = []  # [[author, article], [author, article], ...]
for author in authors:
  author_path = os.path.join(PATH, author)
  for article_file in os.listdir(author_path):
    if article_file.endswith(".txt"):
      article_path = os.path.join(author_path, article_file)
      with open(article_path, "rb") as f:
        article = f.read().decode("iso-8859-9")

        article = re.sub(r"[\n\r\f\x96]", "", article)
        dataset.append([author, article])

In [7]:
article_count = {}
for author in authors:
  article_count.update({author: 0})

for row in dataset:
  author = row[0]
  count = article_count[author]
  article_count.update({author: count+1})

for author in authors:
  print(f"{author:>2}\t{article_count[author]}")

 1	50
 2	50
 3	50
 4	50
 5	50
 6	50
 7	50
 8	50
 9	50
10	50
11	50
12	50
13	50
14	50
15	50
16	50
17	50
18	50
19	50
20	50
21	50
22	50
23	50
24	50
25	50
26	50
27	50
28	50
29	50
30	50


In [8]:
import pandas as pd

dataset = pd.DataFrame(dataset, columns=["author", "article"])
print(dataset)

     author                                            article
0         1  IT'S been a dreadful week for hearing the deta...
1         1  TODAY the Royal Bank of Scotland announced a r...
2         1  THERE was a time when the nanny state seemed a...
3         1  HAVING, in the past, been to only Scottish Con...
4         1  WHY is it  that some politicians think that if...
...     ...                                                ...
1495     30  It was mainly a social occasion. An elegant bl...
1496     30  Seven score and five years after Abraham Linco...
1497     30  Americans are waking up this Christmas Day to ...
1498     30  The real estate market isn't pain-free when yo...
1499     30  It was a different kind of pioneering. As coal...

[1500 rows x 2 columns]


### Tokenization, Punctuations, Stop Words, Case Folding

> Tokeinize, remove punctuations and stopwords, apply case folding.

In [9]:
# nltk library for tokenizing, punctuations and stopwords
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
print(len(stop_words), stop_words)

179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [12]:
from nltk import word_tokenize
import re
# Tokenizing, removing punctuations and stopwords, applying case folding
for idx, row in dataset.iterrows():
  dataset.at[idx, "article"] = [token.lower() for token in word_tokenize(re.sub(r"\.", " . ", row["article"])) if token.isalpha() and (token.lower() not in stop_words)]

In [13]:
print(dataset)

     author                                            article
0         1  [dreadful, week, hearing, details, court, case...
1         1  [today, royal, bank, scotland, announced, reco...
2         1  [time, nanny, state, seemed, attractive, idea,...
3         1  [past, scottish, conservative, election, parti...
4         1  [politicians, think, ban, something, make, dif...
...     ...                                                ...
1495     30  [mainly, social, occasion, elegant, black, tie...
1496     30  [seven, score, five, years, abraham, lincoln, ...
1497     30  [americans, waking, christmas, day, big, packa...
1498     30  [real, estate, market, house, may, cut, millio...
1499     30  [different, kind, pioneering, coal, mines, wes...

[1500 rows x 2 columns]


### Train-Test Splits

> Split dataset into 2 parts **train dataset** (80%) and **test dataset** (20%).

In [14]:
# Train Dataset
train_dataset = dataset.sample(frac=0.8, random_state=1)

In [15]:
train_dataset

Unnamed: 0,author,article
91,2,"[governments, holyrood, westminster, skirmishi..."
75,2,"[thought, safe, go, back, normal, politics, ne..."
1264,26,"[arsène, wenger, cuts, increasingly, aesthetic..."
330,7,"[wanted, hear, binyam, mohamed, repeated, alle..."
1349,27,"[know, going, give, vitamin, c, yet, another, ..."
...,...,...
652,14,"[even, defeat, belarus, rio, ferdinand, given,..."
70,2,"[opposition, politicians, warned, snp, guarant..."
610,13,"[sunday, night, carlos, delgado, plate, facing..."
1174,24,"[everyone, time, develop, chronic, disease, un..."


In [16]:
len(train_dataset)

1200

In [17]:
train_dataset["author"].value_counts()

author
23    46
2     44
8     44
16    44
7     43
1     43
17    42
6     42
21    41
12    41
19    41
5     41
26    41
11    40
20    40
18    40
30    40
25    40
14    40
9     39
3     39
24    39
4     38
22    38
15    38
29    37
10    36
27    36
13    36
28    31
Name: count, dtype: int64

In [18]:
# Test Dataset
test_dataset = dataset.drop(train_dataset.index)

In [19]:
test_dataset

Unnamed: 0,author,article
15,1,"[always, risk, writing, state, worlds, financi..."
20,1,"[good, news, week, see, deployment, police, of..."
21,1,"[ronnie, reagan, tony, blair, seems, first, mi..."
24,1,"[help, never, thought, writing, political, col..."
25,1,"[going, well, man, plan, sit, back, wait, mayb..."
...,...,...
1478,30,"[volkswagen, returning, western, pennsylvania,..."
1485,30,"[books, unforgettable, trying, remember, full,..."
1495,30,"[mainly, social, occasion, elegant, black, tie..."
1498,30,"[real, estate, market, house, may, cut, millio..."


In [20]:
len(test_dataset)

300

### Build Vocabulary

> Create a token set which holds all the tokens in train dataset and a token named as ***\<UNKNOWN>***, for tokens may appear in test dataset but does not appear in train dataset.

In [21]:
total_tokens = 0
token_set = set()
for _, tokens in train_dataset["article"].items():
  total_tokens += len(tokens)
  for token in tokens:
    token_set.add(token)

# Add <UNKNOWN> to token set for occurence of words that are not in corpus.
token_set.add("<UNKNOWN>")

In [22]:
print(f"Corpus size: {total_tokens}, Vocabulary size: {len(token_set)}\n")
print("First 10 tokens:")
i = 0
for token in token_set:
  if i >= 10:
    break
  print(token)
  i += 1

Corpus size: 452966, Vocabulary size: 32375

First 10 tokens:
gawande
charmingly
rugelach
pelts
lineout
colossally
naughty
ceremonies
inaction
seated


### Vectorization


> token2idx: {token: token_idx}

> Create train_vector_df with respect to structure [label, vector].
vector is calculated using each tokens term frequency. vector's size is equal to vocabulary size (vocab_size).

In [23]:
# token to index mapping
token2idx = {}
for idx, term in enumerate(token_set):
  token2idx.update({term:idx})

In [24]:
print(f"Size: {len(token2idx)}\n\nTerms and Indexes:")
i = 0
for key in token2idx:
  if i >= 10:
    break
  print(f"{key}: {token2idx[key]}")
  i += 1

Size: 32375

Terms and Indexes:
gawande: 0
charmingly: 1
rugelach: 2
pelts: 3
lineout: 4
colossally: 5
naughty: 6
ceremonies: 7
inaction: 8
seated: 9


In [25]:
# Document vectors of Train Set
train_vector_df = None
vocab_size = len(token_set)
article_vectors = []
for _, row in train_dataset.iterrows():
  vector = [0] * vocab_size
  for token in row["article"]:
    vector[token2idx[token]] += 1
  article_vectors.append([row["author"], vector])

train_vector_df = pd.DataFrame(article_vectors, columns=["author", "vector"])

### Model Training

> Create a dictionary that holds a vector for each author. **{author: vector}**

In [26]:
import numpy as np

# create a dictionary for author vectors, {author: author_vector}
author_vectors = {}
for author in authors:
  author_vectors.update({author:np.array([0] * vocab_size)})

# convert train dataset vectors to numpy arrays
# calculate author vectors using element-wise addition for each author
for _, row in train_vector_df.iterrows():
  vector = np.array(row["vector"])
  author_vectors[row["author"]] += vector

In [27]:
# First 20 element of each author's vector
for author in authors:
  print(f"Author {author}: {author_vectors[author][:20]}\n")

Author 1: [0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0]

Author 2: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

Author 3: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0]

Author 4: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Author 5: [0 0 0 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0]

Author 6: [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0]

Author 7: [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0]

Author 8: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2]

Author 9: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1]

Author 10: [0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 2 0 0 2]

Author 11: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Author 12: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Author 13: [0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0]

Author 14: [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0]

Author 15: [0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0]

Author 16: [2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 4]

Author 17: [0 0 0 0 0 1 0 0 0 0 3 0 0 0 1 0 1 0 0 1]

Author 18: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0]

Author 19: [0 0 0 0 0 0 0 0 0 0 0 0 1

### Similarity Measure: Cosine

> Defining a function which calculates cosine similarity between vectors of the same dimension.

In [28]:
def cosine_sim(vector1, vector2):
  dot_product = np.dot(vector1, vector2)  # dot product of the vectors
  magnitude1 = np.linalg.norm(vector1)    # length of vector1
  magnitude2 = np.linalg.norm(vector2)    # length of vector2

  cosine_similarity = dot_product / (magnitude1 * magnitude2)
  return cosine_similarity

### Testing and Result

> Calculate the vector for each review in test_dataset with term frequencies. Predict labels for all reviews using cosine similarity. Finally compare predictions with real labels and calculate the success rate, print out the wrong predictions paired with correct authors.

In [29]:
# Document vectors of Test Set
test_vector_df = None
vocab_size = len(token_set)
test_article_vectors = []
for _, row in test_dataset.iterrows():
  vector = [0] * vocab_size
  for token in row["article"]:
    idx = token2idx.get(token, token2idx.get("<UNKNOWN>"))
    vector[idx] += 1
  test_article_vectors.append([row["author"], vector])

test_vector_df = pd.DataFrame(test_article_vectors, columns=["author", "vector"])

In [30]:
# test_authors: true authors of articles
test_authors, test_vectors  = test_vector_df.iloc[:,0], test_vector_df.iloc[:,1]
test_authors.value_counts()

author
28    19
13    14
10    14
27    14
29    13
22    12
4     12
15    12
24    11
9     11
3     11
11    10
25    10
14    10
18    10
20    10
30    10
12     9
5      9
19     9
21     9
26     9
6      8
17     8
1      7
7      7
2      6
8      6
16     6
23     4
Name: count, dtype: int64

In [31]:
# make all predictions
predicted_authors = []  # list of predictions
for vector in test_vectors:
  similarities = []
  for author in author_vectors:
    similarities.append(cosine_sim(np.array(vector), author_vectors[author]))
  max_idx = 0
  max_sim = similarities[0]
  for i in range(1, len(similarities)):
    if similarities[i] > max_sim:
      max_sim = similarities[i]
      max_idx = i
  predicted_authors.append(str(max_idx+1))

In [32]:
correct_predictions = 0
for idx, predicted_author in enumerate(predicted_authors):
  if predicted_author == test_authors[idx]:
    correct_predictions += 1

success_rate = correct_predictions / len(predicted_authors)
print(f"Success rate: {success_rate:.4g}")

Success rate: 0.8067


In [33]:
# Uncorrect predictions
print("Author\tPrediction")
for idx, predicted_author in enumerate(predicted_authors):
  if predicted_author != test_authors[idx]:
    print(f"{test_authors[idx]:^6}\t{predicted_author:^10}")

Author	Prediction
  1   	    4     
  1   	    10    
  1   	    5     
  2   	    7     
  3   	    2     
  3   	    10    
  3   	    2     
  3   	    19    
  3   	    8     
  4   	    1     
  4   	    7     
  4   	    10    
  4   	    7     
  4   	    1     
  6   	    10    
  6   	    21    
  6   	    10    
  6   	    10    
  6   	    21    
  7   	    20    
  7   	    2     
  7   	    8     
  9   	    16    
  9   	    16    
  9   	    11    
  9   	    11    
  10  	    5     
  11  	    27    
  11  	    9     
  13  	    14    
  14  	    26    
  14  	    22    
  14  	    12    
  14  	    8     
  15  	    13    
  16  	    29    
  16  	    25    
  18  	    17    
  18  	    19    
  18  	    19    
  21  	    6     
  21  	    20    
  24  	    25    
  24  	    16    
  27  	    9     
  27  	    11    
  27  	    11    
  28  	    12    
  28  	    8     
  28  	    8     
  28  	    26    
  28  	    26    
  28  	    12    
  28  	    8     
  28  	   