# **Article Categorization with TF-IDF Score**

Text classification and article categorization are some of the basic use cases in text analytics. In the project presented in this notebook, we are going to define a custom function to rate text documents based on TF-IDF score. 

**Import Libraries** \
As usual, I would like to start my Python environment with importing the proper libraries that will be utilized later on.


In [None]:
import numpy as np
import re
import sqlite3
import pandas as pd
from collections import Counter
from math import log

import dfply

**Loading the Data**   
The data is for this project is provided as a SQLike database. This databse contains only one Article table with three columns: 


*   **id** - the primary key of the table, the unique identifier for each article
*   **category** -  column consisting of predefined article categories, the label
*   **raw_text** - the actual text of each article

In order to load this data in, we are going to use sqlite3.connect function and query all records from the table to a pandas dataframe.





In [2]:
conn = sqlite3.connect('Project 01 - Database.db')
sql = 'SELECT * FROM Article'
df = pd.read_sql_query(sql, conn, index_col='id')
conn.close()
#------------------------------
df.head(10)
df.shape

(2225, 2)

**Cleaning the Data** \
As first step for our analysis, we are going to prepare our data by cleaning the raw text of each article. This involves multiple steps such as removing white space, underscores, etc. 

In [3]:
def clean_text(raw_text):
  #convert the raw text to lowercase
  text = raw_text.lower()
  #remove all numbers from the text using a regular expression
  text = re.sub(r'[0-9]', ' ', text)
  #remove all underscores from the text
  text = re.sub(r'\_', ' ', text)
  #remove anything else in the text that isn't a word character or a space (e.g., punctuation, special symbols, etc.)
  text = re.sub(r'[^\w\s]', ' ', text)
  #remove any excess whitespace
  for _ in range(10):
    text = text.replace('  ', ' ')
  #remove any leading or trailing space characters
  text = text.strip()
  #return the clean text
  return text

#------------------------------------
df['clean_text'] = [clean_text(raw_text) for raw_text in df.raw_text]

df.head(4)

Unnamed: 0_level_0,category,raw_text,clean_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6347,Politics,Hiding women away in the home hidden behind ve...,hiding women away in the home hidden behind ve...
13840,Sports,Celtic brushed aside Clyde to secure their pla...,celtic brushed aside clyde to secure their pla...
14775,Unknown,"If you have finished Doom 3, Half Life 2 and H...",if you have finished doom half life and halo d...
16641,Unknown,Controversial new UK casinos will be banned fr...,controversial new uk casinos will be banned fr...


**Building Vocabulary** \
Next, we will combine the text from all articles and create a vocabulary of words and their frequencies.

In [4]:
#build a vocabulary of words
all_text = ' '.join(df.clean_text) #join all of the English texts into one big string
words = all_text.split() #split the text into words
word_frequencies = Counter(words) #count all words in the text
vocabulary = list(word_frequencies.keys()) #get a list of all unique words

len(vocabulary)

vocabulary.sort()

# **Calculating TF-IDF**
TF-IDF score here, is calculated for each word that appears in each document as the Term Frequency multipled by Inverse Document Frequency.\
Term Frequncy - number of the time a word appears in a document / total number of words in that document.
Inverse Document Frequency - log( total number of documents in the corpus / number of documents in the corpus containing that word )

**Computing Inverse Document Frequency (IDF)**\
First, we will go over our entities in the created vocabulary and calculate the IDF for each word.


In [5]:
vocab = pd.DataFrame (vocabulary,columns=['words'])
vocab["IDF"] = ""
N = df.shape[0]  # Total Number of Documents in the Corpus
for word in vocab.words:
  Nw = 0
  for row in df.itertuples():
    if word in row.clean_text:
      Nw += 1
  idf = log(N/Nw)  
  vocab.loc[vocab.words==word, 'IDF'] = idf

#------------------- Checking the length of Vocab and IDF
vocab.head(5)

# Code for checking IDF for a word
# vocab.loc[vocab.words=="the"]

Unnamed: 0,words,IDF
0,a,0.0
1,aa,3.10234
2,aaa,4.99946
3,aaas,5.7616
4,aac,6.6089


**Defining an Article Class**\
Here, we are utilizing the class definition in Python in order to store the atributes related to each article in a more tidy and accessible form.

In [6]:
class Article:
  def __init__(self, document_id, category, Term_Freq, total_words, TF_IDF, estimated_topic):
    self.id = document_id #the document's unique ID number
    self.category = category #the document's topic
    self.total_words = total_words #the total number of words in the document
    self.Term_Freq = Term_Freq 
    self.TF_IDF = TF_IDF
    self.estimated_topic = estimated_topic
    # self.word_probabilities = None

**Calculating Term Frequency for each Article**\
Next, we will go over every single article in our dataframe, and compute the Term Frequencies for each word appearing in that article.

In [7]:
articles= []

for row in df.itertuples():
  words = row.clean_text.split()
  article_word_freq = Counter(words)
  Nd = sum(article_word_freq.values())
  TF = []
  for vocab_word in vocab.words:
    Fwd = 0
    if vocab_word in article_word_freq:
      Fwd = article_word_freq[vocab_word]
    term_freq = Fwd / Nd
    TF.append(term_freq) 

  articles.append(Article(row.Index, row.category, TF, Nd, 0, ""))

**Computing TF-IDF**\
Now that we have IDF values for our vocabulary, and the Term Frequency related to each article, we can calculate the TF-IDF score for each article.

In [8]:
for article in articles:
  article.TF_IDF = article.Term_Freq * np.array(vocab.IDF)
  

**Computing Average TF_IDF**\
To be able to classify the category of a new article, we need to define each category as a vector that represents the average TF-IDF score of all articles from that category.

In [9]:
def Average_TF_IDF(category):
  ave_df = pd.DataFrame()
  for article in articles:
    if article.category == category:
      ave_df[str(article.id)] = article.TF_IDF
  ave_df['Mean'] = ave_df.mean(axis=1) 
  return ave_df.Mean


topics_TFIDF = {'Business': np.zeros(len(vocabulary)), 'Sports': np.zeros(len(vocabulary)), 'Politics': np.zeros(len(vocabulary)), 'Technology': np.zeros(len(vocabulary)), 'Entertainment': np.zeros(len(vocabulary))}

for topic in topics_TFIDF:
  topics_TFIDF[topic] = Average_TF_IDF(category = topic)


# Business_TFIDF = Average_TF_IDF(category='Business')
# Politics_TFIDF = Average_TF_IDF(category='Politics')
# Sports_TFIDF = Average_TF_IDF(category='Sports')
# Technology_TFIDF = Average_TF_IDF(category='Technology')
# Entertainment_TFIDF = Average_TF_IDF(category='Entertainment')

**Computing Distance**\
Let us define a simple function to get the Euclidean distance between an two TF-IDF vectors.

In [10]:
def get_distance(point1, point2):
  return np.sqrt(np.sum(np.square(point1 - point2)))

**Estimating Unknown Topics**\
Now, for each article with a category defined as ***Unknown***, we can compute the distance between the article's TF-IDF vector and each of the defined categories average TF-IDF vector.\
Finally, the estimated category (topic) of the unknown artilce can be deicided as the category with the least distance.

In [11]:
for article in articles:
  if article.category == 'Unknown':
    distance_from_topics = {'Business':0, 'Politics':0, 'Sports':0 , 'Technology':0 ,'Entertainment':0 }
    for topic in ['Business', 'Politics', 'Sports' , 'Technology' ,'Entertainment' ]:
      distance_from_topics[topic] = get_distance(article.TF_IDF, topics_TFIDF[topic] )
    article.estimated_topic = min(distance_from_topics, key= distance_from_topics.get)


**Creating a Dataframe with Estimates**

In [12]:
Estimates = pd.DataFrame({'Article_Id':[],'Article_Category':[]})
for article in articles:
  if article.category == 'Unknown':
    Estimates=Estimates.append(pd.DataFrame({'Article_Id':[article.id],'Article_Category':article.estimated_topic} ))
    
 

In [13]:
Estimates.shape
Estimates.head(3)

Unnamed: 0,Article_Id,Article_Category
0,14775.0,Technology
0,16641.0,Politics
0,17511.0,Sports


**Saving CSV File**

In [14]:
with open('Dadvand_Kouhi, Sina.csv', 'w') as csvfile:
  for row in Estimates.itertuples():
    csvfile.write('{},{}\n'.format(int(row.Article_Id), row.Article_Category))

**Testing the Process on known categories**

In [15]:
Validation_dataframe = pd.DataFrame({'Article_Id':[],'Article_Category':[], 'Estimated_Category':[]})
for article in articles:
  if article.category != 'Unknown':
    distance_from_topics = {'Business':0, 'Politics':0, 'Sports':0 , 'Technology':0 ,'Entertainment':0 }
    for topic in ['Business', 'Politics', 'Sports' , 'Technology' ,'Entertainment' ]:
      distance_from_topics[topic] = get_distance(article.TF_IDF, topics_TFIDF[topic] )
    article.estimated_topic = min(distance_from_topics, key= distance_from_topics.get)
    Validation_dataframe=Validation_dataframe.append(pd.DataFrame({'Article_Id':[article.id],'Article_Category':article.category,'Estimated_Category':article.estimated_topic} ))



**Condusion Matrix and Accuracy**

In [16]:
from sklearn import metrics
print(metrics.confusion_matrix(Validation_dataframe.Article_Category, Validation_dataframe.Estimated_Category, labels=['Business', 'Politics', 'Sports' , 'Technology' ,'Entertainment' ]))

print(metrics.classification_report(Validation_dataframe.Article_Category, Validation_dataframe.Estimated_Category, labels=['Business', 'Politics', 'Sports' , 'Technology' ,'Entertainment' ]))


[[259   6   0   5   0]
 [  0 238   0   1   0]
 [  0   0 294   0   0]
 [  0   0   0 222   3]
 [  1   1   0   0 195]]
               precision    recall  f1-score   support

     Business       1.00      0.96      0.98       270
     Politics       0.97      1.00      0.98       239
       Sports       1.00      1.00      1.00       294
   Technology       0.97      0.99      0.98       225
Entertainment       0.98      0.99      0.99       197

     accuracy                           0.99      1225
    macro avg       0.99      0.99      0.99      1225
 weighted avg       0.99      0.99      0.99      1225



**Dataset for Later**
https://www.kaggle.com/c/learn-ai-bbc/overview

In [17]:
# IDF
# df.iloc[0].clean_text
# a=0
# for row in df.itertuples():
#   if 'women2' in row.clean_text:
#     print('found')
#     a += 1
# a

# vocab = pd.DataFrame (vocabulary,columns=['words'])
# vocab["IDF"] = ""
# N = df.shape[0]  # Total Number of Documents in the Corpus
# for word in vocab.words:
#   Nw = 0
#   for row in df.itertuples():
#     if word in row.clean_text:
#       Nw += 1
#   idf = log(N/Nw)  
#   vocab.loc[vocab.words==word, 'IDF'] = idf

#------------------- Checking the length of Vocab and IDF
# len(vocabulary)
# len(IDF)
# vocab2 = vocab
# vocab2['original_idf']= IDF
# vocab2
# word
# vocab2.loc[vocab2.words==word, 'original_idf'] = 1258


text2 = df.iloc[0].clean_text.split()

count = Counter(text2)
count['the']
vocab_word = 'the'
if 'al' in count:
  print(count[vocab_word])
df.head(2)
for row in df.itertuples():
    print (row)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Pandas(Index=7866912, category='Sports', raw_text='A US judge has set a preliminary trial date for the Balco steroid distribution case which has rocked athletics. US district court judge Susan Ilston rejected an attempt by the defence team to have the case dismissed at a pre-trial hearing in San Francisco. And she set a March date for the case of the four men accused of distributing illegal performance-enhancing drugs to elite athletes to be heard. A firm decision on whether the trial takes place is expected in January. The judge said that she may conduct hearings in January into whether federal agents illegally searched the Balco headquarters and wrongfully obtained statements from the company\'s founder Victor Conte and its vice-president James Valente. The two men - along with personal trainer Greg Anderson and athletics coach Remy Korchemny - were all indicted earlier this year but have pleaded their innocence. The outcome of those hearings could result in some or all of the charge