# Dbpedia dataset multiclass classification problem

## Description of the data
The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.

In [1]:
from sklearn import naive_bayes ,model_selection,preprocessing,svm,metrics,linear_model,neighbors
from sklearn.feature_extraction.text import TfidfVectorizer ,CountVectorizer
from sklearn import decomposition, ensemble


In [2]:
import numpy as np
import matplotlib as plt
import pandas as pd
import xgboost as xg
import string as strng

In [3]:
from keras.preprocessing import sequence,text
from keras import layers, optimizers, models

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Reading train and test data

In [4]:
train_data1 = pd.read_csv('train.csv')
test_data1 = pd.read_csv('test.csv')

## Exploratory data analysis

In [5]:
train_data1.describe

<bound method NDFrame.describe of         class                              title  \
0           1                   E. D. Abbott Ltd   
1           1                     Schwan-Stabilo   
2           1                         Q-workshop   
3           1  Marvell Software Solutions Israel   
4           1        Bergan Mercy Medical Center   
...       ...                                ...   
559995     14                   Barking in Essex   
559996     14                   Science & Spirit   
559997     14             The Blithedale Romance   
559998     14                Razadarit Ayedawbon   
559999     14           The Vinyl Cafe Notebooks   

                                                  content  
0        Abbott of Farnham E D Abbott Limited was a Br...  
1        Schwan-STABILO is a German maker of pens for ...  
2        Q-workshop is a Polish company located in Poz...  
3        Marvell Software Solutions Israel known as RA...  
4        Bergan Mercy Medical Center is a

In [6]:
test_data1.describe

<bound method NDFrame.describe of        class                     title  \
0          1                     TY KU   
1          1     Odd Lot Entertainment   
2          1                    Henkel   
3          1                GOAT Store   
4          1  RagWing Aircraft Designs   
...      ...                       ...   
69995     14            Energy Victory   
69996     14                 Bestiario   
69997     14         Wuthering Heights   
69998     14             L'Indépendant   
69999     14      The Prophecy (novel)   

                                                 content  
0       TY KU /taɪkuː/ is an American alcoholic bever...  
1       OddLot Entertainment founded in 2001 by longt...  
2       Henkel AG & Company KGaA operates worldwide w...  
3       The GOAT Store (Games Of All Type Store) LLC ...  
4       RagWing Aircraft Designs (also called the Rag...  
...                                                  ...  
69995   Energy Victory: Winning the War on Terro

### Checking whether the data represents all the classes equally

In [7]:
group_by_class = train_data1.groupby(['class'])
group_by_class.size()

class
1     40000
2     40000
3     40000
4     40000
5     40000
6     40000
7     40000
8     40000
9     40000
10    40000
11    40000
12    40000
13    40000
14    40000
dtype: int64

In [8]:
group_by_class = test_data1.groupby(['class'])
group_by_class.size()

class
1     5000
2     5000
3     5000
4     5000
5     5000
6     5000
7     5000
8     5000
9     5000
10    5000
11    5000
12    5000
13    5000
14    5000
dtype: int64

## Taking subset of train data and test data to work with
It is found with experimentation that the time taken to run the models considering the full fledged data is quite high and sometimes , the kernel dies. Inorder to overcome this problem, we could take a subset of data from the original data.If we take first few records  in sequence , we might end up getting the data representing only few classes. So, idea is to take a fraction of data from the original data randomly

In [9]:
# The data huge and processing is taking time so we shall take randomly some 15% of test data and
#15 % of train data and work on that data only

train_data = train_data1.sample(frac=0.15) # from the whole data, we are taking only 15/100 th fraction of data
train_data.describe

<bound method NDFrame.describe of         class                                              title  \
549526     14                                    L'Arrêt de mort   
331944      9                                 Anjir Siah-e Sofla   
64819       2                          North Garland High School   
198353      5                                       Trần Văn Tuý   
362141     10                                         Peropteryx   
...       ...                                                ...   
109645      3                              Francis George Fowler   
430730     11                                      Persea campii   
557761     14  Holy Hell: A Memoir of Faith Devotion and Pure...   
113031      3                                     Alexander Hall   
78612       2  Central Institute of Plastics Engineering & Te...   

                                                  content  
549526   Death Sentence (French: L'Arrêt de mort) is a...  
331944   Anjir Siah-e Sofla (

In [10]:
group_by_class = train_data.groupby(['class'])
group_by_class.size()

class
1     5979
2     5942
3     5924
4     5944
5     6202
6     5805
7     6047
8     5824
9     6024
10    6020
11    6001
12    6019
13    6126
14    6143
dtype: int64

In [11]:
test_data = test_data1.sample(frac=0.15) # from the whole data, we are taking only 15/100 th fraction of data
test_data.describe

<bound method NDFrame.describe of        class                          title  \
15434      4                    Heiner Dopp   
18577      4                Shannon Welcome   
51716     11        Andropogon benthamianus   
67367     14               A Spot of Bother   
1240       1  Maianbar Bundeena Bus Service   
...      ...                            ...   
26149      6             USS Nausett (1865)   
25475      6   Metropolitan Railway F Class   
41498      9                         Kožuhe   
43295      9              Firuzabad Mashhad   
35793      8                  Kasilof River   

                                                 content  
15434   Heiner Dopp (born 27 June 1956 in Bad Dürkhei...  
18577   Shannon Roy Welcome Warren (born 22 November ...  
51716   Andropogon benthamianus is a species of grass...  
67367   A Spot of Bother is the second adult novel by...  
1240    Mainanbar Bundeena Bus Service is an Australi...  
...                                            

In [12]:
group_by_class = test_data.groupby(['class'])
group_by_class.size()

class
1     786
2     768
3     713
4     758
5     757
6     739
7     761
8     758
9     724
10    735
11    766
12    712
13    762
14    761
dtype: int64

In [13]:
x_train_data =train_data[['content']]
y_train_data = train_data[['class']]
x_test_data = test_data[['content']]
y_test_data = test_data[['class']]

In [14]:
x_train_list = x_train_data['content'].tolist()
y_train_list = y_train_data['class'].tolist()
x_test_list = x_test_data['content'].tolist()
y_test_list = y_test_data['class'].tolist()

## Preprocessing 
* we are using the if-idf vectorrizer
* we are taking one word at a time while doing tf-idf
* we are removiing the stop words 

In [15]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000,stop_words='english')
tfidf_vect.fit(x_train_list)
xtrain_tfidf =  tfidf_vect.transform(x_train_list)
xvalid_tfidf =  tfidf_vect.transform(x_test_list)


We assume the independence in case of Naive Bayes classifier

In [32]:
# mMultinomial Naive bayesian model 
x=naive_bayes.MultinomialNB().fit(xtrain_tfidf,y_train_list)
predictions = x.predict(xvalid_tfidf)
accuracy=metrics.accuracy_score(predictions, y_test_list)
print(accuracy)
confusion_matrix=metrics.confusion_matrix(predictions, y_test_list)
confusion_matrix


0.9444761904761905


array([[682,   6,   5,   0,   3,   8,  16,   0,   0,   0,   6,   0,   0,
         12],
       [  7, 746,   1,   0,   3,   1,  32,   0,   4,   0,   0,   0,   0,
          0],
       [  5,   1, 585,   5,  12,   1,   2,   0,   1,   3,   0,   2,   2,
          7],
       [  3,   3,   3, 748,   3,   1,   0,   0,   0,   9,   0,   0,   0,
          4],
       [  2,   4,  14,   3, 727,   0,   2,   0,   0,   3,   1,   0,   0,
          3],
       [ 17,   2,   0,   0,   3, 725,   1,   1,   1,   1,   0,   0,   0,
          1],
       [ 14,   5,   1,   0,   2,   1, 697,   6,   6,   0,   1,   0,   0,
          1],
       [  1,   0,   0,   0,   0,   1,   7, 750,  11,   3,   0,   0,   0,
          0],
       [  0,   1,   0,   0,   0,   0,   0,   0, 700,   0,   0,   0,   0,
          0],
       [  0,   0,   0,   0,   0,   0,   1,   0,   1, 659,  18,   0,   0,
          3],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,  57, 739,   0,   0,
          0],
       [ 27,   0,  41,   0,   0,   0,   0, 

In [33]:
print (metrics.f1_score(predictions, y_test_list,average='macro'))
print (metrics.f1_score(predictions, y_test_list,average='micro'))
print (metrics.f1_score(predictions, y_test_list,average='weighted'))

0.9439932789827418
0.9444761904761906
0.9449425861774863


In [17]:
#Logistic Regression 

x=linear_model.LogisticRegression().fit(xtrain_tfidf,y_train_list)
predictions = x.predict(xvalid_tfidf)
predictions_proba = x.predict_proba(xvalid_tfidf)

accuracy=metrics.accuracy_score(predictions, y_test_list)
print("the accuracy with Logistic regression model is",accuracy)



the accuracy with Logistic regression model is 0.9709523809523809


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [19]:
print("F1-score for logistic regression with macro average",metrics.f1_score(y_test_list, predictions, average='macro'))
print("F1-score for logistic regression with micro average",metrics.f1_score(y_test_list, predictions, average='micro'))
print("F1-score for logistic regression with weighted average",metrics.f1_score(y_test_list, predictions, average='weighted'))


F1-score for logistic regression with macro average 0.9710350443802884
F1-score for logistic regression with micro average 0.9709523809523809
F1-score for logistic regression with weighted average 0.9709543337491622


In [21]:
# KNN 

x=neighbors.KNeighborsClassifier().fit(xtrain_tfidf,y_train_list)
predictions3 = x.predict(xvalid_tfidf)
accuracy=metrics.accuracy_score(predictions3, y_test_list)
accuracy

0.42933333333333334

In [22]:
print("F1-score for KNN with macro average",metrics.f1_score(y_test_list, predictions3, average='macro'))
print("F1-score for KNN with micro average",metrics.f1_score(y_test_list, predictions3, average='micro'))
print("F1-score for KNN with weighted average",metrics.f1_score(y_test_list, predictions3, average='weighted'))


F1-score for KNN with macro average 0.48188868562474907
F1-score for KNN with micro average 0.4293333333333333
F1-score for KNN with weighted average 0.4820763641736975


In [23]:
#SVM
x=svm.SVC().fit(xtrain_tfidf,y_train_list)
predictions4 = x.predict(xvalid_tfidf)
accuracy=metrics.accuracy_score(predictions4, y_test_list)
accuracy

0.9722857142857143

In [25]:
print("F1-score for SVM with macro average",metrics.f1_score(y_test_list, predictions4, average='macro'))
print("F1-score for SVM with micro average",metrics.f1_score(y_test_list, predictions4, average='micro'))
print("F1-score for SVM with weighted average",metrics.f1_score(y_test_list, predictions4, average='weighted'))


F1-score for SVM with macro average 0.9723869933922656
F1-score for SVM with micro average 0.9722857142857143
F1-score for SVM with weighted average 0.9722981188738296


In [26]:
# Random Forest Classifier
x=ensemble.RandomForestClassifier().fit(xtrain_tfidf,y_train_list)
predictions5 = x.predict(xvalid_tfidf)
accuracy=metrics.accuracy_score(predictions5, y_test_list)
accuracy

0.95

In [27]:
print("F1-score for Random_forest with macro average",metrics.f1_score(y_test_list, predictions5, average='macro'))
print("F1-score for Random_forest with micro average",metrics.f1_score(y_test_list, predictions5, average='micro'))
print("F1-score for Radom_forest with weighted average",metrics.f1_score(y_test_list, predictions5, average='weighted'))


F1-score for Random_forest with macro average 0.9498809150604893
F1-score for Random_forest with micro average 0.9500000000000001
F1-score for Radom_forest with weighted average 0.949819327115977


In [28]:
# XG Boost classifier
x=xg.XGBClassifier().fit(xtrain_tfidf,y_train_list)
predictions6 = x.predict(xvalid_tfidf)
accuracy=metrics.accuracy_score(predictions6, y_test_list)
accuracy

0.9305714285714286

In [29]:
print("F1-score for Xgboost with macro average",metrics.f1_score(y_test_list, predictions6, average='macro'))
print("F1-score for Xgboost  with micro average",metrics.f1_score(y_test_list, predictions6, average='micro'))
print("F1-score for Xgboost with weighted average",metrics.f1_score(y_test_list, predictions6, average='weighted'))


F1-score for Xgboost with macro average 0.9309263178654514
F1-score for Xgboost  with micro average 0.9305714285714286
F1-score for Xgboost with weighted average 0.9307439320986206


## Summary 
* The data consists of 560,000 test data observations and 70,000 train data observations.
* Each observation from both the test data and the train data belongs to one of the 14 classes.
* Checked for the balace of data 
* Found out that working with whole data is taking long time and sometimes system is hanging.
* So, took a random sample which contains 15/100 fraction of original data(train data =84,000 observations,test data=10,500 observations)
* After taking sample also, checked whether the data is balanced or not
* Did pre-processing steps like tf-idf tokenization and removing stop-words
* Then considered baseline model as Naive Bayes model
* Different models and their accuracies and F1-scores:


| Model | Accuracy| F1-Score|
| ---------------------- | ---------------------- | ---------------------- |
| Naive-Bayesian | 94.4 | 94.49(Macro Average) |
| LogisticRegression | 97.09 | 97.01(Macro Average) |
| KNeighborsClassifier | 42.9 | 48.18(Macro Average) |
| __Linear SVM__ | __97.2__ | __97.3(Macro Average)__ |
| RandomForestClassifier | 95| 95(Macro Average) |
| XGBClassifier | 93.05 | 93.09(Macro Average) |


* __Conclusion__:

    Linear SVM and Logistic Regression models give the highest accuracy as well as F1-score ie.,97.2 and 97.02         respectively. With the assumed data , SVM seems to be the best model. However, accuracies might change when         whole data is considered.