# Language detection: Textmine by usage off XBRL-AI

This is a very small inspirational implementation of textmining by usage of XBRL-AI at https://github.com/Niels-Peter/XBRL-AI. The focus is on this page is to show a simple Case using the library. Hopefully it can work as a quick introduction to the library.

The case builds a simple model to detect whether a text from auditors statement is danish or english.

It is done in this steps:
* load data from the Danish cloud holding annual reports in XBRL
* convert them into python dictionaries
* read text from the auditors statements in y and X vectors
* train a ML-model to detect whether a text is in danish or english
* evaluation of the model
* use the model


But before we can get started we need to import the library. In this case we are working on danish data from www.cvr.dk. Therefore we need both:

* xbrl_ai and
* xbrl_local.xbrl_ai_dkxbrl_local.xbrl_ai_dk


In [35]:
from xbrl_ai import xbrlinstance_to_dict
from xbrl_local.xbrl_ai_dk import xbrldict_to_xbrl_dk_64, scanscroll_fetchlist_dk


Next thing to do is to get a list of annual reports. The <font color='red'>scanscroll_fetchlist_dk</font> goes straight into the danish XBRL-cloud and load a list of reports.

To keep it simpel we just loads the annual reports published on the 28. feb. 2018.



In [36]:
# Fetch a list of Annual Reports fra the Cloud
input_data = scanscroll_fetchlist_dk('2018-02-28', format = 'publishTime')

Before loading the data we define the 2 vectors for our training data.

* y is holding the examples that we want to train the machine to learn. Here it is 'da' and 'en' - the language-tag.
* X hold the text of the elements in the auditors statements

So our goal is to learn the machine to "predict" y (language tag) from the text in X, by presenting it with examples from annual reports that are already language-tagged.

In [37]:
# Prepare vector y and X for trainingsdata
y=[]
X=[]

### Here comes the tricky part!

Next we have to get the data for our training. To do this we'll run though the list that we got from <font color='red'>scanscroll_fetchlist_dk</font>. For each instance in the list we <font color='red'>requests</font> it, and convert it first by use off <font color='red'>xbrlinstance_to_dict</font> and then by use of <font color='red'>xbrldict_to_xbrl_dk_64</font>.

Having done that, we can go though each line of each xbrl document, and extract text that have a language-tag and put it into X, and the language tag into y.

Warning:<font color='red'>
THIS STEP takes some time!</font>

A good advice would be to pickle your data after this step!


In [38]:
from bs4 import BeautifulSoup
import numpy as np
import requests

#Go though the list of annual reports on by one 
for instance in input_data:
    targeturl = instance['dokumentUrl']
    
    # make sure the the URL is not null
    # and convert all data til dictionary in xbrl_as_dk_64 format
    if type(targeturl).__name__ != 'NoneType':
        xbrldoc_as_dict = xbrlinstance_to_dict(requests.get(targeturl).content)
        xbrl_as_dk_64 = xbrldict_to_xbrl_dk_64(xbrldoc_as_dict)
        
        #Go though each annual reports line by line 
        for element in xbrl_as_dk_64:
            
            # Use only language tag and audit (=arr)
            try:
                if element[5][0:5] == 'lang:':
                    if element[0][0:4] == 'arr:':
                        
                        #clear out HTML and add data to vector y and X
                        tekst = BeautifulSoup((xbrl_as_dk_64[element])[0], 'lxml').get_text(' ')
                        y = np.append(y, [element[5][5:].lower()], axis = 0)
                        X = np.append(X, [tekst[0:]], axis = 0)
            except:
                pass


Before we can move on, we have to do a small cleaning of the danish tags.

In [39]:
# harmonise dk/da tag!
y[y=='dk'] = 'da'


### Lets do some simple machine learning

First thing to do is to split the traindata in two. One part for training, and one for testing.


In [40]:
from sklearn.model_selection import train_test_split
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.33)


Now we can train a very simple text-pipeline. Here we use the Random Forest Classifier. Nothing fancy, but good enough for our simple case.


In [41]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier

model = Pipeline([
    ('vect', CountVectorizer(ngram_range = (1, 1))),
    ('tfidf', TfidfTransformer(use_idf = True)),
    ('clf_RF', RandomForestClassifier()),
])

y_pred = model.fit(X_tr, y_tr).predict(X_test)


Lets see confusion matrix and the reports:

In [43]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[745   0]
 [  3 106]]
             precision    recall  f1-score   support

         da       1.00      1.00      1.00       745
         en       1.00      0.97      0.99       109

avg / total       1.00      1.00      1.00       854



Pretty nice!



So lets try it out on a danish and a english text:

In [44]:
print(model.predict(['Den af os udarbejdede årsrapport er aflagt i overensstemmelse med dansk regnskabslovgivning.']))
print(model.predict(["Our opinion on the annual accounts does not cover the management’s review"]))


['da']
['en']
