## LDA for document clustering 


In order to cluster the documents into the different types we will use an LDA (Latent Dirichlet Allocation) as our clustering model. 
However, before the application of the LDA we need to do some data processing to clean the content of the scrapped dowcuments. 

The process will be the following: 
- Import data (.csv) 
- Tokenitzation (transform information into words) 
- Remove punctuation marks 
- Reduce words to root 
- Build our dictionary 
- Change all text to one single vector with the count of words
- Normalization 
- Build the final matrix of MxN size (M: number of ducments, N: dictionary length) 
- Model with LDA and clustering 


## 1. Packages

In [10]:
#Packages
import numpy as np 
import pandas as pd 
from nltk.tokenize import word_tokenize
import nltk
import numpy as np 
import lda 
import lda.datasets
import matplotlib as plt
from textblob import TextBlob

## 1. Pre-processar el text 

1. Import 
2. Token
3. Puntuaction marks removal 
4. Reduce words to root

In [11]:
## Load documents 
data = pd.read_csv('dogc_1.csv',encoding='utf-8')
data.head()
textos = data.Text
textos.shape #Es un array de 204 files per una columna que conté els textos 

(204,)

In [12]:
## Check 
textos.head()

0    Expedient 5/O/2015. Extinció de les llicències...
1    I. L’apartat 5 de l’article 29 de l’Estatut or...
2    I L’apartat 5 de l’article 29 de l’Estatut org...
3    L'article 20 de la Llei Orgànica 15/99 de 13 d...
4    El Col·legi de Dietistes – Nutricionistes de C...
Name: Text, dtype: object

In [13]:
## Translate docs from catalan to english to be able to 
## use the all-ready made dictionary and functionalities 
text_translated = []
translation = []
for text in textos:
    text_TextBlob=TextBlob(text)
    translation = text_TextBlob.translate(from_lang="ca", to='en')
    text_translated.append(translation)

In [18]:
np.shape(text_translated)

(204,)

In [19]:
text_new=[]
for text in text_translated:
    text2=str(text)
    text_new.append(text2)

texts=pd.DataFrame(text_new)
texts.to_csv('DOGC_english_text.csv',encoding='utf-8')
english_texts = pd.read_csv('DOGC_english_text.csv',encoding='utf-8')

In [20]:
english_texts.head()
texts_en=english_texts['0']
texts_en.head()

0    File 5 / O / 2015. Termination of licenses for...
1    I. Section 5 of Article 29 of the Organic Stat...
2    I Paragraph 5 of Article 29 of the Organic Sta...
3    Article 20 of the Law 15/99 of December 13, Pr...
4    The College of Dietitians - Nutritionists Cata...
Name: 0, dtype: object

In [8]:
## Tokenize 
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
tokenized_docs=[]
for text in texts_en:
    blob = TextBlob(text, tokenizer=tokenizer)
    tokenized_docs.append(blob.words)

len(tokenized_docs)

204

In [9]:
## Removing puntuation 
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    
    new_review = []
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
len(tokenized_docs_no_punctuation)

204

In [11]:
# Remove numbers and short words 
non_short_words=[]
for text in tokenized_docs_no_punctuation:  
    new_text3 = []
    for token in text:
        if len(token) >= 3:
            new_text3.append(token)
    non_short_words.append(new_text3)

print np.shape(non_short_words)
non_short_words[0][0:10]

(204L,)


[u'File',
 u'2015',
 u'Termination',
 u'licenses',
 u'for',
 u'the',
 u'provision',
 u'digital',
 u'terrestrial',
 u'television']

In [12]:
# Lemmatitzar
from textblob import Word

preprocessed_docs = []
for doc in non_short_words:
    final_doc = []
    for word in doc:
        w=Word(word)
        final_doc.append(w.lemmatize())
    preprocessed_docs.append(final_doc)

In [13]:
print tokenized_docs_no_punctuation[0][1:100]
print '\n\n', preprocessed_docs[0][1:100]

[u'5', u'O', u'2015', u'Termination', u'of', u'licenses', u'for', u'the', u'provision', u'of', u'digital', u'terrestrial', u'television', u'broadcasting', u'local', u'corresponding', u'to', u'the', u'districts', u'of', u'Cornella', u'de', u'Llobregat', u'and', u'Tarragona', u'resignation', u'of', u'the', u'holder', u'society', u'Tele', u'Taxi', u'Digital', u'Television', u'SLU', u'Since', u'the', u'company', u'Tele', u'Taxi', u'Television', u'Digital', u'SLU', u'represented', u'by', u'Mr', u'Justo', u'Molinero', u'Calero', u'presented', u'a', u'letter', u'and', u'other', u'documents', u'which', u'formalize', u'the', u'renunciation', u'of', u'licenses', u'for', u'the', u'provision', u'of', u'broadcasting', u'digital', u'terrestrial', u'television', u'd', u'local', u'level', u'corresponding', u'to', u'the', u'districts', u'of', u'Cornella', u'de', u'Llobregat', u'and', u'Tarragona', u'this', u'Council', u'should', u'proceed', u'to', u'declare', u'them', u'extinguished', u'and', u'inscriu

In [14]:
## To lower case 
new_preprocessed_docs=[]
for doc in preprocessed_docs:
    doc_nl=[]
    for word in doc:
        word2=word.lower()
        doc_nl.append(word2)
    new_preprocessed_docs.append(doc_nl)

In [15]:
## Remove empty information words 
new_preprocessed_docs2=[]
for doc in new_preprocessed_docs:
    filtered = [ v for v in doc if not v.startswith('the') ]
    new_preprocessed_docs2.append(filtered)

In [112]:
## Manera manual pero que va be
new_preprocessed_docs39=[]
for doc in new_preprocessed_docs38:
    filtered = [ v for v in doc if not v.startswith('who') ] #anar canviant la paraula i el numero del doc
    new_preprocessed_docs39.append(filtered)

In [113]:
print '\n\n', new_preprocessed_docs39[200][1:100]



[u'provision', u'article', u'organic', u'statute', u'functioning', u'council', u'selection', u'organization', u'staff', u'whereas', u'agreement', u'plenary', u'council', u'december', u'2000', u'approved', u'classification', u'job', u'size', u'staff', u'council', u'subsequent', u'resolution', u'amend', u'update', u'list', u'post', u'work', u'dated', u'january', u'february', u'march', u'2001', u'february', u'september', u'2002', u'february', u'october', u'2003', u'november', u'2004', u'regard', u'favorable', u'report', u'comptroller', u'audiovisual', u'council', u'proposal', u'secretarygeneral', u'adopted', u'following', u'resolution', u'approve', u'job', u'offer', u'public', u'broadcasting', u'council', u'year', u'2005', u'place', u'listed', u'annex', u'agreement', u'offer', u'public', u'employment', u'vacancy', u'mentioned', u'include', u'provision', u'equipped', u'budget', u'year', u'considered', u'necessary', u'proper', u'operation', u'service', u'expense', u'arising', u'provision'

## 2. Dictionary

In [70]:
## Build Dictionary 
import string 
    
def build_lexicon(corpus): # define a set with all possible words included in all the sentences or "corpus"
    lexicon = set()
    for doc in corpus:
        lexicon.update(word for word in doc)
    return lexicon

def tf(term, document):
  return freq(term, document)

def freq(term, document):
  return document.split().count(term)


In [115]:
vocabulary = build_lexicon(new_preprocessed_docs39)

In [114]:
texts_39=pd.DataFrame(new_preprocessed_docs39)
texts_39.to_csv('DOGC_english_text_filtered.csv',encoding='utf-8')

In [116]:
len(vocabulary)

7669

In [117]:
pattern=r'[0-9]'
vocabulary_nonumeric = list()
for word in vocabulary: 
    match = re.search(pattern, word)
    if not match:
        vocabulary_nonumeric.append(word)

In [118]:
len(vocabulary_nonumeric)

6493

In [119]:
vocabulary_big = list()
for word in vocabulary_nonumeric: 
    if len(word)>=3: 
        vocabulary_big.append(word)

In [120]:
vocabulary_array = np.unique(vocabulary_big)
vocabulary_array.shape

(6478L,)

In [121]:
def tf(term, document):
  return freq(term, document)

def freq(term, document):
  return document.count(term)

doc_term_matrix = []
for doc in preprocessed_docs:
    #print 'The doc is "' + doc + '"'
    tf_vector = [tf(word, doc) for word in vocabulary_array]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    #print 'The tf vector for Document %d is [%s]' % ((DOGC_text.index(doc)+1), tf_vector_string)
    doc_term_matrix.append(tf_vector)

In [79]:
len(doc_term_matrix)

204

In [122]:
##Normalitzem la matriu 
import math
import numpy as np

def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))

In [123]:
print 'Shape sense norm: ', np.shape(doc_term_matrix)
print 'Shape amb norm: ', np.shape(doc_term_matrix_l2)

Shape sense norm:  (204L, 6478L)
Shape amb norm:  (204L, 6478L)


In [124]:
type(doc_term_matrix_l2)
doc_matrix = np.asarray(doc_term_matrix_l2)
doc_matrix.shape

(204L, 6478L)

-----------

## LDA 

Use LDA (Latent Dirichlet Allocation) with the normalized matrix

In [125]:
X = np.copy(doc_term_matrix)

In [138]:
#Afagem vocab i titles 
vocab = vocabulary_array
print 'Exemple vocab: ', vocab[30:40] #Nom de les columnes (bàsicament definiran els vectors binomics)

titles = text_translated #Nom de les mostres (títols del text que contenen le sparaules) 
print 'Exemple titles: ', titles[0][0:100]
print 'Mida vocab: ', len(vocab) 
print 'Mida titles: ', len(titles)

Exemple vocab:  [u'accedir' u'accept' u'acceptance' u'accepted' u'accepts' u'access'
 u'accessed' u'accessibility' u'accessible' u'accessing']
Exemple titles:  File 5 / O / 2015. Termination of licenses for the provision of digital terrestrial television broad
Mida vocab:  6478
Mida titles:  204


In [127]:
model = lda.LDA(n_topics=5, n_iter=1500, random_state=1) ## 5 topics 

In [128]:
model.fit(X)



<lda.lda.LDA instance at 0x0000000026E1BF88>

In [129]:
##Topic word
topic_word=model.topic_word_
print topic_word[1,:].sum()

1.0


In [133]:
n_top_words = 10 #Definim el nombre màxim de paraules
topics = list()
#Serveix per treure les 10 paraules més importants de cada topic 
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topics.append(topic_words)
    print'Topic ',i,': ',topic_words

Topic  0 :  [u'public' u'provision' u'article' u'procedure' u'administrative'
 u'agreement' u'document' u'accordance' u'service' u'appeal']
Topic  1 :  [u'staff' u'function' u'job' u'management' u'service' u'approved' u'care'
 u'provision' u'member' u'accordance']
Topic  2 :  [u'prize' u'school' u'research' u'work' u'audiovisual' u'award' u'jury'
 u'euro' u'communication' u'grant']
Topic  3 :  [u'test' u'selection' u'court' u'public' u'board' u'applicant' u'call'
 u'exercise' u'process' u'place']
Topic  4 :  [u'data' u'file' u'professional' u'exercise' u'information' u'right'
 u'service' u'address' u'access' u'number']


In [131]:
n_top_words = 30 #Definim el nombre màxim de paraules
topics = list()
#Serveix per treure les 10 paraules més importants de cada topic 
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topics.append(topic_words)
    print'Topic ',i,': ',topic_words

Topic  0 :  [u'public' u'provision' u'article' u'procedure' u'administrative'
 u'agreement' u'document' u'accordance' u'service' u'appeal' u'legal'
 u'resolution' u'contract' u'month' u'administration' u'case'
 u'corresponding' u'established' u'body' u'broadcasting' u'vote'
 u'publication' u'law' u'day' u'into' u'term' u'following' u'general'
 u'decision' u'two']
Topic  1 :  [u'staff' u'function' u'job' u'management' u'service' u'approved' u'care'
 u'provision' u'member' u'accordance' u'agreement' u'health' u'following'
 u'primary' u'structure' u'temporary' u'effect' u'director'
 u'organizational' u'proposal' u'adopted' u'body' u'approving'
 u'administrative' u'functioning' u'assigned' u'rule' u'professional'
 u'published' u'article']
Topic  2 :  [u'prize' u'school' u'research' u'work' u'audiovisual' u'award' u'jury'
 u'euro' u'communication' u'grant' u'education' u'following' u'right'
 u'author' u'project' u'rule' u'broadcasting' u'public' u'published'
 u'cash' u'student' u'awarded' u

In [134]:
topics = pd.DataFrame(topics)
topics.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,public,provision,article,procedure,administrative,agreement,document,accordance,service,appeal
1,staff,function,job,management,service,approved,care,provision,member,accordance
2,prize,school,research,work,audiovisual,award,jury,euro,communication,grant
3,test,selection,court,public,board,applicant,call,exercise,process,place
4,data,file,professional,exercise,information,right,service,address,access,number


In [139]:
#També es pot treure els topis dels documents 
doc_topic = model.doc_topic_ #Igual que l'anterior, una matriu amb el nombre de topics per cada títol
#cada topic té un pes i el que té major pes serà el que definirà el document. 
for i in range(5):
    print 'Topic: ', doc_topic[i].argmax(), '\nText: ', titles[i][0:100]

Topic:  0 
Text:  File 5 / O / 2015. Termination of licenses for the provision of digital terrestrial television broad
Topic:  2 
Text:  I. Section 5 of Article 29 of the Organic Statute and functioning of the Audiovisual Council of Cata
Topic:  2 
Text:  I Paragraph 5 of Article 29 of the Organic Statute and functioning of the Audiovisual Council of Cat
Topic:  4 
Text:  Article 20 of the Law 15/99 of December 13, Protection of Personal Data (Act), states that the creat
Topic:  4 
Text:  The College of Dietitians - Nutritionists Catalonia, for the exercise of its powers and functions re


In [140]:
classificacio = list() 
for i in range(doc_topic.shape[0]): 
    classificacio.append(doc_topic[i].argmax())

In [141]:
#Afegir nova columna amb la classificació
data['classificacio']=classificacio

In [142]:
#Guardar taules importants
classificacio = pd.DataFrame(classificacio)
#titles.to_csv('titles_i_class_21022016.csv',encoding='utf-8')
topics.to_csv('topics_27022016.csv', encoding='utf-8')
data.to_csv('data_angles_classificat_27022016.csv',encoding='utf-8')

In [145]:
data=pd.read_csv('data_angles_classificat_27022016.csv',encoding='utf-8')

In [146]:
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Area,Cerca,Data del document,Identificador,Organisme Emisor,Text,Tipus Document,Titol,URL,_id,classificacio
0,0,0,ACORD,Acord,18/11/2015,147_2015,Consell de l'Audiovisual de Catalunya,Expedient 5/O/2015. Extinció de les llicències...,Acord,"ACORD 147/2015, de 18 de novembre, del Ple del...",http://dogc.gencat.cat/ca/pdogc_canals_interns...,56bf19249a603e040a683e50,0
1,1,1,ACORD,Acord,18/11/2015,145_2015,Consell de l'Audiovisual de Catalunya,I. L’apartat 5 de l’article 29 de l’Estatut or...,Acord,"ACORD 145/2015, de 18 de novembre, del Ple del...",http://dogc.gencat.cat/ca/pdogc_canals_interns...,56bf192b9a603e040a683e51,2
2,2,2,ACORD,Acord,18/11/2015,146_2015,Consell de l'Audiovisual de Catalunya,I L’apartat 5 de l’article 29 de l’Estatut org...,Acord,"ACORD 146/2015, de 18 de novembre, del Ple del...",http://dogc.gencat.cat/ca/pdogc_canals_interns...,56bf19349a603e040a683e52,2
3,3,3,ACORD,Acord,04/11/2015,sobre la creació i supressió de fitxers de dad...,Col·legi d'Economistes de Catalunya,L'article 20 de la Llei Orgànica 15/99 de 13 d...,Acord,ACORD sobre la creació i supressió de fitxers ...,http://dogc.gencat.cat/ca/pdogc_canals_interns...,56bf19419a603e040a683e53,4
4,4,4,ACORD,Acord,10/11/2015,sobre creació d'un fitxer públic que conté dad...,Col·legi de Dietistes-Nutricionistes de Catalunya,El Col·legi de Dietistes – Nutricionistes de C...,Acord,ACORD sobre creació d'un fitxer públic que con...,http://dogc.gencat.cat/ca/pdogc_canals_interns...,56bf194b9a603e040a683e54,4


In [147]:
data_doc=data.groupby('Tipus Document')
data_class=data.groupby('classificacio')
count_class=data_class.Identificador.count()

In [148]:
#Text segons la classificació
Text_0=data[data.classificacio ==0].Text
Text_1=data[data.classificacio ==1].Text
Text_2=data[data.classificacio ==2].Text
Text_3=data[data.classificacio ==3].Text
Text_4=data[data.classificacio ==4].Text
Text_5=data[data.classificacio ==5].Text

In [149]:
print "Classificació de textos amb LDA \n"
print "Class 0=", len(np.unique(Text_0))
print "Class 1=", len(np.unique(Text_1))
print "Class 2=", len(np.unique(Text_2))
print "Class 3=", len(np.unique(Text_3))
print "Class 4=", len(np.unique(Text_4))
print "Total=", data.Text.count()


Classificació de textos amb LDA 

Class 0= 49
Class 1= 67
Class 2= 68
Class 3= 8
Class 4= 11
Total= 204


In [150]:
texts=data.Text

In [269]:
text_0=new_preprocessed_docs21[0]
text_0=str(text_0)
text_0.find('daniel sirera')

-1

In [327]:
find_persona=[]
count=0
for text in new_preprocessed_docs21:   
    text=str(text)
    if (text.find('lluis') != -1) & (text.find('recoder') != -1) : 
        find_persona.append(count)
    count +=1    

In [328]:
find_persona

[129]

In [330]:
new_preprocessed_docs21[129]
textos[129]

u"De conformitat amb el disposat a l'article 108.6 de la Llei org\xe0nica 5/1985, del r\xe8gim electoral general, en relaci\xf3 amb la disposici\xf3 transit\xf2ria segona de l'Estatut d'autonomia de Catalunya aprovat per Llei org\xe0nica 6/2006, de 19 de juliol, la Junta Electoral Central, en la seva reuni\xf3 de 16 de desembre de 2010, ha acordat la publicaci\xf3 en el Diari Oficial de la Generalitat de Catalunya dels resultats generals i per circumscripcions de les eleccions al Parlament de Catalunya convocades per Decret del President de la Generalitat de Catalunya 132/2010, de 4 d'octubre, i celebrades el 28 de novembre de 2010, d'acord amb les actes d'escrutini general i de proclamaci\xf3 d'electes remeses per les Juntes Electorals Provincials de la Comunitat Aut\xf2noma de Catalunya, recollint les dades que figuren en les esmentades actes.\nLa publicaci\xf3 s'ordena de la manera seg\xfcent:\nQuadre 1. Resum general. Electors (total d'electors: residents en Espanya, residents abse