# Arabic Topic Modeling with Textacy And Spacy-udpipe
> Textacy for sure is a powerful Library to use with NLP, but if you open the docs you'll find it doesn't support any lanuage that spacy doesn't support, this notebook is a simple introduction to use textacy for Arabic language or any other language that spacy doesn't have a model for it til now.

- toc: true
- branch: master
- badges: true
- comments: true
- author: Esraa Khaled
- categories: [fastpages, jupyter]

In [1]:
%%capture
!pip install spacy-udpipe
!pip install textacy

## Downloading udpipe model

> First, we have to install the Arabic model from this link: [Models](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131) And upload it to our colab notebook.

In [2]:
import spacy_udpipe
import textacy
import textacy.tm
import pandas as pd



In [None]:
#spacy_udpipe.download("ar")

Already downloaded a model for the 'ar' language


In [3]:
nlp = spacy_udpipe.load_from_path(lang="ar",
                                  path="./arabic-padt-ud-2.5-191206.udpipe",
                                  meta={"description": "Custom 'ar' model"})
text = "القاهرة هي المكان المفضل لدي"

doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

القاهرة قَاهِرَة NOUN nsubj
هي هُوَ PRON nmod
المكان مَكَان NOUN ROOT
المفضل المفضل ADJ amod
لدي لَدَى ADP case
ي هُوَ PRON nmod


> Now we have our model as "nlp" and we can use it with many other libraries. 

In [4]:
df = pd.read_csv("/content/Without_namesAndSW.csv")
#df.info()

## Topic Modeling

To get the topics we need to go through these steps:

- To make it easy and use the options we have in textacy we'll convert our data to textacy's corpus. 
- Get tokens of every document.
- Specify the vectorizer we want.
- Make the doc-term-matrix. **Note: This matrix can be used with gensim models if you Transpose it.**

In [5]:
corpus = textacy.Corpus(nlp, data=df['No_stopWords'])

In [6]:
print(corpus)

Corpus(29 docs, 44526 tokens)


In [7]:
tokenized_docs = (
   (term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True))
    for doc in corpus)

In [8]:
vectorizer = textacy.representations.vectorizers.Vectorizer(
   tf_type="linear", idf_type="smooth", norm="l2",
    min_df=3, max_df=0.95)

**Another Note: You can get the id2word dictionary also from the vectorizer here and use it with your code.**

In [None]:
#collapse-hide

id2word = vectorizer.id_to_term

In [9]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

In [10]:
doc_term_matrix

<29x1185 sparse matrix of type '<class 'numpy.float64'>'
	with 9628 stored elements in Compressed Sparse Row format>

In [11]:
model = textacy.tm.topic_model.TopicModel("nmf", n_topics=4)
model.fit(doc_term_matrix)



In [12]:
model

TopicModel(n_topics=4, model=NMF)

## Model Inspection:

**Top Topic Terms:**

In [13]:
doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1,2,3]):
     print("topic", topic_idx, ":", "   ".join(top_terms))

topic 0 : بَنك   حِسَاب   المدعى   اِعتِمَاد   مَبلَغ   فَائِدَة   شَرِكَة   تَارِيخ   خَبِير   مَديُونِيَّة
topic 1 : أُجرَة   ـ   وَفَاء   مَستاجَر   بالاجرة   تَكلِيف   تَكرَار   إِعلَان   اِستِئنَاف   إِخلَاء
topic 2 : شَرِكَة   أَوَّل   قَرَار   جَمعِيَّة   شَرِيك   تَصفِيَة   عَمَل   إِدَارَة   ثَانِي   87
topic 3 : اَلَّذِي   2002   تَابَع   يكفى   دَفع   قَول   تَحقِيق   قِسم   اِثنَان   وكفايت




**Topic Weights:**

In [14]:
for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
     print(i, val)

0 0.33786064340985056
1 0.4085472464974949
2 0.13763662553716402
3 0.11595548455549054


**Documents Topics:**

In [None]:
for doc_idx, topics in model.top_doc_topics(doc_topic_matrix):
     print("Doc ID: ", doc_idx,":", topics)


ID:  0 : (0, 3, 2)
ID:  1 : (0, 2, 3)
ID:  2 : (0, 2, 3)
ID:  3 : (0, 3, 2)
ID:  4 : (0, 3, 2)
ID:  5 : (0, 3, 2)
ID:  6 : (0, 3, 2)
ID:  7 : (0, 3, 2)
ID:  8 : (0, 3, 2)
ID:  9 : (0, 2, 3)
ID:  10 : (3, 2, 0)
ID:  11 : (0, 2, 3)
ID:  12 : (0, 2, 3)
ID:  13 : (3, 2, 0)
ID:  14 : (0, 3, 2)
ID:  15 : (0, 1, 3)
ID:  16 : (0, 1, 3)
ID:  17 : (0, 1, 3)
ID:  18 : (0, 1, 3)
ID:  19 : (0, 1, 2)
ID:  20 : (0, 1, 3)
ID:  21 : (0, 2, 1)
ID:  22 : (0, 1, 3)
ID:  23 : (0, 1, 3)
ID:  24 : (0, 1, 3)
ID:  25 : (0, 1, 3)
ID:  26 : (0, 1, 2)
ID:  27 : (0, 2, 1)
ID:  28 : (0, 2, 1)


In [None]:
# model.termite_plot(doc_term_matrix, vectorizer.id_to_term,
#                    topics=-1,  n_terms=25, sort_terms_by="seriation")

In [None]:
model.save("nmf-4topics.pkl")