- a software system to bring out information of web articles from consiting of articles and blogs

- smart search engines can be considered unsupervised learning approaches, due to the nature of clustering related information without such label in hand

- Search Engines have evolved **from a text input and output service to** an experience that cuts across voice, video, documents, and conversations


- an **infinite problem** to solve


- **related** to information retrieval, language understanding


- the **value that an effective search tool can bring to a business is enormous**; a key piece of intellectual property. Often a search bar is the main interface between customers and the business. 
    - create a competitive advantage by delivering an improved user experience.
    


search engine popular approaches:
- manual implementation with dataframe + tf-idf
- Elastic Search + BM25
- BM25 + Azure Cognitive Search

Requirements:
- Search index for storing each document, reflecting relevant information and up to date information
    - data can be reorganized by date (suggestion)
- Query understanding
    - takes sentence and preprocessed data information **directly without much context**
    - we can extract words or tokens from the query to match **article_type** (suggestion)
        - query to match tags (done)
    - we can filter the search by either blog or News (suggestion)
        - or add multiple results available (blog, News, or both)
    - BM25 + Azure Cognitive search
- Query ranking
    - by consine similarity

## 1- Library and Data Imports

In [64]:
import numpy as np
import pandas as pd
import time

# for text cleaning and preprocessing
import re
from nltk.corpus import stopwords
import string 
from sklearn.feature_extraction.text import TfidfVectorizer

In [65]:
docs_df = pd.read_json('../Data/husna.json')

## 2- Data Preparation

#### 2.1 preparing data for cleaning

In [66]:
# MODIFIED
docs_df = docs_df.drop(columns=['publisher', 'crawled_at', 'published_at'], axis=1)

In [67]:
docs_df_dropped = docs_df.drop(index=
                               docs_df[(docs_df['content'].str.len() == 0) & (docs_df['title'] == '')].index, axis=0)
docs_df_dropped = docs_df_dropped.reset_index(drop=True)
docs_df = docs_df_dropped

In [68]:
docs_df['text'] = docs_df['content'].apply(lambda x: " ".join(x))

## 3- Data Cleaning

important data cleaning functions:
- remove punctuation
- tokenization 
- stem words

**cleaning functions not implemented**: removing repeating characters, stop words, emoji, hashtags

In [69]:
def show_info_text(df_col):
    print(f"-> Number of Documents: {docs_df.shape[0]}")
    print('-' * 50, end='\n\n')

    print('-> Documents - First 150 letters')
    print()
    for i, document_i in enumerate(docs_df['text_clean'][:20]):
        print(f"Document Number {i+1}: {document_i[:150]}..")
        print()

    print('-' * 50)
    
def data_preprocessing(df_col):
    # Instantiate a TfidfVectorizer object
    vectorizer = TfidfVectorizer()
    
    # It fits the data and transform it as a vector
    X = vectorizer.fit_transform(df_col)
    # Convert the X as transposed matrix
    X = X.T.toarray()
    # Create a DataFrame and set the vocabulary as the index
    df = pd.DataFrame(X, index=vectorizer.get_feature_names())
    return df, vectorizer

### 3.1 data cleaning (ver.1)

handle:
- removing mentions
- removing punctuation
- removing Arabic diacritics (short vowels and other harakahs) 
    - حركات وشد
- removing elongation 
    - مد
- removing stopwords (which is available in NLTK corpus)
    - normal stopwords (not specific to arabic)
- remove words from languages other than arabic and english

**NOTE** 
- ~6000 source documents -> ~5000 documents -> ~89,000 tokens
- Clean time for 'text' column: ~111 seconds
- **Problems**:
    - may not be normalized enough
    - words from other languages
    - confusing numbers (remove or keep?)
        - remove english numbers? or arabic numbers? or both?
        - should we remove words of letters mixed with numbers (E.g. COVID19)
    - links (remove or keep?)
        

### 3.2 data cleaning (ver.2)

handle:
- removing Arabic diacritics (short vowels and other harakahs)
- variation by form and spelling, based on context (Orthographic Ambiguity)
- existence of many forms for the same word (Morphological Richness)
- dialects (Dialectal Variation)
- different ways to write the same word when writing in dialectal Arabic, for which there is no agreed-upon standard
    - Orthographic Inconsistency
- removing longation and stop words
- remove words from languages other than arabic and english
   
these problems can possibly lead to immensly large vocabularies generated.

In [70]:
docs_df_cleaned2 = docs_df.copy()

In [71]:
# import the dediacritization tool
from camel_tools.utils.dediac import dediac_ar

# Reducing Orthographic Ambiguity
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar

# toknenization
from camel_tools.tokenizers.word import simple_word_tokenize

# Morphological Disambiguation (Maximum Likelihood Disambiguator)
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained() # instantiation fo MLE disambiguator

# tokenization / lemmatization (choosing approach that best fit the project)
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

import re
from nltk.corpus import stopwords

In [72]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2(txt):
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    
    # dediacritization
    txt = dediac_ar(txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    txt = normalize_alef_ar(txt)
    txt = normalize_teh_marbuta_ar(txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    
    # remove longation
    txt = re.sub("[إأآا]", "ا", txt)
    txt = re.sub("ى", "ي", txt)
    txt = re.sub("ؤ", "ء", txt)
    txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
    txt = re.sub("گ", "ك", txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    
    return txt

In [73]:
# apply to your text column
docs_df_cleaned2 = docs_df.drop(columns=['_id', 'summary', 'content'])
start_time = time.time()
docs_df_cleaned2['text_clean'] = docs_df['text'].apply(text_clean2)
docs_df_cleaned2['title_clean'] = docs_df['title'].apply(text_clean2)
time_measure = (time.time() - start_time) * 10**3

In [74]:
docs_df_cleaned2['content_clean'] = docs_df_cleaned2['title_clean'] + " " + docs_df_cleaned2['text_clean']
docs_df_cleaned2['doc_id'] = docs_df_cleaned2.index

In [75]:
docs_df_cleaned2.head(2)

Unnamed: 0,title,url,tags,article_type,text,text_clean,title_clean,content_clean,doc_id
0,التربية: تحويل 42 مدرسة إلى نظام الفترتين واستئجار 15 مدرسة هذا العام,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%A7%D9%84%D8%AA%D8%B1%D8%A8%D9%...,"[التربية والتعليم, وزارة التربية والتعليم]",News,أكدت أمين عام وزارة التربية والتعليم للشؤون المالية والإدارية الدكتورة نجو...,اكد امين وزاره تربيه تعليم شان مالي اداري دكتور نجوي القبيلات ان توسع دوام ط...,تربيه تحويل 42 مدرسه الي نظام فتره استءجار 15 مدرسه عام,تربيه تحويل 42 مدرسه الي نظام فتره استءجار 15 مدرسه عام اكد امين وزاره ترب...,0
1,تكريما للمعلمين زيادة منح أبناء المعلمين 550 مقعدا إضافيا بالجامعات,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%B2%D9%8A%D8%A7%D8%AF%D8%A9-%D9...,"[مكرمة أبناء المعلمين, وزارة التربية والتعليم]",News,احتفلت وزارة التربية والتعليم بيوم المعلم بتكريمها نخبة من المعلمات والمعل...,احتفل وزاره تربيه تعليم يوم معلم تكريم نخبه معلم معلم مختلف مديريه تربيه تعل...,تكريم معلم زياده منح ابناء معلم 550 مقعد اضافي جامعه,تكريم معلم زياده منح ابناء معلم 550 مقعد اضافي جامعه احتفل وزاره تربيه تعليم...,1


In [20]:
# docs_df_cleaned2.to_csv('../Data/processed/SE_data4.csv', index=False)

In [185]:
# fit + transform time

start_time = time.time()
text_clean_enc_df, clean2_vect = data_preprocessing(docs_df_cleaned2['text_clean'])
# text_clean_enc_df = clean2_vect.transform(docs_df_cleaned2['text_clean'])
time_measure = (time.time() - start_time) * 10**3

print('time measure: ', time_measure)
# text_clean_enc_df

time measure:  1535.5424880981445




- ~6000 source documents -> ~5000 documents -> ~23,000 tokens
- Clean time for 'text' column: ~180 seconds
- Problems:
    - stopwords (remove or keep?)
    - normalization may have cut out too many tokens 
    - confusing numbers (remove or keep?)
        - remove english numbers? or arabic numbers? or both?
    - should we remove words of letters mixed with numbers (E.g. COVID19)
    - links (remove or keep?)
    
**NOTE** discuss with instructor before proceeding

### 3.3 check results

In [186]:
# display(docs_df_cleaned.head())
i=0

In [224]:
# check results
print(f'--> {i}')
display(text_clean_enc_df.index[50*i:50*(i+1)])
display(text_clean_enc_df.index[-50*(i+1):(-50*i)-1])
i += 1

print('clean time: {:.2f} seconds'.format(time_measure * 10**-3))

--> 37


Index(['732811', '733', '7331', '734', '734811', '735', '7354', '73600', '737',
       '7375659', '737888', '738', '7383', '739', '739015', '73997', '74',
       '7407053', '741', '7413', '742521', '743', '743331', '74413', '745667',
       '746', '7469', '747', '748', '749046', '74915', '75', '750', '7500',
       '7508948', '751', '7520', '753', '754', '755', '7565', '758', '75921',
       '75zdsxi5bf', '76', '760', '76003', '761', '762', '763'],
      dtype='object')

Index(['هجمه', 'هجهوج', 'هجو', 'هجوبا', 'هجوم', 'هجوي', 'هجي', 'هجين', 'هد',
       'هدا', 'هداء', 'هدار', 'هداريم', 'هداف', 'هدام', 'هدايه', 'هدد', 'هدر',
       'هدرا', 'هدف', 'هدم', 'هدنه', 'هدهد', 'هدوء', 'هدوءا', 'هدول', 'هدي',
       'هديب', 'هدير', 'هديل', 'هديه', 'هذ', 'هذا', 'هذلول', 'هراء', 'هراوه',
       'هرب', 'هرتزليا', 'هرتسليا', 'هرتسوج', 'هرتسوغ', 'هرس', 'هرسك', 'هرش',
       'هرطقات', 'هرطقه', 'هرع', 'هرف', 'هرقل'],
      dtype='object')

clean time: 1.54 seconds


In [None]:
vocab_ = vectorizer.vocabulary_
print(f"number of unique words: {len(vocab_.keys())}")
most_freq_word = sorted(vocab_.items(), key=lambda x: x[1], reverse=True)[:1][0]
print('most frequent word is --> {} ({} times)'.format(most_freq_word[0], most_freq_word[1]))
score = len(vocab_.keys()) / most_freq_word[1]
print('Ratio: {:.3f}'.format(score))

---

## 4- Apply Cleaning on Query

In [157]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2_steps(txt):
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    print('stopwords', txt)
    
    # dediacritization
    txt = dediac_ar(txt)
    print('dediacritization', txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_alef_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_teh_marbuta_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    print('Reducing Morphological Variation', txt)
    
    # remove longation
    txt = re.sub("[إأآا]", "ا", txt)
    txt = re.sub("ى", "ي", txt)
    txt = re.sub("ؤ", "ء", txt)
    txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
    txt = re.sub("گ", "ك", txt)
    print('remove longation', txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    print('remove non-arabic/non-english/non-number words', txt)
    
    return txt

In [164]:
from farasa.stemmer import FarasaStemmer

In [226]:
stemmer = FarasaStemmer()

query_test = 'يذهب'
stemmed_text = stemmer.stem(query_test)                                     
print(stemmed_text)

ذهب


In [182]:
query_test = '،ذهب'
query_test_cleaned = text_clean2_steps(query_test)
query_test_cleaned

stopwords ،ذهب
dediacritization ،ذهب
Reduce Orthographic Ambiguity and Dialectal Variation ،ذهب
Reduce Orthographic Ambiguity and Dialectal Variation ،ذهب
Reduce Orthographic Ambiguity and Dialectal Variation ،ذهب
Reducing Morphological Variation ، ذهب
remove longation ، ذهب
remove non-arabic/non-english/non-number words ، ذهب


'، ذهب'

## 5- Bulding Search Engine with Haystack

`Haystack` is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. 

We'll use haystack to build a scalable semantic search engine using the State-of-the-Art NLP models. built in a modular fashion so that you can combine the best technology from other open-source projects like Huggingface's Transformers, Elasticsearch, or Milvus.

`there are 3 major components to Haystack.`

- **Document Store**: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).

- **Retriever**: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.

- **Reader**: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.

In [145]:
from haystack.pipelines import DocumentSearchPipeline

In [3]:
# importing necessary dependencies

from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils.cleaning import clean_wiki_text
from haystack.utils import convert_files_to_docs, fetch_archive_from_http  
from haystack.nodes import FARMReader 
# from haystack.utils import print_answers

  from .autonotebook import tqdm as notebook_tqdm


Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of DocumentStore include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

But they recommend ElasticsearchDocumentStore because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

`ElasticsearchDocumentStore` because as it **comes preloaded** with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.

In [125]:
from haystack.document_stores import ElasticsearchDocumentStore

ELASTIC_PASSWORD = "zJXerPHeN7PEmq5zWRuZ"
document_store = ElasticsearchDocumentStore(host="localhost", port="9200", scheme='https',
                                            username="elastic", password=ELASTIC_PASSWORD, 
                                            index="se_shai_haystack", ca_certs="../Certs/http_ca.crt")

In [126]:
# delete all added documents (from previous runs)
document_store.delete_all_documents()

In [128]:
# write the dictionaries containing documents to our DB.
document_store.write_documents(docs_df_cleaned2[['content_clean', 'title', 'url', 'tags', 'doc_id']]
                               .rename(columns={'content_clean': 'content'}).to_dict(orient='records'))

`Retrievers` help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. ( Elasticsearch's default BM25 algorithm)

In [112]:
from haystack.nodes.retriever.sparse import ElasticsearchRetriever
from haystack.nodes import BM25Retriever
 
retriever = BM25Retriever(document_store=document_store)

A `Reader` scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

`Haystack` currently supports Readers based on the frameworks `FARM` and `Transformers`. With both you can either load a local model or one from `Hugging Face's` model hub (https://huggingface.co/models).

In [141]:
# medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)
# reader = FARMReader(model_name_or_path="aubmindlab/bert-base-arabertv2", use_gpu=True, context_window_size=1000)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=True, context_window_size=1000)
# reader = FARMReader(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2", use_gpu=True, context_window_size=1000)

# reader = FARMReader(model_name_or_path="ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA", use_gpu=True, context_window_size=1000)
# reader = FARMReader(model_name_or_path="wissamantoun/araelectra-base-artydiqa", use_gpu=True, context_window_size=1000)

**FARMReader**: FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built upon transformers and provides additional features to simplify the life of developers: Parallelized preprocessing, highly modular design, multi-task learning, experiment tracking, easy debugging and close integration with AWS SageMaker.

With a `Haystack Pipeline` you can stick together your building blocks to a search pipeline. Under the hood, Pipelines are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases. To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the ExtractiveQAPipeline that combines a retriever and a reader to answer our questions.

In [172]:
# finder = Finder(reader, retriever)
pipe = ExtractiveQAPipeline(reader, retriever)
# pipe = DocumentSearchPipeline(retriever)

### results

In [152]:
SE_data4 = pd.read_csv('../Data/processed/SE_data4.csv')

In [149]:
start_time = time.time()


q = "مولد نبوي"
q_processed = text_clean2(q)
number_of_answers_to_fetch = 5
print()

# prediction = finder.get_answers(question=question, top_k_retriever=20, top_k_reader=number_of_answers_to_fetch)
prediction = pipe.run(
    query=q_processed, params=
    {"Retriever": {"top_k": 10}}
)

print(f"Query: {prediction['query']}")
print("\n")
for i in range(len(prediction['documents'])):
    print(f"#{i+1}")
#     print(prediction)
#     print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"Url: {prediction['documents'][i].meta['url']}")    # from data stored in db
    print(f"Title: {prediction['documents'][i].meta['title']}")  # from data stored in db
    print(f"Text: {prediction['documents'][i].content}")    # from data stored in db
#     print(f"Context: {prediction['answers'][i]['context']}")  # a 1000 words context around the answer
    print('\n\n')
    
print('search duration: ', time.time() - start_time, 'ms')


Query: مولد نبوي


#1
Url: https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%A7%D9%84%D8%B9%D9%85%D9%84-%D8%B9%D8%B7%D9%84%D8%A9-%D8%A7%D9%84%D9%85%D9%88%D9%84%D8%AF-%D8%A7%D9%84%D9%86%D8%A8%D9%88%D9%8A
Title: العمل: عطلة المولد النبوي الشريف تشمل القطاع الخاص
Text: عمل   عطله مولد نبوي شريف شمل قطاع خاص اكد وزاره عمل ان عطل رسمي ماجور عامل مءسسه قطاع خاص اذا عمل استحق اجره اضافي واقع   150     جري معتاد وفق حكم ماده 59   ب قانون عمل اردني رقم   8   سنه 1996 تعديل . جاء توضيح ضوء بلاغ صادر رءيس وزير تعطيل وزاره داءره رسمي مءسسه هيءه عام احتفاء ذكري مولد نبوي شريف صادف سبت موافق ثامن اول ، مءكد ان بلاغ شمل قطاع خاص .



#2
Url: https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%A7%D9%84%D9%85%D9%88%D9%84%D8%AF-%D8%A7%D9%84%D9%86%D8%A8%D9%88%D9%8A-%D8%B9%D8%B7%D9%84%D8%A9-%D8%B1%D8%B3%D9%85%D9%8A%D8%A9
Title: تعطيل الوزارات والدوائر الرسمية احتفاء بذكرى المولد النبوي
Text: تعطيل وزاره داءره رسمي احتفاء ذكري مولد نبوي تقرر تعطيل وزاره داءره رسمي مءسسه هيءه عام جامعه رسمي بلديه مجلس خدمه مشترك اما

In [176]:
start_time = time.time()


q = "مولد نبوي"
q_processed = text_clean2(q)
number_of_answers_to_fetch = 5
print()

# prediction = finder.get_answers(question=question, top_k_retriever=20, top_k_reader=number_of_answers_to_fetch)
prediction = pipe.run(
    query=q_processed, params=
    {"Retriever": {"top_k": 10}, "Reader": {"top_k": number_of_answers_to_fetch}}
)

# prediction['documents'][0].meta
# prediction['answers']
print(f"Query: {prediction['query']}")
print("\n")
# for i in range(len(prediction)):
#     print(f"#{i+1}")
# #     print(prediction)
# #     print(f"Answer: {prediction['answers'][i]['answer']}")
#     print(f"Url: {prediction['documents'][i].meta['url']}")    # from data stored in db
#     print(f"Title: {prediction['documents'][i].meta['title']}")  # from data stored in db
#     print(f"Text: {prediction['documents'][i].content}")    # from data stored in db
# #     print(f"Context: {prediction['answers'][i]['context']}")  # a 1000 words context around the answer
#     print('\n\n')
    
print('search duration: ', time.time() - start_time, 'ms')




Inferencing Samples: 100%|█████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.56 Batches/s]

Query: مولد نبوي


search duration:  1.3164136409759521 ms





In [32]:
# prediction['documents']

In [179]:
prediction['answers'][0]

[<Answer {'answer': 'ي', 'type': 'extractive', 'score': 0.2571229934692383, 'context': 'تعطيل وزاره داءره رسمي احتفاء ذكري مولد نبوي تقرر تعطيل وزاره داءره رسمي مءسسه هيءه عام جامعه رسمي بلديه مجلس خدمه مشترك امان عمان كبري شركه مملوك كامل حكومه ، ثلاثاء مقبل ، شهر ربيع اول 1443 هجري ، موافق تاسع شهر اول سنه 2021 ميلاد ، احتفاء ذكري مولد نبوي شريف . جاء قرار رءيس وزير دكتور شر الخصاونه ان بلاغ استثني وزاره داءره مءسسه رسمي اقتضي طبيعه عمل خلاف ذلك . اكد رءيس وزير وزاره داءره رسمي مءسسه هيءه عام اسهام ابراز مناسبه جليل ظهر ما لاق ب .', 'offsets_in_document': [{'start': 245, 'end': 246}], 'offsets_in_context': [{'start': 245, 'end': 246}], 'document_id': '90ed7478671f3438e4ed0047912f501c', 'meta': {'title': 'تعطيل الوزارات والدوائر الرسمية احتفاء بذكرى المولد النبوي', 'url': 'https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%A7%D9%84%D9%85%D9%88%D9%84%D8%AF-%D8%A7%D9%84%D9%86%D8%A8%D9%88%D9%8A-%D8%B9%D8%B7%D9%84%D8%A9-%D8%B1%D8%B3%D9%85%D9%8A%D8%A9', 'tags': ['المولد النبوي', 'النبي محمد صلّ

In [104]:
# # different approach
# from haystack.nodes import EmbeddingRetriever
# from haystack.pipelines import FAQPipeline #initialize a pipeline and ask questions

# ELASTIC_PASSWORD = "zJXerPHeN7PEmq5zWRuZ"
# document_store = ElasticsearchDocumentStore(host="localhost", port="9200", scheme='https',
#                                             username="elastic", password=ELASTIC_PASSWORD, 
#                                             index="se_shai_haystack2", ca_certs="../Certs/http_ca.crt",
#                                             embedding_dim=384, embedding_field="content_emb",  
# #                                             excluded_meta_data=["content"],
#                                             similarity='cosine' )
# document_store.delete_all_documents()

# retriever = EmbeddingRetriever( 
#     document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", 
#     use_gpu=True , model_format='sentence_transformers'
#     #instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of 
#     #the questions (user question vs. FAQ ones).
# )
# document_store.write_documents(docs_df_cleaned2[['content_clean', 'title', 'url', 'tags', 'doc_id']]
#                                .rename(columns={'content_clean': 'content'}).to_dict(orient='records'))
# document_store.update_embeddings(retriever=retriever)

# pipe = FAQPipeline(retriever=retriever) 




# q = "نبي"
# q_processed = text_clean2(q)
# number_of_answers_to_fetch = 10
# print(q_processed)

# # prediction = finder.get_answers(question=question, top_k_retriever=20, top_k_reader=number_of_answers_to_fetch)
# # prediction = pipe.run(
# #     query=q_processed, params=
# #     {"Retriever": {"top_k": 10}}
# # )
# prediction = retriever.retrieve(q_processed, top_k = 10)