- a software system to bring out information of web

- smart search engines can be considered unsupervised learning approaches, due to the nature of clustering related information without such label in hand

- Search Engines have evolved **from a text input and output service to** an experience that cuts across voice, video, documents, and conversations


- an **infinite problem** to solve


- **related** to information retrieval, language understanding


- the **value that an effective search tool can bring to a business is enormous**; a key piece of intellectual property. Often a search bar is the main interface between customers and the business. 
    - create a competitive advantage by delivering an improved user experience.
    


search engine popular approaches:
- Elastic Search + tf-idf
- BM25 + Azure Cognitive Search

Requirements:
- Search index for storing each document, reflecting relevant information and up to date information
    - data can be reorganized by date (suggestion)
- Query understanding
    - takes sentence and preprocessed data information **directly without much context**
    - we can extract words or tokens from the query to match **article_type** (suggestion)
        - query to match tags (done)
    - we can filter the search by either blog or News (suggestion)
        - or add multiple results available (blog, News, or both)
    - BM25 + Azure Cognitive search
- Query ranking
    - by consine similarity

## 1- Library and Data Imports

In [1]:
import numpy as np
import pandas as pd
import time

# for text cleaning and preprocessing
import re
from nltk.corpus import stopwords
import string 
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
docs_df = pd.read_json('../Data/husna.json')

## 2- Data Preparation

#### 2.1 preparing data for cleaning

In [3]:
# MODIFIED
docs_df = docs_df.drop(columns=['publisher', 'crawled_at', 'url', 'published_at'], axis=1)

In [4]:
docs_df_dropped = docs_df.drop(index=
                               docs_df[(docs_df['content'].str.len() == 0) & (docs_df['title'] == '')].index, axis=0)
docs_df_dropped = docs_df_dropped.reset_index(drop=True)
docs_df = docs_df_dropped

In [5]:
docs_df['text'] = docs_df['content'].apply(lambda x: " ".join(x))

## 3- Data Cleaning

important data cleaning functions:
- remove punctuation
- tokenization 
- stem words

**cleaning functions not implemented**: removing repeating characters, stop words, emoji, hashtags

In [6]:
def show_info_text(df_col):
    print(f"-> Number of Documents: {docs_df.shape[0]}")
    print('-' * 50, end='\n\n')

    print('-> Documents - First 150 letters')
    print()
    for i, document_i in enumerate(docs_df['text_clean'][:20]):
        print(f"Document Number {i+1}: {document_i[:150]}..")
        print()

    print('-' * 50)
    
def data_preprocessing(df_col):
    # Instantiate a TfidfVectorizer object
    vectorizer = TfidfVectorizer()
    
    # It fits the data and transform it as a vector
    X = vectorizer.fit_transform(df_col)
    # Convert the X as transposed matrix
    X = X.T.toarray()
    # Create a DataFrame and set the vocabulary as the index
    df = pd.DataFrame(X, index=vectorizer.get_feature_names())
    return df, vectorizer

### 3.1 data cleaning (ver.1)

handle:
- removing mentions
- removing punctuation
- removing Arabic diacritics (short vowels and other harakahs) 
    - حركات وشد
- removing elongation 
    - مد
- removing stopwords (which is available in NLTK corpus)
    - normal stopwords (not specific to arabic)
- remove words from languages other than arabic and english

In [7]:
docs_df_cleaned = docs_df.copy()

In [8]:
# punctuation symbols
punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ''' + string.punctuation

# Arabic stop words with nltk
stop_words = stopwords.words()
arabic_diacritics = re.compile("""
                             ّ    | # Shadda
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)

def clean_text(txt): 
    #remove punctuations
    translator = str.maketrans('', '', punctuations)
    txt = txt.translate(translator)
    
    # remove Tashkeel
    txt = re.sub(arabic_diacritics, '', txt)
    
    # remove longation
    txt = re.sub("[إأآا]", "ا", txt)
    txt = re.sub("ى", "ي", txt)
    txt = re.sub("ؤ", "ء", txt)
    txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
    txt = re.sub("گ", "ك", txt)
    
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_words)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    
    return txt

In [9]:
docs_df_cleaned = docs_df.drop(columns=['_id', 'title', 'summary', 'content', 'text'])

start_time = time.time()
docs_df_cleaned['text_clean'] = docs_df['text'].apply(clean_text)

# docs_df_cleaned['summary_clean'] = docs_df['summary'].apply(clean_text) # no need for now
docs_df_cleaned['title_clean'] = docs_df['title'].apply(clean_text)
text_clean_enc_df, clean1_vect = data_preprocessing(docs_df_cleaned['text_clean'])
time_measure = (time.time() - start_time) * 10**3

text_clean_enc_df 



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5074,5075,5076,5077,5078,5079,5080,5081,5082,5083
00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00000015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ہر,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہمارے,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہولوکاسٹ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہےکہ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
start_time = time.time()
text_clean_enc_df, clean1_vect = data_preprocessing(docs_df_cleaned['text_clean'])
# text_clean_enc_df = vectorizer.fit_transform(docs_df_cleaned2['text_clean'])
time_measure = (time.time() - start_time) * 10**3

print(f"preprocessing time taken: {time_measure}")

preprocessing time taken: 4406.821727752686




**NOTE** 
- ~6000 source documents -> ~5000 documents -> ~89,000 tokens
- Clean time for 'text' column: ~111 seconds
- **Problems**:
    - may not be normalized enough
    - words from other languages
    - confusing numbers (remove or keep?)
        - remove english numbers? or arabic numbers? or both?
        - should we remove words of letters mixed with numbers (E.g. COVID19)
    - links (remove or keep?)
        

### 3.1 data cleaning (ver.2)

handle:
- removing Arabic diacritics (short vowels and other harakahs)
- variation by form and spelling, based on context (Orthographic Ambiguity)
- existence of many forms for the same word (Morphological Richness)
- dialects (Dialectal Variation)
- different ways to write the same word when writing in dialectal Arabic, for which there is no agreed-upon standard
    - Orthographic Inconsistency
- removing longation and stop words
- remove words from languages other than arabic and english
   
these problems can possibly lead to immensly large vocabularies generated.

In [11]:
docs_df_cleaned2 = docs_df.copy()

In [12]:
# import the dediacritization tool
from camel_tools.utils.dediac import dediac_ar

# Reducing Orthographic Ambiguity
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar

# toknenization
from camel_tools.tokenizers.word import simple_word_tokenize

# Morphological Disambiguation (Maximum Likelihood Disambiguator)
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained() # instantiation fo MLE disambiguator

# tokenization / lemmatization (choosing approach that best fit the project)
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

import re
from nltk.corpus import stopwords

In [13]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2(txt):
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    
    # dediacritization
    txt = dediac_ar(txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    txt = normalize_alef_ar(txt)
    txt = normalize_teh_marbuta_ar(txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    
    # remove longation
    txt = re.sub("[إأآا]", "ا", txt)
    txt = re.sub("ى", "ي", txt)
    txt = re.sub("ؤ", "ء", txt)
    txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
    txt = re.sub("گ", "ك", txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    
    return txt

In [14]:
# apply to your text column
docs_df_cleaned2 = docs_df.drop(columns=['_id', 'title', 'summary', 'content', 'text'])
start_time = time.time()
docs_df_cleaned2['text_clean'] = docs_df['text'].apply(text_clean2)
time_measure = (time.time() - start_time) * 10**3

In [15]:
# text_clean_enc_df = data_preprocessing(docs_df_cleaned2['text_clean'])

# text_clean_enc_df, clean2_vect = data_preprocessing(docs_df_cleaned['text_clean'])
start_time = time.time()
text_clean_enc_df, clean2_vect = data_preprocessing(docs_df_cleaned2['text_clean'])
# text_clean_enc_df = clean2_vect.fit_transform(docs_df_cleaned2['text_clean'])

time_measure = (time.time() - start_time) * 10**3

text_clean_enc_df



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5074,5075,5076,5077,5078,5079,5080,5081,5082,5083
00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ہر,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہمارے,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہولوکاسٹ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہےکہ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# fit + transform time

start_time = time.time()
text_clean_enc_df, clean2_vect = data_preprocessing(docs_df_cleaned2['text_clean'])
# text_clean_enc_df = clean2_vect.transform(docs_df_cleaned2['text_clean'])
time_measure = (time.time() - start_time) * 10**3

print('time measure: ', time_measure)
# text_clean_enc_df

time measure:  1032.4792861938477




In [17]:
# using vectorizer time 


- ~6000 source documents -> ~5000 documents -> ~23,000 tokens
- Clean time for 'text' column: ~180 seconds
- Problems:
    - stopwords (remove or keep?)
    - normalization may have cut out too many tokens 
    - confusing numbers (remove or keep?)
        - remove english numbers? or arabic numbers? or both?
    - should we remove words of letters mixed with numbers (E.g. COVID19)
    - links (remove or keep?)
    
**NOTE** discuss with instructor before proceeding

### 3.x check results

In [18]:
# display(docs_df_cleaned.head())
i=0

In [19]:
# check results
print(f'--> {i}')
display(text_clean_enc_df.index[50*i:50*(i+1)])
display(text_clean_enc_df.index[-50*(i+1):(-50*i)-1])
i += 1

print('clean time: {:.2f} seconds'.format(time_measure * 10**-3))

--> 0


Index(['00', '000', '0000', '0000015', '001', '0013', '0016', '00249915491874',
       '00380939361181', '004', '0040732947388', '0040737373273',
       '0041766636358', '00436507716412', '004366565513092', '00436766846092',
       '00436767388014', '007', '00905384419586', '0092512833310',
       '00962795497777', '0096893921146', '00971567436777', '01', '010', '012',
       '013', '014', '015', '0167017', '018', '018ر1', '0191487964410400001',
       '02', '0211', '026', '029', '02ر626', '03', '030', '034', '037', '039',
       '03ر13', '04', '040', '0404', '041', '044', '048'],
      dtype='object')

Index(['٤٠١٠', '٤٤', '٤٥', '٤٥٠', '٤٨', '٥٠', '٥٠٠', '٥١', '٥٢', '٥٣٧', '٥٥',
       '٦٠', '٦٠٠', '٦٠١', '٦٥', '٧٠', '٧٢', '٧٥', '٧٨', '٨٠', '٨٠٠', '٨٠٠م٢',
       '٨٤', '٨٥', '٨٧', '٨٨', '٩٠', '٩٥', '٩٧', '٩٨', '٩٩', 'ٱثنا', 'ٱثنان',
       'ٱثنتان', 'پاک', 'پیوند', 'کابل', 'کر', 'کو', 'کی', 'کیا', 'کیخلاف',
       'کیلءے', 'کےذریعےجان', 'کےفروغ', 'ہر', 'ہمارے', 'ہولوکاسٹ', 'ہےکہ'],
      dtype='object')

clean time: 1.03 seconds


In [20]:
vocab_ = vectorizer.vocabulary_
print(f"number of unique words: {len(vocab_.keys())}")
most_freq_word = sorted(vocab_.items(), key=lambda x: x[1], reverse=True)[:1][0]
print('most frequent word is --> {} ({} times)'.format(most_freq_word[0], most_freq_word[1]))
score = len(vocab_.keys()) / most_freq_word[1]
print('Ratio: {:.3f}'.format(score))

NameError: name 'vectorizer' is not defined

## 4- Apply Cleaning on Query

In [120]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2_steps(txt):
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    print('stopwords', txt)
    
    # dediacritization
    txt = dediac_ar(txt)
    print('dediacritization', txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_alef_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_teh_marbuta_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    print('Reducing Morphological Variation', txt)
    
    # remove longation
    txt = re.sub("[إأآا]", "ا", txt)
    txt = re.sub("ى", "ي", txt)
    txt = re.sub("ؤ", "ء", txt)
    txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
    txt = re.sub("گ", "ك", txt)
    print('remove longation', txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    print('remove non-arabic/non-english/non-number words', txt)
    
    return txt

In [121]:
query_test = 'ضياء'
query_test_cleaned = text_clean2_steps(query_test)
query_test_cleaned

stopwords ضياء
dediacritization ضياء
Reduce Orthographic Ambiguity and Dialectal Variation ضياء
Reduce Orthographic Ambiguity and Dialectal Variation ضياء
Reduce Orthographic Ambiguity and Dialectal Variation ضياء
Reducing Morphological Variation ضياء
remove longation ضياء
remove non-arabic/non-english/non-number words ضياء


'ضياء'

## 5- Calculating Similarities

In [21]:
display(text_clean_enc_df)
temp_x = text_clean_enc_df.values
display(temp_x)
display(text_clean_enc_df.loc[:, 0])
display(np.shape(temp_x[:, 0]))

q_temp = [q1_cleaned]
q_temp = clean2_vect.transform(q_temp).toarray()
q_temp = np.tile(np.array(q_temp).transpose(), (1, text_clean_enc_df.shape[1]))
q_temp = q_temp.reshape(-1, text_clean_enc_df.shape[0])
# print(q_temp)
# np.dot(text_clean_enc_df, q_temp)
# tile(array([[1,2,3]]).transpose(), (1, 3))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5074,5075,5076,5077,5078,5079,5080,5081,5082,5083
00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ہر,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہمارے,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہولوکاسٹ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ہےکہ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

00          0.0
000         0.0
0000        0.0
0000015     0.0
001         0.0
           ... 
ہر          0.0
ہمارے       0.0
ہولوکاسٹ    0.0
ہےکہ        0.0
یقول        0.0
Name: 0, Length: 23737, dtype: float64

(23737,)

NameError: name 'q1_cleaned' is not defined

In [16]:
import numba

#### optimization 1

In [17]:
# optimized 1
from numba import njit
import numba

def get_similar_articles(q, df, vectorizer):
    # Convert the query become a vector
    q = [q]
    q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
    df = df.values
    
    # Calculate the similarity
    sim = list(range(df.shape[1]))
    
    @njit
    def calc_sim(val):
        x = val
        num = np.linalg.norm(df[:, val]) * np.linalg.norm(q_vec)
        if num == 0:
            y = 0
        else:
            y = np.dot(df[:, val], q_vec) / num
        return x, y 
    
    # prepare result
    sim = list(map(calc_sim, sim))
    sim_sorted = sim
    return sim_sorted

In [24]:
q1 = 'مولد النبي بييبس'
q1_cleaned = text_clean2(q1)
q1_cleaned = [q1_cleaned]

q_vec_test = clean2_vect.transform(q1_cleaned).toarray().reshape(text_clean_enc_df.shape[0],)
# print(q_vec_test)
np.linalg.norm(q_vec_test)

1.0

#### optimization 3

In [22]:
# optimized 3
# sorted(np.linalg.norm(q_vec_test) * np.linalg.norm(text_clean_enc_df.values, axis=0), reverse=True)
start_time = time.time()

# %time
np_data = text_clean_enc_df.values
    nums = np.linalg.norm(q_vec_test) * np.linalg.norm(np_data, axis=0)
nums = np.dot(np_data.T, q_vec_test)/nums
nums_id = list(enumerate(nums))
nums_id_sorted = sorted(nums_id, key=lambda x: x[1], reverse=True)
print((time.time() - start_time) * 10 ** 3)

IndentationError: unexpected indent (3409369138.py, line 6)

In [133]:
nums_id_sorted[:7]

[(4600, 0.20582365181966394),
 (2608, 0.20073743494298743),
 (2148, 0.2000789504882992),
 (1697, 0.19007115734405167),
 (3739, 0.16069003395532222),
 (2152, 0.1579702725541781),
 (4734, 0.15792581158425886)]

In [274]:
np_data.shape

(23737, 5084)

In [28]:
q_vec_test.shape

(23737,)

In [27]:
np_data.T.shape

(5084, 23737)

In [25]:
# sorted(np.linalg.norm(q_vec_test) * np.linalg.norm(text_clean_enc_df.values, axis=0), reverse=True)
start_time = time.time()

# %time
np_data = text_clean_enc_df.values
np_data
nums = np.dot(np_data.T, q_vec_test)
nums_id = list(enumerate(nums))
nums_id_sorted = sorted(nums_id, key=lambda x: x[1], reverse=True)
print((time.time() - start_time) * 10 ** 3)

139.9993896484375


In [268]:
nums_id_sorted[:7]

[(4600, 0.205823651819664),
 (2608, 0.20073743494298743),
 (2148, 0.20007895048829927),
 (1697, 0.1900711573440516),
 (3739, 0.1606900339553222),
 (2152, 0.15797027255417811),
 (4734, 0.15792581158425886)]

#### optimization 3

In [124]:
# optimized 2
from numba import njit, prange

@njit(parallel=True, cache=True)
def compute_calc(np_data, q_vec, sim, n):
    for i in prange(n):
        num = np.linalg.norm(np_data[:, i]) * np.linalg.norm(q_vec)
        if num == 0:
            sim[i] = 0
        else:
            sim[i] = np.dot(np_data[:, i], q_vec) / num
    return sim

def get_similar_articles(q, df, vectorizer):
    # Convert the query become a vector
    start_time = time.time()
    q = [q]
    q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
    print('vectorizer use Time Duration: ', time.time()- start_time)
    
    # Calculate the similarity
    np_data=df.values
    sim = np.zeros((text_clean_enc_df.shape[1])) 
    range_n = np.array(range(np.shape(np_data)[1]))
    sim = compute_calc(np_data, q_vec, sim, np.shape(np_data)[1])
    
    
    # prepare result
    sim = list(enumerate(sim))
    sim_sorted = sorted(sim, key=lambda x: x[1], reverse=True)[:10]
    return sim_sorted


In [125]:
# Add The Query
q1 = 'مولد النبي'
q1_cleaned = text_clean2(q1)

# Measures
time_measure = None
most_freq_measure = None  

# q_preprocessed = vectorizer.fit_transform([q1]).transpose()
# text_clean_enc_df_2 = vectorizer.fit_transform(docs_df_cleaned['text_clean'])
start_time = time.time()
sorted_docs_with_scores = get_similar_articles(q1_cleaned, text_clean_enc_df, vectorizer=clean2_vect)  # call function
# awesome_cossim_top(text_clean_enc_df, q_preprocessed, 3, lower_bound=0)
time_measure = (time.time() - start_time) * 10**3

print('matching time taken', time_measure)

print('result', sorted(sorted_docs_with_scores, key=lambda x: x[1], reverse=True)[:10])

print()
print('correct result', [(4600, 0.20582365181966394),
 (2608, 0.20073743494298743),
 (2148, 0.2000789504882992),
 (1697, 0.19007115734405167),
 (3739, 0.16069003395532222),
 (2152, 0.1579702725541781),
 (4734, 0.15792581158425886),
 (2202, 0.1468077541726054),
 (546, 0.13791008813162506),
 (565, 0.13109829367759931)])

vectorizer use Time Duration:  0.001982450485229492
matching time taken 622.8823661804199
result [(4600, 0.20582365181966394), (2608, 0.20073743494298746), (2148, 0.2000789504882992), (1697, 0.19007115734405167), (3739, 0.16069003395532222), (2152, 0.1579702725541781), (4734, 0.15792581158425886), (2202, 0.1468077541726054), (546, 0.13791008813162506), (565, 0.1310982936775993)]

correct result [(4600, 0.20582365181966394), (2608, 0.20073743494298743), (2148, 0.2000789504882992), (1697, 0.19007115734405167), (3739, 0.16069003395532222), (2152, 0.1579702725541781), (4734, 0.15792581158425886), (2202, 0.1468077541726054), (546, 0.13791008813162506), (565, 0.13109829367759931)]


In [None]:
vocab_ = vectorizer.vocabulary_
print(f"number of unique words: {len(vocab_.keys())}")
most_freq_word = sorted(vocab_.items(), key=lambda x: x[1], reverse=True)[:1][0]
print('most frequent word is --> {} ({} times)'.format(most_freq_word[0], most_freq_word[1]))
score = len(vocab_.keys()) / most_freq_word[1]
print('Ratio: {:.3f}'.format(score))

most_freq_measure = most_freq_word[1]

In [49]:
docs_df[docs_df['title'] == 'الغاز الروسي يضع أوروبا خلال الشتاء المقبل أمام اختبار تاريخي']

Unnamed: 0,_id,title,summary,content,tags,article_type,text
2,633fdfd13ffae8229d05cb35,الغاز الروسي يضع أوروبا خلال الشتاء المقبل أما...,,[يشهد العالم أول أزمة طاقة عالمية حقيقية في ال...,"[الغاز الروسي, أوروبا, أزمة الطاقة]",News,يشهد العالم أول أزمة طاقة عالمية حقيقية في الت...


In [46]:
# most_freq_measure = None  
# start_time = time.time()

# doc_ids = list(self.docs_df_cleaned2['tags'].index)
# q_list = np.array(q.split(' '))
# sim_score = list(np.zeros(self.docs_df_cleaned2.shape[0]))
q = 'التربية والتعليم'
def match_func(txt):
    global q
    if q in ' '.join(txt):
        return 1
    return 0

vals = docs_df_cleaned2['tags'].tolist()
sim_non_sorted = list(map(match_func, vals))
sim_non_sorted = list(enumerate(sim_non_sorted))
sim_non_sorted
# for i, tag in enumerate(self.docs_df_cleaned2['tags']):
#     for str_tag in tag:
#         q_list_map = np.vectorize(lambda x: x in str_tag)(q_list) 
#         if True in q_list_map:
#             sim_score[i] += 1

# sim_non_sorted = list(zip(doc_ids, sim_score))
# sorted_docs_with_scores_content = sim_non_sorted

[(0, 1),
 (1, 1),
 (2, 0),
 (3, 1),
 (4, 0),
 (5, 0),
 (6, 0),
 (7, 0),
 (8, 1),
 (9, 0),
 (10, 1),
 (11, 1),
 (12, 0),
 (13, 0),
 (14, 0),
 (15, 0),
 (16, 0),
 (17, 0),
 (18, 0),
 (19, 0),
 (20, 0),
 (21, 0),
 (22, 0),
 (23, 0),
 (24, 0),
 (25, 0),
 (26, 0),
 (27, 0),
 (28, 0),
 (29, 0),
 (30, 0),
 (31, 0),
 (32, 0),
 (33, 0),
 (34, 0),
 (35, 0),
 (36, 0),
 (37, 0),
 (38, 0),
 (39, 0),
 (40, 0),
 (41, 0),
 (42, 0),
 (43, 0),
 (44, 0),
 (45, 0),
 (46, 0),
 (47, 0),
 (48, 0),
 (49, 0),
 (50, 0),
 (51, 0),
 (52, 0),
 (53, 0),
 (54, 0),
 (55, 0),
 (56, 0),
 (57, 0),
 (58, 0),
 (59, 0),
 (60, 0),
 (61, 0),
 (62, 0),
 (63, 0),
 (64, 0),
 (65, 0),
 (66, 0),
 (67, 0),
 (68, 1),
 (69, 0),
 (70, 0),
 (71, 0),
 (72, 0),
 (73, 0),
 (74, 0),
 (75, 0),
 (76, 0),
 (77, 0),
 (78, 0),
 (79, 0),
 (80, 0),
 (81, 0),
 (82, 0),
 (83, 0),
 (84, 0),
 (85, 0),
 (86, 0),
 (87, 0),
 (88, 1),
 (89, 0),
 (90, 0),
 (91, 0),
 (92, 0),
 (93, 0),
 (94, 0),
 (95, 0),
 (96, 0),
 (97, 0),
 (98, 0),
 (99, 0),
 (100, 1),

In [42]:
docs_df_cleaned2['tags'].tolist()

[['التربية والتعليم', 'وزارة التربية والتعليم'],
 ['مكرمة أبناء المعلمين', 'وزارة التربية والتعليم'],
 ['الغاز الروسي', 'أوروبا', 'أزمة الطاقة'],
 ['التربية والتعليم', 'التوقيت الصيفي', 'المدارس'],
 ['وزارة العمل', 'المولد النبوي'],
 ['الربط الكهربائي الأردني - العراقي', 'الأردن'],
 ['الباص السريع'],
 ['المخدرات', 'الأطفال', 'المدارس'],
 ['وزارة التربية', 'وزارة التربية والتعليم'],
 ['وزارة المياه', 'السدود'],
 ['وزارة التربية والتعليم'],
 ['أبناء قطاع غزة',
  'أبناء الأردنيات',
  'اللاجئين السوريين',
  'وزارة التربية والتعليم'],
 ['المركز الوطني لحقوق الإنسان', 'الربيع العربي'],
 ['التهاب الكبد الوبائي', 'منظمة الصحة العالمية', 'الكبد', 'الأطفال'],
 ['جائحة كورونا', 'المدارس', 'فحص كورونا', 'البرتوكول العلاجي'],
 ['خط الديسي', 'وزارة المياه'],
 ['إعادة حركة السير', 'الباص السريع'],
 ['أمانة عمان الكبرى',
  'التقاطعات',
  'المرورية',
  'الباص السريع',
  'جسر المدينة الرياضية العلوي'],
 ['أفغانستان', 'طالبان'],
 ['الأسرى', 'جنين', 'المقاومة الفلسطنية', 'فرار أسرى'],
 ['درعا', 'روسيا', '

In [443]:
q = 'مولد النبي'
time_measure = None
most_freq_measure = None  
start_time = time.time()

doc_ids = list(docs_df_cleaned2['tags'].index)
q_list = np.array(q.split(' '))
sim_score = list(np.zeros(docs_df_cleaned2.shape[0]))

for i, tag in enumerate(docs_df_cleaned2['tags']):
    for str_tag in tag:
        q_list_map = np.vectorize(lambda x: x in str_tag)(q_list) 
        if True in q_list_map:
            sim_score[i] += 1

sim_non_sorted = list(zip(doc_ids, sim_score))
sorted_docs_with_scores_content = sim_non_sorted

time_measure = (time.time() - start_time) * 10**3
score = 0
print('Ratio: {:.3f}'.format(score))
print()
print('time measure:', time_measure)

Ratio: 0.000

time measure: 164.09802436828613


## 6- getting top documents

In [272]:
sorted_docs_with_scores = sorted(sorted_docs_with_scores, key=lambda x: x[1], reverse=True)
top_5_docs = np.array(sorted_docs_with_scores, dtype='int32')[:5, 0]
top_5_docs

array([4600, 2608, 2148, 1697, 3739])

In [62]:
# results 
print('time measure:', time_measure)
print('frequency measure:', most_freq_measure)
print('score %.3f' % score)

time measure: 648.0000019073486
frequency measure: 89123
score 1.000


## References

Done:
- Pre-processing Arabic text for machine-learning using the camel-tools Python package
    - https://towardsdatascience.com/arabic-nlp-unique-challenges-and-their-solutions-d99e8a87893d
- faster and more optimized processes
    - https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
    - https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

Not Done:
- elastic search with tf-idf + newer search mechanism approaches
    - https://towardsdatascience.com/how-to-build-a-search-engine-9f8ffa405eac#:~:text=use%20Elasticsearch.%20This%20search%20engine%20was%20powered%20by%20incredibly%20simple%20term%2Dfrequency%2C%20inverse%20document%20frequency%20(or%20tf%2Didf)
- model which can handle out-of-vocabulary words + word vector approaches (word-2-vec)
    - https://towardsdatascience.com/supercharging-word-vectors-be80ee5513d#:~:text=model%20which%20can%20handle%20out%2Dof%2Dvocabulary%20words
- faster and smarter search engine
    - https://www.kaggle.com/code/greegtitan/creating-simple-search-engine
    - https://towardsdatascience.com/how-to-build-a-smart-search-engine-a86fca0d0795