- a software system to bring out information of web

- smart search engines can be considered unsupervised learning approaches, due to the nature of clustering related information without such label in hand

- Search Engines have evolved **from a text input and output service to** an experience that cuts across voice, video, documents, and conversations


- an **infinite problem** to solve


- **related** to information retrieval, language understanding


- the **value that an effective search tool can bring to a business is enormous**; a key piece of intellectual property. Often a search bar is the main interface between customers and the business. 
    - create a competitive advantage by delivering an improved user experience.
    


search engine popular approaches:
- manual implementation with dataframe + tf-idf
- Elastic Search + BM25
- BM25 + Azure Cognitive Search

Requirements:
- Search index for storing each document, reflecting relevant information and up to date information
    - data can be reorganized by date (suggestion)
- Query understanding
    - takes sentence and preprocessed data information **directly without much context**
    - we can extract words or tokens from the query to match **article_type** (suggestion)
        - query to match tags (done)
    - we can filter the search by either blog or News (suggestion)
        - or add multiple results available (blog, News, or both)
    - BM25 + Azure Cognitive search
- Query ranking
    - by consine similarity

## 1- Library and Data Imports

In [1]:
import numpy as np
import pandas as pd
import time

# for text cleaning and preprocessing
import re
from nltk.corpus import stopwords
import string 
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
docs_df = pd.read_json('../Data/husna.json')

## 2- Data Preparation

#### 2.1 preparing data for cleaning

In [4]:
# MODIFIED
docs_df = docs_df.drop(columns=['publisher', 'crawled_at', 'published_at'], axis=1)

In [5]:
docs_df_dropped = docs_df.drop(index=
                               docs_df[(docs_df['content'].str.len() == 0) & (docs_df['title'] == '')].index, axis=0)
docs_df_dropped = docs_df_dropped.reset_index(drop=True)
docs_df = docs_df_dropped

In [6]:
docs_df['text'] = docs_df['content'].apply(lambda x: " ".join(x))

## 3- Data Cleaning

important data cleaning functions:
- remove punctuation
- tokenization 
- stem words

**cleaning functions not implemented**: removing repeating characters, stop words, emoji, hashtags

### data cleaning (ver.2)

In [7]:
docs_df_cleaned2 = docs_df.copy()

In [15]:
# import the dediacritization tool
from camel_tools.utils.dediac import dediac_ar

# Reducing Orthographic Ambiguity
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar

# toknenization
from camel_tools.tokenizers.word import simple_word_tokenize

# Morphological Disambiguation (Maximum Likelihood Disambiguator)
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained() # instantiation fo MLE disambiguator

# tokenization / lemmatization (choosing approach that best fit the project)
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

In [16]:
# 4
def remove_urls(text):
    return re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ', text)

# 5
def remove_html(text):
    return BeautifulSoup(text, "html.parser").text

# removing symbols
symb_re = re.compile(r"""[!"#$%&\'()*+,-./:;<=>?@[\\\]^_`{|}~،؟…«“\":\"…”]""")
def remove_symbols(text: str) -> str:
    return symb_re.sub(repl="", string=text)

# 10
multiple_space_re = re.compile("\s{2,}")
def remove_multiple_whitespace(text):
    return multiple_space_re.sub(repl=" ", string=text)

In [101]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2(txt):
    txt = remove_urls(txt)
    txt = remove_html(txt)
    
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    
    # dediacritization
    txt = dediac_ar(txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    txt = normalize_alef_ar(txt)
    txt = normalize_teh_marbuta_ar(txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    
    # remove longation
#     txt = re.sub("[إأآا]", "ا", txt)
#     txt = re.sub("ى", "ي", txt)
#     txt = re.sub("ؤ", "ء", txt)
#     txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
#     txt = re.sub("گ", "ك", txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    
    # remove symbols
    txt = remove_symbols(txt)
    
    # remove multiple whitespace
    txt = remove_multiple_whitespace(txt)
    
    
    return txt

In [102]:
# apply to your text column
start_time = time.time()
docs_df_cleaned2 = docs_df.drop(columns=['_id', 'summary', 'content'])
docs_df_cleaned2['text_clean'] = docs_df['text'].apply(text_clean2)
docs_df_cleaned2['title_clean'] = docs_df['title'].apply(text_clean2)
docs_df_cleaned2['content_clean'] = docs_df_cleaned2['title_clean'] + " " + docs_df_cleaned2['text_clean']
docs_df_cleaned2['doc_id'] = docs_df_cleaned2.index
time_measure = (time.time() - start_time) * 10**3
print('Data Cleaning Time: {} ms'.format(time_measure))



Data Cleaning Time: 27089.78033065796 ms


In [109]:
vectorizer = TfidfVectorizer()
clean2_vect = vectorizer.fit_transform(docs_df_cleaned2['content_clean'])

In [113]:
docs_df_cleaned2['content_word_count'] = docs_df_cleaned2['content_clean'].apply(lambda x: len(x.split()))

In [114]:
no_words_corpora = docs_df_cleaned2.content_word_count.sum()
# print(f"number of unique words: {len(clean2_vect.vocabulary_.keys())}")
no_words_vocab = len(vectorizer.vocabulary_.keys())

print(f"Number of words in corpora: {no_words_corpora}")
print(f"Number of words in vocabulary {no_words_vocab}")

Number of words in corpora: 1144292
Number of words in vocabulary 23799


In [117]:
docs_df_cleaned2 = docs_df_cleaned2.drop(columns=['text_clean', 'title_clean', 'content_word_count'])

In [119]:
docs_df_cleaned2.head()

Unnamed: 0,title,url,tags,article_type,text,content_clean,doc_id
0,التربية: تحويل 42 مدرسة إلى نظام الفترتين واست...,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%...,"[التربية والتعليم, وزارة التربية والتعليم]",News,أكدت أمين عام وزارة التربية والتعليم للشؤون ...,تربيه تحويل 42 مدرسه إلى نظام فتره استئجار 15 ...,0
1,تكريما للمعلمين زيادة منح أبناء المعلمين 550 م...,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%...,"[مكرمة أبناء المعلمين, وزارة التربية والتعليم]",News,احتفلت وزارة التربية والتعليم بيوم المعلم بت...,تكريم معلم زياده منح أبناء معلم 550 مقعد إضافي...,1
2,الغاز الروسي يضع أوروبا خلال الشتاء المقبل أما...,https://husna.fm/%D9%85%D9%84%D9%81%D8%A7%D8%A...,"[الغاز الروسي, أوروبا, أزمة الطاقة]",News,يشهد العالم أول أزمة طاقة عالمية حقيقية في الت...,غاز روسي وضع أوربا شتاء أمام اختبار تاريخي شهد...,2
3,تفاصيل دوام المدارس بعد قرار تثبيت التوقيت الصيفي,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%...,"[التربية والتعليم, التوقيت الصيفي, المدارس]",News,كشفت أمين عام وزارة التربية والتعليم للشؤون ال...,تفصيل دوام مدرسه قرار تثبيت توقيت صيفي كشف أمي...,3
4,العمل: عطلة المولد النبوي الشريف تشمل القطاع ا...,https://husna.fm/%D9%85%D8%AD%D9%84%D9%8A/%D8%...,"[وزارة العمل, المولد النبوي]",News,أكدت وزارة العمل أن العطل الرسمية تكون مأجور...,عمل عطله مولد نبوي شريف شمل قطاع خاص أكد وزاره...,4


### preparing input for word vector

In [84]:
stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean_wv(txt):
    txt = remove_urls(txt)
    txt = remove_html(txt)
    
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    
    # dediacritization
    txt = dediac_ar(txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    txt = normalize_alef_ar(txt)
    txt = normalize_teh_marbuta_ar(txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    
    # remove longation (EXCLUDED)
#     txt = re.sub("[إأآا]", "ا", txt)
#     txt = re.sub("ى", "ي", txt)
#     txt = re.sub("ؤ", "ء", txt)
#     txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
#     txt = re.sub("گ", "ك", txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    
    # remove symbols
    txt = remove_symbols(txt)
    
    # remove multiple whitespace
    txt = remove_multiple_whitespace(txt)
    
    
    return txt

In [85]:
start_time = time.time()
word_vector_input = docs_df.drop(columns=['_id', 'summary', 'content'])
word_vector_input['text_clean'] = docs_df['text'].apply(text_clean_wv)
word_vector_input['title_clean'] = docs_df['title'].apply(text_clean_wv)
word_vector_input['content_clean'] = word_vector_input['title_clean'] + " " + word_vector_input['text_clean']
word_vector_input['doc_id'] = word_vector_input.index
time_measure = (time.time() - start_time) * 10**3



In [88]:
word_vector_input.to_csv('../Data/processed/SE_data4.csv', index=False, encoding='utf-8-sig')

### Apply Cleaning on Query

In [82]:
## Method 1 (used cleaning)

stop_word_list = pd.read_csv('../Data/stop_words/list.csv')['words'].to_list()
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok', diac=False) # atbseg scheme 
def text_clean2_steps(txt):
    # remove stopwords
    txt = ' '.join(word for word in txt.split() if word not in stop_word_list)
    print('stopwords', txt)
    
    # dediacritization
    txt = dediac_ar(txt)
    print('dediacritization', txt)
    
    # normalization: Reduce Orthographic Ambiguity and Dialectal Variation
    txt = normalize_alef_maksura_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_alef_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    txt = normalize_teh_marbuta_ar(txt)
    print('Reduce Orthographic Ambiguity and Dialectal Variation', txt)
    
    # normalization: Reducing Morphological Variation
    tokens = simple_word_tokenize(txt)
    disambig = mle.disambiguate(tokens)
    lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
    tokens = tokenizer.tokenize(lemmas)
    txt = ' '.join(tokens)
    print('Reducing Morphological Variation', txt)
    
    # remove longation
#     txt = re.sub("[إأآا]", "ا", txt)
#     txt = re.sub("ى", "ي", txt)
#     txt = re.sub("ؤ", "ء", txt)
#     txt = re.sub("ئ", "ء", txt)
    txt = re.sub("ة", "ه", txt)
#     txt = re.sub("گ", "ك", txt)
    print('remove longation', txt)
    
    # remove non-arabic words, or non-numbers, or non-english words in the text
    txt = re.sub(r'[^a-zA-Z\s0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD.0-9]+'
                 ,' ', txt)
    print('remove non-arabic/non-english/non-number words', txt)
    
    return txt

In [89]:
query_test = 'كتاب'
query_test_cleaned = text_clean2_steps(query_test)
query_test_cleaned

stopwords كتاب
dediacritization كتاب
Reduce Orthographic Ambiguity and Dialectal Variation كتاب
Reduce Orthographic Ambiguity and Dialectal Variation كتاب
Reduce Orthographic Ambiguity and Dialectal Variation كتاب
Reducing Morphological Variation كتاب
remove longation كتاب
remove non-arabic/non-english/non-number words كتاب


'كتاب'

In [68]:
# cleaningi with Farasa

from farasa.stemmer import FarasaStemmer
stemmer = FarasaStemmer()

query_test = 'فئات'
stemmed_text = stemmer.stem(query_test)                                     
print(stemmed_text)

فئة
