<a href="https://colab.research.google.com/github/MhmDSmdi/Text-Similarity/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project we need to find similarity between persian sentences which are in concept of ophthalmology. First we must load fasttext pre-trained model for persian language an un-zip it with following scripts.

In [0]:
# !pip install -q hazm
# !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fa.300.bin.gz
# !gunzip cc.fa.300.bin.gz

  

After that, we have to load our opthalmology data set form google drive in order to tune general pre-trained model.

In [0]:
# from google.colab import drive
# drive.mount('/content/drive')

First step in natural language processing is data pre-processing which cleans our data by normalizing, stemming or lemmatizing. the most important step in data cleaning is removing stop-words from our corpus.

In [0]:
import pickle

from hazm import Normalizer, Stemmer, Lemmatizer, sent_tokenize, word_tokenize, stopwords_list

stops = set(stopwords_list())


def load_dataset(file_name, column_name='question'):
    data = pickle.load(open(file_name, "rb"))
    statements = []
    for i in range(len(data)):
        statements.append(data[i][column_name])
    return statements


def statement_pre_processing(input_statement):
    normalizer = Normalizer()
    lemmatizer = Lemmatizer()
    input_statement = normalizer.normalize(input_statement)
    input_statement = [lemmatizer.lemmatize(word) for word in word_tokenize(input_statement) if word not in stops]
    return input_statement


def dataset_cleaner(dataset):
    statements = []
    normalizer = Normalizer()
    lemmatizer = Lemmatizer()
    for i in range(len(dataset)):
        normalized_statement = normalizer.normalize(dataset[i])
        # for sentence in sent_tokenize(dataset[i]):
        word_list = [lemmatizer.lemmatize(word) for word in word_tokenize(normalized_statement) if word not in stops]
        statements.append(word_list)
    return statements


In [0]:
import multiprocessing
import gensim

from gensim.models import Phrases, Word2Vec, FastText
from gensim.models.phrases import Phraser
from gensim.similarities import WmdSimilarity
from gensim.test.utils import datapath


def load_pre_trained_model(file_name, encoding='utf-8'):
    # model = gensim.models.KeyedVectors.load_word2vec_format(file_name)
    model = FastText.load_fasttext_format(file_name)
    # model.save('fasttext_fa_model')
    return model
  

In [0]:
def train_word2vec_bigram(word_statements, name='word2vec_fa_model'):
    phrases = Phrases(word_statements, min_count=30, progress_per=10000)
    bigram = Phraser(phrases)
    sentences = bigram[word_statements]
    num_cores = multiprocessing.cpu_count()
    w2v_model = Word2Vec(min_count=20,
                         window=2,
                         size=300,
                         sample=6e-5,
                         alpha=0.03,
                         min_alpha=0.0007,
                         negative=20,
                         workers=num_cores - 1)
    w2v_model.build_vocab(sentences, progress_per=10000)
    w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
    w2v_model.save(name)
    w2v_model.init_sims(replace=True)
    return w2v_model
  
 

In [0]:
def progbar(curr, full_progbar):
    frac = curr / full_progbar
    filled_progbar = round(frac * full_progbar)
    print('\r', '#' * filled_progbar + '-' * (full_progbar - filled_progbar), '[{:>7.2%}]'.format(frac), end='')
    

In [6]:
progbar(0, 100)
medical_questions = load_dataset("./drive/My Drive/Colab Notebooks/data_all.pickle")
medical_questions_words = dataset_cleaner(medical_questions)
progbar(45, 100)
model = load_pre_trained_model('./cc.fa.300.bin')
progbar(65, 100)




 #############################################------------------------------------------------------- [ 45.00%]

  if sys.path[0] == '':


 #################################################################----------------------------------- [ 65.00%]

Until now, we have loaded fasttext's pre-traines model and our ophthalmology dataset. As I saied above, for get more accurate result, we should train fasttext pre-trained model again because, fasttext model is very general and we need to tune it for ophthalmology and medical field.
After train fasttext model, we should use word mover distance (WMD) to find similarity between user's query and our medical corpus.

**For this you need to gensim 3.7 or higher so if your gensim lib. is lower, you need to update it with following script :**

```
!pip install gensim==3.8 [or higher]
```



In [8]:
model.build_vocab(medical_questions_words, update=True)
model.train(medical_questions_words, total_examples=len(medical_questions_words), epochs=model.epochs)
progbar(70, 100)
instance = WmdSimilarity(medical_questions_words, model, num_best=10)

 ######################################################################------------------------------ [ 70.00%]

Finally we just need to input a sentence and get output which is the mose similar sentences and its score

In [10]:
user_question = ['آیا برای عمل لیزیک باید ناشتا بود؟',
                 'برای عمل لیزیک نباید سیگار کشید؟',
                 'سلام خسته نباشید چشمم درد میکنه خواستم بدونم باید چیکار کنم؟',
                 'فوتبال ورزش پر هیجانی است']

for i in range(len(user_question)):
    query = statement_pre_processing(user_question[i])
    sims = instance[query]
    print('Query: ' + user_question[i])
    for j in range(10):
        print(medical_questions[sims[j][0]] + "("+'sim = %.4f' % sims[j][1]+")")
    print()
    progbar(i * 5 + 75, 100)
progbar(100, 100)


Query: آیا برای عمل لیزیک باید ناشتا بود؟
آیا برای عمل لیزیک باید ناشتا بود؟(sim = 1.0000)
آیا انجام پیریمتری قبل از لیزیک لازم است؟(sim = 0.7721)
هزينه عمل ليزيك چند است؟(sim = 0.7717)
با سلام. آیا امکان عمل لیزیک برای همه افراد وجود دارد؟(sim = 0.7445)
انجام عمل لیزیک چه شرایطی دارد؟(sim = 0.7435)
آيا با داشتن آستيكمات بالا ميتوان عمل ليزيك انجام داد؟(sim = 0.7287)
با سلام. درد عمل لیزیک به چه علت است؟(sim = 0.7258)
با سلاماگر ممكن است بفرماييد كه عمل ليزيك بهتر است يا prk ؟(sim = 0.7218)
ايا با وجود استيگمات بودن چشم امكان عمل ليزيك وجود دارد؟(sim = 0.7200)
عمل فمتو لیزیک  چیست؟(sim = 0.7176)

 ###########################################################################------------------------- [ 75.00%]Query: برای عمل لیزیک نباید سیگار کشید؟
آیا کشیدن سیگار بعد از عمل لیزیک منعی دارد؟(sim = 0.7998)
انجام عمل لیزیک چه شرایطی دارد؟(sim = 0.7969)
با سلام. آیا امکان عمل لیزیک برای همه افراد وجود دارد؟(sim = 0.7943)
هزينه عمل ليزيك چند است؟(sim = 0.7845)
آيا با داشتن آستيكمات بالا ميتوان