The main objective is to determine the similarity between two sentences from different aspects. Based on the corpus received from Quora, word similarity between two sentences is determined using four different aspects. We determined that more measures are required to determine the efficiency than just accuracy. Experiments show that using Naïve Bayes to determine the similarity between two sentences is closer to the people’s comprehension to the meaning of the sentence and gives a higher accuracy and efficiency as compared to a cosine similarity.
In the work we have done here, we use two methods for detecting duplicate questions. We compare these two approaches in depth using measures of accuracy, precision, recall, and f-measure. We found the accuracy of the Naïve Bayes classifier to be slightly more accurate than the cosine similarity approach. Looking at the confusion matrices for both approaches during our experiments led us to determine that accuracy alone is not the best measure for this task. We then experimented on quite a few measures and finally settled on f-measure, precision, and recall. We found that they, along with accuracy, provide a good measure for our experiment. Comparing the two approaches used, we found that Naïve Bayes has significantly better recall value than cosine similarity. Consequently, it also has a higher fmeasure value. This led us to determine that Naïve Bayes is much better for this classification than cosine similarity.
Complete report related to the project can be accessed here : https://github.com/NavneetPrakashSingh/natural-python/blob/master/report.pdf
Complte code related tot he project can be accessed here : https://github.com/NavneetPrakashSingh/natural-python/tree/master/code
Presentation slides can be accessed here : https://github.com/NavneetPrakashSingh/natural-python/tree/master/presentation