The problem is the identification of similar questions on Quora.
With millions of questions posted, it can be challenging for the user to find similar questions that have already been answered or addressed.
To solve this problem, I employed natural language processing to clean and preprocess the data, and a Siamese Neural Network with LSTM layers and Manhattan distance metric to measure the similarity between pairs of questions.
- Data import and Exploratory data analysis
- Splitting the data,cleaning & preprocessing (Tokenize sentences, Remove capital letters, Remove stopwords, Remove non-alphanumeric characters, Lemmatize the tokens)
- Vectorizing the Train and Test Sets using Word2Vec.
- Train the Siamese Neural Network model & Test.
- Visualization of the preformance.