# What this code is going to do
- We have a set of FAQ questions in a csv file.
- The code converts the text of the questions into a tensor.
- Then we take in the user's query.
- That is also converted into a tensor.
- Finally, we compare query tensor to the questions tensor and select the best match.
- This _match_ is the closest question in the csv to the one given by the user.
- We then output the answer corresponding to the matching question from the csv.

# Code

## Libraries Used

`numpy` is used just for finding the max argument.

In [1]:
import numpy as np

`pandas` is used for making R-like data frames.

In [6]:
import pandas as pd

`lingualytics` and `texthero` are used for cleaning up text.

In [9]:

from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits

`sentence_transformers` converts text into a tensor.

In [4]:

from sentence_transformers import SentenceTransformer

`torch` is used to find the two most similar tensors.

In [5]:

import torch

## Other Stuff

`model` is a function from `SentenceTransformer` which will be used to convert the text into a cursor.

In [6]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')


`torch.nn.CosineSimilarity` is a function from `torch` which finds out the needed question (by comparing tensors)

In [7]:
cos_fn = torch.nn.CosineSimilarity(dim = 1, eps = 1e-6) 

# The Code

## Getting, Cleaning and Embedding all the text

Creating a data frame from the questions and answers in the csv file.

In [7]:

DATASET_PATH = 'faq.csv'
df = pd.read_csv(DATASET_PATH, encoding_errors = "ignore")
df


Unnamed: 0,Q,A
0,What is kandi?,kandi (pronounced kandee) is a platform that h...
1,Have feedback or want to know more?,We are a passionate set of application focused...
2,What components does kandi cover?,kandi helps you select software components acr...
3,How do I use kandi?,kandi provides two simplified experiences to h...
4,How do I shortlist components on kandi?,You can use the below filters to shortlist com...
5,How do I implement the components that I have ...,The component listing and detailed insights pa...


The text is being cleaned below, i.e. all useless words, punctuation and other stuff is removed.

In [19]:
df['clean_Q'] = df['Q'].pipe(remove_digits)\
    .pipe(remove_punctuation)\
    .pipe(remove_lessthan,length=100)\
    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
df

  return s.str.replace(rf"([{punctuation}])+", " ")


Unnamed: 0,Q,A,clean_Q
0,What is kandi?,kandi (pronounced kandee) is a platform that h...,What kandi
1,Have feedback or want to know more?,We are a passionate set of application focused...,Have feedback want know
2,What components does kandi cover?,kandi helps you select software components acr...,What components kandi cover
3,How do I use kandi?,kandi provides two simplified experiences to h...,kandi
4,How do I shortlist components on kandi?,You can use the below filters to shortlist com...,shortlist components kandi
5,How do I implement the components that I have ...,The component listing and detailed insights pa...,implement components selected kandi


Now, the questions are being converted into a tensor.

In [10]:
q_embs = model.encode(df['clean_Q'], convert_to_tensor = True)
print(q_embs)
print(q_embs.shape)

tensor([[-0.0344,  0.2531, -0.9012,  ...,  0.4821, -0.0692,  0.2219],
        [-0.1312, -0.1116, -0.0634,  ..., -0.0702, -0.9116, -0.0137],
        [-0.3552,  0.4154, -0.7785,  ...,  0.1249,  0.2409,  0.0764],
        [ 0.1192,  0.3432, -0.4598,  ...,  0.3603, -0.2396,  0.1668],
        [-0.4885,  0.2131, -0.6421,  ...,  0.1549, -0.5106,  0.1832],
        [-0.3025,  0.2400, -0.6605,  ...,  0.2106,  0.1297,  0.1079]])
torch.Size([6, 384])


Here, the user query is taken and a data frame is created for it.

In [11]:
user_q = "kandi is components cover?!"
df_q = pd.DataFrame([user_q], columns = ['user_query'])
df_q


Unnamed: 0,user_query
0,kandi is components cover?!


Now, the data frame for the data query is converted into a tensor.

In [12]:
user_q_emb = model.encode(df_q['user_query'], convert_to_tensor=True)
print(user_q_emb.shape)

torch.Size([1, 384])


## Comparing Tensors

Compare tensors

In [13]:
cos_fn(user_q_emb, q_embs)

tensor([0.6007, 0.0868, 0.8953, 0.5715, 0.6218, 0.6855])

Get the index of the most similar one

In [14]:
ans = np.argmax(cos_fn(user_q_emb, q_embs)).item()

## Output the answer

In [15]:
print(df['A'][ans])

kandi helps you select software components across:
Packages from all package managers and repositories
Source Code across all major code repositories
Cloud Functions and APIs across all hyperscale cloud providers
