<a href="https://colab.research.google.com/github/KornSiwat/cassava-related-questions-recommendation/blob/main/cassava_related_questions_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Cassava Related Questions Recommendation
Having questions?    
Contact me:    
  - email: siwatponpued@gmail.com
  - line: k41763

Needed Files:
  - faq-list.csv
  - grouping-word-list.txt (domain specific words)
  - normalization-word-list.txt


## Project Setup

###Install Dependencies

"pythainlp" NLP Package for Thai Language

In [1]:
!pip install pythainlp

Collecting pythainlp
  Downloading pythainlp-2.3.2-py3-none-any.whl (11.0 MB)
[K     |████████████████████████████████| 11.0 MB 20.6 MB/s 
[?25hCollecting python-crfsuite>=0.9.6
  Downloading python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743 kB)
[K     |████████████████████████████████| 743 kB 55.4 MB/s 
[?25hCollecting tinydb>=3.0
  Downloading tinydb-4.5.2-py3-none-any.whl (23 kB)
Collecting typing-extensions<4.0.0,>=3.10.0
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions, tinydb, python-crfsuite, pythainlp
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.7.4.3
    Uninstalling typing-extensions-3.7.4.3:
      Successfully uninstalled typing-extensions-3.7.4.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.6.0 requir

engines for pythainlp tokenizer

In [2]:
!pip install deepcut

Collecting deepcut
  Downloading deepcut-0.7.0.0-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 38.7 MB/s 
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions, deepcut
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tinydb 4.5.2 requires typing-extensions<4.0.0,>=3.10.0; python_version <= "3.7", but you have typing-extensions 3.7.4.3 which is incompatible.[0m
Successfully installed deepcut-0.7.0.0 typing-extensions-3.7.4.3


"pandas" for reading/writing file and processing data


In [3]:
!pip install pandas



"sklearn" for creating vector

In [4]:
!pip install sklearn



### Import Dependencies

"pandas" for reading/writing file and processing data


In [5]:
import pandas as pd

"pythainlp" for Thai Language tokenization and stopwords reference

In [6]:
from pythainlp.tokenize import word_tokenize as pythainlp_tokenize
from pythainlp.corpus import thai_stopwords

"re" for string splitting

In [7]:
import re

Types for type hinting

In [8]:
from typing import List

"sklearn"

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Codes

###Model

Text contains a string with a ignore_tokenization flag used in tokenization preprocessor to mark domain specific words.

In [10]:
class Text:
    def __init__(self,
                 value: str,
                 ignore_tokenization: bool):
        self.value = value
        self.ignore_tokenization = ignore_tokenization

    def __repr__(self) -> str:
        return str(self.value)

### Configs & Constants

pythainlp tokenizer engine names

In [11]:
MM_ENGINE='mm' #ใช้ Maximum Matching algorithm ในการตัดคำภาษาไทย - API ชุดเก่า
NEWMM_ENGINE='newmm' #ใช้ Maximum Matching algorithm ในการตัดคำภาษาไทย โค้ดชุดใหม่ โดยใช้โค้ดคุณ Korakot Chaovavanich จาก https://www.facebook.com/groups/408004796247683/permalink/431283740586455/ มาพัฒนาต่อ
DEEPCUT_ENGINE='deepcut' #ใช้ deepcut จาก https://github.com/rkcosmos/deepcut ในการตัดคำภาษาไทย

grouping words

In [12]:
def read_grouping_words(filename: str):
    def process_line(line: str):
        return line.strip()

    with open(filename) as f:
        lines = f.readlines()

    processed_lines = list(map(process_line, lines))

    grouping_words = processed_lines

    return grouping_words


grouping_words = read_grouping_words('grouping-word-list.txt')

print(grouping_words)


['โรคใบไหม้', 'Cassava bacterial blight', 'โรคใบจุดสีน้ำตาล', 'brown leaf spot disease', 'โรคใบด่างมันสำปะหลัง', 'cassava mosaic disease', 'โรคแอนแทรคโนส', 'anthracnose', 'โรครากเน่าและหัวเน่า', 'root and tuber rot', 'โรครากปมมันสำปะหลัง', 'cassava root-knot', 'โรคลำต้นไหม้', 'stem blight disease', 'โรคพุ่มแจ้', 'witches broom', 'โรคเน่าเปียก', 'wet rot', 'เพลี้ยแป้ง', 'เพลี้ยแป้งลาย', 'เพลี้ยแป้งแจ๊คเบียดเลย์', 'เพลี้ยแป้งสีเทา', 'เพลี้ยแป้งเขียว', 'เพลี้ยแป้งสีชมพู', 'ไร', 'ไรแมงมุมคันซาบา', 't. kanzawai kishida', 'ไรแมงมุม', 'oligonychus biharensis hirst', 'ไรแดงหม่อน', 'ไรแดงมันสำปะหลัง', 'แมลงหวี่ขาว', 'แมลงหวี่ขาวยาสูบ', 'Bemisia tabaci', 'Gennadius', 'silverleaf whitefly', 'แมลงหวี่ขาวใยเกลียว', 'Aleurodicus disperses Russell', 'เพลี้ยหอย', 'เพลี้ยหอยเกล็ด', 'เพลี้ยหอยขาว', 'ปลวก', 'แมลงนูนหลวง', 'Lepidiota stigma Fabricius', 'มันสำปะหลัง', 'แคสซาวา', 'ทาปิโอก้า', 'ทาปิโอกา', 'Cassava', 'Tapioca', 'แมนิออค', 'Manioc', 'เกษตรศาสตร์ 50', 'เกษตรศาสตร์', 'KU50', 'เกษตรศาสตร์ 72', 'K

normalization words

In [13]:
def read_normalization_words(filename: str):
    def process_line(line: str, separator: str = "|"):
        return line.strip().split(separator)

    with open(filename) as f:
        lines = f.readlines()

    processed_lines = list(map(process_line, lines))

    normalization_words = dict()

    for line in processed_lines:
        normalized_word = line[0]
        to_be_normalized_words = line[1:]

        for to_be_normalized_word in to_be_normalized_words:
            normalization_words[to_be_normalized_word] = normalized_word

    return normalization_words


normalization_words = read_normalization_words('normalization-word-list.txt')

print(normalization_words)


{'มั๊ย': 'หรือไม่', 'ไหม': 'หรือไม่', 'หรือเปล่า': 'หรือไม่', 'รึเปล่า': 'หรือไม่', 'เท่าไร': 'เท่าไหร่', 'เท่าใด': 'เท่าไหร่', 'เมื่อใด': 'เมื่อไร', 'เมื่อไหร่': 'เมื่อไร', 'ตอนไหน': 'เมื่อไร', 'ตอนใด': 'เมื่อไร', 'ยังไง': 'อย่างไร', 'จริงเหรอ': 'จริงไหม', 'จริงหรือ': 'จริงไหม', 'จริงหรือไม่': 'จริงไหม', 'จริงหรือเปล่า': 'จริงไหม', 'อะไร': 'ใด', 'มาก': 'สูง', 'เยอะ': 'สูง', 'ที่สุด': 'สุด', 'แปลง': 'ไร่', 'เพาะปลูก': 'ปลูก', 'ยาฆ่าหญ้า': 'สารกำจัดวัชพืช', 'ยากำจัดวัชพืช': 'สารกำจัดวัชพืช', 'หญ้า': 'วัชพืช', 'Cassava bacterial blight': 'โรคใบไหม้', 'brown leaf spot disease': 'โรคใบจุดสีน้ำตาล', 'cassava mosaic disease': 'โรคใบด่างมันสำปะหลัง', 'anthracnose': 'โรคแอนแทรคโนส', 'root and tuber rot': 'โรครากเน่าและหัวเน่า', 'cassava root-knot': 'โรครากปมมันสำปะหลัง', 'stem blight disease': 'โรคลำต้นไหม้', 'witches broom': 'โรคพุ่มแจ้', 'wet rot': 'โรคเน่าเปียก', 't. kanzawai kishida': 'ไรแมงมุมคันซาบา', 'oligonychus biharensis hirst': 'ไรแมงมุม', 'ไรแดงมันสำปะหลัง': 'ไรแดงหม่อน', 'Bemisia 

stopwords

In [14]:
stopwords = thai_stopwords()

frequently used

In [15]:
WHITE_SPACE = ' '

###Preprocessors

In [16]:
def preprocess_questions(questions: List[str],
                         tokenizer_engine: str,
                         grouping_words: List[str],
                         normalization_words: dict,
                         stopwords: List[str]) -> str:
    return list(
        map(lambda question:
            preprocess_question(question=question,
                                tokenizer_engine=tokenizer_engine,
                                grouping_words=grouping_words,
                                normalization_words=normalization_words,
                                stopwords=stopwords
                                ),
            questions))


def preprocess_question(question: str,
                        tokenizer_engine: str,
                        grouping_words: List[str],
                        normalization_words: dict,
                        stopwords: List[str]) -> str:
    tokenized_words = tokenize(text=question,
                               tokenizer_engine=tokenizer_engine,
                               grouping_words=grouping_words)

    normalized_words = list(
        map(lambda word:
            normalize(text=word,
                      normalization_words=normalization_words),
            tokenized_words))

    removed_stopword_words = remove_stopword(texts=normalized_words,
                                             stopwords=stopwords)

    space_separated_words = " ".join(removed_stopword_words)

    return space_separated_words


def create_vectors_fit_transform(questions: List[str], vectorizer) -> pd.DataFrame:
    sparse_matrix = vectorizer.fit_transform(questions)

    return sparse_matrix_to_data_frame(sparse_matrix)


def create_vectors_transform(questions: List[str], vectorizer) -> pd.DataFrame:
    sparse_matrix = vectorizer.transform(questions)

    return sparse_matrix_to_data_frame(sparse_matrix)


def sparse_matrix_to_data_frame(sparse_matrix):
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix,
                      columns=vectorizer.get_feature_names(),
                      )

    return df


def tokenize(text: str,
             tokenizer_engine: str,
             grouping_words: List[str]) -> List[Text]:

    before_split_by_grouping_words_texts = [
        Text(value=text, ignore_tokenization=False)]
    after_split_by_grouping_words_texts = []

    for grouping_word in grouping_words:
        for before_tokenize_text in before_split_by_grouping_words_texts:
            for word in split_by_grouping_word(
                    text=before_tokenize_text,
                    grouping_word=grouping_word):
                after_split_by_grouping_words_texts.append(word)
        before_split_by_grouping_words_texts = after_split_by_grouping_words_texts
        after_split_by_grouping_words_texts = []
    splitted_by_grouping_words_texts = before_split_by_grouping_words_texts

    tokenize_result = []

    for text in splitted_by_grouping_words_texts:
        if text.ignore_tokenization:
            tokenize_result.append(text)
        else:
            for word in pythainlp_tokenize(text.value,
                                           engine=tokenizer_engine):
                if word != WHITE_SPACE:
                    tokenize_result.append(
                        Text(word, ignore_tokenization=False))

    return tokenize_result


def normalize(text: Text, normalization_words: dict) -> Text:
    if text.ignore_tokenization:
        return text

    result = text

    for original_word in normalization_words:
        new_word = normalization_words[original_word]
        result.value = result.value.replace(original_word, new_word)

    return result


def remove_stopword(texts: List[Text], stopwords: List[str]) -> List[str]:
    return [text.value for text in texts if (text.value not in stopwords) or text.ignore_tokenization]


def split_by_grouping_word(text: Text, grouping_word: str):
    if text.ignore_tokenization:
        return [text]

    split_results = re.split(f'({grouping_word})', text.value)

    return list(map(
        lambda split_result:
        Text(value=split_result,
             ignore_tokenization=split_result == grouping_word),
        split_results))


def white_space_tokenizer(x: str):
    return x.split(WHITE_SPACE)


### Test Preprocessors

test tokenize

In [17]:
test_tokenize_result = tokenize(text="การใส่ปุ๋ย N-P-K เท่าใดและสูตรอะไรถึงจะเหมาะสำหรับการปลูกมันสำปะหลัง",
         tokenizer_engine='deepcut',
         grouping_words=["N-P-K"])

print(test_tokenize_result)

[การ, ใส่, ปุ๋ย, N-P-K, เท่า, ใด, และ, สูตร, อะไร, ถึง, จะ, เหมาะ, สำหรับ, การ, ปลูก, มันสำปะหลัง]


test normalize

In [18]:
normalization_words = {
    "เหมาะ": "เหมาะสม",
}

test_normalize_result = list(
        map(lambda word:
            normalize(text=word,
                      normalization_words=normalization_words),
            test_tokenize_result))

print(test_normalize_result)

[การ, ใส่, ปุ๋ย, N-P-K, เท่า, ใด, และ, สูตร, อะไร, ถึง, จะ, เหมาะสม, สำหรับ, การ, ปลูก, มันสำปะหลัง]


test remove stopword

In [19]:
# stopwords = ["จะ", "อยาก", "การ"]

test_remove_stopword_result = remove_stopword(texts=test_normalize_result,
                stopwords=stopwords)

print(" ".join(test_remove_stopword_result))

ใส่ ปุ๋ย N-P-K ใด สูตร เหมาะสม สำหรับ ปลูก มันสำปะหลัง


Test preprocess questions

In [20]:
questions = ["การใส่ปุ๋ย N-P-K เท่าใดและสูตรอะไรถึงจะเหมาะสำหรับการปลูกมันสำปะหลัง",
             "สูตรปุ๋ย N-P-K ที่เหมาะกับมันสำปะหลังคืออะไร",
             "ปลูกมันสำปะหลังควรใส่ปุ๋ย N-P-K แบบไหน"]

normalization_words = {
    "เหมาะ": "เหมาะสม",
}

test_preprocess_questions_result = preprocess_questions(questions=questions,
                            tokenizer_engine=DEEPCUT_ENGINE,
                            grouping_words=[
                                "N-P-K"],
                            normalization_words=normalization_words,
                            stopwords=stopwords
                            )

print(test_preprocess_questions_result)

['ใส่ ปุ๋ย N-P-K ใด สูตร เหมาะสม สำหรับ ปลูก มันสำปะหลัง', 'สูตร ปุ๋ย N-P-K เหมาะสม มันสำปะหลัง', 'ปลูก มันสำปะหลัง ใส่ ปุ๋ย N-P-K']


test create vector with count vectorizer

In [21]:
vectorizer = CountVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

create_vectors_fit_transform(test_preprocess_questions_result, vectorizer)

Unnamed: 0,n-p-k,ปลูก,ปุ๋ย,มันสำปะหลัง,สำหรับ,สูตร,เหมาะสม,ใด,ใส่
0,1,1,1,1,1,1,1,1,1
1,1,0,1,1,0,1,1,0,0
2,1,1,1,1,0,0,0,0,1


In [22]:
vectorizer = TfidfVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

create_vectors_fit_transform(test_preprocess_questions_result, vectorizer)

Unnamed: 0,n-p-k,ปลูก,ปุ๋ย,มันสำปะหลัง,สำหรับ,สูตร,เหมาะสม,ใด,ใส่
0,0.255105,0.328495,0.255105,0.255105,0.431931,0.328495,0.328495,0.431931,0.328495
1,0.397897,0.0,0.397897,0.397897,0.0,0.512364,0.512364,0.0,0.0
2,0.397897,0.512364,0.397897,0.397897,0.0,0.0,0.0,0.0,0.512364


### Cosine Similarity Analysis Processing Pipeline

In [38]:
def perform_cosine_similarity_analysis(filename: str,
                                       id_colomn_name: str,
                                       reference_question_column_name: str,
                                       query_question_column_name: str,
                                       normalization_words: dict,
                                       grouping_words: List[str],
                                       stopwords: List[str],
                                       vectorizer,
                                       tokenizer_engine: str = 'deepcut',
                                       top_n: int =1,
                                       ):
    faqs = pd.read_csv(filename)

    reference_questions = preprocess_questions(questions=faqs[reference_question_column_name].drop_duplicates(),
                     tokenizer_engine=DEEPCUT_ENGINE,
                     grouping_words=grouping_words,
                     normalization_words=normalization_words,
                     stopwords=stopwords)

    query_questions = preprocess_questions(questions=faqs[query_question_column_name],
                     tokenizer_engine=DEEPCUT_ENGINE,
                     grouping_words=grouping_words,
                     normalization_words=normalization_words,
                     stopwords=stopwords)

    reference_question_vectors = create_vectors_fit_transform(questions=reference_questions,
                                                              vectorizer=vectorizer)
    query_question_vectors = create_vectors_transform(questions=query_questions,
                                                      vectorizer=vectorizer)

    similarity_matrix = cosine_similarity(query_question_vectors,
                                          reference_question_vectors)

    def process_similarity_matrix_row(row):
        index_value_pairs = [(index, value) for index, value in enumerate(row)]

        return sorted(index_value_pairs,
                      key=lambda x: x[1], reverse=True)

    cosine_similarity_result = list(map(process_similarity_matrix_row,
                                        similarity_matrix))

    result_checker = dict()

    for query_question_index, i in enumerate(faqs[id_colomn_name].astype(int)):
        reference_question_index = i - 1

        result_checker[query_question_index] = reference_question_index

    def is_result_correct(query_question_index,
                          reference_question_index,
                          result_checker):
        return result_checker[query_question_index] == reference_question_index

    matching_result = []

    for query_question_index, results in enumerate(cosine_similarity_result):
        top_matchings = results[0:top_n]

        for top_matching in top_matchings:
            top_matching_reference_question_index = top_matching[0]
            if (is_result_correct(query_question_index=query_question_index,
                                    reference_question_index=top_matching_reference_question_index,
                                    result_checker=result_checker)):
                matching_result.append(True)
                break
        else:
            matching_result.append(False)
        
    number_of_query_questions = len(faqs)
    correct_matching = len(list(filter(lambda x: x, matching_result)))
    incorrect_matching = len(list(filter(lambda x: not x, matching_result)))
    accuracy = correct_matching / number_of_query_questions

    print(f'Number of all query question: {number_of_query_questions}')
    print(f'Correct Matching: {correct_matching}')
    print(f'Incorrect Matching: {incorrect_matching}')
    print(f'Accuracy: {accuracy * 100}%')

### Cosine Similarity Analysis Processing Pipeline Test

read frequetly asked questions list

"frequetly asked questions" shorten to "faqs"

In [24]:
test_faqs = pd.read_csv("faq-list.csv")

print(test_faqs)

     ID Main Q  ...                                          Similar Q
0            1  ...               ต้องไถดินกี่ครั้งก่อนปลูกมันสำปะหลัง
1            1  ...  การเตรียมดินปลูกมันสำปะหลัง ควรไถดินกี่ครั้งก่...
2            1  ...               ก่อนปลูกมันสำปะหลังต้องไถดินกี่ครั้ง
3            1  ...      จำนวนครั้งในการไถเตรียมดินก่อนปลูกมันสำปะหลัง
4            1  ...           การเตรียมดินปลูกมันสำปะหลังต้องทำอย่างไร
..         ...  ...                                                ...
496         35  ...                     ทำยังไงไม่โดนหักเงินเวลาขายมัน
497         35  ...                        ทำไมต้องหักเงินที่ขายมันได้
498         35  ...      เพราะอะไรเวลาเอามันสำปะหลังไปขายถึงโดนหักเงิน
499         35  ...    เวลาเอามันสำปะหลังไปขายมักจะโดนหักเงินเพราะอะไร
500         35  ...  ปัจจัยที่โดนหักเงินตอนเอามันสำปะหลังไปขายมีอะไ...

[501 rows x 3 columns]


preprocess both reference and query questions from faqs

In [25]:
test_reference_questions = preprocess_questions(questions=test_faqs["Main Q"].drop_duplicates(),
                     tokenizer_engine=DEEPCUT_ENGINE,
                     grouping_words=grouping_words,
                     normalization_words=normalization_words,
                     stopwords=stopwords)

test_query_questions = preprocess_questions(questions=test_faqs["Similar Q"],
                     tokenizer_engine=DEEPCUT_ENGINE,
                     grouping_words=grouping_words,
                     normalization_words=normalization_words,
                     stopwords=stopwords)

create vector both reference and query questions from faqs

In [30]:
test_vectorizer = TfidfVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

test_reference_question_vectors = create_vectors_fit_transform(questions=test_reference_questions,
                                                               vectorizer=test_vectorizer)
test_query_question_vectors = create_vectors_transform(questions=test_query_questions,
                                                       vectorizer=test_vectorizer)

Cosine Simalarity Analysis

compare query questions to reference questions

In [31]:
test_similarity_matrix = cosine_similarity(test_query_question_vectors,
                                           test_reference_question_vectors)

print(test_similarity_matrix)

[[0.85022634 0.74170657 0.11615272 ... 0.06948483 0.078478   0.02071538]
 [0.95728868 0.72564227 0.13137376 ... 0.07859035 0.08876202 0.01379331]
 [0.85022634 0.74170657 0.11615272 ... 0.06948483 0.078478   0.02071538]
 ...
 [0.01727714 0.02588144 0.02345407 ... 0.05729445 0.01584662 0.94775148]
 [0.01727714 0.02588144 0.02345407 ... 0.05729445 0.01584662 0.94775148]
 [0.01893844 0.02837009 0.02570931 ... 0.06280364 0.01737037 0.86461383]]


process each rows to show most similar reference question first.

store data as tuple -> (index of reference question, cosine similarity analysis value)

In [33]:
def test_process_similarity_matrix_row(row):
    index_value_pairs = [(index, value) for index, value in enumerate(row)]

    return sorted(index_value_pairs, key=lambda x: x[1], reverse=True)


test_cosine_similarity_result = list(map(test_process_similarity_matrix_row,
                                         test_similarity_matrix))

print(test_cosine_similarity_result)

[[(0, 0.8502263424673482), (1, 0.7417065658073387), (25, 0.36813645398211486), (24, 0.30314843799626157), (27, 0.13086143212104479), (2, 0.11615271796466531), (18, 0.11615271796466531), (28, 0.11615271796466531), (26, 0.11379548315614876), (3, 0.10700762527321679), (4, 0.09235860040959323), (9, 0.09113511383046508), (33, 0.07847799702548766), (32, 0.06948482794343147), (23, 0.06600461296391061), (17, 0.049535997980670717), (22, 0.036399682215862664), (15, 0.035010890021776216), (12, 0.03120629125745677), (6, 0.03055241167993575), (19, 0.02816064913611533), (29, 0.027058997867162978), (30, 0.027058997867162975), (8, 0.02603795479807635), (7, 0.02589082243213649), (10, 0.02576046855141177), (14, 0.02343926317713063), (31, 0.022781913855256224), (34, 0.020715382224982037), (16, 0.01691351963196778), (5, 0.0), (11, 0.0), (13, 0.0), (20, 0.0), (21, 0.0)], [(0, 0.9572886783594878), (1, 0.7256422727566196), (25, 0.46576366006169917), (24, 0.38354127795772586), (27, 0.13391301713890558), (2, 0

Result Checker.   
dictionary containing:     
  - key=query question index 
  - value=correct matching reference question

In [34]:
# check if id column contains any null value
test_faqs["ID Main Q"].isnull().values.any()

False

In [35]:
test_result_checker = dict()

for query_question_index, i in enumerate(test_faqs["ID Main Q"].astype(int)):
    reference_question_index = i - 1

    test_result_checker[query_question_index] = reference_question_index

print(test_result_checker)

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 2, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2, 48: 2, 49: 2, 50: 2, 51: 2, 52: 2, 53: 2, 54: 2, 55: 2, 56: 3, 57: 3, 58: 3, 59: 3, 60: 3, 61: 3, 62: 3, 63: 3, 64: 3, 65: 3, 66: 3, 67: 3, 68: 3, 69: 3, 70: 3, 71: 3, 72: 3, 73: 3, 74: 3, 75: 3, 76: 3, 77: 4, 78: 4, 79: 4, 80: 4, 81: 4, 82: 4, 83: 4, 84: 4, 85: 4, 86: 4, 87: 4, 88: 4, 89: 4, 90: 4, 91: 4, 92: 4, 93: 4, 94: 4, 95: 4, 96: 4, 97: 4, 98: 4, 99: 4, 100: 5, 101: 5, 102: 5, 103: 5, 104: 5, 105: 5, 106: 5, 107: 5, 108: 5, 109: 5, 110: 5, 111: 5, 112: 5, 113: 5, 114: 5, 115: 5, 116: 5, 117: 5, 118: 5, 119: 5, 120: 5, 121: 5, 122: 5, 123: 5, 124: 5, 125: 5, 126: 5, 127: 5, 128: 5, 129: 5, 130: 5, 131: 5, 132: 5, 133: 5, 134: 5, 135: 5, 136: 5, 137: 5, 138: 

In [36]:
def test_is_result_correct(query_question_index, reference_question_index, result_checker):
    return result_checker[query_question_index] == reference_question_index

test_matching_result = []

for test_query_question_index, test_results in enumerate(test_cosine_similarity_result):
    test_top_matching = test_results[0]
    test_top_matching_reference_question_index = test_top_matching[0]

    test_matching_result.append(test_is_result_correct(query_question_index=test_query_question_index,
                            reference_question_index=test_top_matching_reference_question_index,
                            result_checker=test_result_checker))

print(test_matching_result)

[True, True, True, True, True, False, True, True, False, False, False, False, True, True, True, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, True, True, True, False, False, True, True, True, True, False, True, False, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, True, True, True, False, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True,

Matching Summary

In [37]:
test_number_of_query_questions = len(test_faqs)
test_correct_matching = len(list(filter(lambda x: x, test_matching_result)))
test_incorrect_matching = len(list(filter(lambda x: not x, test_matching_result)))
test_accuracy = test_correct_matching / test_number_of_query_questions

print(f'Number of all query question: {test_number_of_query_questions}')
print(f'Correct Matching: {test_correct_matching}')
print(f'Incorrect Matching: {test_incorrect_matching}')
print(f'Accuracy: {test_accuracy * 100}%')

Number of all query question: 501
Correct Matching: 390
Incorrect Matching: 111
Accuracy: 77.84431137724552%


## Cosine Similarity Result

### Using Count Vectorizer

Count Vectorizer Top 1

In [39]:
print("Cosine Similarity using CountVectorizer Evaluate Top 1")

count_vectorizer = CountVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=count_vectorizer,
                                   top_n=1)

Cosine Similarity using CountVectorizer Evaluate Top 1
Number of all query question: 501
Correct Matching: 387
Incorrect Matching: 114
Accuracy: 77.24550898203593%


Count Vectorizer Top 3

In [40]:
print("Cosine Similarity using CountVectorizer Evaluate Top 3")

count_vectorizer = CountVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=count_vectorizer,
                                   top_n=3)

Cosine Similarity using CountVectorizer Evaluate Top 3
Number of all query question: 501
Correct Matching: 445
Incorrect Matching: 56
Accuracy: 88.82235528942117%


Count Vectorizer Top 5

In [41]:
print("Cosine Similarity using CountVectorizer Evaluate Top 5")

count_vectorizer = CountVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=count_vectorizer,
                                   top_n=5)

Cosine Similarity using CountVectorizer Evaluate Top 5
Number of all query question: 501
Correct Matching: 453
Incorrect Matching: 48
Accuracy: 90.41916167664671%


### Using TfidfVectorizer

Tfidf Vectorizer Top 1

In [42]:
print("Cosine Similarity using Tfidf Vectorizer Evaluate Top 1")

tfidf_vectorizer = TfidfVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=tfidf_vectorizer,
                                   top_n=1)

Cosine Similarity using Tfidf Vectorizer Evaluate Top 1
Number of all query question: 501
Correct Matching: 390
Incorrect Matching: 111
Accuracy: 77.84431137724552%


Tfidf Vectorizer Top 3

In [43]:
print("Cosine Similarity using Tfidf Vectorizer Evaluate Top 3")

tfidf_vectorizer = TfidfVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=tfidf_vectorizer,
                                   top_n=3)

Cosine Similarity using Tfidf Vectorizer Evaluate Top 3
Number of all query question: 501
Correct Matching: 452
Incorrect Matching: 49
Accuracy: 90.21956087824351%


Tfidf Vectorizer Top 5

In [44]:
print("Cosine Similarity using Tfidf Vectorizer Evaluate Top 5")

tfidf_vectorizer = TfidfVectorizer(tokenizer=white_space_tokenizer,
                             min_df=1)

perform_cosine_similarity_analysis(filename="faq-list.csv",
                                   id_colomn_name="ID Main Q",
                                   reference_question_column_name="Main Q",
                                   query_question_column_name="Similar Q",
                                   normalization_words=normalization_words,
                                   grouping_words=grouping_words,
                                   stopwords=stopwords,
                                   tokenizer_engine=DEEPCUT_ENGINE,
                                   vectorizer=tfidf_vectorizer,
                                   top_n=5)

Cosine Similarity using Tfidf Vectorizer Evaluate Top 5
Number of all query question: 501
Correct Matching: 464
Incorrect Matching: 37
Accuracy: 92.61477045908184%
