# Code Snippet Retrieval

* Shiva Sai Pavan INJA



___
___
___

# Information Retrieval Problem:
## Code Search and Snippet Retrieval

<br/><br/>

* **Description**:
    * Develop a code search engine that allows users to search for and retrieve relevant code snippets and programming resources based on their queries.
    * The results can include code snippets, libraries, and solutions to programming problems.
    * We can present results by code similarity and by using ranking algorithms.

<br/><br/>

* **Objective**: 
    * To develop a code search engine that can take both free-form user queries and code snippets.



___

* **Importance**:
    1. *Code Quality Improvement*: By exploring various code snippets and solutions, developers can discover better coding practices, leading to improved code quality and maintainability.
    
    2. *Enhanced Productivity*: A code search engine significantly accelerates the software development process. Developers can quickly locate and reuse existing code snippets, libraries, and solutions, reducing redundant work and saving valuable time.
    
    3. *Bug Resolution*: When encountering bugs or issues in their code, developers can search for similar problems and solutions, which can help them troubleshoot and resolve issues more effectively.


___

* **Summary of changes and updates since the last assignment submission**:
    * <span style="color:green">Added new ideas from the literature.</span>
    * <span style="color:green">Added 5 new literature papers from journals and conferences.</span>
    * <span style="color:green">Updated the chapter on hardware software and data.</span>
    * <span style="color:green">Updated the Tasks Accomplished, and Work Planned.</span>
    * <span style="color:green">Wrote the software code along with test cases.</span>


___
___
___

# Overview of Past and Current Solution Ideas

## State-of-the-Art Solutions

### New ideas from the literature:
* Neural Code Search (NCS + TF-IDF + Cosine Similarity)
* GloVe embedding model
* Word2vec model
* Doc2vec model

### From the literature:
* Aroma
* FaCoY: **F**ind **a** **C**ode **o**ther than **Y**ours
* CoCaBu: **CO**de vo**CABU**lary
* GitSearch
* DGMS: **D**eep **G**raph **M**atching and **S**earching
* COSAL: **Co**de-to-Code **S**earch **A**cross **L**anguages
* Search4Code

_We will explain these tools in detail as we discuss the literature..._

___

### Available online:

* Google Code Search

* TranSˆ3

* SLAMPA

_In the interest of time and space, please see the "Other Material" in the "Appendix" for details..._

___
___

## Solution Ideas from Journal and Conference Papers

___


* **Aroma `[Luan:2019:OOPSLA]`**:
    * Takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet.
    * Real time use; it first retrieves a small set of snippets based on approximate search, and then performs the pruning and clustering operations on this set. 
    * Recommend code snippets on a given query from a large corpus containing millions of methods on a multi-core server machine.
    * Implemented in *C++* for four programming languages: 
        * *Hack*, *Java*, *JavaScript*, *Python*.
        <!-- * [Julien Verlaguet and Alok Menghrajani. 2014. Hack: a new programming language for HHVM. https://code.fb.com/developer-tools/hack-a-new-programming-language-for-hhvm],  -->

<center><img src="Figures/aroma_overview.png" width="800"/></center>
<!-- ![aroma_overview.png](attachment:da5fb8ba-ecf0-4d30-8f82-4a76ee664615.png) -->

___

* **FaCoY `[Kim:2018:ICSE]`**:
    * Code-to-code search tools; statically finding code fragments which may be semantically similar to user input code.
    * Fully static; relies solely on source code with no constraint of having runtime information. 
    * Based on query alternation: after extracting structural code elements from a code fragment to build a query, build alternate queries using code fragments that present similar descriptions to the initial code fragment. 
    * Implemented indices for *Java* files collected from *GitHub* and Q&A posts from *StackOverflow*.

<center><img src="Figures/facoy_overview.png" width="800"/></center>
<!-- ![facoy_overview.png](attachment:cc95e454-c9db-47a7-8a4a-ba171cabb381.png) -->

___

* **CoCaBu `[Sirres:2018:ICSE]`**: 
<!-- Keyword-based code search; built GITSEARCH, a code search engine, on top of GitHub and Stack Overflow. -->
    * ``CoCaBu`` for the vocabulary mismatch problem: Finds relevant code with free-form query terms that describe the task, with no knowledge on the keywords/API to search. 
    * ``CoCaBu``’s Code Query Generator then creates another query which includes not only the initial user query terms but also program elements, such as method and class names, from the extracted snippets.
    * ``GitSearch`` free-form search engine for GitHub: We instantiate the COCABU approach based on indices of Java files built from *GitHub* and Q&A posts from *StackOverflow* to find the most relevant source code examples for developer queries.
    * ``GitSearch`` collects textual information in addition to structural entities. Treats source code as text, and conducts pre-processing such as *tokenization* (e.g., splitting camel case), *stop word removal*, and *stemming*.

<center><img src="Figures/cocabu_overview.png" width="800"/></center>
<!-- ![cocabu_overview.png](attachment:755f8571-9a44-40ae-833c-b7b136a3f680.png) -->

___

* **DeCo `[Cambronero:2019:ESEC/FSE]`**: 
    * This method embeds code into vectors and queries into vectors for semantic correlation, with supervised and unsupervised techniques.
    * The query and the candidate code snippets are mapped to a shared vector space.
    * Vector representations can be learned in an unsupervised manner, which just uses code, or in a supervised manner, which exploits an aligned corpus of code and their corresponding natural language descriptions.

<center><img src="Figures/deeplearning.png" width="1200"/></center>
<!-- ![deeplearning.png](attachment:8e1a3898-05f4-4e66-91cb-529ecca9358f.png) -->

___

* **COSAL `[Mathew:2021:ESEC/FSE]`**: 
  * Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity.
  * It uses two static similarity measures based on extracted tokens from source code, a tree edit distance based on a generic Abstract Syntax Tree (AST), and one dynamic similarity measure to compute IO similarity.

<center><img src="Figures/COSAL.png" width="1200"/></center>
<!-- ![deeplearning.png](attachment:8e1a3898-05f4-4e66-91cb-529ecca9358f.png) -->

___

* **DGMS `[Ling:2021:TKDD]`**: 
   * ``DGMS``is an end-to-end matching model based on graph neural networks for semantic code retrieval.
   * ``DGMS`` makes better use of the rich structural information in source codes and query texts as well as the interaction semantic relations between each other.

<center><img src="Figures/DGMS.png" width="1200"/></center>
<!-- ![DGMS.png](attachment:003ea40f-3b20-4276-9c47-dcdfe3e3ff30.png) -->

___

* **Effective Reformulation of Query for Code Search using Crowdsourced Knowledge and Extra-Large Data Analytics** `[Rahman:2018:ICSME]`
    * It is a technique that automatically identifies relevant and specific API classes from the StackOverflow Q&A site for a programming task written as a natural language query, then reformulates the query for improved code search.
    * A total of 656538 Q&A threads are collected from StackOverflow, and then standard natural language preprocessing is performed on each thread to normalize the content.
    * The baseline idea is to extract appropriate API classes from them using appropriate selection methods and use them for query reformulation.
    * API Class Rank calculation and Borda score calculation are used to achieve the goals.
 
<center><img src="Figures/NLP2API.png" width="800"/></center>

___

* **Fast code recommendation via approximate sub-tree matching** `[Shao:2021:FITEE]`
    * It takes programming context as input and recommends relevant code snippets to assist developers in software development.
    * A new code recommendation algorithm based on sub-tree hashing and the SW algorithm is introduced.
    * Experimental results show that it has good performance in terms of time consumption and accuracy for different recommending tasks.
    * Code recommendation tool implemented for C and Java Language.
    * The SW algorithm is used for sequence alignment to find similar fragments between the two sequences and determine their similarity.

<center><img src="Figures/fast_code_recommendation.png" width="800"/></center>

___

* **Search4Code: Code Search Intent Classification Using Weak Supervision** `[Rao:2021:MSR]`
    * A novel weak supervision-based model to detect code search intent in queries, and this code recommendation works well at method and file level with good performance.
    * A large-scale dataset of queries, mined from the Bing web search engine, used for code search research.
    * A CNN-based model for code search intent classification has been developed for C# and Java search queries mined from the Bing web search engine.
    * Implemented for the programming languages C# and Java..  
    * This method requires additional storage to store the hash representation of AST nodes for code snippets, and the growth of this space  is  linearly related to the size of the code database.
 
<center><img src="Figures/search4code.png" width="800"/></center>

___

* **Neural query expansion for code search** `[Liu:2019:MAPL]`
    * The Neural Code Search (NCS) system facilitates code searching through natural language inquiries.
    * It utilizes the proximity between vectorized representations of both code snippets and search terms to execute the search process.
    * It incorporates a supervised learning component to align words in the search queries with terms found in the actual code.
    * It employs a specialized ranking method to address the dispersal of search outcomes that often occurs with searches based on vector similarity.
    * The capabilities of NCS are showcased through its ability to respond to programming queries sourced from Stack Overflow using a collection of code obtained from GitHub.


___

* **Code search: A survey of techniques for finding code** `[Grazia:2023:ACMCS]`
    * The paper reviews three decades of advancements in the field of code search research.
    * Here are their key observations:
        * The research indicates that while free-form queries are convenient to input and offer a broad range of expression, they might introduce ambiguity and lack the specificity of other structured query formats.
        * Queries using programming languages don't require users to learn new syntax, yet the detail and expressiveness of such code-based queries fluctuate based on the user's search goal and the particularities of the search engine used.
        * User interfaces have the potential to assist in the preprocessing of queries either transparently or through user feedback mechanisms.
        * Code search engines typically refine queries by drawing on parallels between the search terms and code identifiers, as well as databases that pair natural language with code.
        * Common methods for refining queries include adjusting the weight of search terms, adding new terms or substituting them, as well as enhancing queries with more sophisticated representations.
        * Studies of developers reveal that code searches are routinely conducted to fulfill a variety of objectives, such as understanding existing code, locating reusable code segments, and efficiently navigating to familiar code.


___

* **GloVe `[Pennington:GloVe]`**:
    * Glove is short for Global Vectors for Word Representation. It employs a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics.
    * The model's architecture allows it to capture both local and global aspects of word semantics, by aggregating global co-occurrence statistics from a corpus. GloVe vectors have the advantage of being able to relate to the words’ co-occurrence probabilities, allowing for a rich representation of word meaning and nuanced differences.
    * Effcient to read and use; we can use a pre-trained model to convert "description" and "code" from code datasets to meaningful embeddings for further quering.
<center><img src="Figures/glove.png" width="800"/></center>

___

* **word2vec `[Pennington:word2vec]`**:
    * Word2vec utilizes a shallow neural network architecture, either through a Continuous Bag of Words (CBOW) or Skip-gram model, focusing on word prediction either from context (CBOW) or predicting context from a word (Skip-gram).
    * Enables the encoding of words into a vector space where semantic meaning is reflected in the distance and directionality between vectors. This allows for operations such as vector addition and subtraction to reveal semantic relationships between words, like solving analogies.
    * It excels on tasks involving word analogy, word similarity, and named entity recognition, and can be trained on either aggregated global corpus statistics or on individual local context windows, providing flexibility and robustness in learning word embeddings.
<center><img src="Figures/word2vec.png" width="800"/></center>

___

* **doc2vec `[Le:word2vec]`**:
    * Extends the word2vec paradigm by not only learning word embeddings but also learning document-level embeddings, capturing the essence of entire sentences, paragraphs, or documents in a continuous vector space.
    * It Utilizes a "Paragraph Vector" framework which maintains a unique vector for each document that is trained to predict words in the document, enabling nuanced understanding of document semantics beyond individual word meanings.
    * Facilitates a variety of document-level tasks such as document classification, sentiment analysis, and information retrieval by encoding the contextual information of words within a document, setting it apart from word-level embeddings.
<center><img src="Figures/doc2vec.png" width="800"/></center>

___
___

## Solution Ideas Helpful
* Treat source code as text, and conducts pre-processing such as tokenization (e.g., splitting camel case), stop word removal, and stemming.
* Semantic vector idea: embed code into vectors like GloVe or word2vec.
* Develop a comprehensive similarity determination method: based on token similarity, structural similarity, and behavioral similarity.
* Graph neural networks on deep graph matching and searching (DGMS) model.

<!-- > * Pick three best ideas from the literature afrter brainstorming
> * detail which aspects of the papers we could use specificly 
> * Solution Idea 1
> * Solution Idea 2 -->

## Solution Ideas Helpful

* **Pre-processing for Code Context**: Applies text pre-processing techniques to the code and its descriptions, such as tokenization (e.g., splitting identifiers and camelCase), removing less informative tokens (akin to stop words in natural language), and stemming to reduce words to their base or root form.
* **Semantic Embedding of Code and Description**: Utilizes the GloVe model to embed code tokens into a vector space, taking advantage of the model's strength in capturing co-occurrence statistics within the corpus to reflect the semantic meaning of code components.
* **Semantic Similarity for Search**: Establishes a method for computing similarity that encompasses semantic closeness between code descriptions, leveraging the dense vector representations from GloVe to match search queries with the most relevant code snippets based on their descriptions.
* **Comprehensive Code Similarity Framework**: Constructs a similarity framework that accounts for the broader document-level context of code, using doc2vec embeddings to improve the matching of complex queries to relevant code documents, considering both structure and meaning.

___
___
___

# New Solution Ideas

___

* Design a model that accepts free-form query terms and performs a static (on pre-processed data within our corpus) search. 
* Propose a semantic vector model based on codes AND comments. 
* Pre-processing code similarity with graph first and use novel match algorithms.
    

___

* **Below are some key features of our solution:** *(we marked changes with colors and strikes)*
1. ***Code Corpus:*** Collect and curate a vast repository of code snippets and programming resources from various sources, ~including public code repositories like GitHub, and Stack Overflow.~ <span style="color:green">like GeeksForGeeks, W3Schools, TutorialsPoint, and ChatGPT.</span> Ensure that the codebase covers a wide range of ~programming languages and~ domains.
2. ***Indexing and Parsing:*** Implement an indexing system to process and parse the collected codebase. Extract metadata, comments, and code structure to build an efficient search index. Tokenize the code to enable keyword-based searches.
3. ***Query Interface:*** Develop a user-friendly query interface where users can input their programming queries. Allow for both text-based queries (e.g., "Python JSON parsing") and code-based queries (e.g., code snippet or function signature).



___

4. ***Search Algorithms:*** Implement search algorithms that can retrieve relevant code snippets and resources based on the user's query. Explore techniques like vector embeddings, semantic search, and natural language processing to improve search accuracy.
5. ***Code Similarity:*** Calculate code similarity metrics to ensure that retrieved code snippets are not only relevant but also similar in functionality to what the user needs. Techniques such as code diffing and code embeddings can be useful here.
6. ***Ranking and Relevance:*** Develop a ranking algorithm to sort search results by relevance. Take into account factors such as code quality, community feedback (e.g., upvotes on Stack Overflow), and recency of code updates.

<br/><br/>

* **Optional**
7. ***User Feedback:*** Allow users to provide feedback on the relevance and usefulness of retrieved code snippets. Use this feedback to continuously improve search results.

___



<!--     8. *Code Visualization:* Provide a code visualization feature that allows users to see how the retrieved code snippets fit into larger code structures or projects. This can help users understand context. -->
<!--     9. *Code Execution Sandbox:* Optionally, provide a sandbox environment where users can test and run code snippets within the search interface, enhancing the usability of the tool. -->


* **Challenges and considerations:**
    - *Data Privacy and Security:* Ensure that sensitive or proprietary code is not included in the code corpus. Implement security measures to protect user data.
    - *Scalability:* Building and maintaining a large code corpus can be resource-intensive. Consider strategies for efficient storage and updates.
    - *Code Quality:* Assess the quality of retrieved code snippets to avoid promoting poorly written or insecure code.
    - *Legal Considerations:* Be mindful of copyright and licensing issues when collecting and distributing code snippets.

<br/>

* **Evaluation:** 
    * Evaluate the effectiveness of your code search engine using metrics such as **precision**, **recall**, and **user satisfaction** surveys. 
    * Conduct user testing to gather feedback and make improvements.
    * Performance in terms of response time to queries.

___
___
___

# Hardware, Software, and Data

___
___

## Hardware, Software, and Data Needs

We could use our laptops: 
* M1 Pro; 16 GB memory, 512 GB storage. MacOS 13.4.1.
* 2.4 GHz 8-Core Intel Core i9 with Radeon Pro Vega 20 4 GB GPU; 32 GB memory, 2 TB storage. MacOS 14.1.1.


___

## Software System Diagram

* Diagram of the flow of data and control for the software:

<center><img src="Figures/IR_Project.png" width="800"/></center>

___

## Significant Software Tasks Accomplished

* Curated first hundred entries for the dataset using websites like GeeksForGeeks, W3Schools, TutorialsPoint, and ChatGPT. We have manually ensured that the codebase covers a wide range of domains.</li>
* Defined the JSON format for the data; containing description, code and tags fields. Below is the example for an entry in our dataset:</li>
```
    {
      "description":"Reverse a string in Python",
      "code":"string[::-1]",
      "tags":[
        "python",
        "string",
        "reverse"
      ]
    }
```

___

* Implement an indexing system to process and parse the collected codebase. Extracted an index using our descriptions.
* Preprocessed and tokenized the dataset and queries to enable free-form search. Below are some of the preprocessing ideas that we implemented:
    * Case Standardization
    * Tokenization
    * Whitespace Removal
    * Removal of non-ascii Characters
    * Punctuation Removal
    * Remove urls
    * Stop-word Removal
    * Common-word Removal
    * Question-word Removal
    * WordNetLemmatizer
    * PorterStemmer
    * SnowballStemmer

___

* Wrote three search and ranking methods:
    * The inititial method uses TF-IDF and cosine similarity scores.
    * The second method uses Global Vectors for Word Representation (GloVe).
        * GloVe is constructed by aggregating global word-word co-occurrence statistics from a corpus. 
        * The model effectively captures linear substructures in the vector space by leveraging both global statistics and local context, making it capable of providing insight into word analogies and semantic relationships. 
        * As a result, words with similar meanings are placed closer together in the vector space, reflecting their semantic similarity.
    * The third method uses Doc2Vec model, also known as Paragraph Vector, is an extension of the Word2Vec approach to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs, and entire documents.
        * Developed by researchers at Google, Doc2Vec represents documents in a high-dimensional vector space, 
        * This method not only considers the context of words within a document but also maintains a unique vector for the document itself. 
        * These document vectors are trained to predict words in a context, enabling the model to capture the semantic meaning of the text. As a result, semantically similar documents are mapped to proximate points in the vector space, facilitating tasks such as document similarity analysis, classification, and clustering.

___
___
___

___
___
___

___
___
___

# Appendix

___
___

## Reference List

* Reference 1: `[Luan:2019:OOPSLA]` Luan, Sifei, et al. "Aroma: Code recommendation via structural code search." Proceedings of the ACM on Programming Languages 3.OOPSLA (2019): 1-28. https://dl.acm.org/doi/abs/10.1145/3360578

* Reference 2: `[Kim:2018:ICSE]` Kim, Kisub, et al. "FaCoY: a code-to-code search engine." Proceedings of the 40th International Conference on Software Engineering. 2018. https://dl.acm.org/doi/abs/10.1145/3180155.3180187

* Reference 3: `[Sirres:2018:ICSE]` Mathew et al., Cross-language code search using static and dynamic analyses. In the proceedings of 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 2021. https://dl.acm.org/doi/pdf/10.1145/3468264.3468538


___

* Reference 4: `[Cambronero:2019:ESEC/FSE]` Cambronero, Jose, et al. "When deep learning met code search." Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. https://doi.org/10.1145/3338906.3340458

* Reference 5: `[Mathew:2021:ESEC/FSE]` Mathew, George, and Kathryn T. Stolee. "Cross-language code search using static and dynamic analyses." Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021. https://doi.org/10.1145/3468264.3468538

* Reference 6: `[Ling:2021:TKDD]` Ling, Xiang, et al. "Deep graph matching and searching for semantic code retrieval." ACM Transactions on Knowledge Discovery from Data (TKDD) 15.5 (2021): 1-21.
 https://doi.org/10.1145/3447571


___

* Reference 7: `[Rahman:2018:ICSME]` Rahman, Mohammad Masudur, and Chanchal Roy. "Effective reformulation of query for code search using crowdsourced knowledge and extra-large data analytics." 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2018. https://doi.org/10.1109/ICSME.2018.00057

* Reference 8: `[Shao:2021:FITEE]` Shao, Yichao, et al. "Fast code recommendation via approximate sub-tree matching." Frontiers of Information Technology & Electronic Engineering 23.8 (2022): 1205-1216. https://doi.org/10.1631/FITEE.2100379

* Reference 9: `[Rao:2021:MSR]` Rao, Nikitha, Chetan Bansal, and Joe Guan. "Search4Code: Code search intent classification using weak supervision." 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 2021. https://doi.org/10.1109/MSR52588.2021.00077


___

* Reference 10: `[Liu:2019:MAPL]` Liu, Jason, et al. "Neural query expansion for code search." Proceedings of the 3rd acm sigplan international workshop on machine learning and programming languages. 2019. https://doi.org/10.1145/3315508.3329975

* Reference 11: `[Grazia:2023:ACMCS]` Di Grazia, Luca, and Michael Pradel. "Code search: A survey of techniques for finding code." ACM Computing Surveys 55.11 (2023): 1-31. https://doi.org/10.1145/3565971

* Reference 12: `[Pennington:2014:EMNLP]` Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

* Reference 13: `[Mikolov:2013:ANIPS]` Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

* Reference 14: `[Le:2014:PMLR]` Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. PMLR, 2014.


___
___

## Other Material
* Google Code Search: https://code.google.com.

<!-- * Search4Code, Microsoft: https://github.com/microsoft/Search4Code. -->

* Wenhua Wang, Yuqun Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranSˆ3: A Transformer-based Framework for Unifying Code Summarization and Code Search. CoRR abs/2003.03238 (2020). arXiv:2003.03238 https://arxiv.org/abs/2003.03238.

* S. Zhou, H. Zhong, and B. Shen. 2018. SLAMPA: Recommending Code Snippets with Statistical Language Model. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC). 79–88.

* GloVe project website: https://nlp.stanford.edu/projects/glove/


___
___
___

# Software

In [5]:
# Cells for the well organized and documented code 
# --------------------------------------------------------------------------------
# Preamble script block to identify host, user, and kernel
import sys
! hostname
! whoami
print(sys.executable)
print(sys.version)
print(sys.version_info)
# --------------------------------------------------------------------------------

DESKTOP-G2A0SJV
desktop-g2a0sjv\i.shiva sai pavan
C:\Users\I.Shiva Sai Pavan\anaconda3\python.exe
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:20:11) [MSC v.1938 64 bit (AMD64)]
sys.version_info(major=3, minor=12, micro=3, releaselevel='final', serial=0)


In [2]:
# --------------------------------------------------------------------------------
import json 

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import numpy as np
from numpy import dot
from numpy.linalg import norm

import gensim
import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

import time
# --------------------------------------------------------------------------------

In [3]:
# --------------------------------------------------------------------------------
# Uncomment below if you do not have these on your system
# nltk.download('stopwords')
# nltk.download('punkt')
# --------------------------------------------------------------------------------

In [4]:
# --------------------------------------------------------------------------------
# BEWARE: Slow to load!
# import gensim.downloader as api
# --------------------------------------------------------------------------------
# load glove vectors from twitter 2015
# model_glove = api.load("glove-twitter-100") 
# --------------------------------------------------------------------------------

In [5]:
# --------------------------------------------------------------------------------
# Read a pre-trained glove model
def load_glove_model(File):
    print("Loading Glove Model")
    model_glove = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            model_glove[word] = embedding
    print(f"{len(model_glove)} words loaded!")
    return model_glove

model_glove = load_glove_model("Models/glove.6B.50d.txt")
# --------------------------------------------------------------------------------

Loading Glove Model
400000 words loaded!


In [6]:
# --------------------------------------------------------------------------------
# BEWARE: Slow to load!
# --------------------------------------------------------------------------------
# text8_dataset = api.load("text8")  # load dataset as iterable (Cleaned small sample from wikipedia)
# --------------------------------------------------------------------------------
# model_w2v = Word2Vec(text8_dataset)  # train w2v model
# model_w2v.wv.save_word2vec_format('test_w2v.txt', binary=False)
# --------------------------------------------------------------------------------

In [7]:
# --------------------------------------------------------------------------------
# Read a pre-trained word2vec model
# model_w2v = KeyedVectors.load_word2vec_format("Models/test_w2v.txt")
# Read a pre-trained glove model
def load_word2vec_model(File):
    print("Loading Word2vec Model")
    model_w2v = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            model_w2v[word] = embedding
    print(f"{len(model_w2v)} words loaded!")
    return model_w2v

model_w2v = load_word2vec_model("Models/test_w2v.txt")
# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------

Loading Word2vec Model
71291 words loaded!


In [8]:
# --------------------------------------------------------------------------------
# Utility Functions
# pretty dictionary printer: https://stackoverflow.com/questions/3229419/how-to-pretty-print-nested-dictionaries
def pretty(d, indent=0):
    for key, value in d.items():
        print('\t' * indent + str(key))
        if isinstance(value, dict):
            pretty(value, indent+1)
        else:
            print('\t' * (indent+1) + str(value))

def printResults(results, n):
    if n > len(results):
        print("ERROR: Number of results requested (" + str(n) + ") is greater than available (" + str(len(results)) + ") results!")
    else: 
        i = 0;
        for r in results[0:n]:
            print('Result #' + str(i+1) + ":\n")
            pretty(r)
            i = i + 1
            print('--------------------------------------------------------------------------------\n')
# --------------------------------------------------------------------------------

In [9]:
# --------------------------------------------------------------------------------
# Dataset / Code Corpus

with open('Dataset/dataset.json') as f:
    d = json.load(f)
    # print(d)
    # jdata = json.loads(f)
    # for d in jdata:
    #     for key, value in d.iteritems():
    #         print(key, value)
    
print("type(d) is : " + str(type(d)))
print("type(d['dataset']) is : " + str(type(d['dataset'])))
print("len(d['dataset']) is : " + str(len(d['dataset'])))

dataset = d['dataset']
    
# --------------------------------------------------------------------------------

type(d) is : <class 'dict'>
type(d['dataset']) is : <class 'list'>
len(d['dataset']) is : 100


In [10]:
# --------------------------------------------------------------------------------
# Preprocessing

stop_words = set(stopwords.words('english'))

question_words = {"what", "who", "whom", "whose", "which", "where", "when", "why", "how"}

common_words = {"please", "code", "program", "write"}

# def preprocess_text(text):
#     return preprocess_text(text, 3, "SNOWBALL")

def preprocess_text(text, removal, method):
    text = text.lower()                                                 # Case standardization
    
    tokens = word_tokenize(text)                                        # Tokenize 
    
    tokens = [token.strip() for token in tokens if token.strip()]       # Remove whitespace

    tokens = [re.sub(r'[^\x00-\x7f]',r' ', token) for token in tokens]  # remove non-ascii (everything without a code from x00 to x7f)
    
    tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]        # Remove punctuation (everything not a word or space)

    tokens = [re.sub(r'https', '', token) for token in tokens]          # Remove urls
    
    #Remove irrelevant text
    if removal == 1:
        tokens = [i for i in tokens if not i in stop_words]             # stop-word removal
    elif removal == 2:
        tokens = [i for i in tokens if not i in stop_words]             # stop-word removal
        tokens = [i for i in tokens if not i in question_words]         # question-word removal
    elif removal == 3:
        tokens = [i for i in tokens if not i in stop_words]             # stop-word removal
        tokens = [i for i in tokens if not i in question_words]         # question-word removal
        tokens = [i for i in tokens if not i in common_words]           # common-word removal

    #lemmatization: typically preferred over stemming
    if method == 'LEMMER':
        lemmer = WordNetLemmatizer()
        tokens = [lemmer.lemmatize(word) for word in tokens]
    elif method == 'PORTER':
        porter = PorterStemmer()
        tokens = [porter.stem(token) for token in tokens]
    elif method == 'SNOWBALL':
        snowball = SnowballStemmer("english")
        tokens = [snowball.stem(token) for token in tokens]     
        
    return " ".join(tokens)

# --------------------------------------------------------------------------------

In [11]:
# --------------------------------------------------------------------------------
print("--------------------------------------------------------------------------------")
print(stop_words)
print("--------------------------------------------------------------------------------")
print(question_words)
print("--------------------------------------------------------------------------------")
print(common_words)
print("--------------------------------------------------------------------------------")
# --------------------------------------------------------------------------------

--------------------------------------------------------------------------------
{'been', 'she', "that'll", 'in', 'will', 'did', 're', 'out', 'was', 'than', "you'll", 'their', 'until', 'a', 'ma', 'ourselves', 'it', "shouldn't", 'from', 'these', 'again', 'my', "isn't", 'won', 'why', 'through', 'yourselves', 'down', "should've", 'had', 'by', 'hasn', 'wouldn', "mustn't", "didn't", 'most', 'me', 'which', 'with', 'about', 'before', 'has', 'themselves', "she's", 'him', 'shan', 'be', 'what', "aren't", "won't", 'and', 'can', 'the', 'further', 'there', 'very', 'hadn', 'ours', 'its', "you've", 'wasn', 'off', 'this', 'but', 'into', 'once', 'more', 'just', "couldn't", 'd', 'aren', 'i', 'against', 'shouldn', 'ain', 'of', "it's", 'for', 'you', "you'd", 'up', 'few', 'at', 'same', 'your', 'under', 'how', 'having', 'if', 'only', "haven't", "hasn't", "you're", 'didn', 'or', 'they', "don't", 'he', 'yours', 'whom', 'needn', 'himself', 'while', 'are', 'that', 'don', "needn't", 'her', 'where', 'above', 'her

In [12]:
# --------------------------------------------------------------------------------

# X_train, X_test, Y_train, Y_test = train_test_split([entry['description'] for entry in dataset], 
#                                                     [entry['code'] for entry in dataset],
#                                                     test_size=0.3,
#                                                     random_state=0)
# print(X_train.shape, X_test.shape)

# print("Length of X_train:\t" + str(len(X_train)))

# print("Length of X_test:\t" + str(len(X_test)))

# --------------------------------------------------------------------------------

In [13]:
# --------------------------------------------------------------------------------
# Search & Ranking:

# Vector Representation: 
# The first method, we use TF-IDF.
# --------------------------------------------------------------------------------
# Combine descriptions and code to create a corpus
corpus = [entry["description"] + " " + entry["code"] for entry in dataset]
corpus_description = [entry["description"] for entry in dataset]
vectorizer = TfidfVectorizer().fit(corpus)
# --------------------------------------------------------------------------------
# def tfidf_search(query):
#     return tfidf_search(query, 3, "SNOWBALL")
# --------------------------------------------------------------------------------
def tfidf_search(query, removal, method):
    # Preprocess the query
    processed_query = preprocess_text(query, removal, method)

    print("Searching for processed query of : \n\t" + str(processed_query) + "\n")
    
    # Convert the query to a vector
    query_vector = vectorizer.transform([processed_query])
    
    # Compute similarity scores
    similarity_scores = cosine_similarity(query_vector, vectorizer.transform(corpus))
    
    # Rank the results
    ranked_indices = similarity_scores.argsort()[0][::-1]

    print("Best match scored cosine similarity of " + str(similarity_scores.max()) + "; the worst match scored " + str(similarity_scores.min()) + ".\n")
    
    return [dataset[i] for i in ranked_indices]
# --------------------------------------------------------------------------------

In [14]:
# --------------------------------------------------------------------------------
# Search & Ranking:

# Vector Representation: 
# The second one: use GloVe model
# GloVe, which stands for Global Vectors for Word Representation,
# is an unsupervised learning model. 
# GloVe is constructed by aggregating global word-word co-occurrence statistics from a corpus. 
# The model effectively captures linear substructures in the vector space by leveraging both global statistics and local context, making it capable of providing insight into word analogies and semantic relationships. 
# As a result, words with similar meanings are placed closer together in the vector space, reflecting their semantic similarity.
# --------------------------------------------------------------------------------
def sentence_to_glove_vec(sentence, model):
    # Lowercase the sentence
    words = sentence.lower().split()
    
    # Remove out-of-vocabulary words
    words = [word for word in words if word in model.keys()]
    
    # If no words in the model, return a zero vector
    if not words:
        return np.zeros(50)
    
    # Sum up the word vectors
    sentence_vector = np.sum([model[word] for word in words], axis=0)
    
    # Take the mean of the vectors
    return sentence_vector / len(words)
# --------------------------------------------------------------------------------
# def GloVe_search(query):
#     return GloVe_search(query, 3, "SNOWBALL")
# --------------------------------------------------------------------------------
def GloVe_search(query, removal, method):
    # Preprocess the query
    processed_query = preprocess_text(query, removal, method)

    print("Searching for processed query of : \n\t" + str(processed_query) + "\n")
    
    # Convert the query to a vector
    query_vector = sentence_to_glove_vec(processed_query, model_glove)
    
    # Compute similarity scores
    sim_scores = []
    for entry in dataset:
        description = entry["description"]
        processed_des = preprocess_text(description, removal, method)
        des_vector = sentence_to_glove_vec(processed_des, model_glove)
        sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))
        
    # Rank the results
    similarity_scores = np.array(sim_scores)
    ranked_indices = similarity_scores.argsort()[::-1]
    

    print("Best match scored cosine similarity of " + str(similarity_scores.max()) + "; the worst match scored " + str(similarity_scores.min()) + ".\n")
    
    return [dataset[i] for i in ranked_indices]
# --------------------------------------------------------------------------------

In [15]:
# --------------------------------------------------------------------------------
# Vector Representation: 
# The third one: use word2vec model
# Word2Vec is a neural network-based technique developed by Google to convert words into numerical form, 
# where semantically similar words are mapped to proximate points in a multidimensional space.
# --------------------------------------------------------------------------------
def sentence_to_vec(sentence, model):
    # Lowercase the sentence
    words = sentence.lower().split()
    
    # Remove out-of-vocabulary words
    words = [word for word in words if word in model.keys()]
    
    # If no words in the model, return a zero vector
    if not words:
        return np.zeros(100)
    
    # Sum up the word vectors
    sentence_vector = np.sum([model[word] for word in words], axis=0)
    
    # Take the mean of the vectors
    return sentence_vector / len(words)
# --------------------------------------------------------------------------------
# def word2vec_search(query):
#     return word2vec_search(query, 3, "SNOWBALL")
# --------------------------------------------------------------------------------
def word2vec_search(query, removal, method):
    processed_query = preprocess_text(query, removal, method)

    print("Searching for processed query of : \n\t" + str(processed_query) + "\n")
    
    # Convert the query to a vector
    query_vector = sentence_to_vec(processed_query, model_w2v)
    
    # Compute similarity scores
    sim_scores = []
    for entry in dataset:
        description = entry["description"]
        processed_des = preprocess_text(description, removal, method)
        des_vector = sentence_to_vec(processed_des, model_w2v)
        sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))
    
    # Rank the results

    similarity_scores = np.array(sim_scores)
    similarity_scores_without_nan = similarity_scores[~np.isnan(similarity_scores)]
    similarity_scores_with_inf = np.nan_to_num(similarity_scores, nan=-np.inf)
    # Get the indices that would sort the array, then slice to get the last five entries
    sorted_indices = np.argsort(similarity_scores_with_inf)[-5:]
    # Since we want the largest values, we reverse the sorted indices
    ranked_indices = sorted_indices[::-1]

    print("Best match scored cosine similarity of " + str(similarity_scores_without_nan.max()) + "; the worst match scored " + str(similarity_scores_without_nan.min()) + ".\n")
    
    return [dataset[i] for i in ranked_indices]
# --------------------------------------------------------------------------------

In [16]:
# --------------------------------------------------------------------------------
# Vector Representation: 
# The fourth one: use doc2vec model
# # Doc2Vec, also known as Paragraph Vector, 
# is an extension of the Word2Vec approach to unsupervised learning of continuous representations for larger blocks of text, 
# such as sentences, paragraphs, and entire documents. 
# Developed by researchers at Google, Doc2Vec represents documents in a high-dimensional vector space, 
# This method not only considers the context of words within a document but also maintains a unique vector for the document itself. 
# These document vectors are trained to predict words in a context, enabling the model to capture the semantic meaning of the text. As a result, semantically similar documents are mapped to proximate points in the vector space, facilitating tasks such as document similarity analysis, classification, and clustering.
# --------------------------------------------------------------------------------
# NOT SUITABLE FOR THE LENGTH OF OUR DESCRIPTIONS
# --------------------------------------------------------------------------------

In [17]:
# --------------------------------------------------------------------------------
# Test Session
start_time = time.time()
results = tfidf_search("How can I reverse a string?", 3, "SNOWBALL")
print(" --------------------------------------------- %s seconds ---" % (time.time() - start_time))
printResults(results, 3)
print('\n================================================================================\n')
start_time = time.time()
results = GloVe_search("How can I reverse a string?", 3, "SNOWBALL")
print(" --------------------------------------------- %s seconds ---" % (time.time() - start_time))
printResults(results, 3)
print('\n================================================================================\n')
start_time = time.time()
results = word2vec_search("How can I reverse a string?", 3, "SNOWBALL")
print(" --------------------------------------------- %s seconds ---" % (time.time() - start_time))
printResults(results, 3)

# --------------------------------------------------------------------------------

Searching for processed query of : 
	revers string 

Best match scored cosine similarity of 0.8038446020918463; the worst match scored 0.0.

 --------------------------------------------- 0.015004158020019531 seconds ---
Result #1:

description
	Reverse a string in Python
code
	string[::-1]
tags
	['python', 'string', 'reverse']
--------------------------------------------------------------------------------

Result #2:

description
	Convert a list of strings into a single string
code
	joined_string = ' '.join(my_list_of_strings)
tags
	['python', 'string', 'join', 'list']
--------------------------------------------------------------------------------

Result #3:

description
	Determine if a string has all unique characters.
code
	
def is_unique(s):
    return len(set(s)) == len(s)

# Example Usage:
unique_string = "abcdef"
not_unique_string = "aabbcc"
is_unique(unique_string)  # Returns True
is_unique(not_unique_string)  # Returns False
        
tags
	['string', 'unique characters']
--

  sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))


In [18]:
def search_utility(query, removal, method, top):
    print("User given query:\n\t" + str(query) + "")
    
    print('\n\n === TF-IDF SEARCH ==============================================================\n')
    start_time = time.time()
    results = tfidf_search(query, removal, method)
    print(" ------------------------------------------------------------- %s seconds ---" % round((time.time() - start_time), 4))
    printResults(results, top)
    print('\n\n === GloVe SEARCH ===============================================================\n')
    start_time = time.time()
    results = GloVe_search(query, removal, method)
    print(" ------------------------------------------------------------- %s seconds ---" % round((time.time() - start_time), 4))
    printResults(results, top)
    print('\n\n === word2vec SEARCH ============================================================\n')
    start_time = time.time()
    results = word2vec_search(query, removal, method)
    print(" ------------------------------------------------------------- %s seconds ---" % round((time.time() - start_time), 4))
    printResults(results, top)

def search_compare(query, top):
    search_utility(query, 3, "SNOWBALL", top)


In [19]:
# --------------------------------------------------------------------------------
# Test

search_compare("How to write a search algorithm?", 3)

# results = tfidf_search("How to write a search algorithm?", 3, "SNOWBALL")
# printResults(results, 5)
# print('\n================================================================================\n')
# results = GloVe_search("How to write a search algorithm?", 3, "SNOWBALL")
# printResults(results, 5)
# print('\n================================================================================\n')
# results = word2vec_search("How to write a search algorithm?", 3, "SNOWBALL")
# printResults(results, 5)

# --------------------------------------------------------------------------------

User given query:
	How to write a search algorithm?



Searching for processed query of : 
	search algorithm 

Best match scored cosine similarity of 0.14987489537815366; the worst match scored 0.0.

 ------------------------------------------------------------- 0.006 seconds ---
Result #1:

description
	Binary Search: Search a sorted array by repeatedly dividing the search interval in half.
code
	def binary_search(arr, target):
    low = 0
    high = len(arr) - 1
    while low <= high:
        mid = (low + high) // 2
        guess = arr[mid]
        if guess == target:
            return mid
        if guess > target:
            high = mid - 1
        else:
            low = mid + 1
    return None
tags
	['algorithm', 'binary search', 'search']
--------------------------------------------------------------------------------

Result #2:

description
	Implement a basic binary search algorithm
code
	
def binary_search(arr, x):
    low, high = 0, len(arr) - 1
    while low <= high:
     

  sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))


In [20]:
# --------------------------------------------------------------------------------
# Test

search_compare("write a tree traversal", 3)

print('\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n')

search_compare("breadth-first search", 3)

# --------------------------------------------------------------------------------

User given query:
	write a tree traversal



Searching for processed query of : 
	tree travers

Best match scored cosine similarity of 0.19982010957894894; the worst match scored 0.0.

 ------------------------------------------------------------- 0.0053 seconds ---
Result #1:

description
	Post-order traversal of a binary tree
code
	
def post_order_traversal(root):
    return post_order_traversal(root.left) + post_order_traversal(root.right) + [root.val] if root else []

# Example Usage: Use the binary tree from the in-order traversal example
post_order = post_order_traversal(root)
        
tags
	['tree', 'binary tree', 'post-order']
--------------------------------------------------------------------------------

Result #2:

description
	Pre-order traversal of a binary tree
code
	
def pre_order_traversal(root):
    return [root.val] + pre_order_traversal(root.left) + pre_order_traversal(root.right) if root else []

# Example Usage: Use the binary tree from the in-order traversal exam

  sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))


Best match scored cosine similarity of 0.9999999999999998; the worst match scored 0.03458672056691623.

 ------------------------------------------------------------- 0.0303 seconds ---
Result #1:

description
	Post-order traversal of a binary tree
code
	
def post_order_traversal(root):
    return post_order_traversal(root.left) + post_order_traversal(root.right) + [root.val] if root else []

# Example Usage: Use the binary tree from the in-order traversal example
post_order = post_order_traversal(root)
        
tags
	['tree', 'binary tree', 'post-order']
--------------------------------------------------------------------------------

Result #2:

description
	Pre-order traversal of a binary tree
code
	
def pre_order_traversal(root):
    return [root.val] + pre_order_traversal(root.left) + pre_order_traversal(root.right) if root else []

# Example Usage: Use the binary tree from the in-order traversal example
pre_order = pre_order_traversal(root)
        
tags
	['tree', 'binary tree', 

In [21]:
# --------------------------------------------------------------------------------
# Test

search_compare("how to get time in python?", 3)

# --------------------------------------------------------------------------------

User given query:
	how to get time in python?



Searching for processed query of : 
	get time python 

Best match scored cosine similarity of 0.42800010888082557; the worst match scored 0.0.

 ------------------------------------------------------------- 0.0044 seconds ---
Result #1:

description
	Decorator to time the execution of a function
code
	
import time

def timer(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f'Function {func.__name__} executed in {end - start:.4f} seconds')
        return result
    return wrapper

@timer
def example_function():
    time.sleep(2)
        
tags
	['python', 'decorator', 'timer', 'function']
--------------------------------------------------------------------------------

Result #2:

description
	Function to add two numbers in Python
code
	def add(a, b): return a + b
tags
	['python', 'function', 'math', 'addition']
------------------------------

  sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))


Best match scored cosine similarity of 0.8581071615550888; the worst match scored -0.22575424805927483.

 ------------------------------------------------------------- 0.0315 seconds ---
Result #1:

description
	Get the current date and time using datetime
code
	
from datetime import datetime
current_datetime = datetime.now()
        
tags
	['python', 'datetime', 'current time']
--------------------------------------------------------------------------------

Result #2:

description
	Merge Sort: Sort an array by dividing it into halves, sorting the halves, and merging them back together.
code
	def merge_sort(arr):
    if len(arr) > 1:
        mid = len(arr) // 2
        L = arr[:mid]
        R = arr[mid:]
        merge_sort(L)
        merge_sort(R)
        i = j = k = 0
        while i < len(L) and j < len(R):
            if L[i] < R[j]:
                arr[k] = L[i]
                i += 1
            else:
                arr[k] = R[j]
                j += 1
            k += 1
       

In [22]:
# --------------------------------------------------------------------------------
# Test

search_compare("how to write breadth first search?", 3)

# --------------------------------------------------------------------------------

  sim_scores.append(np.dot(query_vector, des_vector) / (np.linalg.norm(query_vector)*np.linalg.norm(des_vector)))


User given query:
	how to write breadth first search?



Searching for processed query of : 
	breadth first search 

Best match scored cosine similarity of 0.1690434742544628; the worst match scored 0.0.

 ------------------------------------------------------------- 0.0057 seconds ---
Result #1:

description
	Implement Breadth-First Search (BFS) for a graph
code
	
def bfs(graph, start):
    visited = []
    queue = deque([start])
    while queue:
        vertex = queue.popleft()
        if vertex not in visited:
            visited.append(vertex)
            queue.extend(graph[vertex] - set(visited))
    return visited

# Example Usage: Use the graph from the DFS example
bfs_visited = bfs(graph, 'A')
        
tags
	['graph', 'bfs']
--------------------------------------------------------------------------------

Result #2:

description
	Perform breadth-first search (BFS) on a graph
code
	
from collections import deque

def bfs(graph, start):
    visited = set()
    queue = deque([star

In [23]:
# # --------------------------------------------------------------------------------
# # EVALUATING:

# # --------------------------------------------------------------------------------

# def transform_model_data_w_tfidf_vectorizer(preprocessed_text, Y_train,  X_test, Y_test, model_type):
#     #vectorize dataset 
#     tfidf = TfidfVectorizer() 
#     vectorized_data = tfidf.fit_transform(preprocessed_text)

#     #define model
#     if model_type == 0:
#         model = TfidfVectorizer()
#     elif model_type == 1:
#         model = MultinomialNB(alpha=0.1)
#     else:
#         model = BernoulliNB(alpha=0.1)

#     model.fit(vectorized_data, Y_train)
 
#     #evaluate model
#     predictions = model.predict(tfidf.transform(X_test))

#     accuracy = accuracy_score( Y_test, predictions)
#     balanced_accuracy = balanced_accuracy_score(Y_test, predictions)
#     precision = precision_score(Y_test, predictions)

#     print("Accuracy:",round(100*accuracy,2),'%')
#     print("Balanced accuracy:",round(100*balanced_accuracy,2),'%')
#     print("Precision:", round(100*precision,2),'%')
#     return predictions, model

# # --------------------------------------------------------------------------------

# # preprocessed_text_1 = [text_clean(text, 'L', True) for text in X_train]
# # processed_query = preprocess_text(query, removal, method)

# preprocessed_text_1 = [preprocess_text(text, 3, "SNOWBALL") for text in X_train]

# transform_model_data_w_tfidf_vectorizer(preprocessed_text_1, Y_train,  X_test, Y_test, 0)
    
# # --------------------------------------------------------------------------------