## Tutorial : Exploration of full-text indexing
We'll read in some files, then index the "important" words in their contents, and finally search for some of those words

In [1]:
import set_path      # Importing this module will add the project's home directory to sys.path

Added 'D:\Docs\- MY CODE\Brain Annex\BA-Win7' to sys.path


In [2]:
import os
import sys
import getpass

from neoaccess import NeoAccess

from BrainAnnex.modules.neo_schema.neo_schema import NeoSchema
from BrainAnnex.modules.full_text_indexing.full_text_indexing import FullTextIndexing
from BrainAnnex.modules.media_manager.media_manager import MediaManager

# Connect to the database
#### You can use a free local install of the Neo4j database, or a remote one on a virtual machine under your control, or a hosted solution, or simply the FREE "Sandbox" : [instructions here](https://julianspolymathexplorations.blogspot.com/2023/03/neo4j-sandbox-tutorial-cypher.html)
NOTE: This tutorial is tested on version 4 of the Neo4j database, but will probably also work on the new version 5# Connect to the database

In [3]:
# Save your credentials here - or use the prompts given by the next cell
host = ""             # EXAMPLES:  bolt://123.456.789.012   OR   neo4j://localhost
password = ""

In [4]:
print("To create a database connection, enter the host IP, but leave out the port number: (EXAMPLES:  bolt://123.456.789.012  OR  neo4j://localhost )\n")

host = input("Enter host IP WITHOUT the port number.  EXAMPLE: bolt://123.456.789.012 ")
host += ":7687"    # EXAMPLE of host value:  "bolt://123.456.789.012:7687"

password = getpass.getpass("Enter the database password:")

print(f"\n=> Will be using: host='{host}', username='neo4j', password=**********")

To create a database connection, enter the host IP, but leave out the port number: (EXAMPLES:  bolt://123.456.789.012  OR  neo4j://localhost )



Enter host IP WITHOUT the port number.  EXAMPLE: bolt://123.456.789.012  bolt://123.456.789.012
Enter the database password: ········



=> Will be using: host='bolt://123.456.789.012:7687', username='neo4j', password=**********


In [5]:
db = NeoAccess(host=host,
               credentials=("neo4j", password), debug=False)   # Notice the debug option being OFF

Attempting to connect to Neo4j database


In [6]:
print("Version of the Neo4j driver: ", db.version())

Version of the Neo4j driver:  4.4.11


# Explorations of Indexing

In [7]:
# Verify that the database is empty  (if necessary, use db.empty_dbase()  to clear it)
q = "MATCH (n) RETURN COUNT(n) AS number_nodes"

db.query(q, single_cell="number_nodes")

41

#### Initialize the indexing

In [8]:
NeoSchema.set_database(db)
FullTextIndexing.db = db

In [9]:
MediaManager.set_media_folder("D:/tmp/")   # CHANGE AS NEEDED on your system

In [10]:
db.empty_dbase()                           # WARNING: USE WITH CAUTION!!!

In [11]:
FullTextIndexing.initialize_schema()

#### Read in 2 files (stored in the "media folder" specified above), and index them

In [12]:
filename = "test1.txt"      # 1st FILE
file_contents = MediaManager.get_from_file(filename)
file_contents

'hello to the world !!! ?  Welcome to learning how she cooks with potatoes...'

In [13]:
word_list = FullTextIndexing.extract_unique_good_words(file_contents)
word_list

['hello', 'world', 'welcome', 'learning', 'cooks', 'potatoes']

#### Note that many common words get dropped...

In [14]:
content_item_id = NeoSchema.create_data_node(class_node="Content Item", properties = {"name": filename})
content_item_id

0

In [15]:
# Index the chosen words for this first Content Item
FullTextIndexing.new_indexing(content_item_id = content_item_id, unique_words = word_list)

#### Process the 2nd Content Item

In [16]:
filename = "test2.htm"     # 2nd FILE
file_contents = MediaManager.get_from_file(filename)
file_contents

"<p>Let's make a <i>much better world</i>, shall we?  What do you say to that enticing prospect?</p>\n\n<p>Starting on a small scale &ndash; we&rsquo;ll learn cooking a potato well.</p>"

In [17]:
word_list = FullTextIndexing.extract_unique_good_words(file_contents)
word_list

['world', 'say', 'enticing', 'prospect', 'scale', 'learn', 'cooking', 'potato']

In [18]:
content_item_id = NeoSchema.create_data_node(class_node="Content Item", properties = {"name": filename})
content_item_id

8

In [19]:
# Index the chosen words for this 2nd Content Item
FullTextIndexing.new_indexing(content_item_id = content_item_id, unique_words = word_list)

_Here's what we have created so far:_

![Full Text Indexing](../BrainAnnex/docs/tutorial_full_text_indexing.png)

In [20]:
def search_word(word :str) -> [str]:
    """
    Look up any stored words that contains the requested one (ignoring case.)  
    Then locate the Content Items that are indexed by any of those words.
    Return a list of the values of the "name" attributes in all the found Content Items
    """
    q= f'''MATCH (w:Word)-[:occurs]->(:Indexer)<-[:has_index]-(ci:`Content Item`)
         WHERE w.name CONTAINS toLower('{word}')
         RETURN ci.name AS content_name
         '''
    result = db.query(q, single_column="content_name")
    return result

# Now, can finally try out some word searches

In [21]:
search_word("hello")

['test1.txt']

In [22]:
search_word("world")

['test1.txt', 'test2.htm']

### Make sure to search for the word STEMS, in order to find all variants!!
For example, search for "potato" in order to find both "potato" and "potatoes".

In [23]:
search_word("POTATO")

['test1.txt', 'test2.htm']

In [24]:
search_word("POTATOES")

['test1.txt']

In [25]:
search_word("Learn")

['test1.txt', 'test2.htm']

In [26]:
search_word("Learning")

['test1.txt']