# Build Example Directory

This project is a search engine that will find lost files. To test the code, I will need a directory with many files within. This notebook populates a directory of empty subdirectories.

## Get subdirectories inside of the dir

In [25]:
import os

print(os.getcwd())
print(os.listdir())

example_dir = os.path.join(os.getcwd(), os.listdir()[2])
print("\nEmpty Example Dir: ", example_dir)

c:\Users\hunte\OneDrive\Documents\Coding Projects\Imprecision-Search
['Build-Corpus.py', 'build_example_dir.ipynb', 'example_directory']

Empty Example Dir:  c:\Users\hunte\OneDrive\Documents\Coding Projects\Imprecision-Search\example_directory


In [63]:
import os

def gather_subdirs(directory, subdir_list=None):
    '''
    This function recursively gathers all subdirectories of a 
    given directory. It returns a list of all subdirectories.
    '''
    # If no list is given, create a new list
    if subdir_list is None:
        subdir_list = []
        
    # Loop over all files and directories in the given directory
    for name in os.listdir(directory):

        # Create the full path to the file or directory
        path = os.path.join(directory, name)

        # If the path is a directory, append it to the list
        if os.path.isdir(path):
            subdir_list.append(path)

            # Call the function recursively with the new path
            gather_subdirs(path, subdir_list)
            
    return subdir_list



# Call the function
subdirs = gather_subdirs(example_dir)
print("\nSubdirectories: ", subdirs[:2])



Subdirectories:  ['c:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Imprecision-Search\\example_directory\\subdir', 'c:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Imprecision-Search\\example_directory\\subdir\\subdir-recursion-level-1']


## Get Examples to Fill Dir With

I am going to populate the subdirectories just attained with examples from the glue microsoft research paraphrase corpus

In [43]:
from datasets import load_dataset

# loading in the General Language Understanding Evaluation (GLUE) dataset
# mrpc stands for Microsoft Research Paraphrase Corpus
mrpc = load_dataset("glue", "mrpc")

mrpc

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [57]:
import pandas as pd

# Assuming `mrpc` is your DatasetDict object
train_dataset = mrpc['train']

# Convert the 'sentence1' column to a Pandas Series
examples = pd.Series(train_dataset['sentence1'])

# Show the first few entries
examples.head()

0    Amrozi accused his brother , whom he called " ...
1    Yucaipa owned Dominick 's before selling the c...
2    They had published an advertisement on the Int...
3    Around 0335 GMT , Tab shares were up 19 cents ...
4    The stock rose $ 2.11 , or about 11 percent , ...
dtype: object

## Write a single example to txt file 3 times inside each dir

In [65]:
print(len(subdirs))

#for all subdirs
for i, subdir in enumerate(subdirs):
    for j in range(3):
        # print examples from examples series into txt file
        with open(os.path.join(subdir, f"document{i+j}.txt"), "w") as f:
            f.write(examples[i+j])

39
