# Applying Concept Search to the Eighteenth-Century Dataset - Complete Notebook

Finally, we've reached the last stage of our project, where we put together everything we've learned to search for concepts across our entire dataset. Back in the simple counting notebook, we counted up single tokens - importantly, the token "principle" - in a blunt, one-word appropximation of the technique we'll use here. Concept search involves three key advances over other types of search:

 1. Many words are included in a search instead of one, so that we retrieve texts based on how well they represent an idea, rather than just a single token;
 2. Document scores are derived from TF-IDF statistics, so that the relevance of rare and common words is balanced out;
 3. Volumes are broken into smaller chunks of 1000 words, so that we identify short passages inside a text that discuss the passage in question. This has the added benefit of giving us much more precise results.

As we've said, there are a number of ways to start this sort of analysis, from using the results from topic modelling or word embeddings, to a more hands-on technique, which involves feeding in a set of loading passage, and evaluating them for their most frequent words.

## 1. Build a Query from Your Vector-Model Results

Let's start with the most simplest way to build a search, which is simply to us the cluster of terms nearest to `principle` that we derived in the previous lesson. Make sure to go through and remove any words that you would prefer not to include. You should keep in mind, too, that this first run may not be the most effective demonstration of the possibilities of concept search, because most of the words in this list are near cognates, rather than words that don't necessarily mean the same thing as our concept of interest, but that do appear alongside it.

In [None]:
query = ['principles', 'system', 'supposition', 'motive', 'conviction', 'proposition', 'concept', 'doctrine', 'basis', 'morality', 'reasoning', 'notion', 'criterion', 'precept', 'unity', 'theory']

We'll need to run all our usual cleaning steps on the string, so that it accords with any tokens found in the texts that are searched.

In [None]:
from string import punctuation
punctuation += "“”‘’↩"
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
more_stopwords = (stopwords.words('english')) + ["0", "1", "10", "100", "11", "12", "13", "14", "15", "16", "17", "18", "19", "2", "3", "4", "5", "6", "7", "8", "9", "a", "able", "adam", "also", "also", "although", "among", "another", "away", "b", "began", "c", "came", "could", "d", "de", "done", "e", "eight", "et", "even", "even", "ever", "every", "every", "f", "first", "five", "found", "four", "g", "gave", "give", "go", "good", "great", "h", "high", "however", "i", "ii", "iii", "indeed", "j", "john", "k", "know", "l", "la", "le", "left", "let", "life", "like", "little", "long", "m", "made", "made", "make", "make", "man", "many", "may", "may", "men", "might", "mr", "much", "much", "must", "must", "n", "near", "never", "nine", "nothing", "o", "often", "one", "one", "p", "p", "part", "per", "place", "put", "q", "r", "s", "said", "said", "saw", "sect", "see", "self", "seven", "several", "shall", "shall", "sir", "six", "soon", "t", "take", "ten", "th", "thee", "therefore", "thing", "things", "thou", "though", "though", "three", "thus", "thy", "till", "time", "told", "took", "two", "two", "u", "u", "upon", "upon", "us", "v", "v", "vol", "w", "way", "well", "went", "whether", "without", "without", "would", "would", "x", "y", "yet", "z"]
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [None]:
usefulWordList = [word for word in query if word not in more_stopwords]
query_lemmas = [wordnet_lemmatizer.lemmatize(word) for word in usefulWordList]

In [None]:
query_lemmas

## 2. Assemble a Toy Corpus to Use for Practice

In [None]:
import os
from pathlib import Path
home = str(Path.home())

textdirectory = home + '/dh2/corpora_and_metadata/chunked_files_kant/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
# Get list of filenames
import glob
filenames = glob.glob("*.txt")
print(filenames)

In [None]:
corpus = []
for i in filenames:
    with open (str(i),'r') as file:
        readFile = file.read()
        words = nltk.tokenize.word_tokenize(readFile)
        usefulWordList = [word for word in words if word not in more_stopwords]
        lemma_text = [wordnet_lemmatizer.lemmatize(word) for word in usefulWordList]
        corpus.append(lemma_text)
    file.close()

In [None]:
len(corpus)

## 3. Create a Dictionary of Word Counts for Your Toy Corpus

In [None]:
corpus_dict = {}
for i in range(len(filenames)):
    corpus_dict[filenames[i]] = corpus[i]

In [None]:
len(corpus_dict)

In [None]:
word_freq = {}
for i in corpus_dict:
    for lemma in corpus_dict[i]:
        if lemma not in word_freq.keys():
            word_freq[lemma] = 1
        else:
            word_freq[lemma] += 1

In [None]:
word_freq["principle"]

In [None]:
for lemma in sorted(word_freq, key=word_freq.get, reverse=True):
    print(lemma, word_freq[lemma])

## 4. Derive TF-IDF Scores for Your Toy Corpus

In [None]:
lemma_idfs = {}
for lemma in word_freq:
    doc_containing_lemma = 0
    for i in corpus_dict:
       if lemma in corpus_dict[i]:
        doc_containing_lemma += 1
    lemma_idfs[lemma] = np.log(len(corpus)/(1 + doc_containing_lemma))

In [None]:
print(lemma_idfs)

In [None]:
print(lemma_idfs["principle"])

In [None]:
complete_corpus_tfidfs = {}
for i in corpus_dict: 
    doc_word_counts = {}
    doc_tfidfs = {}
    for lemma in corpus_dict[i]:
        if lemma not in doc_word_counts:
            doc_word_counts[lemma] = 1
        else:
            doc_word_counts[lemma] += 1
        doc_tfidfs[lemma] = doc_word_counts[lemma]*lemma_idfs[lemma]
    complete_corpus_tfidfs[i] = doc_tfidfs

In [None]:
complete_corpus_tfidfs

In [None]:
complete_corpus_tfidfs["E000049.001.0000.txt"]["principle"]

## 5. Finally, Write a Script to Perform Your Search

In [None]:
search_results = {}
for i in corpus_dict:
    doc_score = 0
    for lemma in query_lemmas:
        if lemma not in set(corpus_dict[i]):
            pass
        else:
            doc_score += complete_corpus_tfidfs[i][lemma]
        search_results[i] = doc_score

In [None]:
import pandas as pd

search_results_df = pd.DataFrame(search_results, index=[0])

In [None]:
search_results_df = search_results_df.transpose()

In [None]:
search_results_df.reset_index(level=0, inplace=True)

In [None]:
search_results_df = search_results_df.rename(columns={"index":"ChunkName", 0:"agg_tfidf"})

In [None]:
search_results_df.sort_values(by=['agg_tfidf'], ascending=False)

In [None]:
with open ("E000049.001.0105.txt",'r') as result:
    readResult = result.read()
result.close()

In [None]:
print(readResult)