In [166]:
from wsgiref import headers

import requests
from bs4 import BeautifulSoup
import re
import os

film_link = "https://en.wikipedia.org/wiki/List_of_Star_Wars_films"
series_link = "https://en.wikipedia.org/wiki/List_of_Star_Wars_television_series"
characters_link = "https://en.wikipedia.org/wiki/List_of_Star_Wars_characters"

# Film processing

In this section we're going to extract the information about Star Wars films as well as the links to all the films, that will be processed in a second moment.

In [117]:
film_page = requests.get(film_link)
soup_film = BeautifulSoup(film_page.content, "html.parser")

In [118]:
div_content = soup_film.find("div", {"class": "mw-content-ltr mw-parser-output"})

In [119]:
print(div_content.prettify())

<div class="mw-content-ltr mw-parser-output" dir="ltr" lang="en">
 <p class="mw-empty-elt">
 </p>
 <p class="mw-empty-elt">
 </p>
 <style data-mw-deduplicate="TemplateStyles:r1218072481">
  .mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}.mw-parser-output .infobox .navbar{font-size:100%}body.skin-minerva .mw-parser-output .infobox-header,body.skin-minerva .mw-parser-output .infobox-subheader,body.skin-minerva .mw-parser-output .infobox-above,body.skin-minerva .mw-parser-output .infobox-title,body.skin-minerva .mw-parser-output .infobox-image,body.skin-minerva .mw-parser-output .infobox-full-data,body.skin-minerva .mw-parser-output .infobox-below{text-align:center}html.skin-theme-clientpref-night .mw-parser-output .infobox-full-data div{background:#1f1f23!important;color:#f8f9fa}@media(prefers-color-scheme:dark){html.skin-them

In [22]:
title_tags = ["h1", "h2", "h3", "h4", "h5"]
ignore_headers = ["Reception", "Unproduced and abandoned projects", "Documentaries", "Notes", "See also", "References", "External links"]

regex_headers = "|".join(ignore_headers)

Now we will convert the page into a text file removing redundant <p> tags that are repeated in the same section. In addition, we will keep the titles but removing all the useless information on the tags' classes. 

Since not all the information are relevant, we decide to discard some headers, reported in the list ignore_headers.

In [95]:
def convert_page_to_text(html_tag):

    full_page_text = ""
    last_tag_name = None
    ignore_paragraph = False
    
    for child_tag in html_tag.children:
        if child_tag.name == "p":
            if not ignore_paragraph:
                # In the case the last tag was not a paragraph, add the start of a paragraph tag <p>
                if last_tag_name is None or last_tag_name != "p":
                    full_page_text += "<p>"
                    
                full_page_text += child_tag.text
                last_tag_name = "p"
        else:
            # If now we're reading a tag that is not p and the last tag was p, close the tag.
            if last_tag_name is not None and last_tag_name == "p":
                full_page_text += "</p>"
            last_tag_name = child_tag.name
            
            # If the header of the paragraph is to be ignored, just skip it.
            if child_tag.name == "h2" and re.search(regex_headers, child_tag.text, flags=re.IGNORECASE):
                ignore_paragraph = True
            elif child_tag.name == "h2":
                ignore_paragraph = False
                
            # Keep the header tag     
            if not ignore_paragraph and child_tag.name in title_tags:
                    full_page_text += f'<{child_tag.name}>{child_tag.text}</{child_tag.name}>'
                
    return full_page_text
    

Let's now print the result.

In [42]:
full_page_text = convert_page_to_text(div_content)

print(full_page_text)

<p>

</p><p>The Star Wars franchise involves multiple live-action and animated films. The film series started with a trilogy set in medias res that was later expanded to a trilogy of trilogies, known as the "Skywalker Saga".
The 1977 self-titled film, later subtitled Episode IV – A New Hope, was followed by the sequels The Empire Strikes Back (1980) and Return of the Jedi (1983), subtitled onscreen as Episode V and Episode VI; these films form the original trilogy. Sixteen years later, the prequel trilogy was released, consisting of The Phantom Menace (1999), Attack of the Clones (2002), and Revenge of the Sith (2005). After creator George Lucas sold Lucasfilm to Disney in 2012, a sequel trilogy consisting of Episodes VII through IX was released, consisting of The Force Awakens (2015), The Last Jedi (2017), and The Rise of Skywalker (2019).
The first three spin-off films produced were the made-for-television Star Wars Holiday Special (1978), The Ewok Adventure (1984) and Ewoks: The Bat

As one can observe, there is a lot of noise in the text. This is due mostly to the text related to the "edit" sections and all the citations. We want to remove those and for it we will rely on regular expressions.

In [43]:
cleaned_text = re.sub("\[.+?\]", "", full_page_text)

In [44]:
print(cleaned_text)

<p>

</p><p>The Star Wars franchise involves multiple live-action and animated films. The film series started with a trilogy set in medias res that was later expanded to a trilogy of trilogies, known as the "Skywalker Saga".
The 1977 self-titled film, later subtitled Episode IV – A New Hope, was followed by the sequels The Empire Strikes Back (1980) and Return of the Jedi (1983), subtitled onscreen as Episode V and Episode VI; these films form the original trilogy. Sixteen years later, the prequel trilogy was released, consisting of The Phantom Menace (1999), Attack of the Clones (2002), and Revenge of the Sith (2005). After creator George Lucas sold Lucasfilm to Disney in 2012, a sequel trilogy consisting of Episodes VII through IX was released, consisting of The Force Awakens (2015), The Last Jedi (2017), and The Rise of Skywalker (2019).
The first three spin-off films produced were the made-for-television Star Wars Holiday Special (1978), The Ewok Adventure (1984) and Ewoks: The Bat

### Links extraction

Even though now we obtained a full text of the page and cleaned it. We don't actually care about it on its own but also at all the links that are shown inside since they're the ones bringing more information on the specific movies, for example.

The approach will be the following: 
- Get all the links in the page content
- Retrieve some link names with some heuristics (e.g. they contain Episode or Star Wars in the link)
- Fine-tune the remaining links

In [89]:
links = div_content.find_all("a")
link_dict = {tag.text : tag.get("href") for tag in links}

link_dict

{'': '/wiki/File:Solo_A_Star_Wars_Story_Japan_Premiere_Red_Carpet_Alden_Ehrenreich_(41008143870).jpg',
 'George Lucas': '/wiki/George_Lucas',
 'Irvin Kershner': '/wiki/Irvin_Kershner',
 'Richard Marquand': '/wiki/Richard_Marquand',
 'J. J. Abrams': '/wiki/J._J._Abrams',
 'Rian Johnson': '/wiki/Rian_Johnson',
 'Dave Filoni': '/wiki/Dave_Filoni',
 'Gareth Edwards': '/wiki/Gareth_Edwards_(director)',
 'Ron Howard': '/wiki/Ron_Howard',
 'Jon Favreau': '/wiki/Jon_Favreau',
 'Gary Kurtz': '/wiki/Gary_Kurtz',
 'Howard Kazanjian': '/wiki/Howard_Kazanjian',
 'Rick McCallum': '/wiki/Rick_McCallum',
 'Catherine Winder': '/wiki/Catherine_Winder',
 'Kathleen Kennedy': '/wiki/Kathleen_Kennedy_(producer)',
 'Bryan Burk': '/wiki/Bryan_Burk',
 'Allison Shearmur': '/wiki/Allison_Shearmur',
 'Ram Bergman': '/wiki/Ram_Bergman',
 'Michelle Rejwan': '/wiki/Michelle_Rejwan',
 'Lucasfilm': '/wiki/Lucasfilm',
 'Lucasfilm Animation': '/wiki/Lucasfilm_Animation',
 '20th Century Fox': '/wiki/20th_Century_Fox',
 '

There are way too many links and not all of them are actually movies.
First, let's remove all the links that do not refer a wikipedia page (they don't have "wiki" inside).

In [90]:
compiled_wiki_regex = re.compile("wiki")
remove_key = []

for key, link in link_dict.items():
    if link is None or not re.search(compiled_wiki_regex, link):
        remove_key.append(key)
        
for k in remove_key:
    del link_dict[k]
    
link_dict

{'': '/wiki/File:Solo_A_Star_Wars_Story_Japan_Premiere_Red_Carpet_Alden_Ehrenreich_(41008143870).jpg',
 'George Lucas': '/wiki/George_Lucas',
 'Irvin Kershner': '/wiki/Irvin_Kershner',
 'Richard Marquand': '/wiki/Richard_Marquand',
 'J. J. Abrams': '/wiki/J._J._Abrams',
 'Rian Johnson': '/wiki/Rian_Johnson',
 'Dave Filoni': '/wiki/Dave_Filoni',
 'Gareth Edwards': '/wiki/Gareth_Edwards_(director)',
 'Ron Howard': '/wiki/Ron_Howard',
 'Jon Favreau': '/wiki/Jon_Favreau',
 'Gary Kurtz': '/wiki/Gary_Kurtz',
 'Howard Kazanjian': '/wiki/Howard_Kazanjian',
 'Rick McCallum': '/wiki/Rick_McCallum',
 'Catherine Winder': '/wiki/Catherine_Winder',
 'Kathleen Kennedy': '/wiki/Kathleen_Kennedy_(producer)',
 'Bryan Burk': '/wiki/Bryan_Burk',
 'Allison Shearmur': '/wiki/Allison_Shearmur',
 'Ram Bergman': '/wiki/Ram_Bergman',
 'Michelle Rejwan': '/wiki/Michelle_Rejwan',
 'Lucasfilm': '/wiki/Lucasfilm',
 'Lucasfilm Animation': '/wiki/Lucasfilm_Animation',
 '20th Century Fox': '/wiki/20th_Century_Fox',
 '

From what we have left, let's try to extract the names of the films

In [91]:
film_title_regex = re.compile("Episode|Rogue One|The Clone Wars|^Solo: A Star Wars Story$", flags=re.IGNORECASE)
relevant_links = []

for key, value in link_dict.items():
    if re.search(film_title_regex, key):
        relevant_links.append(value)
        
relevant_links

Star Wars: The Clone Wars
Rogue One
Solo: A Star Wars Story
Episode I – The Phantom Menace
Episode IX – The Rise of Skywalker
Episode II – Attack of the Clones
Episode III – Revenge of the Sith
Episode IV – A New Hope
Episode V – The Empire Strikes Back
Episode VI – Return of the Jedi
Episode VII – The Force Awakens
Episode VIII – The Last Jedi
The Clone Wars
Star Wars: The Clone Wars (film)
Rogue One: A Star Wars Story
episodes
Star Wars: Episode I – The Phantom Menace
Star Wars: Episode II – Attack of the Clones
Star Wars: Episode III – Revenge of the Sith


['/wiki/Star_Wars:_The_Clone_Wars_(2008_TV_series)',
 '/wiki/Rogue_One_(soundtrack)',
 '/wiki/Solo:_A_Star_Wars_Story#Box_office',
 '/wiki/Star_Wars:_Episode_I_%E2%80%93_The_Phantom_Menace',
 '/wiki/Star_Wars:_The_Rise_of_Skywalker',
 '/wiki/Star_Wars:_Episode_II_%E2%80%93_Attack_of_the_Clones',
 '/wiki/Star_Wars:_Episode_III_%E2%80%93_Revenge_of_the_Sith',
 '/wiki/Star_Wars_(film)',
 '/wiki/The_Empire_Strikes_Back',
 '/wiki/Return_of_the_Jedi',
 '/wiki/Star_Wars:_The_Force_Awakens',
 '/wiki/Star_Wars:_The_Last_Jedi',
 '/wiki/Star_Wars:_The_Clone_Wars_(film)#Soundtrack',
 '/wiki/Star_Wars:_The_Clone_Wars_(film)',
 '/wiki/Rogue_One#Box_office',
 '/wiki/List_of_Star_Wars_Rebels_episodes',
 '/wiki/Star_Wars:_Episode_I_%E2%80%93_The_Phantom_Menace',
 '/wiki/Star_Wars:_Episode_II_%E2%80%93_Attack_of_the_Clones',
 '/wiki/Star_Wars:_Episode_III_%E2%80%93_Revenge_of_the_Sith']

In [93]:
for idx, link in enumerate(relevant_links):
    relevant_links[idx] = re.sub("#(.+)$", "", link)
    
#Remove duplicates
relevant_links = list(set(relevant_links))
relevant_links

['/wiki/Rogue_One',
 '/wiki/Star_Wars:_The_Clone_Wars_(film)',
 '/wiki/Star_Wars:_Episode_I_%E2%80%93_The_Phantom_Menace',
 '/wiki/List_of_Star_Wars_Rebels_episodes',
 '/wiki/Star_Wars:_Episode_II_%E2%80%93_Attack_of_the_Clones',
 '/wiki/Solo:_A_Star_Wars_Story',
 '/wiki/Star_Wars:_The_Clone_Wars_(2008_TV_series)',
 '/wiki/Return_of_the_Jedi',
 '/wiki/Star_Wars:_The_Force_Awakens',
 '/wiki/Rogue_One_(soundtrack)',
 '/wiki/Star_Wars:_The_Rise_of_Skywalker',
 '/wiki/Star_Wars:_Episode_III_%E2%80%93_Revenge_of_the_Sith',
 '/wiki/Star_Wars:_The_Last_Jedi',
 '/wiki/Star_Wars_(film)',
 '/wiki/The_Empire_Strikes_Back']

Finally, we have all the links and we just need to add the wikipedia prefix.

In [94]:
for idx, link in enumerate(relevant_links):
    relevant_links[idx] = "https://en.wikipedia.org/" + link

Now relevant_links contains all the useful links for scraping the subsequent pages

In [96]:
web_pages_texts = []

for link in relevant_links:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, "html.parser")
    div_content = soup.find("div", {"class": "mw-content-ltr mw-parser-output"})

    web_pages_texts.append(convert_page_to_text(div_content))
    
web_pages_texts

["<p>\n</p><p>Rogue One (or Rogue One: A Star Wars Story) is a 2016 American epic space opera film directed by Gareth Edwards. The screenplay by Chris Weitz and Tony Gilroy is from a story by John Knoll and Gary Whitta. It was produced by Lucasfilm and distributed by Walt Disney Studios Motion Pictures. It is the first installment of the Star Wars anthology series, and an immediate prequel to Star Wars (1977).[a] The main cast consists of Felicity Jones, Diego Luna, Ben Mendelsohn, Donnie Yen, Mads Mikkelsen, Alan Tudyk, Riz Ahmed, Jiang Wen, and Forest Whitaker. Set a week before the events of Star Wars, the plot follows a group of rebels who band together to steal plans of the Death Star, the ultimate weapon of the Galactic Empire. It details the Rebel Alliance's first effective victory against the Empire, first referenced in Star Wars' opening crawl.[5]\nBased on an idea first pitched by Knoll ten years before it entered development, the film was made to be different in tone and sty

Let's re-define the function for converting a web page to text to keep the most relevant information, aggregating also what we did afterwards

In [184]:
def convert_page_to_text(html_tag, title_tags, keep_headers, headers, page_title):
    """

    :param html_tag:
    :param title_tags:
    :param keep_headers: True if you want to keep the paragraph, False if ignore the headers provided in the parameter headers
    :param headers:
    :return:
    """
    regex_headers = ("|".join(headers))
    # Unwrap meta tags
    meta_tags = html_tag.find_all("meta")
    for meta_tag in meta_tags:
        meta_tag.unwrap()

    full_page_text = ""
    last_tag_name = None
    ignore_paragraph = False

    full_page_text += f"<h1>{page_title}</h1>"

    for child_tag in html_tag.children:
        if child_tag.name == "p":
            if not ignore_paragraph:
                # In the case the last tag was not a paragraph, add the start of a paragraph tag <p>
                if last_tag_name is None or last_tag_name != "p":
                    full_page_text += "<p>"

                full_page_text += child_tag.text
                last_tag_name = "p"
        else:
            # If now we're reading a tag that is not p and the last tag was p, close the tag.
            if last_tag_name is not None and last_tag_name == "p":
                full_page_text += "</p>"
            last_tag_name = child_tag.name

            # If the header of the paragraph is to be ignored, just skip it.
            if child_tag.name == "h2" and re.search(regex_headers, child_tag.text, flags=re.IGNORECASE):
                ignore_paragraph = not keep_headers
            elif child_tag.name == "h2":
                ignore_paragraph = keep_headers

            # Keep the header tag
            if not ignore_paragraph and child_tag.name in title_tags:
                full_page_text += f'<{child_tag.name}>{child_tag.text}</{child_tag.name}>'

    full_page_text = re.sub("\[.+?\]", "", full_page_text)
    return full_page_text

In [114]:
keep_headers = ["Plot", "Cast"]

web_pages_texts = []

for link in relevant_links:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, "html.parser")
    div_content = soup.find("div", {"class": "mw-content-ltr mw-parser-output"})

    web_pages_texts.append(convert_page_to_text(div_content, title_tags, True, keep_headers))
    
web_pages_texts

['<p>\n</p><p>Rogue One (or Rogue One: A Star Wars Story) is a 2016 American epic space opera film directed by Gareth Edwards. The screenplay by Chris Weitz and Tony Gilroy is from a story by John Knoll and Gary Whitta. It was produced by Lucasfilm and distributed by Walt Disney Studios Motion Pictures. It is the first installment of the Star Wars anthology series, and an immediate prequel to Star Wars (1977). The main cast consists of Felicity Jones, Diego Luna, Ben Mendelsohn, Donnie Yen, Mads Mikkelsen, Alan Tudyk, Riz Ahmed, Jiang Wen, and Forest Whitaker. Set a week before the events of Star Wars, the plot follows a group of rebels who band together to steal plans of the Death Star, the ultimate weapon of the Galactic Empire. It details the Rebel Alliance\'s first effective victory against the Empire, first referenced in Star Wars\' opening crawl.\nBased on an idea first pitched by Knoll ten years before it entered development, the film was made to be different in tone and style f

In [115]:
overall_len = sum(len(string) for string in web_pages_texts)

And finally we save the files so that we can retrieve them later when performing chunking.

In [177]:
for idx, page in enumerate(web_pages_texts):
    link_name = relevant_links[idx]
    file_name = re.findall("/wiki/(.+)", link_name)[0]
    file_name = re.sub(":", "", file_name)
    directory = "./web_pages/"
    file_name = "./web_pages/" + file_name + ".html"
    if not os.path.exists(directory):
        os.mkdir(directory)
        
    with open(file_name, "w") as html_file:
        print("saving file")
        html_file.write(page)

saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file
saving file


### Extract characters information

In [186]:
characters_page = requests.get(characters_link)
soup_chars = BeautifulSoup(characters_page.content, "html.parser")

div_content = soup_chars.find("div", {"class": "mw-content-ltr mw-parser-output"})

title_tags = ["h1", "h2", "h3", "h4", "h5"]
ignore_headers = ["References", "External Links"]

full_page_text = convert_page_to_text(div_content, title_tags, False, ignore_headers, "List of Star Wars characters")



In [189]:
div_link_container = div_content.find_all("div", {"role": "note"})

links = []
for div in div_link_container:
    link_to_resource = div.find("a")["href"]
    full_link = "https://en.wikipedia.org/" + link_to_resource
    links.append(full_link)


['https://en.wikipedia.org//wiki/Star_Wars',
 'https://en.wikipedia.org//wiki/Admiral_Ackbar',
 'https://en.wikipedia.org//wiki/Padm%C3%A9_Amidala',
 'https://en.wikipedia.org//wiki/Cassian_Andor',
 'https://en.wikipedia.org//wiki/The_Armorer',
 'https://en.wikipedia.org//wiki/Wedge_Antilles',
 'https://en.wikipedia.org//wiki/Doctor_Aphra',
 'https://en.wikipedia.org//wiki/Cad_Bane',
 'https://en.wikipedia.org//wiki/Darth_Bane',
 'https://en.wikipedia.org//wiki/Tobias_Beckett',
 'https://en.wikipedia.org//wiki/Jar_Jar_Binks',
 'https://en.wikipedia.org//wiki/Ezra_Bridger',
 'https://en.wikipedia.org//wiki/Lando_Calrissian',
 'https://en.wikipedia.org//wiki/Chewbacca',
 'https://en.wikipedia.org//wiki/The_Client_(Star_Wars)',
 'https://en.wikipedia.org//wiki/Poe_Dameron',
 'https://en.wikipedia.org//wiki/The_Mandalorian_(character)',
 'https://en.wikipedia.org//wiki/Count_Dooku',
 'https://en.wikipedia.org//wiki/Kanan_Jarrus',
 'https://en.wikipedia.org//wiki/Cara_Dune',
 'https://en.wi

## Extract series information

In [190]:
series_page = requests.get(series_link)
soup_series = BeautifulSoup(series_page.content, "html.parser")

div_content = soup_series.find("div", {"class": "mw-content-ltr mw-parser-output"})

title_tags = ["h1", "h2", "h3", "h4", "h5"]
ignore_headers = ["References", "External Links"]

full_page_text = convert_page_to_text(div_content, title_tags, False, ignore_headers, "List of Star Wars series")

In [193]:
div_link_container = div_content.find_all("div", {"role": "note"})

links = []
for div in div_link_container:
    link_to_resource = div.find("a")
    
    if link_to_resource is None:
        continue
        
    link_to_resource = link_to_resource["href"]
    full_link = "https://en.wikipedia.org/" + link_to_resource
    links.append(full_link)

In [200]:
def convert_page_series_to_text(html_tag, page_title):
    full_page_text = ""
    
    full_page_text += f"<h1>{page_title}</h1>"
    full_page_text += "<p>"
    
    for child_tag in html_tag.children:
        if child_tag.name == "h2":
            full_page_text += "</p>"
            break
            
        if child_tag.name == "p":
            full_page_text += child_tag.text
    
    episode_titles = html_tag.find_all("td", {"class": "summary"})
    episode_summaries = html_tag.find_all("td", {"class": "description"})
    
    for title, summary in zip(episode_titles, episode_summaries):
        title_tag = f"<title>{title.text}</title>"
        summary_tag = f"<p>{summary.text}</p>"
        full_page_text += title_tag + summary_tag
        
    return full_page_text

In [201]:
pages_texts = []

for link in links:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, "html.parser")
    
    div_content = soup.find("div", {"class": "mw-content-ltr mw-parser-output"})
    page_title = soup.find("h1", {"id": "firstHeading"}).text
    
    page_text = convert_page_series_to_text(div_content, page_title)
    pages_texts.append(page_text)

In [202]:
pages_texts[0]

'<h1>Star Wars: Droids</h1><p>American-Canadian animated television series\n\n\n.mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}.mw-parser-output .infobox .navbar{font-size:100%}body.skin-minerva .mw-parser-output .infobox-header,body.skin-minerva .mw-parser-output .infobox-subheader,body.skin-minerva .mw-parser-output .infobox-above,body.skin-minerva .mw-parser-output .infobox-title,body.skin-minerva .mw-parser-output .infobox-image,body.skin-minerva .mw-parser-output .infobox-full-data,body.skin-minerva .mw-parser-output .infobox-below{text-align:center}html.skin-theme-clientpref-night .mw-parser-output .infobox-full-data div{background:#1f1f23!important;color:#f8f9fa}@media(prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .infobox-full-data div{background:#1f1f23!important;color:#f8f9fa}}.mw-parse

## Chunking

Chunking is the process of dividing the text into smaller parts to make the text fit inside the contextual window of our LLM.
There are different types of chunking: Naive, RecursiveCharacterSplit or Semantic. 

We are going to use the last 2. In particular, for the Semantic one, we will use a splitter relying on the knowledge we have on the HTML text. This is useful since we don't want to treat titles and normal text in the same way, as the title could provide relevant information on the text.

In [158]:
from langchain_text_splitters import HTMLSectionSplitter, RecursiveCharacterTextSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"),("h4", "Header 4")]
sample_document = web_pages_texts[0]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(sample_document)

chunk_size = 800
chunk_overlap = 100
separators = ["\n\n","\n", "(?<=\. )", " ", ""]
rec_char_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=separators)
recursive_header_split = rec_char_splitter.split_documents(html_header_splits)

recursive_header_split

[Document(page_content="Rogue One (or Rogue One: A Star Wars Story) is a 2016 American epic space opera film directed by Gareth Edwards. The screenplay by Chris Weitz and Tony Gilroy is from a story by John Knoll and Gary Whitta. It was produced by Lucasfilm and distributed by Walt Disney Studios Motion Pictures. It is the first installment of the Star Wars anthology series, and an immediate prequel to Star Wars (1977). The main cast consists of Felicity Jones, Diego Luna, Ben Mendelsohn, Donnie Yen, Mads Mikkelsen, Alan Tudyk, Riz Ahmed, Jiang Wen, and Forest Whitaker. Set a week before the events of Star Wars, the plot follows a group of rebels who band together to steal plans of the Death Star, the ultimate weapon of the Galactic Empire. It details the Rebel Alliance's first effective victory against the", metadata={'Header 1': '#TITLE#'}),
 Document(page_content="weapon of the Galactic Empire. It details the Rebel Alliance's first effective victory against the Empire, first referen

In [159]:
len(recursive_header_split)

16

### Vector Store

Once we have generated the chunks with also metadata information, we need to store our text somewhere.
To do so, we need to convert our documents into embeddings and store them in an appropriate vector index or vector database.

In [183]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'intfloat/e5-small-v2'
model_kwargs = {"device": "cpu", "trust_remote_code": True}

embeddings_model = HuggingFaceEmbeddings(model_name=embed_model_id, model_kwargs=model_kwargs)

embedder = CacheBackedEmbeddings.from_bytes_store(embeddings_model, store, namespace=embed_model_id)

vector_store = FAISS.from_documents(recursive_header_split, embedder)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [182]:
query = "What did Jyn retrieve?"

embedding_vector = embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k=2)

TypeError: SentenceTransformer.encode() got an unexpected keyword argument 'dimensionality'

Jyn proposes a plan to steal the Death Star schematics to the Rebel fleet but fails to gain approval from the Alliance Council, who feel victory against the Empire is now impossible. Frustrated at their inaction, Jyn's group leads a small squad of Rebel volunteers to raid the databank; the group arrives on Scarif in the stolen Imperial shuttle (which Rook dubs "Rogue One") after having been granted access through the planet's shield by feigning as legitimate traffic. Jyn, Cassian, and K-2SO infiltrate the base while the other Rebels attack the Imperial garrison as a diversion.
