# <a id='toc1_'></a>[NLP Final Project - Wikipedia Search Engine](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [NLP Final Project - Wikipedia Search Engine](#toc1_)    
  - [1 - Scrapping Wikipedia](#toc1_1_)    
    - [Scrapping class](#toc1_1_1_)    
    - [Saving the links](#toc1_1_2_)    
    - [- Adding the ids of the paragraphs](#toc1_1_3_)    
  - [2 - Cleaning the data](#toc1_2_)    
  - [3 - Creating the n-grams](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

All the main parts has been done individually and then merged together. The scrapping part has'nt been runned before saving this notebook as it takes a lot of time to run.

## <a id='toc1_1_'></a>[1 - Scrapping Wikipedia](#toc0_)

### <a id='toc1_1_1_'></a>[Scrapping class](#toc0_)

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import time

class WikiPage():
    """
    This class represents a Wikipedia page and provides methods to fetch and parse the page content.
    
    Attributes:
        url: The URL of the Wikipedia page.
        soup: A BeautifulSoup object representing the parsed HTML content of the page.
        links: A list of Wikipedia links from the page content.
        title: The title of the Wikipedia page.
        content: The content of the Wikipedia page.
        summary: The summary of the Wikipedia page.
    """
    def __init__(self, url) -> None:
        """
        Initializes a new instance of the WikiPage class with the specified URL.
        """
        self.url = url
        self.soup = self.get_soup()
        self.links = self.get_links()
        self.title = self.get_h1()
        self.content = self.get_content()[1:]
        self.summary = self.get_summary()

    def get_wiki_page(self) -> str:
        """
        Fetches the HTML content of the Wikipedia page.
        Returns:
            The HTML content of the page as a string.
        """
        try:
            response = requests.get(self.url)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {self.url}: {e}")
            return None
        
    def get_soup(self) -> BeautifulSoup:
        """
        Parses the HTML content of the Wikipedia page using BeautifulSoup.
        Returns:
            A BeautifulSoup object representing the parsed HTML.
        """
        page = self.get_wiki_page()
        soup = BeautifulSoup(page, "html.parser")
        return soup
    
    def get_links(self) -> list:
        """
        Retrieves a list of Wikipedia links from the page content.
        Returns:
            A list of Wikipedia links as strings.
        """
        allLinks = self.soup.find(id="bodyContent").find_all("a")
        links = [link["href"] for link in allLinks
                # check if it has an href attribute
                if link.has_attr("href") 
                # check if it is a wikipedia link
                and link["href"].startswith("/wiki/")
                # check if it is not a file 
                and not link["href"].endswith((".jpg", ".png", ".svg"))
                # check if it is not a special page
                and "Special:" not in link["href"]
                # check if it is not a help page
                and "Help:" not in link["href"] 
                # check if it is not a wikipedia page
                and "Wikipedia:" not in link["href"]]
        return links
    
    # Parser
    def get_h1(self) -> str:
        """
        Retrieves the title of the Wikipedia page.
        Returns:
            The title of the page as a string.
        """
        return self.soup.find(id="firstHeading").text
    
    def get_summary(self) -> list:
        """
        Retrieves the summary of the Wikipedia page.
        Returns:
            A list of paragraphs representing the summary.
        """
        try:
            if self.soup.find("table"):
                return [p.text for p in self.soup.find("table").find_next_siblings("p")]
            else:
                # if there is no table we must get the summary in another way
                # get the first 10 paragraphs
                summary = [p.text for p in self.soup.find_all("p")[:10]]
                # get the first paragraph of the content (that comes after the first h2)
                first_p = self.content[0]["paragraphs"]
                # return the summary that are not in the first paragraph
                return [p for p in summary if p not in first_p]
        except Exception as e:
            print(f"Error getting summary for {self.url}: {e}")
            return None
    
    def get_content(self) -> list:
        """
        Retrieves the content of the Wikipedia page.
        Returns:
            A list of dictionaries representing the content sections.
        """
        content = []
        tags = [f"h{n}" for n in range(2, 4)]
        # find all h2 or h3 tags
        for tag in self.soup.find_all(tags):
            paragraph = []
            # find all siblings of the tag until we find a new heading tag
            for sibling in tag.next_siblings:
                if sibling.name in tags:
                    break
                if sibling.name == "p":
                    paragraph.append(sibling.text)
            content.append({
                "type": tag.name,
                "title": tag.text,
                "paragraphs": paragraph
            })
        return content

    def wiki_page_to_dict(self):
        """
        Converts the WikiPage object to a dictionary.
        Returns:
            A dictionary representation of the WikiPage object.
        """
        return {
            "url": self.url,
            "title": self.title,
            "summary": self.summary,
            "content": self.content,
        }
    
    def to_json(self, filename="data/raw_wiki.json"):
        """
        Serializes the WikiPage object to JSON and appends it to a file.
        Args:
            filename: The path to the file to append the JSON data to.
        """
        with open(filename, "r") as f:
            data = json.load(f)
            data.append(self.wiki_page_to_dict())
        with open(filename, "w") as f:
            json.dump(data, f, indent=4)

        

    def links_to_json(self, filename="data/links.json"):
        """
        Serializes the links of the WikiPage object to JSON and appends it to a file.
        Args:
            filename: The path to the file to append the JSON data to.
        """
        with open(filename, "r") as f:
            data = json.load(f)
            data.append({
                "url": self.url,
                "links": self.links
            })
        with open(filename, "w") as f:
            json.dump(data, f, indent=4)  

In [None]:
start = time.time()
print("Starting scrapping...")

# create the root page
root_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
root_page = WikiPage(root_url)
# write the root page to the json file
# with open("data/raw_wiki.json", "w") as f:
#     json.dump([root_page.wiki_page_to_dict()], f, indent=4)
# # write the root page links to the json file
# with open("data/links.json", "w") as f:
#     json.dump([{
#         "url": root_url,
#         "links": root_page.links
#     }], f, indent=4)
done = [root_url]
fifo_links = root_page.links

# while we don't have 5000 pages in the json file we continue
while len(done) < 5000:
    link = fifo_links.pop(0)
    try:
        link_url = "https://en.wikipedia.org" + link
        if link_url in done:
            continue
        # create the page
        page = WikiPage(link_url)
        # write the page to the json file
        page.to_json()
        # write the page links to the json file
        page.links_to_json()
        # add the page to the done list
        done.append(page.url)
        if len(fifo_links) + len(done) < 5000:
            # check the length of the total links to make sure we don't have too much links
            fifo_links = fifo_links + page.links
    except Exception as e:
        pass

### <a id='toc1_1_2_'></a>[Saving the links](#toc0_)

As we have scrapped also the links each article was redirecting to, we can now create a ranking of the most cited articles inside our dataset. We will do this by creating a dictionary with the number of times each article was cited. We will then sort the dictionary by the number of citations. It could be usefeul for our PageRank algorithm to have a list of the most cited articles, as they are probably the most important ones.

In [2]:
# load the links
with open("data/links.json", "r") as f:
    data = json.load(f)

# create a new dict with the links as keys and the values as the number of times they appear
links = {}
for item in data:
    for link in item["links"]:
        if link in links:
            links[link] += 1
        else:
            links[link] = 1

# sort the dict by the number of times they appear
sorted_links = {k: v for k, v in sorted(links.items(), key=lambda item: item[1], reverse=True)}

# save the sorted links
with open("data/sorted_links.json", "w") as f:
    json.dump(sorted_links, f, indent=4)

The links could help us in the future in order to make a better PageRank score for the pages that are the most cited in the dataset. The idea is to give more importance to the articles that are well referenced. This is how the scientific papers are ranked, the more a paper is cited, the more important it is considered.

### <a id='toc1_1_3_'></a>[- Adding the ids of the paragraphs](#toc0_)

During the scrapping part, we forgot to add the ids of the paragraphs sequentially. We will now add them to the dataset.

In [None]:
import json
with open('../data/raw_wiki.json', 'r') as f:
    data = json.load(f)

In [None]:
# add the ids of the paragraphs
index = 0
for page in data:
    if page['summary']:
        # add the field summary_ids in 4th level
        page['summary_ids'] = []
        for paragraph in page['summary']:
            # we append the index to the summary_ids
            page['summary_ids'].append(index)
            index += 1
    for section in page['content']:
        # add the field ids in 4th level
        section['ids'] = []
        for paragraph in section['paragraphs']:
            # we append the index to the ids
            section['ids'].append(index)
            index += 1

In [None]:
# save the data with the ids
with open('../data/raw_wiki_with_ids.json', 'w') as f:
    json.dump(data, f, indent=4)

## <a id='toc1_2_'></a>[2 - Cleaning the data](#toc0_)

In [None]:
import re
from nltk.corpus import stopwords
import json
from nltk.stem import PorterStemmer

STOPWORDS = set(stopwords.words('english'))

def clean_text(text: str) -> str:
    # convert to lowercase
    lower_cased = text.lower()
    # remove references in square brackets
    no_references = re.sub(r'\[.*?\]', '', lower_cased)
    # keep only alphanumeric characters
    alphanumeric = re.sub(r'\W+', ' ', no_references)
    # remove strange unicode characters
    alphanumeric = re.sub(r'[^\x00-\x7F]+', '', alphanumeric)
    # remove stopwords
    no_stopwords = ' '.join([word for word in alphanumeric.split() if word not in STOPWORDS])
    # stemmize
    stemmed = ' '.join([PorterStemmer().stem(word) for word in no_stopwords.split()])
    return stemmed

def clean_page(page: dict) -> dict:
    # apply the clean_text function to each paragraph in the page
    cleaned_page = {}
    # clean the URL and title
    cleaned_page['url'], cleaned_page['title'] = page['url'], clean_text(page['title'])
    # clean the summary paragraphs
    if page['summary']:
        cleaned_page['summary_ids'] = page['summary_ids']
        cleaned_page['summary'] = [clean_text(paragraph) for paragraph in page['summary']]
    # clean the content sections
    cleaned_page['content'] = []
    for section in page['content']:
        if not section['paragraphs']:
            continue
        cleaned_section = {}
        cleaned_section['type'] = section['type']
        # clean the section title
        cleaned_section['title'] = clean_text(section['title'])
        # clean each paragraph in the section
        cleaned_section['paragraphs'] = [clean_text(paragraph) for paragraph in section['paragraphs'] if clean_text(paragraph)]
        cleaned_section['ids'] = section['ids']
        cleaned_page['content'].append(cleaned_section)
    return cleaned_page

In [None]:
# load the data_file
with open('../data/raw_wiki_with_ids.json', 'r') as f:
    data = json.load(f)
# clean the data
cleaned_data = [clean_page(page) for page in data]

# save the cleaned data in a new file
with open('../data/cleaned_wiki.json', 'w') as f:
    json.dump(cleaned_data, f, indent=4)

## <a id='toc1_3_'></a>[3 - Creating the n-grams](#toc0_)

In [1]:
import json
with open('../data/cleaned_wiki.json', 'r') as f:
    data = json.load(f)

In [2]:
def get_n_gram(n, text):
    # split the text into words
    words = text.split()
    n_gram = []
    # create the n-grams
    for i in range(len(words) - n + 1):
        # join the words to create the n-gram
        n_gram.append(' '.join(words[i:i+n]))
    return n_gram

In [12]:
def save_n_grams(n, data):
    n_gram_dict = {}
    for page in data:
        if 'summary' in page:
            for i, paragraph in enumerate(page['summary']):
                n_gram_par = get_n_gram(n, paragraph)
                for n_gram in n_gram_par:
                    if n_gram in n_gram_dict:
                        n_gram_dict[n_gram].append(page['summary_ids'][i])
                    else:
                        n_gram_dict[n_gram] = [page['summary_ids'][i]]
            
        for section in page['content']:
            for i, paragraph in enumerate(section['paragraphs']):
                n_gram_par = get_n_gram(n, paragraph)
                for n_gram in n_gram_par:
                    if n_gram in n_gram_dict:
                        n_gram_dict[n_gram].append(section['ids'][i])
                    else:
                        n_gram_dict[n_gram] = [section['ids'][i]]
    n_gram_dict = dict(sorted(n_gram_dict.items(), key=lambda item: len(item[1]), reverse=True))
    with open(f'../data/{n}_grams.json', 'w') as f:
        json.dump(n_gram_dict, f)

for i in range(1, 5):
    save_n_grams(i, data)

In [13]:
# convert the notebook to html
!jupyter nbconvert --to html scrapping_cleaning.ipynb

[NbConvertApp] Converting notebook scrapping_cleaning.ipynb to html
[NbConvertApp] Writing 334756 bytes to scrapping_cleaning.html
