<a href="https://colab.research.google.com/github/RijalBijay/Information_Retrieval_Python_Project/blob/main/Task1_WebCrawling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval Coursework (STW7071CEM)

Task 1: Search Engine
Create a vertical search engine comparable to Google Scholar, but specialized in retrieving just papers/books published by a member of Coventry University's Research Centre for health and life sciences (RCHL)


## Installs Required Packages and Import them
If you're using Python along with Beautiful Soup to crawl data from websites in Google Colab, you'll need to install the necessary packages and import them into your Colab notebook.





In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Required Python Packages for Web Crawling

Package Uses:
  Scrapy: A powerful web crawling framework for extracting data from websites.
  Requests: A simple HTTP library for making HTTP requests in Python.
  BeautifulSoup4: A library for parsing HTML and XML documents, commonly used for web scraping tasks.
  NLTK: Natural Language Toolkit, a library for natural language processing tasks such as tokenization, stemming, tagging, parsing, and more.
  Gensim: A Python library for topic modeling, document indexing, and similarity retrieval with large corpora.
  XGBoost: An optimized distributed gradient boosting library designed for efficiency, flexibility, and portability.
  Pandastable: A GUI (Graphical User Interface) widget for displaying and analyzing dataframes in Python using Pandas.

In [None]:
# To install the below packages, remove the '#'

# !pip install scrapy
# !pip install requests
# !pip install BeautifulSoup4
# !pip install nltk
# !pip install gensim
# !pip install xgboost
# !pip install pandastable

# Imported the installed packages in the below code



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime
import string
import json
import nltk
nltk.download('omw-1.4');
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# 1. Information Retrieval Engine or Crawler Component

In [3]:
# The initial URL or starting point for web crawling or scraping, acting as the entry point for exploring a website."
URL = "https://pureportal.coventry.ac.uk/en/organisations/coventry-university/persons/"

#This link provides access to profile pages of individuals associated with Coventry University.
profile_url = "https://pureportal.coventry.ac.uk/en/persons/"

In [4]:
def retrieve_max_page_number():

    first = requests.get(URL)
    soup = BeautifulSoup(first.text, 'html.parser')
    final_page = soup.select('#main-content > div > section > nav > ul > li:nth-child(12) > a')[0]['href']
    fp = final_page.split('=')[-1]
    return int(fp)

max = retrieve_max_page_number()
max

38

In [5]:
def verify_department(researcher):

    l1 = researcher.find('div', class_='rendering_person_short')

    for span in l1.find_all('span'):
        # Check department
        #print(span.text)
        if span.text == str('Centre for Health and Life Sciences'):
            name = researcher.find('h3', class_='title').find('span').text
            return name
        else:
            pass

def generate_csv_file():
     database = pd.DataFrame(columns=['Title', 'Authors', 'Publication Year', 'Publication Link'])
     database.to_csv('Crawling_database.csv')

def append_to_csv(database):
    current_data = pd.read_csv(database, index_col="Unnamed: 0")
    return current_data

def enter_each_researchers_publication(researcher, url, df):

    new_url = url + str(researcher).replace(' ','-').lower() + '/publications/'
    page = requests.get(new_url)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id="main-content")
    papers = results.find_all("li", class_="list-result-item")

    for paper in papers:
        title = paper.find('h3', class_='title')
        if title is not None:
            title_span = title.find('span')
            title_text = title_span.text if title_span is not None else "N/A"
        else:
            title_text = "N/A"

        author = paper.find('a', class_='link person')
        print (author)
        if author == None or author == 'None':
            continue
        else:
            author_span = author.find('span')
            author_text = author_span.text if author_span is not None else "N/A"

        date = paper.find('span', class_="date")
        date_text = date.text if date is not None else "N/A"

        link = paper.find('h3', class_='title')
        link_href = link.find('a', href=True)['href'] if link is not None else "N/A"
        print(link_href)

   #After this line, there is more code to append data to a DataFrame or perform additional operations.

        #Retrieve data from the existing Crawling_database.csv file
        opening = pd.read_csv('Crawling_database.csv', index_col="Unnamed: 0")

        #Generate a new DataFrame containing the data to be added
        new_row = pd.DataFrame({'Title': [title.text],
                                'Authors': [author.text],
                                'Publication Year': [date.text],
                                'Publication Link': [link_href]})

        # Concatenate the existing DataFrame and the new row DataFrame
        opening = pd.concat([opening, new_row], ignore_index=True)

        # Save the updated DataFrame to database.csv
        opening.to_csv('Crawling_database.csv')



In [6]:
## Scrape function
def scrape(mx):
    df = append_to_csv('Crawling_database.csv')
    i=0
    while True:

        if i > 17:
            break

        if i>0:
            url = URL + '?page=' + str(i)
        else:
            url = URL

        i = i+1

        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        results = soup.find(id="main-content")
        researchers = results.find_all("li", class_="grid-result-item")

        for researcher in researchers:
            # Check if researcher has any papers
            check = researcher.find('div', class_='stacked-trend-widget')
            if check:
                name = verify_department(researcher)
                if name is None:
                    pass
                else:
                    enter_each_researchers_publication(name, profile_url, df)

In [None]:
generate_csv_file()
append_to_csv(database='Crawling_database.csv') #Generate_csv

%time scrape(max)

In [9]:
test_db = pd.read_csv('Crawling_database.csv').rename(columns={'Unnamed: 0':'SN'})
test_db
print(f'{test_db.shape[0]} records were scraped')

237 records were scraped


#Indexing Component: Efficient Data Retrieval System

In [None]:
crawled_db = pd.read_csv('Crawling_database.csv').rename(columns={'Unnamed: 0':'SN'}).reset_index(drop=True)
crawled_db.head()
# crawled_db = pd.read_csv('Crawling_database.csv', index_col=0)

In [None]:
crawled_db.tail()

In [None]:
individual_row = crawled_db.loc[138,:].copy()
individual_row

In [None]:
crawled_db.Title.unique()

In [None]:
crawled_db.Authors.value_counts()



In [None]:
crawled_db.head(7)
#ids = scraped_db["Title"]
#scraped_db[ids.isin(ids[ids.duplicated()])]

In [None]:
individual_row = crawled_db.loc[138,:].copy()
individual_row

#Text preprocessing / Data Analysis and Preprocessing

In [15]:
# Remove stop words
sw = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

def tp1(txt):
    txt = txt.lower()   # Make lowercase
    txt = txt.translate(str.maketrans('','',string.punctuation))   # Remove punctuation marks
    txt = lematize(txt)
    return txt


def fwpt(word):
    tag = pos_tag([word])[0][1][0].upper()
    hash_tag = {"V": wordnet.VERB, "R": wordnet.ADV,"N": wordnet.NOUN,"J": wordnet.ADJ}
    return hash_tag.get(tag, wordnet.NOUN)

def lematize(text):
        tkns = nltk.word_tokenize(text)
        ax = ""
        for each in tkns:
            if each not in sw:
                ax += lemmatizer.lemmatize(each, fwpt(each)) + " "
        return ax


In [None]:
# crawl title, author
individual_row[['Title', 'Authors']]

In [None]:
# Illustration of converting text to lowercase and removing punctuation
tp1(individual_row['Title'])

In [None]:
#Example of lemmatization in action.

lematize(tp1(individual_row['Title']))
#lematize(individual_row['Title'])

#### Unprocessed

In [None]:
crawled_db[['Title', 'Authors']].iloc[131]

#### Processed

In [None]:
crawled_db[['Title','Authors']].iloc[120]

#Preparing the DataFrame for Analysis.

In [None]:
processed_db = crawled_db.copy()

def preprocess_df(df):
    df.Title = df.Title.apply(tp1)
    df.Author = df.Authors.str.lower()
    df = df.drop(columns=['Authors','Publication Year'], axis=1)
    return df

preprocess_df(processed_db)
processed_db.head()

#Data Indexing and Extraction

In [None]:
single = processed_db.loc[0,:].copy()
print(single)
indexing_trial = {}

words = single.Title.split()
SN = single.SN
word = words[0]
example = {word: [SN]}

print('=====================================================================')
print('Sample index')
print(example)

In [None]:
crawled_db['Publication Link']

# Indexing Function Execution

In [23]:
#Indexing function
def execute_indexing(inputs, index):
    words = inputs.Title.split()
    SN = int(inputs.SN)
    for word in words:
        if word in index.keys():
            if SN not in index[word]:
                index[word].append(SN)
        else:
            index[word] = [SN]
    return index

indx = execute_indexing(inputs=single, index= {})
#print(indx)

# Full Index Construction

In [None]:
def full_index(df, index):
    for x in range(len(df)):
        inpt = df.loc[x,:]
        ind = execute_indexing(inputs=inpt, index=index)
    return ind

def construct_index(df, index):
    queue = preprocess_df(df)
    ind = full_index(df=queue, index=index)
    return ind

indexed = full_index(processed_db,
                     index = {})


indexes = construct_index(df=crawled_db,
                          index = {})

# Index File Management

In [25]:
with open('index_data.json', 'w') as new_f:
    json.dump(indexes, new_f, sort_keys=True, indent=4)

with open('index_data.json', 'r') as file:
    data = json.load(file)

def index_2(df, x_path):
    if len(df) > 0:
        with open(x_path, 'r') as file:
            prior_index = json.load(file)
        new_index = construct_index(df = df, index = prior_index)
        with open(x_path, 'w') as new_f:
            json.dump(new_index, new_f, sort_keys=True, indent=4)

In [None]:
len(data)

In [None]:
data

#Query Handler

In [28]:
def show_query_processing():
    sample = input('Please Input Search Terms: ')
    processed_query = tp1(sample)
    #print(f'User Search Query: {sample}')
    print(f'Processed Search Query: {processed_query}')
    return processed_query

#show_query_processing()

#Separate Query into Individual Terms

In [29]:
def separate_query(terms):
    each = tp1(terms)
    return each.split()

dqp = show_query_processing()
dqp
print(f'Separate Query: {separate_query(dqp)}')

Please Input Search Terms: detergentfree membrane protein purification
Processed Search Query: detergentfree membrane protein purification 
Separate Query: ['detergentfree', 'membrane', 'protein', 'purification']


#Boolean Operations

In [30]:
def union(lists):
    union = list(set.union(*map(set, lists)))
    union.sort()
    return union

def intersection(lists):
    intersect = list(set.intersection(*map(set, lists)))
    intersect.sort()
    return intersect

#Search Function

In [31]:
def vertical_search_handler(df, query, index=indexes):
    query_separate = separate_query(query)
    retrieved = []
    for word in query_separate:
        if word in index.keys():
            retrieved.append(index[word])


    # Ranked Retrieval
    if len(retrieved)>0:
        high_rank_result = intersection(retrieved)
        low_rank_result = union(retrieved)
        c = [x for x in low_rank_result if x not in high_rank_result]
        high_rank_result.extend(c)
        result = high_rank_result

        final_output = df[df.SN.isin(result)].reset_index(drop=True)

        # Return result in order of Intersection ----> Union
        dummy = pd.Series(result, name = 'SN').to_frame()
        result = pd.merge(dummy, final_output, on='SN', how = 'left')

    else:
        result = 'No result found'

    return result

# Test Search Function

In [None]:
def test_search_engine():
    xtest = crawled_db.copy()
    query = input("Please provide your search query: ")
    return vertical_search_handler(xtest, query, indexed)

test_search_engine()

# Final Search Function

In [34]:
def final_search_engine(results):
    if type(results) != 'list':
        return results
        #print(results)
    else:
        for i in range(len(results)):
            printout = results.loc[i, :]
            #print(printout['Title'])
            #print(printout['Authors'])
            #print(printout['Publication Year'])
            #print(printout['Publication Link'])
            #print('')

In [None]:
crawled_db['Authors'].iloc[24]

In [None]:
final_search_engine(test_search_engine())

## 4. Schedule Crawler for every week or CronJob

To demonstrate a weekly scheduled crawling, the following parameters are defined:




In [None]:
# days = 0
# interval = 7
# while days <= 1:
#     scrape(max)
#     print(f"Crawled at {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
#     print(f'Next crawl scheduled after {interval} days')
#     time.sleep(interval)
#     days = days + 1