Before attempting to scrape any website, it is crucial to review and adhere to the website’s terms of service, robots.txt file, and any other guidelines or restrictions they may have in place.

In [20]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException  
import pandas as pd

# Setup Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to the website
URL = "https://sherni.inflibnet.ac.in/"
driver.get(URL)

# Find the "Profile" button element
view_buttons = driver.find_elements(By.LINK_TEXT, "View Profile")

# Initialize an empty DataFrame
data = []

# Iterate through each button and click it using JavaScript
for button in view_buttons:
    try:
        driver.execute_script("arguments[0].click();", button)
        time.sleep(2)  # Wait for the profile page to load
        
        # Switch to the newly opened tab or window
        driver.switch_to.window(driver.window_handles[-1])
        
        time.sleep(2)
        
        # Get the HTML content of the page
        page_source = driver.page_source

        # Parse the HTML content with BeautifulSoup
        soup = BeautifulSoup(page_source, "html.parser")

        card = driver.find_element(By.CLASS_NAME, "name-location")

        time.sleep(3)

        # Extract name, designation, and institute name
        name = card.find_element(By.TAG_NAME, "h1").text

        uni = card.find_elements(By.TAG_NAME, "li")[-1].text
        #uni = ', '.join([element.text for element in uni_element])

        expertise = driver.find_element(By.ID, "e_expertise").text
        prof = driver.find_element(By.CSS_SELECTOR, ".cbp_tmlabel h2").text

        # Scroll down to load the paper
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(10)  # Wait for the paper to load
        
        try:
            paper = driver.find_element(By.CSS_SELECTOR, "#publication_div h2").text
        except NoSuchElementException:
            paper = "No paper found"
        
        try:
            educ = driver.find_element(By.ID, "qualification-view").text.replace('\n', ' ')
        except NoSuchElementException:
            educ = "Educational qualification not available"
            
        # Append the extracted details to the data list
        data.append([name, uni, expertise, prof, paper, educ])

        # Close the current tab or window and switch back to the main page
        driver.close()
        driver.switch_to.window(driver.window_handles[0])
        
    except NoSuchElementException:
        # Increment error count if NoSuchElementException occurs
        error_count += 1
        print("Error: NoSuchElementException encountered. Skipping this profile.")
        # Close the current tab or window and switch back to the main page
        driver.close()
        driver.switch_to.window(driver.window_handles[0])

# Close the browser
driver.quit()

# Create a DataFrame from the collected data
df = pd.DataFrame(data, columns=['Faculty Name', 'Institute Name', 'Subject Expertise', 'Designation', 'Most Recent Work', 'Education Qualification'])

# Print the DataFrame
df

Error: NoSuchElementException encountered. Skipping this profile.


Unnamed: 0,Faculty Name,Institute Name,Subject Expertise,Designation,Most Recent Work,Education Qualification
0,Prof Rekha S Singhal,"Institute of Chemical Technology, Mumbai",Food Science and Technology,Professor,Co-extraction of marigold flowers (Tagetes ere...,Ph.D.
1,Prof Madhoolika Agrawal,Banaras Hindu University,Environmental Sciences,Professor,Secondary metabolites responses of plants expo...,"1982 Ph.D Banaras Hindu University, Varanasi"
2,Prof Anushree Malik,Indian Institute of Technology Delhi,Agricultural Engineering,Professor,An integration of algae-mediated wastewater tr...,"2000 PhD Indian Institute of Technology, Delhi"
3,Prof Paramjit Khurana,University of Delhi,Plant Sciences,Professor,Identification of universal stress proteins in...,1983 Ph.D University of Delhi
4,Dr. Seghal Kiran G.,Pondicherry University,Food Science and Technology,Assistant Professor,Genomic investigation unveils high-risk ESBL p...,Educational qualification not available
5,Prof Purnima Singh,Indian Institute of Technology Delhi,"Humanities, Multidisciplinary",Professor (HAG),Haptoglobin Gene Expression and Anthracycline-...,"D.Phil. University of Allahabad, Allahabad"
6,Dr Kavita Singh,Jawaharlal Nehru University,Art,Professor,Exploring synergies between India's climate ch...,1996 Ph.D Panjab University
7,Ms Pooja Tyagi,Noida Institute of Engineering and Technology,Language and Linguistics,Assistant Professor,Development of copper nanoparticles and their ...,"2002 MA D G college, Kanpur"
8,Dr Ravi Kiran,Thapar Institute of Engineering and Technology,"Humanities, Multidisciplinary",Professor,Understanding the relevance of experiential le...,1999 Ph.D Thapar Institute of Engineering and ...
9,Dr Devika J,Centre for Development Studies,Literary Theory and Criticism,Associate Professor,The kiss of love protests: A report on resista...,"2003 Ph.D (History) Mahatma Gandhi University,..."


In [22]:
# Save the DataFrame to a CSV file
df.to_csv('SHERNI_faculty_profiles.csv', index=False)

## Steps Taken & Challenges Faced:

- Using Selenium WebDriver, the script first navigates to the Sherni platform
- By finding elements with the link text "View Profile," the script identifies buttons to view individual profiles
- I iterate through each "View Profile" button and click it using JavaScript execution
- Upon clicking a profile button, a new tab or window opens. The script switches to this new tab to extract profile information
- Various methods are employed to locate and extract each piece of information:
 
**Name, Designation, and Institute Name:** Extracted directly from specific elements on the page.

**Subject Expertise:** Located by its unique ID and extracted.

**Most Recent Work:** Extracted from the "Publications" section, if available.

**Educational Qualifications:** Extracted from the "Qualifications" section, if available.
- To ensure complete data retrieval, the script scrolls down the page to load additional content
- NoSuchElementException is also handled when attempting to extract information that may not be present on all profiles
- After extracting information from a profile, the script closes the current tab/window and switches back to the main page to proceed to the next profile

In [87]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException  
import pandas as pd
from selenium.webdriver.common.action_chains import ActionChains

# Setup Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Navigate to the webpage
URL = "https://www.iima.ac.in/"
driver.get(URL)

time.sleep(3)

# Find the element with ID "nav-icon3" and click it
nav_icon = driver.find_element(By.ID, "nav-icon3")
nav_icon.click()

faculty_and_research_option = driver.find_element(By.LINK_TEXT, "Faculty & Research")
# Hover over the "Faculty & Research" option
action = ActionChains(driver)
action.move_to_element(faculty_and_research_option).perform()

# Add a brief pause to see the cursor movement (optional)
time.sleep(2)  # Adjust the sleep duration as needed

# Scroll to the right to view more options inside it
#driver.execute_script("arguments[0].scrollIntoView();", faculty_and_research_option)
# Scroll to the right by a percentage of the document width
driver.execute_script("window.scrollBy(document.body.scrollWidth * 0.5, 0);")

# After expanding the menu and moving the cursor to view more options

# Find the "Our Faculty" option by link text
our_faculty_option = driver.find_element(By.LINK_TEXT, "Our Faculty")

# Click on the "Our Faculty" option
our_faculty_option.click()

time.sleep(2)

faculty_info = []

# Outer loop to navigate through multiple pages
while True:
    # Scroll down to load the paper
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    cards = driver.find_elements(By.CLASS_NAME, "faculty-card-detail")

    for card in cards:
        driver.execute_script("arguments[0].click();", card)
    
        time.sleep(2)
        # Find the name of the person
        name = driver.find_element(By.TAG_NAME,"h3").text
    
        # Profession
        prof = driver.find_element(By.TAG_NAME, "p").text
    
        # Find the parent div element
        parent_div = driver.find_element(By.CLASS_NAME, "employee-area-box")

        # Find the p tag inside the parent div and get its text
        expertise = parent_div.find_element(By.TAG_NAME, "p").text.split(":")[-1].strip()
    
        # Scroll down to load the paper
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
        # Wait for 2 secs
        time.sleep(2)
    
        # For Education background and Reasearch Work, class is "field elements"
        # Scrape educational qualifications
        try:
            educ_parent_div = driver.find_element(By.CLASS_NAME, "field--name-field-academic-degrees-education")
            educ_items = educ_parent_div.find_elements(By.TAG_NAME, "p")

            educational_qualifications = ""
            for educ_item in educ_items:
                educational_qualifications += educ_item.text.strip() + " "
        except NoSuchElementException:
            educational_qualifications = "Not Found"
    
        # Scrape research area
        try:
            research_parent_div = driver.find_element(By.CLASS_NAME, 'last-p-margin-0')
            research_items = research_parent_div.find_elements(By.TAG_NAME, "p")
            research_areas = ""
            for research_item in research_items:
                research_areas += research_item.text.strip() + " "
        except NoSuchElementException:
            research_areas = "Not Found"
        
        # Append faculty information to the list
        faculty_info.append({
            "Name": name,
            "Profession": prof,
            "Area of Expertise": expertise,
            "Educational Background": educational_qualifications,
            "Research Area": research_areas
        })
        
        driver.back()
    
    # Find all the page elements
    pages = driver.find_elements(By.CLASS_NAME, "pager__item")
    
    # Check if there is a "Next" page
    next_page_link = None
    for page in pages:
        if "Next" in page.text:
            next_page_link = page.find_element(By.TAG_NAME, "a")
            break
    
    # If there is no "Next" page, exit the loop
    if not next_page_link:
        break
    
    # Scroll to the next page link and click it
    driver.execute_script("arguments[0].scrollIntoView();", next_page_link)
    
    # Click on the next page link
    driver.execute_script("arguments[0].click();", next_page_link)
    
# Close the driver
driver.quit()

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(faculty_info)

df

Unnamed: 0,Name,Profession,Area of Expertise,Educational Background,Research Area
0,Saravanan A,Assistant Professor of Strategy,Strategy,"Ph.D., International Investment Law, Indian In...",International Investment Law Intellectual Prop...
1,Anurag Agarwal,Associate Professor of Strategy,,"LL.D. Doctor of Laws , (Lucknow University) – ...",International Business Dispute Resolution Arbi...
2,Promila Agarwal,Associate Professor of Human Resources Management,,"Ph.D, Faculty of Management Studies, Universit...",High-Performance Work Systems Dark Triad Perso...
3,Sobhesh Agarwalla,Professor of Finance and Accounting,Finance and Accounting,"Ph.D., IIMA FCA ACS ACMA",
4,Anirban Banerjee,Assistant Professor of Finance and Accounting,Finance and Accounting,"Ph.D, Finance, Indian Institute of Management ...",Not Found
...,...,...,...,...,...
99,Prahalad Venkateshan,Associate Professor of Operations and Decision...,Operations and Decision Sciences,"Bachelor of Engineering, College of Engineerin...",Vehicle Routing Facility Location Network Desi...
100,Sanjay Verma,Associate Professor of Information Systems,,"Fellow , Indian Institute of Management Calcut...",E-Governance Knowledge Management Multiple Res...
101,Akshaya Vijayalakshmi,Associate Professor of Marketing,,Not Found,Broadly: Marketing and Public Policy Specifica...
102,Vineet Virmani,Associate Professor of Finance and Accounting,,"Fellow, Economics, IIMA B. Tech, Mech, IT-BHU",Not Found


In [90]:
df.to_csv('IIMA_faculty_profiles.csv', index=False)

## Steps Taken & Challenges Faced:

- Once the driver has been implemented and the IIMA main webpage opens, we can see a 3 bar navigation icon on the top-right corner of the page. My code uses Selenium to interact with such elements. In this case, it clicks it.
- A drop down menu opens up wherein our code hovers on the "Faculty & Research" option to make the next drop down visible. There were challenges in doing this since the two lists almost overlap each other but I resolved it by putting a code that would scroll to the right once it reaches the "Faculty & Research" option.
- From here, we will click on the "Our Faculty" link
- In the new webpage, we iterate (loop) through all the page numbers (as can be seen on the bottom of the webpage)
- For each page, the code opens each faculty profile and webscrapes their name, profession, expertise area, educational background and research work.
- Some more challenges I resolved included things like adding a scrolling element for each faculty page for it to load all the elements necessary for webscraping, adding wait times and finding unique class/tag names.
- Another challenge was that some faculty pages did not have education and research area elements so I had to make a counter for it in order to let my code loop run all the way till the end

**NOTE: Unlike the SHERNI website, there wasn't any specific "year" or "university" element that I could scrape separately. The educational background element has the combined information.**