#### **Step 1-** Defining libraries and accessing the URL

https://www.isb.edu/en/study-isb/advanced-management-programmes.html

In [133]:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# Specify the correct path to your chromedriver executable
chromedriver_path = "C:/Program Files (x86)/chromedriver.exe"  

# Initialize the Chrome WebDriver using the Service class
service = Service(executable_path=chromedriver_path)

In [134]:
from selenium.webdriver.common.keys import Keys
# This class provides keys in the keyboard like RETURN, F1, ALT, etc. 
# It is useful when you need to simulate keypresses in your Selenium 
# scripts. For instance, after filling out a text field, you might want 
# to simulate pressing the Enter key to submit the form.
from selenium.webdriver.common.by import By
# This class provides methods for locating elements within a page. 
# It is used with find_element(s)_by methods to specify how Selenium 
# should search for elements. The options include searching by ID, name, 
# XPath, link text, partial link text, tag name, or class name.
# Using By makes your code more readable and maintainable

In [135]:
driver = webdriver.Chrome(service=service) 
# This starts a new chrome browser session
# You should see a new window open up in chrome 
# In effect you can now control all browser actions using code
# To control the browser that you just opened, you need to go through
# the driver object you created

# driver.quit() # to close full browser
# driver.close() # to close tab

In [136]:
driver.get("https://www.isb.edu/en/study-isb/advanced-management-programmes.html") 
# You are now telling the driver to load the webpage specified by the URL
# The driver.get method will navigate to a page given by the URL. 
# WebDriver will wait until the page has fully loaded before returning 
# control to your script.

#### **Step 2-** Finding Elements and Extracting HTML

<span style='color:Blue'> We need to extract the following information about each program. <br>
(i) Title <br>
(ii) Brief description <br>
(iii) Duration <br>
(iv) Work Experience  </span>

In [137]:
# Inspect the website HTML code. All data we need is in the 'body' element.
# It has an ID. Note that ID is a unique tag on a website. We will extract 'body'
body = driver.find_element(By.ID, "page-7ea680bf20")

In [138]:
# Let us try to understand this object
help(body)

Help on WebElement in module selenium.webdriver.remote.webelement object:

class WebElement(BaseWebElement)
 |  WebElement(parent, id_) -> 'None'
 |
 |  Represents a DOM element.
 |
 |  Generally, all interesting operations that interact with a document will be
 |  performed through this interface.
 |
 |  All method calls will do a freshness check to ensure that the element
 |  reference is still valid.  This essentially determines whether the
 |  element is still attached to the DOM.  If this test fails, then an
 |  ``StaleElementReferenceException`` is thrown, and all future calls to this
 |  instance will fail.
 |
 |  Method resolution order:
 |      WebElement
 |      BaseWebElement
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __eq__(self, element)
 |      Return self==value.
 |
 |  __hash__(self) -> 'int'
 |      Return hash(self).
 |
 |  __init__(self, parent, id_) -> 'None'
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  __ne__(self, 

In [139]:
# To get the full HTML of the body element, we can ask for the outerHTML attribute
html_body = body.get_attribute('outerHTML')
print(html_body)

<body class="page basicpage chrome chrome127" id="page-7ea680bf20" data-cmp-data-layer-enabled="">
        <script>
          window.adobeDataLayer = window.adobeDataLayer || [];
          adobeDataLayer.push({
              page: JSON.parse("{\x22page\u002D7ea680bf20\x22:{\x22@type\x22:\x22isb\/components\/structure\/page\x22,\x22repo:modifyDate\x22:\x222024\u002D05\u002D21T09:47:58Z\x22,\x22dc:title\x22:\x22Advanced Management Programme\x22,\x22xdm:template\x22:\x22\/conf\/isb\u002Duniversity\u002Dtemplates\/settings\/wcm\/templates\/content\u002Dpage\x22,\x22xdm:language\x22:\x22en\x22,\x22xdm:tags\x22:[],\x22repo:path\x22:\x22\/content\/sites\/isb\/en\/study\u002Disb\/advanced\u002Dmanagement\u002Dprogrammes.html\x22}}"),
              event:'cmp:show',
              eventInfo: {
                  path: 'page.page\u002D7ea680bf20'
              }
          });
        </script>
        
        
            




            



            
<!-- Google Tag Manager (noscript) -->
<n

#### **Step 3-** Extracting data from HTML

In [140]:
Prog_info = body.find_elements(By.CSS_SELECTOR, 'li.cmp-list__item')
# This will generate a list object of class with all objects with tag "li" that meet the criteria for class
for item in Prog_info:
    print(item.text)

Advanced Management Programme in Business Analytics (AMPBA)
Designed for professionals who want to build or enhance their understanding of data & analytics to inform business decision-making
Duration: 12 months + 3 months of Capstone Project
Work Experience: 2+ years
Location: Hyderabad | Mohali
APPLY
KNOW MORE
Advanced Management Programme for Healthcare (AMPH)
Meant to deliver specialised management education to executives from the healthcare-delivery industry, and those who want to build domain expertise
Duration: 12 months
Work Experience: 3+ years
Location: Mohali | Hyderabad
APPLY
KNOW MORE
Advanced Management Programme for Infrastructure (AMPI)
Specialised programme for professionals who want to build or enhance expertise in the multidisciplinary sector of infrastructure, with focus on emerging economies
Duration: 12 months
Work Experience: 5+ years
Location: Mohali and Hyderabad
APPLY
KNOW MORE
Advanced Management Programme in Operations and Supply Chain (AMPOS)
Designed to mee

In [141]:
len(Prog_info)

5

In [142]:
# let me save the outer html for each object as a list using list comprehension
Prog_info_html = [prog.get_attribute('outerHTML') for prog in Prog_info]

In [143]:
# How many items do we have in this temp object?
len(Prog_info_html)

# There are 5 programs on this page.

5

In [144]:
# Lets inspect the first element in the list
Prog_info_html[0]

'<li class="cmp-list__item">\n   \t   <div class="row card-4-container">\n         <div class="col-md-12 col-sm-12">\n            <div class="card-4">\n               <figure>\n                  <img loading="lazy" class="img-responsive" src="/content/dam/sites/isb/study-isb/advanced-management-programmes/AMP-AMPBA-Common-Thumbs.png" alt="imageAltText">\n               </figure>\n               <div class="card-detail">\n                  <span class="award-icon"><img loading="lazy" src="/content/dam/sites/isb/images/award-ic.png" alt=""></span>\n                  <div class="head-bx">\n                     <div class="tag"></div>\n                     <h3>Advanced Management Programme in Business Analytics (AMPBA)</h3>\n                     <h4>Designed for professionals who want to build or enhance their understanding of data &amp; analytics to inform business decision-making</h4>\n                  </div>\n                  <div class="info-bx">\n                     <div class="row

In [145]:
# We need to convert each object in the list into a beautifulsoup object
# We can use a list comprehension to accomplish this

from bs4 import BeautifulSoup
prog_soup = [BeautifulSoup(prog_html) for prog_html in Prog_info_html]

In [146]:
#Extracting titles
titles = [prog.find("h3").text for prog in prog_soup]
titles

#Extracting Brief description
Description = [prog.find("h4").text for prog in prog_soup]
Description

#Extracting Duration
Duration = [prog.find("strong").text for prog in prog_soup]
Duration

#Extracting Work Exp.
Work_Exp = [prog.find_all("strong")[1].text for prog in prog_soup]
Work_Exp

['2+ years ', '3+ years ', '5+ years', '5+ years', '5+ years']

#### **Step 4-** Creation of dataframe

<span style='color:Blue'> As per **Question 2 (part 2)**, we need to create a dataframe with following fields. <br>
(i) Title <br>
(ii) Description <br>
(iii) Duration (in months) <br>
(iv) Capstone Project (Yes/ No) <br>
(v) Work Experience  </span>

In [147]:
#Creating a dataframe
data = {'Titles': titles, 'Description': Description, 'Duration': Duration, 'Work_Exp': Work_Exp}
df = pd.DataFrame(data)
df

Unnamed: 0,Titles,Description,Duration,Work_Exp
0,Advanced Management Programme in Business Analytics (AMPBA),Designed for professionals who want to build or enhance their understanding ...,12 months + 3 months of Capstone Project,2+ years
1,Advanced Management Programme for Healthcare (AMPH),Meant to deliver specialised management education to executives from the hea...,12 months,3+ years
2,Advanced Management Programme for Infrastructure (AMPI),Specialised programme for professionals who want to build or enhance experti...,12 months,5+ years
3,Advanced Management Programme in Operations and Supply Chain (AMPOS),Designed to meet the increasing need for specialised executives working in m...,12 months,5+ years
4,Advanced Management Programme in Public Policy (AMPPP),Specialised programme for the needs of mid-career and senior-level professio...,12 months,5+ years


In [148]:
#Updating dataframe
df[['Duration', 'Capstone']] = df['Duration'].str.split('+', expand=True)
df['Capstone']=df['Capstone'].fillna('No')
df

Unnamed: 0,Titles,Description,Duration,Work_Exp,Capstone
0,Advanced Management Programme in Business Analytics (AMPBA),Designed for professionals who want to build or enhance their understanding ...,12 months,2+ years,3 months of Capstone Project
1,Advanced Management Programme for Healthcare (AMPH),Meant to deliver specialised management education to executives from the hea...,12 months,3+ years,No
2,Advanced Management Programme for Infrastructure (AMPI),Specialised programme for professionals who want to build or enhance experti...,12 months,5+ years,No
3,Advanced Management Programme in Operations and Supply Chain (AMPOS),Designed to meet the increasing need for specialised executives working in m...,12 months,5+ years,No
4,Advanced Management Programme in Public Policy (AMPPP),Specialised programme for the needs of mid-career and senior-level professio...,12 months,5+ years,No


In [149]:
#Updating dataframe
df['Capstone'] = df['Capstone'].apply(lambda x: 'No' if x == "No" else 'Yes')
import re
pattern= r'(.*)\s+months'
df['Duration'] = df['Duration'].str.extract(pattern)
df['Duration']=df['Duration'].astype(int)
df['Capstone'] = pd.Categorical(df['Capstone'], categories=['No', 'Yes'])
df.rename(columns={'Duration': 'Duration (in months)','Capstone': 'Capstone Project (Yes/No)'}, inplace=True)
df #printing data farme

Unnamed: 0,Titles,Description,Duration (in months),Work_Exp,Capstone Project (Yes/No)
0,Advanced Management Programme in Business Analytics (AMPBA),Designed for professionals who want to build or enhance their understanding ...,12,2+ years,Yes
1,Advanced Management Programme for Healthcare (AMPH),Meant to deliver specialised management education to executives from the hea...,12,3+ years,No
2,Advanced Management Programme for Infrastructure (AMPI),Specialised programme for professionals who want to build or enhance experti...,12,5+ years,No
3,Advanced Management Programme in Operations and Supply Chain (AMPOS),Designed to meet the increasing need for specialised executives working in m...,12,5+ years,No
4,Advanced Management Programme in Public Policy (AMPPP),Specialised programme for the needs of mid-career and senior-level professio...,12,5+ years,No


In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Titles                     5 non-null      object  
 1   Description                5 non-null      object  
 2   Duration (in months)       5 non-null      int32   
 3   Work_Exp                   5 non-null      object  
 4   Capstone Project (Yes/No)  5 non-null      category
dtypes: category(1), int32(1), object(3)
memory usage: 401.0+ bytes
