<a href="https://colab.research.google.com/github/Jaywestty/Python-Web-Scraping/blob/main/Assignment_Submission1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### WEB SCRAPING THE NATIONAL HEART,LUNG AND BLOOD INSTITUTE WEBSITE

The main aim of this project is to web scrape the 'https://www.nhlbi.nih.gov/health', get the condition names, condition information, the symptoms and diagnosis, and to also convert the scraped information to a csv dataset.  

In [None]:
#Importing the necessary libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

In [None]:
#The Url of the desired website to scrape

url = 'https://www.nhlbi.nih.gov/health?page='
reponse = requests.get(url)
soup = BeautifulSoup(reponse.text, 'html.parser')

In [None]:
print(f'{reponse.status_code}, Access granted successfully')

200, Access granted successfully


In [None]:
#To scrape topic name
health_topics = []

#Fecthing all page html from 0 to 12.
for page in range(13):
    page_url = url + str(page)
    res = requests.get(page_url)
    page_soup = BeautifulSoup(res.text, 'html.parser')

    #To Scrape All Topics
    for name in page_soup.find_all('h4', class_="field field--name-title field--type-string field--label-hidden field__item"):

        name_topic = name.get_text(strip=True)

        health_topics.append(name_topic)


In [None]:
health_topics

['Acute Respiratory Distress Syndrome',
 'Alpha-1 Antitrypsin Deficiency',
 'Anemia',
 'Angina (Chest Pain)',
 'Antiphospholipid Syndrome (APS)',
 'Aortic Aneurysm',
 'Aplastic Anemia',
 'Arrhythmias',
 'Asthma',
 'Atherosclerosis',
 'Atrial Fibrillation',
 'Bleeding Disorders',
 'Blood Cholesterol',
 'Blood Clotting Disorders',
 'Blood Tests',
 'Bronchiectasis',
 'Bronchitis',
 'Bronchopulmonary Dysplasia (BPD)',
 'Cardiac Arrest',
 'Cardiac Catheterization',
 'Cardiogenic Shock',
 'Cardiomyopathy',
 'Childhood Interstitial Lung Disease',
 'Circadian Rhythm Disorders',
 'Clinical Trials',
 'Conduction Disorders',
 'Congenital Heart Defects',
 'COPD',
 'Coronary Artery Bypass Grafting',
 'Coronary Heart Disease',
 'CPAP',
 'Cystic Fibrosis',
 'Defibrillators',
 'Disseminated Intravascular Coagulation (DIC)',
 'Genetic Therapies',
 'Heart Attack',
 'Heart Failure',
 'Heart Inflammation',
 'Heart Surgery',
 'Heart Tests',
 'Heart Treatments',
 'Heart Valve Diseases',
 'Heart-Healthy Livi

In [None]:
#Checking the length

len(health_topics)

102

In [None]:
topic_link = []
symptom_link=[]
symptoms = []

#Fecthing all page html from 0 to 12.
for page in range(13):
    page_url = url + str(page)
    res = requests.get(page_url)
    page_soup = BeautifulSoup(res.text, 'html.parser')

    #To scrape all topic links
    for link in page_soup.find_all('a', class_="title-link", href=True):
        health_link = 'https://www.nhlbi.nih.gov'+ link['href']
        topic_link.append(health_link)

#Remove Duplicates from the topic_link to stop infinte loading.
topic_link = list(dict.fromkeys(topic_link))

#To scrape all symptom links
for s_link in topic_link:
    s_link_s = s_link + str('/symptoms')
    symptom_link.append(s_link_s)

#To Scrape All Symptoms Content
for sys in symptom_link:
        print('Scraping content:', sys)
        sys_res = requests.get(sys)
        sys_soup = BeautifulSoup(sys_res.text,'html.parser')

        # Locate all content divs with the specific class
        content_div = sys_soup.find_all('div', class_="field field--name-field-component-section-content field--type-text-long field--label-hidden clearfix field__item")

        # Initialize content dictionary for each topic
        content = {
            "Symptoms":'N/A',
            'Common Symptoms':'N/A',
            'Other Symptoms':'N/A',
        }

        # Map content divs to sections based on their availability
        if len(content_div) > 0:
            content["Symptoms"] = content_div[0].get_text(strip=True, separator="\n")
        if len(content_div) > 1:
            content['Common Symptoms'] = content_div[1].get_text(strip=True, separator="\n")
        if len(content_div) > 2:
            content['Other Symptoms'] = content_div[2].get_text(strip=True, separator="\n")

        # Append the symptoms for this topic
        symptoms.append({
            'Symptoms':content["Symptoms"],
            'Common Symptoms':content['Common Symptoms'],
            'Other Symptoms':content['Other Symptoms'],
            })


#Create a dataframe for the symptoms
pd_symptom = pd.DataFrame(symptoms)


Scraping content: https://www.nhlbi.nih.gov/health/ards/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/alpha-1-antitrypsin-deficiency/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/anemia/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/angina/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/antiphospholipid-syndrome/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/aortic-aneurysm/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/anemia/aplastic-anemia/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/arrhythmias/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/asthma/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/atherosclerosis/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/atrial-fibrillation/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/bleeding-disorders/symptoms
Scraping content: https://www.nhlbi.nih.gov/health/blood-cholesterol/symptoms
Scraping cont

In [None]:
pd_symptom.head(5)

Unnamed: 0,Symptoms,Common Symptoms,Other Symptoms
0,Difficulty breathing is usually the first symp...,,
1,,,
2,,,
3,Symptoms vary based on the\ntype of angina\nyo...,The main symptom of angina is chest pain or di...,Symptoms of angina can be different for women ...
4,,,


In [None]:
pd_symptom.shape

(102, 3)

In [None]:
len(symptoms)

102

In [None]:
topic_link = []
diagnosis_link=[]
diagnosis = []

#Fecthing all page html from 0 to 12.
for page in range(13):
    page_url = url + str(page)
    res = requests.get(page_url)
    page_soup = BeautifulSoup(res.text, 'html.parser')

    #To scrape all topic links
    for link in page_soup.find_all('a', class_="title-link", href=True):
        health_link = 'https://www.nhlbi.nih.gov'+ link['href']
        topic_link.append(health_link)

#Remove Duplicates from the topic_link to stop infinte loading.
topic_link = list(dict.fromkeys(topic_link))

#To scrape all diagnosis links
for d_link in topic_link:
    d_link_d = d_link + str('/diagnosis')
    diagnosis_link.append(d_link_d)

#To Scrape All Symptoms Content
for dyd in diagnosis_link:
        print('Scraping content:', dyd)
        dyd_res = requests.get(dyd)
        dyd_soup = BeautifulSoup(dyd_res.text,'html.parser')

         # Locate all content divs with the specific class
        content_di = dyd_soup.find_all('div', class_="field field--name-field-component-section-content field--type-text-long field--label-hidden clearfix field__item")

        # Initialize content dictionary for each topic
        cont = {
        "Medical History":'N/A',
        'Physical Exam':'N/A',
        'Diagnostic Test':'N/A',
        'Other Test':'N/A'
        }

         # Map content divs to sections based on their availability
        if len(content_di) > 0:
            cont["Medical History"] = content_di[0].get_text(strip=True, separator="\n")
        if len(content_di) > 1:
            cont['Physical Exam'] = content_di[1].get_text(strip=True, separator="\n")
        if len(content_di) > 2:
            cont['Diagnostic Test'] = content_di[2].get_text(strip=True, separator="\n")
        if len(content_di) > 3:
            cont['Other Test'] = content_di[3].get_text(strip=True, separator="\n")

         # Append the diagnosis for this topic
        diagnosis.append({
            'Medical History':cont["Medical History"],
            'Physical Exam':cont['Physical Exam'],
            'Diagnostic Test':cont['Diagnostic Test'],
            'Other Test':cont['Other Test']
            })






#Create a dataframe for the diagnosis
pd_diagnosis = pd.DataFrame(diagnosis)


Scraping content: https://www.nhlbi.nih.gov/health/ards/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/alpha-1-antitrypsin-deficiency/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/anemia/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/angina/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/antiphospholipid-syndrome/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/aortic-aneurysm/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/anemia/aplastic-anemia/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/arrhythmias/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/asthma/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/atherosclerosis/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/atrial-fibrillation/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/bleeding-disorders/diagnosis
Scraping content: https://www.nhlbi.nih.gov/health/blood-cholesterol/diagnosis


In [None]:
pd_diagnosis.head(5)

Unnamed: 0,Medical History,Physical Exam,Diagnostic Test,Other Test
0,Your doctor will diagnose acute respiratory di...,"To help diagnose ARDS, your doctor may ask you...",Your doctor will examine you for signs of ARDS...,"To diagnose ARDS, your doctor may have you und..."
1,,,,
2,Blood test results\nThe table below shows some...,,,
3,Your healthcare provider may diagnose angina b...,Your healthcare provider will want to learn ab...,"As part of a physical exam, your healthcare pr...","Depending on your symptoms and risk factors, y..."
4,,,,


In [None]:
len(diagnosis)

102

In [None]:
pd_diagnosis.shape

(102, 4)

In [None]:
#To Scrape info
topic_info = []

#Fecthing all page html from 0 to 12.
for page in range(13):
    page_url = url + str(page)
    res = requests.get(page_url)
    page_soup = BeautifulSoup(res.text, 'html.parser')

    #To scrape all info
    for info in page_soup.find_all('div', class_="field field--name-field-health-topic-summary field--type-text-long field--label-hidden field__item"):
        info_topic = info.get_text(strip=True, separator='\n')

        topic_info.append(info_topic)


In [None]:
len(topic_info)

102

In [None]:
#Creating a dataframe for the topic name, topic link and topic info
data = pd.DataFrame()
data['Condition Name']=health_topics
data['Condition Link']=topic_link
data['Condition Info']=topic_info


In [None]:
data.head(5)

Unnamed: 0,Condition Name,Condition Link,Condition Info
0,Acute Respiratory Distress Syndrome,https://www.nhlbi.nih.gov/health/ards,Acute Respiratory Distress Syndrome (ARDS) is ...
1,Alpha-1 Antitrypsin Deficiency,https://www.nhlbi.nih.gov/health/alpha-1-antit...,Alpha-1 antitrypsin deficiency is an inherited...
2,Anemia,https://www.nhlbi.nih.gov/health/anemia,Anemia is a condition in which the blood has a...
3,Angina (Chest Pain),https://www.nhlbi.nih.gov/health/angina,Angina is chest pain or discomfort that happen...
4,Antiphospholipid Syndrome (APS),https://www.nhlbi.nih.gov/health/antiphospholi...,Antiphospholipid syndrome (APS) is an autoimmu...


In [None]:
data.shape

(102, 3)

In [None]:
#Concating the three dataframes (data Pd_symptoms, and the Pd_diagnosis) into one dataset.

new_data = pd.concat([data,pd_symptom,pd_diagnosis], axis=1)

In [None]:
new_data.head(5)

Unnamed: 0,Condition Name,Condition Link,Condition Info,Symptoms,Common Symptoms,Other Symptoms,Medical History,Physical Exam,Diagnostic Test,Other Test
0,Acute Respiratory Distress Syndrome,https://www.nhlbi.nih.gov/health/ards,Acute Respiratory Distress Syndrome (ARDS) is ...,Difficulty breathing is usually the first symp...,,,Your doctor will diagnose acute respiratory di...,"To help diagnose ARDS, your doctor may ask you...",Your doctor will examine you for signs of ARDS...,"To diagnose ARDS, your doctor may have you und..."
1,Alpha-1 Antitrypsin Deficiency,https://www.nhlbi.nih.gov/health/alpha-1-antit...,Alpha-1 antitrypsin deficiency is an inherited...,,,,,,,
2,Anemia,https://www.nhlbi.nih.gov/health/anemia,Anemia is a condition in which the blood has a...,,,,Blood test results\nThe table below shows some...,,,
3,Angina (Chest Pain),https://www.nhlbi.nih.gov/health/angina,Angina is chest pain or discomfort that happen...,Symptoms vary based on the\ntype of angina\nyo...,The main symptom of angina is chest pain or di...,Symptoms of angina can be different for women ...,Your healthcare provider may diagnose angina b...,Your healthcare provider will want to learn ab...,"As part of a physical exam, your healthcare pr...","Depending on your symptoms and risk factors, y..."
4,Antiphospholipid Syndrome (APS),https://www.nhlbi.nih.gov/health/antiphospholi...,Antiphospholipid syndrome (APS) is an autoimmu...,,,,,,,


In [None]:
new_data.shape

(102, 10)

In [None]:
#To Export the data to a csv dataset

new_data.to_csv('nhlbi.csv', index=False)