# Notebook Introduction:

- This notebook demonstrates several tasks that involve web scraping and file handling using Python libraries such as requests, BeautifulSoup, and urllib. The code consists of three main tasks:



     1. Extracting and saving webpage content using BeautifulSoup:
    
        We'll load a webpage, extract paragraph content, and save it into a local file.
    
     2. Downloading and saving a text file from a URL:
    
        This demonstrates how to download content from a given URL and save it locally.
    
     3. Scraping links from dynamically generated URLs: We'll generate links based on specific characters, extract and filter the links from the pages, and process       them.

# Import Libraries

In [1]:
import requests

from bs4 import BeautifulSoup

# Extracting and Saving Data from a Webpage using requests and BeautifulSoup

In [2]:
def GetPage(Link,FileName) :

    page = requests.get(Link)

    soup = BeautifulSoup(page.content, 'html.parser')

    print(f'Number of paragraphs is :  {len(soup.find_all("p"))}')

    try :

        title = soup.find(id="firstHeading")

        print(f'Page title is :   {title.string}')

    except :

        pass

    if len(soup.find_all('p')) ==0 :

       return None

    f = open(FileName,'w',encoding = 'utf-8')

    for i in range(len(soup.find_all('p'))) :

        f.write(soup.find_all('p')[i].get_text())

        f.write('\n')

    print(f'Page saved in {FileName}')

Explanation:



1. Importing Libraries:



     - requests: Used to send HTTP requests to the URL and fetch the webpage content.
    
     - BeautifulSoup: Used for parsing the HTML content and extracting specific elements from the webpage.

2. GetPage Function:



      - This function takes a URL (Link) and a file name (FileName) as input.
    
      - We use requests.get(Link) to download the webpage content.
    
      - BeautifulSoup parses the page content and makes it easier to navigate and extract data.
    
      - We count the number of paragraphs (tags) in the page using soup.find_all("p").
    
      - If the page has a title (with id="firstHeading"), we print it.
    
      - If the page contains no paragraphs, the function returns None.
    
      - Otherwise, it opens a file and writes the extracted text of each paragraph to the file.

3. Example Usage:

     - In the last line, we run GetPage on the "Ahly Club" Wikipedia page and save the extracted content in a file called Ahly.text.

In [3]:
GetPage('https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%86%D8%A7%D8%AF%D9%8A_%D8%A7%D9%84%D8%A3%D9%87%D9%84%D9%8A_(%D9%85%D8%B5%D8%B1)','Ahly.text')

Number of paragraphs is :  123
Page title is :   النادي الأهلي (مصر)
Page saved in Ahly.text


# Downloading and Saving a Text File from a URL

In [4]:
import urllib.request

url = 'https://raw.githubusercontent.com/HeshamAsem/NLTK/master/Files/HardTimes.txt'

response = urllib.request.urlopen(url)

data = response.read().decode('utf8')

print(data)

Thomas Gradgrind, sir.  A man of realities.  A man of facts and calculations.  A man who proceeds upon the principle that two and two are four, and nothing over, and who is not to be talked into allowing for anything over.  Thomas Gradgrind, sir—peremptorily Thomas—Thomas Gradgrind.  With a rule and a pair of scales, and the multiplication table always in his pocket, sir, ready to weigh and measure any parcel of human nature, and tell you exactly what it comes to.  It is a mere question of figures, a case of simple arithmetic.  You might hope to get some other nonsensical belief into the head of George Gradgrind, or Augustus Gradgrind, or John Gradgrind, or Joseph Gradgrind (all supposititious, non-existent persons), but into the head of Thomas Gradgrind—no, sir!

In such terms Mr. Gradgrind always mentally introduced himself, whether to his private circle of acquaintance, or to the public in general.  In such terms, no doubt, substituting the words ‘boys and girls,’ for ‘sir,’ Thoma

In [5]:
def download_and_save_file(url, file_name):

    try:


        response = urllib.request.urlopen(url)

        data = response.read().decode('utf-8')



        

        with open(file_name, 'w', encoding='utf-8') as file:

            file.write(data)



        print(f"Data has been successfully saved to '{file_name}'")

    except urllib.error.URLError as e:

        print(f"Failed to download the file: {e}")

    except Exception as e:

        print(f"An unexpected error occurred: {e}")




Explanation:



1. Using urllib to Download Content:



     - urllib.request.urlopen(url) opens the URL and reads the content.
    
     - The data is then decoded from bytes to a string using .decode('utf-8').

2. download_and_save_file Function:



     - This function downloads content from a URL and saves it to a local file.
    
     - It handles exceptions like network errors (urllib.error.URLError) and other unforeseen errors.
    
     - It opens a file (file_name) in write mode and saves the content.

3. Example Usage:

     - We download a text file from a GitHub URL and save it locally as HardTimes.txt.




In [6]:
download_and_save_file('https://raw.githubusercontent.com/HeshamAsem/NLTK/master/Files/HardTimes.txt', 'HardTimes.txt')

Data has been successfully saved to 'HardTimes.txt'


# Scraping Links Based on Specific Characters(Arabic Characters)

In [7]:
T = 'دجحخهعغفقثصضذشسيبلاتنمكطظزوةىلارؤءئ'

T = list(set(T))

len(T)

33

In [8]:
urls = []



for t in T:


    url = f'https://www.webteb.com/drug/list/{t}'



    try:


        reqs = requests.get(url)

        reqs.raise_for_status()  



        soup = BeautifulSoup(reqs.text, 'html.parser')




        for link in soup.find_all('a', href=True):   

            urls.append(link['href'])



    except requests.exceptions.RequestException as e:

        print(f"Failed to process {url}: {e}")




urls


['/',
 'https://webteb.miavitals.com/',
 'https://webteb.miavitals.com/',
 '/',
 'https://twitter.com/WebTeb_com',
 'https://www.facebook.com/Webteb.net',
 'https://www.instagram.com/webteb/',
 '/medical',
 '/lifestyle',
 'https://baby.webteb.com',
 '/diseases',
 '/drug',
 '/testyourself',
 'https://webteb.miavitals.com/',
 'https://www.webteb.com/medical',
 'https://www.webteb.com/body-organs',
 'https://www.webteb.com/dental-health',
 'https://www.webteb.com/heart',
 'https://www.webteb.com/alternative-medicine',
 'https://www.webteb.com/woman-health',
 'https://www.webteb.com/cancer',
 'https://www.webteb.com/eye-health',
 'https://www.webteb.com/sex-education',
 'https://www.webteb.com/mental-health',
 'https://www.webteb.com/symptoms',
 'https://www.webteb.com/diabetes',
 'https://www.webteb.com/medical-technology',
 'https://news.webteb.com',
 'https://baby.webteb.com',
 'https://baby.webteb.com/حاسبة-الحمل-وموعد-الولادة',
 'https://baby.webteb.com/baby-names',
 'https://baby.web

Explanation:



1. List T:



 - We start with a string T containing Arabic characters. set(T) removes any duplicates, and list(set(T)) converts it back to a list.

2. Looping over T:



 - For each character in T, a URL is generated dynamically. For example, t would create a different URL for each character (e.g., https://www.webteb.com/drug/list/د).

3. Making Requests:



 - requests.get(url) sends a GET request to the generated URL.

 - soup = BeautifulSoup(reqs.text, 'html.parser') parses the page content.

4. Extracting Links:



 - We use soup.find_all('a', href=True) to find all anchor tags with href attributes, and append the URLs to the urls list.

5. Exception Handling:



 - The try-except block ensures that any HTTP errors or exceptions during the request are caught and handled gracefully.

In [9]:
len(urls)

8430

### I used this code to retrieve only the URLs of all drugs from the WebTeb website,

In [10]:
U = [i for i in urls if 'https://www.webteb.com/drug' in i]

len(U)

3153

In [11]:
len(list(set(U)))

700

In [12]:
U = list(set(U))

U

['https://www.webteb.com/drug/سيفترياكسون',
 'https://www.webteb.com/drug/druginteractions',
 'https://www.webteb.com/drug/فينيتوين',
 'https://www.webteb.com/drug/اليسكيرين',
 'https://www.webteb.com/drug/ابومورفين',
 'https://www.webteb.com/drug/دانازول',
 'https://www.webteb.com/drug/بروبنسيد',
 'https://www.webteb.com/drug/سوكرالفات',
 'https://www.webteb.com/drug/ايكونازول',
 'https://www.webteb.com/drug/اكسيد-الزرنيخ-الثلاثي',
 'https://www.webteb.com/drug/سيروليموس',
 'https://www.webteb.com/drug/سيسبلاتين',
 'https://www.webteb.com/drug/كالسيتونين',
 'https://www.webteb.com/drug/سيتيريزين',
 'https://www.webteb.com/drug/دروسبيرينون',
 'https://www.webteb.com/drug/ليفوكبستين',
 'https://www.webteb.com/drug/سولفاديازين-الفضة',
 'https://www.webteb.com/drug/سوماتروبين',
 'https://www.webteb.com/drug/كربونات-الكالسيوم',
 'https://www.webteb.com/drug/اينفوبيرتيد',
 'https://www.webteb.com/drug/ليفوفلوكساسين',
 'https://www.webteb.com/drug/ناليديكسيك-اسيد',
 'https://www.webteb.com/d