The updated code `response.encoding = response.apparent_encoding or response.encoding` works effectively because it addresses two key challenges when dealing with web scraping and HTML parsing: **character encoding** and **HTML structure handling**.

### 1. **Character Encoding**
Web pages can use different character encodings (like `UTF-8`, `GB2312`, or others), especially for non-Latin scripts like Chinese. If the encoding is not correctly handled, you may encounter garbled text or errors. Here's why the approach works:

- **HTTP Response Encoding**:  
  When `requests.get(url)` fetches a page, the `response.encoding` property is usually set to the encoding specified in the HTTP headers (`Content-Type`).

- **`response.apparent_encoding`**:  
  `requests` includes a feature to detect the most likely encoding of the response content if it isn’t explicitly specified. This detection is done using `chardet` or `charset-normalizer`.

- **Setting `response.encoding`**:  
  By explicitly setting `response.encoding` to `response.apparent_encoding` (if detected), you ensure the text is decoded properly into readable Chinese (or other characters). If `response.apparent_encoding` isn’t available, it defaults to the server-provided encoding.

### 2. **HTML Parsing with BeautifulSoup**
Once the content is decoded correctly, `BeautifulSoup` can handle and parse the HTML structure effectively. Here's how it works:

- **Chinese Text Support**:  
  `BeautifulSoup` doesn't require special handling for Chinese or other scripts once the HTML is correctly decoded. It works on the parsed Unicode representation of the content.

- **Tag and Text Matching**:  
  When you use `soup.find('a', text='PDF下载')`, `BeautifulSoup`:
  - Searches for an `<a>` tag where the visible text matches `'PDF下载'`.
  - Compares the decoded text (now readable Chinese) to find the desired tag.

### Why It Matters
If the encoding isn't correctly handled, the Chinese characters (like `PDF下载`) might appear as unreadable symbols (e.g., `æŒ‰è¦ä½¿ç”¨`) in the `response.text`. This would cause the `soup.find` method to fail because it can't match garbled text.

### Summary of Why This Works
1. **Accurate Encoding Detection**:
   - Ensures the response content is decoded correctly, preserving Chinese characters and other special symbols.

2. **Robust Parsing**:
   - `BeautifulSoup` can reliably locate the desired elements in the properly decoded HTML structure.

3. **Seamless Integration**:
   - Using libraries like `requests` and `BeautifulSoup` allows for handling encoding and parsing seamlessly, even for non-Latin scripts. 

This combination makes the script resilient and ensures it works for web pages containing Chinese or other complex character sets.

In [3]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Function to extract the PDF link from the HTML
def get_pdf_link(url):
    try:
        # Fetch the webpage content
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP request errors

        # Properly decode the response content
        # Use response.apparent_encoding to handle Chinese characters if needed
        response.encoding = response.apparent_encoding or response.encoding

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        # print(soup.prettify())
        
        # Locate the <a> tag with the "PDF下载" text
        pdf_tag = soup.find('a', string='PDF下载')
        if pdf_tag:
            # Construct the absolute URL for the PDF link
            pdf_url = urljoin(url, pdf_tag['href'])
            return pdf_url
        else:
            return "PDF link not found."
    except requests.exceptions.RequestException as e:
        return f"Error occurred: {e}"

# Example usage
url = "http://paper.people.com.cn/rmrb/pc/layout/202412/25/node_01.html"
pdf_link = get_pdf_link(url)
print("PDF Link:", pdf_link)

PDF Link: http://paper.people.com.cn/rmrb/pc/attachement/202412/25/58d97d37-6229-4f3b-872d-3911f85cb11b.pdf
