Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SITES] https://www.bbc.com #625

Open
Sriram629009746 opened this issue Mar 18, 2024 · 7 comments
Open

[SITES] https://www.bbc.com #625

Sriram629009746 opened this issue Mar 18, 2024 · 7 comments

Comments

@Sriram629009746
Copy link

First please check that it is really an issue with the library, and not some special case of website:

[ ] There is no paywall
[ ] You do not have to be logged in to see the articles
[ ] You tried using a common browser user agent in your configuration / call
[ ] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.bbc.com

Some sample urls that I have tried

https://www.bbc.com/news/world-australia-67832905
https://www.bbc.com/news/business-67470876

The exact code i used to test this articles/website

import newspaper

url = "https://www.bbc.com/news/business-67470876"
result = newspaper.article(url)
print(result.text)
# the text is empty

** What parts of the article are missing / not parsed correctly **
[ ] Text Content

Other information, remarks, messages, etc:
It was working until a few days ago. I am using the package with version 0.9.2

@AndyTheFactory
Copy link
Owner

Hi there,

make sure that you are not blocked by bbc - try:

import requests
response = requests.get("https://www.bbc.com/news/business-67470876")
print(response.status_code)
print(response.text)

also check if the article object has some content in the html property

can you also try the same code with v 0.9.3?

@Sriram629009746
Copy link
Author

Thank you for responding.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

I noticed that the website UI of the BBC has changed. I think that could be the reason for this issue.

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

@AndyTheFactory
Copy link
Owner

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

What's the error. I did in fact change how dependencies are installed.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

You are right, there seems to be an issue with bbc. I will investigate it

@AndyTheFactory
Copy link
Owner

AndyTheFactory commented Mar 18, 2024

Yeah, there is a problem. It seems that bbc.com is now just dynamically rendered, there page is constructed with javascript after it loads. Here, you can see that there are not text elements to render without javascript: https://www.textise.net/showText.aspx?strURL=https%253A//www.bbc.com/news/business-67470876

Quick Fixes:

Anyway i will think about other alternative solutions

@Sriram629009746
Copy link
Author

The html component of the response seems to have the text content although not in a contiguous paragraph form. Maybe that is something to look at.

I tried v0.9.3 on google colab. Regarding the error while importing newspaper, this is what I got when I did pip install:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.24.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.
Successfully installed feedparser-6.0.11 newspaper4k-0.9.3 numpy-1.26.4 pandas-2.2.1 requests-file-2.0.0 sgmllib3k-1.0.0 tldextract-5.1.1 tzdata-2024.1

When I try to import after installation, I get this error:
image

@2dareis2do
Copy link

On subject of bbc and dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760

@2dareis2do
Copy link

interesting about playwright and textise.

Content for bbc seems to be in the main page render as well as attached via some nonce window object.

e.g.

<script nonce>
window.__INITIAL_DATA__={}
</script>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants