Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(crawler): using newspaper and fixed recursive by merging content #955

Merged
merged 2 commits into from
Aug 15, 2023

Conversation

StanGirard
Copy link
Collaborator

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Checklist before requesting a review

Please delete options that are not relevant.

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented hard-to-understand areas
  • I have ideally added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged

Screenshots (if appropriate):

@StanGirard StanGirard temporarily deployed to preview August 15, 2023 16:00 — with GitHub Actions Inactive
@vercel
Copy link

vercel bot commented Aug 15, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 15, 2023 4:08pm
quivrapp ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 15, 2023 4:08pm

@StanGirard StanGirard mentioned this pull request Aug 15, 2023
6 tasks
@github-actions
Copy link
Contributor

github-actions bot commented Aug 15, 2023

Risk Level 2 - /home/runner/work/quivr/quivr/backend/core/crawl/crawler.py

  1. The print(e) statement in the exception handling blocks is not a good practice. It's better to use logging instead of print statements. This will give you more control over what gets logged, where it goes, and how it's formatted. For example:
import logging

try:
    # some code
except Exception as e:
    logging.error(e)
    raise
  1. The extract_content method returns None when an exception occurs. This could potentially cause issues if the return value is used without checking for None. It would be better to raise an exception and handle it in the calling code.

  2. The checkGithub method could be improved by using a more robust method to check if the URL is a GitHub URL. Currently, it simply checks if 'github.com' is in the URL string, which could lead to false positives. Consider using the urlparse function from the urllib.parse module to parse the URL and check the netloc attribute.

from urllib.parse import urlparse

def checkGithub(self):
    parsed_url = urlparse(self.url)
    return parsed_url.netloc == 'github.com'
  1. The slugify function could be improved by using a library like python-slugify which handles unicode and special characters better.

🖨️🔀🔗


Powered by Code Review GPT

@sentry-io
Copy link

sentry-io bot commented Aug 16, 2023

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ TypeError: write() argument must be str, not None /crawl View Issue
  • ‼️ TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str' /crawl View Issue

Did you find this useful? React with a 👍 or 👎

StanGirard added a commit that referenced this pull request Sep 12, 2023
…955)

* fix(crawler): using newspaper and fixed recursive by merging content

* feat(code-review): added feedback from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant