fix(crawler): using newspaper and fixed recursive by merging content #955

StanGirard · 2023-08-15T16:00:17Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Checklist before requesting a review

Please delete options that are not relevant.

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented hard-to-understand areas
I have ideally added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged

Screenshots (if appropriate):

vercel · 2023-08-15T16:00:23Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 15, 2023 4:08pm
quivrapp	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 15, 2023 4:08pm

github-actions · 2023-08-15T16:01:37Z

Risk Level 2 - /home/runner/work/quivr/quivr/backend/core/crawl/crawler.py

The print(e) statement in the exception handling blocks is not a good practice. It's better to use logging instead of print statements. This will give you more control over what gets logged, where it goes, and how it's formatted. For example:

import logging

try:
    # some code
except Exception as e:
    logging.error(e)
    raise

The extract_content method returns None when an exception occurs. This could potentially cause issues if the return value is used without checking for None. It would be better to raise an exception and handle it in the calling code.
The checkGithub method could be improved by using a more robust method to check if the URL is a GitHub URL. Currently, it simply checks if 'github.com' is in the URL string, which could lead to false positives. Consider using the urlparse function from the urllib.parse module to parse the URL and check the netloc attribute.

from urllib.parse import urlparse

def checkGithub(self):
    parsed_url = urlparse(self.url)
    return parsed_url.netloc == 'github.com'

The slugify function could be improved by using a library like python-slugify which handles unicode and special characters better.

🖨️🔀🔗

Powered by Code Review GPT

sentry-io · 2023-08-16T13:56:08Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ TypeError: write() argument must be str, not None /crawl View Issue
‼️ TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str' /crawl View Issue

_{Did you find this useful? React with a 👍 or 👎}

…955) * fix(crawler): using newspaper and fixed recursive by merging content * feat(code-review): added feedback from code review

fix(crawler): using newspaper and fixed recursive by merging content

a3fccc8

StanGirard temporarily deployed to preview August 15, 2023 16:00 — with GitHub Actions Inactive

StanGirard had a problem deploying to preview August 15, 2023 16:00 — with GitHub Actions Failure

StanGirard mentioned this pull request Aug 15, 2023

revert(recursive crawl): failing #950

Closed

6 tasks

vercel bot deployed to Preview – quivrapp August 15, 2023 16:02 View deployment

vercel bot deployed to Preview – docs August 15, 2023 16:03 View deployment

feat(code-review): added feedback from code review

a2828e6

StanGirard had a problem deploying to preview August 15, 2023 16:05 — with GitHub Actions Failure

StanGirard temporarily deployed to preview August 15, 2023 16:05 — with GitHub Actions Inactive

vercel bot deployed to Preview – docs August 15, 2023 16:05 View deployment

vercel bot deployed to Preview – quivrapp August 15, 2023 16:08 View deployment

StanGirard merged commit d7c5c79 into main Aug 15, 2023
6 of 7 checks passed

StanGirard mentioned this pull request Aug 15, 2023

chore(main): release 0.0.57 #930

Merged

StanGirard added a commit that referenced this pull request Sep 12, 2023

fix(crawler): using newspaper and fixed recursive by merging content (#…

e0b0416

…955) * fix(crawler): using newspaper and fixed recursive by merging content * feat(code-review): added feedback from code review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crawler): using newspaper and fixed recursive by merging content #955

fix(crawler): using newspaper and fixed recursive by merging content #955

StanGirard commented Aug 15, 2023

vercel bot commented Aug 15, 2023 •

edited

github-actions bot commented Aug 15, 2023 •

edited

sentry-io bot commented Aug 16, 2023 •

edited

fix(crawler): using newspaper and fixed recursive by merging content #955

fix(crawler): using newspaper and fixed recursive by merging content #955

Conversation

StanGirard commented Aug 15, 2023

Description

Checklist before requesting a review

Screenshots (if appropriate):

vercel bot commented Aug 15, 2023 • edited

github-actions bot commented Aug 15, 2023 • edited

Powered by Code Review GPT

sentry-io bot commented Aug 16, 2023 • edited

Suspect Issues

vercel bot commented Aug 15, 2023 •

edited

github-actions bot commented Aug 15, 2023 •

edited

sentry-io bot commented Aug 16, 2023 •

edited