# __Webscraper to Obtain Article Content__
### __Author__: Lucas Vicente Nascimento
### __Date__: 03/19/2024
### __Introduction__
> The purpose of this project was to practice my data scraping skills and also to integrate the use of other tools and good practices in programming, documentation, portfolio development and skill progression.

> __Disclaimer:__ Due to the requirement that the article could be of my choice, I note that the code developed here was obviously based on the HTML code of this specific URL. Therefore, this code cannot be used with another website.
The article I used for the project is from a Brazilian financial market news page, InfoMoney.
Here is the URL: [From FAAMG to Seven Wonders or Fantastic Four: are the big techs really that expensive?](https://www.infomoney.com.br/colunistas/convidados/de-faamg-para-sete-maravilhas-ou-quarteto-fantastico-afinal-as-big-techs-estao-mesmo-caras/)

### __Development Process__
##### 1. __Understanding the expected goal__
To build a simple web scraper that will return the content of a news article when provided a specific URL. Some examples of real products using similar technologies include price tracking websites and SEO audit tools that can extract top search results.

##### 2. __Understanding the requirements__
* Choose a news site. Given a specific URL of an article from the chosen site, return the title and content of the article to the user.
* __For an extra challenge__: Analyze information such as the article's title, update date, and author to return them separately to the user.

##### 3. __Choosing tools__
* Python (version: 3.11.8)
    * Requests (to access the page)
    * BeautifulSoup (for parsing the page)
* BeautifulSoup Documentation
* ChatGPT
* Git
* GitHub
* VSCode
* Jupyter Notebook

##### 4. __Definition of a timeline__
> Initially, I set a deadline of 3 days but in fact, I managed to finish in 10 hours!

##### 5. __Algorithm Draft__
1. Enter URL;
2. Load page;
3. Extract Title;
4. Extract Date and Time;
5. Extract Author's Name;
6. Extract Content;

##### 6. __Coding and Testing__

In [None]:
# Import libraries for accessing content and parsing
import requests
from bs4 import BeautifulSoup

# Function to receive URL
def get_article(url):

    # Loading page
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')

    # Scraping information from the page
    title = soup.find('h1') # Starting with the title, of course :)
    print(f'Title: {title.text}')

    date = soup.find('time') # The date the article was written
    print(f'Date: {date.text}')

    # Extract information about the author
    author_div = soup.find("div", class_="single__author-info") # Finding the <div> tag
    author_span = author_div.find("span", class_="typography__body--5") # Finding the <span> tag
    author_name = author_span.get_text(separator=" ", strip=True) # Finally getting the text we want
    print(f'Author: {author_name}')

    # Extract article content
    '''
    Here we have two sections of the article. The first is the disclaimer and the second is the actual content.
    '''
    content_div = soup.find('div', class_='single__content')
    disclaimer_div = content_div.find('div', class_="single__disclaimer")
    disclaimer_text = disclaimer_div.get_text(separator=" ", strip=True)
    print(f"Disclaimer:\n{disclaimer_text}") # Disclaimer

    content_section = content_div.find('div', class_="element-border--bottom spacing--pb4")
    content_text = content_section.get_text(separator=" ", strip=True)
    print(f"Content:\n{content_text}") # Full content

In [None]:
#Calling function and testing
get_article('https://www.infomoney.com.br/colunistas/convidados/de-faamg-para-sete-maravilhas-ou-quarteto-fantastico-afinal-as-big-techs-estao-mesmo-caras/')

##### __7. Challenges Faced and Solutions__
- __Correct identification of HTML elements for data collection;__
> I don't have a background in HTML, and as a beginner in programming, this was a tough challenge when you don't know what you're doing but rewarding for when you have an "EUREKA!" moment. A basic lesson I learned about dealing with tags is to pay close attention, since we have so many _divs_ inside of _divs_ that contain _span_ but then our text is found in an _a_ inside of a _p_ hahahahaha.

> __Solution:__ Thorough documentation reading and also asking ChatGPT about how tags work, all this to make sure my code was finding the right tags.

![image.png](attachment:image.png)

- __Handling errors during testing;__
> I'll owe you the screenshots of the countless errors I received in the console, a thousand apologies!

> __Solution:__ I used as inspiration a code that I developed in Coursera exercises, but the difference is that we extracted tables and not texts.

![image-2.png](attachment:image-2.png)

### __Results and Impacts__
- Knowledge and practice of HTML tags;
- Practical application of good programming practices;
- Improvement in algorithm development, thus allowing for better code structuring;

### __Documentation, Uploads, and Posts__
__Relevant Links__
- [Project GitHub Repository](https://github.com/Lucas-Nascimento-Tech/Dev-Projects-Repo)
- [Project Documentation](https://github.com/Lucas-Nascimento-Tech/Dev-Projects-Repo/blob/main/README-(PT-BR).md)
- [My portfolio containing other projects and other information](https://lucasnascimentoportfolio.carrd.co/#)