- This project implements a Web Scraper that scrapes for news, based on the topic entered by the user.
- The user types a topic, and the scraper returns latest news headlines and links to the articles.
- News scraped using Google.
- Done using JavaScript and HTML/CSS.
- Implemented using a Client-Server service using socket.io and expressJS APIs.
- Also uses other APIs such as Puppeteer and CheerioJS for scraping.
- All the code associated to the client side could be found inside the public directory.
- main.html contains the HTML code for the interface, which displays the news headlines and links ones the user searches for a topic.
- styles.css contains all the styles required for main.html.
- process.js processes the topic entered by the user, and sends a request to the server-side script (using socket.io) which then returns the news resources. It renders the news in readable format.
- index.js receives the topic entered by the user from the client file (process.js).
- Employing the Puppeteer and CheerioJS APIs, it scrapes for news, and returns headlines and their links to the client file using socket.io.
- expressJS: used to implement the server-side. It facilitates the user of sockets.
- socket.io: to implement sockets on the client and server files to exchange information.
- Puppeteer: to connect to the web, and perform simple functions such as heading to a website, typing, searching, etc. It was used to perform a Google search for the news on the topic entered by the user, and locate the "News" tab, which displays many "News-cards" or news headlines on the topic. It was also used to retrieve the HTML content of the destination web page.
- CheerioJS: to refer to HTML contents of the webpage retrieved using Puppeteer. It works just like jQuery, and elements of the HTML webpage could be retrieved using the same syntax. It was thus useful in accessing the "News-card" elements and their attributes which consisted of sub-elements such as news headlines, links, etc.
- Make sure that the above-mentioned APIs have been installed.
- Head on to the terminal and locate to the directory where the repository is saved/cloned. Then run node index.js.
- On executing the code, open the HTML file in a web browser. The code running in the terminal would notify by saying "Connected to socket" followed its ID.
- You're good to go! Type in the topic for the news, and wait a few seconds for the results. There you'll have it!
- Follow along the code for comments for better understanding of how it works behind the scenes. Inspecting the destination webpage (Ctrl+Shitf+I) is highly recommended to better understand its HTML structure.
-
Running code on console. User would be notified after the HTML page is opened in the browser:
-
Search for a topic. Result might take a few seconds to load.
-
Clicking on a news item would take the user to the news article's page/website.
- Go through the documentation/tutorials for the APIs.
- Follow along the code for comments for better understanding of how it works behind the scenes.
- Check the terminal console where the code is executed (for index.js or server-file), and the browser console (for process.js or client file) to keep a track of events taking place in server and client files during execution.
- Inspecting the destination webpage (Ctrl+Shitf+I) is highly recommended to better understand its HTML structure.
- Set the headless atttribute of Puppeteer to false to see its actions in the browser, and which elements are clicked on. Find the link to the line of code here.