An eclectic news search engine that helps with deduplication of news articles and enriches your reading experience.
Contributers and Contact Information: [Sharan Babu, sharanbabu2001@gmail.com / 19211a05q9@bvrit.ac.in, www.sharanbabu.ml ]
Problem Statement addressed : [Reduce The Noise Of News Search (No.7) ]
Description:
Time is of essence and hence spending it by only consuming necessary information is very important. That is the problem this project tries to solve in the domain of news search. Gemini is an eclectic news search engine that helps with deduplication of news articles and enriches your reading experience. Enables readers to dig deep into a certain opinion or explore multiple facets of the news at hand.
Submission Video : Link
Current Solution Export: Link
Fun Fact: 'Gemini' is synonymous to the word 'clone' or 'twin' and hence I thought it would be a cool play on the word for a project that dealt with deduplication.
Productivity and value for time is ever-so important in today's fast moving world. We are constantly in search for tools that bolster this notion without sometimes realizing that we might be sacrificing something important in the process. Take 'News Search' as a problem for example. As an end consumer, we want quick access to happenings around the world and this requirement is readily fulfilled by the mobiles in our pocket, but in the quest for this speed, are we getting an entire picture of the event (news) at hand? Probably not. That is where Gemini comes in. Gemini with the power of Graph Technology is able to provide meaningful and otherwise looked-over insights that can totally change your viewpoint on a matter. With Gemini, you can not only save a lot of time by ensuring that you are not reading the same content over and over again but also answer interesting questions like:
- Which countries have what opinion about an event?
- Inherent bias in the subject matter being discussed
- How two entities (like countries) are related to an event?
- Connection of keywords and topics to news from different sources
- Centrality of a particular opinion?
and many such insightful questions. The revelations are often eye-opening and help lead to healthy conclusions.
Use of a hyper-node based schema where the hyper-nodes lead to related children nodes and the hyper-nodes (called 'ANCHOR NEWS' in the context of this project) themselves are linked to each other with a semantic similarity score provides an innovative and unique way to model the problem of news deduplication in such a way that deduplication is inherent and the defined schema allows for a wide range of custom queries to be implemented for obtaining varied and meaningful insights in an optimal way. Schema is easily extendable to add new properties/attributes for news articles.
The way the graph has been modelled allows for it to be extended for non-news use cases as well. It is magic when you are able to explore the populated graph in a Graph Studio or visualization tool of your choice. The connections induce new ideas as to how the graph can be queried in creative ways to mine new patterns.
Gemini is trying to solve a pressing and ever-existent problem. This project has the potential to help people form concerted and unbiased opinions thereby leading to a safe and just society. Gemini is highly scalable and generalizable as well.
Schema:
Gemini 👥 helps address a lot of use cases and its easy schema extensibility makes it convenient and simple to add new dynamics to the existing graph. Examples of use cases:
- Normal user browsing a news topic
- Organization checking public reception of it's fresh developments on news
- Organization checking how it's competitors are performing in the eyes of the public sphere
- Empathetic understanding of the problem
- Exploratory analysis (eclectic surfing) & subject-matter deep dive (related news drill-down)
Other additions:
- Data: Initial data fetched from https://newscatcherapi.com/ and later enriched (metadata creation) using different NLP models.
- Technology Stack: Python, TigerGraph, Dataframe Processing, Semantic Search, Sentiment Analysis, Keyword Generation, Website (Streamlit)
- Visuals/Project Images: Link
- Project Link: Link
This project was built using Python. Required libraries can be found in the requirements.txt file of this repository.
-
Open this Google Colab Link or download the Gemini.ipynb file from this repository and open it using a Jupyter Notebook (You can also simply open the Gemini.ipynb file in Github itself and click on the 'Open in Colab' badge at the top). Data is fetched from newscatcher API and I have hardcoded my API key. You can use the same while executing or get your own at https://app.newscatcherapi.com/auth/register .
-
Simply, execute all the cells in the Colaboratory notebook in order. Each cell is documented as to what purpose it serves and the time of execution if it takes a while to execute. There are only 3 cells in the entire notebook whose values are to be changed. Those are:
2a. Enter your preferred news search term in this cell before executing.
2b. In this cell, we establish a connection to our TigerGraph instance. So, create an instance with Blank Started Kit and replace the connection paramters & credentials in this cell with yours.
2c. The following cell requires no change if the above 2 cells were changed accordingly.
-
All the other cells can be executed as it is. The Colab notebook has been structured in such a way that the flow through it is smooth for the person running it and each component of this project has been separately executed with an example for better understanding of how the project was built.
-
If you used the Colab notebook for execution, Download the files called anchor_vertices_list and children_news_info from the left pane.
- Clone this repository.
git clone https://github.com/Sharan-Babu/Gemini.git
-
Ensure that you have the files anchor_vertices, children_news_info and gemini.py in the same folder. If you had changed the search term, replace the anchor_vertices and children_news_info files in the cloned repository with the ones you downloaded in step 4. Now, we are ready to spin up the website.
-
Install streamlit
pip install streamlit
- Run the Web Application
streamlit run gemini.py
- You can now view the website. You can also visit your instances' GraphStudio to view all the populated data and explore the graph.
The solution and the way it has been built can be extended into a regular search engine capable of generalizing on regular websites (non-news) as well; for different search terms.
Explored and used Graph technology for the first time and it was a very fulfilling experience. Learned about how existing news search engines work, their strengths and weaknesses and how cutting edge advances in NLP (keyword generation, semantic search) can be used to make news search better for the readers.
Participating in the hackathon was a pleasant experience. Ample resources, time and support were provided.
How it was built?
- Fetched news articles from newsapicatcher.com. Used the PyTigerGraph Python library to interact with the TigerGraph instance.
- Used 3 different NLP models for Semantic Search, Keyword Generation and Sentiment Analysis respectively. Helps with enriching the news articles with additional metadata.
- Later, loaded all data to the TigerGraph instance and processed/explored data in several ways using custom GSQL queries as well as special algorithms like centrality and similarity.
- Finally, results are shown to the end-user in a Streamlit web application.
Link to my TigerGraph notes: https://github.com/Sharan-Babu/Gemini/blob/main/TigerGraph%20Notes.pdf
Helpful Code Snippets for learning TigerGraph: https://github.com/Sharan-Babu/Gemini/tree/main/code_snippets
Useful Reference Links:
- https://docs.tigergraph.com/cloud/start/overview
- https://www.tigergraph.com/graphstudio/
- https://docs.tigergraph.com/graph-ml/current/intro/
- https://colab.research.google.com/drive/1JhYcnGVWT51KswcXZzyPzKqCoPP5htcC
- https://pytigergraph.github.io/pyTigerGraph/
- https://docs.tigergraph.com/tigergraph-server/current/api/built-in-endpoints#_upsert_data_to_graph
- https://newscatcherapi.com/
- https://github.com/MaartenGr/KeyBERT
- https://graphforall.devpost.com/details/inspiration#h_1353131838381643228912926
Special thanks to Ashleigh Faith for the great problem statement, detailed description and attached resources.