# **Section 1: General Knowledge**

### 1. **What is a graph database, and how does it differ from relational databases?**
- Graph Database is a type of NoSQL database that is designed to store and query graph data.
#### Differences from Relational Databases

1. **Data Representation**:
   - **Graph Databases**: Data is stored as a network of interconnected nodes and edges, making it easier to model complex relationships.
   - **Relational Databases**: Data is stored in tables with rows and columns. Relationships are represented through foreign keys and join operations.

2. **Query Performance**:
   - **Graph Databases**: Efficient for queries involving complex relationships and traversals, as they use index-free adjacency (each node directly points to its related nodes).
   - **Relational Databases**: Can become inefficient for complex queries involving multiple joins, as they require multiple table lookups¹².

3. **Schema Flexibility**:
   - **Graph Databases**: Schema-less or schema-flexible, allowing for dynamic changes in data structure without significant overhead.
   - **Relational Databases**: Schema-based, requiring predefined schemas and often needing alterations for structural changes¹².

4. **Use Cases**:
   - **Graph Databases**: Ideal for applications with complex relationships, such as social networks, fraud detection, and recommendation engines.
   - **Relational Databases**: Suitable for applications requiring strong consistency and structured data, such as financial systems and e-commerce platforms³⁴.

### 2. **Describe the key components of a graph (nodes, relationships, properties) and provide examples using this dataset.**
- Graph is a set of nodes and edges.
- Nodes represent entities. When two nodes are connected to a single edge we call the nodes to be neighbors.
- An edge of a network is a connection between nodes of a graph. There are two types of edges:
    - 1. Directed edge: This edge has a direction. It is represented by an arrow.
    - 2. Undirected edge: This edge has no direction. It is represented by a simple line.
- Nodes connected to n any number of nodes is called it's degree. If, a nodes id connected to itself, we call it a loop.s
- Nodes are connected in paths via edges and when they form a closed path we call it as a cycle.
- Properties are key-value pairs that provide additional information about nodes and edges.

***We can understand the key components of a graph via the dataset given to me:***
#### Nodes
- **Source**: The origin of the text (e.g., person, organization, platform).
- **Therapy**: Represents the primary therapy mentioned.
- **Alternative Therapy**: Represents other possible therapies or alternatives discussed.
- **Sentiment**: Can be treated as a node if you want to categorize the overall sentiment as a distinct entity (e.g., positive, neutral, negative).

#### Edges
- **Text to Source**: The text is created by the source, so an edge can link a piece of text (or its id) to the source node.
- **Text to Therapy**: Each text may discuss or reference a specific therapy, so an edge can link the text to the therapy node.
- **Text to Alternative Therapy**: If alternative therapies are mentioned, an edge can link the text to the alternative therapy node.
- **Text to Sentiment**: The text has a sentiment (e.g., positive, negative), so an edge can connect the text to the sentiment node.
- **Comparison Aspect**: Links between different therapies or insights, indicating comparative relationships discussed in the text.

#### Properties
- **Text**: The content or message being analyzed.
- **Timestamp**: When the text was created, linked to either the text or source.
- **URL**: A property linked to the source or text, providing an external reference.
- **HCP**: A property that could represent healthcare professional-related discussions.
- **Insight**: Qualitative insights drawn from the text, could be added as a property to the text node.
- **Comparison Aspect**: The specific aspect being compared between therapies, could be linked to both therapies.

### 3. **What is the property graph model used by Neo4j?**
The property graph model used by Neo4j is a flexible and expressive way to represent data. Here are the key components:

1. **Nodes**: These represent entities or objects in the graph. For example, a node could represent a person, a product, or a location.
2. **Relationships**: These describe how nodes are connected. Each relationship has a direction (from one node to another) and a type (e.g., "FRIENDS_WITH", "PURCHASED").
3. **Properties**: Both nodes and relationships can have properties, which are key-value pairs that store additional information. For example, a person node might have properties like `name` and `age`, while a relationship might have a property like `since` to indicate when the relationship started¹².
4. **Cypher Query Language**: The Cypher Quey Language of Neo4j makes it really powerful and advanced than the traditional databases. 
This model allows for a highly intuitive and visual way to represent complex data and relationships, making it particularly useful for applications like social networks, recommendation systems, and fraud detection³.

### 4. **List three common use cases for graph databases. Briefly explain how one of these use cases could be applied to the provided dataset.**
 **Here are three common use cases for graph databases**:

1. **Social Networks**: Graph databases are excellent for modeling and querying social networks, where relationships between users (e.g., friendships, followers) are crucial.
2. **Fraud Detection**: They can identify complex patterns and relationships in transaction data to detect fraudulent activities.
3. **Recommendation Engines**: By analyzing user preferences and behaviors, graph databases can provide personalized recommendations for products, content, or services¹².

### Applying Knowledge Graph to the Dataset

Given our dataset with columns like `id`, `text`, `source`, `timestamp`, `url`, `hcp`, `therapy`, `alternative_therapy`, `insight`, `comparison_aspect`, `discussion`, and `sentiment`, here's how Knowledge Graph could be applied:

1. **Nodes**: Represent different entities such as `hcp` (healthcare professionals), `therapy`, `alternative_therapy`, and `insight`.
2. **Relationships**: Connect these nodes based on interactions or discussions. For example, a relationship could be "DISCUSSES" between `hcp` and `therapy`.
3. **Properties**: Store additional information such as `timestamp`, `sentiment`, and `source`.

#### Example Analysis:
1. **Enhanced Search and Retrieval**:
- Use the knowledge graph to perform advanced searches, such as finding all discussions by a specific HCP about a particular therapy within a given timeframe.
Improve retrieval accuracy by leveraging the relationships and properties in the graph.
2. **Contextual Understanding**:
- Gain deeper insights into how different therapies are perceived by analyzing the sentiment of discussions and the context provided by related insights.
Understand the connections between different therapies and alternative therapies through the comparison aspects.
3. **Decision Support**:
- Use the knowledge graph to support decision-making by identifying key insights and trends in the data.
For example, determine which therapies are most frequently discussed positively by influential HCPs, aiding in strategic planning and marketing efforts.

## **Section 2: Data Manipulation and Preprocessing**

### 1. **Data Import and Cleaning Task:**  
   - **Task:** Pre-process the provided dataset to handle missing values in key fields like `therapy`, `sentiment`, and `hcp`. Use `pandas` to clean the data and ensure proper formatting before importing into Neo4j.

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [11]:
df = pd.read_excel('C:\\Users\\karng\\Desktop\\Graph-Data-Science-Assessment\\KG_technical_skills_assessment_data.xlsx')
df

Unnamed: 0,pid,text,source,timestamp,url,hcp,therapy,alternative_therapy,insight,comparison_aspect,discussion,sentiment
0,b047f473-36e7-45ee-b179-338d41491b9f,𝐂𝐥𝐢𝐜𝐤 𝐡𝐞𝐫𝐞 𝐓𝐨 https://lnkd.in/g9XvDJir 𝐠𝐞𝐭 𝐰𝐞𝐥...,linkedin,2024-07-05T09:30:07.936Z,https://www.linkedin.com/pulse/global-olanzapi...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
1,32d7a40b-7b5c-4902-8833-9860636d93b1,#innovation #management #digitalmarketing #tec...,linkedin,2024-08-23T17:30:05.856Z,https://www.linkedin.com/pulse/south-korea-ari...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
2,ed634429-90c3-4bfa-9825-e3bfb6684f58,🚀 TOP 10 #Biopharma Deals of the Year! 🚀\n\nht...,linkedin,2024-06-28T18:15:55.805Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Bristol Myers Squibb', 'specialty': ...","[{'name': 'KarXT', 'type': 'drug', 'indication...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'industry ...","[{'aspect_name': 'sales potential', 'descripti...","{'platform': 'industry report', 'date': '2023-...","{'type': 'positive', 'intensity': 'high'}"
3,ea5461d0-afde-4ea1-8baf-6bddd33742e7,New in The Value Science Weekly! 📰\n\nIn this ...,linkedin,2023-12-06T12:35:29.186Z,https://www.linkedin.com/pulse/using-artificia...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Valbenazine', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'AI in healthcare', 'source': 'Value ...","[{'aspect_name': 'value-based pricing', 'descr...","{'platform': 'Value Science Weekly', 'date': '...","{'type': 'mixed', 'intensity': 'moderate'}"
4,a27dadaa-c58d-4259-9742-0a2f7e9e5dff,#innovation #management #digitalmarketing #tec...,linkedin,2024-08-23T01:30:07.894Z,https://www.linkedin.com/pulse/paliperidone-ex...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
...,...,...,...,...,...,...,...,...,...,...,...,...
95,50bb3d9f-7027-46db-b4c2-9686f0a45f17,"Teva Pharmaceuticals, a U.S. affiliate of Teva...",linkedin,2024-06-02T15:39:35.891Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Eric A. Hughes, MD, PhD', 'specialty...","[{'name': 'UZEDY', 'type': 'long-acting inject...","{'name': 'Invega Sustenna', 'type': 'long-acti...","{'type': 'clinical strategy', 'source': 'Psych...","[{'aspect_name': 'pharmacokinetics', 'descript...",{'platform': 'Psych Congress Elevate 2024 Annu...,"{'type': 'positive', 'intensity': 'moderate'}"
96,d7dfaace-8522-41c4-b8eb-20687fbdb97d,I would like to share some insight about using...,linkedin,2023-01-18T14:16:02.074Z,https://www.linkedin.com/pulse/paliperidone-pa...,"{'name': 'Not specified', 'specialty': 'Psychi...","[{'name': 'Paliperidone Palmitate', 'type': 'D...","{'name': 'Not applicable', 'type': 'Not applic...","{'type': 'comparison', 'source': 'Not specifie...","[{'aspect_name': 'Efficacy', 'description': 'C...","{'platform': 'Not specified', 'date': 'Not spe...","{'type': 'Not specified', 'intensity': 'Not sp..."
97,ba25d77c-a06d-4fd8-8b2a-e1d5415b4897,𝐂𝐥𝐢𝐜𝐤 𝐡𝐞𝐫𝐞 𝐓𝐨 https://lnkd.in/dKScmk59 𝐠𝐞𝐭 𝐰𝐞𝐥...,linkedin,2024-08-09T15:30:03.375Z,https://www.linkedin.com/pulse/europe-risperid...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Risperidone', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'LinkedIn'...","[{'aspect_name': 'Market Share', 'description'...","{'platform': 'LinkedIn', 'date': '2023-10-01',...","{'type': 'positive', 'intensity': 'moderate'}"
98,83fa1b2a-a38d-4255-9ade-13ccd4b96cb8,A true milestone for any PhD: the first paper ...,linkedin,2024-04-16T12:25:21.693Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Roos van Westrhenen', 'specialty': '...","[{'name': 'Aripiprazole', 'type': 'drug', 'ind...","{'name': 'Unknown', 'type': 'Unknown', 'indica...","{'type': 'systematic review', 'source': 'Journ...","[{'aspect_name': 'Side Effect Profile', 'descr...",{'platform': 'Journal of Psychiatric Research'...,"{'type': 'positive', 'intensity': 'high'}"


In [12]:
df.isnull().sum()
# There are 13 missing values in the 'text' column. In the column of 'Therapy', 'Sentiment' and 'hcp' there are no null values,hence i am cleaning the data on what i see in the dataset.

pid                     0
text                   13
source                  0
timestamp               0
url                     0
hcp                     0
therapy                 0
alternative_therapy     0
insight                 0
comparison_aspect       0
discussion              0
sentiment               0
dtype: int64

In [13]:
df = df.dropna()
df

Unnamed: 0,pid,text,source,timestamp,url,hcp,therapy,alternative_therapy,insight,comparison_aspect,discussion,sentiment
0,b047f473-36e7-45ee-b179-338d41491b9f,𝐂𝐥𝐢𝐜𝐤 𝐡𝐞𝐫𝐞 𝐓𝐨 https://lnkd.in/g9XvDJir 𝐠𝐞𝐭 𝐰𝐞𝐥...,linkedin,2024-07-05T09:30:07.936Z,https://www.linkedin.com/pulse/global-olanzapi...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
1,32d7a40b-7b5c-4902-8833-9860636d93b1,#innovation #management #digitalmarketing #tec...,linkedin,2024-08-23T17:30:05.856Z,https://www.linkedin.com/pulse/south-korea-ari...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
2,ed634429-90c3-4bfa-9825-e3bfb6684f58,🚀 TOP 10 #Biopharma Deals of the Year! 🚀\n\nht...,linkedin,2024-06-28T18:15:55.805Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Bristol Myers Squibb', 'specialty': ...","[{'name': 'KarXT', 'type': 'drug', 'indication...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'industry ...","[{'aspect_name': 'sales potential', 'descripti...","{'platform': 'industry report', 'date': '2023-...","{'type': 'positive', 'intensity': 'high'}"
3,ea5461d0-afde-4ea1-8baf-6bddd33742e7,New in The Value Science Weekly! 📰\n\nIn this ...,linkedin,2023-12-06T12:35:29.186Z,https://www.linkedin.com/pulse/using-artificia...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Valbenazine', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'AI in healthcare', 'source': 'Value ...","[{'aspect_name': 'value-based pricing', 'descr...","{'platform': 'Value Science Weekly', 'date': '...","{'type': 'mixed', 'intensity': 'moderate'}"
4,a27dadaa-c58d-4259-9742-0a2f7e9e5dff,#innovation #management #digitalmarketing #tec...,linkedin,2024-08-23T01:30:07.894Z,https://www.linkedin.com/pulse/paliperidone-ex...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...",[],"{'platform': '', 'date': '', 'sentiment': '', ...","{'type': '', 'intensity': ''}"
...,...,...,...,...,...,...,...,...,...,...,...,...
95,50bb3d9f-7027-46db-b4c2-9686f0a45f17,"Teva Pharmaceuticals, a U.S. affiliate of Teva...",linkedin,2024-06-02T15:39:35.891Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Eric A. Hughes, MD, PhD', 'specialty...","[{'name': 'UZEDY', 'type': 'long-acting inject...","{'name': 'Invega Sustenna', 'type': 'long-acti...","{'type': 'clinical strategy', 'source': 'Psych...","[{'aspect_name': 'pharmacokinetics', 'descript...",{'platform': 'Psych Congress Elevate 2024 Annu...,"{'type': 'positive', 'intensity': 'moderate'}"
96,d7dfaace-8522-41c4-b8eb-20687fbdb97d,I would like to share some insight about using...,linkedin,2023-01-18T14:16:02.074Z,https://www.linkedin.com/pulse/paliperidone-pa...,"{'name': 'Not specified', 'specialty': 'Psychi...","[{'name': 'Paliperidone Palmitate', 'type': 'D...","{'name': 'Not applicable', 'type': 'Not applic...","{'type': 'comparison', 'source': 'Not specifie...","[{'aspect_name': 'Efficacy', 'description': 'C...","{'platform': 'Not specified', 'date': 'Not spe...","{'type': 'Not specified', 'intensity': 'Not sp..."
97,ba25d77c-a06d-4fd8-8b2a-e1d5415b4897,𝐂𝐥𝐢𝐜𝐤 𝐡𝐞𝐫𝐞 𝐓𝐨 https://lnkd.in/dKScmk59 𝐠𝐞𝐭 𝐰𝐞𝐥...,linkedin,2024-08-09T15:30:03.375Z,https://www.linkedin.com/pulse/europe-risperid...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Risperidone', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'LinkedIn'...","[{'aspect_name': 'Market Share', 'description'...","{'platform': 'LinkedIn', 'date': '2023-10-01',...","{'type': 'positive', 'intensity': 'moderate'}"
98,83fa1b2a-a38d-4255-9ade-13ccd4b96cb8,A true milestone for any PhD: the first paper ...,linkedin,2024-04-16T12:25:21.693Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Roos van Westrhenen', 'specialty': '...","[{'name': 'Aripiprazole', 'type': 'drug', 'ind...","{'name': 'Unknown', 'type': 'Unknown', 'indica...","{'type': 'systematic review', 'source': 'Journ...","[{'aspect_name': 'Side Effect Profile', 'descr...",{'platform': 'Journal of Psychiatric Research'...,"{'type': 'positive', 'intensity': 'high'}"


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87 entries, 0 to 99
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   pid                  87 non-null     object
 1   text                 87 non-null     object
 2   source               87 non-null     object
 3   timestamp            87 non-null     object
 4   url                  87 non-null     object
 5   hcp                  87 non-null     object
 6   therapy              87 non-null     object
 7   alternative_therapy  87 non-null     object
 8   insight              87 non-null     object
 9   comparison_aspect    87 non-null     object
 10  discussion           87 non-null     object
 11  sentiment            87 non-null     object
dtypes: object(12)
memory usage: 8.8+ KB


In [19]:
#df = df.drop(['pid', 'text', 'comparison_aspect','discussion'], axis=1)
df

Unnamed: 0,source,timestamp,url,hcp,therapy,alternative_therapy,insight,sentiment
0,linkedin,2024-07-05T09:30:07.936Z,https://www.linkedin.com/pulse/global-olanzapi...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
1,linkedin,2024-08-23T17:30:05.856Z,https://www.linkedin.com/pulse/south-korea-ari...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
2,linkedin,2024-06-28T18:15:55.805Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Bristol Myers Squibb', 'specialty': ...","[{'name': 'KarXT', 'type': 'drug', 'indication...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'industry ...","{'type': 'positive', 'intensity': 'high'}"
3,linkedin,2023-12-06T12:35:29.186Z,https://www.linkedin.com/pulse/using-artificia...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Valbenazine', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'AI in healthcare', 'source': 'Value ...","{'type': 'mixed', 'intensity': 'moderate'}"
4,linkedin,2024-08-23T01:30:07.894Z,https://www.linkedin.com/pulse/paliperidone-ex...,"{'name': '', 'specialty': '', 'affiliation': '...",[],"{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
...,...,...,...,...,...,...,...,...
95,linkedin,2024-06-02T15:39:35.891Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Eric A. Hughes, MD, PhD', 'specialty...","[{'name': 'UZEDY', 'type': 'long-acting inject...","{'name': 'Invega Sustenna', 'type': 'long-acti...","{'type': 'clinical strategy', 'source': 'Psych...","{'type': 'positive', 'intensity': 'moderate'}"
96,linkedin,2023-01-18T14:16:02.074Z,https://www.linkedin.com/pulse/paliperidone-pa...,"{'name': 'Not specified', 'specialty': 'Psychi...","[{'name': 'Paliperidone Palmitate', 'type': 'D...","{'name': 'Not applicable', 'type': 'Not applic...","{'type': 'comparison', 'source': 'Not specifie...","{'type': 'Not specified', 'intensity': 'Not sp..."
97,linkedin,2024-08-09T15:30:03.375Z,https://www.linkedin.com/pulse/europe-risperid...,"{'name': '', 'specialty': '', 'affiliation': '...","[{'name': 'Risperidone', 'type': 'drug', 'indi...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'LinkedIn'...","{'type': 'positive', 'intensity': 'moderate'}"
98,linkedin,2024-04-16T12:25:21.693Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Roos van Westrhenen', 'specialty': '...","[{'name': 'Aripiprazole', 'type': 'drug', 'ind...","{'name': 'Unknown', 'type': 'Unknown', 'indica...","{'type': 'systematic review', 'source': 'Journ...","{'type': 'positive', 'intensity': 'high'}"


In [21]:
df = df.drop(['therapy'],axis = 1)
df

Unnamed: 0,source,timestamp,url,hcp,alternative_therapy,insight,sentiment
0,linkedin,2024-07-05T09:30:07.936Z,https://www.linkedin.com/pulse/global-olanzapi...,"{'name': '', 'specialty': '', 'affiliation': '...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
1,linkedin,2024-08-23T17:30:05.856Z,https://www.linkedin.com/pulse/south-korea-ari...,"{'name': '', 'specialty': '', 'affiliation': '...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
2,linkedin,2024-06-28T18:15:55.805Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Bristol Myers Squibb', 'specialty': ...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'industry ...","{'type': 'positive', 'intensity': 'high'}"
3,linkedin,2023-12-06T12:35:29.186Z,https://www.linkedin.com/pulse/using-artificia...,"{'name': '', 'specialty': '', 'affiliation': '...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'AI in healthcare', 'source': 'Value ...","{'type': 'mixed', 'intensity': 'moderate'}"
4,linkedin,2024-08-23T01:30:07.894Z,https://www.linkedin.com/pulse/paliperidone-ex...,"{'name': '', 'specialty': '', 'affiliation': '...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': '', 'source': '', 'timestamp': '', 'i...","{'type': '', 'intensity': ''}"
...,...,...,...,...,...,...,...
95,linkedin,2024-06-02T15:39:35.891Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Eric A. Hughes, MD, PhD', 'specialty...","{'name': 'Invega Sustenna', 'type': 'long-acti...","{'type': 'clinical strategy', 'source': 'Psych...","{'type': 'positive', 'intensity': 'moderate'}"
96,linkedin,2023-01-18T14:16:02.074Z,https://www.linkedin.com/pulse/paliperidone-pa...,"{'name': 'Not specified', 'specialty': 'Psychi...","{'name': 'Not applicable', 'type': 'Not applic...","{'type': 'comparison', 'source': 'Not specifie...","{'type': 'Not specified', 'intensity': 'Not sp..."
97,linkedin,2024-08-09T15:30:03.375Z,https://www.linkedin.com/pulse/europe-risperid...,"{'name': '', 'specialty': '', 'affiliation': '...","{'name': '', 'type': '', 'indication': '', 'ma...","{'type': 'market impact', 'source': 'LinkedIn'...","{'type': 'positive', 'intensity': 'moderate'}"
98,linkedin,2024-04-16T12:25:21.693Z,https://www.linkedin.com/feed/update/urn:li:ac...,"{'name': 'Roos van Westrhenen', 'specialty': '...","{'name': 'Unknown', 'type': 'Unknown', 'indica...","{'type': 'systematic review', 'source': 'Journ...","{'type': 'positive', 'intensity': 'high'}"


In [23]:
#I downloaded the modified dataframe to work upon it in Neo4j
from IPython.display import FileLink

df.to_csv('modified_dataframe.csv', index=False)
FileLink('modified_dataframe.csv')

In [24]:
#Flattening the dataframe
import ast

# Function to safely parse JSON-like fields
def parse_json_field(field):
    try:
        return ast.literal_eval(field)
    except (ValueError, SyntaxError):
        return {}

# Apply this function to your nested columns
df['hcp_parsed'] = df['hcp'].apply(parse_json_field)
df['alternative_therapy_parsed'] = df['alternative_therapy'].apply(parse_json_field)
df['insight_parsed'] = df['insight'].apply(parse_json_field)
df['sentiment_parsed'] = df['sentiment'].apply(parse_json_field)


In [25]:
# Extract fields from the parsed columns
df['hcp_name'] = df['hcp_parsed'].apply(lambda x: x.get('name', ''))
df['hcp_specialty'] = df['hcp_parsed'].apply(lambda x: x.get('specialty', ''))
df['hcp_affiliation'] = df['hcp_parsed'].apply(lambda x: x.get('affiliation', ''))

df['therapy_name'] = df['alternative_therapy_parsed'].apply(lambda x: x.get('name', ''))
df['therapy_type'] = df['alternative_therapy_parsed'].apply(lambda x: x.get('type', ''))
df['therapy_indication'] = df['alternative_therapy_parsed'].apply(lambda x: x.get('indication', ''))

df['insight_type'] = df['insight_parsed'].apply(lambda x: x.get('type', ''))
df['insight_source'] = df['insight_parsed'].apply(lambda x: x.get('source', ''))
df['insight_timestamp'] = df['insight_parsed'].apply(lambda x: x.get('timestamp', ''))

df['sentiment_type'] = df['sentiment_parsed'].apply(lambda x: x.get('type', ''))
df['sentiment_intensity'] = df['sentiment_parsed'].apply(lambda x: x.get('intensity', ''))


In [26]:
df_cleaned = df.drop(columns=['hcp', 'alternative_therapy', 'insight', 'sentiment', 'hcp_parsed', 'alternative_therapy_parsed','insight_parsed', 'sentiment_parsed'], axis = 1)


In [27]:
# Save the cleaned DataFrame to a CSV file
df_cleaned.to_csv('flattened_data.csv', index=False)

In [30]:
# Load the new CSV to verify
df_cleaned = pd.read_csv('flattened_data.csv', sep='\t')

print(df_cleaned.head())

  source,timestamp,url,hcp_name,hcp_specialty,hcp_affiliation,therapy_name,therapy_type,therapy_indication,insight_type,insight_source,insight_timestamp,sentiment_type,sentiment_intensity
0  linkedin,2024-07-05T09:30:07.936Z,https://www....                                                                                                                                       
1  linkedin,2024-08-23T17:30:05.856Z,https://www....                                                                                                                                       
2  linkedin,2024-06-28T18:15:55.805Z,https://www....                                                                                                                                       
3  linkedin,2023-12-06T12:35:29.186Z,https://www....                                                                                                                                       
4  linkedin,2024-08-23T01:30:07.894Z,https://www....        

In [31]:
#I tried to format the dateset but was unable to fix all the errors as i am unable to import the dataset to the neo4j dataset.

## **Section 3: Practical Skills with Neo4j**


### 1. **Cypher Query Writing:**  
   - **Task:** Write Cypher queries to perform the following operations:  
     a) Retrieve all posts that mention a specific therapy (e.g., KarXT).  
     b) Find the shortest path between two articles based on shared authors or discussed therapies.  
     c) Update the properties of nodes based on conditions (e.g., if `therapy` is "KarXT" and `sentiment` is empty, set it to "neutral").

Here are the Cypher queries to perform the specified operations:

#### a) Retrieve All Posts That Mention a Specific Therapy (e.g., KarXT)
```cypher
MATCH (post:Post)-[:DISCUSSES]->(therapy:Therapy {name: 'KarXT'})
RETURN post;
```

#### b) Find the Shortest Path Between Two Articles Based on Shared Authors or Discussed Therapies
Assuming the articles are represented by `Post` nodes and have relationships with `Author` and `Therapy` nodes:
```cypher
MATCH (post1:Post {id: 'article1_id'}), (post2:Post {id: 'article2_id'}),
      p = shortestPath((post1)-[*]-(post2))
RETURN p;
```
This query finds the shortest path between two articles (`article1_id` and `article2_id`) based on any relationships, including shared authors or discussed therapies.

#### c) Update the Properties of Nodes Based on Conditions
If `therapy` is "KarXT" and `sentiment` is empty, set it to "neutral":
```cypher
MATCH (post:Post)-[:DISCUSSES]->(therapy:Therapy {name: 'KarXT'})
WHERE post.sentiment IS NULL
SET post.sentiment = 'neutral'
RETURN post;
```


### 2. **Graph Algorithms Application:**  
   - **Task:** Use the Neo4j Graph Data Science (GDS) library to run **Node Similarity** between different articles/posts based on shared topics, authors, or therapies. Interpret the results and discuss how this can be useful for identifying similar content.

### Node Similarity Algorithm in Neo4j GDS

The **Node Similarity** algorithm in the Neo4j Graph Data Science (GDS) library compares nodes based on their connections to other nodes. It calculates pairwise similarities using metrics like Jaccard Similarity, Overlap Coefficient, and Cosine Similarity. This algorithm is particularly useful for identifying nodes that share many common neighbors, making it ideal for finding similar articles/posts based on shared topics, authors, or therapies¹.

#### Graph Model:
1. **Nodes**:
   - **Posts**: Represented by the `id`, `text`, `source`, `timestamp`, `url`, and `sentiment` columns.
   - **Authors (HCPs)**: Represented by the `hcp` column.
   - **Therapies**: Represented by the `therapy` and `alternative_therapy` columns.
   - **Insights**: Represented by the `insight` column.

2. **Relationships**:
   - **"AUTHORED"**: Connects Authors (HCPs) to Posts.
   - **"DISCUSSES"**: Connects Posts to Therapies and Alternative Therapies.
   - **"MENTIONS"**: Connects Posts to Insights.

### Running the Node Similarity Algorithm

#### Step 1: Create the Graph
First, create the graph projection in Neo4j:
```cypher
CALL gds.graph.project(
  'postGraph',
  ['Post', 'HCP', 'Therapy', 'Insight'],
  {
    AUTHORED: { type: 'AUTHORED', orientation: 'UNDIRECTED' },
    DISCUSSES: { type: 'DISCUSSES', orientation: 'UNDIRECTED' },
    MENTIONS: { type: 'MENTIONS', orientation: 'UNDIRECTED' }
  }
)
```

#### Step 2: Run the Node Similarity Algorithm
Run the Node Similarity algorithm to find similar posts:
```cypher
CALL gds.nodeSimilarity.stream('postGraph', {
  nodeProjection: 'Post',
  relationshipProjection: {
    DISCUSSES: {
      type: 'DISCUSSES',
      orientation: 'UNDIRECTED'
    },
    AUTHORED: {
      type: 'AUTHORED',
      orientation: 'UNDIRECTED'
    },
    MENTIONS: {
      type: 'MENTIONS',
      orientation: 'UNDIRECTED'
    }
  },
  similarityCutoff: 0.5
})
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).id AS Post1, gds.util.asNode(node2).id AS Post2, similarity
ORDER BY similarity DESC
```

### Interpreting the Results
The results will show pairs of posts (`Post1` and `Post2`) along with their similarity scores. Higher similarity scores indicate that the posts share more common neighbors (e.g., authors, therapies, insights).

### Usefulness for Identifying Similar Content
1. **Content Recommendation**: By identifying similar posts, you can recommend related articles to users, enhancing their reading experience.
2. **Content Clustering**: Group similar posts together to identify trends and common themes in discussions.
3. **Author Collaboration**: Discover which authors frequently discuss similar topics, potentially fostering collaboration opportunities.
4. **Therapy Insights**: Identify therapies that are commonly discussed in relation to specific insights.

### 3. **Performance Optimization:**  
   - **Task:** Suggest possible ways to optimize query performance in Neo4j for this dataset (e.g., using indexes, refactoring queries).

Optimizing query performance in Neo4j involves several strategies, including using indexes, refactoring queries, and leveraging Neo4j's built-in tools. Here are some key techniques:

### 1. Use Indexes
Indexes can significantly speed up query performance by reducing the amount of data Neo4j needs to scan.

#### Creating Indexes
Create indexes on frequently queried properties, such as `id`, `hcp`, `therapy`, and `timestamp`.
```cypher
CREATE INDEX FOR (p:Post) ON (p.id);
CREATE INDEX FOR (h:HCP) ON (h.name);
CREATE INDEX FOR (t:Therapy) ON (t.name);
CREATE INDEX FOR (p:Post) ON (p.timestamp);
```

### 2. Refactor Queries
Refactor the queries to be more efficient by:
- **Filtering Early**: Apply filters as early as possible in our query to reduce the dataset size.
- **Avoiding Cartesian Products**: Ensure our query patterns do not unintentionally create Cartesian products, which can be very expensive.

#### Example of Filtering Early
```cypher
MATCH (p:Post)-[:DISCUSSES]->(t:Therapy)
WHERE t.name = 'KarXT'
RETURN p;
```

### 3. Use Query Profiling Tools
Use `PROFILE` and `EXPLAIN` to understand how our queries are executed and identify bottlenecks.

#### Using PROFILE
```cypher
PROFILE MATCH (p:Post)-[:DISCUSSES]->(t:Therapy)
WHERE t.name = 'KarXT'
RETURN p;
```
This will provide a detailed execution plan, showing where the most time is spent.

### 4. Optimize Data Model
Ensure your data model is optimized for our queries. For example, if you frequently query relationships between `Post` and `Therapy`, ensure these relationships are well-defined and indexed.

### 5. Use Parameters
Use parameters in our queries to allow Neo4j to reuse execution plans, reducing the overhead of query planning.

#### Example with Parameters
```cypher
MATCH (p:Post)-[:DISCUSSES]->(t:Therapy)
WHERE t.name = $therapyName
RETURN p;
```
And then set the parameter in your application code:
```python
therapyName = 'KarXT'
```

### 6. Limit Results
Use `LIMIT` to restrict the number of results returned, which can improve performance for large datasets.

#### Example with LIMIT
```cypher
MATCH (p:Post)-[:DISCUSSES]->(t:Therapy)
WHERE t.name = 'KarXT'
RETURN p
LIMIT 10;
```

### 7. Regular Maintenance
Regularly run maintenance tasks such as `CALL db.indexes()` to check the status of your indexes and `CALL db.stats.collect()` to update statistics.

### Example of Combining Techniques
Here's an example query that combines several optimization techniques:
```cypher
PROFILE
MATCH (p:Post)-[:DISCUSSES]->(t:Therapy)
WHERE t.name = $therapyName
RETURN p
LIMIT 10;
```


## **Section 4: Model Development and Analysis**


### **Sentiment Classification Task:**  
   - **Task:** Using the sentiment data from the `sentiment` column, develop a simple sentiment classification model. You can use the cleaned dataset and assign sentiment categories (positive, neutral, negative).

To develop a simple sentiment classification model using the sentiment data from your dataset, we can follow these steps:

### Steps to Develop a Sentiment Classification Model

1. **Data Preparation**:
   - Load the dataset.
   - Clean the data (e.g., handle missing values, remove special characters).
   - Encode the sentiment labels (positive, neutral, negative).

2. **Feature Extraction**:
   - Convert text data into numerical features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

3. **Model Training**:
   - Split the data into training and testing sets.
   - Train a machine learning model (e.g., Logistic Regression, Naive Bayes).

4. **Model Evaluation**:
   - Evaluate the model using metrics like accuracy, precision, recall, and F1-score.

5. **Prediction**:
   - Use the trained model to predict sentiment on new data.

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import re

In [33]:
# Load the dataset
df = pd.read_excel('C:/Users/karng/Desktop/Graph-Data-Science-Assessment/KG_technical_skills_assessment_data.xlsx')

In [34]:
# Data Cleaning
df.dropna(subset=['text', 'sentiment'], inplace=True)
df['text'] = df['text'].apply(lambda x: re.sub(r'\W', ' ', str(x)))
df['text'] = df['text'].apply(lambda x: re.sub(r'\s+', ' ', x))
df['text'] = df['text'].apply(lambda x: x.lower())

In [41]:
# Handle NaN values in sentiment column
df['sentiment'].fillna('neutral', inplace=True)  # Impute NaN values with 'neutral'

In [42]:
# Encode sentiment labels
df['sentiment'] = df['sentiment'].map({'positive': 1, 'neutral': 0, 'negative': -1})

In [43]:
# Feature Extraction
tfidf = TfidfVectorizer(max_features=5000, stop_words=stopwords.words('english'))
X = tfidf.fit_transform(df['text']).toarray()
y = df['sentiment']

In [44]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Model Evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

In [None]:
# Prediction on new data
new_texts = ["This therapy is amazing!", "I don't like this treatment.", "It's okay, not great."]
new_texts_cleaned = [re.sub(r'\W', ' ', text.lower()) for text in new_texts]
new_texts_tfidf = tfidf.transform(new_texts_cleaned).toarray()
predictions = model.predict(new_texts_tfidf)
print("Predictions:", predictions)

### Explanation

1. **Data Cleaning**: 
   - Remove rows with missing values in `text` and `sentiment`.
   - Clean the text by removing special characters and converting to lowercase.

2. **Feature Extraction**:
   - Use TF-IDF to convert text data into numerical features.

3. **Model Training**:
   - Split the data into training and testing sets.
   - Train a Logistic Regression model on the training data.

4. **Model Evaluation**:
   - Evaluate the model's performance using accuracy and classification report.

5. **Prediction**:
   - Predict sentiment for new text data.

### Interpretation

The model will classify the sentiment of the text data into positive, neutral, or negative categories. This can be useful for analyzing the overall sentiment in discussions, identifying trends, and making data-driven decisions.

## **Section 5: Research and Innovation (10 Points)**

### **Research Task:**  
   - **Task:** Review the article [How to Implement Graph RAG Using Knowledge Graphs and Vector Databases](https://towardsdatascience.com/how-to-implement-graph-rag-using-knowledge-graphs-and-vector-databases-60bb69a22759?gi=375294af5855). Summarize the key findings and propose how these insights could improve the analysis of sentiment and topics in this dataset using a knowledge graph approach.

The article "How to Implement Graph RAG Using Knowledge Graphs and Vector Databases" on Towards Data Science provides a comprehensive guide on leveraging knowledge graphs and vector databases to enhance Retrieval-Augmented Generation (RAG) applications. Here’s an in-depth analysis of the key points and insights from the article:

### Key Points

1. **Introduction to RAG**:
   - **RAG Overview**: RAG combines information retrieval with language generation to improve the accuracy and relevance of AI responses. It uses a two-step process: retrieving relevant documents and generating responses based on those documents¹.
   - **Role of Knowledge Graphs**: Knowledge graphs play a crucial role in RAG by providing structured, interconnected data that enhances the retrieval process¹.

2. **Knowledge Graphs vs. Vector Databases**:
   - **Knowledge Graphs**: Represent information as nodes (entities) and edges (relationships). They are ideal for structured data with clear relationships¹.
   - **Vector Databases**: Handle unstructured text and embeddings, making them suitable for tasks involving semantic similarity and clustering¹.
   - **When to Use Each**: The article discusses scenarios where knowledge graphs are more beneficial than vector databases and vice versa¹.

3. **Creating a Knowledge Graph**:
   - **Data Preparation**: The article outlines steps to prepare text data for creating a knowledge graph. This includes extracting entities and relationships from the text¹.
   - **Storing in Neo4j**: It provides a detailed guide on setting up a Neo4j instance and storing the knowledge graph in the database¹.
   - **Querying the Graph**: Techniques for querying the knowledge graph to retrieve relevant information for user queries are discussed¹.

4. **Integrating with RAG**:
   - **Combining with LangChain**: The article explains how to integrate the knowledge graph with LangChain to support RAG applications. This involves using the graph to retrieve relevant documents and then generating responses based on those documents¹.
   - **Handling Different Data Types**: It also covers how to expand the approach to handle various data types and file formats beyond plain text¹.

### Insights for Sentiment and Topic Analysis

1. **Enhanced Contextual Understanding**:
   - **Entity Relationships**: By capturing relationships between entities (e.g., words, phrases, topics), knowledge graphs provide a deeper understanding of the context, leading to more accurate sentiment and topic analysis¹.
   - **Contextual Queries**: Knowledge graphs enable complex queries that consider the context of entities, improving the relevance of sentiment and topic classification¹.

2. **Integration of Structured and Unstructured Data**:
   - **Holistic View**: Combining structured data (e.g., metadata) with unstructured text (e.g., reviews) in a knowledge graph offers a comprehensive view of the data¹.
   - **Rich Insights**: This integration can reveal nuanced insights into sentiment and topics that might be missed when analyzing data in isolation¹.

3. **Efficient Querying and Analysis**:
   - **Real-Time Analysis**: Knowledge graphs allow for efficient querying of interconnected data, facilitating real-time sentiment and topic analysis¹.
   - **Advanced Analytics**: Techniques like vector similarity searches can be used to cluster similar sentiments or topics, enhancing the overall analysis¹.

4. **Scalability and Flexibility**:
   - **Scalable Solutions**: Knowledge graphs can scale to handle large datasets, making them suitable for extensive sentiment and topic analysis¹.
   - **Flexible Data Handling**: They can accommodate various data types and formats, providing flexibility in how data is analyzed and interpreted¹.

# **Bonus Tasks:**

## 1. **Named Entity Recognition (NER) Extraction Task:**  
   - **Task:** Use an NER tool to extract important entities (e.g., organizations, people, drug names) from the `text` field in the dataset. Propose how this additional NER data could be integrated into the graph to improve the dataset's structure and relationships.

Here’s a guide on how to do this and how to integrate the extracted entities into our knowledge graph:

### Guide to Extract Entities Using NER

1. **Choose an NER Tool**:
   - Popular NER tools include SpaCy, NLTK, and Stanford NER. For this example, we'll use SpaCy.

2. **Install SpaCy and Download the Model**:
    ```python
    !pip install spacy
    !python -m spacy download en_core_web_sm
    ```

3. **Load the SpaCy Model and Process the Text**:
    ```python
    import spacy

    # Load the SpaCy model
    nlp = spacy.load("en_core_web_sm")

    # Sample text
    text = "Apple is looking at buying U.K. startup for $1 billion. Elon Musk is the CEO of SpaceX."

    # Process the text
    doc = nlp(text)

    # Extract entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    print(entities)
    ```

    This will output entities like:
    ```
    [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY'), ('Elon Musk', 'PERSON'), ('SpaceX', 'ORG')]
    ```

### Integrating NER Data into the Knowledge Graph

1. **Define Node Types and Relationships**:
   - **Node Types**: Create nodes for different entity types such as `Person`, `Organization`, `Location`, `Money`, etc.
   - **Relationships**: Define relationships between these entities based on the context in which they appear. For example, `Elon Musk` (Person) is the `CEO` (relationship) of `SpaceX` (Organization).

2. **Create Nodes and Relationships in Neo4j**:
    ```python
    from py2neo import Graph, Node, Relationship

    # Connect to Neo4j
    graph = Graph("bolt://localhost:7687", auth=("neo4j", "password"))

    # Create nodes
    person = Node("Person", name="Elon Musk")
    organization = Node("Organization", name="SpaceX")
    graph.create(person)
    graph.create(organization)

    # Create relationship
    ceo_relationship = Relationship(person, "CEO_OF", organization)
    graph.create(ceo_relationship)
    ```

3. **Enhance Data Structure and Relationships**:
   - **Contextual Links**: Use the context in which entities appear to create more meaningful relationships. For example, if `Apple` is mentioned in the context of acquiring a startup, create an `ACQUIRES` relationship.
   - **Temporal Information**: If dates or time expressions are extracted, link them to events or entities to provide a temporal dimension to your data.

4. **Querying the Enhanced Graph**:
   - With the enriched graph, you can perform more complex queries to analyze sentiment and topics. For example, you can query all organizations associated with a particular person or all events linked to a specific date.

### Benefits of Integrating NER Data

1. **Improved Data Structure**: Adding entities and their relationships enhances the structure of your dataset, making it more informative and easier to query.
2. **Enhanced Contextual Analysis**: By capturing the context in which entities appear, you can perform more nuanced sentiment and topic analysis.
3. **Efficient Information Retrieval**: The enriched graph allows for more efficient and relevant information retrieval, improving the overall analysis process.

## 2. **Entity Normalization Task:**  
   - **Task:** Perform entity normalization on fields such as `therapy` to ensure consistency in naming conventions (e.g., standardize drug names and organization names). Describe how this normalization process could improve data analysis and usability, particularly in the context of graph queries.

Entity normalization is the process of standardizing names and terms within your dataset to ensure consistency. This is particularly important for fields where drug names and organization names might have multiple variations. Here’s how you can perform entity normalization and how it can improve data analysis and usability:(I have used sample values)

### Steps for Entity Normalization

1. **Identify Variations**:
   - Use NER tools to extract entities from your dataset.
   - Identify different variations of the same entity (e.g., "ibuprofen" vs. "Ibuprofen" vs. "Advil").

2. **Create a Standardized Dictionary**:
   - Develop a dictionary that maps all variations to a standardized name.
   - For example:
     ```python
     normalization_dict = {
         "ibuprofen": "Ibuprofen",
         "Advil": "Ibuprofen",
         "Tylenol": "Acetaminophen",
         "Paracetamol": "Acetaminophen"
     }
     ```

3. **Apply Normalization**:
   - Replace variations in your dataset with the standardized names using the dictionary.
     ```python
     def normalize_entity(entity, normalization_dict):
         return normalization_dict.get(entity.lower(), entity)

     # Example usage
     normalized_entities = [normalize_entity(ent, normalization_dict) for ent in entities]
     ```

4. **Update the Knowledge Graph**:
   - Ensure that the nodes and relationships in your knowledge graph use the standardized names.
   - This might involve updating existing nodes or creating new ones with the normalized names.

### Benefits of Entity Normalization

1. **Improved Data Consistency**:
   - **Consistency**: Ensures that all references to the same entity are uniform, reducing confusion and errors.
   - **Accuracy**: Enhances the accuracy of data analysis by eliminating discrepancies caused by different naming conventions.

2. **Enhanced Query Performance**:
   - **Efficient Queries**: With standardized names, graph queries become more efficient and straightforward, as you don’t need to account for multiple variations of the same entity.
   - **Simplified Analysis**: Simplifies the process of writing and maintaining queries, as you can use a single term to refer to each entity.

3. **Better Data Integration**:
   - **Interoperability**: Facilitates the integration of data from different sources, as standardized names ensure that entities are recognized and matched correctly.
   - **Comprehensive Insights**: Enables more comprehensive analysis by combining data from various sources without inconsistencies.

4. **Enhanced Relationship Mapping**:
   - **Accurate Relationships**: Ensures that relationships between entities are accurately represented, as all entities are consistently named.
   - **Contextual Understanding**: Improves the contextual understanding of relationships, leading to more meaningful insights.

### Example in the Context of Graph Queries

Consider a scenario where you are analyzing drug interactions in a medical dataset. Without normalization, you might have multiple nodes for the same drug, making it difficult to accurately query and analyze interactions. After normalization:

- **Before Normalization**:
  ```cypher
  MATCH (d:Drug)-[:INTERACTS_WITH]->(d2:Drug)
  WHERE d.name IN ["ibuprofen", "Ibuprofen", "Advil"]
  RETURN d, d2
  ```

- **After Normalization**:
  ```cypher
  MATCH (d:Drug)-[:INTERACTS_WITH]->(d2:Drug)
  WHERE d.name = "Ibuprofen"
  RETURN d, d2
  ```

This streamlined query is easier to write, understand, and maintain, and it ensures that all interactions involving "Ibuprofen" are accurately captured.