# **Section 1: General Knowledge**

### 1. **What is a graph database, and how does it differ from relational databases?**
- Graph Database is a type of NoSQL database that is designed to store and query graph data.
#### Differences from Relational Databases

1. **Data Representation**:
   - **Graph Databases**: Data is stored as a network of interconnected nodes and edges, making it easier to model complex relationships.
   - **Relational Databases**: Data is stored in tables with rows and columns. Relationships are represented through foreign keys and join operations.

2. **Query Performance**:
   - **Graph Databases**: Efficient for queries involving complex relationships and traversals, as they use index-free adjacency (each node directly points to its related nodes).
   - **Relational Databases**: Can become inefficient for complex queries involving multiple joins, as they require multiple table lookups¹².

3. **Schema Flexibility**:
   - **Graph Databases**: Schema-less or schema-flexible, allowing for dynamic changes in data structure without significant overhead.
   - **Relational Databases**: Schema-based, requiring predefined schemas and often needing alterations for structural changes¹².

4. **Use Cases**:
   - **Graph Databases**: Ideal for applications with complex relationships, such as social networks, fraud detection, and recommendation engines.
   - **Relational Databases**: Suitable for applications requiring strong consistency and structured data, such as financial systems and e-commerce platforms³⁴.

### 2. **Describe the key components of a graph (nodes, relationships, properties) and provide examples using this dataset.**
- Graph is a set of nodes and edges.
- Nodes represent entities. When two nodes are connected to a single edge we call the nodes to be neighbors.
- An edge of a network is a connection between nodes of a graph. There are two types of edges:
    - 1. Directed edge: This edge has a direction. It is represented by an arrow.
    - 2. Undirected edge: This edge has no direction. It is represented by a simple line.
- Nodes connected to n any number of nodes is called it's degree. If, a nodes id connected to itself, we call it a loop.s
- Nodes are connected in paths via edges and when they form a closed path we call it as a cycle.
- Properties are key-value pairs that provide additional information about nodes and edges.

***We can understand the key components of a graph via the dataset given to me:***
#### Nodes
- **Source**: The origin of the text (e.g., person, organization, platform).
- **Therapy**: Represents the primary therapy mentioned.
- **Alternative Therapy**: Represents other possible therapies or alternatives discussed.
- **Sentiment**: Can be treated as a node if you want to categorize the overall sentiment as a distinct entity (e.g., positive, neutral, negative).

#### Edges
- **Text to Source**: The text is created by the source, so an edge can link a piece of text (or its id) to the source node.
- **Text to Therapy**: Each text may discuss or reference a specific therapy, so an edge can link the text to the therapy node.
- **Text to Alternative Therapy**: If alternative therapies are mentioned, an edge can link the text to the alternative therapy node.
- **Text to Sentiment**: The text has a sentiment (e.g., positive, negative), so an edge can connect the text to the sentiment node.
- **Comparison Aspect**: Links between different therapies or insights, indicating comparative relationships discussed in the text.

#### Properties
- **Text**: The content or message being analyzed.
- **Timestamp**: When the text was created, linked to either the text or source.
- **URL**: A property linked to the source or text, providing an external reference.
- **HCP**: A property that could represent healthcare professional-related discussions.
- **Insight**: Qualitative insights drawn from the text, could be added as a property to the text node.
- **Comparison Aspect**: The specific aspect being compared between therapies, could be linked to both therapies.

### 3. **What is the property graph model used by Neo4j?**
The property graph model used by Neo4j is a flexible and expressive way to represent data. Here are the key components:

1. **Nodes**: These represent entities or objects in the graph. For example, a node could represent a person, a product, or a location.
2. **Relationships**: These describe how nodes are connected. Each relationship has a direction (from one node to another) and a type (e.g., "FRIENDS_WITH", "PURCHASED").
3. **Properties**: Both nodes and relationships can have properties, which are key-value pairs that store additional information. For example, a person node might have properties like `name` and `age`, while a relationship might have a property like `since` to indicate when the relationship started¹².
4. **Cypher Query Language**: The Cypher Quey Language of Neo4j makes it really powerful and advanced than the traditional databases. 
This model allows for a highly intuitive and visual way to represent complex data and relationships, making it particularly useful for applications like social networks, recommendation systems, and fraud detection³.

### 4. **List three common use cases for graph databases. Briefly explain how one of these use cases could be applied to the provided dataset.**
 **Here are three common use cases for graph databases**:

1. **Social Networks**: Graph databases are excellent for modeling and querying social networks, where relationships between users (e.g., friendships, followers) are crucial.
2. **Fraud Detection**: They can identify complex patterns and relationships in transaction data to detect fraudulent activities.
3. **Recommendation Engines**: By analyzing user preferences and behaviors, graph databases can provide personalized recommendations for products, content, or services¹².

### Applying Knowledge Graph to the Dataset

Given our dataset with columns like `id`, `text`, `source`, `timestamp`, `url`, `hcp`, `therapy`, `alternative_therapy`, `insight`, `comparison_aspect`, `discussion`, and `sentiment`, here's how Knowledge Graph could be applied:

1. **Nodes**: Represent different entities such as `hcp` (healthcare professionals), `therapy`, `alternative_therapy`, and `insight`.
2. **Relationships**: Connect these nodes based on interactions or discussions. For example, a relationship could be "DISCUSSES" between `hcp` and `therapy`.
3. **Properties**: Store additional information such as `timestamp`, `sentiment`, and `source`.

#### Example Analysis:
1. **Enhanced Search and Retrieval**:
- Use the knowledge graph to perform advanced searches, such as finding all discussions by a specific HCP about a particular therapy within a given timeframe.
Improve retrieval accuracy by leveraging the relationships and properties in the graph.
2. **Contextual Understanding**:
- Gain deeper insights into how different therapies are perceived by analyzing the sentiment of discussions and the context provided by related insights.
Understand the connections between different therapies and alternative therapies through the comparison aspects.
3. **Decision Support**:
- Use the knowledge graph to support decision-making by identifying key insights and trends in the data.
For example, determine which therapies are most frequently discussed positively by influential HCPs, aiding in strategic planning and marketing efforts.

# **Section 2: Data Manipulation and Preprocessing (30 Points)**

### 1. **Data Import and Cleaning Task:**  
   - **Task:** Pre-process the provided dataset to handle missing values in key fields like `therapy`, `sentiment`, and `hcp`. Use `pandas` to clean the data and ensure proper formatting before importing into Neo4j.