# Big Data Modeling and Management Assigment - Homework 1

# Submission

GROUP NUMBER: **XXXXXX** - please add your group number into the file name

GROUP MEMBERS:

|STUDENT NAME|STUDENT NUMBER|
|---|---|
|XXXXXX|XXXXXX|
|XXXXXX|XXXXXX|
|XXXXXX|XXXXXX|
|XXXXXX|XXXXXX|

## 🍺 The Beer project  🍺 

As it was shown in classes, graph databases are a natural way of navegating related information. For this first project we will be taking a graph database to analyse beer and breweries!   

The project datasets are based on [kaggle](https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews), released by Evan Hallmark. 

### Problem description

Imagine you are working in the Data Management department of Analytics company.
Explore the database via python neo4j connector and/or the graphical tool in the NEO4J webpage. Answer the questions while adjusting the database to meet the needs of your colleagues.
Please record and keep track of your database changes, and submit the file with all cells run and with the output shown.

### Questions

1. Explore the database: get familiar with current schema, elements and other important database parameters. [1 point]
2. Adjust the database and mention reasoning behind: e.g. clean errors, remove redundancies, adjust schema as necessary. Visualize the final version of database schema. [4 points]
3. Analytics department requires the following information for the biweekly reporting: [5 points]
    1. How many reviews has the beer with the most reviews?
    2. Which three users wrote the most reviews about beers?
    3. Find all beers that are described with following words: 'fruit', 'complex', 'nutty', 'dark'.
    4. Which top three breweries produce the largest variety of beer styles?
    5. Which country produces the most beer styles?
4. Market Analysis department in your company accesses and updates the trends data on the daily basis. Given that, consider how you need to optimize the database and its performance so that the following queries are efficient. Measure performance to communicate your improvements using PROFILE before final query. Answer the following: [4 points]
    1. Using ABV score, find five strongest beers, display their ABV score and the corresponding brewery? Keep in mind that the strongest known beer is Snake Venom, and deal with the error entries in the database.
    2. Using the answer from question 2, find the top 5 distict beer styles with the highest average score of smell + feel that were reviewed by the third most productive user. Keep in mind that cleaning the database earlier should ensure correct results.
5. Answer **two out of four** of the following questions using Graph Algorithms (gds): [NB: make sure to clear the graph before using it again] For the quarterly report, Analytics department the follownig information. [6 points]
    1. Which two countries are most similiar when it comes to their top five most produced Beer styles?
    2. Which beer is the most popular when considering the number of users who reviewed it? 
    3. Users are connected together by their reviews of beers, taking into consideration the "smell" score they assign as a weight, how many communities are formed from these relationships? How many users are in the three largest communities? 
    4. Which user is the most influential when it comes to reviews of distinct beers by style?
 
### Groups  

Groups should have 4 people maximum. Please mark which group you are here: https://shorturl.at/zE0QP 

### Submission      

The code used to produce the results and to-the-point explations should be uploaded to moodle. They should have a clear reference to the group, either on the file name or on the document itself. Preferably one Jupyter notebook per group.

Delivery date: Until the **midnight of March 18, 2025**

### Evaluation   

This will be 20% of the final grade.   
Each solution will be evaluated on 2 components: correctness of results and efficiency of the query (based on database schema).  
All code will go through plagiarism automated checks. Groups with the same code will undergo investigation.

## Loading the Database

#### Be sure that you **don't have** the neo4j docker container from the classes running (you can Stop it in the desktop app or with the command "`docker stop Neo4JLab`")


The default container does not have any data whatsoever, we will have to load a database into our docker image:
- Download and unzip the `Neo4JHWData` file provided in Moodle.
- Copy the path of the `Neo4JHWData` folder of the unziped file, e.g. `C:/PATH/Neo4JHWData/data`.
- Download and unzip the `Neo4JPlugins` file provided in Moodle.
- Copy the path of the `Neo4JPlugins` folder of the unziped file, e.g. `C:/PATH/Neo4Jplugins`.
- Change the code below accordingly. As you might have noticed, you do not have a user called `nunoa`, please use the appropriate path that you got from the previous step. Be sure that you have a neo4j docker container running: \

`docker run --name Neo4JHW2025 -p 7474:7474 -p 7687:7687 -d -v "c:\PATH\Neo4JPlugins":/plugins -v "c:\PATH\Neo4JHWData\data":/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted=gds.* --env NEO4J_dbms_security_procedures_allowlist="gds.*" neo4j:4.4.5`

- Since Neo4j is trying to recognize a new database folder, this might take a bit (let's say 3 minutes), so don't worry.

If the neo4j browser fails to load gds plugins, run the following in the Command Prompt before creating the container again:
`// Remove stopped containers //
docker container prune -f
// Remove unused images //
docker image prune -a -f
// Remove unused volumes //
docker volume prune -f
// Remove unused networks //
docker network prune -f
// Remove all unused resources in one command //
docker system prune -a -f`

In [23]:
from neo4j import GraphDatabase
from pprint import pprint

In [24]:
NEO4J_URI="neo4j://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="test"

In [25]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), )

In [26]:
def execute_read(driver, query):    
    with driver.session(database="neo4j") as session:
        result = session.execute_read(lambda tx, query: list(tx.run(query)), query)
    return result

<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>1.</b> Explore the database: get familiar with current schema, elements and other important database parameters.
</div>

In [27]:
query = """
       CALL db.labels();
    """

result = execute_read(driver, query)

pprint(result)

[<Record label='COUNTRIES'>,
 <Record label='CITIES'>,
 <Record label='BREWERIES'>,
 <Record label='BEERS'>,
 <Record label='REVIEWS'>,
 <Record label='STYLE'>,
 <Record label='USER'>]


In [28]:
query = """
        CALL db.relationshipTypes();
    """

result = execute_read(driver, query)

pprint(result)


[<Record relationshipType='REVIEWED'>,
 <Record relationshipType='BREWED'>,
 <Record relationshipType='IN'>,
 <Record relationshipType='HAS_STYLE'>,
 <Record relationshipType='POSTED'>]


In [29]:
labels = ["COUNTRIES", "CITIES", "BREWERIES", "BEERS", "REVIEWS", "STYLE", "USER"]

for label in labels:
    query = f"""
            MATCH (n:{label})
            RETURN keys(n) AS properties
            LIMIT 5
    """

    result = execute_read(driver, query)

    print(f"Properties for {label}: {result}\n")


Properties for COUNTRIES: [<Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>]

Properties for CITIES: [<Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>, <Record properties=['name']>]

Properties for BREWERIES: [<Record properties=['notes', 'types', 'id', 'name', 'state']>, <Record properties=['notes', 'id', 'types', 'state', 'name']>, <Record properties=['notes', 'types', 'id', 'name', 'state']>, <Record properties=['notes', 'types', 'id', 'name', 'state']>, <Record properties=['state', 'id', 'name', 'types', 'notes']>]

Properties for BEERS: [<Record properties=['notes', 'abv', 'name', 'state', 'id', 'retired', 'availability', 'brewery_id']>, <Record properties=['id', 'abv', 'notes', 'state', 'name', 'retired', 'availability', 'brewery_id']>, <Record properties=['notes', 'abv', 'name', 'state', 'id', 'retired', 'availa

In [30]:
relationships = ["REVIEWED", "BREWED", "IN", "HAS_STYLE", "POSTED"]

for rel in relationships:
    query = f"""
            MATCH ()-[r:{rel}]->()
            RETURN keys(r) AS properties
            LIMIT 5
    """
    
    result = execute_read(driver, query)
    
    print(f"Properties for relationship {rel}: {result}\n")


Properties for relationship REVIEWED: [<Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>]

Properties for relationship BREWED: [<Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>]

Properties for relationship IN: [<Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>]

Properties for relationship HAS_STYLE: [<Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>]

Properties for relationship POSTED: [<Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>, <Record properties=[]>]



In [31]:
for label in labels:
    query = f"""
        MATCH (n:{label})
        RETURN count(n) AS total_nodes
        """
    
    result = execute_read(driver, query)

    print(f"Total {label}: {result}\n")

Total COUNTRIES: [<Record total_nodes=400>]

Total CITIES: [<Record total_nodes=23330>]

Total BREWERIES: [<Record total_nodes=100694>]

Total BEERS: [<Record total_nodes=417746>]

Total REVIEWS: [<Record total_nodes=2549271>]

Total STYLE: [<Record total_nodes=113>]

Total USER: [<Record total_nodes=123935>]



In [34]:
query = """
        MATCH ()-[r]->() 
        RETURN type(r), count(r) AS rel_count 
        ORDER BY rel_count DESC
"""

result = execute_read(driver, query)
pprint(result)

[<Record type(r)='POSTED' rel_count=2538044>,
 <Record type(r)='REVIEWED' rel_count=2537991>,
 <Record type(r)='BREWED' rel_count=358873>,
 <Record type(r)='HAS_STYLE' rel_count=358873>,
 <Record type(r)='IN' rel_count=62424>]


<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(29, 213, 78); padding: 10px; border-radius: 10px;">


| **Node Labels**  | **Properties**                                                     |**Total number of nodes**|
|------------------|--------------------------------------------------------------------|:------------------------:|
| COUNTRIES        | name                                                               |400                       |
| CITIES           | name                                                               |23330                     |
| BREWERIES        | notes, types, id, name, state                                      |100694                    |
| BEERS            | notes, abv, name, state, id, retired, availability, brewery_id     |417746                    |
| REVIEWS          | text, smell, look, taste, feel, overall, beer_id, id, date, score  |2549271                   | 
| STYLE            | name                                                               |113                       |
| USER             | name                                                               |123935                    |

<br><br>

| **Relationship Types** | **Total number of relationships** |
|------------------------|:---------------------------------:|
| REVIEWED |2537991|
| BREWED   |358873|
| IN |62424|
| HAS_STYLE|358873
|POSTED | 2538044|

</div>


In [11]:
query = """
        MATCH (c:COUNTRIES)
        WHERE NOT c.code IN ['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AO', 'AQ', 'AR', 'AS', 'AT', 'AU', 'AW', 
            'AX', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BL', 'BM', 'BN', 'BO', 'BQ', 'BR', 
            'BS', 'BT', 'BV', 'BW', 'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL', 'CM', 'CN', 
            'CO', 'CR', 'CU', 'CV', 'CW', 'CX', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 
            'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FK', 'FM', 'FO', 'FR', 'GA', 'GB', 'GD', 'GE', 'GF', 'GG', 'GH', 
            'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR', 'GS', 'GT', 'GU', 'GW', 'GY', 'HK', 'HM', 'HN', 'HR', 'HT', 
            'HU', 'ID', 'IE', 'IL', 'IM', 'IN', 'IO', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP', 'KE', 'KG', 
            'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 
            'LU', 'LV', 'LY', 'MA', 'MC', 'MD', 'ME', 'MF', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO', 'MP', 'MQ', 
            'MR', 'MS', 'MT', 'MU', 'MV', 'MW', 'MX', 'MY', 'MZ', 'NA', 'NC', 'NE', 'NF', 'NG', 'NI', 'NL', 'NO', 
            'NP', 'NR', 'NU', 'NZ', 'OM', 'PA', 'PE', 'PF', 'PG', 'PH', 'PK', 'PL', 'PM', 'PN', 'PR', 'PS', 'PT', 
            'PW', 'PY', 'QA', 'RE', 'RO', 'RS', 'RU', 'RW', 'SA', 'SB', 'SC', 'SD', 'SE', 'SG', 'SH', 'SI', 'SJ', 
            'SK', 'SL', 'SM', 'SN', 'SO', 'SR', 'SS', 'ST', 'SV', 'SX', 'SY', 'SZ', 'TC', 'TD', 'TF', 'TG', 'TH', 
            'TJ', 'TK', 'TL', 'TM', 'TN', 'TO', 'TR', 'TT', 'TV', 'TW', 'TZ', 'UA', 'UG', 'UM', 'US', 'UY', 'UZ', 
            'VA', 'VC', 'VE', 'VG', 'VI', 'VN', 'VU', 'WF', 'WS', 'YE', 'YT', 'ZA', 'ZM', 'ZW']
        RETURN c.name
"""

result = execute_read(driver, query)

pprint(result)

[]


<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(29, 213, 78); padding: 10px; border-radius: 10px;">
COUNTRIES:<br>
- The name used for countries is <a href="https://www.iban.com/country-codes" target="_blank">Alpha-2</a><br>
- Countries are only connected to Cities (y:CITIES)-[r:IN]->(c:COUNTRIES)
-There are countries without connections
</div>
<div style="background-color:rgb(251, 227, 227); border-left: 10px solid rgb(185, 29, 9); padding: 10px; border-radius: 10px;">

Cities are also connected to Breweries (r:BREWERIES)-[:IN]->(:CITIES)
BREWERIES are connected to BEERS (Each BREWERIES has at least one BEERS) -[r:BREWED]->(n:BEERS)
(n:BEERS)-[r:REVIEWED]->(j:REVIEWS)
(n:BEERS)-[r:HAS_STYLE]->(j:STYLE) (Different beers can have the same style, não existe STYLE que não esteja conectado com alguma BEERS)
(n:REVIEWS)-[r:POSTED]->(j:USER) (Há reviews sem user)

- Não existem BREWERIES em mais do que uma CITIES
- Cada BREWERIES pode ter mais do que uma BEER
- Cada BEER pode ter mais do que uma REVIEW
- A mesma BEER não pode ter mais do que um STYLE
- Existem id sem label
- Há REVIEWS sem USER
- Uma mesma REVIEW não pode ser escrita por mais que um USER
- Existem USER, BEERS, CITIES, COUNTRIES, BREWERIES desconectados
- Não existem STYLES repetidos
</div>


In [12]:
query = """
        MATCH (r:REVIEWS)
        WHERE NOT (r)-[:POSTED]->(:USER)
        OPTIONAL MATCH (r)-[rel]->(n)
        RETURN r, rel, n
        LIMIT 25;
"""

result = execute_read(driver, query)

pprint(result)




[<Record r=<Node element_id='921375' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='921921' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='922467' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='923013' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='923559' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='924105' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='924651' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='925197' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='925743' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='926289' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node element_id='926835' labels=frozenset() properties={}> rel=None n=None>,
 <Record r=<Node elem

In [13]:
query = """
        CALL db.stats.retrieve('GRAPH COUNTS');
"""

result = execute_read(driver, query)

pprint(result)

[<Record section='GRAPH COUNTS' data={'relationships': [{'count': 5856205}, {'relationshipType': 'REVIEWED', 'count': 2537991}, {'relationshipType': 'REVIEWED', 'startLabel': 'BEERS', 'count': 2537991}, {'relationshipType': 'REVIEWED', 'count': 2537991, 'endLabel': 'REVIEWS'}, {'relationshipType': 'BREWED', 'count': 358873}, {'relationshipType': 'BREWED', 'startLabel': 'BREWERIES', 'count': 358873}, {'relationshipType': 'BREWED', 'count': 358873, 'endLabel': 'BEERS'}, {'relationshipType': 'IN', 'count': 62424}, {'relationshipType': 'IN', 'count': 12077, 'endLabel': 'COUNTRIES'}, {'relationshipType': 'IN', 'startLabel': 'CITIES', 'count': 12077}, {'relationshipType': 'IN', 'count': 50347, 'endLabel': 'CITIES'}, {'relationshipType': 'IN', 'startLabel': 'BREWERIES', 'count': 50347}, {'relationshipType': 'HAS_STYLE', 'count': 358873}, {'relationshipType': 'HAS_STYLE', 'startLabel': 'BEERS', 'count': 358873}, {'relationshipType': 'HAS_STYLE', 'count': 358873, 'endLabel': 'STYLE'}, {'relatio

<div style="background-color:rgb(251, 227, 227); border-left: 10px solid rgb(185, 29, 9); padding: 10px; border-radius: 10px;">

Relationships: The count of all relationships in the database, broken down by relationship type and between specific node labels. For example, there are 5,856,205 relationships in total, and it provides counts for specific relationship types like REVIEWED, BREWED, IN, HAS_STYLE, and POSTED.

Nodes: The count of nodes in the database, broken down by node label. For example, there are 3,215,489 nodes in total, with detailed counts for specific labels like COUNTRIES, CITIES, BREWERIES, BEERS, REVIEWS, STYLE, and USER.

Indexes: Information on the indexes in use within the graph, including: The type of index (e.g., LOOKUP or BTREE). The size of the index and the properties or labels it covers.

Constraints: This field is empty in your case, indicating there are no constraints defined on the graph.

In [14]:
query = """
        MATCH ()-[r]->() 
        RETURN type(r), count(r) 
        ORDER BY count(r) DESC;

"""

result = execute_read(driver, query)

pprint(f"Total relationships of type {rel}: {result}\n")

("Total relationships of type POSTED: [<Record type(r)='POSTED' "
 "count(r)=2538044>, <Record type(r)='REVIEWED' count(r)=2537991>, <Record "
 "type(r)='BREWED' count(r)=358873>, <Record type(r)='HAS_STYLE' "
 "count(r)=358873>, <Record type(r)='IN' count(r)=62424>]\n")


In [15]:
query = """
        MATCH (b:BEERS) 
        WHERE b.abv IS NULL 
        RETURN count(b);
"""

result = execute_read(driver, query)

pprint(f"Total relationships of type {rel}: {result}\n")

'Total relationships of type POSTED: [<Record count(b)=0>]\n'


In [16]:
#####################Tem algum interesse users que não ligam a nada?#####################
query = """
        MATCH (n)  
        WHERE NOT (n)--()  
        RETURN labels(n), properties(n)  
        LIMIT 10;
"""

result = execute_read(driver, query)

pprint(result)



[<Record labels(n)=['USER'] properties(n)={'name': 'Rick_Ereth'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'matttyt'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'ChaBrah'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'bbc0202'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'Kbenoit16'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'Jonathan101'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'RicketyCrix'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'Ddeck212'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'TommyWiseau22'}>,
 <Record labels(n)=['USER'] properties(n)={'name': 'KyleWalker081'}>]


In [17]:
query = """
        MATCH (b1:BREWERIES), (b2:BREWERIES)
        WHERE b1.name = b2.name AND id(b1) <> id(b2)
        RETURN b1.name, count(*)
        ORDER BY count(*) DESC;
"""

result = execute_read(driver, query)

pprint(result)



[<Record b1.name='Whole Foods Market' count(*)=104652>,
 <Record b1.name='Total Wine & More' count(*)=86142>,
 <Record b1.name='Cost Plus World Market' count(*)=55460>,
 <Record b1.name='Mellow Mushroom' count(*)=51756>,
 <Record b1.name="Trader Joe's" count(*)=30800>,
 <Record b1.name='Old Chicago' count(*)=24180>,
 <Record b1.name="BJ's Restaurant & Brewhouse" count(*)=18360>,
 <Record b1.name='World of Beer' count(*)=17292>,
 <Record b1.name='Yard House' count(*)=14762>,
 <Record b1.name='Wegmans' count(*)=13340>,
 <Record b1.name='Beverages & more!' count(*)=11990>,
 <Record b1.name='Buffalo Wild Wings' count(*)=7656>,
 <Record b1.name='Giant Eagle' count(*)=6972>,
 <Record b1.name='ABC Fine Wine & Spirits' count(*)=6320>,
 <Record b1.name='Hy-Vee Wine & Spirits' count(*)=6006>,
 <Record b1.name='Granite City Food & Brewery' count(*)=5112>,
 <Record b1.name='The Brass Tap' count(*)=4032>,
 <Record b1.name="BJ's Restaurant" count(*)=4032>,
 <Record b1.name="Binny's Beverage Depot" c

<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(29, 213, 78); padding: 10px; border-radius: 10px;">
    The beer with the highest alcohol by volume (ABV) is Brewmeister’s <a href="https://www.coalitionbrewing.com/which-ipa-has-the-highest-alcohol-content/" target="_blank">“Snake Venom”</a> at an eye-watering 67.5% ABV.
</div>

In [18]:
query = """
        MATCH (b:BEERS) 
        WHERE toFloat(b.abv) > 67.5 OR toFloat(b.abv) < 0 
        RETURN b.name, b.abv;
"""

result = execute_read(driver, query)

pprint(result)


[<Record b.name="Earache: World's Shortest Album" b.abv='100.0'>,
 <Record b.name='Dark Reckoning' b.abv='80.0'>,
 <Record b.name='Radiohead - OK Computer' b.abv='100.0'>,
 <Record b.name='water' b.abv='100.0'>]


In [19]:
query = """
        MATCH (b:BEERS)
        WHERE b.retired <> 'f' AND b.retired <> 't'
        RETURN b LIMIT 25;
"""

result = execute_read(driver, query)

pprint(result)

[]


In [19]:
query = """
        MATCH (r:REVIEWS)
        WHERE toFloat(r.feel) > 5 OR toFloat(r.look) > 5 OR toFloat(r.overall) > 5 OR toFloat(r.score) > 5 OR toFloat(r.smell) > 5 OR toFloat(r.taste) > 5
        RETURN r LIMIT 25;
"""

result = execute_read(driver, query)

pprint(result)

[]


<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>2.</b> Adjust the database and mention reasoning behind: e.g. clean errors, remove redundancies, adjust schema as necessary. Visualize the final version of database schema.
</div>

<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>3.1.</b> How many reviews has the beer with the most reviews?
</div>

In [21]:
query = """
        MATCH (b:BEERS)-[:REVIEWED]->(r:REVIEWS)  
        RETURN b.name, COUNT(r) AS nr_of_review  
        ORDER BY nr_of_review DESC  
        LIMIT 1;
"""

result = execute_read(driver, query)

pprint(result)


[<Record b.name='IPA' nr_of_review=8771>]


 <div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>3.2.</b> Which three users wrote the most reviews about beers?
</div>

In [22]:
query = """
        MATCH (b:BEERS)-[:REVIEWED]->(r:REVIEWS)-[:POSTED]->(u:USER)  
        RETURN u.name, COUNT(r) AS nr_of_review  
        ORDER BY nr_of_review DESC  
        LIMIT 3;
"""

result = execute_read(driver, query)

pprint(result)

[<Record u.name='Sammy' nr_of_review=3756>,
 <Record u.name='acurtis' nr_of_review=3403>,
 <Record u.name='kylehay2004' nr_of_review=3368>]


<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>3.3.</b> Find all beers that are described with following words: 'fruit', 'complex', 'nutty', 'dark'.
</div>

In [23]:
query = """
        MATCH (b:BEERS)  
        WHERE b.notes =~ '.*fruit.*'  
        OR b.notes =~ '.*complex.*'  
        OR b.notes =~ '.*nutty.*'  
        OR b.notes =~ '.*dark.*'  
        RETURN b.name, b.notes  
        LIMIT 50;
"""

result = execute_read(driver, query)

pprint(result)

[<Record b.name='Hefeweizen' b.notes='A pale, spicy, fruity, refreshing Hefeweizen originating in Southern Germany. The fast-maturing beer is lightly hopped with hallertau and shows a unique banana-and-clove yeast character. This is a specialty for summer consumption but enjoyed year-round by all. Prost!'>,
 <Record b.name='Wheat Ale' b.notes='A hefe-weizen with exotic top notes to the aromas. Predominantly banana with floral edges. On the palate the fruit notes continue with some caramel coming through on the rich elegant finish. Retains a good mousse throughout.'>,
 <Record b.name='Philadelphia Porter' b.notes='Originally enjoyed by the working class of England, porters inevitably became a staple in American brew pubs. Our interpretation is brewed with Caramel malt for a rich, toffee-like sweetness and a touch of East Kent Goldings hops. However, it’s the robust flavor of the black and chocolate malt that take center stage in this dark, full-bodied ale.'>,
 <Record b.name='Black Cher

<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>3.4.</b> Which top three breweries produce the largest variety of beer styles?
</div>

In [24]:
query = """
        MATCH (br:BREWERIES)-[:BREWED]->(be:BEERS)-[:HAS_STYLE]->(s:STYLE)  
        RETURN br.name, COUNT(DISTINCT s) AS count  
        ORDER BY count DESC  
        LIMIT 3;
"""

result = execute_read(driver, query)

pprint(result)

[<Record br.name='Iron Hill Brewery & Restaurant' count=94>,
 <Record br.name='Rock Bottom Restaurant & Brewery' count=93>,
 <Record br.name='Goose Island Beer Co.' count=88>]


<div style="background-color:rgb(243, 247, 243); border-left: 10px solid rgb(17, 126, 46); padding: 10px; border-radius: 10px;">
<b>3.5.</b> Which country produces the most beer styles?

In [25]:
query = """
        MATCH (co:COUNTRIES)<-[:IN]-(ci:CITIES)<-[:IN]-(br:BREWERIES)-[:BREWED]->(be:BEERS)-[:HAS_STYLE]->(s:STYLE)
        RETURN co.name AS country, COUNT(DISTINCT s) AS style_count
        ORDER BY style_count DESC
        LIMIT 1;
"""

result = execute_read(driver, query)

pprint(result)

[<Record country='US' style_count=113>]


4. Market Analysis department in your company accesses and updates the trends data on the daily basis. Given that, consider how you need to optimize the database and its performance so that the following queries are efficient. Measure performance to communicate your improvements using PROFILE before final query. Answer the following:
    1. Using ABV score, find five strongest beers, display their ABV score and the corresponding brewery? Keep in mind that the strongest known beer is Snake Venom, and deal with the error entries in the database.

In [None]:
// Criar índice para melhorar a performance da consulta
CREATE INDEX FOR (b:BEERS) ON (b.ABV);

// Remover ou corrigir entradas com valores inválidos de ABV
MATCH (b:BEERS)
WHERE toFloat(b.ABV) IS NULL OR toFloat(b.ABV) < 0
DETACH DELETE b;

// Verificar se "Snake Venom" tem o valor correto
MATCH (b:BEERS)
WHERE b.name = "Snake Venom"
RETURN b.ABV;

// Encontrar as 5 cervejas mais fortes e suas respectivas cervejarias
PROFILE  
MATCH (b:BEERS)-[:BREWED_BY]->(brew:BREWERIES)  
WHERE toFloat(b.ABV) IS NOT NULL  
RETURN b.name AS Beer, toFloat(b.ABV) AS ABV, brew.name AS Brewery  
ORDER BY ABV DESC  
LIMIT 5;

4. Market Analysis department in your company accesses and updates the trends data on the daily basis. Given that, consider how you need to optimize the database and its performance so that the following queries are efficient. Measure performance to communicate your improvements using PROFILE before final query. Answer the following:
    2. Using the answer from question 2, find the top 5 distict beer styles with the highest average score of smell + feel that were reviewed by the third most productive user. Keep in mind that cleaning the database earlier should ensure correct results.

In [None]:
// Criar índice para melhorar a performance da consulta
CREATE INDEX FOR (r:REVIEWS) ON (r.smell, r.feel);
CREATE INDEX FOR (u:USER) ON (u.user_id);
CREATE INDEX FOR (s:STYLE) ON (s.name);

// Encontrar o terceiro usuário mais produtivo (que escreveu mais reviews)
PROFILE  
MATCH (u:USER)-[:POSTED]->(r:REVIEWS)
WITH u, COUNT(r) AS review_count  
ORDER BY review_count DESC  
SKIP 2 LIMIT 1  
WITH u.user_id AS third_most_productive_user  

// Encontrar os 5 estilos distintos com a maior média de (smell + feel) para esse usuário
MATCH (u:USER {user_id: third_most_productive_user})-[:POSTED]->(r:REVIEWS)-[:REVIEWED]->(b:BEERS)-[:HAS_STYLE]->(s:STYLE)
WHERE toFloat(r.smell) IS NOT NULL AND toFloat(r.feel) IS NOT NULL  
WITH s.name AS Beer_Style, AVG(toFloat(r.smell) + toFloat(r.feel)) AS Avg_Score  
ORDER BY Avg_Score DESC  
LIMIT 5  
RETURN Beer_Style, Avg_Score;


In [37]:
query = """
        MATCH (n)  
        WHERE size(keys(n)) = 0 
        RETURN n LIMIT 25;
"""

result = execute_read(driver, query)

pprint(result)

[]
