# Elasticsearch

In [None]:
import os
from elasticsearch import Elasticsearch, helpers, NotFoundError
import json
from datetime import datetime

In [None]:
USER = "elastic"
PWD = "mXwp5dz4"
API_KEY = "a1JVeVo1VUJBdUJsalpERXYwNXg6RnpOeEZvUnRTTC0xZVJDQ0ZacHhRdw=="

In [None]:
client = Elasticsearch("http://localhost:9200", \
                      basic_auth=(USER, PWD))

In [None]:
client.info()

### New York City Restaurants JSON dataset

### Full Query DSL (Domain Specific Language)

### `search` method

- Allows you to execute a search query and get back search hits that match the query. You can provide search queries using the q query string parameter or request body.
- documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

#### (Simple search) Q1: Find all restaurants in Manhattan that were subject to inspection.

In [None]:
response = client.search(
    index=nyc_index,
    body={
        "query": {
            "match": {
                "BORO": "Manhattan"
            }
        },
        "size": 20
    }
)

#response

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], hit['_source']['BORO'], \
         hit['_source']['BUILDING'], hit['_source']['STREET'])

### Controlling size of the `hits` and the `_source` fields

In [None]:
response = client.search(
    index=nyc_index,
    body=
)

print(response['hits']['total']['value'])

# for hit in response['hits']['hits']:
#     # print(hit['_source']['DBA'], hit['_source']['BORO'], \
#     #      hit['_source']['BUILDING'], hit['_source']['STREET'])
#     print(hit['_source'])

#### (Simple search) Q2: Find all inspected "Pizza" restaurants.

In [None]:
response = client.search(
    index=nyc_index,
    body=
)

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'])

#### (Fuzzy search) Q3: Find all inspected restaurants whose names are similar to "Mamma Mia".

In [None]:
response = client.search(
    index=nyc_index,
    body=
)

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'])



#### (Phrase match search) Q4: Find all inspected restaurants whose violations that have the phrase "food worker".

In [None]:
response = client.search(
    index=nyc_index,
    body=
)

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], \
         hit['_source']['VIOLATION DESCRIPTION'])

#### (Multi field match search) Q5: Find all restaurants that have "Pizza" or "Pasta" in either their name or their "CUISINE DESCRIPTION".

In [None]:
response = client.search(
    index="nyc_restaurants",
    body=
)

for hit in response['hits']['hits']:
    print(f"{hit['_source']['DBA']}: {hit['_source']['CUISINE DESCRIPTION']}")

#### Q6: Find all inspected restaurants that have Italian or Mexican in their "CUISINE DESCRIPTION".

Using `multi_match`.

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])
for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], hit['_source']['CUISINE DESCRIPTION'])

Using `simple_query_string`.

In [None]:
query = {
    
}
response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], ":", hit['_source']['CUISINE DESCRIPTION'])


### Boosting

- Boosting is the process by which you can modify the relevance of a document.
- There are two different types of boosting: boost at indexing or boost while querying.
- Reading: https://weng.gitbooks.io/elasticsearch-in-action/content/chapter6_searching_with_relevancy/63boosting.html 

#### Q7: Find all inspected restaurants that have Italian or Mexican in their "CUISINE DESCRIPTION" with higher scoring for Mexican (10.0).

In [None]:
query = {
    
}
response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])

for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], ":", hit['_source']['CUISINE DESCRIPTION'])


### Why does Italian still get ranked higher than Mexican?

1. Term Frequency: If the term "Italian" appears more frequently across documents, Elasticsearch’s default ranking algorithm (BM25) may still give it higher weight despite the boost.

2. Document Length: If documents mentioning "Mexican" are much longer, Elasticsearch may consider the match to be less significant, while shorter documents mentioning "Italian" may rank higher.

3. Field Analysis: The field `VIOLATION DESCRIPTION` might be tokenized in a way that affects how terms are matched and scored. For example, stemming or normalization could influence relevance.

4. Scoring Nuances: Elasticsearch's scoring formula (BM25) factors in other elements like field length, inverse document frequency (IDF), and term frequency (TF), which could make "Italian" rank higher if it fits better with these parameters in the indexed documents.

### How can we re-write the query to make sure `Mexican` gets boosted?

### Disjunction max query

- Returns documents matching one or more wrapped queries, called query clauses or clauses.
- If a returned document matches multiple query clauses, the dis_max query assigns the document the highest relevance score from any matching clause, plus a tie breaking increment for any additional matching subqueries.
- documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html 

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])
for hit in response['hits']['hits']:
    print(hit['_source']['DBA'], ":", hit['_source']['CUISINE DESCRIPTION'])


### Highlighting 

- During the search process, elasticsearch extracts the text from the fields you want to highlight.
- It then marks the matching terms in the retrieved documents, usually by wrapping them in HTML tags (like `<em>` or `<strong>`), making them visually distinct.

#### Q8: Find all inspected restaurants that have Mexican in "CUISINE DESCRIPTION" and highlight the "CUISINE DESCRIPTION" field in the results.

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response)

In [None]:
query = {
    "query": {
        "match_phrase": {
            "CUISINE DESCRIPTION": "Mexican"
        }
    },
    "_source": ["DBA", "CUISINE DESCRIPTION"],
    "highlight": {
        "fields": {
            "CUISINE DESCRIPTION": {
                "pre_tags": ["<strong>"], 
                "post_tags": ["</strong>"]
            }
        }
    }
}

response = client.search(index=nyc_index, body=query)
print(response)

### Benefits of Highlighting
- Improved User Experience: Highlighting helps users quickly identify relevant sections of text that match their search terms, making it easier to evaluate the results.
- Enhanced Readability: By drawing attention to specific keywords or phrases, you improve the overall readability of search results.
- Customization: You can customize how highlighting appears (e.g., using different tags or styles) to fit the design of your application.

### Boolean operators

- `must` operator: AND
- `should` operator: OR
- `must_not` operator: NOT

#### Q9: Find all Italian restaurants that were inspected in Manhattan.

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])
for hit in response['hits']['hits']:
    print(hit['_source'])

#### Q10: Find all inspected restaurants that have Italian or Mexican in their "CUISINE DESCRIPTION".

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])
for hit in response['hits']['hits']:
    print(hit['_source'])

#### Q11: Find all inspected restaurants that are not in Bronx.

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
print(response['hits']['total']['value'])
for hit in response['hits']['hits']:
    print(hit['_source'])

### Aggregations

#### Q12: How many restaurants are listed in the dataset?

In [None]:
query = {
    
}

response = client.search(index=nyc_index, body=query)
total_count = response['aggregations']['total_restaurants']['value']
print(f"Total number of restaurants: {total_count}")

#### Q13: Find the total score of all restaurant inspections in the dataset.

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response['aggregations']['total_score']['value'])

#### Q14: What is the average score of restaurant inspections?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response['aggregations']['average_score']['value'])

#### Q15: What is the minimum score of restaurant inspections?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response['aggregations']['min_score']['value'])

#### Q16: What is the maximum score of restaurant inspections?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response['aggregations']['max_score']['value'])

#### Q17: What are the top 20 most common cuisine types among the inspected restaurants?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
for bucket in response['aggregations']['cuisine_count']['buckets']:
    print(bucket['key'], ":", bucket['doc_count'])

#### Q18: How many restaurants fall into each inspection score range (intervals of 5)?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
for bucket in response['aggregations']['score_histogram']['buckets']:
    print(bucket['key'], ":", bucket['doc_count'])

#### How many unique cuisine types have been inspected?

In [None]:
query = {

}

response = client.search(index=nyc_index, body=query)
print(response['aggregations']['unique_cuisines']['value'])