# Indexing documents with ElasticSearch

In this notebook we will download a dataset of newspaper articles that we will put in the ElasticSearch index we have created. To that purpose we will use the [ES Python client](https://pypi.org/project/elasticsearch/). You will first need to install it in your Python environment.

Let's now connect through Python to the ES instance we have created before, and check its info (exactly as we did in the browser)

In [1]:
from elasticsearch import Elasticsearch

client = Elasticsearch("http://localhost:9200")
dict(client.info())

ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x000002E67E923D90>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it))

With the following command we can check what are the indices already stored in the ES instance. Since we have not created any index yet, there should not be any result!

In [2]:
list(client.indices.get_alias())

ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x000002E67F1994D0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it))

### The Dataset

For this project we will use the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset) from Kaggle. Take some time to understand its structure and its specificities.

Then you can download it, unzip it and specify here below the path of the `JSON` file on your computer. We then can have a look to the structure of one article.

In [None]:
input_file = "/YOUR_PATH_TO_THE_DATASET"

In [None]:
import json
articles = []
for line in open(input_file):
    articles.append(json.loads(line))

articles[0]

Based on that structure we will create a `mapping`. It consists of defining how a document, and the fields it contains, are stored and indexed. Remember how to create table in SQL, this is the same idea. You will find more info about mapping [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html).

In our case, the documents have 6 fields, that we can classify like this:

- Full text fields (relevant to query in a search engine):
    - `headline`
    - `short_description`
- Datetime
    - `date`
- Keyword (string for which we don't want to do complex search queries):
    - `link`
    - `category`
    - `authors`

For the full text fields we can specify an [analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html) to be applied on the text. The analyzer defines how to index and search a text field. Elasticsearch has a few built-in analyzers (including language-based analysers) but we can also create our own, which can be useful for very specific content (*e.g* tweets because they contain hashtags that should not be considered by the full text search). In this case we will use the built-in [english analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer)

We can now build the following mapping:

In [None]:
mapping = {
    "properties": {
        "headline": {
            "type": "text",
            "analyzer": "english"
        },
        "short_description": {
            "type": "text",
            "analyzer": "english"
        },
        "date": {
            "type": "date"
        },
        "category": {
            "type": "keyword"
        },
        "link": {
            "type": "keyword",
        },
        "authors": {
            "type": "keyword"
        }
    }
}

### Indexing

We will now create an index named `articles` based on the mapping we have defined. Prior we will check if this index does not exists already. If it exists we will delete it.

In [None]:
if not client.indices.exists(index="articles"):
    client.indices.create(index="articles", mappings=mapping)
else:
    client.indices.delete(index="articles")


By running again the command that lists the indexes on our ES instance we can now see that the `articles` index has been created:

In [None]:
list(client.indices.get_alias())

We can now index our articles one by one. Let's try to index the 10.000 first documents. Let's do it by using `tqdm` for checking the time it takes. If the library is not installed yet, you can install it.

In [None]:
from tqdm import tqdm

for article in tqdm(articles[:10000]):
    client.index(index="articles", id=article["link"], document=article)


As you can see the process is quite slow. We can speed it up the process by indexing batches of documents.

Let's try it on the same sample:

In [None]:
from datetime import datetime
from elasticsearch import helpers

start = datetime.now()

bulk = []
for article in articles[:10000]:
    bulk.append({
        "_index": "articles",
        "_id": article["link"], # We will define the URL of the articles as unique identifier. That means that if we reindex the same article, it will be overwritten.
        "_source": article
    })

helpers.bulk(client, bulk)
print(f"Done in {datetime.now() - start}")

Much faster, isn't ? So we let you modify the code above in order to index the full list of articles.

As soon as it is done, we can check if the documents have well been indexed by counting them. We can either check it through the browser (http://localhost:9200/articles/_count) or by using the Python client:

In [None]:
client.count(index='articles').get('count')

## Well Done! 

Your first Elasticsearch index is up and running! Let's now try some queries in [the next notebook](4.Queries.ipynb).