# Solr Setup for Haunted Places Dataset (Assignment 3)

## Overview
This section documents the process for setting up Apache Solr and ingesting the Haunted Places dataset for DSCI 550 Assignment 3. 

We created a Solr core named `assignment3`, uploaded the processed JSON dataset, and exported the core as a `.tgz` file for submission.


## Steps

### 1. Start Solr via Docker
We pulled the latest Solr Docker image and started a Solr container.

```bash
docker pull solr
docker run -d -p 8983:8983 --name solr-container solr

### 2. Create a Core
We created a new Solr core called assignment3.

```bash
docker exec -it solr-container solr create_core -c assignment3


### 3. Convert TSV to Solr-compatible JSON Array
We wrote a Python script to convert the original final_haunted_places.tsv file into a properly formatted JSON array for Solr ingestion.



### 4. Ingest JSON into Solr
We posted the JSON array into Solr using curl and specified json.array=true.
After uploading, we verified the data in Solr Admin UI by querying: "*:*"
The Haunted Places data should appear correctly.

```bash
curl 'http://localhost:8983/solr/assignment3/update?commit=true&json.array=true' --header "Content-Type: application/json" --data-binary @Dataset/final_haunted_places_array.json

### 5. Save the Core
We exported the Solr core assignment3 to a .tgz archive. 

```bash
docker exec -it solr-container tar czf /var/solr/data/assignment3.tgz /var/solr/data/assignment3
docker cp solr-container:/var/solr/data/assignment3.tgz 3_ApacheSolr-ElasticSearch/
```

And we confirmed the archive contents using:
```bash
tar -tzf 3_ApacheSolr-ElasticSearch/assignment3.tgz
```

And the archive contains all necessary config and index files for restoring the core.


# Instructions to Restore Core

1. Start a Solr container.
2. Copy the .tgz archive into the container:

    ```bash
    docker cp assignment3.tgz solr-container:/var/solr/data/
    ```
3. SSH into the container and extract:

    ```bash
    docker exec -it solr-container bash
    cd /var/solr/data/
    tar -xzf assignment3.tgz
    exit
    docker restart solr-container
    ```
The assignment3 core will be restored and available at http://localhost:8983/solr/#/assignment3/query.



In [1]:
import csv
import json

In [4]:
input_tsv = '../Dataset/final_haunted_places.tsv'
output_json = '../Dataset/final_haunted_places_array.json'

In [5]:
# Read TSV
data = []
with open(input_tsv, mode='r', encoding='utf-8') as tsv_file:
    reader = csv.DictReader(tsv_file, delimiter='\t')
    for row in reader:
        clean_row = {k: v for k, v in row.items() if v != ''}
        data.append(clean_row)

In [6]:
# Write as a JSON Array
with open(output_json, mode='w', encoding='utf-8') as json_file:
    json.dump(data, json_file, indent=2)

print(f" Successfully created {output_json} (in array format)")


 Successfully created ../Dataset/final_haunted_places_array.json (in array format)
