# Storage in Kubernetes

By default, containers running in pods on Kubernetes have access to the filesystem, but as we're about to find out, there are some big limitations to this."

The `api` application in SynergyChat can be configured to save its data (the messages) to a file on the filesystem. That way, even if the program is restarted, the messages will still be there.

Take a look at the logs of the api pod with:

```python
kubectl logs <podname>
```

Let's do a quick test:
1. Open the webpage
2. Create few messages
3. Refresh it
4. All message should still be there
5. Delete the `api` pod
6. Once K8s replaces the deleted pod with a new one, refresh it again

Ok now all the messages are gone. Why? Because K8s treat everything running / saving/ storing in a pods is server memory (or we say ephemeral). That mean when a pod is deleted, all the memory gone with it.

This cover a good theory behind K8s or even a container: When thing goes wrong, we make another back up in a blank state.

So what if we want to keep our memory instead of delete? (aka Persistant Storage)

# Ephemeral Volume

On-disk files in a container are ephemeral as we saw before. This presents some problems for applications that want to save long-lived data across restarts. For example, user data in a database.

The Kubernetes [volume](https://kubernetes.io/docs/concepts/storage/volumes/) abstraction solves two primary problems:
- Data persistence
- Data sharing across containers

After a short glance at the document, as it turns out, there are a lot of different types of "volumes" in Kubernetes. Some are even ephemeral as well, just like a container's standard filesystem. The primary reason for using an ephemeral volume is to share data between containers in a pod.

It's time to shift our focus back to the crawler service. The crawler service continuously crawls Project Gutenberg and exposes the information that it finds via a JSON API. That data is then made available via slash commands in the chat application.

Let's check how it do until now:

```python
kubectl logs <crawler-podname>

You should see some logs with timestamps that show you the crawler's progress.

Let's update the crawler deployment to use a volume that will be shared across all containers in the crawler pod, and scale up the number of containers in the pod

```yaml
volumes:
  - name: cache-volume
    emptyDir: {}
```

Add a new `vvolumeMounts` section to the container entry. This will mount the volume we just created at the `/cache` path.

```yaml
volumeMounts:
  - name: cache-volume
    mountPath: /cache
```

Duplicate the entire first entry in the `containers` list twice (To make it 3 different containers). Update the name of each:
1. `synergychat-crawler-1`
2. `synergychat-crawler-2`
3. `synergychat-crawler-3`

Now all the containers in the pod will share the same volume at `/cache`. It's just an empty directory, but the crawler will use it to store its data.

Add a `CRAWLER_DB_PATH` environment variable to the crawler's ConfigMap. Set it to `/cache/db`. The crawler will use a directory called `db` inside the volume to store its data.

Apply the new ConfigMap and Deployment, and check the status of your new pod.

You should notice that there's a problem with the pod! Only 1/3 of containers should be "ready". Use the `logs` command to get the logs for all 3 containers:

```python
kubectl logs <podname> --all-containers
```

You should see something like this:

```
listen tcp :8080: bind: address already in use

Because pods share the same network namespace, they can't all bind to the same port! Hmm... let's put a band-aid on this by binding each container to a different port. `8080` is the only one that will be exposed via the service, but that's okay for now. We can add redundancy later.

Add two new values to the crawler's ConfigMap:
1. `CRAWLER_PORT_2: 8081`
2. `CRAWLER_PORT_3: 8082`

Update the crawler deployment

Change the second and third containers to map `CRAWLER_PORT_2 -> CRAWLER_PORT` and `CRAWLER_PORT_3 -> CRAWLER_PORT` respectively (the Docker image expects a variable named "CRAWLER_PORT"). I'm not going to give you the code, but know that it's gonna be a bit tedious because you need to use `env:` instead of `envFrom:` for the second and third containers. Don't forget to continue exposing the `CRAWLER_KEYWORDS` and `CRAWLER_DB_PATH` environment variables for all containers.

# Containers in Pods

It's important to remember that while it's common for a pod to run just a single container, multiple containers can run in a single pod. This is useful when you have containers that need to share resources. In other words, we can scale up the instances of an application either at the container level or at the pod level.

In our situation, there will be:
| Application | Pods | Containers |
| ----------- | ---- | ---------- |
| Web         | 3    | 3          |
| Crawler     | 1    | 3          |


# Persistence

All the volumes we've worked with so far have been ephemeral, meaning when the associated pod is deleted the volume is deleted as well. This is fine for some use cases, but for most CRUD apps we want to persist data even if the pod is deleted.

If you think about it, it's not even just when pods are explicitly deleted with `kubectl` that we need to worry about data loss. Pods can be deleted for several reasons:
- The node they're running on could fail
- A new version of the image was published (code was updated, etc)
- A new node was added to the cluster and the pod was rescheduled

In all of these cases, we want to make sure that our data is still available. [Persistent volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) allow us to do this.

### Persistent Volumes (PV)

Instead of simply adding a volume to a deployment, a persistent volume is a cluster-level resource that is created separately from the pod and then attached to the pod. It's similar to a ConfigMap in that way.

PVs can be created statically or dynamically.
- Static PVs are created manually by a cluster admin
- Dynamic PVs are created automatically when a pod requests a volume that doesn't exist yet

Generally speaking, and especially in the cloud-native world, we want to use dynamic PVs. It's less work and more flexible.

### Persistent Volume Claims (PVC)

A [persistent volume claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) is a request for a persistent volume. When using dynamic provisioning, a PVC will automatically create a PV if one doesn't exist that matches the claim.

The PVC is then attached to a pod, just like a volume would be.


Create a new file called `api-pvc.yaml` and add the following:

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: synergychat-api-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
```

This creates a new PVC called `synergychat-api-pvc` with a few properties that can be read from and written to by multiple pods at the same time. It also requests 1GB of storage.

Apply it and verify by:

```python
kubectl get pvc
kubectl get pv
```

# Attach Persistence

So far all we've done is create an empty persistent volume. Let's get the `api` application to use it.

Create a new volume in the `api-deployment` referencing your pvc based on `crawler` deployment:

```yaml
volumes:
  - name: synergychat-api-volume
    persistentVolumeClaim:
      claimName: synergychat-api-pvc
```

Then mount it in the container under the `/persist` directory:
```yaml
volumeMounts:
  - name: synergychat-api-volume
    mountPath: /persist

Update the `API_DB_FILEPATH` environment variable you added earlier to instead use the new mount path: `/persist/db.json`

Apply the changes, then check to make sure all your pods are healthy

With your tunnel running, open the web in browser and do:
1. Send some messages.
2. Delete the api pod.
3. Once the new pod is running, refresh the page and make sure your messages are still there. If they are, your persistent volume is working! (Need time because this one is cold-start)

# Databases

Running application databases inside Kubernetes is not something you should do by default in every situation. Kubernetes is fundamentally optimized for managing stateless workloads, while databases are stateful systems that require careful handling of storage, backups, performance tuning, and failure recovery.

For local development, testing, learning environments, or small non-critical systems, hosting a database inside Kubernetes can be practical and even beneficial because it simplifies environment setup and keeps everything self-contained

However, for production workloads—especially those that are business-critical, data-intensive, or require high availability—it is generally better to run databases outside the Kubernetes cluster using managed database services or dedicated database infrastructure.