# Kubernetes Storage & Stateful Sets

Kubernetes has transformed the way we deploy and manage applications. While it excels at handling *stateless* workloads, dealing with *stateful* applications introduces unique challenges. 

**Stateless applications** are designed without the reliance on persistent data or state information that needs to be maintained across multiple instances. They operate with the assumption that any instance of the application can handle any request without needing to access or store specific data. **Stateful applications**, on the other hand, introduce a unique set of challenges to Kubernetes. These applications maintain specific state information that must persist across instances and even during scaling or failover events.

In this lesson, we will delve into two critical aspects of Kubernetes that address these challenges: *Kubernetes Storage* and *Stateful Sets*.

## Volumes in Kubernetes

Volumes in Kubernetes play a crucial role in managing data within containers, but they differ in several ways from their counterparts in Docker. In Docker, volumes are directories located on the host machine's disk or even within another container. These volumes can be mounted into containers during runtime, facilitating data sharing between containers or even across different host machines using drivers. While Docker volumes are valuable, they have limitations, especially when it comes to large-scale deployments.

Kubernetes Volumes offer a more versatile and robust solution for managing data in containerized applications. Here are some high-level features that set Kubernetes Volumes apart:

1. **Simultaneous Mounting**: Kubernetes Volumes can be mounted simultaneously into multiple containers within the same Pod. This capability is particularly useful when you have multiple containers that need access to the same data.

2. **Ephemeral and Persistent Volumes**: In Kubernetes, Volumes can be categorized as either ephemeral or persistent. *Ephemeral Volumes* have lifetimes tied to their Pods. Whenever a Pod restarts or is reassigned to a different node, ephemeral Volumes are recreated. On the other hand, *persistent Volumes* have lifetimes independent of their Pods. They offer durability and data retention even when Pods come and go.

3. **Automatic Data Availability**: Kubernetes takes care of ensuring that data is available across container restarts. This functionality is handled by `kubelet`, the Kubernetes node agent. It ensures that data remains accessible as long as the associated Volume exists, regardless of the Pod's state.

## Kubernetes Volumes Types

### `EmptyDir` Volume

> *`EmptyDir`* is the simplest volume type in Kubernetes. It creates a temporary storage volume that is tied to a Pod. When the Pod is deleted or rescheduled, the data stored in `EmptyDir` is lost. This volume is suitable for ephemeral data that is needed within a single Pod and does not need to persist across restarts or rescheduling events. 

Common use cases include storing temporary files or caches or sharing data among containers within the same Pod.

### `HostPath` Volume

> *`HostPath`* allows you to use a directory on the host machine as a source of data for a volume. It offers data persistence, but it comes with some limitations. The data stored with `HostPath` is specific to the host where the Pod is running. This means that data may not be available if the Pod is rescheduled to a different node.

Common use cases include accessing host-specific data and running stateful applications that rely on host-specific resources.

Let's have a look at an example `HostPath`:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: hostpath-pod
spec:
  containers:
  - name: my-container
    image: my-application-image
    volumeMounts:
    - name: hostpath-volume
      mountPath: /app-config
  volumes:
  - name: hostpath-volume
    hostPath:
      path: /data/host-specific-data

```
In this example, in the `volumes` section, we define the `hostpath-volume`. It uses the `hostPath` volume type and specifies the path as `/data/host-specific-data`. This path points to the directory on the host machine that contains the configuration files we need for our application.

In the `volumeMounts` section under the container, which allows us to mount a volume into the container, we mount the `hostpath-volume` to the path `/app-config` within the container. This means that the contents of the `host-specific-data` directory on the host will be accessible to the container at `/app-config`.

### Persistent Volumes

While `HostPath` volumes are primarily used for data persistence within a single node, *Persistent Volumes* are designed to provide data persistence that transcends the lifecycle of individual pods. Kubernetes provides two key resources for managing persistent storage: `PersistentVolume` and `PersistentVolumeClaim`. These resources abstract the provisioning and utilization of storage in your cluster. Common use cases for these volumes are running stateful applications that require persistent data, and managing shared storage resources for multiple Pods.

> A `PersistentVolume` represents a storage volume in a cluster, provisioned by a cluster administrator. 

These volumes are similar to physical storage devices found on the host machine. Importantly, they exist independently of any specific pod's lifecycle. When bound to a pod, they function similarly to regular volumes, offering a reliable means of data storage.

> In contrast, a `PersistentVolumeClaim` is a user's request for the platform to create a `PersistentVolume` on their behalf. 

Conceptually, PVCs share similarities with pods in the sense that they consume resources. However, while pods request resources like CPU and RAM, PVCs specify storage requirements, including size and access modes (e.g., `read-write` or `read-only`).
    
#### Provisioning Persistent Volumes

The provisioning of Persistent Volumes can occur through two primary methods:

- **Static Provisioning**: In this approach, cluster administrators pre-create PVs that are available for users to consume. Users then select a PV that matches their needs.

- **Dynamic Provisioning**: Dynamic provisioning is a more automated approach. When a user requests a PVC, the cluster attempts to dynamically provision an appropriate PV based on the PVC's requirements. Dynamic provisioning ensures that PVs always match the requirements specified by PVCs. 

#### Reclaim Policy

Once a user is finished with a volume, they can initiate the deletion of the associated PVC, which subsequently triggers the reclamation of the underlying resource. Three reclaim policies govern how this resource reclamation occurs:

- **Retain**: The *retain policy* leaves the data intact, preserving both the `PersistentVolume` and any external storage

- **Delete**: The *delete policy* removes not only the `PersistentVolume` but also the external storage associated with it

- **Recycle** (Deprecated): The *recycle policy* has been deprecated in favor of dynamic provisioning. Dynamic provisioning is now the recommended approach for creating and managing `PersistentVolumes`.

`PersistentVolumes` can exist in one of four states:

- **Available**: In this state, the `PersistentVolume` is ready to be bound to a pod, but it is not yet associated with any pod

- **Bound**: When a `PersistentVolume` is bound, it means that a PVC has been successfully associated with it, and it is actively serving a pod

- **Released**: The released state occurs when a `PersistentVolumeClaim` is deleted, but the resource has not yet been reclaimed by the cluster

- **Failed**: In the event that reclamation of a volume fails, it is marked as failed. This can happen due to various reasons.

#### Specifying `PersistentVolume` Configuration

Similar to pods and other Kubernetes resources, PV objects are defined using `.yaml` configuration files. Below is an example configuration file illustrating the key attributes you can specify when creating a `PersistentVolume`:

``` yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data"
```

Let's break down the example above:

- `spec.capacity.storage`: This attribute indicates the requested storage capacity. In this example, 10Gi of storage is requested. Kubernetes interprets storage capacity in units defined by its resource model.

- `.spec.accessModes`: Specifies how pods can access the `PersistentVolume`. Common modes include:

  - `ReadWriteOnce`: Allows a single Node to mount the volume with read-write access. This mode is suitable for scenarios where only one pod needs read-write access to the data.

  - `ReadOnlyMany`: Permits multiple Nodes to mount the volume, but only read access is granted. Useful for scenarios where multiple pods need read-only access to the data.

  - `ReadWriteMany`: Similar to `ReadOnlyMany`, but grants read-write access to multiple nodes. Applications using this mode should handle possible data races.
    
  - `ReadWriteOncePod`: Provides read-write access but only for a single pod

- `.spec.storageClassName`: Specifies the associated `StorageClass` for the `PersistentVolume`. If unspecified, no `StorageClass` is assigned

![](./images/modes_providers.png)


#### Hands-On

1. Create a `.yaml` file with the above configuration for creating a `PersistentVolume` resource
2. Use the necessary `kubectl` command to apply the configuration and create the `PersistentVolume` resource 
3. Observe the status and properties of the `PersistentVolume` to ensure it's successfully provisioned and available for use

In [8]:
# apply the configuration
!kubectl

persistentvolume/task-pv-volume created


In [11]:
# observe tha status of the volume
!kubectl

NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
task-pv-volume   10Gi       RWO            Retain           Available           manual                  13m


> Now, let's explain how this PV configuration allows data to transcend the lifecycle of individual pods:

- When you create a PV like this, it represents a persistent storage resource that exists independently of any specific pod's lifecycle. The data stored in the directory `/mnt/data` on the host machine is associated with the PV.

- A pod can request access to this PV by creating a `PersistentVolumeClaim` (PVC) that specifies the same `storageClassName` and requests storage capacity. When the PVC is created, it will be dynamically bound to this PV.

- Once the PVC is bound to the PV, any pod that needs access to this persistent storage can use the PVC for mounting the PV. The PV can be mounted by multiple pods, one at a time, due to the `ReadWriteOnce` access mode.

- Even if pods are deleted or rescheduled to different nodes, the data in the PV remains intact. This is how the PV transcends the lifecycle of individual pods. The data is not tied to any specific pod and persists as long as the PV exists.

### Storage Classes

> *Storage Classes* provide a level of abstraction over PVs, making it easier to dynamically provision storage resources. Each storage class defines characteristics like performance, availability, and replication. When a PVC requests a specific storage class, Kubernetes automatically provisions the appropriate storage based on the class's definition.

Common use cases include dynamically provisioning storage based on application requirements, and simplifying the management of different storage types.

#### Defining a `StorageClass`

Storage Classes are typically defined using `.yaml` configuration files, and they can be referenced by `PersistentVolume` configuration files. Let's consider an example of a `StorageClass` definition:

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: k8s.io/minikube-hostpath
parameters:
  type: local
  location: /mnt/disks/ssd
reclaimPolicy: Retain

```

In the example above:

- The `provisioner` field specifies the type of storage provisioner to use. Different provisioners are used for various storage backends. In this case, the provisioner `k8s.io/minikube-hostpath` is often used for local, single-node testing clusters like Minikube.

- The `parameters` section allows you to provide additional configuration specific to the chosen provisioner. In this example, `type: local` indicates the use of local storage, and `location: /mnt/disks/ssd` specifies the path on the host machine where PVs will be created.

- The `reclaimPolicy` field determines what happens to the PV when the associated PVC is deleted

> We have seen the local persistent volumes, but Kubernetes supports various storage provisioners, each tailored to specific storage backends or cloud providers. 

Here are some common provisioner types:

- **NFS (Network File System) Provisioner**: NFS provisioner enables the use of NFS-based network storage. It allows you to dynamically provision PVs backed by NFS servers. This is useful for scenarios where you need shared network storage for your applications.

- **Amazon Elastic Block Store (EBS) Provisioner**: EBS provisioner is designed to work with Amazon Web Services (AWS) EBS volumes. It allows you to dynamically provision EBS-backed PVs, making it easy to integrate Kubernetes with AWS storage.

- **Google Persistent Disk (GPD) Provisioner**: GPD provisioner is specific to Google Cloud Platform (GCP) and works with Google Persistent Disks. It enables the dynamic provisioning of PVs using GCP's block storage.

- **Azure Disk Provisioner**: Azure Disk provisioner is used for creating PVs backed by Azure Managed Disks in Microsoft Azure. It streamlines the process of provisioning and managing persistent storage in Azure-based Kubernetes clusters.

### Expandable Volumes

Starting with Kubernetes version `1.11`, the ability to dynamically expand volumes was introduced. This feature allows you to adjust the storage capacity of a persistent volume claim (PVC) in response to changing requirements, eliminating the need to manually manage storage resizing and ensuring that your applications always have the required storage space.

> The easiest way to take advantage of this dynamic volume expansion capability is to use an internal cloud provider, and the process is straightforward: you simply need to set `.allowVolumeExpansion: true` in the `StorageClass` definition.

For example:

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
  - debug
volumeBindingMode: Immediate
```

Many internal cloud providers support expandable volumes. Here is a list of some commonly used providers:

![](./images/expandable-volumes.png)

## Storage Summary

This summary provides a guideline for managing storage in Kubernetes:

- **Data Sharing Consideration**:

  - Determine whether your data needs to be shared between containers or pods

  - If sharing data between containers within the same pod, consider using ephemeral volumes for simple data exchange between applications

  - If data needs to be shared between pods or preserved after pod termination, use `PersistentVolume`s

- **Dynamic Provisioning Decision**:

  - Decide whether dynamic provisioning is required. This is particularly useful when it's challenging to predict the exact storage requirements in advance.

- **Without Dynamic Provisioning**:

  - Create a `PersistentVolume` `.yaml` configuration file (defines how a volume is created)

  - Create a `PersistentVolumeClaim` `.yaml` configuration file (specifies how a volume is requested)

  - Create a `MyApplication .yaml` configuration file (the workload resource). Avoid using bare pods in your configurations.

- **With Dynamic Provisioning**:

  - Create a `StorageClass` `.yaml` configuration file (acts as a template for providing `PersistentVolume`s to pods as needed).

- **Expandable Volumes**:

  - Consider using expandable volumes when you know the maximum number of pods that will run at any given time and the exact amount of storage required for each pod is unknown

Dynamic provisioning is beneficial for large-scale applications, where manual provisioning is impractical due to the following reasons:

  - A high number of pods are requesting significant storage allocations

  - It's challenging to predict in advance how many pods will run

> In large deployments, cloud-storage providers are often preferred over local storage, as local storage may prove insufficient.

## Stateful Sets
   
In Kubernetes, workloads are typically used for stateless applications, where data is not written to external storage. However, when it comes to managing stateful applications, Kubernetes provides a powerful resource called *Stateful Sets*.

> *Stateful Sets* are a specific type of workload resource in Kubernetes that adds a storage volume to each pod it manages. This unique characteristic allows all data associated with a pod to be stored persistently. This is particularly useful for applications like databases, where data persistence is critical between pod restarts.

### Limitations

- **Storage Provisioning**: The storage used by Stateful Sets must be provisioned either by a cluster administrator or dynamically using Storage Classes

- **Storage Preservation**: Deleting a Stateful Set does not automatically delete the associated storage. Storage preservation is prioritized to prevent accidental data loss.

- **Headless Service Requirement**: Stateful Sets require the presence of a *headless Service*. A headless Service in Kubernetes is a type of service that is used to manage the network identity of pods without load balancing or providing a cluster IP.

### Components of a Stateful Set

To create a Stateful Set, you'll need specific `.yaml` definitions. Here's an overview of the necessary components:

1. Headless Service

```yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
```

2. Stateful Set

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
```

In this example:

- A headless Service named `nginx` is created to manage the network domain for pods

- The `StatefulSet` named `web` is defined to manage the deployment of pods running the `nginx` container. The Stateful Set configuration includes the following:

  - `serviceName`: Specifies the name of the headless Service created earlier, which is `nginx`. This associates the `StatefulSet` with the headless Service.

  - `replicas`: Sets the desired number of replicas (pods) to be managed by the `StatefulSet`. In this case, it's set to 2 replicas.

  - `volumeClaimTemplates`: This section defines templates for `PersistentVolumeClaims` (PVCs) that will be created for each pod. Each PVC will be associated with a `PersistentVolume` (PV) provisioned by a `PersistentVolume` provisioner. 

### Hands-On

1. Create a `.yaml` file with the above configuration for creating a `StatefulSet` and its associated resources

2. Apply the configuration using the correct `kubectl` command

3. Access the Kubernetes Minikube dashboard and observe the `PersistentVolumeClaims` (PVCs). You should see that the volumes are already bound and available for use.

In [15]:
!kubectl

service/nginx created
statefulset.apps/web created


The dashboard's appearance should be similar to that shown below:

<p align=center><img src=images/PVC.png></p>

## Key Takeaways

- Volumes in Kubernetes provide a way to store and share data within containers. Volumes can be mounted simultaneously, making it easy to share data between containers within the same pod or between pods.
- `EmptyDir` volumes provide a way to create temporary storage within a pod. They are primarily used for sharing files and data between containers running within the same pod.
- Data stored in an `EmptyDir` volume is ephemeral and tied to the lifecycle of the pod. When the pod is deleted or rescheduled to a different node, the data is lost.
- `HostPath` volumes allow pods to use a directory on the host machine as a source of data for a volume. Data is persisted on the host machine.
- `HostPath` volumes are used for accessing host-specific data or resources. The data stored with `HostPath` is specific to the host where the pod is running.
- Persistent Volumes (PVs) are physical storage volumes in the cluster provisioned by administrators, while Persistent Volume Claims (PVCs) are requests for storage made by pods
- Storage Classes define the classes of storage offered by administrators and can be used to dynamically provision storage. Different provisioners can be used with Storage Classes, enabling flexibility in choosing storage solutions.
- Dynamic provisioning allows Kubernetes to automatically create PVs based on PVC requirements, making it suitable for large-scale deployments. It eliminates the need for manual provisioning and ensures that storage matches the demands of pods.
- Expandable volumes can be used when the maximum number of pods running simultaneously is known, but the exact storage requirements per pod are uncertain. 
- Stateful Sets are workload resources used for stateful applications that require persistent storage. They manage pods with storage volumes, ensuring data persistence even across pod restarts.
- Stateful Sets require a headless Service to manage network identity
- Deleting a Stateful Set does not automatically delete associated storage to preserve data