# Using probability distributions to fill the gaps in our climate data
<img src="util/prob_distrib_weather.png" style="width: 400px; float:right"><h>When there are gaps in climate data—due to missing records for certain days, months, or locations—we can use probability distributions to estimate the missing information. Probability distributions are mathematical functions that describe how a particular variable, such as temperature or rainfall, is expected to vary based on past observations. By analyzing historical climate data, we can determine the likely patterns and behaviors of these variables.

For example, if we have temperature data for several years but lack information for a few specific months, we can use a probability distribution (such as the Normal distribution) to estimate the likely temperature during those missing periods. This approach relies on statistical techniques to create a model that represents how the variable typically behaves, considering factors like seasonality and trends. 
    
In this Notenbook we are going to try to fill the gaps in daily temperature data.

## As always, first we need to import the necessary libraries

In [None]:
import numpy as np  # for numerical operations, especially with arrays
import pandas as pd  # for data manipulation and analysis
import matplotlib.pyplot as plt  # for data visualization

## Load climate data with gaps
We are going to load the data from excel files, for this purpose we use the Pandas library. The data corresponds to Montefrio (Granada, Spain) and the period 2005-2021.

In [None]:
# Step 1: Load real temperature daily rainfall data
temp_data = pd.read_excel('datos/Daily_Temp_2005_2021.xlsx',index_col=0)
temp_data.head()

In [None]:
temp_data.index

In [None]:
# Plot temperature data
plt.figure(figsize = (15,4))
plt.plot(temp_data.index,temp_data['temp'])

## Fill the gaps
### A. First method: mean value
First of all, we need to find the NaN values

In [None]:
temp_data['temp'].isnull()

In [None]:
missing_temp_dates = temp_data[temp_data['temp'].isnull()].index
missing_temp_dates

In [None]:
mean_temp = np.mean(temp_data['temp'])
print(mean_temp)

In [None]:
filled_temp_data = temp_data.copy()
filled_temp_data.loc[missing_temp_dates, 'temp'] = mean_temp

In [None]:
# Plot the filled temperature data
plt.figure(figsize = (15,4))
plt.plot(filled_temp_data.index,filled_temp_data['temp'])
plt.title('Method 1: mean temperature')
plt.ylabel('degC')

### B. Second method: random normal values
<img src="util/normal_distribution.png" style="width: 400px; float:right"><h>
    
Temperature data is often symmetrically distributed around a central value, which makes the normal (Gaussian) distribution a suitable candidate. This distribution is particularly useful for modeling daily temperatures, which tend to **vary within a relatively narrow range around a seasonal average**.

The probability density function (PDF) of a normal distribution is given by:

$
f(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}
$

Where:
- $ x $ is the temperature,
- $ \mu $ is the mean (average temperature),
- $ \sigma $ is the standard deviation, indicating the typical variability around the mean.

### Why Normal Distribution?

1. **Symmetric values:** Temperature data typically exhibits a symmetric distribution around an average, making the normal distribution an appropriate model.
2. **Central tendency:** The normal distribution captures the tendency of temperature values to cluster around a central value, with decreasing probabilities as values move farther from the mean.
3. **Flexible variability:** With its two parameters (mean and standard deviation), the normal distribution can represent different climates and seasonal variations in temperature data, adapting to both average levels and fluctuations.

In [None]:
std_temp = np.std(temp_data['temp'])
print(std_temp)

In [None]:
for i in range(len(temp_data)):
    if np.isnan(temp_data['temp'][i]):
        random_temp = np.random.normal(mean_temp,std_temp)
        filled_temp_data['temp'][i] = random_temp     

In [None]:
# Plot the filled temperature data
plt.figure(figsize = (15,4))
plt.plot(filled_temp_data.index,filled_temp_data['temp'])
plt.title('Method 2: random normal values')
plt.ylabel('degC')

The normal distribution is often a good fit for filling gaps in daily mean temperature data, but it’s not always perfect. Daily mean temperatures tend to follow a pattern that can approximate a bell curve, especially over short time periods, like a single season. This means the normal distribution can capture the average and variability reasonably well.

Fitting a normal distribution to the entire temperature data has some limitations because real daily temperatures usually show a cycle that follows the seasons, which isn’t captured well by a single normal distribution. For example, summer temperatures tend to be warmer, and winter temperatures cooler.

### C. Third method: monthly mean value

In [None]:
monthly_mean_temp = np.zeros(12)

for m in np.arange(12):
    # Calculate the mean temperature for each month (m+1)
    monthly_mean_temp[m] = temp_data[temp_data.index.month == m + 1]['temp'].mean()
    print(f'for the month {m+1} the mean temperature = {monthly_mean_temp[m]:.2f}')

In [None]:
monthly_mean_temp

In [None]:
filled_temp_data = temp_data.copy()
for i in range(len(temp_data)):
    if np.isnan(temp_data['temp'][i]):
        month = temp_data.index[i].month
        filled_temp_data['temp'][i] = monthly_mean_temp[month-1]     

In [None]:
# Plot the filled temperature data
plt.figure(figsize = (15,4))
plt.plot(filled_temp_data.index,filled_temp_data['temp'])
plt.title('Method 3: monthly mean temperature')
plt.ylabel('degC')

### D. Forth method: monthly random normal values

In [None]:
monthly_std_temp = np.zeros(12)

for m in np.arange(12):
    # Calculate the standard deviation of the temperature for each month (m+1)
    monthly_std_temp[m] = temp_data[temp_data.index.month == m + 1]['temp'].std()
    print(f'for the month {m+1} the standard deviation = {monthly_std_temp[m]:.2f}')

In [None]:
for i in range(len(temp_data)):
    if np.isnan(temp_data['temp'][i]):
        month = temp_data.index[i].month
        random_temp = np.random.normal(monthly_mean_temp[month-1],monthly_std_temp[month-1])
        filled_temp_data['temp'][i] = random_temp

In [None]:
# Plot the filled temperature data
plt.figure(figsize = (15,4))
plt.plot(filled_temp_data.index,filled_temp_data['temp'])
plt.title('Method 4: monthly random normal values')
plt.ylabel('degC')