# Introduction

In [1]:
# Import required packages
import torch
import math
import numpy as np
import matplotlib.pyplot as plt
import h5py
import pandas as pd

Open `lotka_volterra_data.h5` file on notebook

In [2]:
with h5py.File('lotka_volterra_data.h5', 'r') as f:
    # Access the full dataset
    trajectories = f['trajectories'][:]
    time_points = f['time'][:]

    # Access a single trajectory
    system_id = 0 # First system

Checikng shape of the dataset, we expect trajectories to be of size $(1000 \times 100 \times 2)$, and time_points of size $(100 \times 1)$

In [3]:
print('Time points shape:',time_points.shape)
print('')
print('Trajectory shape (pray/predator):',trajectories.shape)

Time points shape: (100,)

Trajectory shape (pray/predator): (1000, 100, 2)


In [4]:

num_systems, num_time_steps, num_variables = trajectories.shape
# Create a DataFrame
df_traj = pd.DataFrame({
    "system_id": np.repeat(np.arange(num_systems), num_time_steps),  # Repeats 0-999, each 100 times
    "time_step": np.repeat(time_points[np.arange(num_time_steps)], num_systems),    # Cycles 0-99 for each system
    "prey": trajectories[:, :, 0].flatten(),  # Flatten prey values
    "predator": trajectories[:, :, 1].flatten()  # Flatten predator values
})


In [5]:
df_traj # Visualising data in dataframe format

Unnamed: 0,system_id,time_step,prey,predator
0,0,0.0,0.949917,1.040624
1,0,0.0,0.740551,0.779542
2,0,0.0,0.682246,0.564390
3,0,0.0,0.716674,0.407644
4,0,0.0,0.824511,0.300283
...,...,...,...,...
99995,999,200.0,0.901549,0.579420
99996,999,200.0,0.957527,0.539055
99997,999,200.0,1.036460,0.515615
99998,999,200.0,1.129212,0.510619


In [6]:
time_step = df_traj['time_step'].to_numpy()

print(time_step[4925])

8.080808080808081


# Part 2 (a)

Grouping prey and predator into arrays to determine the maximum value for scaling procedure.

In [7]:
prey_array = df_traj['prey'].to_numpy() # Converting to numpy array
predator_array = df_traj['predator'].to_numpy()

### 2.10 Scaling Dataset `lotka_volterra_data.h5`

As we will see in the `Table` presented below, in the original dataset we have laues that vary significantly. To standardize the numeric range, we are going to use [quantiles]( https://en.wikipedia.org/wiki/Quantile). A quantile is a value that divides a dataset into equal-sized intervals, indicating the data points below which a given percentage if observations fall. From the project instructions it is adviced to apply a simple scaling:
$$
x_t' = \frac{x_t}{\alpha}
$$
where $\alpha$ should be chosen based on the distribution of the dataset `lotka_volterra_data.h5`.

In our particular case we want most of our dataset to be in range $[0,10]$. This is coded in the [`preprocessor.py`](https://github.com/MatteoMancini01/M2_Cw/blob/main/src/preprocessor.py) file, which appropriate docstrings.



#### `numpy.quantile()`

For scaling our dataset we want to use [`numpy.quantile()`](https://numpy.org/doc/2.1/reference/generated/numpy.quantile.html). The `numpy.quantile()` function calculates the quantiles of a given NumPy array. Quantiles are cut points that devide the data into intercals with equal probability. Thus `numpy.quantile()`can be used to scale our dataset dynamically, without having to worry about choosing the appropriate value for $\alpha$.

In [8]:
# Import class Preprocessor from src/preprocessor.py
from src.preprocessor import Preprocessor

# Set scaling_operator to function 
scaling_operator = Preprocessor.scaling_operator

Scaling data

In [9]:
trajectories_scaled, scaling_factor = scaling_operator(trajectories, 0.9, 10)
print('Scaling factor:', scaling_factor)

Scaling factor: 0.25283724


Collecting scaled data into `pandas.DataFrame` format, in particular, we want to construct a $100000\times 4$ table, (number of rows $= 1000 \times 100$). With four colums, of which three are `time_step`, `prey` and `predator`, but with an additional one `system_id` (this separates the $1000$ different systems), which will be later used to convert our timeseries data into string format.

In [10]:
num_systems_scaled, num_time_steps_scaled, num_variables_scaled = trajectories_scaled.shape

# Create a DataFrame
df_traj_scaled = pd.DataFrame({
    "system_id": np.repeat(np.arange(num_systems_scaled), num_time_steps_scaled),  # Repeats 0-999, each 100 times
    "time_step": np.repeat(time_points[np.arange(num_time_steps_scaled)], num_systems_scaled),  # Cycles 0-200 (array.shape = (100,)) for each system
    "prey": trajectories_scaled[:, :, 0].flatten(),  # Flatten prey values
    "predator": trajectories_scaled[:, :, 1].flatten()  # Flatten predator values
})

Visualising `df_traj_scaled`.

In [11]:
df_traj_scaled

Unnamed: 0,system_id,time_step,prey,predator
0,0,0.0,3.757031,4.115786
1,0,0.0,2.928965,3.083177
2,0,0.0,2.698359,2.232228
3,0,0.0,2.834528,1.612280
4,0,0.0,3.261036,1.187654
...,...,...,...,...
99995,999,200.0,3.565728,2.291671
99996,999,200.0,3.787127,2.132025
99997,999,200.0,4.099317,2.039314
99998,999,200.0,4.466160,2.019556


Converting `prey` and `predator` columns into array using [`pandas.DataFrame.to_numpy`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html)

In [12]:
prey_array_scaled = df_traj_scaled['prey'].to_numpy() # Converting to numpy array
predator_array_scaled = df_traj_scaled['predator'].to_numpy()

Defining a function that calculates the percentage of values in an array that fall outside a given range. (This seem tedious, as we set a value for quantile in the function `scaling_operator`, e.g. $q = 0.9$, means that only $10%$ of the values will be out of our custom range. But this will be used to measure what percentage of datapoints in the original dataset is outside a specific range.)

In [13]:
def scaling_measure(arr, min_val, max_val):
    
    """
    Calculates the percentage of values in an array that fall outside a given range.

    Parameters:
    -----------
    arr : array-like
        The input numerical data.
    min_val : float
        The minimum acceptable value.
    max_val : float
        The maximum acceptable value.

    Returns:
    --------
    str
        The percentage of values outside the range, formatted as a string.
    """

    # Count values about the max range
    outside_count = np.sum((arr < min_val)|(arr > max_val))

    # Calculating the pergentage of values outside max range
    percentage_outside = (outside_count/arr.size)*100

    return f'{percentage_outside:.2f}%'

Collecting scaling information into a Table using `pandas.DataFrame`.

In [14]:
min_val = 0
max_val = 1
Table_1 = pd.DataFrame({

    'Pray': [max(prey_array), np.mean(prey_array), min(prey_array), scaling_measure(prey_array, min_val, max_val)],
    'Pray after scaling': [max(prey_array_scaled), np.mean(prey_array_scaled), min(prey_array_scaled), scaling_measure(prey_array_scaled, min_val, max_val)],
    'Predator': [max(predator_array), np.mean(predator_array), min(predator_array), scaling_measure(predator_array, min_val, max_val)],
    'Predator after scaling': [max(predator_array_scaled), np.mean(predator_array_scaled), min(predator_array_scaled), scaling_measure(predator_array_scaled, min_val, max_val)],
    
})
Table_1.index = ["Maximim Value", "Mean Value", "Minimum Value", f"Values outside the range {min_val}-{max_val}"] # Adding index for each row

 From the table below, we can observe, scaling was successful. The reason why we want to test how many data points are outside the range $[0,1]$, is due to the fact that a lot of data points in the original dataset (pre-scaling) are very small, many of order $10^{-3}$ (and smaller order $10^{-4}$), which may affect the tokenisation process.

In [15]:
Table_1

Unnamed: 0,Pray,Pray after scaling,Predator,Predator after scaling
Maximim Value,13.740113,54.343708,4.76849,18.859921
Mean Value,1.698114,6.716232,0.569606,2.252856
Minimum Value,0.002077,0.008216,0.000037,0.000148
Values outside the range 0-1,63.11%,93.83%,12.21%,77.33%


Looking at the last row, we can see that we have a major improvement for both `prey` and `predator` categories, the percentage of values outside the range $[0,1]$ has increased in `prey` by ~ $30\%$ and `predator` by ~ $65\%$. Thus, scaling was successful. Now we can proceed with the next step, i.e. converting the scaled dataset to strings, for compatibility with [Qwen2.5]( https://github.com/QwenLM/Qwen2.5).

### 2.11 Loading Qwen2.5

Below a short demonstration on how to use `load_qwen()` from `src.qwen`.

In [16]:
from src.qwen import load_qwen # Import load_qwen
model, tokenizer = load_qwen() # set model = model and tokeinzer = tokenizer

  from .autonotebook import tqdm as notebook_tqdm
2025-03-15 12:06:07.142305: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742040367.225622   95182 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742040367.249479   95182 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-15 12:06:07.458010: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpec

Trying with examples provided at the end of project instructions, see [LLMTIME Preprocessing Scheme](https://github.com/MatteoMancini01/M2_Cw/blob/main/instructions/main.pdf).

In [17]:
print(tokenizer("1.23", return_tensors="pt")["input_ids"].tolist()[0])
print('')
print(tokenizer("1 . 2 3", return_tensors="pt")["input_ids"].tolist()[0])

[16, 13, 17, 18]

[16, 659, 220, 17, 220, 18]


Trying to tokenise $[0.25,1.50;0.27,1.47;0.31,1.42]$

In [18]:
print(tokenizer("0.25,1.50;0.27,1.47;0.31,1.42", return_tensors='pt')["input_ids"].tolist()[0])

[15, 13, 17, 20, 11, 16, 13, 20, 15, 26, 15, 13, 17, 22, 11, 16, 13, 19, 22, 26, 15, 13, 18, 16, 11, 16, 13, 19, 17]


Example of how `load_qwen()` works!

In [19]:
text = 'Hello, world' # Define input text
input_ids = tokenizer(text, return_tensors='pt').input_ids # Tokenize text 
output = model.generate(input_ids, max_length = 50) # Generate output

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


From the above code line 3 `output`, the variable `max_length` determines how many more words will the model predict when inputing text, e.g. `text = Hello, world`, as we can see from the below output.

In [20]:
output

tensor([[ 9707,    11,  1879,  2219,  1986,   374,   264,  4285, 23811,  1879,
          2025,   304, 13027,    13,  1084, 23473,   330,  9707,    11,  1879,
          8958,   311,   279,  2339,   382, 73594, 12669,   198,  1350,   445,
          9707,    11,  1879, 22988, 13874, 19324,  4498,   498,  1598,   419,
          2038,    11,   432,   686,  2550,  1447, 13874,  3989,  9707,    11]])

From the above tensor we recognise that the first 3 tokens are related to our text, the rest of the tokens is predicted text determined from the model, as we will see below when decoding `output`.

In [21]:
print(tokenizer.decode(output[0], skip_special_tokens=True)) # Decoding output

Hello, world!

This is a simple hello world program in Python. It prints "Hello, world!" to the console.

```python
print("Hello, world!")
```

When you run this code, it will output:

```
Hello,


### 2.12 Converting Scaled Dataset into Strings

We now have seen how tokenisation for text works! There is a small issue, Qwen2.5 is only designed to convert text, i.e. strings in Python, to tokens, while our dataset is a timeseries composed of 2 variables prey and predator over a time series of size 100, this is repated for a 1000 samples. Thus, before we proceed with tokenisation, we require to convert the time series data into sets of strings. To do so we are going to define a fucntion `array_to_string(data)`, and a function to convert string to array `sring_to_array(formatted_string)` (both functions are in [preprocessor.py](https://github.com/MatteoMancini01/M2_Cw/blob/main/src/preprocessor.py)).

⚠ Note: The function `array_to_string` is specifically designed for the dataset `lotka_volterra_data.h5`, in particular after converting `trajectories` into a `panda.DataFrame` format, with columns `system_id` (labeling each system from 0 to 999), columns `prey` and `predator`, each  displaying 100 data points for every `system_id`.

In [22]:
from src.preprocessor import Preprocessor

array_to_string = Preprocessor.array_to_string # Importing array_to_string(data) to convert timeseries to string
string_to_array = Preprocessor.string_to_array # Importing string_to_array(formatted_string) to convert strings back to arrays

traject_scaled_string = array_to_string(df_traj_scaled) # Converting df_traj_scaled into string format

Checking result post-conversion

In [23]:
print(traject_scaled_string) # Print output

system_id
0      3.7570314,4.115786;2.9289646,3.0831769;2.69835...
1      3.8422916,3.976525;4.266382,3.2503378;4.986718...
2      4.2447176,4.401468;3.3779166,3.5601249;3.03805...
3      4.1148176,4.5674887;2.6275814,4.4058175;1.7460...
4      3.275598,3.167274;3.533418,2.2473524;4.0886607...
                             ...                        
995    3.918201,4.624609;2.1393397,3.2536037;1.521137...
996    3.594898,4.6532717;2.2493894,3.7967243;1.68079...
997    4.4646792,4.432619;4.054759,4.030011;3.8997226...
998    4.475788,4.0170417;3.1674376,3.1674619;2.56437...
999    4.0352407,4.488609;3.0280378,4.1001153;2.48077...
Length: 1000, dtype: object


We also want to test the function `string_to_array`, this is done below for the first `system_id` string data format, i.e. `system_id` $ = 0$.

In [24]:
print(string_to_array(traject_scaled_string[0]))

[[ 3.7570314   4.115786  ]
 [ 2.9289646   3.0831769 ]
 [ 2.698359    2.232228  ]
 [ 2.834528    1.6122797 ]
 [ 3.261036    1.1876544 ]
 [ 3.9731696   0.9090103 ]
 [ 4.9950223   0.7362934 ]
 [ 6.3499713   0.6416597 ]
 [ 8.033252    0.61052966]
 [ 9.978284    0.6402782 ]
 [12.027031    0.74106705]
 [13.914932    0.9378812 ]
 [15.283247    1.2709625 ]
 [15.732348    1.7848765 ]
 [14.932756    2.4919257 ]
 [12.876417    3.2815378 ]
 [10.089222    3.8815145 ]
 [ 7.451006    4.014958  ]
 [ 5.5736156   3.6663675 ]
 [ 4.519748    3.0661838 ]
 [ 4.0797853   2.4484437 ]
 [ 4.0676417   1.9291508 ]
 [ 4.384644    1.5358958 ]
 [ 4.9906063   1.2594512 ]
 [ 5.869442    1.081379  ]
 [ 7.00247     0.98556626]
 [ 8.343905    0.9630018 ]
 [ 9.800556    1.0129801 ]
 [11.211844    1.1445497 ]
 [12.355572    1.3731911 ]
 [12.978617    1.7145051 ]
 [12.864242    2.164892  ]
 [11.940796    2.671792  ]
 [10.378936    3.1176598 ]
 [ 8.59046     3.3508794 ]
 [ 7.0140243   3.2962737 ]
 [ 5.8966465   3.0044987 ]
 

As we can observe from the above output, we successfully converted string back to array.

### 2.13 Tokenisation 

We provided few basic examples on how to use `load_qwen()` in section 2.11, with some text and numbers (string form). We now want to proceed and tokenise our data, to achieve this, we designed a function for our particular needs that uses `model, tokenizer = load_qwen()`.


In [25]:
from src.qwen import tokenize_time_series


In [26]:
tokenised_data = tokenize_time_series(traject_scaled_string)

Visualising data before tokenisation:

In [39]:
traject_scaled_string

system_id
0      3.7570314,4.115786;2.9289646,3.0831769;2.69835...
1      3.8422916,3.976525;4.266382,3.2503378;4.986718...
2      4.2447176,4.401468;3.3779166,3.5601249;3.03805...
3      4.1148176,4.5674887;2.6275814,4.4058175;1.7460...
4      3.275598,3.167274;3.533418,2.2473524;4.0886607...
                             ...                        
995    3.918201,4.624609;2.1393397,3.2536037;1.521137...
996    3.594898,4.6532717;2.2493894,3.7967243;1.68079...
997    4.4646792,4.432619;4.054759,4.030011;3.8997226...
998    4.475788,4.0170417;3.1674376,3.1674619;2.56437...
999    4.0352407,4.488609;3.0280378,4.1001153;2.48077...
Length: 1000, dtype: object

After tokenisation:

In [27]:
tokenised_data

system_id
0      [input_ids, attention_mask]
1      [input_ids, attention_mask]
2      [input_ids, attention_mask]
3      [input_ids, attention_mask]
4      [input_ids, attention_mask]
                  ...             
995    [input_ids, attention_mask]
996    [input_ids, attention_mask]
997    [input_ids, attention_mask]
998    [input_ids, attention_mask]
999    [input_ids, attention_mask]
Length: 1000, dtype: object

Clsoer look at two examples:

In [43]:
# Print tokenised output for the first system
print('Two examples of tokens from tokenised_data:')
print('')
print('Preprocessed data:')
print(traject_scaled_string[3])
print('')
print('After tokenisation:')
print(tokenised_data.iloc[3]["input_ids"].squeeze().tolist())  # Tokenised tensor
print('Length of the above token:',len(tokenised_data.iloc[3]["input_ids"].squeeze().tolist()))  # Tokenised tensor
print('')
print('')
print('Preprocessed data:')
print(traject_scaled_string[990])
print('')
print('After tokenisation:')
print(tokenised_data.iloc[990]["input_ids"].squeeze().tolist())
print('Length of the above token:',len(tokenised_data.iloc[990]["input_ids"].squeeze().tolist()))


Two examples of tokens from tokenised_data:

Preprocessed data:
4.1148176,4.5674887;2.6275814,4.4058175;1.7460897,3.9027684;1.2831641,3.2811728;1.0638798,2.682386;0.984722,2.1671355;1.0008674,1.7468183;1.0973107,1.4147936;1.2760373,1.1590365;1.5503457,0.9673673;1.9425601,0.8293;2.4818568,0.73716146;3.200081,0.68669236;4.1233573,0.67780405;5.256911,0.7159537;6.561579,0.81451684;7.924198,0.9982614;9.129895,1.3068261;9.859524,1.7921007;9.748706,2.4942725;8.56405,3.3738594;6.5229297,4.2074757;4.3476043,4.6512256;2.7189922,4.5382977;1.7658302,4.0314283;1.271206,3.3866417;1.0345033,2.7637167;0.9446259,2.2259717;0.9511616,1.7870439;1.0363959,1.4404546;1.2006862,1.1733989;1.4562578,0.9726263;1.8251879,0.82680446;2.336653,0.72754854;3.0239303,0.66978234;3.9170804,0.6524396;5.029446,0.67964417;6.335076,0.76294285;7.738619,0.92477435;9.045132,1.2025427;9.947493,1.6488174;10.061644,2.3158267;9.090996,3.1920273;7.1202655,4.0990167;4.8138604,4.6784763;2.9853754,4.6766305;1.890149,4.2113376;1.3143833