# Data preparation module

Available methods:

- get_points_within_area: prepares points data for further processing,
- prepare_areal_shapefile: prepares areal shapefile for processing and transforms it into numpy array,
- read_point_data: reads data and converts it into numpy array,
- select_values_in_range: selects set of values which are greater than (lag - step size) and lesser than (lag + step size),
- set_areal_weights: prepares array for weighted semivariance calculation.

## get_points_within_area

```python
pyinterpolate.data_processing.data_preparation.get_points_within_area.get_points_within_area(
    area_shapefile, points_shapefile, areal_id_col_name, points_val_col_name,
    dropna=True, points_geometry_col_name='geometry', nans_to_zero=True)
```

Function prepares points data for further processing.

INPUT:

- **area_shapefile**: (```string```) areal data ```shapefile``` address,
- **points_shapefile**: (```string```) points data ```shapefile``` address,
- **areal_id_col_name**: (```string```) name of the column with id of areas,
- **points_val_col_name**: (```string```) name of the value column of each point,
- **dropna***: (```bool```) if ```True``` then rows with ```NaN``` are deleted (areas without any points),
- **points_geometry_col_name***: (```string```) default is ```'geometry'``` as in ```GeoPandas GeoDataFrames```,
- **nans_to_zero**: (```bool```) if ```True``` then all ```NaN``` values are casted to 0.


OUTPUT:

- ```numpy array``` of area id and array with point coordinates and values:

```python
[
    area_id,
    [point_position_x, point_position_y, value]
]
```

***

## prepare_areal_shapefile

```python
pyinterpolate.data_processing.data_preparation.prepare_areal_shapefile.prepare_areal_shapefile(
    areal_file_address, id_column_name=None, value_column_name=None,
    geometry_column_name='geometry', dropnans=True)
```

Function prepares areal shapefile for processing and transforms it into ```numpy array```. Function returns two lists.

INPUT:

- **areal_file_address**: (```string```) path to the shapefile with area data,
- **id_column_name**: (```string```) id column name, if not provided then index column is treated as the id,
- **value_column_name**: (```string```) value column name, if not provided then all values are set to ```NaN```,
- **geometry_column_name**: (```string```) default is ```'geometry'``` as in ```GeoPandas GeoDataFrames```,
- **dropnans**: (```bool```) if ```True``` then rows with ```NaN``` are dropped.


OUTPUT:

- ```numpy array``` of area id, area geometry, coordinate of centroid x, coordinate of centroid y, value:

```python
[area_id, area_geometry, centroid coordinate x, centroid coordinate y, value]
```

***

## read_point_data

```python
pyinterpolate.data_processing.data_preparation.read_data.read_point_data(path, data_type)
```

Function reads data from a text file and converts it into ```numpy array```.

INPUT:

- **path**: (str) path to the file,
- **data_type**: (str) data type, available types: ```'txt'``` for txt files.


OUTPUT:

- ```numpy array``` of coordinates and their values.

***

## select_values_in_range

```python
pyinterpolate.data_processing.data_preparation.select_values_in_range.select_values_in_range(
    data, lag, step_size)
```

Function selects set of values which are greater than (```lag - step size```) and lesser than (```lag + step size```).

INPUT:

- **data**: array of distances,
- **lag**: (```float```) lag within areas are included,
- **step_size**: (```float```) step between lags. Usually it is constant in each iteration and it is ```0.5 * lag```.


OUTPUT:

- ```numpy array``` mask with distances within specified radius.

***

## set_areal_weights

```python
pyinterpolate.data_processing.data_preparation.set_areal_weights.set_areal_weights(areal_data,
                                                                                   areal_points)
```

Function prepares array for _weighted semivariance_ calculation.

INPUT:

- **areal_data**: (```numpy array```) of areas in the form:

```python
[area_id, areal_polygon, centroid coordinate x, centroid coordinate y, value]
```

- **areal_points**: (```numpy array```) of points within areas in the form:

```python
[area_id, [point_position_x, point_position_y, value]]
````


OUTPUT:

- ```numpy array``` of weighted points.