![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - Advanced Data Manipulation - Homework

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.21.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

# disable warnings for chained assignment
pd.set_option('mode.chained_assignment', None)

Pandas installed at version: 1.3.5


In [3]:
!python -m pip install sklearn
import sklearn as skl
import sklearn.base as sklb
import sklearn.preprocessing as sklpre
import sklearn.pipeline as sklpipe
import sklearn.compose as sklcompose

print ("Sklearn installed at version: {}".format(skl.__version__))

Sklearn installed at version: 1.0.2


In [4]:
import warnings

# supress RuntimeWarnings that are not relevant
warnings.filterwarnings("ignore")

## 1. Loading and exploring data

We will use a dataset containing data about RON exchange rates in relation with several currencies EUR, USD and CHF. The data covers the years starting from 2010 to 2021.

We will load the data into a DataFrame object in order to perform further processing.

In [5]:
# load the data into a dataframe object
# no index column is expected
raw_data = pd.read_csv(
    "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Common/data/RON_Exchange_Rates.csv",
    index_col = None
)

## 2. Cleaning and converting data

Let's convert the data for ensuring its further usability. We will convert the DATE column from string to datetime value.

In [6]:
# create a convert date column transformer which transforms string data
# into datetime data
class ConvertDateColumnTransformer(sklb.BaseEstimator, sklb.TransformerMixin) :
  """
  Converts a column to datetime value
  """
  def __init__(self, features_to_convert):
    self._features_to_convert = features_to_convert
    self._feature_names_out = None

  def fit(self, X, Y = None, ** fit_params) :
    return self

  def transform(self, X, Y = None, ** fit_params) :
    result = X.copy()
    
    for feature in self._features_to_convert :
      result[feature] = pd.to_datetime(result[feature])

    self._feature_names_out = result.columns.values

    return result

  def get_feature_names_out(self, input_features=None) :
    return self._feature_names_out


Let's test the transformer via a transformation pipeline.

In [7]:
# create a transformation pipeline
# and transform the associated data
transformation_pipeline = sklpipe.Pipeline(
    steps = [
       ("data conversion", ConvertDateColumnTransformer(["DATE"]))
    ]
)
transformed_clean_data = transformation_pipeline.fit_transform(raw_data)

# print the transformed results
print(
        "Sample records from the transformed data are \n{}\n\
with generated feature names \n{}\n".format(
          transformed_clean_data,
          transformation_pipeline.get_feature_names_out()
        )
    )

Sample records from the transformed data are 
           DATE     EUR     USD     CHF
0    2010-01-04  4.2265  2.9401  2.8419
1    2010-01-05  4.2077  2.9186  2.8345
2    2010-01-06  4.1620  2.8987  2.8051
3    2010-01-07  4.1721  2.9089  2.8158
4    2010-01-08  4.1679  2.9143  2.8134
...         ...     ...     ...     ...
3025 2021-12-27  4.9492  4.3725  4.7604
3026 2021-12-28  4.9491  4.3683  4.7668
3027 2021-12-29  4.9490  4.3849  4.7722
3028 2021-12-30  4.9486  4.3735  4.7713
3029 2021-12-31  4.9481  4.3707  4.7884

[3030 rows x 4 columns]
with generated feature names 
['DATE' 'EUR' 'USD' 'CHF']



In [8]:
# HOMEWORK: create a column selection transformer that allows selection 
# of columns from the input data.
# The columns to be selected are specified in the constructor's parameters
# test the transformer with a dedicated pipeline.

## 3. Performing feature engineering

We will add additional features to the dataset, starting with daily ratios between currencies. 

For starters, let's create a transformer for column ratios so it is possible to generate ratios between currency values.

In [9]:
# create a columns ratio transformer that calculates the ratio values
# for two columns
class CalculateColumnsRatio(sklb.BaseEstimator, sklb.TransformerMixin) :
  """
  Calculates the ratio between a numerator currency and denominator currency
  """
  def __init__(self, numerator_feature, denominator_feature):
    self._numerator_feature = numerator_feature
    self._denominator_feature = denominator_feature
    self._feature_names_out = None 

  def fit(self, X, Y = None, ** fit_params) :
    return self

  def transform(self, X, Y = None, ** fit_params) :
    data_copy = X.copy()

    raw_result = data_copy[self._numerator_feature] / data_copy[self._denominator_feature]   
    
    self._feature_names_out = ["{}_{}_RATIO".format(
        self._numerator_feature, 
        self._denominator_feature
      )]

    result = pd.DataFrame(
        {self._feature_names_out[0] : raw_result}
    )

    return result

  def get_feature_names_out(self, input_features=None) :
    return self._feature_names_out

We can integrate the columns ratio transformer(s) with min-max scaling transformers which are out of the box in Scikit-Learn.



In [10]:
# create a transformation pipeline
# and transform data
transformation_pipeline = sklpipe.Pipeline(
    steps = [
        (
          "Date Column Transformer", 
          ConvertDateColumnTransformer(["DATE"])
        ),
        (
          "Perform Feature Union",
          sklpipe.FeatureUnion(
            [
              ("CURRENCY_RATIO_TRANSFORMER_EUR_USD", CalculateColumnsRatio("EUR", "USD")),
              ("CURRENCY_RATIO_TRANSFORMER_EUR_CHF", CalculateColumnsRatio("EUR", "CHF")),
              ("EUR_SCALER", sklcompose.ColumnTransformer(
                  transformers = [ 
                    (
                      "EUR_MIN_MAX_SCALER", 
                      sklpre.MinMaxScaler(), 
                      ["EUR"]
                    )
                  ]
                )
              )        
            ]
          )
        )
    ]
)
transformed_data = transformation_pipeline.fit_transform(raw_data)

# print the transformed results
print(
        "Sample records from the transformed data are \n{}\n\
with generated feature names \n{}\n".format(
          transformed_data,
          transformation_pipeline.get_feature_names_out()
        )
    )

Sample records from the transformed data are 
[[1.43753614 1.48720926 0.18231169]
 [1.44168437 1.48445934 0.16104954]
 [1.43581606 1.48372607 0.1093644 ]
 ...
 [1.12864604 1.0370479  0.99943452]
 [1.13149651 1.03715968 0.99898213]
 [1.13210699 1.03335143 0.99841665]]
with generated feature names 
['CURRENCY_RATIO_TRANSFORMER_EUR_USD__EUR_USD_RATIO'
 'CURRENCY_RATIO_TRANSFORMER_EUR_CHF__EUR_CHF_RATIO'
 'EUR_SCALER__EUR_MIN_MAX_SCALER__EUR']



In [11]:
# HOMEWORK: Add currency ratio transformers for USD and CHF currencies

In [12]:
# HOMEWORK: Include the column selector transformer in the pipeline so that 
# the DATE, EUR, USD and CHF columns are included in the generated dataset 

In [13]:
# HOMEWORK: Save the generated dataset