![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The NumPy Library - Homework

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.21.5


In [2]:
from numpy.lib import recfunctions as rfn

## 1. Loading and exploring data

We will use a dataset containing data about RON exchange rates in relation with several currencies EUR, USD and CHF.
The data covers the years starting from 2010 to 2021


First of all let's load data:

In [3]:
# import packages for remote data load
import requests
import io

# read data remotely
data_url = "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Common/data/RON_Exchange_Rates.csv"
response = requests.get(data_url)

# load the string data into a record array
raw_data = np.loadtxt(
    io.StringIO(response.text), 
    skiprows = 1, 
    delimiter = ",", 
    dtype = {"names" : ("DATE", "EUR", "USD", "CHF"),
            "formats": ("U20", "float64", "float64", "float64")}
)


The loaded data does not have yet a format that is directly usable for data processing; the DATE field has a textual format.

We need to convert this column to a date time format that is more appropriate (more precisely to a datetime format).

In [4]:
# we will used datetime to convert textual values to date
from datetime import datetime
raw_date_string = [datetime.strptime(date, '%m/%d/%Y').strftime("%Y-%m-%d") for date in raw_data["DATE"]]

# converting standard date strings to numpy datetime
date_data = np.array(raw_date_string, np.datetime64)

Once data is converted, we need to create a new array where all the data is converted to the desired format.

In [5]:
# create a new array containing data formatted to the target format
data = np.array(
    # create a tuple list based on various pieces of information 
    # from the loaded data
    list(zip(date_data, raw_data["EUR"], raw_data["USD"], raw_data["CHF"])),
    dtype = [("DATE",'M8[D]'), ("EUR", "float64"), ("USD", "float64"), ("CHF", "float64")]
)

Let's explore basic information regarding the data.

In [6]:
# exploring the number of records
print(
  "The number of records is {}.".format(data.shape[0])
)

The number of records is 3030.


In [7]:
# explore the DATE information
print(
    "The earliest date is {} and the latest is {}".format(
        np.min(data["DATE"]),
        np.max(data["DATE"])
    )
)

The earliest date is 2010-01-04 and the latest is 2021-12-31


In [8]:
# explore the EUR information
print(
    "The statisical information for EUR is: \n\n\
        min \t: {} \n\
        max \t: {} \n\
        mean \t: {}\n\
        median \t: {} \n\
    ".format(
        np.min(data["EUR"]),
        np.max(data["EUR"]),
        np.average(data["EUR"]),
        np.median(data["EUR"])
    )
)

The statisical information for EUR is: 

        min 	: 4.0653 
        max 	: 4.9495 
        mean 	: 4.534709141914191
        median 	: 4.4979499999999994 
    


In [9]:
# HOMEWORK: explore the USD information

In [10]:
# HOMEWORK: explore the CHF information

## 2. Expanding the dataset

We would like to expand the data in the dataset - adding the year, month and day as a separate piece of information.

In [11]:
# extracting the years information
years_data = [data.astype(object).year for data in data["DATE"]]
years = np.array(years_data, dtype=[("YEAR", "i8")])

# merging information in the record array
data = rfn.merge_arrays (
  (data, years),
  asrecarray = True, 
  flatten = True
)

In [12]:
# use calendar library for month names
import calendar

# extracting the months information
months_data = [data.astype(object).month for data in data["DATE"]]
month_names_data = [calendar.month_name[month] for month in months_data]

months = np.array(months_data, dtype=[("MONTH", "i8")])
month_names = np.array(month_names_data, dtype=[("MONTH_NAME", "U20")])

# merging information in the record array
data = rfn.merge_arrays (
  (data, months, month_names),
  asrecarray=True, 
  flatten=True
)

In [13]:
# HOMEWORK: insert the days information as well

In [14]:
# print the expanded dataset
print(
    "The expanded dataset is: \n{}\n with the columns: \n{}\n".format(
        data,
        data.dtype.names
    )
)

The expanded dataset is: 
[('2010-01-04', 4.2265, 2.9401, 2.8419, 2010,  1, 'January')
 ('2010-01-05', 4.2077, 2.9186, 2.8345, 2010,  1, 'January')
 ('2010-01-06', 4.162 , 2.8987, 2.8051, 2010,  1, 'January') ...
 ('2021-12-29', 4.949 , 4.3849, 4.7722, 2021, 12, 'December')
 ('2021-12-30', 4.9486, 4.3735, 4.7713, 2021, 12, 'December')
 ('2021-12-31', 4.9481, 4.3707, 4.7884, 2021, 12, 'December')]
 with the columns: 
('DATE', 'EUR', 'USD', 'CHF', 'YEAR', 'MONTH', 'MONTH_NAME')



## 3. Additional feature engineering

In order to improve further the relevance of the datase, we will enhance it further with additional features.

In [15]:
# calculate the EUR_USD and EUR_CHF ratios
EUR_USD_ratio = np.array(data["EUR"] / data["USD"], dtype = [("EUR_USD_RATIO", "float64")])
EUR_CHF_ratio = np.array(data["EUR"] / data["CHF"], dtype = [("EUR_CHF_RATIO", "float64")])

# merging information in the record array
data = rfn.merge_arrays (
  (data, EUR_USD_ratio, EUR_CHF_ratio),
  asrecarray=True, 
  flatten=True
)

In [16]:
# HOMEWORK: add the USD_EUR and USD_CHF ratios

In [17]:
# HOMEWORK: add the CHF_EUR and CHF_USD ratios

In [18]:
# print the feature engineered dataset
print(
    "The dataset with engineered features is: \n{}\n with the columns: \n{}\n".format(
        data,
        data.dtype.names
    )
)

The dataset with engineered features is: 
[('2010-01-04', 4.2265, 2.9401, 2.8419, 2010,  1, 'January', 1.43753614, 1.48720926)
 ('2010-01-05', 4.2077, 2.9186, 2.8345, 2010,  1, 'January', 1.44168437, 1.48445934)
 ('2010-01-06', 4.162 , 2.8987, 2.8051, 2010,  1, 'January', 1.43581606, 1.48372607)
 ...
 ('2021-12-29', 4.949 , 4.3849, 4.7722, 2021, 12, 'December', 1.12864604, 1.0370479 )
 ('2021-12-30', 4.9486, 4.3735, 4.7713, 2021, 12, 'December', 1.13149651, 1.03715968)
 ('2021-12-31', 4.9481, 4.3707, 4.7884, 2021, 12, 'December', 1.13210699, 1.03335143)]
 with the columns: 
('DATE', 'EUR', 'USD', 'CHF', 'YEAR', 'MONTH', 'MONTH_NAME', 'EUR_USD_RATIO', 'EUR_CHF_RATIO')



## 4. Data consolidation

We will consolidate the data by calculating aggregations at the different levels. 

In [19]:
# calculate the min, max, average values for EUR at year level
dataset_years = set(data["YEAR"])
dataset_EUR = []

# extract the dataset for EUR record
for year in dataset_years :
  # obtain the dataset associated to the year 
  year_dataset = data[data["YEAR"] == year] 

  # extract the consolidated dataset for the year
  record_EUR = (
      year, 
      # get max value for EUR
      np.max(year_dataset["EUR"]),
      # get min value for EUR
      np.min(year_dataset["EUR"]),
      # get average value for EUR
      np.average(year_dataset["EUR"]),
      # get the month where EUR had the maximum value
      year_dataset[np.argmax(year_dataset["EUR"])]["MONTH"],
      # get the month names where EUR had the maximum value
      year_dataset[np.argmax(year_dataset["EUR"])]["MONTH_NAME"],      
      # get the month where EUR had the minimum value
      year_dataset[np.argmin(year_dataset["EUR"])]["MONTH"],
      # get the month names where EUR had the minimum value
      year_dataset[np.argmin(year_dataset["EUR"])]["MONTH_NAME"],      
    )
  
  # add data to the dataset
  dataset_EUR.append(record_EUR)  

# create a consolidated dataset
dataset_consolidated = np.array(
    dataset_EUR,
    dtype = [
             ("YEAR", "i8"), 
             ("MAX_EUR", "float64"),
             ("MIN_EUR", "float64"),
             ("AVG_EUR", "float64"),
             ("MONTH_EUR_MAX", "i8"),
             ("MONTH_NAME_EUR_MAX", "U20"),
             ("MONTH_EUR_MIN", "i8"),
             ("MONTH_NAME_EUR_MIN", "U20"),
            ]  
  )

# sort dataset by year
np.recarray.sort(
     dataset_consolidated,
     order = "YEAR"
 )

In [20]:
# HOMEWORK: add the USD information to the consolidated dataset

In [21]:
# HOMEWORK: add the CHF information to the consolidated dataset

In [22]:
# print the consolidated array
print(
    "The consolidated dataset is: \n{}\n with the columns: \n{}\n".format(
        dataset_consolidated,
        dataset_consolidated.dtype.names
    )
)

The consolidated dataset is: 
[(2010, 4.3688, 4.0653, 4.21098949,  6, 'June', 3, 'March')
 (2011, 4.362 , 4.0735, 4.23767804, 11, 'November', 4, 'April')
 (2012, 4.6481, 4.3219, 4.45725952,  8, 'August', 1, 'January')
 (2013, 4.5535, 4.3072, 4.41860119,  6, 'June', 5, 'May')
 (2014, 4.5447, 4.3845, 4.4440381 ,  1, 'January', 7, 'July')
 (2015, 4.5381, 4.3965, 4.44457154, 12, 'December', 4, 'April')
 (2016, 4.5411, 4.4444, 4.49002835, 12, 'December', 7, 'July')
 (2017, 4.6597, 4.4888, 4.56798996, 12, 'December', 2, 'February')
 (2018, 4.6695, 4.6206, 4.65352892,  6, 'June', 8, 'August')
 (2019, 4.7808, 4.6634, 4.74536653, 11, 'November', 1, 'January')
 (2020, 4.875 , 4.7642, 4.83761673,  9, 'September', 2, 'February')
 (2021, 4.9495, 4.8691, 4.92075827,  9, 'September', 1, 'January')]
 with the columns: 
('YEAR', 'MAX_EUR', 'MIN_EUR', 'AVG_EUR', 'MONTH_EUR_MAX', 'MONTH_NAME_EUR_MAX', 'MONTH_EUR_MIN', 'MONTH_NAME_EUR_MIN')



In [23]:
# OPTIONAL HOMEWORK: consolidate the information about min, max, average currency rate values
# at the year and month level

## 5. Extract data insights

Determine the months where EUR rate had its maximum value in each year and also how many times this happened for each of such month.

In [24]:
# determine the months where EUR rate had its maximum value in each year
# along with how many times this happened in each year
max_EUR_rate_months, count_max_EUR_rate_months = np.unique(
    dataset_consolidated["MONTH_NAME_EUR_MAX"], return_counts = True
  )

# display the information
displayable_counts = [str(count) + " time(s)" for count in count_max_EUR_rate_months ]
print(
    "The months where EUR rate had its maximum yearly value \
along with the frequency of such occurences are: \n{}".format(
      dict(zip(max_EUR_rate_months, displayable_counts))
    )
  )

The months where EUR rate had its maximum yearly value along with the frequency of such occurences are: 
{'August': '1 time(s)', 'December': '3 time(s)', 'January': '1 time(s)', 'June': '3 time(s)', 'November': '2 time(s)', 'September': '2 time(s)'}


In [25]:
# HOMEWORK: determine the months where EUR rate had its minimum value in each year.

In [26]:
# OPTIONAL HOMEWORK: determine the same information for the USD and CHF currencies