In [7]:
import pandas as pd
import gzip
from datetime import datetime

# read in the sample data
def read_forex_data(file_path):
    with gzip.open(file_path, 'rt') as f:
        df = pd.read_csv(f, parse_dates=['datetime'])
    return df

# run checks for column validity, data types, completeness and length for comparison
def validate_raw_data(df):

    print(df.describe())

    # Ensure required columns are present
    required_columns = ['datetime', 'currency_pair', 'bid', 'ask', 'volume']
    if not all(column in df.columns for column in required_columns):
        raise ValueError("Missing required columns in the dataset")
     
    #  Data Correctness
    # Ensure bid and ask prices are non-negative
    if not all (df[(df['bid'] >= 0) & (df['ask'] >= 0)]):
        raise ValueError("Negative Values are in this dataset")
    
    # Check data types
    print("Data Types:\n", df.dtypes)

    # check currency pair values
    correct_length = df['currency_pair'].apply(lambda x: len(x) == 6).all()
    if correct_length != True:
        raise ValueError("Invalid currency_pair provided")

    # Data Completeness
    # Check for missing values
    missing_values = df.isnull().sum()
    print("\nMissing Values:\n", missing_values)

    # Length of DataFrame
    length_of_df = len(df)
    print("\nLength of DataFrame:", length_of_df)

    # Example Output fo df
    print("\nDataFrame:\n", df)
    
    return df

file_list = ['sample_fx_data_A.csv.gz','sample_fx_data_B.csv.gz', 'sample_fx_data_C.csv.gz']
    
for file in file_list:
    print(str(file))
    # Step 1: Read the Forex dataset
    raw_data = read_forex_data(file)
    
    # Step 2: Validate the raw data
    validated_data = validate_raw_data(raw_data)


sample_fx_data_A.csv.gz
                            datetime           bid           ask        volume
count                        6847154  6.847154e+06  6.847154e+06  6.847154e+06
mean   2024-01-15 04:09:00.313161728  1.373710e+02  1.373810e+02  5.000257e+03
min              2023-11-30 23:59:58  9.375900e+01  9.378500e+01  1.000000e+00
25%              2023-12-21 04:39:02  9.794900e+01  9.795900e+01  2.501000e+03
50%              2024-01-16 08:19:48  1.479940e+02  1.480010e+02  5.000000e+03
75%    2024-02-06 19:28:25.249999872  1.582430e+02  1.582570e+02  7.499000e+03
max              2024-02-29 23:59:59  1.637160e+02  1.637220e+02  9.999000e+03
std                              NaN  2.602497e+01  2.602510e+01  2.886028e+03
Data Types:
 datetime         datetime64[ns]
bid                     float64
ask                     float64
currency_pair            object
volume                    int64
dtype: object

Missing Values:
 datetime         0
bid              0
ask              0
cur

### Analysis of sample data 

In  order to determine which data provider to select, the data has been checked for correctness, completeness and length. Checks were also carried out to ensure the data contained the correct fields. I also ran basic statistical details on the dataset to get an idea of differences in values for bid, ask and volume.

#### Ruling out data with missing values

It was decided that sample file B was incomplete as it has 5092 and 51364 for bid and ask respectively. This also clearly affected the statistical values for the data. The assumption was made that missing data signifies reduced reliability of the data and it was also difficult to compare the counts and statistical spread of the data with the other two sample sets.

#### Length of the dataframe

Sample file C had a smaller record count. This does not necessarily mean that the data is invalid but the other two files had the same record count and so would suggest that there are indeed missing records. This would potentially lead to innacuracies in reporting and calculations for the second stage of the exercise. A and B were also more similar in the statistical spread of the data. So despite the small missing portion of the data, overall they agree when it comes to the statistical description of the data.

### Conclusion

Sample A was therefore selected as the most approporiate data provider to use based on data completeness and agreement with the other sample file. It also passed all of the value error and validity checks.
