# Applications of Artificial Intelligence  
# Graded Assignment 1 - Data Ingestion

(submission template, version June 2022)

We ask you to follow this template when preparing your submission to aid manual marking and minimise the chance of violating assignment specifications resulting in unfortunate loss of points.  

Please, note that this assignment is **not** autograded. So you may introduce additional cells where necessary. 

Also please submit the notebook **already executed** wherever possible - this is particularly important in Task 2 where the computed statistics guide your discussion. The marker will also re-run cells to validate outputs.

## Task 1 CSV parser 

### 1a
**Is the CSV parser implementation wrapped in a function that takes only a
filename as input and returns data?  <br>
And does it not make use of any
existing CSV reading and parsing library for file handling? <br>**
*Note: passing this section is required for subsequent marks* 


In [418]:
import pandas as pd
import ast
from datetime import datetime

pd.set_option('display.max_rows', None) 
pd.set_option('display.max_columns', None)  


def csv_parser(data_file):
    with open(data_file, 'r', encoding="utf-8-sig") as file:
        read_data = file.read().split('\n')
        rows = [row_parser(line) for line in read_data if line]
        header = read_data[0].split(',')  
        columns = [column.strip('"') for column in header]
    
    modified_rows = []
    for row in rows:
        if len(row) < len(columns):
            row += [''] * (len(columns) - len(row))
        elif len(row) > len(columns):
            row = row[:len(columns)]
        modified_rows.append(row)
    
    df = pd.DataFrame(columns=columns, data=modified_rows[1:])
    datetime_column = df.columns[0] 
    df[datetime_column] = df[datetime_column].apply(datetime_parser)
    df.set_index(datetime_column, inplace=True, drop=True)

    return df



def datetime_parser(input):
    datetime_formats = [
        "%Y-%m-%dT%H:%M:%S",
        "%Y-%m-%d %H:%M:%S",
        "%Y-%m-%d",
        "%m/%d/%Y",
        "%d/%m/%Y",
        "%m/%d/%Y %I:%M:%S %p",
        "%d/%m/%Y %H:%M:%S",
        "%Y-%m-%dT%H:%M:%SZ",
        "%Y-%m-%dT%H:%M:%S%z",
        "%b %d, %Y %I:%M:%S %p %Z",
        "%B %d, %Y",
        "%A, %B %d, %Y",
        "%Y-%m-%d %H:%M",
        "%I %p",
        "%Y-W%W",
        "%Y-%j",
        "%H:%M"
    ]
    for format in datetime_formats:
        try:
            return datetime.strptime(input, format)
        except Exception as e:
            continue
    return input



def row_parser(line):
    entries = []
    entry = ''
    string_flag = False
    set_flag = False
    list_flag = False
    dict_flag = False
    escape_flag = False
    

    for idx, char in enumerate(line):
        if idx == 0 and char == '#':
            continue 
        if char == '\\':
            escape_flag = True 
            continue
        if char == '"' and dict_flag == False and set_flag == False and list_flag == False:
            if string_flag == True:
                entry += char 
            if len(line) == (idx + 1):
                entries.append(entry)
                entry = ''
            string_flag = not string_flag
        elif char == '(':
            entry += char 
            if string_flag == False:
                set_flag = True 
        elif char == ')':
            entry += char 
            if string_flag == False:
                set_flag = False 
                entries.append(entry)
                entry = ''
        elif char == '[':
            entry += char 
            if string_flag == False:
                list_flag = True 
        elif char == ']': 
            entry += char 
            if string_flag == False:
                list_flag = False 
                entries.append(entry)
                entry = ''
        elif char == '{':
            entry += char 
            if string_flag == False:
                dict_flag = True 
        elif char == '}':
            entry += char 
            if string_flag == False:
                dict_flag = False
                entries.append(entry)
                entry = ''
        elif char != ',':
            entry += char
            if len(line) == (idx + 1):
                entries.append(entry)
                entry = ''
        elif dict_flag == False and set_flag == False and list_flag == False and string_flag == False:
            entries.append(entry)
            entry = ''
        else:
            entry += char


    converted_entries = []
    for entry in entries:
        try:
            if entry[-1] == '"':
                entry = entry[:-1]
                if entry.find('"') != -1:
                    entry = entry + '"' 
        except Exception as e:
            pass
        try:
            converted_entry = ast.literal_eval(entry)
        except (ValueError, SyntaxError):
            converted_entry = entry
        converted_entries.append(converted_entry)

    return converted_entries

### 1b 

**Does the program correctly parse the provided CSV files into a suitable Python data structure, recognising and properly handling data
types**?


You **must** include the following four cells with function calls to run your parser on each provided csv file one by one and to print/visualise your data structure with parsed data for each input file. 

It must be clear from the visualised outputs that the parser correctly recognises and processes data types in the input files. <br>

In [405]:
data_structure_rainfall = csv_parser('rainfall-1617.csv')
data_structure_rainfall[:10]

Unnamed: 0_level_0,mm
DateTime,Unnamed: 1_level_1
2016-10-09,0.0
2016-10-10,0.0
2016-10-11,0.0
2016-10-12,0.0
2016-10-13,0.0
2016-10-14,1.1
2016-10-15,2.1
2016-10-16,8.4
2016-10-17,1.1
2016-10-18,3.1


In [406]:
data_structure_barometer = csv_parser('barometer-1617.csv')
data_structure_barometer[:10]

Unnamed: 0_level_0,Baro
DateTime,Unnamed: 1_level_1
2016-10-09,1021.9
2016-10-10,1019.9
2016-10-11,1015.8
2016-10-12,1013.2
2016-10-13,1005.9
2016-10-14,998.6
2016-10-15,998.0
2016-10-16,1002.2
2016-10-17,1009.8
2016-10-18,1013.4


In [407]:
data_structure_temp_indoor = csv_parser('indoor-temperature-1617.csv')
data_structure_temp_indoor[:10]

Unnamed: 0_level_0,Humidity,Temperature,Temperature_range (low),Temperature_range (high)
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-09,54,21.93,21.0,22.8
2016-10-10,52,21.77,20.4,23.6
2016-10-11,51,21.36,19.9,23.0
2016-10-12,51,21.44,20.0,23.6
2016-10-13,52,21.22,20.1,22.3
2016-10-14,52,21.02,19.6,22.6
2016-10-15,53,21.4,20.3,22.5
2016-10-16,53,21.43,20.0,23.0
2016-10-17,53,21.67,20.5,22.7
2016-10-18,54,21.75,20.6,23.1


In [408]:
data_structure_temp_outside = csv_parser('outside-temperature-1617.csv')
data_structure_temp_outside[:10]

Unnamed: 0_level_0,Temperature,Temperature_range (low),Temperature_range (high)
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-10-09,10.66,7.2,13.8
2016-10-10,8.94,5.6,12.8
2016-10-11,8.69,5.3,14.3
2016-10-12,11.55,9.0,14.9
2016-10-13,9.4,6.0,13.3
2016-10-14,9.85,6.8,13.3
2016-10-15,10.72,8.2,14.7
2016-10-16,11.28,7.8,14.5
2016-10-17,11.84,10.0,15.0
2016-10-18,10.24,8.2,12.7


### 1c

**Does the program correctly parse the first hidden CSV file into a
suitable Python data structure recognising and properly handling
data types? <br> This file will in particular test how well your code can
handle a variety of different strings**.

Again, you need to include the following cell to allow the marker to call your parser on the first hidden file and assess correctness of the parser's performance.

In [326]:
data_structure_hidden1 = csv_parser('hidden_file_1')
data_structure_hidden1

### 1d

**Does the program correctly parse the second hidden CSV file into
a suitable Python data structure recognising and properly handling
data types? This file will in particular test how well your code can
handle manually inputted dates**.


Similarly, for the second hidden file:

In [327]:
data_structure_hidden2 = csv_parser('hidden_file_2')
data_structure_hidden2

### 1e

**Does the program give a sensible partial result for the hidden malformed CSV file, which contains some valid data, without throwing unhandled exceptions? Please note that correct informative messages on detected csv format violations in the file generated by the parser will also be credited under this criterion.**


Similary, for the malformed file as before:

In [328]:
data_structure_malformed = csv_parser('malformed_file.csv')
data_structure_malformed

If your parser generates any informative error messages on detected csv-format violations in this file (as part of the sensible partial result), these messages should be printed by the parser function at runtime (not stored in any additional data structure or any class attribute).

## Task 2 Data Wrangling 

### 2a

**Does the program correctly output the minimum, maximum, mean,
and standard deviation for every weather data component in the provided CSV
files?**

In [344]:
def calculate_statistics(data_file):

    df = csv_parser(data_file) 
    
    stats_table = pd.DataFrame(columns=['Field', 'Min', 'Max', 'Mean', 'StdDev'])
    
    for column in df.columns:
        df[column] = pd.to_numeric(df[column], errors='coerce')
        min = df[column].min()
        max = df[column].max()
        mean = round(df[column].mean(), 3)
        std = round(df[column].std(), 3)

        stat_df = pd.DataFrame([[column, min, max, mean, std]],
                               columns=['Field', 'Min', 'Max', 'Mean', 'StdDev'])

        stats_table = pd.concat([stats_table, stat_df])

    stats_table = stats_table.set_index(['Field'], drop=True)
    
    return stats_table

You **must** provide the following calls to your function to compute statistics for each provided data file in turn and a suitable print statement to visualise your data structure with the results:

In [409]:
data_structure_statistics_rainfall = calculate_statistics('rainfall-1617.csv')
data_structure_statistics_rainfall 

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mm,0.0,23.2,1.549,3.325


In [410]:
data_structure_statistics_barometer = calculate_statistics('barometer-1617.csv')
data_structure_statistics_barometer 

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Baro,979.6,1035.6,1009.999,9.87


In [411]:
data_structure_statistics_temp1 = calculate_statistics('indoor-temperature-1617.csv')
data_structure_statistics_temp1 

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Humidity,37.0,59.0,48.52,5.189
Temperature,18.04,29.21,21.828,2.058
Temperature_range (low),14.9,28.2,20.556,2.405
Temperature_range (high),19.7,31.1,23.534,1.701


In [412]:
data_structure_statistics_temp2 = calculate_statistics('outside-temperature-1617.csv')
data_structure_statistics_temp2 

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Temperature,-1.81,26.38,11.139,5.355
Temperature_range (low),-4.1,18.7,7.866,4.879
Temperature_range (high),1.5,38.5,15.524,7.034


### 2b

**Has one of the CSV files been modified to include some plausible
incorrect data, and does the rationale in the document support the
choice?**

**I implemented a systematic bias to the temperature reader. The bias adds 2 degrees to each reading. This is a plausible data corruption as 
the temperature reader could be malfunctioning in its calibration capability due to degradation in the temperature reader**

### 2c

**Does the program correctly output the summary statistics for the new
CSV file?**


As usual, include the relevant function calls here:

In [416]:
data_structure_statistics_modified = calculate_statistics('modified.csv')
display(data_structure_statistics_modified)

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Humidity,37.0,59.0,48.52,5.189
Temperature,20.04,31.21,23.828,2.058
Temperature_range (low),14.9,28.2,20.556,2.405
Temperature_range (high),19.7,31.1,23.534,1.701


### 2d

**Does the jupyter notebook present a discussion of the differences between
  the statistics, and give a convincing analysis as to whether they would be sufficient to identify the incorrect data?**

In [422]:
difference_df = data_structure_statistics_modified - data_structure_statistics_temp1
difference_df

Unnamed: 0_level_0,Min,Max,Mean,StdDev
Field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Humidity,0.0,0.0,0.0,0.0
Temperature,2.0,2.0,2.0,0.0
Temperature_range (low),0.0,0.0,0.0,0.0
Temperature_range (high),0.0,0.0,0.0,0.0


Both the minimum and maximum values for temperature after implementing the bias both increased by 2 degrees. This is a direct reflection of the systematic bias added to each reading, indicating a consistent shift across all temperature values. This uniform increase in minimum and maximum values strongly suggests data corruption due to bias.

The mean value also increased by 2 degrees. The mean is sensitive to shifts in data, and in this case, it perfectly mirrors the applied bias, further pointing towards a systematic corruption in the data. The exact 2-degree increase in the mean value is consistent with the nature of the bias, which adds a fixed value to all readings.

The standard deviation remains unchanged before and after the bias was introduced. This is an expected outcome because standard deviation measures the dispersion of data points around the mean, and adding a constant value to each data point shifts the dataset without affecting its dispersion. The unchanged standard deviation, after shifting the Min, Max, and Mean, confirms that the dataset's variability remains the same, and the shift is due to a systematic bias rather than a random or natural variation.

The statistics provided are sufficient to identify incorrect data due to the systematic bias introduced to the temperature readings. The uniform increase in Min, Max, and Mean values, coupled with the unchanged Standard Deviation, is a clear indicator of a systematic shift in the data, consistent with the described bias of adding 2 degrees to each temperature reading. This analysis demonstrates the effectiveness of basic statistical measures in detecting data corruption due to systematic bias.