<a href="https://colab.research.google.com/github/TheNeuvillette/Data-Science-Fundamentals-DCBP/blob/main/TheNeuvillette_CodingTask1_V01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Science Fundamentals for DCBP, by TheNeuvilette

## Task summary (maximum 12.5 points)

- **CT-1.1** Write a method (function) which removes the header information in the
datafile and saves that information into a separate textfile. [0.5 points]
- **CT-1.2** Write a function which reduces the data resolution by merging/averaging columns such that there is only one column per 1 nm. [2.0 points]  
- **CT-1.3** Now generalize the above merging so that it works for any number nm er column. [2.0 points]
- **CT-1.4** Implement the saving of the reduced dataset to a file. Do this (1) by writing an explicit loop (write line by line) and (2) by using pandas methods. Measure the running times of the two approaches. [2.0 points]
- **CT-1.5** Write a function which takes two wavelengths as input and plots the difference of the data at these lengths over time. [2.0 points]
- **CT-1.6** Let the x-axis be in seconds, label the axes with names and units, label the plot with color and legend, make a title for the plot. Save the plot to a file. [2.0 points]
- **CT-1.7** Plot the full and some reduced datasets in the same figure. Is there any visual difference? [2.0 points]

## Coding Task 1:



**Pre-CT:** Activating all libraries needed during the coding task.



In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt



**Pre-CT:** Installing Google Colab.



In [None]:
from google.colab import drive
drive.mount('/content/drive')



**Pre-CT:** Defining all links to the used files.



In [None]:
input_file_path = '/content/drive/MyDrive/Data_Science/CCD-Data.csv'
output_file_path = '/content/drive/MyDrive/Data_Science/CCD-Data_without_header.csv'
header_file_path = '/content/drive/MyDrive/Data_Science/CCD-Header.txt'
file_path_loop = "/content/drive/MyDrive/Data_Science/Loop_Red_CCD-Data.csv"
file_path_pandas = "/content/drive/MyDrive/Data_Science/Pandas_Red_CCD-Data.csv"

**Pre-CT:** Defining some other universal variables

In [None]:
df = pd.read_csv(input_file_path,delimiter=',')

*   **CT-1.1**: Write a method (function) which removes the header information in the datafile and saves that information into a separate textfile. [0.5 points]



In [None]:
#1.1

def CCD_remove_header(input_file_path, output_file_path, header_file_path):
    """
    Takes a csv-file (CCD-Data.csv) and creates two new files:
    One file containing the header (CCD-Header.txt) and one containing the data (CCD-Data_without_header.csv).
    The original csv-file (CCD-Data.csv) is not modified or deleted.

    Args:
        input_file_path: Path to the csv-file (CCD-Data.csv).
        output_file_path: Path of the newly created data-file (CCD-Data_without_header.csv)
        header_file_path: Path of the newly created header-file (CCD-Header.txt)
    """
    # Identify CCD header and data.
    with open(input_file_path, 'r') as input_file:
        lines = input_file.readlines()
    header = lines.pop(0)

    # Save the CCD data in a new CCD file
    with open(output_file_path, 'w') as output_file:
        output_file.writelines(lines)

    # Save the CCD header to a separate text file
    with open(header_file_path, 'w') as header_file:
        header_file.write(header)

CCD_remove_header(input_file_path, output_file_path, header_file_path)

*   **CT-1.2**: Write a function which reduces the data resolution by merging/averaging columns such that there is only one column per 1 nm. [2.0 points]  




In [None]:
#1.2

def Reduce_resolution_to_1_nm(df):
    """
    Reduces the data resolution by merging/averaging columns such that there is only one column per 1 nm
    Args:
        df (pd.DataFrame): Input DataFrame.
    Returns:
        pd.DataFrame: A new DataFrame containing the averaged values.
    """

    # Select columns from the 4th to the second-to-last column
    selected_columns = df.iloc[:, 3:-1]

    # Convert column names to integers
    new_columns = {col: math.floor(float(col)) for col in selected_columns.columns}
    selected_columns = selected_columns.rename(columns=new_columns)

    # Calculate mean value for each group of columns with the same integer name
    df_red = selected_columns.groupby(selected_columns.columns, axis=1).mean()

    return df_red

df = pd.read_csv(input_file_path,delimiter=',')
df_red = Reduce_resolution_to_1_nm(df)
df_red.head()

- **CT-1.3**: Now generalize the above merging so that it works for any number nm er column. [2.0 points]

In [None]:
#1.3

def Reduce_resolution_advanced (df, start_wavelength=316 , nm_steps=1):
    """
    Reduces the data resolution by merging/averaging columns in a user defined way.
    Args:
        df (pd.DataFrame): Input DataFrame.
        start_wavelength: The wavelength at which the columns are merged.
        nm_steps: Difference between two concecutive columns.
    Returns:
        pd.DataFrame: A new DataFrame containing the averaged values.
    """

    # Reduce colums to 1nm (same as the function "Reduce_resolution_to_1_nm").
    selected_columns = df.iloc[:, 3:-1]
    new_columns = {col: math.floor(float(col)) for col in selected_columns.columns}
    selected_columns = selected_columns.rename(columns=new_columns)
    selected_columns = selected_columns.groupby(selected_columns.columns, axis=1).mean()

    # Merge nm_n columns into one column
    new_columns = {col: (float(col)-start_wavelength)//nm_steps for col in selected_columns.columns}
    selected_columns = selected_columns.rename(columns=new_columns)
    selected_columns = selected_columns.groupby(selected_columns.columns,axis=1).mean()

    # Fix the column labels
    new_columns = {col: int((float(col)*nm_steps+start_wavelength)) for col in selected_columns.columns}
    selected_columns = selected_columns.rename(columns=new_columns)
    return selected_columns

df_red = Reduce_resolution_advanced(df, 317, 3)
df_red.head()

*  **CT-1.4**: Implement the saving of the reduced dataset to a file. Do this (1) by writing an explicit loop (write line by line) and (2) by using pandas methods. Measure the running times of the two approaches. [2.0 points]


In [None]:
""" Important: To be able to run CT-1.4, CT-1.2 or CT-1.3 had to be run previously."""

%%time
# Method 1: By using explicit loop:
with open(file_path_loop, 'w') as file:
    for index, row in df_red.iterrows():
        file.write(','.join(map(str, row.values)) + '\n')

print(f"Data saved to {file_path_loop} using explicit loop.")

In [None]:
%%time
# Method 2: By using pandas:
import pandas as pd
df_red.to_csv(file_path_pandas, index=False)
print(f"Data saved to {file_path_pandas} using Pandas method.")

* **CT-1.5:** Write a function which takes two wavelengths as input and plots the difference of the data at these lengths over time. [2.0 points]

In [None]:
def plot_wavelength_difference(df, a, b):
    x = df.iloc[:,0]
    y = df.iloc[:,a]-df.iloc[:,b]
    plt.plot(x,y)
    return

plot_wavelength_difference(df, 1000, 1950)

- **CT-1.6:** Let the x-axis be in seconds, label the axes with names and units, label the plot with color and legend, make a title for the plot. Save the plot to a file. [2.0 points]


In [None]:
def plot_wavelength_difference_advanced(df, a, b):
    x = df.iloc[:,0]
    y = df.iloc[:,a]-df.iloc[:,b]
    plt.plot(x,y, c="g", label="greenline")
    plt.xlabel('Time [s]')
    plt.ylabel('Difference of intensity [nm]')
    plt.title('Difference of Data at Two Wavelengths Over Time')
    plt.grid(True)
    plt.legend()
    plt.show()
    return

plot_wavelength_difference_advanced(df, 1000, 1950)

- **CT-1.7:** Plot the full and some reduced datasets in the same figure. Is there any visual difference? [2.0 points]