# **Data Collection**

## Objective
The primary objective of this task is to collect and process spectral data from one or more CSV files, consolidate it into a single dataframe, inspect the data for accuracy and integrity, and then save the consolidated dataset into a new CSV file.

## Inputs
- Raw CSV files containing spectral data. These files can be named, for instance, "spectrum1.csv," "spectrum2.csv," and so on. The data within these files should be in a format that requires consolidation.

## Outputs
1. **Consolidated Dataframe**: A single dataframe that combines and processes the spectral data from the input CSV files.
2. **CSV File**: A CSV file containing the consolidated and processed data.

---

# Install python packages in the notebooks

In [1]:
pip install -r /workspaces/Spectral-data/requirements.txt

Note: you may need to restart the kernel to use updated packages.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Spectral-data'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces'

---

---

# Load and Inspect Kaggle data

## Dataset inherited houses

In [5]:
import pandas as pd

# Function to process a single CSV file
def process_csv_file(file_path):
    try:
        # Read the CSV file into a DataFrame with semicolon delimiter(change depending on delimterer)
        df = pd.read_csv(file_path, sep=';', header=None, skiprows=1)
        
        # Read the first row as column headers
        headers = pd.read_csv(file_path, sep=';', nrows=1)

        # Create a new DataFrame with column headers and data from the second row
        df.columns = headers.columns
        
        # Add a new column with the file name
        df['sample'] = os.path.basename(file_path)

        return df
    except Exception as e:
        print(f"Error processing {file_path}: {str(e)}")
        return None

# Function to process multiple CSV files
def process_csv_files(directory):
    results = []
    for filename in os.listdir(directory):
        if filename.endswith(".csv"):
            file_path = os.path.join(directory, filename)
            result = process_csv_file(file_path)
            if result is not None:
                results.append(result)
    return results

if __name__ == "__main__":
    # Set the directory where your CSV files are located
    csv_directory = "/workspaces/Spectral-data/inputs/datasets/raw"

    # Process the CSV files in the directory
    processed_dataframes = process_csv_files(csv_directory)

    # Concatenate the processed DataFrames into one final DataFrame
    final_df = pd.concat(processed_dataframes, ignore_index=True)
    
    # Move the "sample" column to the first position
    cols = final_df.columns.tolist()
    cols = ['sample'] + [col for col in cols if col != 'sample']
    final_df = final_df[cols]

    # Display the final DataFrame
    print(final_df)

          sample   1   2   3   4   5   6
0  spectrum2.csv  20  30  40  50  60  70
1  spectrum1.csv   5  10  20  30  40  50


## Dataset house prices records

In [8]:
final_df.head()

Unnamed: 0,sample,1,2,3,4,5,6
0,spectrum2.csv,20,30,40,50,60,70
1,spectrum1.csv,5,10,20,30,40,50


- Dataframe summary

In [9]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sample  2 non-null      object
 1   1       2 non-null      int64 
 2   2       2 non-null      int64 
 3   3       2 non-null      int64 
 4   4       2 non-null      int64 
 5   5       2 non-null      int64 
 6   6       2 non-null      int64 
dtypes: int64(6), object(1)
memory usage: 240.0+ bytes


- check for duplicates

Checking for duplicates is crucial for maintaining data quality. Duplicates can lead to errors, inaccuracies in analysis, inefficient resource use, and inconsistencies in reporting. Detecting and removing duplicates is essential for data integrity and accurate decision-making.

In [10]:
final_df[final_df.duplicated(subset=None, keep='first')]

Unnamed: 0,sample,1,2,3,4,5,6


# Push files to Repo

* the loaded data is pushed into the repositry

In [15]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

final_df.to_csv(f"/workspaces/Spectral-data/outputs/datasets/collection/spectra.csv",index=False)

[Errno 13] Permission denied: 'outputs'
