# Dealing with large datasets


In [None]:
import pandas as pd
from datetime import datetime

## Download and process static data
This Section of the notebook 
1. downloads static data sources, 
2. loads them into memory, 
3. cleans that data then saves a cleaned data product, which may be used lateron for some analysis.

### Download files from on-line host.
Including this step in the pipeline ensures that the data processing steps are reproducible. 
This will save the severe headache that comes from trying to share a particular cleaned data products.

The script below will:
1. Check if the data directory exists and clear it if it does.
2. Create a new data directory.
3. Download a list of specified data files from an online source into the directory.

This approach ensures that everyone working on this project has access to the same data in its original form, 
facilitating consistent results across different environments.

This indicates that we will use bash scripts in the first cell.
Using Bash for file operations, like downloading data, offers simplicity and efficiency, particularly for straightforward tasks. 
Bash scripts are concise, fast for file system operations, and seamlessly integrate with the Unix/Linux environment, making them ideal for basic tasks like creating directories or downloading files.

In [None]:
# Define the code cell as Bash
%%bash

# Define the directory where data will be stored
DATA_DIR="../data/camels/"

# Remove the data directory if it already exists to ensure a fresh start
if [ -d "$DATA_DIR" ]; then rm -Rf $DATA_DIR; fi

# Create a new data directory
mkdir $DATA_DIR

# List of filenames to be downloaded
filenames=(camels_clim.txt camels_geol.txt camels_hydro.txt camels_name.txt camels_soil.txt camels_topo.txt camels_vege.txt)

# Loop through each file and download it to the data directory
for filename in ${filenames[@]}
do 
    wget -O "${DATA_DIR}${filename}" "https://gdex.ucar.edu/dataset/camels/file/${filename}"
done

### Share the Processing Code, Not Just the Processed Data

Sharing the code used for data processing, rather than just the processed data, is crucial for ensuring reproducibility, especially in data science. While this approach might become challenging with very large datasets, it's essential for maintaining transparency and allowing others to understand and replicate your workflow. By sharing code, you provide insights into how raw data is transformed, cleaned, and made ready for analysis, which is invaluable for collaborative projects and scientific research.


In [None]:
# List of filenames to be loaded
filenames = ["camels_clim.txt", "camels_geol.txt", "camels_hydro.txt",
             "camels_name.txt", "camels_soil.txt", "camels_topo.txt", "camels_vege.txt"]

# Dictionary to store DataFrames for each file
dfs = {}

# Loop through each file, read it into a DataFrame, and store it in the dictionary
for filename in filenames:
    with open(f"../data/camels/{filename}", "r") as f:
        # Read the file using pandas, with ';' as the separator and 'gauge_id' as the index column
        dfs[filename] = pd.read_csv(f, sep=";", index_col="gauge_id")

# Concatenate all DataFrames along the columns
df = pd.concat([dfs[filename] for filename in filenames], axis=1)

### Handling Text Data and NaN Values

Once we consolidate all our data into a single DataFrame, we often encounter text data and NaN (Not a Number) values. Depending on our analysis goals, these may not be useful. For our current example, we require a dataset with complete information, meaning no missing values (NaNs), and our analysis will focus on continuous numerical data. Therefore, it's important to identify and appropriately handle text and NaN values to prepare our dataset for further analysis.

Let's start by taking a preliminary look at our DataFrame to understand its structure and the nature of the data it contains.


In [None]:
# Display the first few rows of the DataFrame to inspect its contents
df.head()

### Cleaning the Data for Analysis

In data analysis, the treatment of NaN (Not a Number) values depends on the context and objectives of your study. While NaNs can be acceptable or even meaningful in certain scenarios, they might not be suitable for others. In our hypothetical analysis, we require a dataset without any missing values. Therefore, we will remove columns containing NaN values to ensure our dataset is complete and ready for analysis.

This step is crucial for maintaining the integrity and reliability of our analysis, as missing data can lead to biased or inaccurate results.


In [None]:
# Remove columns with any NaN values from the DataFrame
df = df.dropna(axis=1)

### Further Data Cleaning: Removing One Hot Encoded Data

In many analytical scenarios, categorical data represented as strings or One Hot encoded data can be quite useful. However, for our specific analysis, we need a dataset consisting solely of numerical values. Therefore, we will identify and remove columns that contain string data, which often represent categorical variables.

This step is crucial for aligning our dataset with the requirements of our analysis, ensuring that the data is in the correct format for the statistical or machine learning methods we plan to apply.


In [None]:
# Initialize a list to hold the names of columns to be dropped
drop_these_columns = []

# Iterate over each column in the DataFrame
for camels_data_column in df.columns.values:
    # Check if the first value in the column is of string type
    if type(df[camels_data_column].values[0]) == str:
        # If it is a string, add the column name to the list
        drop_these_columns.append(camels_data_column)

# Drop the identified columns from the DataFrame
df = df.drop(drop_these_columns, axis=1)


### Saving the Cleaned Data with a Unique Filename

When sharing and managing data files, especially in a collaborative environment, it's good to avoid confusion caused by multiple versions of the same file. To prevent issues related to version control and ensure traceability, we'll save our cleaned dataset with a unique and descriptive filename. This filename will include the current date and time, along with the initials of the person who processed the data. Such a naming convention makes it easier to track changes over time and understand the lineage of the dataset.


In [None]:
# Generate a timestamp string for the current date and time
nowstring = datetime.today().strftime("%d-%m-%Y_%H%M")

# Initials of the data processor (change as needed)
creator_initials = "jf"

# Save the DataFrame to a CSV file with a unique name
df.to_csv(f"../data/camels/camels_attributes_cleaned_{nowstring}_{creator_initials}.csv")

## TODO:

Impliment the following sections:
 - Techniques for handling large datasets (e.g., chunking, streaming).
 - Data cleaning and preprocessing.
 - Efficient data storage formats (like HDF5, Parquet).
 - Use of databases for large datasets (SQL and NoSQL).
 - Parallel processing and distributed computing basics (e.g., using Dask, Spark).
Practical Activities:
 - Demonstration of different data storage formats and their advantages.