# Artificial Dataset Generation

## The goal
The goal of this exercise is to work with statistical notions such as mean, standard deviation, and correlation.

Write a file named artificial_dataset.py that generates a numerical dataset with 300 datapoints (i.e. lines) and at least 6 columns and saves it to a csv file named artificial_dataset.csv.
The columns must satisfy the following requirements :
- they must all have a different mean
- they must all have a different standard deviation (English for "écart type")
- at least one column should contain integers.
- at least one column should contain floats.
- one column must have a mean close to 2.5.
- some columns must be positively correlated.
- some columns must be negatively correlated.
- some columns must have a correlation close to 0.


## Import Libraries

First, let's import the necessary libraries for our data generation.

In [9]:
import pandas as pd
import numpy as np

## Data Generation Functions

Below are the functions that will be used to generate correlated data and create our dataset.


In [10]:
def generate_correlated_data(n: int, mean1: float, std1: float, mean2: float, std2: float, correlation: float) -> np.array:
    """
    Generates a sample of n points with a specified correlation.
    
    :param n: Number of datapoints
    :param mean1: Mean of the first variable
    :param std1: Standard deviation of the first variable
    :param mean2: Mean of the second variable
    :param std2: Standard deviation of the second variable
    :param correlation: Target correlation between the two variables
    :return: n x 2 matrix with the correlated variables
    """
    cov_matrix = [[std1**2, correlation*std1*std2], [correlation*std1*std2, std2**2]]
    sample = np.random.multivariate_normal([mean1, mean2], cov_matrix, n)
    return sample

Now, we define the function to create our full dataset.


In [11]:
def create_dataset(n: int) -> pd.DataFrame:
    """
    Creates a dataset with 6 features and 1 target variable.

    :param n: Number of datapoints
    :return: Pandas DataFrame with the dataset
    """
    np.random.seed(0)

    means = [1, 2.5, 5, 10, 20, 50]
    stds = [0.5, 1, 2, 4, 8, 16]

    # Create positively correlated data for columns 3 and 4
    columns = [None] * 6  # Initialize list for columns
    columns[2], columns[3] = generate_correlated_data(n, means[2], stds[2], means[3], stds[3], 0.8).T

    # Generate negatively correlated data for columns 2 and 5
    columns[1], columns[4] = generate_correlated_data(n, means[1], stds[1], means[4], stds[4], -0.8).T

    # Generate the rest of the columns with different means and standard deviations
    for i, (mean, std) in enumerate(zip(means, stds)):
        if columns[i] is None:
            columns[i] = np.random.normal(mean, std, n)
    columns[0] = np.round(columns[0]).astype(int)  # Ensures column has integer data

    # Create DataFrame
    df = pd.DataFrame(data=np.array(columns).T, columns=[f'feature_{i}' for i in range(1, 7)])

    # Shuffle the last column until low correlation is achieved with all other columns
    min_correlation = 0.1
    attempts = 0
    max_attempts = 1000  # Set a maximum number of shuffling attempts

    while attempts < max_attempts and min_correlation >= 0.1:
        np.random.shuffle(df['feature_6'].values)  # Shuffle feature_6 inplace
        max_correlation = df.corr().abs().loc['feature_6'].drop('feature_6').max()
        if max_correlation < min_correlation:
            min_correlation = max_correlation
        attempts += 1
    return df

## Main Function

The main function orchestrates the data generation process and saves the dataset to a CSV file.


In [12]:
def main():
    """
    Main function to generate and save the dataset.
    """
    n = 300  # number of datapoints
    df = create_dataset(n)
    df.to_csv('artificial_dataset.csv', index=False)


## Execute Main

With everything set up, let's execute the main function to generate and save our dataset.


In [13]:
if __name__ == "__main__":
    main()

## Utility Classes and Functions

Before we perform checks on the dataset, let's define some utility classes and functions.
The `bcolors` class will allow us to print colored text in the notebook's output to make it more readable.
We also define functions to check whether a DataFrame contains integer data and to print messages in color based on a condition.


In [14]:
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

## Data Validation Functions

Now we have functions to check certain conditions within our dataset like the uniqueness of means and standard deviations, data types of the columns, and correlations.


In [15]:
def check_integers(df: pd.DataFrame) -> bool:
    """
    :param df: DataFrame to check
    :return: True if at least one column contains integers, False otherwise
    """
    for col in df.columns:
        if pd.api.types.is_integer_dtype(df[col]):
            return True
        if all(df[col] == df[col].astype(int)):
            return True
    return False

def print_colored_message(condition, true_message, false_message):
    """
    Prints a message in green if condition is True, in red otherwise.

    :param condition: Condition to check
    :param true_message: Message to print if condition is True
    :param false_message: Message to print if condition is False
    """
    if condition:
        print(f"{bcolors.OKGREEN}{true_message}{bcolors.ENDC}")
    else:
        print(f"{bcolors.FAIL}{false_message}{bcolors.ENDC}")

def check_dataset_conditions(file_path: str):
    """
    Checks the conditions of the dataset.
    :param file_path: Path to the dataset
    """
    # Load the dataset
    df = pd.read_csv(file_path)

    # Check for unique means
    means = df.mean()
    print(f"{bcolors.HEADER}\nMeans:\n{bcolors.ENDC}", means)
    all_means_different = len(set(means.round(3))) == len(means)
    print_colored_message(all_means_different, "All columns have different means: True", "All columns have different means: False")

    # Check for unique standard deviations
    stds = df.std()
    print(f"{bcolors.HEADER}\nStandard Deviations:\n{bcolors.ENDC}", stds)
    all_stds_different = len(set(stds.round(3))) == len(stds)
    print_colored_message(all_stds_different, "All columns have different standard deviations: True", "All columns have different standard deviations: False")

    # Check for at least one integer column
    has_integer_column = check_integers(df)
    print_colored_message(has_integer_column, "At least one column contains integers: True", "At least one column contains integers: False")

    # Check for at least one float column
    has_float_column = any(df.dtypes == 'float')
    print_colored_message(has_float_column, "At least one column contains floats: True", "At least one column contains floats: False")

    # Check for a column with mean close to 2.5
    mean_close_to_25 = any(np.isclose(means, 2.5, atol=0.1))
    print_colored_message(mean_close_to_25, "One column has a mean close to 2.5: True", "One column has a mean close to 2.5: False")

    # Check correlations
    corr_matrix = df.corr()
    print(f"{bcolors.HEADER}\nCorrelation Matrix:\n{bcolors.ENDC}", corr_matrix)
    positive_correlations = corr_matrix.values[np.triu_indices_from(corr_matrix.values, 1)].max() > 0.5
    negative_correlations = corr_matrix.values[np.triu_indices_from(corr_matrix.values, 1)].min() < -0.5
    zero_correlations = np.any(np.isclose(corr_matrix.values[np.triu_indices_from(corr_matrix.values, 1)], 0, atol=0.1))

    print_colored_message(positive_correlations, "Some columns are positively correlated: True", "Some columns are positively correlated: False")
    print_colored_message(negative_correlations, "Some columns are negatively correlated: True", "Some columns are negatively correlated: False")
    print_colored_message(zero_correlations, "At least two columns have a correlation close to 0: True", "At least two columns have a correlation close to 0: False")


## Validating the Generated Dataset

With the above utilities and functions, we will now load the dataset and check if it meets the specified conditions.


In [16]:
file_path = 'artificial_dataset.csv'
check_dataset_conditions(file_path)


[95m
Means:
[0m feature_1     0.986667
feature_2     2.427051
feature_3     5.172044
feature_4    10.329018
feature_5    20.493459
feature_6    50.303660
dtype: float64
[92mAll columns have different means: True[0m
[95m
Standard Deviations:
[0m feature_1     0.529866
feature_2     0.944037
feature_3     2.019855
feature_4     4.047939
feature_5     7.508285
feature_6    15.967371
dtype: float64
[92mAll columns have different standard deviations: True[0m
[92mAt least one column contains integers: True[0m
[92mAt least one column contains floats: True[0m
[92mOne column has a mean close to 2.5: True[0m
[95m
Correlation Matrix:
[0m            feature_1  feature_2  feature_3  feature_4  feature_5  feature_6
feature_1   1.000000   0.012291   0.074457   0.043412   0.011114  -0.055734
feature_2   0.012291   1.000000   0.049643   0.065258  -0.790640   0.045529
feature_3   0.074457   0.049643   1.000000   0.805541  -0.019949   0.093384
feature_4   0.043412   0.065258   0.805541   

## Conclusion

We have generated an artificial dataset with specified correlations and checked it for a variety of conditions:

- Unique means and standard deviations for each column
- Presence of both integer and float data types
- Specific means close to a given value (e.g., 2.5)
- Various types of correlations between columns

The utility functions provided us with an automated way to validate these conditions, and the color-coded output helped in quickly assessing the results.

### Findings
- **Means:** The script confirms whether all columns have different means.
- **Standard Deviations:** It checks if all columns have different standard deviations.
- **Data Types:** The dataset was verified for at least one integer and one float column.
- **Mean Value Check:** The presence of a column with a mean close to 2.5 was validated.
- **Correlations:** The dataset was checked for positive, negative, and negligible correlations.

The validation process executed indicates whether the dataset adheres to the specified conditions. This kind of automation is crucial in ensuring data quality for downstream analysis and machine learning tasks.

**Note:** The color-coding in the output is based on ANSI escape codes and may not render as expected in all Jupyter environments. If the color-coding does not appear in the output, consider adjusting the `bcolors` class or the `print_colored_message` function to be compatible with your specific Jupyter setup or remove the color functionality for universal compatibility.

By automating these checks, we can streamline the process of dataset generation and ensure that the data meets the necessary preconditions before it is used for further analysis or model training.
