### Data Generator

# Creating a Robust Test Data Set for Evaluating Classification Models: A Multi-Class Scenario with Missing Values
In this notebook we will create synthetic data to be used by another notebook where we will build number of models and fit the syntesized data.

## Introduction

In the world of machine learning, the efficacy of classification models heavily relies on the quality and diversity of the data they are trained on. To ensure the robustness of these models, it's essential to subject them to rigorous testing against datasets that closely resemble real-world scenarios. This testing often requires the creation of synthetic datasets that emulate the complexities and challenges data scientists encounter in their day-to-day work.

We will focus on the process of synthesizing a data set specifically tailored for testing a classification model. Our objective is to design a dataset with a minimum of five input variables and a multi-class target variable, all while introducing a realistic element: missing values. Missing data is a common problem in real-world datasets, and understanding how your model handles it is crucial for producing reliable results.



# Step 1 : Importing necessary packages

In [1]:
# import required packages
import numpy as np
import pandas as pd

In [2]:
# Set random seed for reproducibility
np.random.seed(42)

# Step 2:  Synthesize input features

First let's define our relationship. This is synthetic data that we create to assist in developing our understanding of modeling. Normally, this relationship is hidden from us, and our job is to identify the best model we can.

In [3]:
# Define the total number of data samples in the dataset.
sample_size = 1000

# Specify the number of input features for each data point.
n_features = 5

# Set the number of distinct classes for the target variable.
n_classes = 3

# Determine the number of missing values to be introduced in the dataset.
n_missing_values = 10

# Specify the number of input features with missing values (in this case, 2).
n_features_with_missing = 2

simulate random data, let's select randomly from a normal(aka Guassian) distribution. This will give us a set of values that are centered around a mean value, with a standard deviation that we can control. This is a more realistic representation of data that we might encounter in the real world.

In [4]:
x_mean = 0
x_stdev = 15

In [5]:
#defining an empty list 
X = []

#using for loop to generate 6 random features
for i in range(1, 6):
    feature_i = np.round(np.random.normal(x_mean, x_stdev, sample_size), 2)
    X.append(feature_i)

In [6]:
X = np.array(X).T  # Transpose the list of features to have columns as features

# Step 3: Synthesize Target

In [7]:
# Defining target with n_classes
y = np.random.randint(0, n_classes, sample_size)

# Step 4: Add Missing Values

Since we wanted to make it close to the real data, we will add missing values to the synthesized data

In [8]:
# Calculate the number of rows to add missing values to (10% of the total)
num_rows_with_missing = int(0.10 * sample_size)

In [9]:
# Add missing values to two random features in a subset of rows
missing_feature_indices = np.random.choice(range(n_features), n_features_with_missing, replace=False)
rows_with_missing = np.random.choice(range(sample_size), num_rows_with_missing, replace=False)

for i in rows_with_missing:
    for j in missing_feature_indices:
        X[i, j] = np.nan

# Step 5: Create a pandas dataframe from the data

Since our goal is to generate data that we can fit with another notebook, let's save this data to a csv. 
First we will create a dataframe with the data we just simulated. 

In [10]:
# creating a pandas dataframe from the data
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])

In [11]:
# Add the target variable
df['Target'] = y

let's see how the data is by viewing few rows of dataframe 

In [12]:
df.head()

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,Target
0,7.45,20.99,-10.13,-28.62,-12.95,0
1,-2.07,13.87,-2.17,-12.91,-0.47,1
2,9.72,0.89,-11.89,-6.2,0.27,0
3,22.85,-9.7,-4.62,28.32,7.09,1
4,-3.51,10.47,-28.4,8.35,-20.5,2


# Step 6: Save the data frame content to a csv

Lastly, let's save the data we created to a csv file. This saved data will be used in the next notebook.

In [13]:
df.to_csv(r'data_divas.csv')