<a href="https://colab.research.google.com/github/SidLabs-Online/API2JSON/blob/main/Dummy_M_CHAT_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating dummy data for testing our application

## Defining Data Structure:
| Column Name | Data Type.  | range.      |
| ----------- | ----------- |-----------  |
| #.          | Int         | 1 - 300
| geneticTestResult   | Text        | [XY,XX,XO,XXY,XXX,XYY,XXYY,XXXY,XXXX,Mosaic,FXS,PWS,AS,DUPI 15,EDS] |
|role         | Text.       | [parent].         |
| caregiverFirstName | Text. | [English names] |
| username | Text | random |
| password | Alphanumeric | Random|
| childFirstName |Text | [English Child Names] |
|dateOfBirth | Date | Random dob ranging between 0 to 2 yrs|
|country | Text | ['AUSTRALIA', 'Asia','EU', 'UK','USA','OTHER'] |
| diagnosis | Text | ['PRENAT','NEONAT'] |
| Q 1 to 20 | Int | [0,1] |


### The below mentioned code is just to display the approach I am going to take to generate this dummy data

In [2]:
pip install Faker

Collecting Faker
  Downloading Faker-22.5.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Faker
Successfully installed Faker-22.5.0


In [66]:
from faker import Faker
import pandas as pd
import random
import string

fake = Faker()
# defining the genetic variation list
variation_list = ['XY', 'XX', 'XO', 'XXY', 'XXX', 'XYY', 'XXYY', 'XXXY', 'XXXX', 'Mosaic', 'FXS', 'PWS', 'AS', 'DUPI 15', 'EDS']

# defining the country list
country_list = ['AUSTRALIA', 'Asia','EU', 'UK','USA','OTHER']

#definig the diagnosis time list
diagnosis_time = ['PRENAT','NEONAT']

# Function to generate random 6-letter username
def generate_username():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(6))

# Function to generate random 6-digit password
def generate_password():
    return ''.join(random.choice(string.digits) for _ in range(6))


# Defining our data structure
data = {'caregiverFirstName': [fake.name() for _ in range(300)],
        'geneticTestResult': [random.choice(variation_list) for _ in range(300)],
        'Email': [fake.email() for _ in range(300)],
        'childFirstName' : [fake.unique.first_name() for _ in range(300)],
        'country' : [random.choice(country_list) for _ in range(300)],
        'dateOfBirth': [fake.date_of_birth(minimum_age=0, maximum_age=2) for _ in range(300)],
        'diagnosis' : [random.choice(diagnosis_time) for _ in range(300)],
        'username': [generate_username() for _ in range(300)],
        'password': [generate_password() for _ in range(300)]}

# Add 'Role' column with 'parent' value
data['Role'] = ['parent'] * 300

# Creating a DataFrame
df_one = pd.DataFrame(data)

# Save to CSV or use as needed
#df.to_csv('dummy_data.csv', index=False)
df_one

Unnamed: 0,caregiverFirstName,geneticTestResult,Email,childFirstName,country,dateOfBirth,diagnosis,username,password,Role
0,Victoria Johnson,XXX,jessica02@example.com,Deanna,USA,2022-07-23,NEONAT,ukatvz,539592,parent
1,James Rodriguez MD,XXXX,marymckinney@example.net,Courtney,AUSTRALIA,2021-02-10,PRENAT,flvsid,141489,parent
2,Gabriella Rogers,XY,vincentryan@example.net,Judith,Asia,2024-01-01,NEONAT,eonutx,207482,parent
3,Allison Peters,XXXX,laura83@example.org,James,AUSTRALIA,2021-05-04,NEONAT,fbwuot,177945,parent
4,Michael Barron,XXYY,keithstanley@example.org,Terry,OTHER,2022-07-18,NEONAT,orsquy,954820,parent
...,...,...,...,...,...,...,...,...,...,...
295,Amy Ray,EDS,ryan65@example.com,Monique,UK,2023-12-03,PRENAT,csskgh,705857,parent
296,Michael Carrillo,FXS,angelicajackson@example.com,Carolyn,OTHER,2022-06-17,NEONAT,qvkajv,283924,parent
297,Jason Wheeler,FXS,williamsullivan@example.net,Sandy,OTHER,2021-06-04,NEONAT,mkghxq,012395,parent
298,Meghan Ward,XYY,echung@example.org,Vanessa,UK,2022-08-16,NEONAT,bnfysc,652592,parent


# M-Chat-R Test data (dummy)

##Generating the 'score' using a machine learning approach to generate a row of 300 integers ranging from 1 to 20 with a target being close to average of 2.3. We'll achieve this is by training a model to predict the desired values based on some features.

Using a simple linear regression model for this purpose. I'll use the scikit-learn library for the implementation. Please make sure you have scikit-learn installed <code>(pip install scikit-learn)</code> before running the code.

In [67]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Set a seed for reproducibility
np.random.seed(42)

# Number of items
num_items = 300

# Generate a row of 300 items ranging from 1 to 20
X = np.random.randint(1, 21, size=num_items)

# Create a target column with the desired average (2.3, 2.4, or 2.5)
target_average = 2.3
y = np.full(num_items, target_average)

# Introduce some noise to the target values
y += np.random.normal(scale=0.5, size=num_items)

# Reshape X for compatibility with scikit-learn
X = X.reshape(-1, 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor model with different parameters
model = DecisionTreeRegressor(max_depth=8, min_samples_split=5, min_samples_leaf=3)
model.fit(X_train, y_train)

# Make predictions on the entire dataset
predictions = model.predict(X)

# Round the predictions to integers
rounded_predictions = np.round(predictions).astype(int)

# Create a DataFrame
data = {'score': rounded_predictions}
df_test = pd.DataFrame(data)

# Display the data frame table
print(df_test)

# Calculate the average
average_value = df_test['score'].mean()
print("\nAverage:", average_value)


     score
0        2
1        2
2        2
3        3
4        2
..     ...
295      2
296      2
297      2
298      2
299      2

[300 rows x 1 columns]

Average: 2.06


An average of 2.06 is a pretty close to the research average, hence going ahead with it.

## Generating data to randomly generate 300 rows with 0s and 1s, of 20 columns 'Q1' to 'Q20' in a way that the sum of each row matches the integer on the 'score' column. The 'score' column is the sum of 0s and 1s for each row of the 20 columns 'Q1' to 'Q20'.

In [68]:
import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

# Number of rows
num_rows = 300

# Use the 'Item' column from the previous dataset
items = df_test['score']

# Initialize a DataFrame for 'Q1' to 'Q20'
data = pd.DataFrame()

# Generate data for each 'Q' column
for i in range(1, 21):
    # Calculate the maximum possible value for the 'Q' column
    max_value = np.minimum(1, items - data.sum(axis=1))

    # Set the 'Q' column based on the difference
    data[f'Q{i}'] = (np.random.rand(num_rows) < max_value).astype(int)

# Shuffle the values within each row (shuffle entire columns)
data.iloc[:, 1:] = np.apply_along_axis(np.random.permutation, axis=1, arr=data.iloc[:, 1:])

# Verify the sum of each row matches the 'Item' column
row_sums = data.sum(axis=1)
assert np.all(row_sums == items), "Adjustment failed!"

# Concatenate 'Item' column with the generated data
result_data = pd.concat([items, data], axis=1)

# Display the resulting
print(result_data)

     score  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  ...  Q11  Q12  Q13  Q14  Q15  \
0        2   0   0   0   0   0   0   0   0   1  ...    0    0    0    0    0   
1        2   0   0   0   0   0   0   0   0   0  ...    0    0    0    0    1   
2        2   0   0   0   0   0   0   0   0   0  ...    0    0    0    0    0   
3        3   0   0   0   0   0   0   1   0   0  ...    0    0    0    0    0   
4        2   0   0   0   0   0   0   0   0   0  ...    0    0    0    1    0   
..     ...  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...  ...  ...  ...  ...   
295      2   0   0   0   0   0   0   0   1   0  ...    0    0    0    0    0   
296      2   0   0   0   0   0   0   0   0   0  ...    0    0    0    0    0   
297      2   0   0   0   0   0   0   1   0   0  ...    0    0    0    1    0   
298      2   0   0   0   0   0   0   0   1   0  ...    1    0    0    0    0   
299      2   0   0   0   0   0   0   0   0   0  ...    1    0    0    0    0   

     Q16  Q17  Q18  Q19  Q20  
0      0

# <code>merge</code> to complete the dataset

In [69]:
# Merge based on row index
final_df = pd.merge(df_one, result_data, left_index=True, right_index=True, how='inner')

# Display the resulting combined dataframe
final_df.head


<bound method NDFrame.head of      caregiverFirstName geneticTestResult                         Email  \
0      Victoria Johnson               XXX         jessica02@example.com   
1    James Rodriguez MD              XXXX      marymckinney@example.net   
2      Gabriella Rogers                XY       vincentryan@example.net   
3        Allison Peters              XXXX           laura83@example.org   
4        Michael Barron              XXYY      keithstanley@example.org   
..                  ...               ...                           ...   
295             Amy Ray               EDS            ryan65@example.com   
296    Michael Carrillo               FXS   angelicajackson@example.com   
297       Jason Wheeler               FXS   williamsullivan@example.net   
298         Meghan Ward               XYY            echung@example.org   
299        Carol Rhodes              XXYY  rodriguezcharles@example.net   

    childFirstName    country dateOfBirth diagnosis username password

In [70]:
from google.colab import files

final_df.to_csv('M-CHAT-R_Dummy300.csv')
files.download('M-CHAT-R_Dummy300.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>