> This dataset is a great resource for researchers and professionals looking to investigate the impact of maternal mental health difficulties on infant sleep. It includes 410 mothers of infants aged between 3 to 12 months old, providing invaluable insights into this important area.

> The data covers a variety of topics, including sociodemographic data (e.g., maternal age, marital status, and educational level) as well as maternal mental health data (such as CB-PTSD, depression and anxiety) and infant sleep patterns (e.g., duration and number of night awakenings).

> In order to make the most out of this dataset researchers can use various analytical tools such as descriptive statistics or regression models in order to study different aspects such as the association between sociodemographics or mental health difficulty with regards to infant sleep duration or quality. Additionally, potential mediators and moderators can be identified by studying specific subgroups within the sample in more depth or making comparisons with other datasets available online which measure similar variables.

#### Using this dataset it will also be possible to gain further insight into key factors that might influence mothers' decisions when it comes to their children's sleeping habits (i.g., whether they should share the same bed with their baby)

- Investigating the relationship between maternal mental health difficulties,
    such as postpartum depression and anxiety, and infant sleep patterns.
- Examining the role of mediators, such as sociodemographic characteristics (age, education level) in moderating
    the relationship between maternal mental health difficulty and infant’s sleep patterns.
- Investigating how various factors (e.g., maternal age, gender of the baby, etc.) affect infants’ sleeping habits
    and quality of sleep throughout infancy up to one-year-old infancy stages by analyzing changes in certain variables
    across all months included in this dataset


> - Data authors <br>
>  This dataset is related to: Sandoz, V.; Lacroix, A.; Stuijfzand, S.; Bickle Graz, M.; Horsch, A.
>  Maternal Mental Health Symptom Profiles and Infant Sleep: A Cross-Sectional Survey. Diagnostics 2022, 12, 1625.

https://doi.org/10.3390/diagnostics12071625.

https://zenodo.org/record/5070945#.Y8OqatJBwUE

### Importing libraries for the analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Importing dataset from a csv file

In [None]:
df = pd.read_csv('Dataset_maternal_mental_health_infant_sleep.csv', encoding = 'unicode_escape')

> This line of code on the cell below will helps to display all rows and columns on the DataFrame 

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

#### Exploring the shape of the DataFrame

In [None]:
df.shape

In [None]:
df.info()

> All columns of the DataFrame but 1 have numerial values. Because almost all the columns are answers to a survey question those columns data types will need to be changed.

##### Example of the Data and summary stats

In [None]:
df.head(10)

In [None]:
df.describe().style.background_gradient()

### Variable exploration. 
#### Check or unique values for every column

In [None]:
df.nunique()

In [None]:
for col in df.columns:
    if col not in ['Participant_number', 'Gestationnal_age', 'Age']:
        print(f'{col} - {np.sort(df[col].unique())}')

#### Marital_status
- 1 = single; 2 = in a relationship; 3 = separated, divorced or widow; 6 = other. 
#### Marital_status_edit
- 1 = single; 2 = in a relationship; 3 = separated, divorced or widow.

> - Participant_number column has 1 value per row. This column will be dropped due that do not add value to our data
> - Type_parents, Birth_1mth_M_inclusion, Birth_12mth_M_inclusion and Child_survey_participation columns have 1 value per row. This columns will be dropped due that do not add value to our data 
> - Marital_status will be dropped

In [None]:
df.drop(columns=['Participant_number', 'Type_parents', 'Birth_1mth_M_inclusion',
                     'Birth_12mth_M_inclusion', 'Marital_status', 'Child_survey_participation'], inplace=True)

In [None]:
df.nunique()

### Checking for Null Values

In [None]:
df.isnull().sum() 

## Imputing values on the 
### Very Short Form of the Infant Behavior Questionnaire-Revised (Negative Emotionality dimension) columns

In [None]:
# Getting the percentage of null values
df.isnull().sum() / df.shape[0] * 100

### Making dataframes for each test (CBTS, EPDS, HADS and IBQ_R)

In [None]:
df_cbts = df.loc[:, 'CBTS_M_3': 'CBTS_22']
df_epds = df.loc[:, 'EPDS_1': 'EPDS_10']
df_hads = df.loc[:, 'HADS_1': 'HADS_13']
df_ibq_r = df.loc[:, 'IBQ_R_VSF_3_bb1': 'IBQ_R_VSF_33_bb1']

In [None]:
df_no_q = df.loc[:, ['Age', 'Marital_status_edit', 'Education', 'Gestationnal_age', 'Type_pregnancy',
        'sex_baby1', 'Age_bb', 'Sleep_night_duration_bb1', 'night_awakening_number_bb1', 'how_falling_asleep_bb1']]

##

- Columns "IBQ_R_VSF_28_bb1" , "IBQ_R_VSF_33_bb1" are related 
> - 28 - When you introduced him/her to an adult he/she did not know, how often did your baby refuse to go to this unknown person?
> - 33 - When he/she was in the presence of several unfamiliar adults, how often did your baby cling to a parent?

#### KNNImputer 
> KNNImputer is a popular method for imputing missing values in datasets. It works by finding the k nearest neighbors of each observation with missing values and then imputing the missing values with the mean value of those neighbors. This method is relatively robust to outliers and can be used to impute missing values in both numerical and categorical data.
> KNNImputer is a powerful and versatile method for imputing missing values in datasets. It is relatively robust to outliers and can be used to impute missing values in both numerical and categorical data. However, it can be computationally expensive, and the choice of k can affect the results
#### Why to use KNNImputer ?
- It preserves the distribution of the data. When you impute missing values with the mean or median, you are essentially assuming that the missing values are distributed the same way as the observed values. This may not always be the case, and using KNNImputer can help to preserve the true distribution of the data

In [None]:
from sklearn.impute import KNNImputer

In [None]:
# Function to impute columns for dataframes
def column_knn_imputer(df):
    imputer = KNNImputer()
    columns_to_impute = list(df.columns[df.isnull().any()])
    
    df_imputed = df.copy(deep=True)
    # Impute using fit_transform on the df_imputed
    df_imputed.loc[:, columns_to_impute] = imputer.fit_transform(df.loc[:, columns_to_impute])
    
    return df_imputed

In [None]:
df_data = column_knn_imputer(df)

In [None]:
df_data.isnull().sum().sum()

In [None]:
df_data.info()

In [None]:
for col in ['IBQ_R_VSF_3_bb1', 'IBQ_R_VSF_4_bb1','IBQ_R_VSF_9_bb1', 'IBQ_R_VSF_10_bb1', 
            'IBQ_R_VSF_16_bb1', 'IBQ_R_VSF_17_bb1', 'IBQ_R_VSF_28_bb1', 'IBQ_R_VSF_29_bb1',
            'IBQ_R_VSF_32_bb1', 'IBQ_R_VSF_33_bb1']:
    df_data[col] = df_data[col].astype('int')

In [None]:
df_data.info()

In [None]:
for col in df_data.columns:
    if col not in ['Participant_number', 'Gestationnal_age', 'Age']:
        print(f'{col} - {np.sort(df_data[col].unique())}')

### Histogram of the data set

In [None]:
df_data.hist(figsize=(20,25));

### Convert columns to category due that values are answer to questions

In [None]:
for col in df_data.columns:
    if col not in ['Age', 'Gestationnal_age', 'night_awakening_number_bb1']:
        df_data[col] = df_data[col]
        df_data[col] = df_data[col].astype('category')

In [None]:
df_data.info()

### Replacing values on columns

In [None]:
# 1 => single (s), 2 => in a relationship (r),  3 => separated-divorced-widow (sep-d-w)

# Updating column names for the question variables
df_data.Marital_status_edit = df_data.Marital_status_edit.replace({1: 's', 2: 'r', 3: 'sep-d-w'})

In [None]:
#  1 => single pregnance; 2 => twin pregnancy

df_data.Type_pregnancy = df_data.Type_pregnancy.replace({1: 'signle', 2: 'twin'})

In [None]:
# 1 => ≥3 months to <6 months; 2 => ≥6 months to <9 months; 3 => ≥9 months to <12 months

df_data.Age_bb = df_data.Age_bb.replace({1 : '3m to <m6', 2: '6m to <9m', 3: '9m to <12m'})

In [None]:
# 1 = while being fed; 2 =while being rocked; 3 = while being held; 4 = alone in the crib; 5 =in the crib with parental presence

df_data.how_falling_asleep_bb1 = df_data.how_falling_asleep_bb1.replace({1: 'while being fed', 2: 'while being rocked',
                                                                         3: 'while being held', 4: 'alone in the crib',
                                                                         5: 'in the crib with parental presence'})

In [None]:
# 1 => no education (n_e), 2 => high school (h_s), 3 => some university (s_U), 
# 4 => associate, certificate or Technology Degree (a_c_t), 5 => university (U)

df_data.Education = df_data.Education.replace({1: 'n_e', 2: 'h_s', 3: 's_U', 4: 'a_c_t', 5: 'U'})

In [None]:
df_data.sex_baby1 = df_data.sex_baby1.replace({1: 'girl', 2: 'boy'})

In [None]:
# Renaming a single column
df_data = df_data.rename(columns={'Marital_status_edit': 'Marital_status',
                                 'sex_baby1': 'Sex'})

### EDA Summary Stats

#### Numerical values

In [None]:
# Function to get summary stats
def get_summary_stats_by_columns(df):
    column_name = df.columns
    new_df = pd.DataFrame(index=['Data type', 'Min', '25%', '50%', '75%','Max', 'Mean', 'Median', 'Mode', 'STD', 'Skewness', 'Kurtosis', 'Count'])
    for col in column_name:
        if pd.api.types.is_numeric_dtype(df[col]):
            new_df[col.upper()] = [df[col].dtype, df[col].min(), df[col].quantile(.25), df[col].quantile(.5), df[col].quantile(.75),df[col].max(), df[col].mean(), df[col].median(),
                                   df[col].mode()[0], df[col].std(), df[col].skew(), df[col].kurt() , df[col].count()]
    return new_df

In [None]:
get_summary_stats_by_columns(df_data)

In [None]:
df_data.hist(figsize=(6,6));

#### Categorical values

In [None]:
df_data.describe(exclude='number')

In [None]:
cat_column_list = ['Marital_status', 'Education', 'Type_pregnancy', 'Sex', 'Age_bb',
            'CBTS_M_3', 'CBTS_M_4', 'CBTS_M_5', 'CBTS_M_6', 'CBTS_M_7', 'CBTS_M_8', 'CBTS_M_9', 
            'CBTS_M_10', 'CBTS_M_11', 'CBTS_M_12', 'CBTS_13', 'CBTS_14', 'CBTS_15', 'CBTS_16',
            'CBTS_17', 'CBTS_18', 'CBTS_19', 'CBTS_20', 'CBTS_21', 'CBTS_22',
            'EPDS_1', 'EPDS_2', 'EPDS_3', 'EPDS_4', 'EPDS_5', 'EPDS_6', 'EPDS_7',
            'EPDS_8', 'EPDS_9', 'EPDS_10', 'HADS_1', 'HADS_3', 'HADS_5', 'HADS_7',
            'HADS_9', 'HADS_11', 'HADS_13']
len(cat_column_list)

In [None]:
fig, ax = plt.subplots(nrows = 14, ncols = 3, figsize =(15,70))
fig.tight_layout(pad=6.0)
i = 0
for row in range(14):
    for col in range(3):
        g = sns.countplot(x=cat_column_list[i], data=df_data, ax = ax[row,col]);
        g.set_xticklabels(g.get_xticklabels(), rotation=30)
        for p in g.patches:
            g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
        i += 1
plt.show()

In [None]:
cat_column_list_2 = ['Age_bb', 'IBQ_R_VSF_3_bb1', 'IBQ_R_VSF_4_bb1','IBQ_R_VSF_9_bb1', 'IBQ_R_VSF_10_bb1', 
                     'IBQ_R_VSF_16_bb1', 'IBQ_R_VSF_17_bb1', 'IBQ_R_VSF_28_bb1', 'IBQ_R_VSF_29_bb1',
                     'IBQ_R_VSF_32_bb1', 'IBQ_R_VSF_33_bb1', 'Sleep_night_duration_bb1']
len(cat_column_list_2)

In [None]:
fig, ax = plt.subplots(nrows = 4, ncols = 3, figsize =(15,20))
fig.tight_layout(pad=6.0)
i = 0
for row in range(4):
    for col in range(3):
        g = sns.countplot(x=cat_column_list_2[i], data=df_data, ax = ax[row,col]);
        g.set_xticklabels(g.get_xticklabels(), rotation=30)
        for p in g.patches:
            g.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
        i += 1
plt.show()