In this notebook, I'll look into cosolidating the breastfeeding data:

In [1]:
# import libraries:
import pandas as pd
import numpy as np

In [2]:
# import data:
NS = pd.read_csv('Cleaned_NS.csv')

In [3]:
# extracting only the breastfeeding data:
NS_breastfeeding = NS[['participant_ID','breastfeeding_yes','breastfeeding_no','breastfeeding_na']]

In [4]:
# Checking out the values:
NS_breastfeeding.describe()

Unnamed: 0,participant_ID,breastfeeding_yes,breastfeeding_no,breastfeeding_na
count,4054,280,1786,216
unique,4054,1,1,1
top,ns 757,yes,no,not applicable
freq,1,280,1786,216


first, I will find out who provided multiple answers to the breastfeeding question.

In [5]:
def consolidate_row_exclude_first(row):
    consolidated = []
    backgrounds = []
    # Loop through the specified columns excluding the first one
    for value in row.iloc[1:]:
        if pd.notna(value) and value != '':
            consolidated.append(value)
            
    return '; '.join(consolidated) , len(consolidated)

In [6]:
# Apply the function to each row and create new columns
NS_breastfeeding[['breastfeeding?', 'number_of_answers']] = NS_breastfeeding.apply(
    lambda row: pd.Series(consolidate_row_exclude_first(row)), axis=1
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding[['breastfeeding?', 'number_of_answers']] = NS_breastfeeding.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding[['breastfeeding?', 'number_of_answers']] = NS_breastfeeding.apply(


In [7]:
NS_breastfeeding.head()

Unnamed: 0,participant_ID,breastfeeding_yes,breastfeeding_no,breastfeeding_na,breastfeeding?,number_of_answers
0,ns 757,,,,,0
1,ns 318,,,,,0
2,ns 328,,,,,0
3,ns 646,,no,,no,1
4,ns 678,,,,,0


Now, we can drop the three breastfeeding columns:

In [8]:
# Now, we can drop all the columns which have been consolidated:
NS_breastfeeding.drop(NS_breastfeeding.columns[1:4], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding.drop(NS_breastfeeding.columns[1:4], axis = 1, inplace = True)


In [9]:
NS_breastfeeding.head()

Unnamed: 0,participant_ID,breastfeeding?,number_of_answers
0,ns 757,,0
1,ns 318,,0
2,ns 328,,0
3,ns 646,no,1
4,ns 678,,0


In [10]:
NS_breastfeeding.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4054 entries, 0 to 4053
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   participant_ID     4054 non-null   object
 1   breastfeeding?     4054 non-null   object
 2   number_of_answers  4054 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 95.1+ KB


The info above shows there are no null values, but we can see there are no answers in the breastfeeding? column. Let's check and replace those blanks with 'prefer not to answer'

In [11]:
update_breastfeeding = NS_breastfeeding[NS_breastfeeding.number_of_answers == 0]

In [12]:
# replacing the no values with 'prefer not to answer'
NS_breastfeeding['breastfeeding?'].loc[update_breastfeeding.index] = 'prefer not to answer'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding['breastfeeding?'].loc[update_breastfeeding.index] = 'prefer not to answer'


In [13]:
NS_breastfeeding.head()

Unnamed: 0,participant_ID,breastfeeding?,number_of_answers
0,ns 757,prefer not to answer,0
1,ns 318,prefer not to answer,0
2,ns 328,prefer not to answer,0
3,ns 646,no,1
4,ns 678,prefer not to answer,0


In [14]:
NS_breastfeeding.number_of_answers.value_counts()

1    2262
0    1782
2      10
Name: number_of_answers, dtype: int64

We can see that there are some people who provided 2 answers to the question. let's check them up.

In [15]:
NS_breastfeeding[NS_breastfeeding.number_of_answers == 2]

Unnamed: 0,participant_ID,breastfeeding?,number_of_answers
358,ns 1400,yes; no,2
1287,ns 976,no; not applicable,2
1611,ns 2498,no; not applicable,2
1907,ns 2756,no; not applicable,2
1910,ns 2761,no; not applicable,2
2338,ns 433,no; not applicable,2
2348,ns 1533,yes; no,2
3195,ns 2707,no; not applicable,2
3386,ns 3829,yes; not applicable,2
3519,ns 3970,no; not applicable,2


In [16]:
# for those who chose 'no; not applicable' we will replace that answer with 'not applicable':
not_applicable = NS_breastfeeding[(NS_breastfeeding.number_of_answers == 2) 
                                 &(NS_breastfeeding['breastfeeding?'].str.contains('not applicable'))]

In [17]:
# replace the two values with "not applicable":
NS_breastfeeding['breastfeeding?'].loc[not_applicable.index] = 'not applicable'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding['breastfeeding?'].loc[not_applicable.index] = 'not applicable'


In [18]:
NS_breastfeeding[NS_breastfeeding.number_of_answers == 2]

Unnamed: 0,participant_ID,breastfeeding?,number_of_answers
358,ns 1400,yes; no,2
1287,ns 976,not applicable,2
1611,ns 2498,not applicable,2
1907,ns 2756,not applicable,2
1910,ns 2761,not applicable,2
2338,ns 433,not applicable,2
2348,ns 1533,yes; no,2
3195,ns 2707,not applicable,2
3386,ns 3829,not applicable,2
3519,ns 3970,not applicable,2


In [19]:
# replace the value for the "yes; no" with "prefer not to answer":
prefer_no_answer = NS_breastfeeding[(NS_breastfeeding["number_of_answers"] == 2)
& (NS_breastfeeding['breastfeeding?'] != 'not applicable')]

In [20]:
NS_breastfeeding['breastfeeding?'].loc[prefer_no_answer.index] = 'prefer not to answer'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NS_breastfeeding['breastfeeding?'].loc[prefer_no_answer.index] = 'prefer not to answer'


In [21]:
# Check the data again:
NS_breastfeeding.describe(include= 'all')

Unnamed: 0,participant_ID,breastfeeding?,number_of_answers
count,4054,4054,4054.0
unique,4054,4,
top,ns 757,prefer not to answer,
freq,1,1784,
mean,,,0.562901
std,,,0.501038
min,,,0.0
25%,,,0.0
50%,,,1.0
75%,,,1.0


In [22]:
NS_breastfeeding.isna().sum()

participant_ID       0
breastfeeding?       0
number_of_answers    0
dtype: int64

# Save the data into a csv file:

In [23]:
# we will stop here for this and save the data into csv file.
NS_breastfeeding.to_csv('NS_breastfeeding_final.csv', index = False)