# **Data Preprocessing and Cleaning**

## Objectives

- Assessment and Handling of Missing Data:   
Identifying missing values in the dataset and making decisions about filling them in or removing them.

- Categorical Data Transformation:   
Applying encoding methods to convert categorical data into a format suitable for machine learning.

- Numerical Data Normalization:   
Processing numerical data, including scaling or normalization, to enhance the performance of machine learning models.

- Detection and Correction of Anomalies:   
Searching for and correcting data that may be erroneous or implausible.

- Documentation of the Preprocessing and Cleaning Process:   
Providing a detailed description of all steps and decisions taken during data preparation to ensure reproducibility and understanding of the process.

## Inputs

* outputs/datasets/collection/ChildrenAnemia.csv

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

### Exploring Unique Values in All Columns

Understanding the diversity and range of responses in each column of our dataset is crucial. This comprehensive check of unique values in all columns — both categorical and numerical — will help us grasp the data's characteristics more fully. It aids in identifying any irregularities or special cases in the data and is essential for planning necessary preprocessing steps. This step is particularly important for ensuring the data is ready for tasks such as machine learning model training, where specific data types and value ranges are required.

To understand the range of responses in each categorical column, it's important to identify all unique values. This will help in understanding the data's characteristics and in planning for any necessary data preprocessing, such as encoding categorical data for machine learning models.

In [None]:
# Printing unique values 

for col in df.columns:
    print(f"Unique values in '{col}': {df[col].unique()}")

### Replacing 'Don't know' Responses with NaN

In certain contexts, a response of "Don't know" may not provide meaningful information for our analysis. Particularly in cases where such responses are equivalent to a lack of data, it's beneficial to replace them with NaN (Not a Number). This is the case for our dataset, where "Don't know" responses in certain columns, such as whether the child's mother took iron supplements, are effectively the same as missing data. Converting these responses to NaN will streamline the dataset for more accurate analysis and modeling.

In [None]:
# Replacing 'Don't know' responses with NaN to treat them as missing data
import numpy as np
df.replace("Don't know", np.nan, inplace=True)

### Processing 'When child put to breast' Data

After initially hypothesizing about the 'When child put to breast' column, we assumed that 'Immediately' indicates within the first hour after birth, 'Hours: 1' represents the first 60 minutes, and the numeric codes have specific meanings: the first digit represents the day after birth (1 or 2), and the last two digits indicate the number of hours past the complete 24-hour periods. Following these assumptions, we consulted with the data provider who confirmed our interpretations. With this clarification, we will convert all values into a uniform system measured in hours since birth. This approach will enable a more standardized and analyzable format for this crucial early-life health indicator.

---

In [None]:
# Function to convert data to hours
def convert_to_hours(value):
    # If the value is 'Immediately', convert it to 0 hours
    if value == 'Immediately':
        return 0
    # If the value contains 'Hours', extract the hour number
    elif 'Hours' in str(value):
        # Extracting the number of hours from the string in the format 'Hours: X'
        return int(value.split(':')[1].strip())
    # If the value contains 'Days', convert it to hours (assuming 'Days: 1' means 24 hours after birth)
    elif 'Days' in str(value):
        # Assuming 'Days: 1' means 24 hours after birth
        return int(value.split(':')[1].strip()) * 24
    # If the value is missing (nan), return it as is
    elif pd.isna(value):
        # Returning missing data as is
        return value
    # Otherwise, convert the numeric value to hours
    else:
        # Converting numeric values to hours
        # The first digit represents the day after birth (1 or 2),
        # and the last two digits indicate the number of hours past the complete 24-hour periods
        day, hours = divmod(int(float(value)), 100)
        return (day - 1) * 24 + hours

# Applying the function to the column
df['When child put to breast'] = df['When child put to breast'].apply(convert_to_hours)
df['When child put to breast'].head()

### Converting 'Age in 5-year groups' to Numerical Format

The 'Age in 5-year groups' column is currently in a categorical format. To facilitate numerical analysis, we'll convert these age groups to a numerical format by representing each group with its average age. This approach simplifies the data and makes it more suitable for numerical analysis and modeling.

In [None]:
# Dictionary to map age groups to their average ages
age_group_mapping = {'15-19': 17, '20-24': 22, '25-29': 27, '30-34': 32, 
                     '35-39': 37, '40-44': 42, '45-49': 47}

# Applying the mapping to the 'Age in 5-year groups' column
df['Age in 5-year groups'] = df['Age in 5-year groups'].map(age_group_mapping)

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
