# **Data Collection**

## Objectives

- Downloading and Initial Analysis of the ChildrenAnemia.csv Dataset:     
Acquiring raw data from Kaggle and performing an initial exploration to understand the dataset's structure, content, and basic statistical summaries using df.info() and other exploratory data analysis methods.

- Checking for Duplicate Column Names:  
Ensuring that the dataset does not have any columns with duplicate names, which could lead to confusion and errors in data analysis.

- Documentation of the Data Collection Process:  
Providing a comprehensive description of the steps and decisions taken during data collection, to guarantee the reproducibility and clarity of the process.


## Inputs

- Kaggle authentication token (kaggle.json) for access to datasets on Kaggle.


## Process

- Using the Kaggle API to download the ChildrenAnemia.csv dataset.
- Initial data analysis and preparation for further analysis.

## Outputs

- Generate Dataset: outputs/datasets/collection/ChildrenAnemia.csv

## Additional Comments

- In a professional setting, project data typically originates from various sources, often combining internal data (such as from company-owned data warehouses) and external data. Unlike this educational project where we are sourcing data from Kaggle, real-world projects involve a more complex data acquisition process.
- It's also important to note that in a business environment, due to confidentiality and security concerns, data is rarely, if ever, stored or shared through public repositories. This project, designed for learning purposes, is an exception where we use a public repository for ease of access and demonstration.
- In a real-world scenario, it is standard practice to add the directories such as inputs/datasets/raw and outputs/datasets/ to the .gitignore file. This is to ensure that sensitive data, particularly data owned by a client, is not inadvertently pushed to a public repository without explicit consent. However, for the purpose of this project's evaluation and to ensure the seamless operation of Jupyter notebooks for reviewers, these directories will be kept within the repository's version control.


---

# Change working directory

We store our Jupyter notebooks in a subfolder of the project. Therefore, when we run the notebooks in the editor, we need to change the working directory. This is necessary to ensure proper access to data files and other project resources that might be located outside the notebook's subfolder.

We need to change the working directory from its current folder to its parent folder
- To access the current working directory, we use the os.getcwd() command. 

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/ChildrenAnemiaRisk/jupyter_notebooks'

Then, we change the working directory from its current folder to its parent folder to facilitate the correct file path references within our notebooks.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/ChildrenAnemiaRisk'

---

# Fetch data from Kaggle
Install kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Please note that to run this section, you must first upload your personal kaggle.json file into the workspace. This file is necessary for authenticating your requests to Kaggle. In this code block, we're setting up the KAGGLE_CONFIG_DIR environment variable to point to the project's directory. Additionally, we modify the file permissions of kaggle.json to make it readable for all users. This step is crucial to ensure that requests to the Kaggle API are processed correctly.

In [30]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Now, we'll set the path for the Kaggle dataset and create a specific directory for it. Following that, we'll execute a command through Kaggle's interface to download the dataset into this newly created directory.

In [31]:
KaggleDatasetPath = "adeolaadesina/factors-affecting-children-anemia-level"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading factors-affecting-children-anemia-level.zip to inputs/datasets/raw
100%|████████████████████████████████████████| 258k/258k [00:00<00:00, 1.05MB/s]
100%|████████████████████████████████████████| 258k/258k [00:00<00:00, 1.04MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [32]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/factors-affecting-children-anemia-level.zip
  inflating: inputs/datasets/raw/children anemia.csv  


---

# Load and Inspect Kaggle data

In [4]:
import pandas as pd
pd.set_option('display.max_rows', None)
df = pd.read_csv(f"inputs/datasets/raw/children anemia.csv")
df.head()

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Hemoglobin level adjusted for altitude (g/dl - 1 decimal),Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,1,22,,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,,Yes
1,35-39,Urban,Higher,Richest,1,28,,,Yes,No,Married,Living with her,Hours: 1,No,,,No
2,25-29,Urban,Higher,Richest,1,26,,,No,No,Married,Living with her,Immediately,No,,,No
3,25-29,Urban,Secondary,Richest,1,25,95.0,Moderate,Yes,No,Married,Living with her,105.0,No,114.0,Not anemic,No
4,20-24,Urban,Secondary,Richest,1,21,,,Yes,No,No longer living together/separated,,Immediately,No,,,No


### Check Data
This section provides a concise overview of the DataFrame. It includes key information such as the number of rows and columns, types of data in each column, memory usage, and a glimpse into the first few entries. This summary is crucial for getting an initial understanding of the data's structure and content.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33924 entries, 0 to 33923
Data columns (total 17 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   Age in 5-year groups                                                   33924 non-null  object 
 1   Type of place of residence                                             33924 non-null  object 
 2   Highest educational level                                              33924 non-null  object 
 3   Wealth index combined                                                  33924 non-null  object 
 4   Births in last five years                                              33924 non-null  int64  
 5   Age of respondent at 1st birth                                         33924 non-null  int64  
 6   Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)  13136 non-null 

---

### Checking for Duplicate Column Names

Before proceeding with data preprocessing, it's important to identify any duplicate column names in the dataset. Duplicate column names can cause confusion and errors in data analysis. This step ensures that each column has a unique identifier.

In [6]:
duplicate_column_names = df.columns[df.columns.duplicated()]

# Display the results
print(duplicate_column_names)

Index([], dtype='object')


### Renaming Columns for Clarity

Despite the initial analysis showing no duplicate column names, there exists a possibility of confusion, particularly with columns labeled 'Anemia level'. To avoid any misunderstanding and potential misinterpretation of the data, it's been decided to rename these columns for enhanced clarity. This decision was made after consulting with the data provider. The column 'Anemia level' will be differentiated between mother and child, and the column 'Hemoglobin level adjusted for altitude' will be specified for the child. These changes will ensure that the data analysis is accurate and interpretable.

In [7]:
# Rename columns for clarity
df.rename(columns={'Anemia level': 'Anemia level mother', 
                   'Anemia level.1': 'Anemia level child',
                   'Hemoglobin level adjusted for altitude (g/dl - 1 decimal)': 'Hemoglobin level child adjusted for altitude (g/dl - 1 decimal)'},
          inplace=True)
print(df.columns)

Index(['Age in 5-year groups', 'Type of place of residence',
       'Highest educational level', 'Wealth index combined',
       'Births in last five years', 'Age of respondent at 1st birth',
       'Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)',
       'Anemia level mother',
       'Have mosquito bed net for sleeping (from household questionnaire)',
       'Smokes cigarettes', 'Current marital status',
       'Currently residing with husband/partner', 'When child put to breast',
       'Had fever in last two weeks',
       'Hemoglobin level child adjusted for altitude (g/dl - 1 decimal)',
       'Anemia level child', 'Taking iron pills, sprinkles or syrup'],
      dtype='object')


### Rationale for Not Checking Duplicate Rows in the Dataset
In the context of our dataset, which focuses on various health and demographic factors related to anemia in children, it is important to consider the nature of the data when deciding whether to check for duplicate rows.

Each row in our dataset represents a unique individual, with their respective demographic and health-related characteristics. Even if the data in two rows appear to be identical, they still represent different individuals. In the realm of health and demographic studies, each individual case contributes to the overall statistical analysis, making every entry significant in its own right.

Therefore, checking for and removing duplicate rows would not be appropriate in this context. It could lead to the unintended exclusion of valid data points, which are crucial for a comprehensive statistical analysis. 

---

# Push data to Repo


Push file with downloaded and minimally processed data to Repo

In [8]:
# create a directory if it isn't exist
output_dir = 'outputs/datasets/collection'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save DataFrame into file CSV
df.to_csv(f"{output_dir}/ChildrenAnemia.csv", index=False)