# **Data Collection**

## Objectives

- Downloading and initial analysis of the ChildrenAnemia.csv dataset using data from Kaggle.

## Inputs

- Kaggle authentication token (kaggle.json) for access to datasets on Kaggle.

## Process

- Using the Kaggle API to download the ChildrenAnemia.csv dataset.
- Initial data analysis and preparation for further analysis.

## Outputs

- Detailed outputs will be added as the project progresses.

## Additional Comments

- In a professional setting, project data typically originates from various sources, often combining internal data (such as from company-owned data warehouses) and external data. Unlike this educational project where we are sourcing data from Kaggle, real-world projects involve a more complex data acquisition process.
- It's also important to note that in a business environment, due to confidentiality and security concerns, data is rarely, if ever, stored or shared through public repositories. This project, designed for learning purposes, is an exception where we use a public repository for ease of access and demonstration.
- In a real-world scenario, it is standard practice to add the directories such as inputs/datasets/raw and outputs/datasets/ to the .gitignore file. This is to ensure that sensitive data, particularly data owned by a client, is not inadvertently pushed to a public repository without explicit consent. However, for the purpose of this project's evaluation and to ensure the seamless operation of Jupyter notebooks for reviewers, these directories will be kept within the repository's version control.


---

# Change working directory

We store our Jupyter notebooks in a subfolder of the project. Therefore, when we run the notebooks in the editor, we need to change the working directory. This is necessary to ensure proper access to data files and other project resources that might be located outside the notebook's subfolder.

We need to change the working directory from its current folder to its parent folder
- To access the current working directory, we use the os.getcwd() command. 

In [9]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/ChildrenAnemiaRisk/jupyter_notebooks'

Then, we change the working directory from its current folder to its parent folder to facilitate the correct file path references within our notebooks.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [10]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [11]:
current_dir = os.getcwd()
current_dir

'/workspace/ChildrenAnemiaRisk'

---

# Fetch data from Kaggle
Install kaggle package to fetch data

In [13]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Please note that to run this section, you must first upload your personal kaggle.json file into the workspace. This file is necessary for authenticating your requests to Kaggle. In this code block, we're setting up the KAGGLE_CONFIG_DIR environment variable to point to the project's directory. Additionally, we modify the file permissions of kaggle.json to make it readable for all users. This step is crucial to ensure that requests to the Kaggle API are processed correctly.

In [30]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Now, we'll set the path for the Kaggle dataset and create a specific directory for it. Following that, we'll execute a command through Kaggle's interface to download the dataset into this newly created directory.

In [31]:
KaggleDatasetPath = "adeolaadesina/factors-affecting-children-anemia-level"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading factors-affecting-children-anemia-level.zip to inputs/datasets/raw
100%|████████████████████████████████████████| 258k/258k [00:00<00:00, 1.05MB/s]
100%|████████████████████████████████████████| 258k/258k [00:00<00:00, 1.04MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [32]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/factors-affecting-children-anemia-level.zip
  inflating: inputs/datasets/raw/children anemia.csv  


---

## Load and Inspect Kaggle data

In [33]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/children anemia.csv")
df.head()

Unnamed: 0,Age in 5-year groups,Type of place of residence,Highest educational level,Wealth index combined,Births in last five years,Age of respondent at 1st birth,Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal),Anemia level,Have mosquito bed net for sleeping (from household questionnaire),Smokes cigarettes,Current marital status,Currently residing with husband/partner,When child put to breast,Had fever in last two weeks,Hemoglobin level adjusted for altitude (g/dl - 1 decimal),Anemia level.1,"Taking iron pills, sprinkles or syrup"
0,40-44,Urban,Higher,Richest,1,22,,,Yes,No,Living with partner,Staying elsewhere,Immediately,No,,,Yes
1,35-39,Urban,Higher,Richest,1,28,,,Yes,No,Married,Living with her,Hours: 1,No,,,No
2,25-29,Urban,Higher,Richest,1,26,,,No,No,Married,Living with her,Immediately,No,,,No
3,25-29,Urban,Secondary,Richest,1,25,95.0,Moderate,Yes,No,Married,Living with her,105.0,No,114.0,Not anemic,No
4,20-24,Urban,Secondary,Richest,1,21,,,Yes,No,No longer living together/separated,,Immediately,No,,,No


This section provides a concise overview of the DataFrame. It includes key information such as the number of rows and columns, types of data in each column, memory usage, and a glimpse into the first few entries. This summary is crucial for getting an initial understanding of the data's structure and content.

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33924 entries, 0 to 33923
Data columns (total 17 columns):
 #   Column                                                                 Non-Null Count  Dtype  
---  ------                                                                 --------------  -----  
 0   Age in 5-year groups                                                   33924 non-null  object 
 1   Type of place of residence                                             33924 non-null  object 
 2   Highest educational level                                              33924 non-null  object 
 3   Wealth index combined                                                  33924 non-null  object 
 4   Births in last five years                                              33924 non-null  int64  
 5   Age of respondent at 1st birth                                         33924 non-null  int64  
 6   Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)  13136 non-null 

---

Check Data Types  
Ensure that each column has the appropriate data type. Numeric data should be of type `int` or `float`, while categorical data should be of type `object` or `category`.

In [36]:
df.dtypes

Age in 5-year groups                                                      object
Type of place of residence                                                object
Highest educational level                                                 object
Wealth index combined                                                     object
Births in last five years                                                  int64
Age of respondent at 1st birth                                             int64
Hemoglobin level adjusted for altitude and smoking (g/dl - 1 decimal)    float64
Anemia level                                                              object
Have mosquito bed net for sleeping (from household questionnaire)         object
Smokes cigarettes                                                         object
Current marital status                                                    object
Currently residing with husband/partner                                   object
When child put to breast    

Convert Boolean Values  
If there are boolean values with 'Yes' and 'No' answers, they can also be converted into a numerical format (1 and 0).

In [40]:
df['Smokes cigarettes'] = df['Smokes cigarettes'].map({'Yes': 1, 'No': 0})

In [41]:
df['Have mosquito bed net for sleeping (from household questionnaire)'] = df['Have mosquito bed net for sleeping (from household questionnaire)'].map({'Yes': 1, 'No': 0})

In [42]:
df['Had fever in last two weeks'] = df['Had fever in last two weeks'].map({'Yes': 1, 'No': 0})

In [43]:
df['Taking iron pills, sprinkles or syrup'] = df['Taking iron pills, sprinkles or syrup'].map({'Yes': 1, 'No': 0})

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
