## Context

## Objective

- Explore and visualize the dataset (first part)


### Overall solution design :

The potential solution design would look like this:

- Checking the data description to get the idea of basic statistics or summary of data.
- Univariate analysis to see how data is spread out, getting to know about the outliers.
- Bivariate analysis to see how different attributes vary with the dependent variable.
- Outlier treatment if needed.
- Missing value treatment using appropriate techniques.
- Feature engineering - transforming features, creating new features if possible.
- Choosing the model evaluation technique - 1) R Squared 2) RMSE can be any other metrics related to regression analysis.
- Splitting the data and proceeding with modeling.


## Importing the necessary libraries and overview of the dataset

In [1]:
# Importing the basic libraries we will require for the project

# Import libraries for data manipulation
import pandas as pd
import numpy as np
import os
# Import libraries for data visualization
import matplotlib.pyplot as plt

# Slightly advanced library for data visualization            
import seaborn as sns      

# Import necessary modules
import geopandas as gpd
from geopy.exc import GeocoderTimedOut, GeocoderServiceError

# import module for geoencoding
from geopy.geocoders import Nominatim

#We can use the function identify_nominal_columns(dataset) of the dython library to identify the categorical variables in the dataset.
from dython.nominal import associations

# add sleep time
from time import sleep

import logging

# Set up the color sheme:
import mapclassify as mc

# to compute zscores: https://pypi.org/project/cgmzscore/
# Resource R: https://rdrr.io/github/WorldHealthOrganization/anthroplus/man/anthroplus_zscores.html
#from cgmzscore.src.main import z_score_lhfa
#from cgmzscore.src.main import z_score_wfa
#import ast
#https://github.com/ewheeler/pygrowup
#from pygrowup import Observation
#from decimal import Decimal

import datetime
# Release memory using gc : The gc module to manually trigger garbage collection. 
# Garbage collection is the process of freeing memory that is no longer being used by the program. 
# By manually triggering garbage collection, you can release memory that is no longer needed.
import gc

gc.collect()


0

Activate R in python. Install the *tidyverse* and *gtsummary* packages.

In [2]:
# activate R magic
%load_ext rpy2.ipython

ModuleNotFoundError: No module named 'rpy2'

In [None]:
%%R
install.packages("gtsummary")

library(tidyverse)
library(gtsummary)

In [None]:
%%R
# Round 1 datasets

# Read excel files data for R1 and R2
df_r1_hh <- readxl::read_excel("output/data/r1_hh.xlsx")

## Loading the data

In [None]:
%%R
# Round 1 datasets

# Read excel files data for R1 and R2
df_r1_hh <- readxl::read_excel("output/data/r1_hh.xlsx")


df_r1_hh = pd.read_excel(os.getcwd() + '\\output\\data\\r1_hh.xlsx') 
df_r1_anthr1 = pd.read_excel(os.getcwd() + '\\output\\data\\r1_overfive.xlsx') 
df_r1_anthr2 = pd.read_excel(os.getcwd() + '\\output\\data\\r1_underfive.xlsx') 

# Round 2 datasets
df_r2_hh <- readxl::read_excel(paste0(getwd(),"output/data/r2_hh.xlsx"))
df_r2_hh = pd.read_excel(os.getcwd() + '\\output\\data\\r2_hh.xlsx') 
df_r2_anthr1 = pd.read_excel(os.getcwd() + '\\output\\data\\r2_overfive.xlsx') 
df_r2_anthr2 = pd.read_excel(os.getcwd() + '\\output\\data\\r2_underfive.xlsx') 

# Round 3 datasets
df_r3_hh <- readxl::read_excel(paste0(getwd(),"/output/data/r3_hh.xlsx"))
df_r3_anthr1 = pd.read_excel(os.getcwd() + '\\output\\data\\r3_overfive.xlsx') 
df_r3_anthr2 = pd.read_excel(os.getcwd() + '\\output\\data\\r3_underfive.xlsx') 

# Importing the Bangladesh raw map: Reading a shapefile containing administrative boundaries of Bangladesh
bgd_adm = gpd.read_file(os.getcwd() + '\\input\\shapefile_data\\shapefile_zip\\BGD_adm\\BGD_adm3.shp')