This assignment highlights some of the challenges you will encounter with real-world data analysis. The data you will be working with is artificially generated containing several inconsistencies. The goal is to identify these inconsistencies and take steps accordingly to correct and combine the data. The description is intentionally incomplete where you can put your knowledge, creativity and skills to work.
Several labs across the globe have measured gene expression of groups of individuals specifically selected for the study.
The gene expression data is already collected and merged together, however the anthropometric data from different labs are delivered separately. In order to do the analysis all data have to be combined in a single file containing all information.
Once the data is combined you will be able to carry out the analysis:
- Study the distribution of Body Mass Index (BMI) in different populations.
- Determine if BMI is related to the expression of certain genes.
- A single properly merged dataset containing all information.
- A report (.ipynb) with two sections, i) data preprocessing, ii) data analysis.
In the data preprocessing part, please enumerate en briefly describe the inconsistencies you've resolved in the data.
Please create two separate subdirectories for the data:
assignment_c/data: the raw data files that is provided with this assignment (separate zip file: assignment_C_data.zip )assignment_c/out: intermediate and final dataset(s) created by you
This way you can make sure that you'll not modify the original dataset in assignment_c/data and that files generated into assignment_c/out can safely be discarded whenever a re-analysis is needed.
The directory structure for your github repository should look something like this :
.
assignment_C
├── assignment_C.ipynb
├── data
│ ├── NL.csv
│ ├── PL.csv
│ ├── UK.csv
│ ├── US.csv
│ └── genes.csv
└── out
└── combined_dataset.csv