Skip to content

AC1817/Ribeiro_AssignmentC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assignment C

This assignment highlights some of the challenges you will encounter with real-world data analysis. The data you will be working with is artificially generated containing several inconsistencies. The goal is to identify these inconsistencies and take steps accordingly to correct and combine the data. The description is intentionally incomplete where you can put your knowledge, creativity and skills to work.

Data

Several labs across the globe have measured gene expression of groups of individuals specifically selected for the study.
The gene expression data is already collected and merged together, however the anthropometric data from different labs are delivered separately. In order to do the analysis all data have to be combined in a single file containing all information.

The analysis

Once the data is combined you will be able to carry out the analysis:

  1. Study the distribution of Body Mass Index (BMI) in different populations.
  2. Determine if BMI is related to the expression of certain genes.

Deliverables

  1. A single properly merged dataset containing all information.
  2. A report (.ipynb) with two sections, i) data preprocessing, ii) data analysis.

In the data preprocessing part, please enumerate en briefly describe the inconsistencies you've resolved in the data.

Organization of Files/Directories

Please create two separate subdirectories for the data:

  • assignment_c/data : the raw data files that is provided with this assignment (separate zip file: assignment_C_data.zip )
  • assignment_c/out : intermediate and final dataset(s) created by you

This way you can make sure that you'll not modify the original dataset in assignment_c/data and that files generated into assignment_c/out can safely be discarded whenever a re-analysis is needed.

The directory structure for your github repository should look something like this :

.
assignment_C
├── assignment_C.ipynb
├── data
│   ├── NL.csv
│   ├── PL.csv
│   ├── UK.csv
│   ├── US.csv
│   └── genes.csv
└── out
    └── combined_dataset.csv

About

Assignment C for Essentials for Data Science Leiden University MSc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors