GitHub - AC1817/Ribeiro_AssignmentC: Assignment C for Essentials for Data Science Leiden University MSc

Assignment C

This assignment highlights some of the challenges you will encounter with real-world data analysis. The data you will be working with is artificially generated containing several inconsistencies. The goal is to identify these inconsistencies and take steps accordingly to correct and combine the data. The description is intentionally incomplete where you can put your knowledge, creativity and skills to work.

Data

Several labs across the globe have measured gene expression of groups of individuals specifically selected for the study.
The gene expression data is already collected and merged together, however the anthropometric data from different labs are delivered separately. In order to do the analysis all data have to be combined in a single file containing all information.

The analysis

Once the data is combined you will be able to carry out the analysis:

Study the distribution of Body Mass Index (BMI) in different populations.
Determine if BMI is related to the expression of certain genes.

Deliverables

A single properly merged dataset containing all information.
A report (.ipynb) with two sections, i) data preprocessing, ii) data analysis.

In the data preprocessing part, please enumerate en briefly describe the inconsistencies you've resolved in the data.

Organization of Files/Directories

Please create two separate subdirectories for the data:

assignment_c/data : the raw data files that is provided with this assignment (separate zip file: assignment_C_data.zip )
assignment_c/out : intermediate and final dataset(s) created by you

This way you can make sure that you'll not modify the original dataset in assignment_c/data and that files generated into assignment_c/out can safely be discarded whenever a re-analysis is needed.

The directory structure for your github repository should look something like this :

.
assignment_C
├── assignment_C.ipynb
├── data
│   ├── NL.csv
│   ├── PL.csv
│   ├── UK.csv
│   ├── US.csv
│   └── genes.csv
└── out
    └── combined_dataset.csv

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
anaconda_projects/db		anaconda_projects/db
data_folder/data		data_folder/data
README.md		README.md
assignment_C.ipynb		assignment_C.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assignment C

Data

The analysis

Deliverables

Organization of Files/Directories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Assignment C

Data

The analysis

Deliverables

Organization of Files/Directories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages