The underlying goal of this assignment is to understand how you think as a data scientist. We recommend that you do this assignment in either a Python notebook or an R markdown notebook - but feel free to use a tool of your choosing.
You have been given an anonymized data set containing 10 features and one target variable.
Your objectives are to:
- Perform data analysis and exploration of the data.
- Build a model to predict for the target variable
target_var
using the variablesfeature_1
tofeature_9
. You are free to use any modelling technique but make sure to justify your reasonings.
You must submit a notebook containing your data exploration, final model and a technical summary. Make sure to write down all your thoughts and reasonings within the notebook with technical readers in mind. You must also provide an executive summary at the end of your findings aimed at non-technical readers.
Other information:
- It costs the company R20 000 per case when a prediction of 0 is made when in fact it was a 1 in the original dataset, it also costs the company R1000 when a prediction of 1 is made when in fact it was a 0 in the original dataset. It is up to you to decide how you will account for this in the assignment.
- You should not make any assumptions about the underlying source of the data.
- You must design the best model you can while avoiding overfitting. You must be able to quantitatively prove this.
Bonus points are given if pySpark
/ rSpark
are used to perform the modelling and data exploration/analysis. You should pretend that you are working with millions of rows if you decided to use Apache Spark.
This is ran in Jupyter notebook. The model used is Random Forest.