In this project, you will analyze a dataset and then communicate your findings about it. You will use the Python libraries NumPy
, pandas
, and Matplotlib
to make your analysis easier.
Prepare for this project with: Intro to Data Analysis
You will need an installation of Python, plus the following libraries:
* pandas
* NumPy
* Matplotlib
* csv
We recommend installing Anaconda, which comes with all of the necessary packages, as well as IPython notebook.
For this project, you will conduct your own data analysis and create a file to share that documents your findings. You should start by taking a look at your dataset and brainstorming what questions you could answer using it. Then you should use pandas
and NumPy
to answer the questions you are most interested in, and create a report sharing the answers. You will not be required to use inferential statistics or machine learning to complete this project, but you should make it clear in your communications that your findings are tentative. This project is open-ended in that we are not looking for one right answer.
Click this link to open a document with links and information about data sets that you can investigate for this project. You must choose one of these datasets to complete the project.
Eventually you’ll want to submit your project (and share it with friends, family, and employers). Get organized before you begin. We recommend creating a single folder that will eventually contain:
- The report communicating your findings
- Any Python code you wrote as part of your analysis
- The data set you used (which you will not need to submit) You may wish to use a Jupyter notebook, in which case you can submit both the code you wrote and the report of your findings in the same document. Otherwise, you will need to submit your report and code separately. If you would like a notebook template to help organize your investigation, you can click here.
Selected dataset: No-Show Appointments Dataset
Dataset Description: This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
Also you can find more information and more details about the dataset at this link.
Brainstorm some questions you could answer using the data set you chose, then start answering those questions. You can find some questions in the data set options to help you get started.
Try and suggest questions that promote looking at relationships between multiple variables. You should aim to analyze at least one dependent variable and three independent variables in your investigation. Make sure you use NumPy and pandas where they are appropriate!
Once you have finished analyzing the data, create a report that shares the findings you found most interesting. If you use a Jupyter notebook, share your findings alongside the code you used to perform the analysis. Make sure that your report text is contained in Markdown cells to clearly distinguish your comments and findings from your code work. You should also feel free to use other tools and software to craft your final report, but make sure that you can submit your report as an HTML or PDF file so that it can be opened easily.
Use the Project Rubric to review your project. If you are happy with your submission, then you're ready to submit your project. If you see room for improvement, keep working to improve your project! Supporting Materials Investigate a Dataset - Template Notebook
The project walkthrough can be found here