# Project Scoping

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

## Reminder: Final Project 

The final project MUST have these elements:
- **Data collected using an API or web scraping** that is different from the data we used in class.
- **Visualizations** that describe the data or analyses that were run.
- An **analysis component** that either looks like relationships with data collected from multiple sources or uses text analysis to get a deeper understanding of the text data that you collected. Projects that involve a **data product** requiring more intensive work are also acceptable. 

You must turn in your final project by **May 14, 2024**. Make sure you submit the following items:
- An HTML document with the report including all code used within the report.
- Any additional .py files that you used for your report. Make sure you use .py files as necessary to make your report clean!
- A CSV file with the data that was used for the project unless it cannot be included for some reason (for example, if it is too large). 

Make sure you read the HTML document to make sure everything is readable and the code is clear! If I cannot read it, you will not get credit for it!

## Starting out

What you do your final project on is completely up to you, with lots of different possibilities. It might seem daunting to get started with thinking about what an appropriate project might look like. In this section, we will talk about what constitutes a good research topic and how you should go about building up the final project.

## Step 1: Identifying a Research Question/Topic

A good research question or topic is **relevant** and **applicable** to people. This could be a relatively small group of people or it could be a wide range of people. Either way, it should be something that is **useful** in some way. This isn't a trivial thing, either! You might think you have a good topic, but you need to also make sure that the information is useful and applicable in some way.

> **Example:** Suppose you did an analysis that showed a relationship between someone's age and their income. This might seem interesting at first glance, but when you think about it more, it has a few issues. For one, this is a pretty intuitive result -- as you get older, you tend to make more money -- so it's not very insightful. For another, how would people even use this information? They can't make themselves older automatically (or get more experience quicker necessarily), so it's hard to see how people could make use of this information.   

Some questions to think about as you try to decide on the goal of your project:
- **What is the topic of the project?** Think about some topic areas that interest you. These can be very broad, such as "music" or more specific, such as "mental health of college students." 
- **What types of outcomes are you hoping to get and/or conclusions are you hoping to reach?** What will the outcome of this project be? Will it be a set of visualizations showing relationships? Do you want to try to identify topics or sentiment within text? Or are you putting together a dataset by scraping webpages and combining it with other data sources such as Census API data?
- **How is this useful?** What use does the project have for people? Can it be used to make better decisions? Does it provide insights beyond what you could have gotten before?

<font color ='red'>**Question 1: Make a list of some possible topics that you would want to explore. Don't worry about where the data will come from for now. Make sure you have idea of what the output of the project might be used for. Discuss with your neighbors to see what they came up with.**</font>

Topics:
- 
-
-

## Step 2: Identifying Data Sources

There are lots of possible data sources. Examples of some possible data sources are provided below.
- Spotify API: https://developer.spotify.com/documentation/web-api (this requires an access token. You can see a guide for how this might be used at https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b)
- Reddit API: https://www.reddit.com/dev/api (see https://www.jcchouinard.com/reddit-api-without-api-credentials/ for an example of how this might work)
- Housing Market API: https://documenter.getpostman.com/view/9197254/UVsFz93V#quickstart
- Delphi CovidCast API: https://cmu-delphi.github.io/delphi-epidata/

These can help provide some motivation for the topics that you've identified above, but you can also look for data sources online that aren't from these APIs. Remember, there are lots of APIs available, and even if the data isn't available in an API, you can scrape data from webpages. 

Some questions to think about when trying identify good data sources:
- **What type of data are you trying to collect?** Is it text data? What is the unit of observation? How big do you expect the data to be?
- **How will you obtain the data?** If the data might come from a big company or organization, then there's a good chance that it is available through an API. If it is something that isn't centrally organized and collected, then it might need to be scraped individually. 
- **What are possible restrictions on data access?** If you are looking for individual-level medical records, you will likely be very disappointed. This is because medical records are highly confidential and typically not released to the public. If, for example, you are interested in topics such as mental health, you might need to resort to other types of data, such as surveys and self-reports. 

> **Example:** Suppose you are interested in political discourse leading up to the 2020 presidential election. One possible source of data are polling results. These are available through a variety of different websites. But what else can you find? The New York Times API has Op-Eds that were published in the New York Times, which you could use text analysis to analyze. You might also consider scraping the text from some other sources. You could consider what has been posted on social media sites such as Reddit. These can all be combined temporally to see what the trends are like to get a better idea of what might be an indicator of success, especially as it applies to the 2024 election. 

<font color ='red'>**Question 2: Think about the topics that you came up with in Question 1. What are possibel data sources you could use to answer those questions or address those issues? Are there ways to combine data sources to get the information you need? After you identify some data sources, discuss with your neighbors.**</font>

Topics with Data Sources:
- 
-
-

## Step 3: Explore the Data

Now that you have identified a possible data source, you might think you're all set for getting started. However, many times, you'll find that the data isn't exactly what you had hoped it would be, or that it doesn't exactly match up with what you had in mind for your project at the beginning. 

This might actually end up being an iterative process, where you explore the data, then go back and alter the project topic or question slightly so that it matches what you can answer with your data. That doesn't mean you have to complete change what you are doing, but **the data will dictate what is possible to do.** This also means that you should be very careful to **not make statements that aren't supported by the data.**

Ideally, you would do this after actually obtaining the data, but you can also explore the data before you even pull from the API or scrape it. The API documentation should have information about the data that is available, so you can use it to check what you can do with the data. 

At some point, you'll have to actually pull from the API or scrape from the website and try to use it. At this point, you might find that it's not exactly what you thought the data might be, and you might need to make some revisions to the project again. Most commonly, this is due to missing data. Even if a field exists within the API documentation, there's no guarantee it will be clean and useable. Sometimes, you might even have to scrap that data source and go with other data sources. 

<font color ='red'>**Question 3: Can you use the data in Question 2 to exactly answer the question in Question 1? Do you need to adjust the topic to match what you have the data for? Discuss with your neighbors.**</font>

Updated Topics with Data Sources:
- 
-
-