# Collecting Data

The foundation of your project will be to collect appropriate data for the problem you are trying to solve. There is incredible versatility in how you can use an machine learning model. Some examples might be to predict the next social media trend, categorizing cancer cells, trying to better understand the relationship between temperature fluctuations and coffee bean growth rates, or creating a chat bot.

Since we are very early in this course, and - most likely at this point - have not discussed many models, it may be difficult to know what you can do with a dataset, or what is in scope of this course. So by the nature of the course title "Exploring Machine Learning" we will take an explorative approach to your project. 

The goal of this part of the project is to explore what data sets you might be interested in, below will be questions to help guide you to selecting a category of data that you want to further explore.

## Identifying what data you want to explore

Data is everywhere and there seems to be data on about anything. You might know exactly what you want to dive deeper into or you might have no idea. Either way I invite you answer the questions below.

Below create a python dictionary with the key being a short summary of the topic of interest, while the value is an explanation of your interest - such as why you are interested in this topic, or why do you feel a strong passion to understand this topic. A topic of interest could be research you are conducting, a topic you are studying at your job, hobbies you have or topics surrounding your identities.

List 5 topics, and for each topic put at least a 50 word description.

For example I might put:

```python
interests = {
    "Cats" : "I have two cats at home, they are basically my children. I would generally like to learn more about cat behavior, health trends and pet owner behavior. It may be interesting to also see industry trends of cat owner, or how they compare to dog owners. Maybe later I want to start to write an app that recognizes cat breeds",

    "Scuba" : "The study of scuba diving seems to be a 'soft' science, and there are general guidelines on when and how long you should do safety stops to avoid getting decompression sickness. Could there be links to human anatomy or behavior on how deep a person should go safely during a dive?",
}
```

In [1]:
# Do not edit the name of this function, it will be used for grading
def what_are_topics_you_are_interested_in():
    interests = {


        "Rock Climbing" : "I have a deep passion for rock climbing, I've been doing it for just about six years now. \
            I am constantly wondering about what optimal training would look like for climbing, and would love to learn more about how \
            different body types perform on certain routes. In the future I would like to be able to identify relationships between bodily factors \
            and climbing performance, as well as what training routines output the highest performance.",

        "Music" : "Music has been an integral part of my life for as long as I could remember - in fact, I'm listening to music as I type this. It \
            doesn't take a data scientist to recognize there's a correlation with personality and what kind of music a person listens to, but I \
            am curious if there would be a way to scientifically predict qualities of a person based on their top genres. I would be interested in \
            gathering data on academic success and the type of music someone listens to.",
        
        "Education" : "As a full-time student, education is at the forefront of my life. As I spend more time in school, I wonder how I could be \
            improving my efficiency. There are so many factors that play into academic performance, but one of the ones I'm most curious about is \
            note taking techniques. I wonder how important tools like Notion and Obsidian are, as well as strategies like Cornell notes and \
            mind-mapping.",

        "Company Optimization" : "As someone passionate about data science, I would like to complete a project that aligns more closely with what I \
            might be doing in industry. I would love to work with a local company and help them optimize their profits. I could imagine this could be \
            possible through predictive analytics, figuring out what times of the year are correlated with an increased demand for certain products. \
                I just looked into this a bit and it seems like digital analytics and customer targeting are two important ways to boost a small \
                    company through analytics.",

        "Environmentalism" : "As someone who spends most of my free time in the outdoors, there are few things I like seeing less than litter. I \
            would be very interested in creating a project that positively contributes to the cleaning of our environment, and I believe this could be \
            accomplished through data science. I would be interested in analyzing the many factors in an area that may correlate with increased litter, \
            so volunteer organizations like the Carolina Climber's Coalition could designate volunteers accordingly."

    } # Fill out your interests
    return interests
# Note: you can use the \ symbol to continue your string to the next line, this makes 
# things look a bit prettier
# Example:
print("This is an \
      extended string ")

This is an       extended string 


## Do datasets exist for my interests?

There is lots of data out there but not for everything. Below are some websites where you can take a look at available datasets. Go ahead and search for datasets related to your topic. Are there many data sets surrounding your topic? Are there many different types of data like categorical, regression, images, etc? If there are limited data sets, do you feel comfortable with the challenge of creating your own data? (Note creating your own data set to supplement existing datasets will increase your score on this assignment)

> You can find a link to databases on the course page!

For 3 of your topics find 3 databases you might want to use for your project. Below create a dictionary with the keys being the topic values you listed above and the value a link of 3 data bases you would like to explore. If you would like to make your own data too make add a string "Create my own data" to the end of the list

Note, if you have trouble finding datasets for your topic you can make your dataset more general, or try a different topic. For example for my "Cats" topic I could expand it to "Pets", "Pet Toy Sales" or "Pet Health Benefits"

You can always change your topic and dataset later, so don't feel that these decisions are permanent.

While searching did it generate any ideas on interests or data sets you would like to explore? - If so you can add or replace a topic to the dictionary above!


Example"

```python
datasets = {
    "Cats" : ["https://www.kaggle.com/datasets/ma7555/cat-breeds-dataset", "https://example.com", "https://example.com", "Create my own data"],

    "Second Topic": ["https://example.com", "https://example.com", "https://example.com"],
    
    "Third Topic" : ["https://example.com", "https://example.com", "https://example.com"]
}
```


In [2]:
def find_some_datasets():
    datasets = {
        
        "Rock Climbing" : ["https://www.kaggle.com/datasets/brkurzawa/ifsc-sport-climbing-competition-results","https://www.kaggle.com/datasets/tomasslama/indoor-climbing-gym-hold-segmentation","https://www.kaggle.com/datasets/anthonygiorgio/ifsc-climbing-competition-results-1991-2019"],

        "Music" : ["https://www.kaggle.com/code/amiraadel21000/students-performance","https://www.kaggle.com/code/annastasy/predicting-students-grades","https://www.kaggle.com/datasets/sids04/ai-generated-psychology-participants-personality"],
        
        "Education" : ["https://www.kaggle.com/code/bhartiprasad17/student-academic-performance-analysis", "https://www.kaggle.com/code/mdismielhossenabir/student-exam-performance-prediction","https://www.kaggle.com/datasets/asaniczka/wages-by-education-in-the-usa-1973-2022"],

        "Company Optimization" : ["https://www.kaggle.com/datasets/gaurav9712/50-startups","https://www.kaggle.com/datasets/ksabishek/product-sales-data","https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis"],

        "Environmentalism" : ["https://www.kaggle.com/datasets/milankalkenings/litter-on-forest-floor","https://www.kaggle.com/datasets/humansintheloop/recycling-dataset","https://www.kaggle.com/datasets/vencerlanz09/taco-dataset-yolo-format"]
        
    }
    return datasets

## Asking questions about your dataset

Some questions you might want to ask for each dataset are:
- Who created this dataset?
- When was this dataset created?
- Could there be any biases when creating this dataset?
- How was this data collected?
- Is this data representative of the problem I am trying to solve?
