# Collecting Data

The foundation of your project will be to collect appropriate data for the problem you are trying to solve. There is incredible versatility in how you can use an machine learning model. Some examples might be to predict the next social media trend, categorizing cancer cells, trying to better understand the relationship between temperature fluctuations and coffee bean growth rates, or creating a chat bot.

Since we are very early in this course, and - most likely at this point - have not discussed many models, it may be difficult to know what you can do with a dataset, or what is in scope of this course. So by the nature of the course title "Exploring Machine Learning" we will take an explorative approach to your project. 

The goal of this part of the project is to explore what data sets you might be interested in, below will be questions to help guide you to selecting a category of data that you want to further explore.

## Identifying what data you want to explore

Data is everywhere and there seems to be data on about anything. You might know exactly what you want to dive deeper into or you might have no idea. Either way I invite you answer the questions below.

Below create a python dictionary with the key being a short summary of the topic of interest, while the value is an explanation of your interest - such as why you are interested in this topic, or why do you feel a strong passion to understand this topic. A topic of interest could be research you are conducting, a topic you are studying at your job, hobbies you have or topics surrounding your identities.

List 5 topics, and for each topic put at least a 50 word description.

For example I might put:

```python
interests = {
    "Cats" : "I have two cats at home, they are basically my children. I would generally like to learn more about cat behavior, health trends and pet owner behavior. It may be interesting to also see industry trends of cat owner, or how they compare to dog owners. Maybe later I want to start to write an app that recognizes cat breeds",

    "Scuba" : "The study of scuba diving seems to be a 'soft' science, and there are general guidelines on when and how long you should do safety stops to avoid getting decompression sickness. Could there be links to human anatomy or behavior on how deep a person should go safely during a dive?",
}
```

In [26]:
# Do not edit the name of this function, it will be used for grading
def what_are_topics_you_are_interested_in():
    interests = { 
        "Cats": "I have two cats at home, they are basically my children. \
                 I would generally like to learn more about cat behavior, health trends and pet owner behavior. \
                 It may be interesting to also see industry trends of cat owners, or how they compare to dog owners. \
                 Maybe later I want to start to write an app that recognizes cat breeds.",
        "Student Mental Health": "This dataset is valuable for researchers, and policymakers interested in student well-being, mental health, and academic success. \
                  Activities that students engage to relive stress. \
                  Academic details like degree level, major, academic year, and current CGPA",
        "Gold Price": "Predicting the price of gold is a critical task in financial markets due to its significance as a \
                   stable store of value and hedge against economic uncertainties \
                   we leverage a dataset containing various economic indicators and historical gold prices to develop predictive models that forecast the price of gold",
        "Cost of Living": "Cost of Living Index by Country, 2024 Mid Year data \
                  The cost of living indices provided on this website are relative to New York City (NYC), with a baseline index of 100% for NYC. \
                  To study the cost of living of NYC",
        "Credit Card Fraud": "This dataset contains credit card transactions made by European cardholders in the year 2023. It comprises over 550,000 records, and the data has been anonymized to protect the cardholders' identities. The primary objective of this dataset is to facilitate the development of fraud detection algorithms and models to identify potentially fraudulent transactions."
    } 
    return interests

## Do datasets exist for my interests?

There is lots of data out there but not for everything. Below are some websites where you can take a look at available datasets. Go ahead and search for datasets related to your topic. Are there many data sets surrounding your topic? Are there many different types of data like categorical, regression, images, etc? If there are limited data sets, do you feel comfortable with the challenge of creating your own data? (Note creating your own data set to supplement existing datasets will increase your score on this assignment)

> You can find a link to databases on the course page!

For 3 of your topics find 3 databases you might want to use for your project. Below create a dictionary with the keys being the topic values you listed above and the value a link of 3 data bases you would like to explore. If you would like to make your own data too make add a string "Create my own data" to the end of the list

Note, if you have trouble finding datasets for your topic you can make your dataset more general, or try a different topic. For example for my "Cats" topic I could expand it to "Pets", "Pet Toy Sales" or "Pet Health Benefits"

You can always change your topic and dataset later, so don't feel that these decisions are permanent.

While searching did it generate any ideas on interests or data sets you would like to explore? - If so you can add or replace a topic to the dictionary above!


Example"

```python
datasets = {
    "Cats" : ["https://www.kaggle.com/datasets/ma7555/cat-breeds-dataset", "https://example.com", "https://example.com", "Create my own data"],

    "Second Topic": ["https://example.com", "https://example.com", "https://example.com"],
    
    "Gold Price" : ["https://example.com", "https://example.com", "https://example.com"]
}
```


In [2]:

import os
from kaggle.api.kaggle_api_extended import KaggleApi

def find_some_datasets():
    # Dictionary with Kaggle dataset URLs
    datasets = {
        "Cats": ["https://www.kaggle.com/datasets/ma7555/cat-breeds-dataset"],
    "Student Mental Health": ["https://www.kaggle.com/datasets/abdullahashfaqvirk/student-mental-health-survey"],
    "Gold Price": ["https://www.kaggle.com/datasets/cvergnolle/gold-price-and-relevant-metrics"],
    "Cost of Living": ["https://www.kaggle.com/datasets/myrios/cost-of-living-index-by-country-by-number-2024"],
    "Credit Card Fraud": ["https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023"]
    }
    return datasets

def download_datasets_from_kaggle(datasets, save_dir='datasets'):
    # Initialize the Kaggle API
    api = KaggleApi()
    api.authenticate()
    
    # Dictionary to keep track of where datasets are downloaded
    downloaded_paths = {}

    # Loop over the dictionary of datasets
    for topic, urls in datasets.items():
        topic_path = os.path.join(save_dir, topic)
        
        # Create directories for each topic if not exist
        if not os.path.exists(topic_path):
            os.makedirs(topic_path)
        
        for url in urls:
            if 'kaggle.com' in url:
                # Extract dataset slug from the URL
                dataset_slug = url.split('/')[-2] + "/" + url.split('/')[-1]
                try:
                    print(f"Downloading Kaggle dataset: {dataset_slug} for topic {topic}")
                    # Download and unzip dataset
                    api.dataset_download_files(dataset_slug, path=topic_path, unzip=True)
                    print(f"Successfully downloaded: {dataset_slug} for topic {topic}")
                    
                    # Store the path of the downloaded dataset
                    downloaded_paths[topic] = topic_path
                except Exception as e:
                    print(f"Failed to download {dataset_slug}: {str(e)}")
            else:
                print(f"Skipping non-Kaggle dataset: {url}")
    
    return downloaded_paths

# Example usage
datasets = find_some_datasets()
downloaded_paths = download_datasets_from_kaggle(datasets)

# Print the paths of the downloaded datasets
print("Downloaded datasets are saved in the following locations:")
for topic, path in downloaded_paths.items():
    print(f"{topic}: {path}")

Downloading Kaggle dataset: ma7555/cat-breeds-dataset for topic Cats
Dataset URL: https://www.kaggle.com/datasets/ma7555/cat-breeds-dataset


KeyboardInterrupt: 

## Asking questions about your dataset

Some questions you might want to ask for each dataset are:
- Who created this dataset?
- When was this dataset created?
- Could there be any biases when creating this dataset?
- How was this data collected?
- Is this data representative of the problem I am trying to solve?
