# Introduction

In this group project, we are tasked with exploring a binary classification problem or a regression problem. The project focuses on developing and comparing different models, with each group member contributing their own model submission. The models will be evaluated on left-out test data. A key aspect of our work will be to select, test, and agree upon a performance metric to assess these models.

Our goal is not only to build effective models but to critically evaluate their performance in real-world scenarios. This will involve exploring various methods, including standard and advanced regression or classification techniques, and investigating the appropriateness of our chosen performance metrics. By using left-out test data, we aim to test whether our models generalize well to new data. At the end, we will have a comparison of the methods. We will highlight which approach performs best based on the agreed metric and we will reflect on the limitations and strengths of each method.

Throughout this project, collaboration and learning from each other’s expertise will be vital. We will build upon what we have learnt in Project 0, taking into consideration the issues we faced and the things which worked well. We aim to deepen our understanding of model evaluation while applying data science techniques to real-world problems.

# Link to Project 0
In Project 0, we learned about the different types of datasets and explored several application domains. We also found several data sources. In this section, we provide links to Project 0, and summarise the relevant research done there. Establishing this as context, we explain how this will guide us in Project 1, with details on how we have chosen our classification problem.

Data is extremely important in various industries as it supports informed decision-making, pattern recognition, and forecasting. The wide range of applications lead to a wide variety of datasets. In Project 0, we found that datasets could be categorised as structured, unstructured or hybrid datasets (though this categorisation was not uniform across the literature and online resources -- see references [1], [2], and [3]). 

In Project 0, we decided to use a structured dataset, as they are easy to find. Having decided on the dataset type, we then chose an application domain. We saw that there were many application domains, including but not limited to healthcare, finance, manufacturing, education and e-commerce. In Project 0, we chose the finance application domain as finance data was abundant and highly varied. This made it easy to find finance data, and provided opportunities to apply many different techniques. 

In trying to find finance data, we stumbled upon many data sources. These are described in the next subsection.

## Sources of Data

Several repositories and databases offer access to a wealth of datasets for analysis. We give a list below:

- **[Kaggle](https://www.kaggle.com/datasets)**: A popular platform for data science competitions.

- **[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)**: A repository of datasets which can be used for classification and regression.

- **[Monash Time Series Forecasting Repository](https://forecastingdata.org/)**: A repository for time series.

- **[Google Dataset Search](https://datasetsearch.research.google.com/)**: A powerful tool for finding datasets across various domains, including finance. This search engine aggregates datasets from multiple sources, including academic institutions, government agencies, and commercial platforms.

- **[Yahoo Finance](https://finance.yahoo.com/)**: Offers real-time and historical data on stocks, bonds, and commodities, as well as financial news and economic indicators, commonly used in market analysis. Data can be accessed freely through its API (application interface) -- see for example the package `quantmod` in R.

- **[Quandl](https://www.quandl.com/)**: A data platform that provides access to financial datasets.

We chose to use Kaggle and the Yahoo Finance API. From Kaggle, we chose to analyse the Credit Card Fraud dataset, and from the Yahoo Finance API, we analysed Apple stock data. 

## Reflections on Project 0
The Apple stock data was interesting as it had several stylistic features which were interesting to uncover and understand. We noted that the modelling process was complicated, with the fitting process using quasi-likelihoods. For this reason, we decided not to pursue this dataset, or similar datasets, in Project 1.

On the other hand, the Credit Card Fraud dataset provided an interesting challenge since the data was unbalanced. We agreed that this data was interesting, so we decided to analyse a similar dataset for Project 1. We chose not to analyse the same data, as we wanted to showcase our abilities to draw unique insights when the answers were not presented to us. We believe that this self-imposed challenge will also help us grow and develop our skills further.

We encountered an issue with the Kaggle dataset, as it was too large to upload to Github, and we did not have any other file hosting solution. We will be considering this issue in our data search for Project 1, and attempt to find a solution.

# Data Exploration

In Project 0, we searched a variety of data sources to find datasets for Project 1. I found several interesting datasets that present unique challenges like imbalanced data, missing values, and outliers, making them excellent candidates for providing learning opportunities.

During this search, I discovered additional sources, such as:
- [Dataset List](https://www.datasetlist.com/)
- [QuantumStat Dataset Index](https://index.quantumstat.com/)
- [Papers with Code - Image Classification](https://paperswithcode.com/datasets?task=image-classification)
- [Carnegie Mellon University Libraries Datasets](https://guides.library.cmu.edu/az/databases)

Some sources, like large datasets or those requiring advanced techniques (e.g., NLP), were not pursued further due to size or complexity. Additionally, after experiencing hosting issues with the Credit Card Fraud dataset for Project 0, we decided to exclude large datasets for practical reasons. We will continue to explore methods to overcome file hosting problems throughout the project. Hence, we will still provide a list of large datasets below, but we may not use them for analysis unless we resolve file hosting issues.

Below are other additional sources we explored, which were more feasible for analysis:
- [Serokell’s Blog on Best ML Datasets](https://serokell.io/blog/best-machine-learning-datasets)
- [Papers with Code - Multi-label Classification](https://paperswithcode.com/datasets?task=multi-label-classification&page=1)
- [Scikit-learn Real-World Datasets](https://scikit-learn.org/1.5/datasets/real_world.html)

We present some interesting datasets in the next section.

# Datasets

The datasets below provide options for both **Classification** and **Regression** tasks. While we focus on classification, some regression datasets are included for the reader's interest. The datasets we selected present a wide range of challenges, including missing data, imbalanced classes, and time-series structures. This provides an excellent opportunity for testing various machine learning techniques.

Kaggle datasets can be downloaded and uploaded to Github if the file size permits. The UCI datasets can be loaded very easily using the provided python code. Clicking the "Import to Python" button on the UCI page gives Python code to load the data. The sci-kit learn datasets can also be easily loaded using Python code, described below.

## Classification Problems

### Kaggle Datasets

1. **Titanic Survival Dataset**
   - **Description:** Predict passenger survival based on demographic data like age, gender, and class.
   - **Challenges:** **Missing data** (especially age) and **imbalanced data** (fewer survivors).
   - **Size:** 891 records with 12 features.
   - **Use Case:** Survival prediction, historical data analysis.
   - **Source:** [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

---

2. **Loan Prediction Dataset**
   - **Description:** Predict loan approval based on credit history, income, and other financial information.
   - **Challenges:** **Missing data** and slightly **imbalanced** outcomes.
   - **Size:** 615 records with 13 features, with a separate test set of about 300 records.
   - **Use Case:** Financial decision-making.
   - **Source:** [Loan Prediction Dataset](https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset)

---

### UCI Datasets

3. **Breast Cancer Dataset**
   - **Description:** Predict whether a tumor is benign or malignant based on cell nuclei features.
   - **Challenges:** Slightly **imbalanced dataset** (more benign than malignant cases).
   - **Size:** 569 samples with 30 features.
   - **Use Case:** Medical diagnosis.
   - **Source:** [UCI Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

---

4. **Adult Income Dataset (Census Data)**
   - **Description:** Predict whether an individual's income exceeds $50,000 based on demographic features.
   - **Challenges:** **Missing values** in some attributes like occupation, **imbalanced classes** (fewer high-income individuals).
   - **Size:** 48,842 records with 14 features.
   - **Use Case:** Social and economic analysis.
   - **Source:** [UCI Adult Dataset](https://archive.ics.uci.edu/ml/datasets/adult)

---

5. **Drug Consumption Dataset**
   - **Description:** Predict whether an individual consumes a drug based on personality traits and demographics.
   - **Challenges:** **Imbalanced data** and potential ethical concerns.
   - **Size:** 1,889 records with 12 features.
   - **Use Case:** Behavioral studies, addiction research.
   - **Source:** [UCI Drug Consumption Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29)

---

### scikit-learn Datasets

6. **Forest Cover Type Dataset**
   - **Description:** Classify forest cover type based on cartographic features like elevation and soil type.
   - **Challenges:** **Imbalanced data** across forest types.
   - **Size:** 581,012 records with 54 features.
   - **Use Case:** Environmental monitoring, land management.
   - **Method to Load:** `fetch_covtype()` (also available on [UCI ML Repository](https://archive.ics.uci.edu/dataset/31/covertype))

---

7. **Digits Dataset**
   - **Description:** Handwritten digit images (0–9) with pixel values, used for classification.
   - **Challenges:** Preprocessing steps like feature scaling and dimensionality reduction add complexity.
   - **Size:** 1,797 records, each with 64 features (an 8x8 pixel image).
   - **Use Case:** Optical character recognition (OCR), computer vision.
   - **Method to Load:** `load_digits()`

---

8. **20 Newsgroups Dataset**
   - **Description:** A text dataset for classification tasks, with the goal of classifying documents into one of 20 different newsgroups based on their content.
   - **Size:** 18,000 posts from newsgroups.
   - **Challenges:** Text classification and natural language processing (NLP).
   - **Use Case:** Text mining and topic modeling.
   - **Method to Load:** `fetch_20newsgroups()`

---

## Regression Problems

### Kaggle Datasets

1. **Medical Cost Personal Dataset**
   - **Description:** Predict individual medical costs based on factors like age, BMI, and smoking status.
   - **Challenges:** **Missing data** and outliers in the dataset require careful handling.
   - **Size:** 1,338 records with 7 features.
   - **Use Case:** Healthcare cost prediction.
   - **Source:** [Medical Cost Personal Dataset](https://www.kaggle.com/mirichoi0218/insurance)

---

### UCI Datasets

2. **Air Quality Dataset**
   - **Description:** Measure air quality in terms of pollutants like CO and NOx, with multivariate data.
   - **Challenges:** **Missing values** and a **time-series** structure add complexity.
   - **Size:** 9,352 records.
   - **Use Case:** Environmental monitoring, pollution prediction.
   - **Source:** [UCI Air Quality Dataset](https://archive.ics.uci.edu/ml/datasets/Air+Quality)

---

### scikit-learn Datasets

3. **California Housing Dataset**
   - **Description:** Predict housing prices in California using features like median income, house age, and population density.
   - **Size:** 20,640 samples with 8 numerical features.
   - **Challenges:** Real-world regression tasks involving house prices.
   - **Use Case:** Real estate pricing predictions.
   - **Method to Load:** `fetch_california_housing()`

---

4. **OpenML Datasets**
   - **Description:** A broad collection of datasets from the OpenML platform that can be used for various machine learning tasks.
   - **Use Case:** Versatile datasets for classification, regression, clustering, etc.
   - **Method to Load:** `fetch_openml()`




For more information and additional datasets, please visit Kaggle, the UCI Machine Learning Repository, or [scikit-learn's real-world datasets page](https://scikit-learn.org/1.5/datasets/real_world.html).

---

The datasets which particularly stood out to me were the forest cover dataset and adult income dataset.  These datasets present a variety of challenges and opportunities for applying classification and regression techniques. They are also excellent for testing advanced methods for handling imbalanced data (e.g., synthetic minority oversampling technique -- SMOTE), and for handling missing data (e.g., imputation).




# References

The references for this section are provided below.

1. **[Blog on Dataset Types: Sprinkledata](https://www.sprinkledata.com/blogs/what-is-a-dataset-types-examples-and-the-techniques-involved)**: This allowed us to get an overview of the types of datasets.
2. **[Office for National Statistics](https://service-manual.ons.gov.uk/content/content-types/datasets)**: This source defines datasets in a different way to what we have in this text.
3. **[Blog on Dataset Types: Brightdata](https://brightdata.com/blog/web-data/what-is-a-dataset)**: This helped us get an overview of different types of datasets, and explore the consistency or variation in the categorisation across resources.

Note: These links were accessed from 10:00 to 12:00 on 27/09/2024.

## Data sources

- **[Kaggle](https://www.kaggle.com/datasets)**

- **[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)**

- **[Google Dataset Search](https://datasetsearch.research.google.com/)**