# Tired of Cliché Datasets? Here are 18 Awesome Alternatives From All Domains
## Unique datasets ranging from microbiology to sports!
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@dogukan?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'> Doğukan Şahin</a>
        on 
        <a href='https://unsplash.com/s/photos/tired?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

### Setup

In [75]:
import warnings

import datapane as dp
import gdown
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

rcParams["xtick.labelsize"] = 15
rcParams["ytick.labelsize"] = 15

warnings.filterwarnings("ignore")

In [76]:
def publish_report(dataset, name):
    dp.Report(dp.DataTable(dataset.head(10))).upload(name=name, open=True)

"I'm going to throw up over my RGB backlit-keyboard *so hard* if I see one more person using Titanic, Iris, Wine or Boston datasets!"

This is the (slightly exagerrated) feeling you might gradually develop after being a data science learner for a while. You just can't help it - everyone wants the easy thing. Beginners use these datasets because they are stupidly straightforwad; most course creators and bloggers use them because they are just a single Google search away (or even bookmarked).

In 85+ articles I have written, I honestly can't remember using any of those clichés (if my memory is letting me down and *did* use them one or more times, I appologize!). This is mainly thanks to the many, many hours I have spent in search of a good dataset and deliver content in novel ways to my precious audience. 

Today, I have decided to share a list of curated example datasets I used in my posts and as part of my own learning. Some of them are even part of a bigger article I am working on (to be published soon). In the meantime, enjoy!

# Regression datasets

## 1️⃣. Diamond prices and carat regression

My favorite from this list is the diamonds dataset. It is ideal in length for practice (+50k samples) and have multiple targets you can predict as a regression or a multi-class classification task:

https://datapane.com/u/bextuychiev/reports/diamonds/

**🎯 Targets: 'carat' or 'price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/shivam2503/diamondshttps://www.kaggle.com/shivam2503/diamonds)**

**📦Dimensions: (53940, 10)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysishttps://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis)**

## 2️⃣. Age of Abalone shells

This is a rather unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusc) using several physical measurements. Traditionally, their age are found by cutting through their cone, staining them and counting the number of rings inside the shell under a microscope. 

For zooligists, this might be fun but for data scientists, not so much:

https://datapane.com/u/bextuychiev/reports/abalone/

**🎯 Target: 'Rings'**

**🔗 Link: [Kaggle](https://www.kaggle.com/rodolfomendes/abalone-dataset)**

**📦Dimensions: (4177, 9)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/ragnisah/eda-abalone-age-prediction)**

## 3️⃣. King county house sales

This is the dataset for those who are still interested in real estate and house prices regression:

https://datapane.com/u/bextuychiev/reports/king/

**🎯 Target: 'price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction)**

**📦Dimensions: (21613, 17)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices)**

## 4️⃣. Cancer death rate

This dataset challenges you to find cancer mortality rate per capita (100,000) using a number of demographic variables:

https://datapane.com/u/bextuychiev/reports/cancer/

**🎯 Target: 'TARGET_deathRate'**

**🔗 Link: [Data.world](https://data.world/nrippner/ols-regression-challenge)**

**📦Dimensions: (3047, 33)**

**⚙Missing values: Yes**

## 5️⃣. Life expectancy

How long a person will live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity and this dataset provided by WHO (World Health Orginazation) is one of them:

https://datapane.com/u/bextuychiev/reports/who/

**🎯 Target: 'Life expectancy'**

**🔗 Link: [Kaggle](https://www.kaggle.com/kumarajarshi/life-expectancy-who/)**

**📦Dimensions: (2938, 21)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mathchi/life-expectancy-who-with-several-ml-techniques)**

## 6️⃣. Car prices

The title says it all - predict car prices using variables like mileage, fuel type, transmission and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles:

https://datapane.com/u/bextuychiev/reports/cars/

**🎯 Target: 'selling_price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?ref=hackernoon.com&select=Car+details+v3.csv)**

**📦Dimensions: (8128, 12)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mohaiminul101/car-price-prediction)**

# Binary classification

## 7️⃣. NBA rookie stats

The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

https://datapane.com/u/bextuychiev/reports/nba/

**🎯 Target: 'TARGET_5Yrs'**

**🔗 Link: [Data.world](https://data.world/exercises/logistic-regression-exercise-1)**

**📦Dimensions: (8128, 12)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mohaiminul101/car-price-prediction)**

## 8️⃣. Stroke prediction

Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history. Very interesting features:

https://datapane.com/u/bextuychiev/reports/stroke/

**🎯 Target: 'stroke'**

**🔗 Link: [Kaggle](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)**

**📦Dimensions: (5110, 11)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/joshuaswords/predicting-a-stroke-shap-lime-explainer-eli5)**

## 9️⃣. Water potability

Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water boides into *potable (drinkable)* and *not potable* using a number of chemical properties:

https://datapane.com/u/bextuychiev/reports/water/

**🎯 Target: 'Potability'**

**🔗 Link: [Kaggle](https://www.kaggle.com/adityakadiwal/water-potability)**

**📦Dimensions: (3276, 10)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/jaykumar1607/water-quality-analysis-plotly-and-modelling)**

## 🔟. Smart grid stability

This is an augmented version of "Electrical Grid Stability Simulated Dataset" created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):

https://datapane.com/u/bextuychiev/reports/grid/

**🎯 Target: 'stabf'**

**🔗 Link: [Kaggle](https://www.kaggle.com/pcbreviglieri/smart-grid-stability)**

**📦Dimensions: (60000, 13)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/pcbreviglieri/predicting-smart-grid-stability-with-deep-learning)**

## 1️⃣1️⃣. IBM HR analytics & employee attrition

This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role):

https://datapane.com/u/bextuychiev/reports/hr/

**🎯 Target: 'Attrition'**

**🔗 Link: [Kaggle](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)**

**📦Dimensions: (1470, 35)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/janiobachmann/attrition-in-an-organization-why-workers-quit)**


## 1️⃣2️⃣. Can I eat this mushroom?

Another one of a kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge - all features are categorical:

https://datapane.com/u/bextuychiev/reports/shroom/

**🎯 Target: 'class'**

**🔗 Link: [Kaggle](https://www.kaggle.com/uciml/mushroom-classification)**

**📦Dimensions: (8124, 23)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/alincijov/mushroom-classification-using-genetic-algorithm)**

## 1️⃣3️⃣. Banknote authentication

Even though this dataset has very few features, I wanted to include it because the task is really interesting - using physical attributes of banknotes, you should classify them into forged or original:

https://datapane.com/u/bextuychiev/reports/note/

**🎯 Target: 'class'**

**🔗 Link: [Kaggle](https://www.kaggle.com/ritesaluja/bank-note-authentication-uci-data)**

**📦Dimensions: (1372, 5)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/vivekgediya/banknote-authentication-analysis)**

## 1️⃣4️⃣. Adult income dataset

Predict whether a person will end up earning more than 50k using factors like age, education, backgroud, gender, marital status, etc.:

https://datapane.com/u/bextuychiev/reports/adult/

**🎯 Target: 'inco,e'**

**🔗 Link: [Kaggle](https://www.kaggle.com/wenruliu/adult-income-dataset)**

**📦Dimensions: (48842, 15)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/alokevil/simple-eda-for-beginners)**

# Mutli-class classification datasets

## 1️⃣5️⃣. Yeast classification

This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species:

https://datapane.com/u/bextuychiev/reports/yeast/

**🎯 Target: 'class_protein_localization'**

**🔗 Link: [OpenML](https://www.openml.org/d/181)**

**📦Dimensions: (1484, 9)**

**⚙Missing values: No**

https://towardsdatascience.com/comprehensive-guide-to-multiclass-classification-with-sklearn-127cc500f362?source=your_stories_page-------------------------------------

## 1️⃣6️⃣. Kaggle TPS May 2021

Kaggle hosts monthly competitions called "Tabular Playground Series" with a beginner-to-medium difficult tasks. The most important point is that each month a new synthetic dataset of considerable size is created using CTGAN framework. This one is from the May edition:

https://datapane.com/u/bextuychiev/reports/tps-may/

**🎯 Target: 'target'**

**🔗 Link: [Kaggle](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?ref=hackernoon.com&select=Car+details+v3.csv)**

**📦Dimensions: (100000, 52)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/subinium/tps-may-categorical-eda)**

https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd?source=your_stories_page-------------------------------------

## 1️⃣7️⃣. Kaggle TPS June 2021

A similar dataset with more features and samples:

https://datapane.com/u/bextuychiev/reports/tps-june/

**🎯 Target: 'target'**

**🔗 Link: [Kaggle](https://www.kaggle.com/c/tabular-playground-series-jun-2021/data?select=train.csv)**

**📦Dimensions: (200000, 77)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/dwin183287/tps-june-2021-eda)**

## 1️⃣8️⃣. Diamonds, again

Just mentioning the diamonds dataset again, because it has three categorical features, which can be multi-class targets on their own:

https://datapane.com/u/bextuychiev/reports/diamonds2/

**🎯 Targets: 'cut', 'color', 'clarity'**

**🔗 Link: [Kaggle](https://www.kaggle.com/shivam2503/diamondshttps://www.kaggle.com/shivam2503/diamonds)**

**📦Dimensions: (53940, 10)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysishttps://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis)**

## Summary

The majority of these datasets will be used in my upcoming article. The goal is to conduct an ultimate comparison of XGBoost, CatBoost, LightGBM and Sklearn on 21 datasets (7 regression, 7 binary and multi-class tasks). 

I am going deep and crazy - the post will include comparisons of the 4 libraries in every aspect imaginable (in terms of model performace). So, be sure to hit the green button below and subscribe!

![](images/cta.gif)