# Tired of Cliché Datasets? Here Are 20 Awesome Alternatives For Any Task
## TODO
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@dogukan?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'> Doğukan Şahin</a>
        on 
        <a href='https://unsplash.com/s/photos/tired?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

### Setup

In [7]:
import warnings

import gdown
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

rcParams["xtick.labelsize"] = 15
rcParams["ytick.labelsize"] = 15

warnings.filterwarnings("ignore")

"I'm going to puke over my RGB backlit-keyboard *so hard* if I see one more person using Titanic, Iris, Wine or Boston datasets!"

This is the feeling you might gradually develop after being a data science learner for a while. You just can't help it - everyone wants the easy thing. Beginners use these datasets because they are stupidly straightforwad; most course creaters and bloggers use them because they are just one single Google search away.

# Regression datasets

## 1. Diamond prices and carat regression

My favorite from this list is the diamonds dataset. It is ideal in length (+50k samples) and have multiple targets you can predict as a regression or a multi-class classification task:

In [4]:
diamonds = sns.load_dataset("diamonds").sample(10).reset_index(drop=True)
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.9,Very Good,F,VS1,61.2,56.0,5324,6.18,6.23,3.8
1,0.41,Good,J,VS2,64.0,59.0,732,4.71,4.64,2.99
2,0.92,Very Good,H,VS1,62.9,58.0,4228,6.17,6.2,3.89
3,0.24,Premium,H,VVS1,61.2,58.0,432,3.96,4.01,2.44
4,0.3,Premium,E,VS2,59.2,59.0,844,4.41,4.37,2.6
5,0.93,Very Good,F,VS2,62.7,58.0,4925,6.21,6.27,3.91
6,0.41,Premium,J,VVS1,62.2,59.0,994,4.75,4.71,2.94
7,0.41,Premium,G,VS1,60.5,58.0,899,4.76,4.8,2.89
8,0.3,Ideal,G,VVS1,62.0,55.0,1013,4.34,4.31,2.68
9,0.71,Very Good,H,VS2,61.8,58.0,2368,5.66,5.7,3.51


In [5]:
sns.load_dataset("diamonds").shape

(53940, 10)

**🎯 Targets: 'carat', 'price' (regression); 'cut', 'color', 'clarity' (multi-class)**

**🔗 Link: [Kaggle](https://www.kaggle.com/shivam2503/diamondshttps://www.kaggle.com/shivam2503/diamonds)**

**📦Dimensions: (53940, 10)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysishttps://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis)**

## 2. Age of Abalone shells

This is a rather unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusc) using several physical measurements. Traditionally, their age are found by cutting through their cone, staining them and counting the number of rings inside the shell under a microscope. 

For zooligists, this might be fun but for data scientists, not so much:

In [12]:
abalone = pd.read_csv("data/reg6_abalone.csv")
abalone.head(10)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


**🎯 Target: 'Rings'**

**🔗 Link: [Kaggle](https://www.kaggle.com/rodolfomendes/abalone-dataset)**

**📦Dimensions: (4177, 9)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/ragnisah/eda-abalone-age-prediction)**

## 3. King county house sales

This is the dataset for those who are still interested in real estate and house prices regression:

In [15]:
king = pd.read_csv("data/reg1_house_sales_king_county.csv")
king.head(10)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15
0,221900.0,3.0,1.0,1180,5650,1.0,0,0,3,7,1180,1955,,47.5112,-122.257,1340,5650
1,538000.0,3.0,2.25,2570,7242,2.0,0,0,3,7,2170,1951,1991.0,47.721,-122.319,1690,7639
2,180000.0,2.0,1.0,770,10000,1.0,0,0,3,6,770,1933,,47.7379,-122.233,2720,8062
3,604000.0,4.0,3.0,1960,5000,1.0,0,0,5,7,1050,1965,,47.5208,-122.393,1360,5000
4,510000.0,3.0,2.0,1680,8080,1.0,0,0,3,8,1680,1987,,47.6168,-122.045,1800,7503
5,1225000.0,4.0,4.5,5420,101930,1.0,0,0,3,11,3890,2001,,47.6561,-122.005,4760,101930
6,257500.0,3.0,2.25,1715,6819,2.0,0,0,3,7,1715,1995,,47.3097,-122.327,2238,6819
7,291850.0,3.0,1.5,1060,9711,1.0,0,0,3,7,1060,1963,,47.4095,-122.315,1650,9711
8,229500.0,3.0,1.0,1780,7470,1.0,0,0,3,7,1050,1960,,47.5123,-122.337,1780,8113
9,323000.0,3.0,2.5,1890,6560,2.0,0,0,3,7,1890,2003,,47.3684,-122.031,2390,7570


In [16]:
king.shape

(21613, 17)

**🎯 Target: 'price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction)**

**📦Dimensions: (21613, 17)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices)**