# Tired of Cliché Datasets? Here Are 20 Awesome Alternatives For Any Task
## TODO
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@dogukan?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'> Doğukan Şahin</a>
        on 
        <a href='https://unsplash.com/s/photos/tired?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

### Setup

In [11]:
import warnings

import gdown
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rcParams

rcParams["xtick.labelsize"] = 15
rcParams["ytick.labelsize"] = 15

warnings.filterwarnings("ignore")

"I'm going to puke over my RGB backlit-keyboard *so hard* if I see one more person using Titanic, Iris, Wine or Boston datasets!"

This is the feeling you might gradually develop after being a data science learner for a while. You just can't help it - everyone wants the easy thing. Beginners use these datasets because they are stupidly straightforwad; most course creaters and bloggers use them because they are just one single Google search away.

# Regression datasets

## 1. Diamond prices and carat regression

My favorite from this list is the diamonds dataset. It is ideal in length (+50k samples) and have multiple targets you can predict as a regression or a multi-class classification task:

In [12]:
diamonds = sns.load_dataset("diamonds").sample(10).reset_index(drop=True)
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,2.15,Good,G,SI2,63.8,58.0,13317,8.23,8.08,5.2
1,1.01,Ideal,G,SI2,61.7,56.0,4260,6.47,6.43,3.98
2,0.51,Premium,E,VS2,60.3,59.0,1781,5.22,5.2,3.14
3,1.05,Ideal,F,VVS1,61.6,55.0,10872,6.57,6.53,4.04
4,0.3,Premium,G,VS1,62.5,59.0,787,4.32,4.25,2.68
5,1.06,Ideal,D,SI1,62.6,56.0,6426,6.49,6.52,4.07
6,0.25,Ideal,G,VS1,62.7,56.0,454,4.01,4.04,2.52
7,1.01,Ideal,J,SI2,61.6,55.0,3732,6.45,6.5,3.99
8,0.9,Very Good,F,SI1,62.5,59.0,3954,6.06,6.13,3.81
9,0.36,Ideal,G,VS1,60.6,57.0,761,4.58,4.62,2.79


In [13]:
sns.load_dataset("diamonds").shape

(53940, 10)

**🎯 Targets: 'carat', 'price' (regression); 'cut', 'color', 'clarity' (multi-class)**

**🔗 Link: [Kaggle](https://www.kaggle.com/shivam2503/diamondshttps://www.kaggle.com/shivam2503/diamonds)**

**📦Dimensions: (53940, 10)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysishttps://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis)**

## 2. Age of Abalone shells

This is a rather unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusc) using several physical measurements. Traditionally, their age are found by cutting through their cone, staining them and counting the number of rings inside the shell under a microscope. 

For zooligists, this might be fun but for data scientists, not so much:

In [14]:
abalone = pd.read_csv("data/reg6_abalone.csv")
abalone.head(10)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


**🎯 Target: 'Rings'**

**🔗 Link: [Kaggle](https://www.kaggle.com/rodolfomendes/abalone-dataset)**

**📦Dimensions: (4177, 9)**

**⚙Missing values: No**

**[📚Starter notebook](https://www.kaggle.com/ragnisah/eda-abalone-age-prediction)**

## 3. King county house sales

This is the dataset for those who are still interested in real estate and house prices regression:

In [15]:
king = pd.read_csv("data/reg1_house_sales_king_county.csv")
king.head(10)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15
0,221900.0,3.0,1.0,1180,5650,1.0,0,0,3,7,1180,1955,,47.5112,-122.257,1340,5650
1,538000.0,3.0,2.25,2570,7242,2.0,0,0,3,7,2170,1951,1991.0,47.721,-122.319,1690,7639
2,180000.0,2.0,1.0,770,10000,1.0,0,0,3,6,770,1933,,47.7379,-122.233,2720,8062
3,604000.0,4.0,3.0,1960,5000,1.0,0,0,5,7,1050,1965,,47.5208,-122.393,1360,5000
4,510000.0,3.0,2.0,1680,8080,1.0,0,0,3,8,1680,1987,,47.6168,-122.045,1800,7503
5,1225000.0,4.0,4.5,5420,101930,1.0,0,0,3,11,3890,2001,,47.6561,-122.005,4760,101930
6,257500.0,3.0,2.25,1715,6819,2.0,0,0,3,7,1715,1995,,47.3097,-122.327,2238,6819
7,291850.0,3.0,1.5,1060,9711,1.0,0,0,3,7,1060,1963,,47.4095,-122.315,1650,9711
8,229500.0,3.0,1.0,1780,7470,1.0,0,0,3,7,1050,1960,,47.5123,-122.337,1780,8113
9,323000.0,3.0,2.5,1890,6560,2.0,0,0,3,7,1890,2003,,47.3684,-122.031,2390,7570


In [16]:
king.shape

(21613, 17)

**🎯 Target: 'price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction)**

**📦Dimensions: (21613, 17)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices)**

## 4. Cancer death rate

This dataset challenges you to find cancer mortality rate per capita (100,000) using a number of demographic variables:

In [17]:
cancer = pd.read_csv("data/reg3_cancer_death_rate.csv")
cancer.head(10)

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctPrivateCoverageAlone,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate
0,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,"(61494.5, 125635]",39.3,...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",33.0,...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096
2,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",45.0,...,43.5,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488
3,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,"(42724.4, 45201]",42.8,...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,57.0,26,144.4,350.1,49955,10321,12.5,0.0,"(48021.6, 51046.4]",48.3,...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657
5,428.0,152,176.0,505.4,52313,61023,15.6,180.259902,"(51046.4, 54545.6]",45.4,...,38.8,32.6,43.2,20.2,84.882631,1.653205,1.538057,3.314635,51.22036,4.964476
6,250.0,97,175.9,461.8,37782,41516,23.2,0.0,"(37413.8, 40362.7]",42.6,...,35.0,28.3,46.4,28.7,75.106455,0.616955,0.866157,8.356721,51.0139,4.204317
7,146.0,71,183.6,404.0,40189,20848,17.8,0.0,"(37413.8, 40362.7]",51.7,...,33.1,25.9,50.9,24.1,89.406636,0.305159,1.889077,2.286268,48.967033,5.889179
8,88.0,36,190.5,459.4,42579,13088,22.3,0.0,"(40362.7, 42724.4]",49.3,...,37.8,29.9,48.1,26.6,91.787477,0.185071,0.208205,0.616903,53.446998,5.587583
9,4025.0,1380,177.8,510.9,60397,843954,13.1,427.748432,"(54545.6, 61494.5]",35.8,...,,44.4,31.4,16.5,74.729668,6.710854,6.041472,2.699184,50.063573,5.53343


**🎯 Target: 'TARGET_deathRate'**

**🔗 Link: [Data.world](https://data.world/nrippner/ols-regression-challenge)**

**📦Dimensions: (3047, 33)**

**⚙Missing values: Yes**

## 5. Life expectancy

How long a person will live? This is one of the hardest questions unanswered in science because it is hard to predict. Several studies have been undertaken to understand human life and this dataset provided by WHO (World Health Orginazation) is one of them:

In [18]:
who = pd.read_csv("data/reg4_life_expectancy.csv")
who.head(10)

Unnamed: 0,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
5,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,16.7,...,66.0,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2
6,2009,Developing,58.6,281.0,77,0.01,56.762217,63.0,2861,16.2,...,63.0,9.42,63.0,0.1,445.893298,284331.0,18.6,18.7,0.434,8.9
7,2008,Developing,58.1,287.0,80,0.03,25.873925,64.0,1599,15.7,...,64.0,8.33,64.0,0.1,373.361116,2729431.0,18.8,18.9,0.433,8.7
8,2007,Developing,57.5,295.0,82,0.02,10.910156,63.0,1141,15.2,...,63.0,6.73,63.0,0.1,369.835796,26616792.0,19.0,19.1,0.415,8.4
9,2006,Developing,57.3,295.0,84,0.03,17.171518,64.0,1990,14.7,...,58.0,7.43,58.0,0.1,272.56377,2589345.0,19.2,19.3,0.405,8.1


**🎯 Target: 'Life expectancy'**

**🔗 Link: [Kaggle](https://www.kaggle.com/kumarajarshi/life-expectancy-who/)**

**📦Dimensions: (2938, 21)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mathchi/life-expectancy-who-with-several-ml-techniques)**

## 6. Car prices

The title says it all - predict car prices using variables like mileage, fuel type, transmission and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles:

In [19]:
cars = pd.read_csv("data/reg5_vehicle_prices.csv")
cars.head(10)

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.0,1248 CC,74.0,190Nm@ 2000rpm,5.0
1,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.0,1498 CC,103.0,250Nm@ 1500-2500rpm,5.0
2,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.0,1497 CC,78.0,"12.7@ 2,700(kgm@ rpm)",5.0
3,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396 CC,90.0,22.4 kgm at 1750-2750rpm,5.0
4,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.0,1298 CC,88.0,"11.5@ 4,500(kgm@ rpm)",5.0
5,2017,440000,45000,Petrol,Individual,Manual,First Owner,20.0,1197 CC,81.0,113.75nm@ 4000rpm,5.0
6,2007,96000,175000,LPG,Individual,Manual,First Owner,17.0,1061 CC,57.0,"7.8@ 4,500(kgm@ rpm)",5.0
7,2001,45000,5000,Petrol,Individual,Manual,Second Owner,16.0,796 CC,37.0,59Nm@ 2500rpm,4.0
8,2011,350000,90000,Diesel,Individual,Manual,First Owner,23.0,1364 CC,67.0,170Nm@ 1800-2400rpm,5.0
9,2013,200000,169000,Diesel,Individual,Manual,First Owner,20.0,1399 CC,68.0,160Nm@ 2000rpm,5.0


**🎯 Target: 'selling_price'**

**🔗 Link: [Kaggle](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?ref=hackernoon.com&select=Car+details+v3.csv)**

**📦Dimensions: (8128, 12)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mohaiminul101/car-price-prediction)**

# Binary classification

## 7. NBA rookie stats

The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

In [20]:
nba = pd.read_csv("data/bin1_nba.csv")
nba.head()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0.0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0.0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0.0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1.0
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1.0


**🎯 Target: 'TARGET_5Yrs'**

**🔗 Link: [Data.world](https://data.world/exercises/logistic-regression-exercise-1)**

**📦Dimensions: (8128, 12)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/mohaiminul101/car-price-prediction)**

## 8. Stroke prediction

Another medical dataset in the list asks you to predict whether a patient will have a stroke or not based on their history. Very interesting dataset:

In [21]:
stroke = pd.read_csv("data/bin2_stroke.csv")
stroke.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


**🎯 Target: 'stroke'**

**🔗 Link: [Kaggle](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)**

**📦Dimensions: (5110, 11)**

**⚙Missing values: Yes**

**[📚Starter notebook](https://www.kaggle.com/joshuaswords/predicting-a-stroke-shap-lime-explainer-eli5)**