<a href="https://colab.research.google.com/github/SinghReena/MachineLearning/blob/master/4_Datasets_Quiz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this colab we have four different datasets.  Please take a look the details here and answer the quiz at: https://forms.gle/G25ek83omZdkjey98

# Phoneme dataset

In [None]:
import sklearn
import pandas as pd
import numpy as np

In [None]:
from sklearn.datasets import fetch_openml
phoneme = fetch_openml('phoneme', version=1, as_frame=False)
phoneme.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

In [None]:
phoneme.target

array(['1', '1', '1', ..., '2', '1', '2'], dtype=object)

In [None]:
np.unique(phoneme.target, return_counts=True)

(array(['1', '2'], dtype=object), array([3818, 1586]))

In [None]:
phoneme.data

array([[ 0.489927, -0.451528, -1.04799 , -0.598693, -0.020418],
       [-0.641265,  0.109245,  0.29213 , -0.916804,  0.240223],
       [ 0.870593, -0.459862,  0.578159,  0.806634,  0.835248],
       ...,
       [ 0.246882, -0.793228,  1.190101,  1.423194, -1.303036],
       [-0.778907, -0.383111,  1.727029, -1.432389, -1.208085],
       [-0.794604, -0.640053,  0.632221,  0.72028 , -1.231182]])

In [None]:
phoneme.feature_names

['V1', 'V2', 'V3', 'V4', 'V5']

In [None]:
print(phoneme.DESCR)

**Author**: Dominique Van Cappel, THOMSON-SINTRA  
**Source**: [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105#sub2), [ELENA](https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/databases/REAL/phoneme/) - 1993  
**Please cite**: None  

The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1). Five different attributes were chosen to characterize each vowel: they are the amplitudes of the five first harmonics AHi, normalised by the total energy Ene (integrated on all the frequencies): AHi/Ene. The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.  

### Source

The current dataset was formatted by the KEEL repository, but originally hosted by the [ELENA Project](https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/elena.htm#stuff). The dataset originates from the European ESPRIT 5516 project: ROARS. The aim of this project was

# Creditcard

In [None]:
from sklearn.datasets import fetch_openml
creditcard = fetch_openml('creditcard', version=1, as_frame=False)
creditcard.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

In [None]:
import pprint

In [None]:
pp = pprint.PrettyPrinter(width=80)

In [None]:
pp.pprint(creditcard.DESCR)

('**Author**: Andrea Dal Pozzolo, Olivier Caelen and Gianluca Bontempi  \n'
 '**Source**: Credit card fraud detection - Date 25th of June 2015  \n'
 '**Please cite**: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and '
 'Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced '
 'Classification. In Symposium on Computational Intelligence and Data Mining '
 '(CIDM), IEEE, 2015  \n'
 '\n'
 'The datasets contains transactions made by credit cards in September 2013 by '
 'european cardholders. This dataset present transactions that occurred in two '
 'days, where we have 492 frauds out of 284,807 transactions. The dataset is '
 'highly unbalanced, the positive class (frauds) account for 0.172% of all '
 'transactions.\n'
 '\n'
 'It contains only numerical input variables which are the result of a PCA '
 'transformation. Unfortunately, due to confidentiality issues, we cannot '
 'provide the original features and more background information about the '
 'data. Fea

In [None]:
creditcard.data[1]

array([ 1.19185711,  0.26615071,  0.16648011,  0.44815408,  0.06001765,
       -0.08236081, -0.07880298,  0.08510165, -0.25542513, -0.16697441,
        1.61272666,  1.06523531,  0.48909502, -0.1437723 ,  0.63555809,
        0.46391704, -0.11480466, -0.18336127, -0.14578304, -0.06908314,
       -0.22577525, -0.63867195,  0.10128802, -0.33984648,  0.1671704 ,
        0.12589453, -0.0089831 ,  0.01472417,  2.69      ])

In [None]:
np.unique(creditcard.target, return_counts=True)

(array(['0', '1'], dtype=object), array([284315,    492]))

# Diabetes

In [None]:
diabetes = fetch_openml('diabetes_numeric', version=1, as_frame=False)
diabetes.keys()


dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

In [None]:
print(diabetes.DESCR)

**Author**:   
**Source**: Unknown -   
**Please cite**:   

This data set concerns the study of the factors affecting patterns of
 insulin-dependent diabetes mellitus in children.  The objective is to
 investigate the dependence of the level of serum C-peptide on the
 various other factors in order to understand the patterns of residual
 insulin secretion. The response measurement is the logarithm of
 C-peptide concentration (pmol/ml) at the diagnosis, and the predictor
 measurements age and base deficit, a measure of acidity.

 Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt) at
 http://www.ncc.up.pt/~ltorgo/Regression/DataSets.html
 Original source: Book Generalized Additive Models (p.304) by Hastie &
 Tibshirani, Chapman & Hall.  
 Characteristics: 43 cases; 3 continuous variables

Downloaded from openml.org.


In [None]:
diabetes.target_names

['c_peptide']

In [None]:
diabetes.target[:5]

array([4.8, 4.1, 5.2, 5.5, 5. ])

In [None]:
diabetes.data[:10]

array([[  5.2,  -8.1],
       [  8.8, -16.1],
       [ 10.5,  -0.9],
       [ 10.6,  -7.8],
       [ 10.4, -29. ],
       [  1.8, -19.2],
       [ 12.7, -18.9],
       [ 15.6, -10.6],
       [  5.8,  -2.8],
       [  1.9, -25. ]])

# Tic-Tac-Toe

In [None]:
import sklearn
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('https://www.openml.org/data/get_csv/50/dataset_50_tic-tac-toe.arff')

In [None]:
def print_game(i):
  print(df.iloc[i].values[:9].reshape(3,3))
  print(df.iloc[i].values[9])

In [None]:
print_game(4)

[['x' 'x' 'x']
 ['x' 'o' 'o']
 ['b' 'o' 'b']]
positive


In [None]:
print_game(-1)

[['o' 'o' 'x']
 ['x' 'x' 'o']
 ['o' 'x' 'x']]
negative


In [None]:
print(tictactoe.DESCR)

**Author**: David W. Aha    
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Tic-Tac-Toe+Endgame) - 1991   
**Please cite**: [UCI](http://archive.ics.uci.edu/ml/citation_policy.html)

**Tic-Tac-Toe Endgame database**  
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first.  The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").  

### Attribute Information  

     (x=player x has taken, o=player o has taken, b=blank)
     1. top-left-square: {x,o,b}
     2. top-middle-square: {x,o,b}
     3. top-right-square: {x,o,b}
     4. middle-left-square: {x,o,b}
     5. middle-middle-square: {x,o,b}
     6. middle-right-square: {x,o,b}
     7. bottom-left-square: {x,o,b}
     8. bottom-middle-square: {x,o,b}
     9. bottom-right-square: {x,o,b}
    10. Class: {positive,negative}

Downloaded from openml.org.
