<a href="https://colab.research.google.com/github/Ravi-kjain84/Main-Repository/blob/master/Sklearn_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this tutorial we will learn how to explore various datasets provided in sklearn.  
It is an easy and fast way to start with your machine learning experience.  
You need not to have a database file to start practicing.  
Just pick up a suitable dataset from sklearn library and start building a model.

In [68]:
from sklearn import datasets

In [69]:
data = datasets.load_diabetes()
data.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

In [70]:
print(data['DESCR'])

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

Now we will try to see the data in a pandas dataframe format.  Please note that it is not mandatory to have a pandas dataframe to create a model.  With experience I learned that if you are confortable working with numpy array and tensors, you will understand machine learning a lot better.

In [71]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.c_[data['data'],data['target']],
    columns=data['feature_names']+['target']
    )
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


You can even create a customised dataset with no real world linkage.  
This will give you more control on type of dataset you want to work with and can experiment with your models.

In [79]:
data_inputs, data_targets = datasets.make_classification(n_samples=100,n_features=20,n_classes=2)

In [80]:
print(data_inputs.shape,data_targets.shape)

(100, 20) (100,)
