In [None]:
'''
    Dataset -->

    A dataset is a collection of data, typically organized in a structured format, which is used for analysis,
    training machine learning models, or conducting experiments. In a dataset, each data point (also known as an observation,
    sample, or instance) contains information about a specific entity, and this information is often stored in a tabular format
'''

In [None]:
'''
    Key Components of a Dataset -->

    Features (Attributes or Variables) -->

    Definition : Features are the individual measurable properties or characteristics of the data.
    These can be inputs to a model in machine learning.
    Example : In a dataset of housing prices, features could include house size, number of rooms, location, etc.

    Samples (Rows, Records, or Observations) -->

    Definition : Each sample or observation represents a single entry in the dataset, which is characterized
    by the values of the features.
    Example : In a dataset of car sales, each row might represent a specific car and contain information like
    the make, model, year, price, etc.

    Target Variable (Label or Output) -->

    Definition : In supervised learning tasks, the target variable is the value that the model is supposed to predict or classify.
    It is the output of the model.
    Example : In a dataset for predicting house prices, the target variable could be the actual price of the house.

    Structure -->

    Tabular : Most common format, where data is arranged in rows and columns (similar to a spreadsheet).
    Non-tabular : Datasets may also come in formats such as images, audio, or text.
'''

In [None]:
'''
    Example of a Tabular Dataset -->

        ID	    House Size (sq ft)  Bedrooms   Price ($)
        1	            2000	        3	    300,000
        2	            1500	        2	    200,000
        3	            2500	        4	    400,000

    Features --> "House Size" and "Bedrooms"
    Target Variable --> "Price"
    Samples --> Each row represents one house
'''

In [None]:
'''
    Types of Datasets -->

    Training Dataset -->
    Used to train a machine learning model by feeding it both features and the target variable.
    Example : A dataset of images and their corresponding labels for classifying objects in images.

    Testing Dataset -->
    Used to evaluate the performance of a trained machine learning model.
    The model makes predictions on this dataset, and the results are compared to the actual values to measure accuracy.

    Validation Dataset -->
    Used to fine-tune the hyperparameters of the model during training to avoid overfitting.

    Unlabeled Dataset -->
    Contains only input data (features) without the corresponding output (target variable).
    Commonly used in unsupervised learning tasks like clustering or anomaly detection.

    Labeled Dataset -->  
    Contains both input data and the corresponding output, used in supervised learning tasks.
'''

In [None]:
'''
    Formats of Datasets -->

    CSV (Comma-Separated Values) : A common format for tabular data.
    Excel : Data stored in spreadsheets, often in .xls or .xlsx formats.
    Image files : Datasets of images, often stored as .jpg, .png, etc.
    JSON (JavaScript Object Notation) : Used for semi-structured data.
    SQL Databases : Data stored in relational databases.
'''

In [None]:
'''
    Types of Data in a Dataset -->

    Numerical Data -->
    Continuous : Data that can take any value within a range (e.g., house price, temperature).
    Discrete : Data that can only take specific values (e.g., number of bedrooms).

    Categorical Data -->
    Data that represents categories or groups (e.g., car make, gender).
    Can be nominal (no order) or ordinal (with a meaningful order).

    Textual Data -->
    Data in the form of text (e.g., product reviews, tweets).

    Image Data -->
    Images, often represented as arrays of pixel values.
'''

In [None]:
'''
    Importing a dataset -->

    In Python, datasets can be imported from various sources such as files (CSV, Excel, etc.),
    databases, or even directly from web URLs. Below are common ways to import datasets using
    popular libraries like pandas and numpy

    Importing Datasets using pandas -->
    df = pd.read_csv('file')

    Importing Datasets from sklearn.datasets (Toy Datasets) -->   
    from sklearn.datasets import load_database
    many datasets are available on sklearn you can use any
'''

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

In [3]:
pd_data = pd.read_csv('Data/Data.csv')
sk_data = load_iris()

In [4]:
pd_data.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [None]:
# sklearn data is not a DataFrame so we can't do head

sk_data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  