Scikit-learn is a great library for creating machine learning models from data. Before you can fit a model using scikit-learn, your data has to be in a recognizable format. Scikit-learn works well with numeric data that is stored in numpy arrays. Additionally, you can convert your data from objects like pandas dataframes to numpy arrays. In this video, I'll show you how you make your data a more acceptable input for scikit-learn. 

## Features Matrix and Target Vector

The first thing you have to understand is what Scikit-Learn expects for Features Matrices and target vectors. In scikit-learn, a features matrix is a two-dimensional grid of data where rows represent samples and columns represent features. A target vector is usually one dimensional and in the case of supervised learning, what you want to predict from the data. 

![image.png](attachment:image.png)

Let's see an example of this. the image is a pandas dataframe of the first 5 rows of the iris dataset. A single flower represent one row of the dataset and the flower measurements are the columns. In this dataset, the species column is what you are trying to predict. 

![image.png](attachment:image.png)

Let's now go over how to make sure your data is in an acceptable format

## Import Libraries

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import load_iris

## Loading the Dataset

In [12]:
# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Organizing Data into Features Matrix and Target Vector

In [14]:
# Create the features matrix (X) and target vector (y)
X = iris_df.drop(columns=['species']).values  # Features matrix
y = iris_df['species'].values                 # Target vector

## Verifying Data Dimensions

In [15]:
# Verifying Data Dimensions
print("Features matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Features matrix shape: (150, 4)
Target vector shape: (150,)


## Converting Data for Scikit-Learn

In [16]:
# Convert to numpy array
X = iris_df.drop(columns=['species']).values  # Convert to NumPy array

In [25]:
# Convert to NumPy array
y = iris_df['species'].values

In [24]:
# Check dimensions
print("Features matrix shape:", X.shape)  # Expected output: (samples, features)
print("Target vector shape:", y.shape)    # Expected output: (samples,)

Features matrix shape: (150, 4)
Target vector shape: (150,)


## Final Steps for Data Readiness

In [27]:
# Ensuring Data Integrity
import numpy as np

# Check for missing values in features matrix and target vector
print("Missing values in X:", np.isnan(X).sum())
print("Missing values in y:", np.isnan(y).sum())

Missing values in X: 0
Missing values in y: 0


In [29]:
# Normalizing or Scaling Features (Optional)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Apply scaling

In [30]:
# Final Verification of Data Shapes
print("Final shape of X:", X.shape)
print("Final shape of y:", y.shape)

Final shape of X: (150, 4)
Final shape of y: (150,)
