#Learning Model Building in Scikit-learn

Scikit-learn which is an open-source Python library which helps in making machine learning more accessible. It provides a straightforward, consistent interface for a variety of tasks like classification, regression, clustering, data preprocessing and model evaluation.

Step 1: Loading a Dataset

In [8]:
from sklearn.datasets import load_iris
iris=load_iris()

X=iris.data
y=iris.target

feature_names= iris.feature_names
target_names=iris.target_names

print("Features names:", feature_names)
print("Targets names:", target_names)

print("\nOverall Type of X is:",type(X))
print("\nData type of the elements stored inside of X is:", X.dtype)

print("\nFirst 5 rows of X:\n", X[:5])

Features names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Targets names: ['setosa' 'versicolor' 'virginica']

Overall Type of X is: <class 'numpy.ndarray'>

Data type of the elements stored inside of X is: float64

First 5 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


Step 2: Splitting the Dataset


In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=1)

In [10]:
#check the Shapes of the Splitted Data to ensures that both sets have correct proportions
print("X_train Shape:", X_train.shape)
print("X_test Shape:", X_test.shape)
print("Y_train Shape:", y_train.shape)
print("Y_test Shape:", y_test.shape)

X_train Shape: (90, 4)
X_test Shape: (60, 4)
Y_train Shape: (90,)
Y_test Shape: (60,)


Step 3: Handling Categorical Data

 Label Encoding: It converts each category into a unique integer. This method works well when the categories have a meaningful order such as “Low”, “Medium” and “High”.

In [11]:
from sklearn.preprocessing import LabelEncoder

categorical_feature=['cat','dog','cat','bird']
encoder=LabelEncoder()
encoded_feature=encoder.fit_transform(categorical_feature)
print("Encoded feature:", encoded_feature)

Encoded feature: [1 2 1 0]


 One-Hot Encoding: It creates binary columns for each category where each column represents a category. This method is useful for categorical variables without any order ensuring that no numeric relationships are implied between the categories.

In [14]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

categorical_feature=['cat','dog','cat','bird']
categorical_feature=np.array(categorical_feature).reshape(-1,1)
encoder=OneHotEncoder(sparse_output=False)
encoded_feature=encoder.fit_transform(categorical_feature)
print("OneHotEncoded feature:\n",encoded_feature)

OneHotEncoded feature:
 [[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


Step 4: Training the Model

In [15]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

Step 5: Make Predictions

In [16]:
y_pred = log_reg.predict(X_test)

Step 6: Evaluating Model Accuracy

In [18]:
from sklearn import metrics
print("Logistic Regression model accuracy:", metrics.accuracy_score(y_test, y_pred))

Logistic Regression model accuracy: 0.9666666666666667


In [19]:
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = log_reg.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)

Predictions: [np.str_('virginica'), np.str_('virginica')]
