# Data Loading and Understanding

In [185]:
# Importing necessary libraries
import sklearn.datasets as datasets
import pandas as pd

# Listing all attributes and methods in the datasets module
available_datasets = dir(datasets)
print(available_datasets)
df = pd.DataFrame(available_datasets)
print(df)

['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__getattr__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_arff_parser', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups', 'clear_data_home', 'data', 'descr', 'dump_svmlight_file', 'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 'get_data_home', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_fr

## Dataset guide-
1. load_*: These are typically small, built-in toy datasets.
2. fetch_*: These are real-world or larger datasets.
3. make_*: They generate data programmatically and don't come from real-world datasets.
4. The ones with no prefix: These are typically utility functions for working with datasets or metadata, but they do not load or generate datasets directly.

## From the list of available datasets, I chose the "load_iris" dataset.
### This dataset contains information about 3 different types of Iris flowers for the purpose of identification. 


In [188]:
# Importing necessary libraries
import pandas as pd
from sklearn.datasets import load_iris

# Loading the load_iris() dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Checking dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [190]:
# Displaying the dataset
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]


In [192]:
print(data.data) # 2D array displaying the data associated to the sepal and petal length and width of each flower.
print(data.target) # 1D array containing the target labels for each sample in the dataset.
print(data.target_names) # 1D array containing the names of the classes corresponding to the integer labels.

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

### Here, we can see that 0 corresponds to Setosa, 1 to Versicolor and 2 to Virginica.

# Data Preprocessing

### I have used 80% of the data for training the model and the remaining 20% for testing purposes. 

### I then scaled the training data input accordingly - 
#### 1. I used StandardScalar class of the Scikit-learn library to perform the transformations.
#### 2. The fit() method calculates the mean and standard deviation for each feature based on the training data (X_train).
#### 3. The transform() method uses these values (mean and standard deviation) to standardize the data by subtracting the mean and dividing by the standard deviation for each feature. The data is now centered around 0 and scaled to have unit variance.

#### Formula used : z = (x−μ)/σ,  where μ = mean, σ = standard deviation and z = the standardized value.

### For the testing data input, transform() scales the test data using the mean and standard deviation computed from the training data. It does not recalculate the mean and standard deviation from the test data. Instead, it applies the same scaling rules derived from the training data.


In [194]:
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting the dataset into training and test sets
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [216]:
# Displaying the data input that will be used for training along with the respective outputs.

print(X_train)
print(y_train)

[[5.5 2.4 3.7 1. ]
 [6.3 2.8 5.1 1.5]
 [6.4 3.1 5.5 1.8]
 [6.6 3.  4.4 1.4]
 [7.2 3.6 6.1 2.5]
 [5.7 2.9 4.2 1.3]
 [7.6 3.  6.6 2.1]
 [5.6 3.  4.5 1.5]
 [5.1 3.5 1.4 0.2]
 [7.7 2.8 6.7 2. ]
 [5.8 2.7 4.1 1. ]
 [5.2 3.4 1.4 0.2]
 [5.  3.5 1.3 0.3]
 [5.1 3.8 1.9 0.4]
 [5.  2.  3.5 1. ]
 [6.3 2.7 4.9 1.8]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.1 3.3 1.7 0.5]
 [5.6 2.7 4.2 1.3]
 [5.1 3.4 1.5 0.2]
 [5.7 3.  4.2 1.2]
 [7.7 3.8 6.7 2.2]
 [4.6 3.2 1.4 0.2]
 [6.2 2.9 4.3 1.3]
 [5.7 2.5 5.  2. ]
 [5.5 4.2 1.4 0.2]
 [6.  3.  4.8 1.8]
 [5.8 2.7 5.1 1.9]
 [6.  2.2 4.  1. ]
 [5.4 3.  4.5 1.5]
 [6.2 3.4 5.4 2.3]
 [5.5 2.3 4.  1.3]
 [5.4 3.9 1.7 0.4]
 [5.  2.3 3.3 1. ]
 [6.4 2.7 5.3 1.9]
 [5.  3.3 1.4 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 2.4 3.8 1.1]
 [6.7 3.  5.  1.7]
 [4.9 3.1 1.5 0.2]
 [5.8 2.8 5.1 2.4]
 [5.  3.4 1.5 0.2]
 [5.  3.5 1.6 0.6]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.9 3.2 5.7 2.3]
 [6.  2.7 5.1 1.6]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [5.5 2.5 4.  1.3]
 [4.4 2.9 1.4 0.2]
 [4.3 3.  1.

In [218]:
# Displaying the data input that will be used for testing along with the expected outputs.

print(X_test)
print(y_test)

[[6.1 2.8 4.7 1.2]
 [5.7 3.8 1.7 0.3]
 [7.7 2.6 6.9 2.3]
 [6.  2.9 4.5 1.5]
 [6.8 2.8 4.8 1.4]
 [5.4 3.4 1.5 0.4]
 [5.6 2.9 3.6 1.3]
 [6.9 3.1 5.1 2.3]
 [6.2 2.2 4.5 1.5]
 [5.8 2.7 3.9 1.2]
 [6.5 3.2 5.1 2. ]
 [4.8 3.  1.4 0.1]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [5.1 3.8 1.5 0.3]
 [6.3 3.3 4.7 1.6]
 [6.5 3.  5.8 2.2]
 [5.6 2.5 3.9 1.1]
 [5.7 2.8 4.5 1.3]
 [6.4 2.8 5.6 2.2]
 [4.7 3.2 1.6 0.2]
 [6.1 3.  4.9 1.8]
 [5.  3.4 1.6 0.4]
 [6.4 2.8 5.6 2.1]
 [7.9 3.8 6.4 2. ]
 [6.7 3.  5.2 2.3]
 [6.7 2.5 5.8 1.8]
 [6.8 3.2 5.9 2.3]
 [4.8 3.  1.4 0.3]
 [4.8 3.1 1.6 0.2]
 [4.6 3.6 1.  0.2]
 [5.7 4.4 1.5 0.4]
 [6.7 3.1 4.4 1.4]
 [4.8 3.4 1.6 0.2]
 [4.4 3.2 1.3 0.2]
 [6.3 2.5 5.  1.9]
 [6.4 3.2 4.5 1.5]
 [5.2 3.5 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.2 4.1 1.5 0.1]
 [5.8 2.7 5.1 1.9]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [5.4 3.9 1.3 0.4]
 [5.4 3.7 1.5 0.2]]
[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


In [220]:
# Displaying the training data input that has been scaled.
print(X_train_scaled)

[[-1.47393679  1.20365799 -1.56253475 -1.31260282]
 [-0.13307079  2.99237573 -1.27600637 -1.04563275]
 [ 1.08589829  0.08570939  0.38585821  0.28921757]
 [-1.23014297  0.75647855 -1.2187007  -1.31260282]
 [-1.7177306   0.30929911 -1.39061772 -1.31260282]
 [ 0.59831066 -1.25582892  0.72969227  0.95664273]
 [ 0.72020757  0.30929911  0.44316389  0.4227026 ]
 [-0.74255534  0.98006827 -1.27600637 -1.31260282]
 [-0.98634915  1.20365799 -1.33331205 -1.31260282]
 [-0.74255534  2.32160658 -1.27600637 -1.44608785]
 [-0.01117388 -0.80864948  0.78699794  0.95664273]
 [ 0.23261993  0.75647855  0.44316389  0.55618763]
 [ 1.08589829  0.08570939  0.55777524  0.4227026 ]
 [-0.49876152  1.87442714 -1.39061772 -1.04563275]
 [-0.49876152  1.4272477  -1.27600637 -1.31260282]
 [-0.37686461 -1.47941864 -0.01528151 -0.24472256]
 [ 0.59831066 -0.58505976  0.78699794  0.4227026 ]
 [ 0.72020757  0.08570939  1.01622064  0.8231577 ]
 [ 0.96400139 -0.13788033  0.38585821  0.28921757]
 [ 1.69538284  1.20365799  1.36

In [222]:
# Displaying the testing data input that has been scaled.
print(X_test_scaled)

[[ 0.35451684 -0.58505976  0.55777524  0.02224751]
 [-0.13307079  1.65083742 -1.16139502 -1.17911778]
 [ 2.30486738 -1.0322392   1.8185001   1.49058286]
 [ 0.23261993 -0.36147005  0.44316389  0.4227026 ]
 [ 1.2077952  -0.58505976  0.61508092  0.28921757]
 [-0.49876152  0.75647855 -1.27600637 -1.04563275]
 [-0.2549677  -0.36147005 -0.07258719  0.15573254]
 [ 1.32969211  0.08570939  0.78699794  1.49058286]
 [ 0.47641375 -1.92659808  0.44316389  0.4227026 ]
 [-0.01117388 -0.80864948  0.09932984  0.02224751]
 [ 0.84210448  0.30929911  0.78699794  1.09012776]
 [-1.23014297 -0.13788033 -1.33331205 -1.44608785]
 [-0.37686461  0.98006827 -1.39061772 -1.31260282]
 [-1.10824606  0.08570939 -1.27600637 -1.44608785]
 [-0.86445224  1.65083742 -1.27600637 -1.17911778]
 [ 0.59831066  0.53288883  0.55777524  0.55618763]
 [ 0.84210448 -0.13788033  1.18813767  1.35709783]
 [-0.2549677  -1.25582892  0.09932984 -0.11123753]
 [-0.13307079 -0.58505976  0.44316389  0.15573254]
 [ 0.72020757 -0.58505976  1.07

# Model Selection and Training 

## I have used two different Machine Learning algorithms for training my model - 

### 1. Logistic Regression : 
#### Performs classification by predicting probabilities using a logistic function. It models the probability that a given input belongs to a particular class using a logistic (sigmoid) function. The model learns by finding the best coefficients for each feature that maximizes the likelihood of the observed class labels.

### 2. K-Nearest Neighbors : 
#### Classifies data points based on the majority class of their k-nearest neighbors in the feature space. It computes the distance between the input sample and all other samples in the training set. It then selects the k nearest neighbors (using distance metrics like Euclidean distance). The class of the input sample is determined by the majority class among these k nearest neighbors.

## Since I'm using two different models, I have used a strategy called "Soft Voting". 

### Soft voting is an Ensemble Learning Method that allows us to combine the outputs of multiple models to make a final decision. Each model provides probability estimates for each class, and the final class is based on the average of these probabilities.


In [251]:
# Importing necessary libraries
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier

# Defining different classifiers for model selection
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'KNN': KNeighborsClassifier(n_neighbors=3)
}

# Creating a pipeline with standard scaler
pipeline = Pipeline([
    ('scaler', StandardScaler()),# Standardize features
    ('classifier', model) 
])

# Creating a soft voting classifier
voting_clf = VotingClassifier(estimators=[
    ('logistic', LogisticRegression(max_iter=200)),
    ('knn', KNeighborsClassifier(n_neighbors=3))
], voting='soft')

# Adding the voting classifier to the pipeline
pipeline.set_params(classifier=voting_clf)

# Fitting the pipeline on the training data
pipeline.fit(X_train, y_train)

# Making predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluating and printing accuracy
print(f"Soft Voting Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Soft Voting Accuracy: 1.00


### On testing our model we see that the predicted data rightly matches the output of 'X_test' i.e. 
### 'y_test': [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
### 0 0 0 2 1 1 0 0]

In [253]:
# Printing the predicted data
print(y_pred)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


# Visualizing Our Pipeline

In [256]:
from sklearn import set_config
set_config(display="diagram")
pipeline

### I have also attached the visualization in my Github Repo README file, incase it's not readable here.