## 1 Demonstrate the following using sklearn library:

### a) Holdout Method 
Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing.

[dataset](https://www.kaggle.com/datasets/uciml/iris)

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [5]:
dataset = pd.read_csv('Iris.csv')
print("Total Rows: ",dataset.shape[0])

Total Rows:  150


In [7]:
x = dataset.drop(columns=['Species'])
y = dataset.Species

In [8]:
x_train, x_test, y_rain, y_test = train_test_split(x,y,test_size=0.2)

In [11]:
print("Train Size: ",x_train.shape[0])
print("Test Size: ",x_test.shape[0])

Train Size:  120
Test Size:  30


### b) K-Fold Cross Validation

In [20]:
from sklearn.model_selection import KFold
import random
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100,9)),columns=['A','B','C','D','E','F','G','H','I'])
df

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,64,46,28,7,88,28,74,5,15
1,28,91,16,43,94,87,44,10,77
2,30,6,56,51,13,59,17,32,40
3,20,51,46,87,70,81,67,91,56
4,91,20,46,42,76,98,78,78,67
...,...,...,...,...,...,...,...,...,...
95,57,21,57,25,5,58,89,85,78
96,24,23,78,45,90,96,27,60,27
97,72,4,96,48,43,72,32,44,59
98,99,69,95,43,95,6,85,19,23


In [31]:
cv = KFold(n_splits=10,random_state=1, shuffle=True)
ans = next(cv.split(df),None)

train = df.iloc[ans[0]]
test = df.iloc[ans[1]]

In [36]:
print("Train Size: ",train.shape)
print("Test Size: ",test.shape)

Train Size:  (90, 9)
Test Size:  (10, 9)


### c) BootStrap Sampling

[ref.](https://www.digitalocean.com/community/tutorials/bootstrap-sampling-in-python)

In [43]:
import random

x = np.random.normal(loc= 300.0, size=1000)
print("Mean of Actual Data with 1K rows: ",np.mean(x))

#bootstrap SAmpling
sample_mean = []
for i in range(50):
    y = random.sample(x.tolist(), 4)
    avg = np.mean(y)
    sample_mean.append(avg)

# mean using bootstrap sampling
print("Mean of sampled Data with 50 rows: ",np.mean(sample_mean))

Mean of Actual Data with 1K rows:  300.0298296538071
Mean of sampled Data with 50 rows:  300.0023012474654


## 2. Implement kNN Algorithm using SciKit- Learn Library for the Data set - Wine data set. Find accuracy and display confusion matrix

In [4]:
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [53]:
# load dataset
dataset = load_wine()
x = dataset['data']
y = dataset['target']

print("shape of the data: ",x.shape)

shape of the data:  (178, 13)


In [54]:
# train-test split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=0)

In [55]:
k=3

In [56]:
# training
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [60]:
# validation against test set.
y_pred = knn.predict(x_test)

In [61]:
y_test

array([0, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 2, 0, 1, 2, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0])

In [63]:
# measuring accuracy
acc = metrics.accuracy_score(y_test,y_pred)
print("Model accuracy: ",round(acc,2)*100)

Model accuracy:  78.0


In [64]:
# generate confusion matrix
metrics.confusion_matrix(y_test, y_pred)

array([[13,  0,  1],
       [ 1, 13,  2],
       [ 1,  3,  2]], dtype=int64)

## 3. Implement Decision Tree Classifier for classification of Iris dataset

a. Load the data set

b. Split the data set (70:30) to train and test sets

c. Train a Decision Tree using train set

d. Test the model using test set.

e. Find accuracy and confusion Matrix

In [2]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.datasets import load_iris

In [6]:
df = load_iris()
dataset = pd.DataFrame(df.data,columns=df.feature_names)
dataset['target'] = df.target

In [7]:
x = dataset.drop(columns=['target'])
y = dataset.target

In [8]:
# train-test split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=0)

In [9]:
# model training
classifier = DecisionTreeClassifier(criterion='entropy')
classifier.fit(x_train,y_train)

DecisionTreeClassifier(criterion='entropy')

In [10]:
# model validation against test set
y_pred = classifier.predict(x_test)

In [11]:
# accuracy measurement
acc = metrics.accuracy_score(y_test,y_pred)
print("Model Accuracy: ",round(acc,2)*100)

Model Accuracy:  100.0


In [12]:
y_pred

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0])

In [14]:
y_test.head(3)

114    2
62     1
33     0
Name: target, dtype: int32

## 4 Implement SVM Classification Algorithm for the DataSet Breast Cancer

In [102]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [112]:
df = load_breast_cancer()

In [119]:
dataset = df.data
dataset = pd.DataFrame(dataset,columns=df.feature_names)
print("Dataset shape: ",dataset.shape)

Dataset shape:  (569, 30)


In [121]:
x = dataset
y = df.target

In [122]:
# splitting data into train and test
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

In [123]:
# Algorithm Implementation
classifier = SVC(kernel='linear', random_state=0)

In [124]:
classifier.fit(x_train,y_train)

SVC(kernel='linear', random_state=0)

In [125]:
y_pred = classifier.predict(x_test)

In [130]:
# Defining Accuracy using confusion metrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("------- Confusion Metrix ------- ")
print(cm)

# Definng Accuracy using accuracy Score
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_pred,y_test)
print("\n------- Accuracy Score -------")
print(round(accuracy*100,2),"%")

------- Confusion Metrix ------- 
[[46  1]
 [ 4 63]]

------- Accuracy Score -------
95.61 %
