## Steps in a Machine Learning Project

 - A Machine Learning Project involves the following steps:

**Defining the Problem:**
    
 - Define a problem statement, which addresses a business problem.

**Obtaining the Source Data:**
    
 - The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and social networking sites.    

**Understanding Data Through Visualization:**
 - Look into data and understand important features such as its mean, and spread.

**Preparing Data for Machine Learning Algorithms:**
 - Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be manipulated or transformed through one or more pre-processing steps.

**Choosing an algorithm:**
 - Based on features of data set, pick a suitable algorithm

**Building the Model:**
 - Train the algorithm with considered training data set and verify its performance through a metric.    

**Fine-tuning the Model:**
 - Identify values of vital parameters, associated with the chosen model for better performance.    

**Use the best model:**
 - Use the model with better performance for addressing the defined problem.    

## Introduction to scikit-learn

scikit-learn is a Machine learning toolkit in Python. The package contains efficient tools used for Data Mining and Data Analysis.

It is built on NumPy, SciPy, and matplotlib packages. It is opensource and also commercially usable under BSD license.

### scikit-learn Utilities
- scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.

  - Preprocessing
  - Model Selection
  - Classification
  - Regression
  - Clustering
  - Dimensionality Reduction

### Steps with scikit-learn
 - Mostly, one would perform the following steps while working on a Machine learning problem with scikit-learn:

 1. Cleaning raw data set.
 2. Further transforming with many scikit-learn pre-processing utilities.
 3. Splitting data into train and test sets with train_test_split utility.
 4. Creating a suitable model with default parameters.
 5. Training the Model using fit function.
 6. Evaluating the Model and fine-tuning it.

### Reading Data for ML
 - Any Machine Learning Algorithm requires data for building a model.

 - The data can be obtained from Multiple sources such as http, ftp repositories, databases, local repositories, etc.

 - Many times raw data, read from a source, cannot be used directly by an ML algorithm for building a Model.

 - So, raw data has to be cleaned, processed, transformed (if required) and then passed to an ML algorithm always.

### Example Data - Breast Cancer Dataset
 - Breast Cancer data set is a popular one, which contains details of 30 features obtained from 569 cancer patients.

 - We will be doing the following tasks and make cancer data set ready for ML.

  - Reading raw data from UCI archive
  - Extract features from Raw data.
  - Naming or Labelling features
  - Extract target values from Raw data
  - Naming or Labelling target values

#### Reading Data from UCI Archive

##### The raw data set from UCI archive can be read with the following code snippet.

In [1]:
import pandas as pd

cancer_set = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', 
                        header = None)
print(cancer_set.shape)

(569, 32)


 - Read raw dataset contains 32 columns.
 - the 1st column has patient ID details, and the 2nd one has tumor type, i.e. malignant or benign.
 - The rest 30 columns represent various features obtained from each patient.

##### Extracting Features from Raw Set

 - All columns, representing features are extracted with the following code snippet.

In [3]:
cancer_features = cancer_set.iloc[:,2:]

print(cancer_features.shape)
print(type(cancer_features))

(569, 30)
<class 'pandas.core.frame.DataFrame'>


 - cancer_features is a dataframe. It is converted to a numpy array with below code.

In [4]:
cancer_features = cancer_features.values
print(cancer_features.shape)
print(type(cancer_features))

(569, 30)
<class 'numpy.ndarray'>


#### Naming features

In [5]:
cancer_features_names = ['mean radius', 
'mean texture', 'mean perimeter', 
'mean area', 'mean smoothness', 
'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry',
'mean fractal dimension','radius error',
'texture error','perimeter error',
'area error', 'smoothness error',
'compactness error','concavity error',
'concave points error','symmetry error',
'fractal dimension error','worst radius',
'worst texture', 'worst perimeter', 
'worst area','worst smoothness', 
'worst compactness', 'worst concavity',
'worst concave points','worst symmetry',
'worst fractal dimension']

#### Extracting target values from Raw data
 - Target values of each patient are extracted with below code snippet.

In [7]:
cancer_target = cancer_set.iloc[:, 1]

# Replacing 'M' with 0 and 'B' with 1
cancer_target = cancer_target.replace(['M', 'B'], [0, 1])

# Converting to numpy array
cancer_target = cancer_target.values

print(type(cancer_target))
print(cancer_target.shape)

<class 'numpy.ndarray'>
(569,)


### scikit-learn Datasets
 - scikit-learn by default comes with few popular datasets.

 - They can be loaded into your working environment and used.

### Reading Cancer Data from scikit-learn

In [15]:
from sklearn import datasets

breast_cancer = datasets.load_breast_cancer()

print(cancer.data.shape)
print(cancer.target.shape)

(569, 30)
(569,)


In [13]:
from sklearn import datasets

iris = datasets.load_iris()

print(type(iris))

<class 'sklearn.utils._bunch.Bunch'>


## Preprocessing - Introduction

**Preprocessing is a step, in which raw data is modified or transformed into a format, suitable for further downstream processing.**

scikit-learn provides many preprocessing utilities such as,

 - Standardization mean removal
 - Scaling
 - Normalization
 - Binarization
 - One Hot Encoding
 - Label Encoding
 - Imputation

### 1 . Standardization

#### Standardization or Mean Removal is the process of transforming each feature vector into a normal distribution with mean 0 and variance 1.

 - This can be achieved using StandardScaler.
 - An example with its output is shown in the next two cards, which requires the following imports.

In [16]:
from sklearn import preprocessing
standardizer  = preprocessing.StandardScaler()
standardizer = standardizer.fit(breast_cancer.data)
breast_cancer_standardizer = standardizer.transform(breast_cancer.data)

print("Mean of each feature after Standardization : \n\n")
print(breast_cancer_standardizer.mean(axis=0))
print("\nStd. of each feature after Standardization : \n\n")
print(breast_cancer_standardizer.std(axis=0))

Mean of each feature after Standardization : 


[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2.57517109e-15]

Std. of each feature after Standardization : 


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


 - Axis is used to compute the means and standard deviations along the given data.
   - If 0, independently standardize each feature,
   - otherwise (if 1) standardize each sample.
 - “axis 0” represents rows and “axis 1” represents columns.

### 2. Scaling


Scaling transforms existing data values to lie between a minimum and maximum value.
  - MinMaxScaler transforms data to range 0 and 1.
  - MaxAbsScaler transforms data to range -1 and 1.


#### 2.1 Using MinMaxScaler

In [17]:
min_max_scaler  = preprocessing.MinMaxScaler().fit(breast_cancer.data)
breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)

**By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in next example.**

 - MinMaxScaler with specified range

In [18]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0,10)).fit(breast_cancer.data)
breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)

#### In the above example, data is transformed to range 0 and 10.

#### 2.2 Using MaxAbsScaler

 - Using MaxAbsScaler, the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously centered at sparse or zero data.

In [19]:
max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)

breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)

#### By default, MaxAbsScaler transforms data to the range -1 and 1.

### 3. Normalization

 - Normalization scales each sample to have a unit norm.
 - Normalization can be achieved with 'l1', 'l2', and 'max' norms.
 - 'l1' norm makes the sum of absolute values of each row as 1, and 'l2' norm makes the sum of squares of each row as 1.
 - 'l1' norm is insensitive to outliers.
 - By default l2 norm is considered. Hence, removing outliers is recommended before applying l2 norm.

In [21]:
normlizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)
breast_cancer_normlized = normlizer.transform(breast_cancer.data)

### 4. Binarization

Binarization is the process of transforming data points to 0 or 1 based on a given threshold.
 - Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0
 - By default, a threshold of 0 is used.

In [22]:
binarizer  = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])

[[1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]]


### 5. OneHotEncoder

 - OneHotEncoder converts categorical integer values into one-hot vectors. In an one-hot vector, every category is transformed into a binary attribute having only 0 and 1 values.

In [23]:
onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])

# Transforming category values 1 and 2 to one-hot vectors
print(onehotencoder.transform([[1]]).toarray())
print(onehotencoder.transform([[2]]).toarray())

[[1. 0.]]
[[0. 1.]]


### 6. Label Encoding

 - Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming categorical values ["benign","malignant"]into[0, 1]` is shown below.

In [24]:
labels = ['malignant', 'benign', 'malignant', 'benign']

labelencoder = preprocessing.LabelEncoder()

labelencoder = labelencoder.fit(labels)

bc_labelencoded = labelencoder.transform(breast_cancer.target_names)

print(bc_labelencoded)

[1 0]


### 7. Imputation

 - Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing values exist.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values='NaN', strategy='mean')

imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)


# Lab 1

In [31]:
#Write your code here
# Task 1
from sklearn import datasets
from sklearn import preprocessing
iris = datasets.load_iris()
normalizer = preprocessing.Normalizer(norm='l2').fit(iris.data)
iris_normalized = normalizer.transform(iris.data)
print(iris_normalized.mean(axis=0))

# Task 2

enc = preprocessing.OneHotEncoder()
iris_target_onehot = enc.fit_transform(iris.target.reshape(-1, 1))
print(iris_target_onehot.toarray()[[0,50,100]])

# Task 3

from sklearn.impute import SimpleImputer
import numpy as np
iris.data[:50, :] = np.nan
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
iris_imputed = imputer.fit_transform(iris.data)

print(iris_imputed.mean(axis=0))


[0.75140029 0.40517418 0.45478362 0.14107142]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
[6.262 2.872 4.906 1.676]


In [33]:
import sklearn.preprocessing as preprocessing

regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']
print(preprocessing.LabelEncoder().fit(regions).transform(regions))

[1 0 3 1 2 0]


## Nearest Neighbors

Nearest neighbors method is used to determine a predefined number of data points that are closer to a sample point and predict its label.

 - sklearn.neighbors provides utilities for unsupervised and supervised neighbors-based learning methods.
 - scikit-learn implements two different nearest neighbors classifiers:
     - KNeighborsClassifier
     - RadiusNeighborsClassifier

### Nearest Neighbor Classifiers

 - KNeighborsClassifier classifies based on k nearest neighbors of every query point, where k is an integer value specified by the user.

 - RadiusNeighborsClassifier classifies based on the number of neighbors present in a fixed radius r of every training point.

### Nearest Neighbors Regression

scikit-learn implements the following two regressors:

 - KNeighborsRegressor predicts based on the k nearest neighbors of each query point.
 - RadiusNeighborsRegressor predicts based on the neighbors present in a fixed radius r of the query point.

### Demo of KNeighborsClassifier

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = datasets.load_breast_cancer()

## Building a Model of KNN classifier

X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
                                                   stratify=cancer.target, random_state=42)

knn_classifier = KNeighborsClassifier()
knn_classifier = knn_classifier.fit(X_train, Y_train)

### Determining Accuracy of the Model

In [2]:
print("Accuracy of train data : ", knn_classifier.score(X_train,Y_train))
print("\nAccuracy of test data : ", knn_classifier.score(X_test,Y_test))

Accuracy of train data :  0.9460093896713615

Accuracy of test data :  0.9300699300699301


## Lab 2

In [12]:
# Task 1

from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target,
                                                   stratify=iris.target, random_state=30)

print(X_train.shape)
print(X_test.shape)

# Task 2

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf = knn_clf.fit(X_train,Y_train)
print("Accuracy on train dataset : ", knn_clf.score(X_train,Y_train))
print("Accuracy on test dataset : ", knn_clf.score(X_test,Y_test))


(112, 4)
(38, 4)
Accuracy on train dataset :  0.9821428571428571
Accuracy on test dataset :  0.9473684210526315


In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the iris dataset
iris = load_iris()

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define the range of n_neighbors values to try
n_neighbors_range = range(3, 11)

# Initialize a dictionary to store the accuracy scores for each model
accuracy_scores = {}

# Loop over the n_neighbors values and fit a KNN model for each
for n_neighbors in n_neighbors_range:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, Y_train)
    accuracy_scores[n_neighbors] = knn.score(X_test, Y_test)

# Find the n_neighbors value with the highest accuracy score
best_n_neighbors = max(accuracy_scores, key=accuracy_scores.get)

# Print the results
print("Accuracy scores for different n_neighbors values:")
for n_neighbors, accuracy in accuracy_scores.items():
    print(f"n_neighbors = {n_neighbors}: accuracy = {accuracy:.3f}")
    
print(f"The model with n_neighbors = {best_n_neighbors} had the highest accuracy score of {accuracy_scores[best_n_neighbors]:.3f}.")
print("The model with n_neighbors", best_n_neighbors, "had the highest accuracy score of", accuracy_scores[best_n_neighbors])

Accuracy scores for different n_neighbors values:
n_neighbors = 3: accuracy = 1.000
n_neighbors = 4: accuracy = 1.000
n_neighbors = 5: accuracy = 1.000
n_neighbors = 6: accuracy = 1.000
n_neighbors = 7: accuracy = 0.967
n_neighbors = 8: accuracy = 1.000
n_neighbors = 9: accuracy = 1.000
n_neighbors = 10: accuracy = 1.000
The model with n_neighbors = 3 had the highest accuracy score of 1.000.
The model with n_neighbors 3 had the highest accuracy score of 1.0


In [13]:
#Write your code here
# Task 1

from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target,
                                                   stratify=iris.target, random_state=30)

print(X_train.shape)
print(X_test.shape)

# Task 2

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf = knn_clf.fit(X_train,Y_train)
print("Accuracy on train dataset : ", knn_clf.score(X_train,Y_train))
print("Accuracy on test dataset : ", knn_clf.score(X_test,Y_test))

# Task 3

# Define the range of n_neighbors values to try
n_neighbors_range = range(3, 11)

# Initialize a dictionary to store the accuracy scores for each model
accuracy_scores = {}

# Loop over the n_neighbors values and fit a KNN model for each
for n_neighbors in n_neighbors_range:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, Y_train)
    accuracy_scores[n_neighbors] = knn.score(X_test, Y_test)

# Find the n_neighbors value with the highest accuracy score
best_n_neighbors = max(accuracy_scores, key=accuracy_scores.get)

# Print the results
print("Accuracy scores for different n_neighbors values:")
for n_neighbors, accuracy in accuracy_scores.items():
    print("n_neighbors = {}: accuracy = {:.3f}".format(n_neighbors, accuracy))
    

print("The model with n_neighbors = " + str(best_n_neighbors) + " had the highest accuracy score of " + str(accuracy_scores[best_n_neighbors]) + ".")

(112, 4)
(38, 4)
Accuracy on train dataset :  0.9821428571428571
Accuracy on test dataset :  0.9473684210526315
Accuracy scores for different n_neighbors values:
n_neighbors = 3: accuracy = 0.947
n_neighbors = 4: accuracy = 0.947
n_neighbors = 5: accuracy = 0.947
n_neighbors = 6: accuracy = 0.974
n_neighbors = 7: accuracy = 0.947
n_neighbors = 8: accuracy = 0.947
n_neighbors = 9: accuracy = 0.947
n_neighbors = 10: accuracy = 0.921
The model with n_neighbors = 6 had the highest accuracy score of 0.9736842105263158.


In [None]:
split iris dataset into two sets names X_train and X_test also split iris.target into two sets Y_train and Y_test
Hint: use train_test_slit method from sklearn.model_selection, set random_state to 30 and perform stratified sampling
Print shape of X_train dataset
Print shape of X_test dataset
fit k nearest neighbors model on X_train data and Y_train labels with default parameters Name the model as knn_clf.
Evaluate the  model accuracy on training dataset and print it's score
Evaluate the  model accuracy on testing dataset and print it's score
fit multiple k nearest neighbors models on X_train data and Y_train labels with n_neighbors  parameter value changing from 3 to 10
evaluate each model accuracy on testing dataset Hint make use of for loop
print the n_neighbors value of the model with highest accuracy use iris data from sklearn


In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()

# Split the data into training and testing sets using stratified sampling
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=30, stratify=iris.target)

# Print the shape of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)


Shape of X_train: (120, 4)
Shape of X_test: (30, 4)


In [None]:
fit k nearest neighbors model on X_train data and Y_train labels with default parameters Name the model as knn_clf.
Evaluate the  model accuracy on training dataset and print it's score
Evaluate the  model accuracy on testing dataset and print it's score

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the Iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=30)

# Print the shape of the training and testing sets
print( X_train.shape)
print( X_test.shape)

# Fit a K-Nearest Neighbors model on the training data
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

# Evaluate the model on the training and testing data
train_acc = knn_clf.score(X_train, Y_train)
test_acc = knn_clf.score(X_test, Y_test)

print(train_acc)
print(test_acc)

# Fit multiple K-Nearest Neighbors models with different n_neighbors values and evaluate them on the testing data
best_acc = 0
best_k = 0
for k in range(3, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, Y_train)
    acc = knn_clf.score(X_test, Y_test)
    if acc > best_acc:
        best_acc = acc
        best_k = k
    

print(best_k)


(112, 4)
(38, 4)
0.9821428571428571
0.9473684210526315
6


## Decision Trees

 Decision Trees is another Supervised Learning method used for Classification and Regression.

 - Decision Trees learn simple decision rules from training data and build a Model.

 - DecisionTreeClassifier and DecisionTreeRegressor are the two utilities from sklearn.tree, which can be used for classification and regression respectively.

### Advantages of Decision Trees
Advantages

 - Decision Trees are easy to understand.
 - They often do not require any preprocessing.
 - Decision Trees can learn from both numerical and categorical data.

### Disadvantages of Decision Trees
 - Decision trees sometimes become complex, which do not generalize well and leads to overfitting. Overfitting can be addressed by placing the least number of samples needed at a leaf node or placing the highest depth of the tree.

 - A small variation in data can result in a completely different tree. This problem can be addressed by using decision trees within an ensemble.

### Building a Decision Tree Classifier Model

In [18]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

cancer = datasets.load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
                                                   stratify=cancer.target, random_state=42)

dt_classifier = DecisionTreeClassifier()
dt_classifier = dt_classifier.fit(X_train,Y_train)

print("Accuracy of train dataset : ", dt_classifier.score(X_train,Y_train))
print("Accuracy of test dataset : ", dt_classifier.score(X_test,Y_test))


Accuracy of train dataset :  1.0
Accuracy of test dataset :  0.916083916083916


### Fine Tuning the Model

In [19]:
dt_classifier = DecisionTreeClassifier(max_depth=2)
dt_classifier = dt_classifier.fit(X_train,Y_train)
print("Accuracy of train dataset :", dt_classifier.score(X_train,Y_train))
print("Accuracy of test dataset :", dt_classifier.score(X_test,Y_test))

Accuracy of train dataset : 0.9577464788732394
Accuracy of test dataset : 0.9090909090909091


## Lab 3

In [34]:
# Task 1
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
np.random.seed(100)

boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target,random_state=30)

print(X_train.shape)
print( X_test.shape)

# Task 2

from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train,Y_train)

print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
print(dt_reg.predict(X_test[:2]))

# Task 3

best_acc = 0
best_d = 0

for d in range(2,6):
    dt_reg = DecisionTreeRegressor(max_depth=d)
    dt_reg = dt_reg.fit(X_train,Y_train)
    acc = dt_reg.score(X_test,Y_test)
    if acc > best_acc:
        best_acc = acc
        best_d = d
        print(acc,d)

print(best_d)


(379, 13)
(127, 13)
1.0
0.8098834820264638
[18.2 13.9]
0.6876109752166819 2
0.6962264524668584 3
0.7086640885662667 4
4



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

## Ensemble Methods

Ensemble methods combine predictions of other learning algorithms, to improve the generalization

 - Ensemble methods are two types:

 - Averaging Methods: They build several base estimators independently and finally average their predictions.
   - E.g.: Bagging Methods, Forests of randomised trees
 - Boosting Methods: They build base estimators sequentially and try to reduce the bias of the combined estimator.
     - E.g.: Adaboost, Gradient Tree Boosting

### Bagging Methods
 - Bagging Methods draw random subsets of the original dataset, build an estimator and aggregate individual results to form a final one.

 - BaggingClassifier and BaggingRegressor are the utilities from sklearn.ensemble to deal with Bagging.

### Randomized Trees
 - sklearn.ensemble offers two types of algorithms based on randomized trees: Random Forest and Extra randomness algorithms.

   - RandomForestClassifier and RandomForestRegressor classes are used to deal with random forests.
   - In random forests, each estimator is built from a sample drawn with replacement from the training set.


### Randomized Trees
 - ExtraTreesClassifier and ExtraTreesRegressor classes are used to deal with extremely randomized forests.

 - In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.

### Boosting Methods
Boosting Methods combine several weak models to create a improvised ensemble.

sklearn.ensemble also provides the following boosting algorithms:
 - AdaBoostClassifier
 - GradientBoostingClassifier

### Demo of Random Forest Classifier

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

cancer = datasets.load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
                                                   stratify=cancer.target, random_state=42)

rf_classifier = RandomForestClassifier()
rf_classifier = rf_classifier.fit(X_train,Y_train)
print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.958041958041958


## Lab 4

In [11]:
# Task 1
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
np.random.seed(100)

boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target,random_state=30)

print(X_train.shape)
print( X_test.shape)

# Task 2

from sklearn.ensemble import RandomForestRegressor
rf_reg  = RandomForestRegressor()
rf_reg = rf_reg.fit(X_train,Y_train)
print(rf_reg.score(X_train,Y_train))
print(rf_reg.score(X_test,Y_test))
print(rf_reg.predict(X_test[:2]))

# Task 3

best_acc = 0
best_r = 0
f = 0
best_e = [50,100,200]
for r in range(3,6):
    for e in best_e:
        rf_reg = RandomForestRegressor(max_depth=r, n_estimators=e)
        rf_reg = rf_reg.fit(X_train,Y_train)
        acc = rf_reg.score(X_test,Y_test)
        if acc > best_acc:
            best_acc = acc
            best_r = r
            f = e
print((r,f))
            


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

(379, 13)
(127, 13)
0.9805545439239387
0.88608530301534
[19.17   9.887]
(5, 50)


In [None]:
Build multiple Random Forest Regressor models on X_train data and Y_train labels with max_depth parameter value changing from 3 to 5 and also setting n_estimators to one of 50,100,200 values
evaluate each model accuracy on testing dataset Hint make use of for loop
print the max_depth and n_estimators values of the model with highest accuracy use boston data from sklearn
Note: Print the parameter values in the form of tuple (a,b), a refers to max_depth value and b refers to n_estimators

In [10]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Load Boston dataset
boston = load_boston()

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

# Define the max_depth and n_estimators values to iterate over
max_depths = [3, 4, 5]
n_estimators_values = [50, 100, 200]

# Initialize variables to store the best accuracy and the corresponding max_depth and n_estimators values
best_accuracy = 0
best_max_depth = 0
best_n_estimators = 0

# Loop over the max_depth and n_estimators values to build and evaluate the Random Forest Regressor models
for max_depth in max_depths:
    for n_estimators in n_estimators_values:
        
        # Build the Random Forest Regressor model with the current max_depth and n_estimators values
        rfr = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=42)
        rfr.fit(X_train, Y_train)
        
        # Evaluate the accuracy of the model on the testing dataset
        accuracy = rfr.score(X_test, Y_test)
        print(f"Accuracy for max_depth={max_depth}, n_estimators={n_estimators}: {accuracy}")
        
        # Update the best_accuracy and corresponding max_depth and n_estimators values if the current model has higher accuracy
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_max_depth = max_depth
            best_n_estimators = n_estimators

# Print the max_depth and n_estimators values of the model with highest accuracy
print(f"Best accuracy: {best_accuracy} with max_depth={best_max_depth}, n_estimators={best_n_estimators}")
# Print the max_depth and n_estimators values of the model with highest accuracy as a tuple
print(f"Best accuracy: {best_accuracy} with (max_depth={best_max_depth}, n_estimators={best_n_estimators})")




    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Accuracy for max_depth=3, n_estimators=50: 0.8298212292045748
Accuracy for max_depth=3, n_estimators=100: 0.8269148884724211
Accuracy for max_depth=3, n_estimators=200: 0.8170980159281125
Accuracy for max_depth=4, n_estimators=50: 0.8674612615279816
Accuracy for max_depth=4, n_estimators=100: 0.8601555677593314
Accuracy for max_depth=4, n_estimators=200: 0.8541638877397426
Accuracy for max_depth=5, n_estimators=50: 0.8852805002144195
Accuracy for max_depth=5, n_estimators=100: 0.8785132418642353
Accuracy for max_depth=5, n_estimators=200: 0.8704427213759016
Best accuracy: 0.8852805002144195 with max_depth=5, n_estimators=50
Best accuracy: 0.8852805002144195 with (max_depth=5, n_estimators=50)


## Understanding SVM

Support Vector Machines (SVMs) separates data points based on decision planes, which separates objects belonging to different classes in a higher dimensional space.

   - SVM algorithm uses the best suitable kernel, which is capable of separating data points into two or more classes.
   - Commonly used kernels are:

 - linear
 - polynomial
 - rbf
 - sigmoid

### Support Vector Classification

scikit-learn provides the following three utilities for performing Support Vector Classification.

 - SVC,
 - NuSVC: Same as SVC but uses a parameter to control the number of support vectors.
 - LinearSVC: Similar to SVC with parameter kernel taking linear value.

### Support Vector Regression
scikit-learn provides the following three utilities for performing Support Vector Regression.

 - SVR
 - NuSVR
 - LinearSVR

### Advantages of SVMs
 - SVM can distinguish the classes in a higher dimensional space.

 - SVM algorithms are memory efficient.

 - SVMs are versatile, and a different kernel can be used by a decision function.

### Disadvantages of SVMs
 - SVMs do not perform well on high dimensional data with many samples.

 - SVMs work better only with Preprocessed data.

 - They are harder to visualize.

### Demo of Support Vector Classification

In [13]:
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.model_selection import train_test_split


cancer = datasets.load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
                                                   stratify=cancer.target, random_state=42)

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9178403755868545
Accuracy of Test Data : 0.9230769230769231


### Improving Accuracy Using Scaled Data

In [14]:
import sklearn.preprocessing as preprocessing

standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)

svm_classifier = SVC()
svm_classifier = svm_classifier.fit(X_train,Y_train)
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9178403755868545
Accuracy of Test Data : 0.9230769230769231


### Viewing the Classification Report

In [15]:
from sklearn import metrics

Y_pred = svm_classifier.predict(X_test)
print("Classification report : \n", metrics.classification_report(Y_test, Y_pred))

Classification report : 
               precision    recall  f1-score   support

           0       0.96      0.83      0.89        53
           1       0.91      0.98      0.94        90

    accuracy                           0.92       143
   macro avg       0.93      0.90      0.92       143
weighted avg       0.93      0.92      0.92       143



## Lab 5

In [21]:
# task 1

from sklearn import datasets
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()

X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target,
                                                   stratify=digits.target, random_state=30)

print(X_train.shape)
print(X_test.shape)

# task 2

from sklearn.svm import SVC
svm_clf = SVC()
svm_clf = svm_clf.fit(X_train,Y_train)
print(svm_clf.score(X_test,Y_test))

# task 3

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
digits_standardized = scaler.fit_transform(digits.data)

X_train, X_test, Y_train, Y_test = train_test_split(digits_standardized, digits.target, test_size=0.2, random_state=30, stratify=digits.target)

svm_clf2 = SVC()
svm_clf2 = svm_clf2.fit(X_train,Y_train)
print(svm_clf2.score(X_test,Y_test))


(1347, 64)
(450, 64)
0.9822222222222222
0.975


## Introduction to Clustering

 - Clustering is one of the unsupervised learning technique.

 - The technique is typically used to group data points into clusters based on a specific algorithm.

 - Major clustering algorithms that can be implemented using scikit-learn are:

  - K-means Clustering
  - Agglomerative clustering
  - DBSCAN clustering
  - Mean-shift clustering
  - Affinity propagation
  - Spectral clustering

### K-Means Clustering
In K-means Clustering entire data set is grouped into k clusters.

Steps involved are:

 - k centroids are chosen randomly.
 - The distance of each data point from k centroids is calculated. A data point is assigned to the nearest cluster.
 - Centroids of k clusters are recomputed.
 - The above steps are iterated till the number of data points a cluster reach convergence.
 - KMeans from sklearn.cluster can be used for K-means clustering.

### Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is a bottom-up approach.

Steps involved are:

 - Each data point is treated as a single cluster at the beginning.

 - The distance between each cluster is computed, and the two nearest clusters are merged together.

 - The above step is iterated till a single cluster is formed.

 - AgglomerativeClustering from sklearn.cluster can be used for achieving this.

 - Merging of two clusters can be any of the following linkage type: ward, complete or average.

### Mean Shift Clustering

Mean Shift Clustering aims at discovering dense areas.

Steps Involved:

 - Identify blob areas with randomly guessed centroids.
 - Calculate the centroid of each blob area and shift to a new one, if there is a difference.
 - Repeat the above step till the centroids converge.
 - make_blobs from sklearn.cluster can be used to initialize the blob areas. MeanShift from sklearn.cluster can be used to perform Mean Shift clustering.

### Affinity Propagation
Affinity Propagation generates clusters by passing messages between pairs of data points, until convergence.

AffinityPropagation class from sklearn.cluster can be used.

 - The above class can be controlled with two major parameters:

 - preference: It controls the number of exemplars to be chosen by the algorithm.
 - damping: It controls numerical oscillations while updating messages.

### Spectral Clustering
Spectral Clustering is ideal to cluster data that is connected, and may not be in a compact space.

In general, the following steps are followed:
     - Build an affinity matrix of data points.
     - Embed data points in a lower dimensional space.
     - Use a clustering method like k-means to partition the points on lower dimensional space
spectral_clustering from sklearn.cluster can be used for achieving this.

### Demo of KMeans



In [None]:
from sklearn.cluster import KMeans

kmeans_cluster = KMeans(n_clusters=2)

kmeans_cluster = kmeans_cluster.fit(X_train) 

kmeans_cluster.predict(X_test)

### Evaluating a Clustering algorithm
A clustering algorithm is majorly evaluated using the following scores:

 - Homogeneity: Evaluates if each cluster contains only members of a single class.

 - Completeness: All members of a given class are assigned to the same cluster.

 - V-measure: Harmonic mean of Homogeneity and Completeness.

 - Adjusted Rand index: Measures similarity of two assignments.

### Evaluation with scikit-learn

In [None]:
from sklearn import metrics

print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))

## Lab 6

In [7]:
# task 1

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score

# load iris dataset
iris = load_iris()

# cluster iris data set into 3 clusters using k-means with default parameters
km_cls = KMeans(n_clusters=3).fit(iris.data)

# determine the homogeneity score of the model and print it
score = homogeneity_score(iris.target, km_cls.labels_)
print(score)

# task 2

from sklearn.cluster import AgglomerativeClustering
agg_cls = AgglomerativeClustering(n_clusters=3).fit(iris.data)
score = homogeneity_score(iris.target, agg_cls.labels_)
print(score)

# task 3

from sklearn.cluster import AffinityPropagation
af_cls = AffinityPropagation().fit(iris.data)
score = homogeneity_score(iris.target, af_cls.labels_)
print(score)







0.7514854021988339
0.7608008469718723
0.9149410296693684


In [None]:
import three modules sklearn.datasets, sklearn.cluster and sklearn.metrics
load popular iris dataset from sklearn.datasets module and assign it to variable iris
cluster iris.data set into 3 clusters using k-means with default parameters Name the model as km_cls
Hint : import required utility from sklearn.cluster
Determine the homogeneity score of the model and print it
Hint : import required utility from sklearn.metrics

In [1]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score

# load iris dataset
iris = load_iris()

# cluster iris data set into 3 clusters using k-means with default parameters
km_cls = KMeans(n_clusters=3).fit(iris.data)

# determine the homogeneity score of the model and print it
score = homogeneity_score(iris.target, km_cls.labels_)
print("Homogeneity score of the KMeans model: %.2f" % score)




Homogeneity score of the KMeans model: 0.75


In [10]:
from sklearn import datasets

iris = datasets.load_iris()
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [11]:
from sklearn import datasets

iris = datasets.load_iris()

type(iris)

sklearn.utils._bunch.Bunch