# Question 1 : Classification using Naive Bayes

Can glucose and blood pressure data classify whether a patient has diabetes or not ? If yes, which classification algorithm should you use ?

The dataset **diabetes_classification.csv** has 3 columns and 995 entries with the above data.


1. Load the dataset.

In [None]:
import pandas as pd

# Load the dataset
dataset = pd.read_csv("diabetes.csv")

# Display the dataset
print(dataset.head())

2. The dataset has two feature columns and one target column. Plot a bar graph or histogram showing the distribution of values in the feature columns (count of each value).

In [None]:
import matplotlib.pyplot as plt

# Plot the distribution of the feature columns
plt.figure(figsize=(10, 5))

# Plot for feature column 1
plt.subplot(1, 2, 1)
plt.hist(dataset['glucose'])
plt.title('Distribution of Glucose')
plt.xlabel('Glucose')
plt.ylabel('Count')

# Plot for feature column 2
plt.subplot(1, 2, 2)
plt.hist(dataset['bloodpressure'])
plt.title('Distribution of Blood Pressure')
plt.xlabel('Blood Pressure')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

 The feature column **glucose** has a somewhat Gaussian distribution of data. So we will try out Gaussian Naive Bayes classification for the data using Scikit-Learn.

3. Split the dataset.
4. Fit a Gaussian NB model on the data. Make predictions and find the accuracy score.

Optional :
5. Compare the model with other classification algorithms like Logistic Regression, KNN, decision tree etc.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


# Split the dataset into features and target variable
X = dataset[['glucose', 'bloodpressure']]
y = dataset['diabetes']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a Gaussian NB model
model = GaussianNB()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score: {:.2f}".format(accuracy))

# Question 2 : Regression using SVM and Tree Algorithms

In this question, we will be using the **insurance.csv** file which contain information on insurance charges based on the following informations: age,sex,bmi,region,number of children and whether the person is a smoker or not. You need to predict the charges based on the information given.

### 1. Load the data.

In [None]:
import pandas as pd

data = pd.read_csv('insurance.csv')

### 2. Separate the numerical and categorical columns.
### 3. Label Encode the categorical columns.
### 4. Scale the numerical columns. (Scale the charges separately so that you can calculate errors afterwards.)

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
numerical_cols = ['age', 'bmi', 'children']
categorical_cols = ['sex', 'region', 'smoker']
target_col = 'charges'

numerical_data = data[numerical_cols]
categorical_data = data[categorical_cols]
target_data = data[target_col]

label_encoder = LabelEncoder()

categorical_data_encoded = categorical_data.copy()
for col in categorical_cols:
    categorical_data_encoded[col] = label_encoder.fit_transform(categorical_data[col])

scaler = StandardScaler()

scaled_numerical_data = scaler.fit_transform(numerical_data)
scaled_target_data = scaler.fit_transform(target_data.values.reshape(-1, 1))


### 5. Split the data.

In [None]:
from sklearn.model_selection import train_test_split

# Combine the scaled numerical data and encoded categorical data into a single DataFrame
X = pd.concat([pd.DataFrame(scaled_numerical_data, columns=numerical_cols), categorical_data_encoded], axis=1)
y = scaled_target_data

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 6. Support Vector Regressor

Here , you will use the SVR model from sklearn.svm and fit it on the training data. Then predict on the test data and calaculate MAE, MSE. But...

The SVR class contains many hyperparameters, example : kernel can have the following values : linear, rbf, poly, sigmoid.

Use **RandomizedSearchCV** from sklearn.model_selection , create a dictionary with keys 'kernel' and 'gamma' . As values of the keys, create a list of some possible values. Run a 3-fold cross validation test (cv=3) and find the best parameters. Then initiate the SVR model with those parameters.

In [None]:
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Define the hyperparameter grid
param_grid = {
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'gamma': [0.1, 0.01, 0.001, 0.0001]
}

# Create an instance of the SVR model
svr = SVR()

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(svr, param_distributions=param_grid, cv=3)
random_search.fit(X_train, y_train.ravel())

# Get the best parameters found during the search
best_params = random_search.best_params_

# Initiate the SVR model with the best parameters
svr_best = SVR(kernel=best_params['kernel'], gamma=best_params['gamma'])
svr_best.fit(X_train, y_train.ravel())

# Make predictions on the test set
y_pred = svr_best.predict(X_test)

# Calculate MAE and MSE
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Print the best parameters and evaluation metrics
print("Best parameters:", best_params)
print("MAE:", mae)
print("MSE:", mse)


### 7. AdaBoost Regressor

We would do similar for AdaBoostRegressor from sklearn.ensemble . Here, the hyperparameters are n_estimators and loss.

Instead of RandomizedSearchCV, let's try GridSearchCV . Find the best parameters and then find errors on test data using the model with best parameters.

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'loss': ['linear', 'square', 'exponential']
}

# Create an instance of the AdaBoostRegressor model
adaboost = AdaBoostRegressor()

# Perform GridSearchCV
grid_search = GridSearchCV(adaboost, param_grid, cv=3)
grid_search.fit(X_train, y_train.ravel())

# Get the best parameters found during the search
best_params = grid_search.best_params_

# Initiate the AdaBoostRegressor model with the best parameters
adaboost_best = AdaBoostRegressor(n_estimators=best_params['n_estimators'], loss=best_params['loss'])
adaboost_best.fit(X_train, y_train.ravel())

# Make predictions on the test set
y_pred = adaboost_best.predict(X_test)

# Calculate MAE and MSE
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Print the best parameters and evaluation metrics
print("Best parameters:", best_params)
print("MAE:", mae)
print("MSE:", mse)


8. Now carry the same procedure for Random Forest Regressor and for Gradient Boosting Regression.
9. Finally, use <a href="https://xgboost.readthedocs.io/en/stable/get_started.html"> XGBoost Regressor </a> and compare all the models. Comment which model had the least error (MAE and MSE).
You will be required to run  <code> !pip install xgboost </code> to import xgboost models.

In [None]:
!pip install xgboost
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for Random Forest Regressor
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Create an instance of the Random Forest Regressor model
rf = RandomForestRegressor()

# Perform GridSearchCV for Random Forest Regressor
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=3)
grid_search_rf.fit(X_train, y_train.ravel())

# Get the best parameters found during the search
best_params_rf = grid_search_rf.best_params_

# Initiate the Random Forest Regressor model with the best parameters
rf_best = RandomForestRegressor(n_estimators=best_params_rf['n_estimators'],
                                max_depth=best_params_rf['max_depth'],
                                min_samples_split=best_params_rf['min_samples_split'])
rf_best.fit(X_train, y_train.ravel())

# Make predictions on the test set for Random Forest Regressor
y_pred_rf = rf_best.predict(X_test)

# Define the hyperparameter grid for Gradient Boosting Regression
param_grid_gb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 1],
    'max_depth': [3, 5, 7]
}

# Create an instance of the Gradient Boosting Regression model
gb = GradientBoostingRegressor()

# Perform GridSearchCV for Gradient Boosting Regression
grid_search_gb = GridSearchCV(gb, param_grid_gb, cv=3)
grid_search_gb.fit(X_train, y_train.ravel())

# Get the best parameters found during the search
best_params_gb = grid_search_gb.best_params_

# Initiate the Gradient Boosting Regression model with the best parameters
gb_best = GradientBoostingRegressor(n_estimators=best_params_gb['n_estimators'],
                                    learning_rate=best_params_gb['learning_rate'],
                                    max_depth=best_params_gb['max_depth'])
gb_best.fit(X_train, y_train.ravel())

# Make predictions on the test set for Gradient Boosting Regression
y_pred_gb = gb_best.predict(X_test)

# Create an instance of the XGBoost Regressor model
xgb = XGBRegressor()

# Perform GridSearchCV for XGBoost Regressor
grid_search_xgb = GridSearchCV(xgb, param_grid_gb, cv=3)
grid_search_xgb.fit(X_train, y_train.ravel())

# Get the best parameters found during the search
best_params_xgb = grid_search_xgb.best_params_

# Initiate the XGBoost Regressor model with the best parameters
xgb_best = XGBRegressor(n_estimators=best_params_xgb['n_estimators'],
                        learning_rate=best_params_xgb['learning_rate'],
                        max_depth=best_params_xgb['max_depth'])
xgb_best.fit(X_train, y_train.ravel())

# Make predictions on the test set for XGBoost Regressor
y_pred_xgb = xgb_best.predict(X_test)


# Calculate MAE and MSE for Random Forest Regressor
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Calculate MAE and MSE for Gradient Boosting Regression
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mse_gb = mean_squared_error(y_test, y_pred_gb)

# Calculate MAE and MSE for XGBoost Regressor
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)

# Print the evaluation metrics for all models
print("Random Forest Regressor:")
print("MAE:", mae_rf)
print("MSE:", mse_rf)
print()

print("Gradient Boosting Regression:")
print("MAE:", mae_gb)
print("MSE:", mse_gb)
print()

print("XGBoost Regressor:")
print("MAE:", mae_xgb)
print("MSE:", mse_xgb)

# Question 3 : Classification using SVM and Tree Algorithms

In this question, we will be using the **bookmyshow_ads.csv** file which contain information on whether an url is spam or not based on 32 features. You need to classify the url as spam or not spam based on the information given.

### 1. Load the data.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("bookmyshow_ads.csv")

# Display the first few rows of the dataset
print(df.head())

### 2. Split the data.

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop("label", axis=1)
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 3. Model Comparison

Similar to the previous question, use the following classifier models from sklearn and compare them:
1. Decision Tree
2. Random Forest
3. Adaboost
4. Gradient Boost
5. XGBoost

For each model, you may also try to find the best hyperparameters using GridSearch Cross Validation or RandomizedSearch Cross Validation.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Create the classifier models
models = {
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "AdaBoost": AdaBoostClassifier(),
    "Gradient Boost": GradientBoostingClassifier(),
    "XGBoost": XGBClassifier()
}

# Define the hyperparameter grids for GridSearchCV
param_grids = {
    "Decision Tree": {
        "max_depth": [None, 5, 10],
        "min_samples_split": [2, 5, 10]
    },
    "Random Forest": {
        "n_estimators": [100, 200, 300],
        "max_depth": [None, 5, 10],
        "min_samples_split": [2, 5, 10]
    },
    "AdaBoost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.1, 1, 10]
    },
    "Gradient Boost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.1, 1, 10],
        "max_depth": [3, 5, 7]
    },
    "XGBoost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.1, 1, 10],
        "max_depth": [3, 5, 7]
    }
}

# Perform GridSearchCV and evaluate models
for model_name, model in models.items():
    # Perform GridSearchCV with error_score='raise'
    grid_search = GridSearchCV(model, param_grids[model_name], cv=3, error_score='raise')
    try:
        grid_search.fit(X_train, y_train)
    except Exception as e:
        print(f"Error occurred during grid search for {model_name}:")
        print(e)
        continue
    
    # Get the best parameters found during the search
    best_params = grid_search.best_params_
    
    # Initiate the model with the best parameters
    model_best = model.set_params(**best_params)
    
    # Fit the model on the training data
    model_best.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = model_best.predict(X_test)
    
    # Calculate and print the accuracy score
    accuracy = accuracy_score(y_test, y_pred)
    print(model_name)
    print("Best Parameters:", best_params)
    print("Accuracy:", accuracy)
    print()


# Question 4 : Clustering

Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs.

The csv file **segmentation data.csv** contains basic data about some customers like Customer ID, age, gender, annual income and spending score. You want to classify the customers into different groups so that marketing strategy could be planned in the future accordingly. How many different groups should be made ? What should be the approach ?

This is an Unsupervised Learning question since it doesn't provide you with labels - the groups. 

### 1. Import the necessary modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans


### 2. Read the csv file "segmentation data.csv" present in the Github repository as a Pandas DataFrame.

In [None]:
# Load the dataset
df = pd.read_csv("segmentation data.csv")

# Display the first few rows of the dataset
print(df.head())

### 3. Do the necessary preprocessing of the data.

> Drop unwanted columns.

> Check for null values.

> Scale the numerical columns.

> Additionally, you may also make the Age column have categorical values. How ? Apply some function that makes age groups turns all ages in some group to a particular number !

Note : Don't do everything in a single code block ! Do it step-by-step and show output for each step.

In [None]:
# Drop the 'CustomerID' column as it is not relevant for clustering
df = df.drop('ID', axis=1)
print(df.head())

# Check for null values in the DataFrame
print(df.isnull().sum())

from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select the numerical columns to be scaled
numerical_columns = ['Income', 'Settlement size']

# Scale the numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
print(df.head())

# Define the age groups and corresponding numerical values
age_groups = {
    '18-30': 1,
    '31-40': 2,
    '41-50': 3,
    '51-60': 4,
    '61+': 5
}

# Function to assign the numerical value based on age group
def assign_age_group(age):
    if age <= 30:
        return age_groups['18-30']
    elif age <= 40:
        return age_groups['31-40']
    elif age <= 50:
        return age_groups['41-50']
    elif age <= 60:
        return age_groups['51-60']
    else:
        return age_groups['61+']

# Apply the age group function to the 'Age' column
df['Age'] = df['Age'].apply(assign_age_group)
print(df.head())


### 4. KMeans Model Training - Scikit-Learn

At first, let's try to implement KMeans Clustering using sklearn.clusters.KMeans .

How to decide for the value 'K' ?

Read the following blog. It provides different ways of evaluating clustering algorithms.

https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters

We will be looking on two methods : Elbow Method, Silhouette Analysis.

**Make a list of values for K , ranging from 2 to 10. For each K, fit a model, calculate the inertia and silhouette scores. Plot them. Decide which value of K is optimal !**

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Define the range of values for K
k_values = range(2, 11)

# Initialize lists to store inertia and silhouette scores for each K
inertia_scores = []
silhouette_scores = []

# Iterate over each value of K
for k in k_values:
    # Initialize and fit the K-means model
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df)
    
    # Calculate the inertia and silhouette scores
    inertia = kmeans.inertia_
    silhouette = silhouette_score(df, kmeans.labels_)
    
    # Append the scores to the respective lists
    inertia_scores.append(inertia)
    silhouette_scores.append(silhouette)

# Plot the inertia scores
plt.figure(figsize=(10, 5))
plt.plot(k_values, inertia_scores, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method - Inertia vs. Number of Clusters')
plt.show()

# Plot the silhouette scores
plt.figure(figsize=(10, 5))
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis - Score vs. Number of Clusters')
plt.show()

### 5. KMeans Model Prediction

Once you decided the optimal K, once again fit a model with that K value and store the silhouette score and the labels for the entire data.

It is observed that the optimal value of k is 4. So, let's store the values of inertia and labels for k=4.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Fit the K-means model with K=4
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(df)

# Calculate the silhouette score for the entire data
silhouette = silhouette_score(df, kmeans.labels_)

# Store the silhouette score and labels
silhouette_score_4 = silhouette
labels_4 = kmeans.labels_

# Print the silhouette score for K=4
print("Silhouette Score for K=4:", silhouette_score_4)

### 6. KMeans Model Training - Scratch

Now, code the KMeans Model from scratch. Train it on the data, and try to find out when you have the labels with maximum accuracy when compared to the labels of the SkLearn model.

In [None]:
import numpy as np

class KMeansScratch:
    def __init__(self, n_clusters, max_iter=100):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
    
    def fit(self, X):
        self.centroids = X[np.random.choice(X.shape[0], size=self.n_clusters, replace=False)]
        
        for _ in range(self.max_iter):
            # Assign points to the nearest centroid
            labels = self._assign_labels(X)
            
            # Update centroids based on assigned points
            self._update_centroids(X, labels)
        
        self.labels = labels
    
    def _assign_labels(self, X):
        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        return labels
    
    def _update_centroids(self, X, labels):
        for i in range(self.n_clusters):
            self.centroids[i] = np.mean(X[labels == i], axis=0)
    
    def predict(self, X):
        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        return labels


# Initialize and fit the K-means model from scratch
kmeans_scratch = KMeansScratch(n_clusters=4)
kmeans_scratch.fit(df.values)

# Calculate the accuracy of the scratch model compared to the sklearn model
accuracy = np.mean(kmeans_scratch.labels == labels_4)
print("Accuracy of K-means Scratch Model:", accuracy)

### 7. DBSCAN model training - Scikit-Learn

Using sklear.clusters.DBSCAN, you have to fit a model on the data.

But, here we would like to deal with two hyperparameters : epsilon and minimum number of samples.

Make two lists. One with some probable values for epsilon, other with probable values for min_samples.

Example : eps= [0.1,0.2,0.5,1,2] , min_samples=[3,4,5,6]

Run a nested loop. for each value of eps and min_samples, fit a dbscan model on the data and calculate the silhouette score. Find the parameters for which the silhouette score is maximum.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Define the lists of probable values for epsilon and min_samples
eps_values = [0.1, 0.2, 0.5, 1, 2]
min_samples_values = [2, 3, 4, 5, 6]

best_eps = None
best_min_samples = None
max_silhouette_score = -1

# Iterate over each value of eps and min_samples
for eps in eps_values:
    for min_samples in min_samples_values:
        # Initialize and fit the DBSCAN model
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        dbscan.fit(df)
        
        # Check if more than one label is generated
        unique_labels = len(set(dbscan.labels_))
        if unique_labels > 1:
            # Calculate the silhouette score
            silhouette = silhouette_score(df, dbscan.labels_)
            
            # Update the best parameters if the silhouette score is higher
            if silhouette > max_silhouette_score:
                max_silhouette_score = silhouette
                best_eps = eps
                best_min_samples = min_samples

# Print the best parameters and the corresponding silhouette score
print("Best Parameters: eps =", best_eps, ", min_samples =", best_min_samples)
print("Max Silhouette Score:", max_silhouette_score)

### 8. DBSCAN model training - Scratch

Code the DBScan model. For the same epsilon and min_samples values, fit the model on the data. You should receive the same silhouette score.

In [None]:
import numpy as np

class DBSCANScratch:
    def __init__(self, eps, min_samples):
        self.eps = eps
        self.min_samples = min_samples
    
    def fit(self, X):
        self.labels = np.zeros(len(X), dtype=int)
        self.cluster_id = 0
        
        for i in range(len(X)):
            if self.labels[i] == 0:
                if self._expand_cluster(X, i):
                    self.cluster_id += 1
    
    def _expand_cluster(self, X, i):
        neighbors = self._region_query(X, i)
        
        if len(neighbors) < self.min_samples:
            self.labels[i] = -1
            return False
        
        self.labels[i] = self.cluster_id
        
        for neighbor in neighbors:
            if self.labels[neighbor] == 0:
                self.labels[neighbor] = self.cluster_id
                
                neighbor_neighbors = self._region_query(X, neighbor)
                if len(neighbor_neighbors) >= self.min_samples:
                    neighbors = np.append(neighbors, neighbor_neighbors)
            
            if self.labels[neighbor] == -1:
                self.labels[neighbor] = self.cluster_id
        
        return True
    
    def _region_query(self, X, i):
        return np.where(np.linalg.norm(X - X[i], axis=1) <= self.eps)[0]

# Fit the DBSCAN model from scratch with the same epsilon and min_samples values
dbscan_scratch = DBSCANScratch(eps=2, min_samples=4)
dbscan_scratch.fit(df.values)

# Calculate the silhouette score for the scratch model
silhouette_scratch = silhouette_score(df, dbscan_scratch.labels)
print("Silhouette Score (Scratch Model):", silhouette_scratch)