# RandomizedSearchCV in Scikit-Learn

## What is RandomizedSearchCV?

`RandomizedSearchCV` is a hyperparameter tuning method in Scikit-learn. It randomly samples a fixed number of hyperparameter combinations from a specified search space and evaluates them using cross-validation. This method allows you to search for the best hyperparameters for a model in a more efficient way compared to exhaustive methods like `GridSearchCV`.

## Why Use RandomizedSearchCV?

- **Efficiency**: Instead of evaluating all possible hyperparameter combinations (as in GridSearchCV), it randomly selects a subset to evaluate, saving computational time and resources.
- **Flexibility**: You can control the number of iterations (hyperparameter combinations) it evaluates by setting the `n_iter` parameter.
- **Exploration**: It explores a broader range of hyperparameter values than GridSearchCV since it uses random sampling.

## Key Features

- **param_distributions**: A dictionary where each key is a hyperparameter name and its value is a distribution or a list of possible values.
- **n_iter**: The number of parameter settings that are sampled and evaluated. The more iterations, the broader the search, but it will take more computation time.
- **cv**: The number of folds used in cross-validation to evaluate model performance.
- **scoring**: The evaluation metric used to determine the best hyperparameter combination (e.g., accuracy, mean squared error).
- **random_state**: Ensures reproducibility by controlling the randomness in the parameter selection.

## Differences Between RandomizedSearchCV and GridSearchCV

| Feature                | RandomizedSearchCV                       | GridSearchCV                       |
|------------------------|------------------------------------------|------------------------------------|
| Search Method           | Random sampling from hyperparameter space | Exhaustive search over all combinations |
| Computation Time        | Generally faster                        | Generally slower                   |
| Number of Combinations  | User-specified (via `n_iter`)            | All possible combinations evaluated |
| Flexibility             | Can handle continuous distributions      | Limited to fixed lists of values   |
| Thoroughness            | Less thorough but efficient              | Thorough but computationally expensive |


## How It Works

1. **Parameter Space**: Define the hyperparameter space with distributions of possible values (either continuous or discrete).
2. **Random Sampling**: Randomly sample combinations from the parameter space.
3. **Cross-Validation**: For each sampled combination, perform cross-validation to evaluate model performance.
4. **Best Parameters**: Identify the combination of hyperparameters that yields the best performance.

# **RandomForestClassifier**

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for RandomForestClassifier
param_dist = {
    'n_estimators': randint(50, 200),  # Randomly choose between 50 and 200 estimators
    'max_depth': [None, 10, 20, 30],   # List of possible max_depth values
    'min_samples_split': randint(2, 11) # Randomly choose min_samples_split between 2 and 10
}

# Create a RandomForestClassifier model
model = RandomForestClassifier(random_state=42)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='accuracy', random_state=42, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: ", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 142}
Best score found:  0.9428571428571428
Test set accuracy:  1.0


# **SVC**

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from scipy.stats import uniform

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for SVC
param_dist = {
    'C': uniform(0.1, 10),  # Continuous distribution of regularization parameter 'C'
    'kernel': ['linear', 'rbf', 'poly'],  # Different kernel types
    'gamma': uniform(0.001, 0.1)  # Kernel coefficient for 'rbf', 'poly' kernels
}

# Create a Support Vector Classifier model
model = SVC(random_state=42)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='accuracy', random_state=42, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: ", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'C': 2.2233911067827616, 'gamma': 0.019182496720710065, 'kernel': 'linear'}
Best score found:  0.9714285714285715
Test set accuracy:  1.0


# **KNeighborsClassifier**

In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for KNeighborsClassifier
param_dist = {
    'n_neighbors': randint(1, 31),  # Randomly choose the number of neighbors between 1 and 30
    'weights': ['uniform', 'distance'],  # Weight function used in prediction
    'metric': ['euclidean', 'manhattan', 'minkowski']  # Distance metric
}

# Create a KNeighborsClassifier model
model = KNeighborsClassifier()

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='accuracy', random_state=42, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: ", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'distance'}
Best score found:  0.9523809523809523
Test set accuracy:  1.0


# **GradientBoostingClassifier**

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import uniform, randint

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for GradientBoostingClassifier
param_dist = {
    'n_estimators': randint(50, 200),         # Number of boosting stages to be run
    'learning_rate': uniform(0.01, 0.3),      # Step size at each iteration
    'max_depth': randint(3, 10),              # Maximum depth of the individual trees
    'subsample': uniform(0.7, 0.3)            # Fraction of samples used for fitting each base learner
}

# Create a GradientBoostingClassifier model
model = GradientBoostingClassifier(random_state=42)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='accuracy', random_state=42, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: ", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'learning_rate': 0.19355586841671385, 'max_depth': 4, 'n_estimators': 64, 'subsample': 0.8368209952651107}
Best score found:  0.9428571428571428
Test set accuracy:  1.0


# **LogisticRegression**

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.stats import uniform, randint

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for LogisticRegression with compatible combinations
param_dist = {
    'C': uniform(0.001, 10),  # Inverse of regularization strength
    'penalty': ['l2', 'none'],  # Type of regularization, compatible with most solvers
    'solver': ['newton-cg', 'lbfgs', 'saga'],  # Solvers compatible with 'l2' or 'none'
    'max_iter': randint(50, 300)  # Maximum number of iterations
}

# Create a LogisticRegression model
model = LogisticRegression(random_state=42)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='accuracy', random_state=42, verbose=1, n_jobs=-1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: ", accuracy)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'C': 3.746401188473625, 'max_iter': 142, 'penalty': 'l2', 'solver': 'saga'}
Best score found:  0.9619047619047618
Test set accuracy:  1.0




# **Ridge**

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform

# Load the California housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter distributions for Ridge regression
param_dist = {
    'alpha': uniform(0.001, 100),  # Regularization strength
    'fit_intercept': [True, False]  # Whether to include an intercept term
}

# Create a Ridge regression model
model = Ridge(random_state=42)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='neg_mean_squared_error', random_state=42, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", -random_search.best_score_)  # Negate to get positive MSE

# Use the best model to make predictions on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model performance on the test set
mse = mean_squared_error(y_test, y_pred)
print("Test set Mean Squared Error: ", mse)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found:  {'alpha': 37.455011884736244, 'fit_intercept': True}
Best score found:  0.527062792279338
Test set Mean Squared Error:  0.5286047409176791
