## Level 1: Exploratory Data Analysis on the Iris Dataset
### Data manipulation and visualization techniques.
### Tasks:
- Load the Iris dataset.
- Perform basic checks (shape, data types, missing values).
- Create visualizations (scatter plots, histograms) to understand the distribution of each feature and the relationships between them.
- Summarize key findings.

In [1]:
# Importing necessary libraries
import pandas as pd
import plotly.express as px
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Basic checks
# Checking the shape of the dataset
shape = df.shape

# Checking data types
data_types = df.dtypes

# Checking for missing values
missing_values = df.isnull().sum()

# Visualizations
# Scatter matrix (similar to pairplot in seaborn)
scatter_matrix = px.scatter_matrix(df, dimensions=iris.feature_names, color="species")
scatter_matrix.update_layout(title="Scatter Matrix of Iris Dataset Features")

# Histograms for each feature
for col in iris.feature_names:
    fig = px.histogram(df, x=col, color="species", marginal="box")
    fig.update_layout(title=f"Histogram of {col}")
    fig.show()

1. The Iris dataset has {} rows and {} columns.
2. The data types of the features are mostly numeric with 'species' as a categorical variable.
3. There are no missing values in the dataset.
4. The scatter matrix shows clear clusters among different species, indicating good feature separation.
5. The histograms and box plots reveal the distribution and spread of each feature, with some showing normal distribution and others skewness.

## Level 2: Basic Classification with Iris Dataset
### Apply basic machine learning models.
### Tasks:
- Split the data into training and testing sets.
- Apply a simple classifier like Logistic Regression.
- Evaluate the model's performance using accuracy and a confusion matrix.

In [3]:
# Importing additional necessary libraries for model training and evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and testing sets
X = df[iris.feature_names]  # Features
y = df['species']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply a simple classifier: Logistic Regression
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Output results
print("Accuracy: ", accuracy)
px.imshow(conf_matrix, text_auto=True)


Accuracy:  1.0


## Level 3: Feature Engineering and PCA

## Dimensionality reduction and feature engineering.

Tasks:
- Apply PCA to the Iris dataset and reduce its dimensions.
- Visualize the data in the new feature space.
- Use a classifier on the transformed data and compare its performance with the original data's classifier.

In [5]:
# Importing necessary libraries for PCA and visualization
from sklearn.decomposition import PCA
import plotly.express as px

# Apply PCA to the Iris dataset and reduce its dimensions to 2
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Create a DataFrame for the PCA transformed data
df_pca = pd.DataFrame(data=X_pca, columns=['PCA1', 'PCA2'])
df_pca['species'] = y

# Visualize the data in the new feature space
fig = px.scatter(df_pca, x='PCA1', y='PCA2', color='species')
fig.update_layout(title="PCA of Iris Dataset")
fig.show()

# Use a classifier on the transformed data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)
logreg_pca = LogisticRegression(max_iter=200)
logreg_pca.fit(X_train_pca, y_train)

# Predict on the test set
y_pred_pca = logreg_pca.predict(X_test_pca)

# Evaluate the model's performance
accuracy_pca = accuracy_score(y_test, y_pred_pca)
conf_matrix_pca = confusion_matrix(y_test, y_pred_pca)

# Compare with original data's classifier performance
original_accuracy = accuracy  # From previous classifier
accuracy_pca, original_accuracy

px.imshow(conf_matrix_pca, text_auto=True)



## Level 4: Advanced Model Application

## Explore different models and their applications.

## Tasks:
- Use different classifiers (e.g., SVM, Decision Trees, K-Nearest Neighbors) on the Iris dataset.
- Experiment with different hyperparameters for each model.
- Compare the performance of these models using cross-validation.

In [6]:
# Importing necessary libraries for different classifiers and cross-validation
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Define different classifiers with different hyperparameters
classifiers = {
    "SVM": SVC(kernel='linear'),
    "SVM with RBF Kernel": SVC(kernel='rbf', C=1, gamma=0.1),
    "Decision Tree": DecisionTreeClassifier(max_depth=3),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}

# Dictionary to hold cross-validation results
cv_results = {}

# Perform cross-validation for each classifier
for name, clf in classifiers.items():
    scores = cross_val_score(clf, X, y, cv=5)
    cv_results[name] = np.mean(scores)

# Output results
cv_results

{'SVM': 0.9800000000000001,
 'SVM with RBF Kernel': 0.9800000000000001,
 'Decision Tree': 0.9733333333333334,
 'K-Nearest Neighbors': 0.9733333333333334}

## Level 5: Experimenting with Different Scalers

### Understand the impact of feature scaling.

### Tasks:
- Apply different scalers (StandardScaler, MinMaxScaler, RobustScaler) to the Iris dataset.
- Use a consistent classifier to evaluate the impact of scaling on model performance.

In [7]:
# Importing necessary libraries for scaling and model evaluation
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Define different scalers
scalers = {
    "Standard Scaler": StandardScaler(),
    "MinMax Scaler": MinMaxScaler(),
    "Robust Scaler": RobustScaler()
}

# Using a consistent classifier for evaluation
classifier = LogisticRegression(max_iter=200)

# Dictionary to hold cross-validation results for each scaler
scaler_results = {}

# Apply each scaler and evaluate the model performance
for scaler_name, scaler in scalers.items():
    # Creating a pipeline that first scales the data and then applies the classifier
    pipeline = make_pipeline(scaler, classifier)

    # Perform cross-validation and store the mean accuracy
    scores = cross_val_score(pipeline, X, y, cv=5)
    scaler_results[scaler_name] = np.mean(scores)

# Output results
scaler_results

{'Standard Scaler': 0.9600000000000002,
 'MinMax Scaler': 0.9266666666666665,
 'Robust Scaler': 0.9333333333333333}

## Level 6: Regression with the Californie Housing Dataset
### Regression problem.
### Tasks:
- Load and explore the California Housing dataset.
- Apply regression models (Linear Regression, Ridge, Lasso).
- Evaluate models using metrics like RMSE (Root Mean Square Error).

In [None]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
import numpy as np

def rmse(true_values, predicted_values):
    return np.sqrt(mean_squared_error(true_values, predicted_values))

# load dataset
housing = fetch_california_housing(as_frame=True)
housing_data = housing.data
housing_target = housing.target

# exploration
print(housing_data.head())
print(housing_data.describe())

# preprocessing
X_train, X_test, y_train, y_test = train_test_split(housing_data, housing_target, test_size=0.2, random_state=42)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_predictions = linear_model.predict(X_test)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)

# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)

# Evaluate Models using RMSE
print("Linear Regression RMSE:", rmse(y_test, linear_predictions))
print("Ridge Regression RMSE:", rmse(y_test, ridge_predictions))
print("Lasso Regression RMSE:", rmse(y_test, lasso_predictions))


## Level 7: Hyperparameter Tuning
### Fine-tune model parameters.
### Tasks:
- Use GridSearchCV or RandomizedSearchCV for hyperparameter tuning on a model of your choice.
- Analyze the impact of hyperparameter tuning on model performance.

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
import numpy as np

# Load the dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up the parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200],
    'max_features': ['log2', 'sqrt'],
    'max_depth': [1, 3, 5, None],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
    #'bootstrap': [True, False]
}

# Create a base model
rf = RandomForestRegressor()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Find the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Evaluate the best model
best_rf = grid_search.best_estimator_
predictions = best_rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

rmse, best_params

Fitting 3 folds for each of 64 candidates, totalling 192 fits


[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.6s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.6s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   5.8s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   3.5s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   6.7s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   3.5s
[CV] END max_depth=1, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   6.9s
[CV] END max_depth=1, max_features=log2, min_s

(0.4900485185611654,
 {'max_depth': None,
  'max_features': 'log2',
  'min_samples_leaf': 1,
  'min_samples_split': 2,
  'n_estimators': 200})

## Level 8: Designing Pipelines for Different Models
### Automating workflows.
### Tasks:
- Design a pipeline for a classification model on the Iris dataset (including preprocessing steps).
- Create a pipeline for a regression model on the California Housing dataset.
- Implement a pipeline for a more complex dataset/model of your choice, integrating advanced preprocessing and model selection steps.

In [None]:
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.pipeline import Pipeline
import numpy as np

# Load datasets
iris = load_iris()
cali_housing = fetch_california_housing()

# Split datasets into training and testing sets
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
X_train_cali, X_test_cali, y_train_cali, y_test_cali = train_test_split(cali_housing.data, cali_housing.target, test_size=0.2, random_state=42)

# Define pipelines
pipeline_iris = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression())
])
pipeline_cali = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LinearRegression())
])

# Fit and evaluate iris dataset
pipeline_iris.fit(X_train_iris, y_train_iris)
iris_predictions = pipeline_iris.predict(X_test_iris)
iris_report = classification_report(y_test_iris, iris_predictions)

# Fit and evaluate california housing dataset
pipeline_cali.fit(X_train_cali, y_train_cali)
cali_predictions = pipeline_cali.predict(X_test_cali)
cali_mse = mean_squared_error(y_test_cali, cali_predictions)
cali_rmse = np.sqrt(cali_mse)

iris_report, cali_rmse