### Short Coding Project: Logistic Regression

#### Project Overview

In this project, you will apply logistic regression to predict whether culverts need repair based on various environmental and physical attributes using the Augmented Culvert Dataset. You will preprocess the data, handle categorical variables, perform feature scaling, apply feature selection, build and evaluate a logistic regression model, and explore advanced topics such as ROC curves and experimenting with different solvers.

- Delete the `# YOUR CODE HERE` comments and write your code.
- **Do not change** the variable names.

### Load the Dataset

Start by loading the Augmented Culvert Dataset and examining its structure.

In [None]:
# Import necessary libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load the dataset
url = 'https://raw.githubusercontent.com/CyConProject/Lab/main/Datasets/Augmented%20Culvert%20Dataset.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()

### Question 1: Data Exploration and Preprocessing

1. **Identify columns with missing values** and the number of missing entries in each.

2. **Handle missing values** in the dataset:

   - For the `'Flooding_Frequency'` column, replace missing values with `'No Flooding'`.

3. **Convert the `'Cul_rating'` column to a binary target variable** where ratings of 0 or 1 are mapped to `0` (needs repair), and ratings of 2, 3, or 4 are mapped to `1` (satisfactory to good condition).

4. **Display the number of instances in each class** of the binary target variable.

**Hint for Part 2**: Use the `fillna()` method with a specified value.

In [None]:
# Identify columns with missing values
missing_values_count = # YOUR CODE HERE
print("Missing Values:\n", missing_values_count)

# Fill missing values in 'Flooding_Frequency' with 'No Flooding'
# YOUR CODE HERE

# Convert 'Cul_rating' to binary target variable
def convert_rating(rating):
    # YOUR CODE HERE

data['Cul_rating'] = data['Cul_rating'].apply(convert_rating)

# Display the number of instances in each class
class_counts = # YOUR CODE HERE

cleaned_data = data.copy()
print("Class Counts:\n", class_counts)

### Question 2: Encode Categorical Variables

Machine learning algorithms require numerical input data. Therefore, we need to encode categorical variables into numerical form. One common method is **one-hot encoding**, which converts categorical variables into a set of binary columns, each representing a unique category with 1s and 0s.

1. **Identify the categorical columns** that need to be encoded.

2. **Encode the categorical variables** using one-hot encoding (use `pd.get_dummies()`).

3. **Display the first few rows** of the updated dataset to verify the changes.


**Hint for part 1:** When identifying categorical columns, you can use `data.select_dtypes(include=['object'])` to select columns that are recognized as categorical in Pandas. This method efficiently identifies columns with non-numeric data, making it easier to encode them correctly. If you want to know more about this method, refer to the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html).

**Hint for part 2:** Use the `columns` parameter in `pd.get_dummies()` to specify the columns you want to encode and set `drop_first=True` to drop the first category of each variable. Dropping the first category helps avoid redundancy since the remaining columns provide all the necessary information. Also, don't forget to set `dtype` to ensure the encoded columns are integers. You can learn more in the [Pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

In [None]:
# Identify categorical columns
categorical_columns = # YOUR CODE HERE

# Encode categorical variables using one-hot encoding
data_encoded = # YOUR CODE HERE

# Display the updated dataset
data_encoded.head()

### Question 3: Feature Scaling

We will perform feature scaling to ensure that all features contribute equally to the model.

1. **Normalize the feature data** using `MinMaxScaler`.

2. **Display the shape of the feature matrix** after scaling.

**Note**: Scaling is important for logistic regression to ensure convergence.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Separate features and target variable
X = data_encoded.drop('Cul_rating', axis=1)
y = data_encoded['Cul_rating']

# Normalize the feature data
scaler = # YOUR CODE HERE
X_scaled = # YOUR CODE HERE
X_scaled_shape = # YOUR CODE HERE

# Display the shape of X_scaled
print("Shape of X_scaled:", X_scaled_shape)

### Question 4: Feature Selection

1. **Use SelectKBest** with the chi-squared (`chi2`) statistical test to select the top 15 features from the scaled dataset that are most relevant to the target variable. For more details, refer to the [SelectKBest Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).


2. **Reduce the training and testing data** to these selected features.

3. **Display the names of the selected features**.

**Hint**: Use `SelectKBest` from `sklearn.feature_selection`.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply SelectKBest with chi-squared test
selector = # YOUR CODE HERE
X_selected = # YOUR CODE HERE

# Get the boolean mask of selected features
mask = # YOUR CODE HERE

# Get the feature names
selected_features = # YOUR CODE HERE

# Display selected feature names
print("Selected Features:\n", selected_features)

### Split the Data into Training and Testing Sets

As you can see, we splitted the dataset into training and testing sets. We used 80% of the data for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.20, random_state=42)

# Print the shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

### Question 5: Train and Evaluate the Logistic Regression Model

1. **Initialize** the logistic regression model with default parameters. Set the `max_iter=1000`.

2. **Train** the model on the selected features from the training data.

3. **Calculate the accuracy scores** for both the training and testing sets.

4. **Print the classification report** for the test data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Logistic Regression model
model = # YOUR CODE HERE

# Train the model on selected features
# YOUR CODE HERE

# Make predictions on training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate accuracy scores
train_accuracy = # YOUR CODE HERE
test_accuracy = # YOUR CODE HERE

# Print classification report for test data
report = # YOUR CODE HERE

# Display the results
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Test Accuracy: {test_accuracy:.2f}")
print("Classification Report:\n", report)

### Question 6: Advanced Evaluation Metrics

The ROC (Receiver Operating Characteristic) curve is a graph that shows the performance of a classification model at all classification thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The AUC (Area Under the Curve) quantifies the overall ability of the model to distinguish between classes. A higher AUC indicates better model performance.

Please watch [this video](https://www.youtube.com/watch?v=4jRBRDbJemM&t=907s) to understand the ROC curve and AUC in detail. Additionally, read the official [scikit-learn documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.roc_auc_score.html) to learn more about the ROC AUC score.


**Calculate the ROC AUC score** for the test predictions. Then, you can see the ROC curve plot.

**Hint**: Use `roc_auc_score` from `sklearn.metrics`.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Calculate probabilities for ROC curve
y_test_probability = # YOUR CODE HERE

# Calculate ROC AUC score
roc_auc_sco = # YOUR CODE HERE
print("ROC AUC Score:", roc_auc_sco)

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_probability)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc_sco)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

### Question 7: Experiment with Different Solvers (Advanced)

Logistic regression solvers are algorithms that optimize the model's parameters. Here are some examples of the solvers:

- **`'liblinear'`**: A good choice for small datasets, using a coordinate descent algorithm.
- **`'saga'`**: Effective for large datasets and supports L1 and L2 regularization.
- **`'lbfgs'`**: Uses the Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm, efficient for multiclass problems.
- **`'newton-cg'`**: An iterative solver that approximates the Newton-Raphson method, well-suited for larger datasets.

**Documentation:** You can read more in the [Logistic Regression Solvers Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).


1. **Train logistic regression models** using different solvers (`'liblinear'`, `'saga'`, `'lbfgs'`, `'newton-cg'`).

2. **Evaluate each model** by calculating the test accuracy and ROC AUC score.

3. **Create a comparison table** showing the solver used, test accuracy, and ROC AUC score.

**Hint:** Set `max_iter=1000` to allow sufficient iterations for the solvers to converge.

In [None]:
solvers = ['liblinear', 'saga', 'lbfgs', 'newton-cg']
results = []

for solver in solvers:
    # Initialize and train the model
    model = # YOUR CODE HERE
    model.fit(X_train, y_train)
    
    # Evaluate the model
    y_test_pred = # YOUR CODE HERE
    y_test_proba = # YOUR CODE HERE
    accuracy = # YOUR CODE HERE
    roc_auc = # YOUR CODE HERE
    
    # Append results
    results.append({
        'Solver': # YOUR CODE HERE
        'Test Accuracy': # YOUR CODE HERE
        'ROC AUC Score': # YOUR CODE HERE
    })
    print(f"Solver: {solver}, Test Accuracy: {accuracy:.2f}, ROC AUC Score: {roc_auc:.2f}")

# Create a DataFrame to display the results
results_df = pd.DataFrame(results)
print("\nComparison of Different Solvers:")
print(results_df)

The reason all the outputs are the same across different solvers is because **logistic regression is a convex optimization problem with a single global minimum**. This means that regardless of the solver used, they all aim to minimize the same cost function and, given enough iterations and proper convergence, will find the same optimal solution.