# Task
Generate Python code in Google Colab to perform crop recommendation using an RFXG ensemble model. The code should handle dataset upload, preprocessing, model training, evaluation, saving/loading, and prediction based on user input.

## Dataset upload and loading

### Subtask:
Generate code to upload a CSV file and load it into a Pandas DataFrame.


**Reasoning**:
The subtask is to upload a CSV file and load it into a Pandas DataFrame. The first step is to import necessary libraries and upload the file.



In [1]:
from google.colab import files
import pandas as pd

uploaded = files.upload()

Saving Crop_recommendation.xlsx to Crop_recommendation.xlsx


**Reasoning**:
The file has been uploaded. Now, I need to load the uploaded Excel file into a pandas DataFrame, as the uploaded file is 'Crop_recommendation.xlsx'.



In [2]:
df = pd.read_excel('Crop_recommendation.xlsx')
df.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


## Data preprocessing

### Subtask:
Generate code to handle categorical labels, check for missing values, and split data into features and target.


**Reasoning**:
Handle categorical labels using one-hot encoding, check for and handle missing values, and split data into features and target.



In [3]:
df_encoded = pd.get_dummies(df, columns=['label'], prefix='label')

if df_encoded.isnull().sum().sum() > 0:
    print("Missing values found. Handling by dropping rows with missing values.")
    df_encoded.dropna(inplace=True)
else:
    print("No missing values found.")

X = df_encoded.drop(columns=[col for col in df_encoded.columns if col.startswith('label_')])
y = df_encoded[[col for col in df_encoded.columns if col.startswith('label_')]]

display(X.head())
display(y.head())

No missing values found.


Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall
0,90,42,43,20.879744,82.002744,6.502985,202.935536
1,85,58,41,21.770462,80.319644,7.038096,226.655537
2,60,55,44,23.004459,82.320763,7.840207,263.964248
3,74,35,40,26.491096,80.158363,6.980401,242.864034
4,78,42,42,20.130175,81.604873,7.628473,262.71734


Unnamed: 0,label_apple,label_banana,label_blackgram,label_chickpea,label_coconut,label_coffee,label_cotton,label_grapes,label_jute,label_kidneybeans,...,label_mango,label_mothbeans,label_mungbean,label_muskmelon,label_orange,label_papaya,label_pigeonpeas,label_pomegranate,label_rice,label_watermelon
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


## Train-test split

### Subtask:
Generate code to split the data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using train_test_split and print the shapes.



In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (1760, 7)
Shape of X_test: (440, 7)
Shape of y_train: (1760, 22)
Shape of y_test: (440, 22)


**MODEL BUILDING**

**Reasoning**:
Since VotingClassifier does not support multilabel classification even with MultiOutputClassifier wrappers, a different approach is needed. A common strategy is to train each base classifier within the MultiOutputClassifier framework and then combine their predictions. I will train the MultiOutputClassifier models separately and then define a custom prediction method that averages their probabilities.



In [7]:
multioutput_rf.fit(X_train, y_train)
multioutput_xgb.fit(X_train, y_train)

class RFXGEnsemble:
    def __init__(self, rf_model, xgb_model):
        self.rf_model = rf_model
        self.xgb_model = xgb_model

    def predict_proba(self, X):
        rf_proba = self.rf_model.predict_proba(X)
        xgb_proba = self.xgb_model.predict_proba(X)

        # predict_proba returns a list of arrays for MultiOutputClassifier
        # We need to average the probabilities for each output
        avg_proba = [(rf_p + xgb_p) / 2 for rf_p, xgb_p in zip(rf_proba, xgb_proba)]
        return avg_proba

    def predict(self, X):
        avg_proba = self.predict_proba(X)
        # Convert probabilities to class labels (assuming threshold of 0.5 for multilabel)
        predictions = [ (p > 0.5).astype(int) for p in avg_proba]
        # Stack the individual output predictions into a single array
        return np.vstack(predictions).T

import numpy as np
ensemble_model = RFXGEnsemble(multioutput_rf, multioutput_xgb)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.


## Model evaluation

### Subtask:
Generate code to evaluate the model using accuracy, precision, recall, F1-score, classification report, confusion matrix, and cross-validation.


**Reasoning**:
Evaluate the ensemble model using various metrics and cross-validation.



In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, multilabel_confusion_matrix
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Re-define the ensemble model with the corrected predict method
class RFXGEnsemble:
    def __init__(self, rf_model, xgb_model):
        self.rf_model = rf_model
        self.xgb_model = xgb_model

    def predict_proba(self, X):
        rf_proba = self.rf_model.predict_proba(X)
        xgb_proba = self.xgb_model.predict_proba(X)

        # predict_proba returns a list of arrays for MultiOutputClassifier,
        # where each array is (n_samples, n_classes) for that output.
        # For multilabel, n_classes is typically 2 (True/False).
        # We need to average the probabilities for the positive class (index 1) for each output.
        avg_proba = [(rf_p[:, 1] + xgb_p[:, 1]) / 2 for rf_p, xgb_p in zip(rf_proba, xgb_proba)]

        # Stack the averaged probabilities. This will result in a shape (n_labels, n_samples)
        return np.vstack(avg_proba)

    def predict(self, X):
        # Get the averaged probabilities with shape (n_labels, n_samples)
        avg_proba_stacked = self.predict_proba(X)

        # Convert probabilities to class labels (assuming threshold of 0.5 for multilabel)
        # The comparison and astype(int) should be applied element-wise to the stacked probabilities
        predictions = (avg_proba_stacked > 0.5).astype(int)

        # Transpose the predictions array to get the correct shape (n_samples, n_labels)
        return predictions.T

# Re-instantiate and re-train the base models and ensemble
rf_clf = RandomForestClassifier(random_state=42)
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

multioutput_rf = MultiOutputClassifier(rf_clf, n_jobs=-1)
multioutput_xgb = MultiOutputClassifier(xgb_clf, n_jobs=-1)

multioutput_rf.fit(X_train, y_train)
multioutput_xgb.fit(X_train, y_train)

ensemble_model = RFXGEnsemble(multioutput_rf, multioutput_xgb)


# Now, re-attempt evaluation with the corrected ensemble model
y_pred = ensemble_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

print("\nClassification Report:")
# classification_report expects y_true and y_pred to be of the same shape
print(classification_report(y_test, y_pred, target_names=y_test.columns))

print("\nConfusion Matrix:")
# Multilabel confusion matrix returns a confusion matrix for each label
conf_matrices = multilabel_confusion_matrix(y_test, y_pred)
# Print or visualize the confusion matrices for each label as needed
# For simplicity, we'll just print the shape and the first few
print("Shape of multilabel confusion matrix:", conf_matrices.shape)
print("First 2 confusion matrices:\n", conf_matrices[:2])


# Cross-validation (using Random Forest base model for demonstration):
print("\nCross-validation (using Random Forest base model for demonstration):")
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# Use the MultiOutputClassifier wrapped Random Forest model
cv_results = cross_val_score(multioutput_rf, X, y, cv=kfold, scoring='accuracy')

print(f"Cross-validation scores: {cv_results}")
print(f"Mean cross-validation accuracy: {np.mean(cv_results)}")

Accuracy: 0.9636363636363636
Precision: 0.9865530303030302
Recall: 0.9681818181818181
F1-score: 0.9757841470474776

Classification Report:
                   precision    recall  f1-score   support

      label_apple       1.00      1.00      1.00        23
     label_banana       1.00      1.00      1.00        21
  label_blackgram       1.00      0.95      0.97        20
   label_chickpea       1.00      1.00      1.00        26
    label_coconut       1.00      0.96      0.98        27
     label_coffee       1.00      1.00      1.00        17
     label_cotton       0.94      1.00      0.97        17
     label_grapes       1.00      1.00      1.00        14
       label_jute       0.91      0.91      0.91        23
label_kidneybeans       1.00      1.00      1.00        20
     label_lentil       0.92      1.00      0.96        11
      label_maize       1.00      1.00      1.00        21
      label_mango       1.00      1.00      1.00        19
  label_mothbeans       0.96      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Cross-validation scores: [0.96818182 0.975      0.97954545 0.98636364 0.97727273]
Mean cross-validation accuracy: 0.9772727272727273


## Save and load model

### Subtask:
Generate code to save the trained model using `joblib` and provide code to load it.


**Reasoning**:
Save and load the trained ensemble model using joblib.



In [11]:
import joblib

# Save the model
joblib.dump(ensemble_model, 'rfxq_ensemble_model.pkl')

# Load the model
loaded_model = joblib.load('rfxq_ensemble_model.pkl')

print("Ensemble model saved to 'rfxq_ensemble_model.pkl' and loaded successfully.")

Ensemble model saved to 'rfxq_ensemble_model.pkl' and loaded successfully.


## User input and prediction

### Subtask:
Generate code to allow user input for soil parameters and predict the recommended crop using the loaded model.


**Reasoning**:
Define a function to get user input, convert it to a DataFrame, predict using the loaded model, and print the recommended crops.



In [12]:
import pandas as pd
import numpy as np

def recommend_crop(model):
    """Gets user input for soil parameters and predicts recommended crop."""

    print("Please enter the soil parameters:")

    while True:
        try:
            n = float(input("Nitrogen (N): "))
            p = float(input("Phosphorus (P): "))
            k = float(input("Potassium (K): "))
            temperature = float(input("Temperature (°C): "))
            humidity = float(input("Humidity (%): "))
            ph = float(input("pH level: "))
            rainfall = float(input("Rainfall (mm): "))
            break
        except ValueError:
            print("Invalid input. Please enter numerical values.")

    # Create a DataFrame from user input
    user_data = pd.DataFrame([[n, p, k, temperature, humidity, ph, rainfall]],
                             columns=['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall'])

    # Predict probabilities using the loaded model
    # The predict_proba method of our custom ensemble returns a list of arrays, one for each label
    # Stack them to get a single array of shape (n_labels, n_samples)
    probabilities_stacked = model.predict_proba(user_data)

    # Transpose to get shape (n_samples, n_labels)
    probabilities = probabilities_stacked.T

    # Assuming y_train columns represent the crop labels
    crop_labels = y_train.columns

    # Determine recommended crops (using a threshold of 0.5)
    recommended_crops_indices = np.where(probabilities[0] > 0.5)[0]

    if len(recommended_crops_indices) > 0:
        print("\nRecommended crop(s):")
        for index in recommended_crops_indices:
            print(f"- {crop_labels[index].replace('label_', '')}")
    else:
        # If no crop is above the threshold, recommend the one with the highest probability
        highest_prob_index = np.argmax(probabilities[0])
        print(f"\nNo specific recommendation above threshold. The most likely crop is: {crop_labels[highest_prob_index].replace('label_', '')} with probability {probabilities[0, highest_prob_index]:.2f}")


# Call the function with the loaded model
recommend_crop(loaded_model)

Please enter the soil parameters:
Nitrogen (N): 90
Phosphorus (P): 42
Potassium (K): 43
Temperature (°C): 25
Humidity (%): 82
pH level: 6.5
Rainfall (mm): 202

Recommended crop(s):
- rice


## Summary:

### Data Analysis Key Findings

*   The dataset used is an Excel file (`Crop_recommendation.xlsx`), not a CSV, requiring `pd.read_excel` for loading.
*   The target variable 'label' contains categorical data and was successfully one-hot encoded.
*   No missing values were found in the dataset after one-hot encoding.
*   The dataset was split into training and testing sets with an 80/20 ratio.
*   The initial attempt to use `VotingClassifier` for the RFXG ensemble failed because `VotingClassifier` does not support multilabel classification, which is the nature of the target variable.
*   A custom ensemble class was implemented by training `RandomForestClassifier` and `XGBClassifier` wrapped in `MultiOutputClassifier` separately and averaging their predicted probabilities.
*   An error in the custom ensemble's `predict` method regarding the shape of predictions was identified and corrected, ensuring compatibility with scikit-learn evaluation metrics.
*   The corrected custom ensemble model was successfully evaluated using accuracy, precision, recall, F1-score, classification report, and multilabel confusion matrix.
*   Cross-validation was demonstrated using one of the base models (`MultiOutputClassifier(RandomForestClassifier)`) due to the custom nature of the ensemble.
*   The trained custom ensemble model was successfully saved and loaded using `joblib`.
*   A function was created to take user input for soil parameters, use the loaded model to predict crop probabilities, and recommend crops based on a probability threshold of 0.5, or the highest probability if none exceed the threshold.

### Insights or Next Steps

*   Consider implementing a more sophisticated ensemble strategy for multilabel classification, such as stacking or blending, rather than simple probability averaging, potentially using libraries that support multilabel ensembles.
*   For the custom ensemble model, consider implementing scikit-learn's `BaseEstimator` and `ClassifierMixin` interfaces to make it compatible with standard scikit-learn utilities like cross-validation and grid search.
