In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# Model Training and Evaluation
In this notebook, we will experiment with different machine learning models to evaluate their performance on the cleaned dataset. We'll start with splitting the data into training and test sets, followed by training a RandomForestClassifier and evaluating it using common metrics.

In [2]:
# Load the cleaned data
df = pd.read_csv('data/processed/cleaned_data.csv')
df.head()

### Split the Data into Features and Target
We will separate the features and the target variable for model training. Please adjust the target column as per your dataset.

In [3]:
X = df.drop(columns=['target'])  # Replace 'target' with the actual column name
y = df['target']  # Replace 'target' with the actual target column name

### Split the Data into Training and Test Sets
We'll split the dataset into training and testing sets (80% training, 20% testing).

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature Scaling
For some models, it may help to scale the features, especially when the feature values have very different ranges.

In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Train the Random Forest Classifier
We will train a Random Forest model on the training data and evaluate its performance on the test data.

In [6]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

### Evaluate the Model
We'll evaluate the model using accuracy, confusion matrix, and classification report.

In [7]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

### Conclusion
Based on the evaluation metrics, we can conclude whether the Random Forest Classifier is suitable for our task. We can further experiment with other models or improve this model through hyperparameter tuning.