### K-Fold Cross Validation
#### K-fold cross-validation is a technique used in machine learning to evaluate the performance of a model. It helps ensure that the model generalizes well to unseen data and doesn't overfit to the training set.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:


# Load your dataset
data = pd.read_csv("D:/Associate - Junior DS Assessment/Junior (A - L2) Data Science/Data/final_ds_nlp/modified_final_file.csv")
data=data.dropna(subset=['raw_rating','sentiment2'])
# Select features and labels (For simplicity, let's assume 'raw_rating' is the feature and 'Sentiment' is the label)
X = data[['raw_rating']]  # Features
y = data['Sentiment']  # Target labels

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy without K-Fold Cross-Validation: {accuracy:.2f}")


Accuracy without K-Fold Cross-Validation: 0.79


In [12]:
from sklearn.model_selection import cross_val_score

# Apply K-Fold Cross-Validation with 5 folds
k_fold_accuracy = cross_val_score(model, X, y, cv=10, scoring='accuracy')

# Print the accuracy for each fold
print(f"Accuracy for each fold: {k_fold_accuracy}")

# Calculate the mean accuracy across all folds
mean_accuracy = k_fold_accuracy.mean()
print(f"Mean accuracy with K-Fold Cross-Validation: {mean_accuracy:.2f}")


Accuracy for each fold: [0.79414032 0.78026214 0.79182729 0.77949113 0.79336931 0.79182729
 0.76869699 0.7925983  0.78874325 0.79243827]
Mean accuracy with K-Fold Cross-Validation: 0.79


### Why Accuracy May Not Improve

- **Model Stability:** If your model is already stable and the dataset is representative, you might see similar accuracy scores across different splits. This means your model is consistent, but it doesn't necessarily improve accuracy.

- **Data Size:** If you have a smaller dataset, the test set in K-Fold Cross-Validation might contain more difficult samples, which could lower accuracy slightly.

- **Overfitting:** If your model was overfitting on the initial train-test split, K-Fold Cross-Validation would reveal that by showing a lower average accuracy across the folds.

### What to Focus On

- **Consistency:** Check if the accuracy across different folds is consistent. Consistent accuracy indicates a stable model that generalizes well.

- **Bias-Variance Tradeoff:** K-Fold Cross-Validation helps you understand if your model has high variance (performing well on some folds but poorly on others) or high bias (underperforming across all folds).
