# Project Part III – Model Selection and Regularization

This notebook performs:
- Model selection using 5-fold cross-validation
- Logistic regression with L1, L2, and Elastic Net regularization
- Feature engineering and data preprocessing


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import warnings
warnings.filterwarnings('ignore')


## Part A: Model Selection

### 1. Data Loading and Preprocessing
We load the dataset, drop missing or duplicate values, and create new features.


In [2]:
def load_and_preprocess_data():
    print("=" * 60)
    print("PART A: MODEL SELECTION")
    print("=" * 60)
    
    print("1. Loading and preprocessing data...")
    df = pd.read_csv('down_data.csv')
    df = df.dropna()
    df = df.drop_duplicates()
    
    df['Total_toxic_replies'] = df['Num_toxic_direct_replies'] + df['Num_toxic_nested_replies']
    df['Toxic_conversation'] = (df['Total_toxic_replies'] > 0).astype(int)
    
    print(f"Dataset shape after preprocessing: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Toxic conversation distribution:")
    print(df['Toxic_conversation'].value_counts())
    
    return df

df = load_and_preprocess_data()


PART A: MODEL SELECTION
1. Loading and preprocessing data...
Dataset shape after preprocessing: (28818, 20)
Columns: ['Unnamed: 0', 'Tweet', 'Followers', 'Friends', 'Num_tweets', 'Verified', 'Listed_count', 'Location', 'Age', 'Length', 'Num_users', 'Num_author_replies', 'TOXICITY_x', 'Num_toxic_direct_replies', 'Num_toxic_nested_replies', 'Num_author_toxic_replies', 'Num_toxic_replies', 'Toxic', 'Total_toxic_replies', 'Toxic_conversation']
Toxic conversation distribution:
Toxic_conversation
0    21613
1     7205
Name: count, dtype: int64


### 2. Model Training and Cross-Validation

We train three models using 5-fold cross-validation:
- Logistic Regression
- Support Vector Machine
- Random Forest

We then evaluate performance using F1-score.


In [None]:
def part_a_model_selection(df):
    print("\n2. Splitting data (80% training, 20% testing)...")
    features_a = ['Length', 'Num_users', 'TOXICITY_x', 'Num_author_replies', 'Verified', 'Age']
    X = df[features_a]
    y = df['Toxic_conversation']
    X['Verified'] = X['Verified'].astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Support Vector Machine': SVC(random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
    }
    
    results = {}
    for name, model in models.items():
        print(f"\n{name}:")
        acc = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        f1 = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
        print(f"Average Accuracy: {acc.mean():.4f}")
        print(f"Average F1 Score: {f1.mean():.4f}")
        results[name] = {'model': model, 'avg_accuracy': acc.mean(), 'avg_f1': f1.mean()}
    
    best_model_name = max(results.keys(), key=lambda x: results[x]['avg_f1'])
    best_model = results[best_model_name]['model']
    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)
    
    print(f"\nBest Model: {best_model_name}")
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Test F1 Score: {f1_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred))
    
    return results, best_model_name

results_a, best_model_name = part_a_model_selection(df)



2. Splitting data (80% training, 20% testing)...

Logistic Regression:
Average Accuracy: 0.7786
Average F1 Score: 0.3038

Support Vector Machine:


## Part B: Regularization Techniques

We explore:
- Unregularized Logistic Regression
- L1 (Lasso)
- L2 (Ridge)
- Elastic Net

We'll use a wider set of features and compare performance.


In [None]:
def part_b_regularization(df):
    features_b = ['Length', 'Num_users', 'TOXICITY_x', 'Num_author_replies', 
                  'Verified', 'Age', 'Followers', 'Friends', 'Num_tweets', 
                  'Location', 'Listed_count']
    
    X = df[features_b]
    y = df['Toxic_conversation']
    X['Verified'] = X['Verified'].astype(int)
    X['Location'] = X['Location'].astype('category').cat.codes

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    models = {
        'Unregularized': LogisticRegression(max_iter=1000, random_state=42),
        'L1': LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000, random_state=42),
        'L2': LogisticRegression(penalty='l2', max_iter=1000, random_state=42),
        'ElasticNet': LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, 
                                         max_iter=1000, random_state=42)
    }

    scores = {}
    for name, model in models.items():
        acc = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        f1 = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
        print(f"\n{name} - Accuracy: {acc.mean():.4f}, F1 Score: {f1.mean():.4f}")
        scores[name] = (acc.mean(), f1.mean())

    return scores

results_b = part_b_regularization(df)



Unregularized - Accuracy: 0.7710, F1 Score: 0.2481

L1 - Accuracy: 0.7788, F1 Score: 0.3074

L2 - Accuracy: 0.7710, F1 Score: 0.2481

ElasticNet - Accuracy: 0.7517, F1 Score: 0.0604


## Discussion: Regularization Techniques

- **L1 (Lasso)**: Shrinks some coefficients to zero, helps in feature selection.
- **L2 (Ridge)**: Shrinks all coefficients but retains them, helps reduce overfitting.
- **Elastic Net**: Combination of L1 and L2, good for correlated features.


## Conclusion

- **Best Model in Part A**: Selected based on F1 Score.
- **Best Regularization**: Determined by cross-validated F1 comparison.

These results guide us on balancing accuracy with model simplicity and generalization.
