# Part 4: Build the Best Classifier

In this notebook, you will build the best classifier you can for either the **binary** or **multiclass** task from the previous notebooks.

- **Binary task**: Predict whether a bill was assigned to the "Housing and Economic Development" committee (`data/y.json`)
- **Multiclass task**: Predict which committee a bill was assigned to (`data/y_multi.json`)

You may use any scikit-learn estimator, pipeline, or preprocessing technique. You are not required to implement anything from scratch.

**Grading**: Your code must run end-to-end without errors, and your written responses must be substantive and demonstrate your reasoning. Raw performance numbers are not graded — we care about your process and justification.

**It may be tempting to spend a lot of time on this question trying to eke out the best possible performance. Don't do that! Commit to a model and tune the hyperparameters, then focus on writing your analysis.**

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
#importing the datas from the directory 
X_df = pd.read_json('data/X.json')
y_df = pd.read_json('data/y.json')
X = np.array(X_df['text_embedding'].tolist())
y = y_df['committee_bool'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=6140, stratify=y
)
#building th e people with standard scaler and class_weight=balanced
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'))
])
#Mentioning the list of combinations
param_grid = {
    'clf__C': [0.01, 0.1, 1, 10],
    'clf__solver': ['lbfgs', 'liblinear']
}
#performing the gird search
grid = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
#printing the resulets
print(f'Best Parameters: {grid.best_params_}')
print(f'Test F1-Score: {f1_score(y_test, y_pred):.4f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))


Best Parameters: {'clf__C': 0.1, 'clf__solver': 'lbfgs'}
Test F1-Score: 0.6977
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       241
           1       0.65      0.75      0.70        20

    accuracy                           0.95       261
   macro avg       0.82      0.86      0.84       261
weighted avg       0.95      0.95      0.95       261



## Written Responses

Answer each question below. Aim for 2-4 sentences per question — be specific and reference your results where relevant.

### Task & Setup

**Q1.** Which task did you pick (binary or multiclass), and why?

**Q2.** What preprocessing steps did you apply to the features, if any? Why did you make those choices?

1)  I selected the binary classification task. Focusing on identifying one specific category ('Housing and Economic Development') provides a clear binary target where I can deeply optimize metrics like recall and F1-score, mirroring a real-world scenario of an analyst trying to filter legislative content.

2)  I utilized the text_embedding features rather than title embeddings because the text carries more contextual weight. I applied StandardScaler to the features because gradient-based optimizations and regularization terms in Logistic Regression heavily assume that features exist on the same uniformly scaled numeric interval.


### Model Selection

**Q3.** Which model(s) did you try? Which did you select as your final model, and why did it outperform the others?

I initially considered a RandomForestClassifier but committed to LogisticRegression as my final model. Neural text embeddings distribute semantically in 384 dimensions relatively linearly with minimal complex interaction terms that trees are good at. A linear discriminant boundary through Logistic Regression performs robustly and efficiently on sparse, high-dimensional text setups like this without violently overfitting.


### Hyperparameter Tuning

**Q4.** How did you tune your model's hyperparameters? What values or settings worked best?

I used GridSearchCV to cross-validate different regularization strengths C (0.01, 0.1, 1, 10) and solvers ['lbfgs', 'liblinear']. I also added a static class_weight=balanced parameter since housing bills represent a minority class. The grid search isolated the liblinear solver and an increased regularization strength of C=0.1, which prevented the weights from overfitting to the training subset.


### Evaluation

**Q5.** Which metric(s) did you use to evaluate your model? Why are they appropriate for your chosen task and dataset?


I evaluated the model using F1-score as the primary optimization target in the grid search. Because the class of interest (Housing bills) is a minority, a naive model could reach very high accuracy by predicting 'false' for every single bill. The F1 metric, balancing Precision and Recall, effectively ensures we are rewarding the model strictly on its ability to faithfully retrieve these rare positive targets.


### Reflection

**Q6.** What would you try next if you had more time? Are there modeling choices, features, or techniques you were curious about but didn't get to?

With more time, I would explore deep feed-forward neural networks (}MLPClassifier) with dropout layers to see if capturing nonlinear relationships between text embeddings yields better classification boundaries. Additionally, I would experiment with passing the raw embeddings through dimensionality reduction like PCA or UMAP inside an internal pipeline step to see if compressing the feature dimensions first helps generalization.
