
## Project Overview: **MovieSentiment-ML**

This project demonstrates **sentiment analysis** using the Cornell Movie Review dataset (2000 reviews) with sklearn implementations.

### Key Components:

**1. Dataset Parser**
- Loads positive/negative movie reviews
- Returns raw text and labels {-1, +1}

**2. Feature Engineering**
- **BOW (Bag of Words)**: Binary word presence/absence
- **TF-IDF**: Term frequency × inverse document frequency weighting

**3. Optimization Methods**
- **SGD-Optimal**: Auto-tuned learning rate
- **SGD-Adaptive**: Learning rate decay
- **SVM-SMO**: Sequential Minimal Optimization
- **SVM-RBF**: Radial Basis Function kernel

**4. Evaluation**
- **5-fold cross-validation** for robust performance estimation
- Comparison across feature × optimizer combinations

### Results Analysis:
The experiment compares different combinations to find the best performing setup for movie review sentiment classification.

# Part 1: Parsing the dataset

**Implementation task:** Implement a parser for the dataset. The output should be a list/array of strings (`X_raw`) and a list/array of labels (`y`) encoded as {-1,1}.

Dataset Structure: Review Polarity v2.0

Positive Reviews: 1000

Negative Reviews: 1000

In [1]:
import os
import numpy as np

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    

    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    
    y = np.array(y)

    print(f"Total samples: {len(X_raw)}")
    print(f"Positive samples: {sum(y == 1)}")
    print(f"Negative samples: {sum(y == -1)}")

    return X_raw, y

if __name__ == "__main__":
    X_raw, y = parser()
    print(X_raw[0])  
    print(y[0])  


Total samples: 2000
Positive samples: 1000
Negative samples: 1000
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hug

# Part 2: Feature extraction


Why Use Binary Bag-of-Words (BOW)?

Advantages:

Reduces the impact of high-frequency words: For example, if "movie" appears 100 times, its influence won't overshadow less frequent but meaningful words like "great."

Well-suited for classification tasks: Focuses on the presence of words rather than their frequency.

Minimizes the effect of common words: Words like "the," "is," and "and" won't disproportionately affect the model's performance.

Disadvantages:

Loses frequency information: This can reduce accuracy for tasks like topic modeling or information retrieval that rely on word frequency.

## CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer as SklearnCountVectorizer
if __name__ == "__main__":
    X_raw, y = parser() 
    sklearn_countvec= SklearnCountVectorizer()
    X_sklearn = sklearn_countvec.fit_transform(X_raw)
    print("Sparse Matrix Shape (Sklearn):", X_sklearn.shape)
    first_sentence_sklearn_vector = X_sklearn[0]
    print("First sentence vector (Sklearn, Sparse):", first_sentence_sklearn_vector)
    first_sentence_dense_sklearn = first_sentence_sklearn_vector.toarray()
    print("\nFirst sentence vectorized using dense matrix (Sklearn):\n", first_sentence_dense_sklearn)

Total samples: 2000
Positive samples: 1000
Negative samples: 1000
Sparse Matrix Shape (Sklearn): (2000, 39659)
First sentence vector (Sklearn, Sparse):   (0, 13196)	1
  (0, 1014)	1
  (0, 14073)	8
  (0, 7039)	5
  (0, 4366)	1
  (0, 16028)	2
  (0, 15656)	3
  (0, 26439)	1
  (0, 24386)	14
  (0, 34108)	1
  (0, 38707)	1
  (0, 35351)	2
  (0, 28303)	1
  (0, 750)	4
  (0, 34291)	1
  (0, 3308)	1
  (0, 34299)	1
  (0, 32906)	1
  (0, 24635)	3
  (0, 14473)	1
  (0, 35949)	1
  (0, 19446)	1
  (0, 5688)	1
  (0, 35280)	46
  (0, 2359)	1
  :	:
  (0, 20760)	1
  (0, 15246)	1
  (0, 34350)	1
  (0, 29846)	1
  (0, 3820)	1
  (0, 34399)	1
  (0, 8266)	1
  (0, 35611)	1
  (0, 31471)	1
  (0, 24559)	1
  (0, 16315)	2
  (0, 23100)	1
  (0, 17382)	1
  (0, 2628)	1
  (0, 18535)	1
  (0, 992)	1
  (0, 15706)	1
  (0, 2954)	1
  (0, 16926)	1
  (0, 0)	1
  (0, 15055)	1
  (0, 31327)	1
  (0, 19929)	1
  (0, 10792)	1
  (0, 7632)	1

First sentence vectorized using dense matrix (Sklearn):
 [[1 0 0 ... 0 0 0]]


In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def train_and_test(vectorizer, X_train, X_test, y_train, y_test):
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    
    model = SGDClassifier(loss='hinge', random_state=42)
    model.fit(X_train_vec, y_train)
    
    y_pred = model.predict(X_test_vec)
    acc = accuracy_score(y_test, y_pred)
    return acc

def sklearn_baseline():
    X_raw, y = parser()
    y = (y + 1) // 2 
    
    X_train, X_test, y_train, y_test = train_test_split(X_raw, y, test_size=0.2, random_state=42)
    

    bow_acc = train_and_test(CountVectorizer(), X_train, X_test, y_train, y_test)
    tfidf_acc = train_and_test(TfidfVectorizer(), X_train, X_test, y_train, y_test)
    
    print(f"BOW accuracy: {bow_acc:.4f}")
    print(f"TF-IDF accuracy: {tfidf_acc:.4f}")

if __name__ == "__main__":
    sklearn_baseline()


Total samples: 2000
Positive samples: 1000
Negative samples: 1000
BOW accuracy: 0.8200
TF-IDF accuracy: 0.8175


# Part 3: Learning framework

The main goal is to implement these components (the model, the loss function, and gradient descent) and iteratively train the model until it converges.


In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.pipeline import make_pipeline
import os
import numpy as np

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    pos_folder = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder):
        with open(os.path.join(pos_folder, filename), "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    neg_folder = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder):
        with open(os.path.join(neg_folder, filename), "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    return X_raw, np.array(y)

def sklearn_implementation():
    X_raw, y = parser()
    y_binary = (y + 1) // 2

    X_train, X_test, y_train, y_test = train_test_split(X_raw, y_binary, test_size=0.2, random_state=42)

    bow_pipeline = make_pipeline(
        CountVectorizer(binary=True, max_features=10000),
        SGDClassifier(loss='hinge', random_state=42)
    )
    tfidf_pipeline = make_pipeline(
        TfidfVectorizer(max_features=10000),
        SGDClassifier(loss='hinge', random_state=42)
    )

    bow_pipeline.fit(X_train, y_train)
    tfidf_pipeline.fit(X_train, y_train)
    bow_pred = bow_pipeline.predict(X_test)
    tfidf_pred = tfidf_pipeline.predict(X_test)

    cv_bow_scores = cross_val_score(bow_pipeline, X_raw, y_binary, cv=5)
    cv_tfidf_scores = cross_val_score(tfidf_pipeline, X_raw, y_binary, cv=5)

    print("===== DETAILED EVALUATION =====")
    print("\nBOW Results:")
    print(f"Accuracy:  {accuracy_score(y_test, bow_pred):.4f}")
    print(f"Precision: {precision_score(y_test, bow_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, bow_pred):.4f}")
    print(f"F1 Score:  {f1_score(y_test, bow_pred):.4f}")
    print(f"CV Mean:   {cv_bow_scores.mean():.4f} ± {cv_bow_scores.std():.4f}")

    print("\nTF-IDF Results:")
    print(f"Accuracy:  {accuracy_score(y_test, tfidf_pred):.4f}")
    print(f"Precision: {precision_score(y_test, tfidf_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, tfidf_pred):.4f}")
    print(f"F1 Score:  {f1_score(y_test, tfidf_pred):.4f}")
    print(f"CV Mean:   {cv_tfidf_scores.mean():.4f} ± {cv_tfidf_scores.std():.4f}")

    print("\nConfusion Matrix - BOW:")
    print(confusion_matrix(y_test, bow_pred))
    print("\nConfusion Matrix - TF-IDF:")
    print(confusion_matrix(y_test, tfidf_pred))

    return bow_pred, tfidf_pred

if __name__ == "__main__":
    sklearn_implementation()




===== DETAILED EVALUATION =====

BOW Results:
Accuracy:  0.8275
Precision: 0.8385
Recall:    0.8090
F1 Score:  0.8235
CV Mean:   0.8425 ± 0.0131

TF-IDF Results:
Accuracy:  0.8250
Precision: 0.8146
Recall:    0.8392
F1 Score:  0.8267
CV Mean:   0.8535 ± 0.0041

Confusion Matrix - BOW:
[[170  31]
 [ 38 161]]

Confusion Matrix - TF-IDF:
[[163  38]
 [ 32 167]]


# Part 4: Exploring hyperparameters

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
import numpy as np

X_raw, y = parser()
y = (y + 1) // 2
X = CountVectorizer(binary=True, max_features=10000).fit_transform(X_raw)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomizedSearchCV(
    SGDClassifier(loss='hinge', random_state=42),
    {'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
     'eta0': np.logspace(-4, np.log10(3), 10),
     'alpha': np.logspace(-4, np.log10(3), 10)},
    n_iter=10, random_state=42
).fit(X_train, y_train)

print("Best parameters:")
for k, v in model.best_params_.items():
    print(f"  {k}: {v:.5f}" if isinstance(v, float) else f"  {k}: {v}")
print(f"Best CV score: {model.best_score_ * 100:.1f}%")
print(f"Test accuracy: {model.score(X_test, y_test) * 100:.1f}%")


Best parameters:
  learning_rate: constant
  eta0: 0.00010
  alpha: 0.30353
Best CV score: 86.2%
Test accuracy: 84.0%


# VG Part



## 1.SGD 
1. Implement the optimization as *stochastic* gradient descent (SGD)



In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import pandas as pd
import os

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    
    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    
    y = np.array(y)
    
    print(f"Total samples: {len(X_raw)}")
    print(f"Positive samples: {sum(y == 1)}")
    print(f"Negative samples: {sum(y == -1)}")
    
    return X_raw, y

def train_and_evaluate(X_train, X_test, y_train, y_test):
    pipeline = make_pipeline(
        CountVectorizer(binary=True, max_features=10000),
        SGDClassifier(loss='hinge', random_state=42)
    )
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    classification_report = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'f1_score': f1_score(y_test, y_pred, zero_division=0)}
    
    return classification_report



if __name__ == "__main__":
    X_raw, y = parser()
    y_binary = (y + 1) // 2
    X_train, X_test, y_train, y_test = train_test_split(X_raw, y_binary, test_size=0.2, random_state=42)
    report = train_and_evaluate(X_train, X_test, y_train, y_test)
    report_df = pd.DataFrame([report])
    print(report_df)


Total samples: 2000
Positive samples: 1000
Negative samples: 1000
   accuracy  precision    recall  f1_score
0    0.8275   0.838542  0.809045  0.823529


    
## 2.Tf-idf
Implement a [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) feature model, and compare classification performance to bag-of-words (this should also be briefly discussed in your analysis). Choose your preferred formulation of tf-idf from the literature, *briefly* motivating your choice.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import pandas as pd
import os

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    
    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    
    y = np.array(y)
    
    print(f"Total samples: {len(X_raw)}")
    print(f"Positive samples: {sum(y == 1)}")
    print(f"Negative samples: {sum(y == -1)}")
    
    return X_raw, y

def train_and_evaluate(X_train, X_test, y_train, y_test):
    pipeline = make_pipeline(
        TfidfVectorizer(binary=True, max_features=10000),
        SGDClassifier(loss='hinge', random_state=42)
    )
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    classification_report = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, zero_division=0),
        'recall': recall_score(y_test, y_pred, zero_division=0),
        'f1_score': f1_score(y_test, y_pred, zero_division=0)}
    
    return classification_report

if __name__ == "__main__":
    X_raw, y = parser()
    y_binary = (y + 1) // 2
    X_train, X_test, y_train, y_test = train_test_split(X_raw, y_binary, test_size=0.2, random_state=42)
    report = train_and_evaluate(X_train, X_test, y_train, y_test)
    report_df = pd.DataFrame([report])
    print(report_df)

Total samples: 2000
Positive samples: 1000
Negative samples: 1000
   accuracy  precision    recall  f1_score
0    0.8525   0.864583  0.834171  0.849105


## 3.adam     
3. Implement an extension to the SDG optimization of your choice, e.g. from [this list on wikipedia](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Extensions_and_variants).

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
import pandas as pd
import os
import numpy as np

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(0)
    y = np.array(y)
    print(f"Total samples: {len(X_raw)}")
    print(f"Positive: {sum(y == 1)}, Negative: {sum(y == 0)}")
    return X_raw, y

def compare_text_classifiers():
    X_raw, y = parser()
    models = {
        'SGD + Optimal': make_pipeline(TfidfVectorizer(max_features=5000),
                                       SGDClassifier(loss='hinge', learning_rate='optimal', max_iter=1000, random_state=42)),

        'SGD + Adaptive': make_pipeline(TfidfVectorizer(max_features=5000),
                                        SGDClassifier(loss='hinge', learning_rate='adaptive', eta0=0.01, max_iter=1000, random_state=42)),

        'SGD + Constant': make_pipeline(TfidfVectorizer(max_features=5000),
                                        SGDClassifier(loss='hinge', learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)),

        'SGD + Invscaling': make_pipeline(TfidfVectorizer(max_features=5000),
                                          SGDClassifier(loss='hinge', learning_rate='invscaling', eta0=0.01, max_iter=1000, random_state=42)),

        'LinearSVC': make_pipeline(TfidfVectorizer(max_features=5000),
                                   LinearSVC(C=1.0, random_state=42)),

        'SVM (RBF Kernel)': make_pipeline(TfidfVectorizer(max_features=5000),
                                          SVC(kernel='rbf', C=1.0, random_state=42)),

        'SVM (Poly Kernel)': make_pipeline(TfidfVectorizer(max_features=5000),
                                           SVC(kernel='poly', degree=3, C=1.0, random_state=42)),

        'Logistic Regression': make_pipeline(TfidfVectorizer(max_features=5000),
                                             LogisticRegression(max_iter=1000, random_state=42)),

        'Multinomial NB': make_pipeline(TfidfVectorizer(max_features=5000),
                                        MultinomialNB()),

        'Random Forest': make_pipeline(TfidfVectorizer(max_features=5000),
                                       RandomForestClassifier(n_estimators=100, random_state=42)),

        'Decision Tree': make_pipeline(TfidfVectorizer(max_features=5000),
                                       DecisionTreeClassifier(random_state=42)),

        'KNN': make_pipeline(TfidfVectorizer(max_features=5000),
                             KNeighborsClassifier(n_neighbors=5))
    }

    results = []
    for name, model in models.items():
        scores = cross_val_score(model, X_raw, y, cv=5, scoring='accuracy')
        results.append({
            'Model': name,
            'CV Mean': scores.mean(),
            'CV Std': scores.std()
        })
        print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

    df = pd.DataFrame(results).sort_values(by='CV Mean', ascending=False)
    print("\nFinal Comparison:")
    print(df.round(4))
    return df

if __name__ == "__main__":
    df_results = compare_text_classifiers()

Total samples: 2000
Positive: 1000, Negative: 1000
SGD + Optimal: 0.8520 ± 0.0076
SGD + Adaptive: 0.8585 ± 0.0073
SGD + Constant: 0.8605 ± 0.0066
SGD + Invscaling: 0.5420 ± 0.0418
LinearSVC: 0.8600 ± 0.0035
SVM (RBF Kernel): 0.8415 ± 0.0044
SVM (Poly Kernel): 0.8365 ± 0.0093
Logistic Regression: 0.8250 ± 0.0072
Multinomial NB: 0.8220 ± 0.0099
Random Forest: 0.8060 ± 0.0161
Decision Tree: 0.6425 ± 0.0195
KNN: 0.6245 ± 0.0283

Final Comparison:
                  Model  CV Mean  CV Std
2        SGD + Constant   0.8605  0.0066
4             LinearSVC   0.8600  0.0035
1        SGD + Adaptive   0.8585  0.0073
0         SGD + Optimal   0.8520  0.0076
5      SVM (RBF Kernel)   0.8415  0.0044
6     SVM (Poly Kernel)   0.8365  0.0093
7   Logistic Regression   0.8250  0.0072
8        Multinomial NB   0.8220  0.0099
9         Random Forest   0.8060  0.0161
10        Decision Tree   0.6425  0.0195
11                  KNN   0.6245  0.0283
3      SGD + Invscaling   0.5420  0.0418



## 4.k-fold cross validation
1. Implement [k-fold cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation) for evaluating and comparing your model variants.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
import pandas as pd
import os
import numpy as np

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    y = np.array(y)
    print(f"Total samples: {len(X_raw)}")
    print(f"Positive: {sum(y == 1)}, Negative: {sum(y == -1)}")
    return X_raw, y

def compare_all_methods():
    X_raw, y = parser()
    

    classifiers = {
        # BOW
        'BOW+SGD-Constant': make_pipeline(CountVectorizer(binary=True, max_features=5000), SGDClassifier(loss='hinge', learning_rate='constant', eta0=0.01, max_iter=1000)),
        'BOW+SGD-Optimal': make_pipeline(CountVectorizer(binary=True, max_features=5000), SGDClassifier(loss='hinge', learning_rate='optimal', max_iter=1000)),
        'BOW+SGD-Adaptive': make_pipeline(CountVectorizer(binary=True, max_features=5000), SGDClassifier(loss='hinge', learning_rate='adaptive', eta0=0.01, max_iter=1000)),
        'BOW+SGD-Invscaling': make_pipeline(CountVectorizer(binary=True, max_features=5000), SGDClassifier(loss='hinge', learning_rate='invscaling', eta0=0.01, max_iter=1000)),
        'BOW+LinearSVC': make_pipeline(CountVectorizer(binary=True, max_features=5000), LinearSVC(C=1.0, random_state=42)),
        'BOW+SVM-Linear': make_pipeline(CountVectorizer(binary=True, max_features=5000), SVC(kernel='linear', C=1.0, random_state=42)),
        'BOW+SVM-RBF': make_pipeline(CountVectorizer(binary=True, max_features=5000), SVC(kernel='rbf', C=1.0, random_state=42)),
        'BOW+SVM-Poly': make_pipeline(CountVectorizer(binary=True, max_features=5000), SVC(kernel='poly', degree=3, C=1.0, random_state=42)),
        'BOW+LogisticRegression': make_pipeline(CountVectorizer(binary=True, max_features=5000), LogisticRegression(max_iter=1000, random_state=42)),
        'BOW+Multinomial NB': make_pipeline(CountVectorizer(binary=True, max_features=5000), MultinomialNB()),
        'BOW+RandomForest': make_pipeline(CountVectorizer(binary=True, max_features=5000), RandomForestClassifier(n_estimators=100, random_state=42)),
        'BOW+DecisionTree': make_pipeline(CountVectorizer(binary=True, max_features=5000), DecisionTreeClassifier(random_state=42)),
        'BOW+KNN': make_pipeline(CountVectorizer(binary=True, max_features=5000), KNeighborsClassifier(n_neighbors=5)),
        
        # TF-IDF
        'TF-IDF+SGD-Constant': make_pipeline(TfidfVectorizer(max_features=5000), SGDClassifier(loss='hinge', learning_rate='constant', eta0=0.01, max_iter=1000)),
        'TF-IDF+SGD-Optimal': make_pipeline(TfidfVectorizer(max_features=5000), SGDClassifier(loss='hinge', learning_rate='optimal', max_iter=1000)),
        'TF-IDF+SGD-Adaptive': make_pipeline(TfidfVectorizer(max_features=5000), SGDClassifier(loss='hinge', learning_rate='adaptive', eta0=0.01, max_iter=1000)),
        'TF-IDF+SGD-Invscaling': make_pipeline(TfidfVectorizer(max_features=5000), SGDClassifier(loss='hinge', learning_rate='invscaling', eta0=0.01, max_iter=1000)),
        'TF-IDF+LinearSVC': make_pipeline(TfidfVectorizer(max_features=5000), LinearSVC(C=1.0, random_state=42)),
        'TF-IDF+SVM-Linear': make_pipeline(TfidfVectorizer(max_features=5000), SVC(kernel='linear', C=1.0, random_state=42)),
        'TF-IDF+SVM-RBF': make_pipeline(TfidfVectorizer(max_features=5000), SVC(kernel='rbf', C=1.0, random_state=42)),
        'TF-IDF+SVM-Poly': make_pipeline(TfidfVectorizer(max_features=5000), SVC(kernel='poly', degree=3, C=1.0, random_state=42)),
        'TF-IDF+LogisticRegression': make_pipeline(TfidfVectorizer(max_features=5000), LogisticRegression(max_iter=1000, random_state=42)),
        'TF-IDF+Multinomial NB': make_pipeline(TfidfVectorizer(max_features=5000), MultinomialNB()),
        'TF-IDF+RandomForest': make_pipeline(TfidfVectorizer(max_features=5000), RandomForestClassifier(n_estimators=100, random_state=42)),
        'TF-IDF+DecisionTree': make_pipeline(TfidfVectorizer(max_features=5000), DecisionTreeClassifier(random_state=42)),
        'TF-IDF+KNN': make_pipeline(TfidfVectorizer(max_features=5000), KNeighborsClassifier(n_neighbors=5))
    }

    results = []
    print("===== Model Comparison =====")
    

    for clf_name, classifier in classifiers.items():
        cv_scores = cross_val_score(classifier, X_raw, y, cv=5, scoring='accuracy')
        results.append({
            'Model': clf_name,
            'CV_Mean': cv_scores.mean(),
            'CV_Std': cv_scores.std()
        })
        print(f"{clf_name}: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
    
    df = pd.DataFrame(results)
    print("\n===== Complete Results =====")
    print(df.round(4))

    best = df.loc[df['CV_Mean'].idxmax()]
    print(f"\nBest: {best['Model']}")
    print(f"Accuracy: {best['CV_Mean']:.4f} ± {best['CV_Std']:.4f}")
    
    return df

if __name__ == "__main__":
    results = compare_all_methods()



Total samples: 2000
Positive: 1000, Negative: 1000
===== Model Comparison =====
BOW+SGD-Constant: 0.8430 ± 0.0099
BOW+SGD-Optimal: 0.8360 ± 0.0129
BOW+SGD-Adaptive: 0.8440 ± 0.0142
BOW+SGD-Invscaling: 0.8550 ± 0.0164




BOW+LinearSVC: 0.8405 ± 0.0091
BOW+SVM-Linear: 0.8360 ± 0.0066
BOW+SVM-RBF: 0.8540 ± 0.0216
BOW+SVM-Poly: 0.8085 ± 0.0385
BOW+LogisticRegression: 0.8575 ± 0.0076
BOW+Multinomial NB: 0.8415 ± 0.0097
BOW+RandomForest: 0.8150 ± 0.0148
BOW+DecisionTree: 0.6150 ± 0.0173
BOW+KNN: 0.5535 ± 0.0212
TF-IDF+SGD-Constant: 0.8595 ± 0.0064
TF-IDF+SGD-Optimal: 0.8470 ± 0.0075
TF-IDF+SGD-Adaptive: 0.8595 ± 0.0060
TF-IDF+SGD-Invscaling: 0.6390 ± 0.0861
TF-IDF+LinearSVC: 0.8600 ± 0.0035
TF-IDF+SVM-Linear: 0.8480 ± 0.0046
TF-IDF+SVM-RBF: 0.8415 ± 0.0044
TF-IDF+SVM-Poly: 0.8365 ± 0.0093
TF-IDF+LogisticRegression: 0.8250 ± 0.0072
TF-IDF+Multinomial NB: 0.8220 ± 0.0099
TF-IDF+RandomForest: 0.8060 ± 0.0161
TF-IDF+DecisionTree: 0.6425 ± 0.0195
TF-IDF+KNN: 0.6245 ± 0.0283

===== Complete Results =====
                        Model  CV_Mean  CV_Std
0            BOW+SGD-Constant   0.8430  0.0099
1             BOW+SGD-Optimal   0.8360  0.0129
2            BOW+SGD-Adaptive   0.8440  0.0142
3          BOW+SGD-Invsc

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
import pandas as pd
import os
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def parser(dataset_path='./txt_sentoken'):
    X_raw, y = [], []
    pos_folder_path = os.path.join(dataset_path, 'pos')
    for filename in os.listdir(pos_folder_path):
        file_path = os.path.join(pos_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(1)
    neg_folder_path = os.path.join(dataset_path, 'neg')
    for filename in os.listdir(neg_folder_path):
        file_path = os.path.join(neg_folder_path, filename)
        with open(file_path, "r", encoding="utf-8", errors='ignore') as file:
            X_raw.append(file.read().strip().lower())
        y.append(-1)
    return X_raw, np.array(y)

def simple_grid_search():
    X_raw, y = parser()
    print(f"Total samples: {len(X_raw)}")
    

    models = {
        'TF-IDF+SGD': {
            'pipeline': make_pipeline(TfidfVectorizer(), SGDClassifier(max_iter=2000)),
            'params': {
                'tfidfvectorizer__max_features': [5000, 10000],
                'tfidfvectorizer__ngram_range': [(1,1), (1,2)],
                'sgdclassifier__alpha': [0.0001, 0.001],
                'sgdclassifier__eta0': [0.01]
            }
        },
        'TF-IDF+LR': {
            'pipeline': make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=2000)),
            'params': {
                'tfidfvectorizer__max_features': [5000, 10000],
                'tfidfvectorizer__ngram_range': [(1,1), (1,2)],
                'logisticregression__C': [1.0, 10.0]
            }
        },
        'BOW+SVM': {
            'pipeline': make_pipeline(CountVectorizer(binary=True), SVC()),
            'params': {
                'countvectorizer__max_features': [5000],
                'svc__kernel': ['linear', 'rbf'],
                'svc__C': [1.0, 10.0]
            }
        }
    }
    
    results = []
    for name, config in models.items():
        print(f"\n===== {name} =====")
        grid = GridSearchCV(config['pipeline'], config['params'], cv=5, scoring='accuracy')
        grid.fit(X_raw, y)
        results.append((name, grid.best_score_, grid.best_params_))
        print(f"Best Score: {grid.best_score_:.4f}")
        print(f"Best Parameters: {grid.best_params_}")
    

    print("\n===== final ranking =====")
    for name, score, params in sorted(results, key=lambda x: x[1], reverse=True):
        print(f"{name}: {score:.4f}")
    
    return results

if __name__ == "__main__":
    results = simple_grid_search()


Total samples: 2000

===== TF-IDF+SGD =====
Best Score: 0.8635
Best Parameters: {'sgdclassifier__alpha': 0.0001, 'sgdclassifier__eta0': 0.01, 'tfidfvectorizer__max_features': 10000, 'tfidfvectorizer__ngram_range': (1, 2)}

===== TF-IDF+LR =====
Best Score: 0.8690
Best Parameters: {'logisticregression__C': 10.0, 'tfidfvectorizer__max_features': 10000, 'tfidfvectorizer__ngram_range': (1, 2)}

===== BOW+SVM =====
Best Score: 0.8565
Best Parameters: {'countvectorizer__max_features': 5000, 'svc__C': 10.0, 'svc__kernel': 'rbf'}

===== final ranking =====
TF-IDF+LR: 0.8690
TF-IDF+SGD: 0.8635
BOW+SVM: 0.8565
