# Final Project

Social Media Bias Predictor by Jay Chadha, Tyler Lynch, Robert Mustachio, and Kiyan Zewer

---

Import dependencies.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Preprocessing

Import the political_social_media dataset and preprocess the `text` and `bias` columns using `CountVectorizer()` and `LabelEncoder`.

Make sure to upload the `political_social_media.csv` to the content folder in the runtime file explorer.

In [None]:
politicalBias = pd.read_csv("/content/sample_data/political_social_media.csv")

# Get dataframe columns
bias = politicalBias["bias"]
text = politicalBias["text"]

print(politicalBias.head())

# Convert text into vectors
vectorizerText = CountVectorizer()
text_matrix = vectorizerText.fit_transform(text)

# Label Encode Bias column
biasLabelEncoder = LabelEncoder()
bias_encoded = biasLabelEncoder.fit_transform(bias)

       bias                                               text
0  partisan  RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1  partisan  VIDEO - #Obamacare:  Full of Higher Costs and ...
2   neutral  Please join me today in remembering our fallen...
3   neutral  RT @SenatorLeahy: 1st step toward Senate debat...
4  partisan  .@amazon delivery #drones show need to update ...


## Data Split

Split the data into 80% training 10% test and 10% validation.

In [None]:
# Split the data into training, test, and validation data
x_train, x_testValid, y_train, y_testValid = train_test_split(text_matrix, bias, test_size=0.2)

# Split test data into test and validation
x_test, x_valid, y_test, y_valid = train_test_split(x_testValid, y_testValid, test_size=0.5)

## Classifiers

### Logistic Regression

Train and evaluate our logistic regression model on our test and validation sets.

In [None]:
# Train logistic regression classifier
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

# Evaluate classifer on training set
accuracy_train = logreg.score(x_train, y_train)
print("Training accuracy:", accuracy_train)

# Evaluate classifer on test set
accuracy_test = logreg.score(x_test, y_test)
print("Test accuracy:", accuracy_test)

# Evaluate classifier on validation set
accuracy_val = logreg.score(x_valid, y_valid)
print("Validation accuracy:", accuracy_val)

Training accuracy: 0.989
Test accuracy: 0.756
Validation accuracy: 0.73


###SVM

Train and evaluate our SVM model on our test and validation sets.

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score

# build the calssifier - Take in training data and the test X values
def classifier(XtrnInput, YtrnInput, Xtst, kernelInput, gammaVal):
    
    # create SVM Calssifier with specified kernel and gamma values
    classif = svm.SVC(kernel=kernelInput, gamma=gammaVal)
    classif.fit(XtrnInput, YtrnInput)
    
    # return prediction on test data
    return classif.predict(Xtst)

# Call the calssifier
predictionTest = classifier(x_train, y_train, x_test, "rbf", 1)

# Evaluate classifer on training set
trainingAccuracy = classifier(x_train, y_train, x_train, "rbf", 1)
trainingAcc = accuracy_score(trainingAccuracy, y_train)
print("Training accuracy:", trainingAcc)

# find the accuracy
accTest = accuracy_score(predictionTest, y_test)
print("Test accuracy:", accTest)

# Validation
predictionValidation = classifier(x_train, y_train, x_valid, "rbf", 1)

# find the accuracy
acc_h = accuracy_score(predictionValidation, y_valid)
print("Validation set accuracy:", acc_h)

Training accuracy: 0.99975
Test accuracy: 0.744
Validation set accuracy: 0.728


## Hyperparameter Tuning

### Logistic Regression

In [None]:
# Create parameter grid to search over
param_grid = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100]
}

# Create the grid search object
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(x_train, y_train)

# Print the best hyperparameters found
print("Best hyperparameters:", grid_search.best_params_)

# Train a new logistic regression model using the best hyperparameters found
logreg_tuned = LogisticRegression(**grid_search.best_params_)
logreg_tuned.fit(x_train, y_train)

# Evaluate the tuned model on the test set
accuracy_train_tuned = logreg_tuned.score(x_train, y_train)
print("Tuned training accuracy:", accuracy_train_tuned)
accuracy_test_tuned = logreg_tuned.score(x_test, y_test)
print("Tuned test accuracy:", accuracy_test_tuned)
validation_test_tuned = logreg_tuned.score(x_valid, y_valid)
print("Tuned validation accuracy:", validation_test_tuned)


Best hyperparameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
Tuned training accuracy: 0.866
Tuned test accuracy: 0.75
Tuned validation accuracy: 0.744


### SVM

In [None]:
# Call the calssifier
predictionTest = classifier(x_train, y_train, x_test, "rbf", 0.01)

# find the accuracy
accTest = accuracy_score(predictionTest, y_test)
print("Test accuracy:", accTest)

# Validation
predictionValidation = classifier(x_train, y_train, x_valid, "rbf", 0.01)

# find the accuracy
acc_h = accuracy_score(predictionValidation, y_valid)
print("Validation set accuracy:", acc_h)

# Call the calssifier
predictionTest = classifier(x_train, y_train, x_test, "rbf", 0.001)

# find the accuracy
accTest = accuracy_score(predictionTest, y_test)
print("Test accuracy:", accTest)

# Validation
predictionValidation = classifier(x_train, y_train, x_valid, "rbf", 0.001)

# find the accuracy
acc_h = accuracy_score(predictionValidation, y_valid)
print("Validation set accuracy:", acc_h)

Test accuracy: 0.754
Validation set accuracy: 0.736
Test accuracy: 0.744
Validation set accuracy: 0.728


In [None]:
# Call the calssifier
predictionTest = classifier(x_train, y_train, x_test, "rbf", 100)

# find the accuracy
accTest = accuracy_score(predictionTest, y_test)
print("Test accuracy:", accTest)

# Validation
predictionValidation = classifier(x_train, y_train, x_valid, "rbf", 100)

# find the accuracy
acc_h = accuracy_score(predictionValidation, y_valid)
print("Validation set accuracy:", acc_h)

Test accuracy: 0.744
Validation set accuracy: 0.728


In [None]:
from sklearn.model_selection import GridSearchCV
  
# defining parameter range
param_grid = {'C': [0.1, 1, 10,], 
              'gamma': [10, 1, 0.1, 0.01,],
              'kernel': ['rbf', 'sigmoid']} 
  
grid = GridSearchCV(svm.SVC(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(x_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.739 total time=   4.1s
[CV 2/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.739 total time=   3.9s
[CV 3/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.739 total time=   3.7s
[CV 4/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.738 total time=   3.6s
[CV 5/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.738 total time=   4.2s
[CV 1/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.636 total time=   1.9s
[CV 2/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.590 total time=   1.9s
[CV 3/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.642 total time=   1.9s
[CV 4/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.611 total time=   1.8s
[CV 5/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.624 total time=   1.8s
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.739 total time=   4.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf

In [None]:
grid.fit(x_test, y_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.740 total time=   0.1s
[CV 2/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.740 total time=   0.1s
[CV 3/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.740 total time=   0.1s
[CV 4/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.750 total time=   0.1s
[CV 5/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.750 total time=   0.1s
[CV 1/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.720 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.740 total time=   0.0s
[CV 3/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.700 total time=   0.0s
[CV 4/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.730 total time=   0.0s
[CV 5/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.740 total time=   0.0s
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.740 total time=   0.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf

In [None]:
grid.fit(x_valid, y_valid)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.730 total time=   0.1s
[CV 2/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.730 total time=   0.1s
[CV 3/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.730 total time=   0.1s
[CV 4/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.730 total time=   0.1s
[CV 5/5] END .......C=0.1, gamma=10, kernel=rbf;, score=0.720 total time=   0.1s
[CV 1/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.710 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.740 total time=   0.0s
[CV 3/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.750 total time=   0.0s
[CV 4/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.720 total time=   0.0s
[CV 5/5] END ...C=0.1, gamma=10, kernel=sigmoid;, score=0.720 total time=   0.0s
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.730 total time=   0.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf

# Analysis
### 1. Based on the results above, which classifier is better, and why?
Based on our results, we determined that neither classifier is better than the other. We determined this because both classifiers returned almost identical accuracies, both before and after hyperparameter tuning. We used a grid search to find the best possible parameters for each classifier, and even after each classifier still returned almost identical accuracies, which validated the claim that neither classifier is much better than the other.

### 2. For further improvement on classification accuracy, what strategies that you can use and why do you think they will be helpful?
To improve the classification accuracy methods such as outlier management and regularization can be used.

Outlier management can help with accuracy as they can hold a large impact on classifications accuracies due to the sudden change in the data that lead to inaccurate predictions. Outliers can lead to poor fits and inaccurate predictions and ultimately a poor regression line. Assigning a lower weight to outliers reduces their impact. Therefore, the data will be balanced as long as it is done properly. If done incorrectly it can lead to biased data instead. In SVM, outliers can cause an inaccurate decision boundary and classification. Kernel functions, such as RBF, linear, nonlinear, and polynomial, can be used to minimize sensitive outliers. One-class SVM can also be used to avoid outliers as outliers do not have as much of an impact.

Another method of improving classification accuracy would be through regularization. Regularization is used to reduce overfitting through the use of penalty terms to the loss function. Two type of regularization are L1 and L2 regularization. L1 regularization adds a penalty term to the loss function by calculating the absolute value of the sum of the coefficients of the model. L1 is useful when trying to reduce the complexity of the model. There are less coefficients being used and in turn making a simple model. L2 regularization tries to get the model to have small coefficients, unlike L1 which can make them zero. This is useful when all features are needed in the model, therefore it would not eliminate them. Overall, regularization can help generalize and simplify models to improve accuracy. Too much regularization can lead to underfitting, therefore, it is important to use it correctly.