## Testing prediction model

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score


In [2]:
# Load dataset
data = pd.read_csv("test_data cs 1.csv")  # Replace with actual dataset path

In [3]:
# Select relevant columns (assume target column is 'target')
features = [col for col in data.columns if col != 'compare_text']
target = 'compare_text'

In [4]:
print(data.columns)

Index(['text', 'screen_name', 'user_lang', 'lang', 'time_zone', 'location',
       'verified', 'friends_count', 'compare_text', 'source', 'created_at',
       'favourites_count', 'listed_count', 'statuses_count', 'followers_count',
       'label', 'cred_score', 'eye_truth'],
      dtype='object')


In [5]:
# Clean missing data
data.dropna(inplace=True)

In [6]:
print(data[features].dtypes)

text                 object
screen_name          object
user_lang            object
lang                 object
time_zone            object
location             object
verified             object
friends_count         int64
source               object
created_at           object
favourites_count      int64
listed_count          int64
statuses_count        int64
followers_count       int64
label                object
cred_score            int64
eye_truth           float64
dtype: object


In [7]:
from sklearn.preprocessing import LabelEncoder

for col in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])

In [8]:
for col in data[features]:
    data[col] = pd.to_numeric(data[col], errors='coerce')  # Convert invalid strings to NaN
data.dropna(inplace=True)  # Drop rows with NaN values


In [9]:
# Normalize data for Neural Network
scaler = StandardScaler()
data[features] = scaler.fit_transform(data[features])

In [10]:
# Feature selection
selector = SelectKBest(f_classif, k=10)
data_selected = selector.fit_transform(data[features], data[target])

In [11]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(data_selected, data[target], test_size=0.2, random_state=42)

In [12]:
print(y_train.dtype)
print(y_train[:5])  # Preview first 5 values

float64
87456    0.813037
4168     0.830008
41962    0.821274
78294    0.845077
13945    0.835596
Name: compare_text, dtype: float64


In [13]:
print(X_train.dtype)
print(X_train[:5])  # Preview first 5 values

float64
[[ 0.18719771  0.18719771  0.54938101 -1.03329507 -1.13296478  0.02950662
   1.29604484 -1.30342201 -0.09644502 -0.44429257]
 [ 1.68201181  1.68201181  1.26643951 -1.03329507 -1.13296478 -0.50661619
  -1.57484965 -1.30342201 -0.09644502 -1.12313561]
 [ 1.63493105  1.63493105  1.26643951  0.96777777  0.88263997 -1.30392703
  -0.29631364  0.46425248 -0.09644502 -0.3304408 ]
 [-0.16590798 -0.16590798 -0.16767749 -1.03329507  0.88263997  0.01575988
   1.00546847  0.46425248 -0.09644502 -0.31531499]
 [ 0.69331587  0.69331587 -0.88473599 -1.03329507 -1.13296478 -0.12170751
  -1.24940412  1.34808972 -0.09644502 -0.30331522]]


In [14]:
# Define regression models
models = {
    "Neural Network": MLPRegressor(hidden_layer_sizes=(100,), max_iter=500, random_state=42),
    "Boosted Decision Tree": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "Linear Regression": LinearRegression()
}

# Train, score, and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluation metrics for regression
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    cv_score = np.mean(cross_val_score(model, X_train, y_train, cv=5, scoring='r2'))

    print(f"{name}: MSE = {mse:.4f}, R² = {r2:.4f}, Cross-Validation Score = {cv_score:.4f}")


Neural Network: MSE = 0.0001, R² = 0.5650, Cross-Validation Score = 0.4614
Boosted Decision Tree: MSE = 0.0001, R² = 0.6362, Cross-Validation Score = 0.6464
Linear Regression: MSE = 0.0001, R² = 0.0650, Cross-Validation Score = 0.0612


# Understanding the metrics

MSE (Mean Squared Error): Measures the average squared difference between predicted and actual values. Lower values are better.

R² (R-Squared): Indicates how well the model explains the variance in the target variable. Values closer to 1 mean better performance.

Cross-Validation Score: The average R² score from multiple training/testing splits. Helps check for overfitting or generalizability.

# Model Comparisons

Neural Network (MLPRegressor)

MSE = 0.0001 (Low error)
R² = 0.5650 (Explains 56.5% of variance)
CV Score = 0.4614 (Performance drops a bit on new data)
Interpretation: The neural network performs decently, capturing some patterns in the data. However, the cross-validation score is lower, suggesting some overfitting.

Boosted Decision Tree (Gradient Boosting Regressor)

MSE = 0.0001 (Low error, similar to NN)
R² = 0.6362 (Explains 63.6% of variance)
CV Score = 0.6464 (Very close to R², meaning it's generalizing well)
Interpretation: This is the best-performing model. It explains more variance than the neural network and has a strong cross-validation score, meaning it's generalizing better.

Linear Regression

MSE = 0.0001 (Low error, but similar to others)
R² = 0.0650 (Explains only 6.5% of variance—very poor)
CV Score = 0.0612 (Poor generalization)
Interpretation: The linear model is weak at capturing patterns in the data. Since R² is close to 0, it suggests that the relationships in the dataset are likely nonlinear.

# Takeaways

Gradient Boosting performed the best—it explains the most variance and generalizes well.
Neural Network is okay but may be overfitting slightly.
Linear Regression is not a good fit, likely because the data has complex relationships that it can't model well.