# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


In [None]:
# Import necessary libraries
import pandas as pd  # Import pandas for data manipulation and analysis.
import matplotlib.pyplot as plt  # Import matplotlib for plotting graphs and visualizations.
from sklearn.feature_extraction.text import CountVectorizer  # Import CountVectorizer for converting text to numerical data.
from sklearn.utils import shuffle  # Import shuffle to randomize the order of data.
from scipy.sparse import hstack  # Import hstack to combine sparse matrices.
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score  # Import methods for model validation and splitting data.
from sklearn.metrics import accuracy_score, classification_report  # Import metrics to evaluate model performance.
import tensorflow as tf  # Import TensorFlow for deep learning tasks.
from tensorflow.keras.models import Sequential  # Import Sequential for building neural network models layer by layer.
from tensorflow.keras.layers import Dense  # Import Dense to add fully connected layers to the model.
from tensorflow.keras.optimizers import Adam  # Import Adam optimizer for training the model.
import numpy as np  # Import NumPy for numerical operations and array manipulation.
import gzip  # Import gzip for reading and writing compressed files.
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression for classification tasks.
#from sklearn.metrics import make_scorer, accuracy_score
#from scikeras.wrappers import KerasClassifier  # Import KerasClassifier


## Model Selection

[Discuss the type(s) of models you consider for this task, and justify the selection.]



## Feature Engineering

[Describe any additional feature engineering you've performed beyond what was done for the baseline model.]


In [None]:
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('/content/Combined-Text-Dataset.csv')

# Feature selection
# Example: Selecting only two features for a simple baseline model
# X = df[['feature1', 'feature2']]
# y = df['target_variable']

# Splitting the dataset
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#We are initializing separate `CountVectorizer` objects for the `abstract`, `title`, and `keyword` columns. These vectorizers will convert the text data into numerical features suitable for machine learning models.
vectorizer_abstract = CountVectorizer()  # Initialize a CountVectorizer for the 'abstract' column to convert text data into a matrix of token counts.
vectorizer_title = CountVectorizer()  # Initialize a CountVectorizer for the 'title' column for the same purpose.
vectorizer_keyword = CountVectorizer()  # Initialize a CountVectorizer for the 'keyword' column to convert keyword text data into numerical features.


## Transforming Text Data into Numerical Features
X_abstract = vectorizer_abstract.fit_transform(df['abstract'])  # Fit the CountVectorizer on the 'abstract' column and transform the text data into a numerical feature matrix.
X_title = vectorizer_title.fit_transform(df['title'])  # Fit the CountVectorizer on the 'title' column and transform the text data into a numerical feature matrix.
X_keyword = vectorizer_keyword.fit_transform(df['keyword'])  # Fit the CountVectorizer on the 'keyword' column and transform the text data into a numerical feature matrix.

# Combine the feature matrices from 'abstract', 'title', and 'keyword' into a single sparse matrix using horizontal stacking.
X = hstack([X_abstract, X_title, X_keyword])

# Define the target variable 'y' as the 'is_human' column, which contains labels for human-written (1) and AI-generated (0) text.
y = df['is_human']

# Splitting the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split the data into training and testing sets.
# 'test_size=0.2' means 20% of the data will be used for testing, while 80% will be used for training.
# 'random_state=42' ensures reproducibility by fixing the randomness in the split.

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# Split the original training data into a smaller training set (X_train_split, y_train_split) and a validation set (X_val, y_val).
# 'test_size=0.2' means 20% of the original training data is set aside for validation.
# 'random_state=42' ensures that the split is reproducible.

X_train_split_dense = X_train_split.toarray()  # Convert the sparse matrix of the training split into a dense NumPy array.
X_val_dense = X_val.toarray()  # Convert the sparse matrix of the validation set into a dense NumPy array.

## Hyperparameter Tuning

[Discuss any hyperparameter tuning methods you've applied, such as Grid Search or Random Search, and the rationale behind them.]


In [None]:

from sklearn.tree import DecisionTreeClassifier # Import DecisionTreeClassifier


In [None]:
# Implement hyperparameter tuning
# Example using GridSearchCV with a DecisionTreeClassifier

# param_grid = {'max_depth': [2, 4, 6, 8]}
# grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
# grid_search.fit(X_train, y_train)

param_grid = {
    'C': [0.1, 1, 10],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear', 'saga']  # Algorithm to use in the optimization problem
}

# For Grid Search
grid_search = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# For Random Search
random_search = RandomizedSearchCV(estimator=LogisticRegression(), param_distributions=param_grid, n_iter=10, cv=5, scoring='accuracy')
random_search.fit(X_train, y_train)


best_params = grid_search.best_params_  # or random_search.best_params_
best_model = grid_search.best_estimator_  # or random_search.best_estimator_





In [None]:
!pip install keras-tuner --upgrade

import keras_tuner as kt

def model_builder(hp):
    model = Sequential()
    model.add(Dense(units=hp.Int('units', min_value=32, max_value=512, step=32),
                    activation='relu', input_dim=X_train_split_dense.shape[1]))
    model.add(Dense(1, activation='sigmoid'))

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=Adam(learning_rate=hp_learning_rate),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    return model

tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory='my_dir',
                     project_name='intro_to_kt')

tuner.search(X_train_split_dense, y_train_split, epochs=10, validation_data=(X_val_dense, y_val))

best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
model = tuner.hypermodel.build(best_hps)

Trial 30 Complete [00h 00m 32s]
val_accuracy: 0.9610678553581238

Best val_accuracy So Far: 0.9766407012939453
Total elapsed time: 00h 07m 50s


## Implementation

[Implement the final model(s) you've selected based on the above steps.]


In [None]:
# Implement the final model(s)
# Example: model = YourChosenModel(best_hyperparameters)
# model.fit(X_train, y_train)

def create_model():
    nn_model = Sequential([
        Dense(64, input_dim=X_train.shape[1], activation='relu'),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    return nn_model
nn_model=create_model()
nn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
nn_model.fit(X_train_split_dense, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_dense, y_val))

# We are constructing a simple neural network model using the Sequential API from Keras. The model consists of three layers:
# two hidden layers with ReLU activation and an output layer with sigmoid activation.
#2 nn_model = Sequential([
#2    Dense(64, input_dim=X_train.shape[1], activation='relu'),  # The first hidden layer with 64 neurons, using ReLU activation, and input dimension equal to the number of features.
#2    Dense(32, activation='relu'),  # The second hidden layer with 32 neurons and ReLU activation.
#2    Dense(1, activation='sigmoid')  # The output layer with a single neuron and sigmoid activation for binary classification.
#2])

#1 nn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Compile the neural network model using the Adam optimizer, which adjusts the learning rate during training.
# The loss function is 'binary_crossentropy', appropriate for binary classification tasks.
# We are also tracking 'accuracy' as a metric to evaluate the model's performance during training.

#nn_model.fit(X_train_split_dense, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_dense, y_val))
# Train the neural network model using the dense training data (X_train_split_dense) and labels (y_train_split).
# The training runs for 10 epochs, with a batch size of 32, meaning the model updates weights after every 32 samples.
# We are also validating the model's performance after each epoch using the validation data (X_val_dense, y_val).


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 28ms/step - accuracy: 0.8784 - loss: 0.3445 - val_accuracy: 0.9611 - val_loss: 0.0962
Epoch 2/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - accuracy: 0.9965 - loss: 0.0108 - val_accuracy: 0.9555 - val_loss: 0.0998
Epoch 3/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 1.0000 - loss: 0.0014 - val_accuracy: 0.9533 - val_loss: 0.1018
Epoch 4/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 1.0000 - loss: 4.7432e-04 - val_accuracy: 0.9533 - val_loss: 0.1023
Epoch 5/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 1.0000 - loss: 2.8188e-04 - val_accuracy: 0.9533 - val_loss: 0.1031
Epoch 6/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 1.0000 - loss: 1.9343e-04 - val_accuracy: 0.9533 - val_loss: 0.1046
Epoch 7/10

<keras.src.callbacks.history.History at 0x79d0f68ea6d0>

In [None]:
nn_model.evaluate(X_val_dense, y_val)

[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.9476 - loss: 0.1015 


[0.14175210893154144, 0.9375]

In [None]:
y_pred_nn = nn_model.predict(X_val_dense)


[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step


In [None]:
results = {}  # Initialize an empty dictionary to store the results of different models.

# evaluation
y_pred_nn_binary = (y_pred_nn > 0.5).astype(int)
accuracy = accuracy_score(y_val, y_pred_nn_binary)  # Calculate the accuracy of the model by comparing the predicted labels (y_pred) to the actual labels (y_test).

# results
results['CNN'] = accuracy  # Store the accuracy of the logistic regression model in the 'results' dictionary under the key 'Logistic_Regression'.
report1 = classification_report(y_val, y_pred_nn_binary)  # Generate a classification report that includes precision, recall, F1-score, and support for each class.



In [None]:
#!pip install scikeras
#!pip uninstall -y scikit-learn
#!pip install scikit-learn==1.3.1

Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.13.0


In [None]:
from IPython import get_ipython
from IPython.display import display
# %% [markdown]
# # Model Definition and Evaluation
# ## Table of Contents
# 1. [Model Selection](#model-selection)
# 2. [Feature Engineering](#feature-engineering)
# 3. [Hyperparameter Tuning](#hyperparameter-tuning)
# 4. [Implementation](#implementation)
# 5. [Evaluation Metrics](#evaluation-metrics)
# 6. [Comparative Analysis](#comparative-analysis)
#
# %%
# Import necessary libraries
import pandas as pd  # Import pandas for data manipulation and analysis.
import matplotlib.pyplot as plt  # Import matplotlib for plotting graphs and visualizations.
from sklearn.feature_extraction.text import CountVectorizer  # Import CountVectorizer for converting text to numerical data.
from sklearn.utils import shuffle  # Import shuffle to randomize the order of data.
from scipy.sparse import hstack  # Import hstack to combine sparse matrices.
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score  # Import methods for model validation and splitting data.
from sklearn.metrics import accuracy_score, classification_report  # Import metrics to evaluate model performance.
import tensorflow as tf  # Import TensorFlow for deep learning tasks.
from tensorflow.keras.models import Sequential  # Import Sequential for building neural network models layer by layer.
from tensorflow.keras.layers import Dense  # Import Dense to add fully connected layers to the model.
from tensorflow.keras.optimizers import Adam  # Import Adam optimizer for training the model.
import numpy as np  # Import NumPy for numerical operations and array manipulation.
import gzip  # Import gzip for reading and writing compressed files.
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression for classification tasks.
from sklearn.metrics import make_scorer, accuracy_score
#Import the correct version of KerasClassifier
from scikeras.wrappers import KerasClassifier  # Import KerasClassifier


In [None]:
# Wrap your Keras model with KerasClassifier
# Define a function to create your Keras model
nn_model_wrapped = KerasClassifier(model=nn_model, epochs=10, batch_size=32, verbose=0)


scoring = make_scorer(accuracy_score)

cv_scores1 = cross_val_score(nn_model_wrapped, X.toarray(), y, cv=10, scoring=scoring)  # Convert X to dense array# Perform 10-fold cross-validation on the logistic regression model using the full dataset (X, y).
# The 'cv=10' parameter indicates that the data is split into 10 parts, and the model is trained and tested 10 times, each time on a different part.
# The result is an array of accuracy scores from each of the 10 folds, stored in 'cv_scores'.

mean_cv_score1 = np.mean(cv_scores1)  # Calculate the mean accuracy score from the cross-validation results.
std_cv_score1 = np.std(cv_scores1)  # Calculate the standard deviation of the accuracy scores from the cross-validation.
standard_error1 = std_cv_score1 / np.sqrt(len(cv_scores1))  # Compute the standard error of the mean, which indicates the precision of the cross-validation results.



In [None]:
# Logistic Regression
# training
# results = {}  # Initialize an empty dictionary to store the results of different models.
model = LogisticRegression()  # Create an instance of the LogisticRegression model.
model.fit(X_train, y_train)  # Train the logistic regression model using the training data (X_train and y_train).

# Predictions
y_pred = model.predict(X_test)  # Use the trained logistic regression model to predict labels for the test data (X_test).

# evaluation
accuracy = accuracy_score(y_test, y_pred)  # Calculate the accuracy of the model by comparing the predicted labels (y_pred) to the actual labels (y_test).


cv_scores2 = cross_val_score(model, X, y, cv=10)
# Perform 10-fold cross-validation on the logistic regression model using the full dataset (X, y).
# The 'cv=10' parameter indicates that the data is split into 10 parts, and the model is trained and tested 10 times, each time on a different part.
# The result is an array of accuracy scores from each of the 10 folds, stored in 'cv_scores'.

#results
results['Logistic_Regression'] = accuracy  # Store the accuracy of the logistic regression model in the 'results' dictionary under the key 'Logistic_Regression'.
report2 = classification_report(y_test, y_pred)  # Generate a classification report that includes precision, recall, F1-score, and support for each class.

mean_cv_score2 = np.mean(cv_scores2)  # Calculate the mean accuracy score from the cross-validation results.
std_cv_score2 = np.std(cv_scores2)  # Calculate the standard deviation of the accuracy scores from the cross-validation.
standard_error2 = std_cv_score2 / np.sqrt(len(cv_scores2))  # Compute the standard error of the mean, which indicates the precision of the cross-validation results.


## Evaluation Metrics

[Clearly specify which metrics you'll use to evaluate the model performance, and why you've chosen these metrics.]


In [None]:
# Evaluate the model using your chosen metrics
# Example for classification
# y_pred = model.predict(X_test)
# print(classification_report(y_test, y_pred))

# Example for regression
# mse = mean_squared_error(y_test, y_pred)

# Your evaluation code here
print(f"CNN Accuracy: {results['CNN']}")  # Print the accuracy of the model, formatted for easy readability.
print(f"Classification Report:\n{report2}")  # Print the classification report to evaluate the model's performance.

print(f"Mean Cross-Validation Score: {mean_cv_score2:.4f}")  # Print the mean cross-validation accuracy score, formatted to four decimal places.
print(f"Standard Error of Cross-Validation Score: {standard_error2:.4f}")  # Print the standard error of the cross-validation score, also formatted to four decimal places.


CNN Accuracy: 0.9375
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.96      0.95       148
           1       0.96      0.95      0.95       152

    accuracy                           0.95       300
   macro avg       0.95      0.95      0.95       300
weighted avg       0.95      0.95      0.95       300

Mean Cross-Validation Score: 0.9393
Standard Error of Cross-Validation Score: 0.0075


## Comparative Analysis

[Compare the performance of your model(s) against the baseline model. Discuss any improvements or setbacks and the reasons behind them.]


In [None]:
# Comparative Analysis code (if applicable)
# Example: comparing accuracy of the baseline model and the new model
# print(f"Baseline Model Accuracy: {baseline_accuracy}, New Model Accuracy: {new_model_accuracy}")

print("Baseline Model Accuracy")
# results
print(f"Logistic_Regression Accuracy: {results['Logistic_Regression']}")  # Print the accuracy of the model, formatted for easy readability.
#print(f"Classification Report:\n{report1}")  # Print the classification report to evaluate the model's performance.

print(f"Mean Cross-Validation Score: {mean_cv_score1:.4f}")  # Print the mean cross-validation accuracy score, formatted to four decimal places.
print(f"Standard Error of Cross-Validation Score: {standard_error1:.4f}")  # Print the standard error of the cross-validation score, also formatted to four decimal places.


print("new_model_accuracy")
# results
print(f"CNN Accuracy: {results['CNN']}")  # Print the accuracy of the model, formatted for easy readability.
#print(f"Classification Report:\n{report2}")  # Print the classification report to evaluate the model's performance.

print(f"Mean Cross-Validation Score: {mean_cv_score2:.4f}")  # Print the mean cross-validation accuracy score, formatted to four decimal places.
print(f"Standard Error of Cross-Validation Score: {standard_error2:.4f}")  # Print the standard error of the cross-validation score, also formatted to four decimal places.



Baseline Model Accuracy
Logistic_Regression Accuracy: 0.9533333333333334
Mean Cross-Validation Score: 0.9733
Standard Error of Cross-Validation Score: 0.0035
new_model_accuracy
CNN Accuracy: 0.9375
Mean Cross-Validation Score: 0.9393
Standard Error of Cross-Validation Score: 0.0075
