# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


In [2]:
# Import necessary libraries
import pandas as pd  # Import pandas for data manipulation and analysis.
import matplotlib.pyplot as plt  # Import matplotlib for plotting graphs and visualizations.
from sklearn.feature_extraction.text import CountVectorizer  # Import CountVectorizer for converting text to numerical data.
from sklearn.utils import shuffle  # Import shuffle to randomize the order of data.
from scipy.sparse import hstack  # Import hstack to combine sparse matrices.
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score  # Import methods for model validation and splitting data.
from sklearn.metrics import accuracy_score, classification_report  # Import metrics to evaluate model performance.
import tensorflow as tf  # Import TensorFlow for deep learning tasks.
from tensorflow.keras.models import Sequential  # Import Sequential for building neural network models layer by layer.
from tensorflow.keras.layers import Dense  # Import Dense to add fully connected layers to the model.
from tensorflow.keras.optimizers import Adam  # Import Adam optimizer for training the model.
import numpy as np  # Import NumPy for numerical operations and array manipulation.
import gzip  # Import gzip for reading and writing compressed files.
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression for classification tasks.


## Model Selection

[Discuss the type(s) of models you consider for this task, and justify the selection.]



## Feature Engineering

[Describe any additional feature engineering you've performed beyond what was done for the baseline model.]


In [3]:
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('/content/Combined-Text-Dataset.csv')

# Feature selection
# Example: Selecting only two features for a simple baseline model
# X = df[['feature1', 'feature2']]
# y = df['target_variable']

# Splitting the dataset
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#We are initializing separate `CountVectorizer` objects for the `abstract`, `title`, and `keyword` columns. These vectorizers will convert the text data into numerical features suitable for machine learning models.
vectorizer_abstract = CountVectorizer()  # Initialize a CountVectorizer for the 'abstract' column to convert text data into a matrix of token counts.
vectorizer_title = CountVectorizer()  # Initialize a CountVectorizer for the 'title' column for the same purpose.
vectorizer_keyword = CountVectorizer()  # Initialize a CountVectorizer for the 'keyword' column to convert keyword text data into numerical features.


## Transforming Text Data into Numerical Features
X_abstract = vectorizer_abstract.fit_transform(df['abstract'])  # Fit the CountVectorizer on the 'abstract' column and transform the text data into a numerical feature matrix.
X_title = vectorizer_title.fit_transform(df['title'])  # Fit the CountVectorizer on the 'title' column and transform the text data into a numerical feature matrix.
X_keyword = vectorizer_keyword.fit_transform(df['keyword'])  # Fit the CountVectorizer on the 'keyword' column and transform the text data into a numerical feature matrix.

# Combine the feature matrices from 'abstract', 'title', and 'keyword' into a single sparse matrix using horizontal stacking.
X = hstack([X_abstract, X_title, X_keyword])

# Define the target variable 'y' as the 'is_human' column, which contains labels for human-written (1) and AI-generated (0) text.
y = df['is_human']

# Splitting the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split the data into training and testing sets.
# 'test_size=0.2' means 20% of the data will be used for testing, while 80% will be used for training.
# 'random_state=42' ensures reproducibility by fixing the randomness in the split.

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# Split the original training data into a smaller training set (X_train_split, y_train_split) and a validation set (X_val, y_val).
# 'test_size=0.2' means 20% of the original training data is set aside for validation.
# 'random_state=42' ensures that the split is reproducible.

X_train_split_dense = X_train_split.toarray()  # Convert the sparse matrix of the training split into a dense NumPy array.
X_val_dense = X_val.toarray()  # Convert the sparse matrix of the validation set into a dense NumPy array.

## Hyperparameter Tuning

[Discuss any hyperparameter tuning methods you've applied, such as Grid Search or Random Search, and the rationale behind them.]


In [10]:
# Implement hyperparameter tuning
# Example using GridSearchCV with a DecisionTreeClassifier

# param_grid = {'max_depth': [2, 4, 6, 8]}
# grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
# grid_search.fit(X_train, y_train)




## Implementation

[Implement the final model(s) you've selected based on the above steps.]


In [4]:
# Implement the final model(s)
# Example: model = YourChosenModel(best_hyperparameters)
# model.fit(X_train, y_train)

# We are constructing a simple neural network model using the Sequential API from Keras. The model consists of three layers:
# two hidden layers with ReLU activation and an output layer with sigmoid activation.
nn_model = Sequential([
    Dense(64, input_dim=X_train.shape[1], activation='relu'),  # The first hidden layer with 64 neurons, using ReLU activation, and input dimension equal to the number of features.
    Dense(32, activation='relu'),  # The second hidden layer with 32 neurons and ReLU activation.
    Dense(1, activation='sigmoid')  # The output layer with a single neuron and sigmoid activation for binary classification.
])

nn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Compile the neural network model using the Adam optimizer, which adjusts the learning rate during training.
# The loss function is 'binary_crossentropy', appropriate for binary classification tasks.
# We are also tracking 'accuracy' as a metric to evaluate the model's performance during training.

nn_model.fit(X_train_split_dense, y_train_split, epochs=10, batch_size=32, validation_data=(X_val_dense, y_val))
# Train the neural network model using the dense training data (X_train_split_dense) and labels (y_train_split).
# The training runs for 10 epochs, with a batch size of 32, meaning the model updates weights after every 32 samples.
# We are also validating the model's performance after each epoch using the validation data (X_val_dense, y_val).


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 86ms/step - accuracy: 0.8945 - loss: 0.2808 - val_accuracy: 0.9614 - val_loss: 0.0989
Epoch 2/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 78ms/step - accuracy: 0.9970 - loss: 0.0091 - val_accuracy: 0.9595 - val_loss: 0.1084
Epoch 3/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 81ms/step - accuracy: 0.9997 - loss: 0.0018 - val_accuracy: 0.9607 - val_loss: 0.0992
Epoch 4/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 80ms/step - accuracy: 1.0000 - loss: 3.5228e-04 - val_accuracy: 0.9626 - val_loss: 0.1016
Epoch 5/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 80ms/step - accuracy: 1.0000 - loss: 2.0313e-04 - val_accuracy: 0.9632 - val_loss: 0.1029
Epoch 6/10
[1m207/207[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 75ms/step - accuracy: 1.0000 - loss: 1.3504e-04 - val_accuracy: 0.9632 - val_loss: 0.1046
Epoc

<keras.src.callbacks.history.History at 0x7d3e07ff1650>

In [26]:
nn_model.evaluate(X_val_dense, y_val)

[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 31ms/step - accuracy: 0.9660 - loss: 0.1146


[0.10908212512731552, 0.9649758338928223]

In [27]:
y_pred_nn = nn_model.predict(X_val_dense)


[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step


In [28]:
results = {}  # Initialize an empty dictionary to store the results of different models.

In [29]:
# evaluation
y_pred_nn_binary = (y_pred_nn > 0.5).astype(int)
accuracy = accuracy_score(y_val, y_pred_nn_binary)  # Calculate the accuracy of the model by comparing the predicted labels (y_pred) to the actual labels (y_test).

# results
print(f"CNN Accuracy: {accuracy}")  # Print the accuracy of the model, formatted for easy readability.
results['CNN'] = accuracy  # Store the accuracy of the logistic regression model in the 'results' dictionary under the key 'Logistic_Regression'.
report = classification_report(y_val, y_pred_nn_binary)  # Generate a classification report that includes precision, recall, F1-score, and support for each class.
print(f"Classification Report:\n{report}")  # Print the classification report to evaluate the model's performance.


CNN Accuracy: 0.964975845410628
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.96       807
           1       0.97      0.96      0.97       849

    accuracy                           0.96      1656
   macro avg       0.96      0.97      0.96      1656
weighted avg       0.97      0.96      0.96      1656



In [30]:
# Logistic Regression
# training
results = {}  # Initialize an empty dictionary to store the results of different models.
model = LogisticRegression()  # Create an instance of the LogisticRegression model.
model.fit(X_train, y_train)  # Train the logistic regression model using the training data (X_train and y_train).

# Predictions
y_pred = model.predict(X_test)  # Use the trained logistic regression model to predict labels for the test data (X_test).

model = LogisticRegression()  # Create an instance of the LogisticRegression model.
model.fit(X_train, y_train)  # Train the logistic regression model using the training data (X_train and y_train).

# Predictions
y_pred = model.predict(X_test)  # Use the trained logistic regression model to predict labels for the test data (X_test).

# evaluation
accuracy = accuracy_score(y_test, y_pred)  # Calculate the accuracy of the model by comparing the predicted labels (y_pred) to the actual labels (y_test).

# results
print(f"Logistic_Regression Accuracy: {accuracy}")  # Print the accuracy of the model, formatted for easy readability.
results['Logistic_Regression'] = accuracy  # Store the accuracy of the logistic regression model in the 'results' dictionary under the key 'Logistic_Regression'.
report = classification_report(y_test, y_pred)  # Generate a classification report that includes precision, recall, F1-score, and support for each class.
print(f"Classification Report:\n{report}")  # Print the classification report to evaluate the model's performance.


#1 cv_scores = cross_val_score(model, X, y, cv=10)
# Perform 10-fold cross-validation on the logistic regression model using the full dataset (X, y).
# The 'cv=10' parameter indicates that the data is split into 10 parts, and the model is trained and tested 10 times, each time on a different part.
# The result is an array of accuracy scores from each of the 10 folds, stored in 'cv_scores'.

#2 mean_cv_score = np.mean(cv_scores)  # Calculate the mean accuracy score from the cross-validation results.
#3 std_cv_score = np.std(cv_scores)  # Calculate the standard deviation of the accuracy scores from the cross-validation.
#4 standard_error = std_cv_score / np.sqrt(len(cv_scores))  # Compute the standard error of the mean, which indicates the precision of the cross-validation results.

#5 print(f"Mean Cross-Validation Score: {mean_cv_score:.4f}")  # Print the mean cross-validation accuracy score, formatted to four decimal places.
#6 print(f"Standard Error of Cross-Validation Score: {standard_error:.4f}")  # Print the standard error of the cross-validation score, also formatted to four decimal places.


Logistic_Regression Accuracy: 0.9685990338164251
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1058
           1       0.97      0.97      0.97      1012

    accuracy                           0.97      2070
   macro avg       0.97      0.97      0.97      2070
weighted avg       0.97      0.97      0.97      2070



## Evaluation Metrics

[Clearly specify which metrics you'll use to evaluate the model performance, and why you've chosen these metrics.]


In [20]:
# Evaluate the model using your chosen metrics
# Example for classification
# y_pred = model.predict(X_test)
# print(classification_report(y_test, y_pred))

# Example for regression
# mse = mean_squared_error(y_test, y_pred)

# Your evaluation code here


## Comparative Analysis

[Compare the performance of your model(s) against the baseline model. Discuss any improvements or setbacks and the reasons behind them.]


In [None]:
# Comparative Analysis code (if applicable)
# Example: comparing accuracy of the baseline model and the new model
# print(f"Baseline Model Accuracy: {baseline_accuracy}, New Model Accuracy: {new_model_accuracy}")
