In [1]:
!pip install numpy pandas scikit-learn tensorflow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Data Preparation and Feature Extraction

The code begins with importing necessary libraries, including pandas for data manipulation, train_test_split from sklearn.model_selection for splitting the dataset, TfidfVectorizer for text feature extraction, accuracy_score for model evaluation, and LogisticRegression for the machine learning model.

It then loads training and test data from CSV files. The training data is split into 80% for training and 20% for validation, using train_test_split, with 'sentence' as features and 'difficulty' as labels. The text data is vectorized using TfidfVectorizer, limiting to 5000 features, to convert text into a format suitable for model training. The vectorization is applied to both training and validation text data.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Load the data
train_data = pd.read_csv('../1_DATA/1_0_PROJECT_DATA/training_data.csv')
test_data = pd.read_csv('../1_DATA/1_0_PROJECT_DATA/unlabelled_test_data.csv')


# Split the data - 80% for training and 20% for validation
X_train, X_val, y_train, y_val = train_test_split(
    train_data['sentence'], 
    train_data['difficulty'], 
    test_size=0.2, 
    random_state=42
)

# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)


### Model fitting
Fitting the model on the train data

In [21]:

# Initialize the model
model = LogisticRegression()
model.fit(X_train_vec, y_train)


### Model Evaluation
The code evaluates a machine learning model's performance using scikit-learn metrics. It predicts and calculates precision, recall, F1-score, and accuracy for both training (**X_train_vec**, **y_train**) and validation (**X_val_vec**, **y_val**) datasets. Metrics are computed with a 'weighted' average, considering label distribution. Results for both training and validation performance are printed, providing a concise assessment of the model's effectiveness.

In [22]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

# Predictions on training set
train_predictions = model.predict(X_train_vec)


train_precision = precision_score(y_train, train_predictions, average='weighted')
train_recall = recall_score(y_train, train_predictions, average='weighted')
train_f1_score = f1_score(y_train, train_predictions, average='weighted')
train_accuracy = accuracy_score(y_train, train_predictions)

# Predictions on validation set
val_predictions = model.predict(X_val_vec)


val_precision = precision_score(y_val, val_predictions, average='weighted')
val_recall = recall_score(y_val, val_predictions, average='weighted')
val_f1_score = f1_score(y_val, val_predictions, average='weighted')
val_accuracy = accuracy_score(y_val, val_predictions)

# Print the performance metrics
print("-------- Training --------")
print(f"Training Precision: {train_precision}")
print(f"Training Recall: {train_recall}")
print(f"Training Accuracy: {train_accuracy}")
print(f"Training F1-Score: {train_f1_score}")
print("")
print("-------- Validation --------")
print(f"Validation Precision: {val_precision}")
print(f"Validation Recall: {val_recall}")
print(f"Validation F1-Score: {val_f1_score}")
print(f"Validation Accuracy: {val_accuracy}")

-------- Training --------
Training Precision: 0.8366764232114121
Training Recall: 0.8364583333333333
Training Accuracy: 0.8364583333333333
Training F1-Score: 0.8357248369723251

-------- Validation --------
Validation Precision: 0.44703858484758374
Validation Recall: 0.45416666666666666
Validation F1-Score: 0.44461812410118495
Validation Accuracy: 0.45416666666666666


**Model fine tuning**

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Create a pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

parameters = {
    'tfidf__max_features': (5000, 10000, None),
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__min_df': [1, 5],
    'clf__C': [0.1, 1, 10],
    'clf__solver': ['lbfgs', 'saga']
}

from sklearn.model_selection import GridSearchCV

# Define Grid Search
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5)

# Load your data
train_data = pd.read_csv('training_data.csv')
X_train = train_data['sentence']
y_train = train_data['difficulty']

# Fit Grid Search to the data
grid_search.fit(X_train, y_train)

print("Best Score: %s" % grid_search.best_score_)
print("Best Parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Score: 0.4610416666666667
Best Parameters set:
	clf__C: 10
	clf__solver: 'saga'
	tfidf__max_features: None
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
