In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load the data
data = pd.read_csv('../data/cleaned_data.csv')

# Extract additional features from the date
data['event_date'] = pd.to_datetime(data['event_date'])  # Convert the 'date' column to datetime format
data['day'] = data['event_date'].dt.day  # Day
data['month'] = data['event_date'].dt.month  # Month
data['year'] = data['event_date'].dt.year  # Year

# Drop the original 'date' column
data.drop('event_date', axis=1, inplace=True)

# Convert categorical variables to numerical format
label_encoder = LabelEncoder()
data['region_encoded'] = label_encoder.fit_transform(data['region'])
data['actor1_encoded'] = label_encoder.fit_transform(data['actor1'])
data['actor2_encoded'] = label_encoder.fit_transform(data['actor2'])

# Select features and target variable
X = data[['region_encoded', 'day', 'month', 'year', 'actor1_encoded', 'actor2_encoded', 'fatalities']]  # Select the features you want to use
y = data['event_type']  # Target variable (event type)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the logistic regression model
model = LogisticRegression(max_iter=3000)  # Initialize the model
model.fit(X_train, y_train)  # Train the model on the training data

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)




Accuracy: 0.9868217054263566



The model investigated the classification of event types based on various parameters available in the dataset. It aimed to predict the type of event (e.g., Shelling/artillery/missile attack, Air/drone strike) using features such as region, date, actors involved, and fatalities. By analyzing these factors, the model aimed to understand the patterns and relationships that influence the occurrence and classification of events.

The accuracy of the model is approximately 98.68%, indicating a high level of agreement between the predicted and actual classes in the test dataset. This suggests that the model has effectively learned the relationships between the input features and the output classes. However, it's advisable to consider other performance metrics, such as precision, recall, and F1-score, especially if the classes in the dataset are imbalanced. Overall, the model demonstrates strong predictive capability based on the provided data.

To assess model performance in scenarios with imbalanced data, additional effectiveness metrics such as:

1. Precision: It represents the proportion of correctly classified instances of a specific class out of all instances predicted by the model for that class. It's computed as:

Precision = TP / (TP + FP)

where TP denotes true positives and FP stands for false positives.

2. Recall: This metric measures the ratio of correctly classified instances of a specific class to the total instances of that class in the original dataset. It's calculated as:

Recall = TP / (TP + FN)

where FN denotes false negatives.

3. F1-score: It's the harmonic mean of precision and recall, offering a balanced evaluation of the model. The formula is:

F1 = 2 * (Precision * Recall) / (Precision + Recall)


In [5]:
from sklearn.metrics import classification_report

# Getting model predictions for the test dataset
y_pred = model.predict(X_test)

# Printing a summary report with classification metrics


print(classification_report(y_test, y_pred, zero_division=1))



                            precision    recall  f1-score   support

Explosions/Remote violence       0.99      1.00      0.99     15275
                     Riots       1.00      0.00      0.00         1
Violence against civilians       0.56      0.02      0.05       204

                  accuracy                           0.99     15480
                 macro avg       0.85      0.34      0.35     15480
              weighted avg       0.98      0.99      0.98     15480




By the precision metric, the model has high values for the "Explosions/Remote violence" and "Riots" classes (99% and 100% respectively), but it has significantly lower precision for the "Violence against civilians" class (56%). This means that the model correctly identifies events of "Explosions/Remote violence" and "Riots" types but is less effective in recognizing "Violence against civilians".

Regarding the recall metric, the model also exhibits high values for the "Explosions/Remote violence" and "Violence against civilians" classes (100% and 2% respectively), but a low value for the "Riots" class (0%). This indicates that the model effectively captures events of "Explosions/Remote violence" and "Violence against civilians" types but struggles to recognize "Riots".

The F1-score, being a harmonic mean between precision and recall, also reflects the model's high precision for the "Explosions/Remote violence" and "Violence against civilians" classes but inefficiency for the "Riots" class.

Taking all these metrics into account, we can conclude that the model performs well for certain classes but faces challenges with others. It's crucial to understand the reasons behind these issues and possibly refine the model or explore alternative approaches to improve its effectiveness.