# **Data Cleaning and Preparation for Original Dataset**
### **1. Introduction**

This report details the systematic data cleaning and feature selection process applied to the raw Global Terrorism Database (GTD). The primary objective of this procedure is to transform the original, complex dataset into a clean, focused, and high-quality dataset specifically tailored for a machine learning model. The Global Terrorism Database (GTD) is a large and complex dataset containing terrorism incidents from 1970 to 2020, with 135 columns in its raw form (globalterrorismdb_0522dist_ML.csv). The goal of this preprocessing is to create a clean, model-ready dataset (globalterrorismdb_cleaned.csv) with a focused set of features for predicting the terrorist group responsible (GroupName).

### **2. The Raw Dataset: Initial State**
The raw dataset is complex, with many columns containing >20% missing values, inconsistent naming, and irrelevant features for predicting GroupName. Data cleaning is essential to:
- **Reduce Complexity**: Drop columns with high missing values to make the dataset manageable (from 135 to 47 columns, then to 11 after feature selection).
- **Eliminate Noise**: Remove unreliable columns to prevent bias in modeling and visualization.
- **Ensure Relevance**: Focus on 10 predictive features (Year, Month, Country, Region, Latitude, Longitude, AttackType, TargetType, WeaponType, suicide) and the target (GroupName).
- **Standardize Data**: Rename columns for clarity (e.g., country_txt to Country) and impute missing values to ensure a complete dataset.


#### ***Step 1: Load and Rename Data***
- **Purpose**: Load the raw dataset and rename columns for clarity and consistency.
- **Why**: The original column names (e.g., year, country_txt, attacktype_p_txt) are inconsistent or unclear. Renaming them to intuitive names (e.g., Year, Country, AttackType) improves readability and ensures compatibility with downstream analysis. The latin-1 encoding handles special characters in the data, and low_memory=False prevents memory issues during loading due to the dataset's size.
- **Rationale**: Clear column names reduce errors in coding and make the dataset more accessible for analysts. Standardizing names early ensures consistency across all subsequent steps.
#

In [None]:
import pandas as pd

print("--- Step 1: Loading and Renaming ---")
# Load the dataset
df = pd.read_csv('globalterrorismdb_0522dist_ML.csv', encoding='latin-1', low_memory=False)

# Rename key columns for clarity and easy access
df.rename(columns={
    'iyear':'Year', 'imonth':'Month', 'gname':'GroupName', 'country_txt':'Country',
    'region_txt': 'Region', 'latitude':'Latitude', 'longitude':'Longitude',
    'attacktype1_txt':'AttackType', 'targtype1_txt':'TargetType',
    'weaptype1_txt':'WeaponType', 'nkill':'Killed', 'nwound':'Wounded'
}, inplace=True)
print("Dataset loaded and columns renamed.")


--- Step 1: Loading and Renaming ---
Dataset loaded and columns renamed.


#### ***Step 2: Define Features and Target***
- **Purpose**: Select a subset of features (Year, Month, Country, Region, Latitude, Longitude, AttackType, TargetType, WeaponType, suicide) and the target variable (GroupName) for the prediction task.
- **Why**: The goal is to predict the terrorist group responsible for an incident based on its characteristics. These 10 features capture the temporal, geographical, and attack-specific attributes likely to be predictive of GroupName. Other columns (e.g., IDs, narrative text) are excluded to avoid data leakage or irrelevant information.
- **Rationale**: Focusing on a small, relevant feature set reduces computational complexity, mitigates the risk of overfitting, and aligns with the modeling objective. Explicitly defining the target and features ensures clarity in the pipeline.
#

In [None]:

print("\n--- Step 2: Defining Features based on the Model Goal ---")
# For predicting the 'GroupName', we select features that describe the attack's signature.

# These are the CORE features we want to keep.
features_to_keep = [
    'Year', 'Month', 'Country', 'Region', 'Latitude', 'Longitude',
    'AttackType', 'TargetType', 'WeaponType', 'suicide'
]

# This is our prediction target.
target_variable = 'GroupName'

print(f"Prediction Target (y): {target_variable}")
print(f"Core Predictor Features (X): {features_to_keep}")



--- Step 2: Defining Features based on the Model Goal ---
Prediction Target (y): GroupName
Core Predictor Features (X): ['Year', 'Month', 'Country', 'Region', 'Latitude', 'Longitude', 'AttackType', 'TargetType', 'WeaponType', 'suicide']


#### ***Step 3: Automated Cleaning (Drop High-Missing Columns)***
- **Purpose**: Drop columns with more than 20% missing values to reduce noise and complexity.
- **Why**: The raw dataset has 135 columns, many of which have high missingness (e.g., >20%). Retaining these columns would require extensive imputation, which could introduce bias or unreliable data into the model. Dropping them reduces the dataset to 47 columns, making it more manageable while preserving most relevant information.
- **Rationale**: A 20% threshold balances retaining useful data with eliminating unreliable columns. This automated approach is efficient for large datasets, as manually inspecting 135 columns is impractical. Dropping high-missing columns early prevents downstream issues in modeling and visualization.
#

In [None]:

print("\n--- Step 3: Automated Cleaning ---")
print(f"Original number of columns: {df.shape[1]}")

# Rule: Drop any column with more than 20% missing values.
missing_percentage = df.isnull().sum() / len(df) * 100
cols_to_drop_auto = missing_percentage[missing_percentage > 20].keys()
df.drop(columns=cols_to_drop_auto, inplace=True)

print(f"Columns dropped for >20% missing. New number of columns: {df.shape[1]}")


--- Step 3: Automated Cleaning ---
Original number of columns: 135
Columns dropped for >20% missing. New number of columns: 47


#### ***Step 4: Create Final Focused Dataset***
- **Purpose**: Create a new DataFrame (model_df) containing only the 10 selected features and the target (GroupName), implicitly dropping all other columns.
- **Why**: After automated cleaning, the dataset still contains 47 columns, many of which are irrelevant for predicting GroupName. Selecting only the 11 needed columns (10 features + 1 target) ensures a focused dataset, reducing memory usage and eliminating potential data leakage (e.g., from summary text or ID columns).
- **Rationale**: Using copy() avoids SettingWithCopyWarning, ensuring safe DataFrame operations. This step streamlines the dataset for EDA and modeling, focusing on features that describe the attack's signature.
#

In [None]:

print("\n--- Step 4: Creating the Final Model-Ready Dataset ---")

# We create our final dataset using only the features we need.
# This implicitly drops all other columns (IDs, text summaries, data leakage columns, etc.)
all_needed_cols = features_to_keep + [target_variable]
model_df = df[all_needed_cols].copy() # Use .copy() to avoid SettingWithCopyWarning

print(f"Created a new DataFrame with only the {len(model_df.columns)} required columns.")



--- Step 4: Creating the Final Model-Ready Dataset ---
Created a new DataFrame with only the 11 required columns.


#### ***Step 5: Final Imputation***
- **Purpose**: Handle remaining missing values in the selected features by imputing numerical columns (Latitude, Longitude) with their mean and categorical columns (Country, Region, AttackType, TargetType, WeaponType, GroupName) with their mode.
- **Why**: Even after dropping high-missing columns, some selected features may have minor missing values. Imputing with the mean for numerical columns preserves the central tendency of geographical coordinates, while using the mode for categorical columns ensures the most common category is used, which is robust for small amounts of missing data.
- **Rationale**: Simple imputation methods (mean, mode) are chosen to avoid introducing complex assumptions or bias, given the large dataset size (289,796 rows). This ensures a complete dataset with no missing values, ready for modeling and visualization.
#

In [None]:

print("\n--- Step 5: Handling Remaining Missing Values ---")
# Even after dropping columns, some of our selected features might have a few missing values.
# We will impute them with simple, robust methods.

# For numerical columns like Latitude/Longitude, fill with the mean.
model_df['Latitude'].fillna(model_df['Latitude'].mean(), inplace=True)
model_df['Longitude'].fillna(model_df['Longitude'].mean(), inplace=True)

# For any categorical columns, fill with the mode (most frequent value).
for col in model_df.select_dtypes(include='object').columns:
    model_df[col].fillna(model_df[col].mode()[0], inplace=True)

print("Final imputation complete.")



--- Step 5: Handling Remaining Missing Values ---


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  model_df['Latitude'].fillna(model_df['Latitude'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  model_df['Longitude'].fillna(model_df['Longitude'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the 

Final imputation complete.


#### ***Step 6: Verify the Cleaned Dataset***
- **Purpose**: Confirm that the final dataset has no missing values, correct data types, and the expected structure (11 columns, 289,796 rows).
- **Why**: Verification ensures the preprocessing steps were successful, checking for data integrity (e.g., no missing values) and correct formatting (e.g., suicide as int, Latitude as float). Displaying the first 5 rows provides a quick visual check of the data.
- **Rationale**: This step catches any errors introduced during preprocessing, ensuring the dataset is model-ready and suitable for EDA or machine learning.
#

In [None]:

print("\n--- Step 6: Verifying the Final Dataset ---")
print("Dataset Info (should show no missing values):")
model_df.info()

print("\nFirst 5 rows of the final, model-ready dataset:")
print(model_df.head())

print("\nDataset is now clean, formatted, and ready for model training.")


--- Step 6: Verifying the Final Dataset ---
Dataset Info (should show no missing values):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209706 entries, 0 to 209705
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Year        209706 non-null  int64  
 1   Month       209706 non-null  int64  
 2   Country     209706 non-null  object 
 3   Region      209706 non-null  object 
 4   Latitude    209706 non-null  float64
 5   Longitude   209706 non-null  float64
 6   AttackType  209706 non-null  object 
 7   TargetType  209706 non-null  object 
 8   WeaponType  209706 non-null  object 
 9   suicide     209706 non-null  int64  
 10  GroupName   209706 non-null  object 
dtypes: float64(2), int64(3), object(6)
memory usage: 17.6+ MB

First 5 rows of the final, model-ready dataset:
   Year  Month             Country                       Region   Latitude  \
0  1970      7  Dominican Republic  Central America & Caribbea

#### ***Step 7: Save the Cleaned Dataset***
- **Purpose**: Save the cleaned DataFrame (model_df) to a new CSV file (globalterrorismdb_cleaned.csv) without the index.
(lbl) column.
- **Why**: The cleaned dataset is saved for use in subsequent analysis or modeling tasks, ensuring reproducibility and ease of access. Excluding the index prevents unnecessary columns in the output file.
- **Rationale**: Saving the dataset preserves the preprocessing work, allowing analysts to load the clean dataset directly for further tasks, such as EDA or model training.
#

In [9]:
# --- Add this code to the end of your script ---

# Define the name for your new, clean CSV file
cleaned_file_name = 'globalterrorismdb_cleaned.csv'

# Save the model_df DataFrame to the new file
# index=False prevents pandas from writing the DataFrame index as a column
model_df.to_csv(cleaned_file_name, index=False)

print(f"\n✅ Cleaned data has been successfully saved as '{cleaned_file_name}'")


✅ Cleaned data has been successfully saved as 'globalterrorismdb_cleaned.csv'


#### ***Why Dropping Columns with High Missing Values is Critical***
- **Handling Large Datasets**: With 135 columns and 289,796 rows, the dataset is computationally intensive. Columns with >20% missing values (e.g., detailed text fields or obscure metadata) are often incomplete and add little predictive value, increasing processing time and memory usage.
- **Reducing Noise**: High-missing columns can introduce noise or bias if imputed improperly, leading to unreliable model predictions or misleading visualizations.
- **Efficiency**: Dropping irrelevant or sparse columns reduces the dataset to a manageable size (47 columns initially, then 11 after feature selection), enabling faster processing and analysis.
- **Focus on Predictive Features**: The selected features (Year, Month, Country, etc.) are chosen for their relevance to predicting GroupName. Dropping other columns ensures the dataset is tailored to the modeling goal, improving model performance and interpretability.
#
This preprocessing pipeline transforms a large, messy dataset into a clean, focused, and model-ready dataset, enabling effective EDA and accurate machine learning predictions.