## Step 1: Load and Explore the Dataset
Start by loading the dataset into your Jupyter Notebook. Perform basic exploratory data analysis (EDA) to understand the data structure, check for missing values, and identify any issues.

In [21]:
import pandas as pd

# Load the dataset
data = pd.read_csv(r'C:\Users\user\Downloads/African_crises_dataset (1).csv')


In [22]:
# Display the first few rows of the dataset
print(data.head())

# Display general information about the dataset
print(data.info())

# Check for missing values
print(data.isnull().sum())


   country_number country_code  country  year  systemic_crisis  exch_usd  \
0               1          DZA  Algeria  1870                1  0.052264   
1               1          DZA  Algeria  1871                0  0.052798   
2               1          DZA  Algeria  1872                0  0.052274   
3               1          DZA  Algeria  1873                0  0.051680   
4               1          DZA  Algeria  1874                0  0.051308   

   domestic_debt_in_default  sovereign_external_debt_default  \
0                         0                                0   
1                         0                                0   
2                         0                                0   
3                         0                                0   
4                         0                                0   

   gdp_weighted_default  inflation_annual_cpi  independence  currency_crises  \
0                   0.0              3.441456             0                0  

In [36]:
# Check unique values in the target column
print(data['banking_crisis'].value_counts())

# Check unique values in the target column
print(data['systemic_crisis'].value_counts())

# Summary statistics
print(data.describe())

banking_crisis
0    965
1     94
Name: count, dtype: int64
systemic_crisis
0    977
1     82
Name: count, dtype: int64
       country_number         year  systemic_crisis     exch_usd  \
count     1059.000000  1059.000000      1059.000000  1059.000000   
mean        35.613787  1967.767705         0.077432    43.140831   
std         23.692402    33.530632         0.267401   111.475380   
min          1.000000  1860.000000         0.000000     0.000000   
25%         15.000000  1951.000000         0.000000     0.195350   
50%         38.000000  1973.000000         0.000000     0.868400   
75%         56.000000  1994.000000         0.000000     8.462750   
max         70.000000  2014.000000         1.000000   744.306139   

       domestic_debt_in_default  sovereign_external_debt_default  \
count               1059.000000                      1059.000000   
mean                   0.039660                         0.152975   
std                    0.195251                         0.360133

## Step 2: Data Cleaning and Encoding the Target Column

clean the dataset by addressing any issues identified in Step 1, such as handling missing values. Additionally, encode the target column (banking_crisis) into numeric values, as machine learning models require numerical inputs.

In [32]:
# Encode the target column
data['banking_crisis'] = data['banking_crisis'].map({'crisis': 1, 'no_crisis': 0})

In [34]:
# Verify the changes
print(data['banking_crisis'].value_counts())  # Confirm encoding

banking_crisis
0    965
1     94
Name: count, dtype: int64


## Step 3: Feature Selection
Now, we need to identify which columns to include as features (X) and which to exclude. Typically, we exclude irrelevant columns such as IDs, redundant features, or columns that leak target information.

In [37]:
# Define selected features and target
features = ['currency_crises', 'inflation_crises', 'systemic_crisis', 
            'exch_usd', 'inflation_annual_cpi', 'domestic_debt_in_default']
target = 'banking_crisis'

# Subset the dataset to include only the selected features and target
X = data[features]  # Features
y = data[target]    # Target

# Check the shape of X and y
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Verify the feature columns
print("Selected features:", X.columns)


Features shape: (1059, 6)
Target shape: (1059,)
Selected features: Index(['currency_crises', 'inflation_crises', 'systemic_crisis', 'exch_usd',
       'inflation_annual_cpi', 'domestic_debt_in_default'],
      dtype='object')


## Step 4: Scaling Features


In [38]:
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Verify scaled features
print("\nScaled feature sample:")
print(X_scaled[:5])



Scaled feature sample:
[[-0.37805817 -0.38547376  3.45175812 -0.38671252 -0.03086348 -0.20321893]
 [-0.37805817 -0.38547376 -0.28970744 -0.38670773 -0.03084763 -0.20321893]
 [-0.37805817 -0.38547376 -0.28970744 -0.38671243 -0.03087408 -0.20321893]
 [-0.37805817 -0.38547376 -0.28970744 -0.38671776 -0.03085199 -0.20321893]
 [-0.37805817 -0.38547376 -0.28970744 -0.3867211  -0.03087427 -0.20321893]]


## Step 5: Splitting the Dataset
Now that your features are scaled, the next step is to split the dataset into training and testing sets. This ensures your model is evaluated on unseen data to measure its performance effectively. We'll use the train_test_split function from sklearn.

In [40]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# Verify the sizes of the splits
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")


Training set size: 847 samples
Test set size: 212 samples


## Step 6: Building and Training the Model
Now that your data is split into training and testing sets, it's time to build and train the model. For this classification task, let's start with a Logistic Regression model, which is commonly used for binary classification.



In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


Accuracy: 96.70%

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       193
           1       0.93      0.68      0.79        19

    accuracy                           0.97       212
   macro avg       0.95      0.84      0.88       212
weighted avg       0.97      0.97      0.96       212

[[192   1]
 [  6  13]]
