<a href="https://colab.research.google.com/github/Nebil1/UNDP-FTL-AI/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
!pip install pandas scikit-learn matplotlib seaborn



In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix

df.head() shows the first 5 rows

In [21]:
file_url = "https://drive.google.com/uc?id=1zIk9JOdJEu9YF7Xuv2C8f2Q8ySfG3nHd"
df = pd.read_csv(file_url)

In [22]:
print(df.shape)
df.head()

(165, 14)


Unnamed: 0,Country or Administrative area,Area [km2],Coast length [km],Rainfall [mm year -1],Factor L/A [-],Factor (L/A) *P [-],P[E] [%],MPW (metric tons year -1),M[E] (metric tons year -1),Ratio Me/MPW,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,Albania,28'486,362,1'117,0.01,14.0,1.56%,69'833,1'565,2.24%,,,,
1,Algeria,2'316'559,998,80,0.0004,0.0,0.09%,764'578,5'774,0.76%,,,,
2,Angola,1'247'357,1'600,1'025,0.001,1.0,0.09%,236'946,860,0.36%,,,,
3,Antigua and Barbuda,443,153,996,0.3,344.0,3.08%,627,2,0.29%,,,,
4,Argentina,2'779'705,4'989,567,0.002,1.0,0.26%,465'808,4'137,0.89%,,,,


Count missing values per column

In [23]:
# Count missing per column
missing_counts = df.isna().sum()
print(missing_counts)

Country or Administrative area      2
Area [km2]                          2
Coast length [km]                   2
Rainfall [mm year -1]               2
Factor L/A [-]                      2
Factor (L/A) *P [-]                 2
P[E] [%]                            2
MPW (metric tons year -1)           2
M[E] (metric tons year -1)          2
Ratio Me/MPW                        2
Unnamed: 10                       165
Unnamed: 11                       165
Unnamed: 12                       165
Unnamed: 13                       165
dtype: int64


In [24]:
empty_cols = ['Unnamed: 10','Unnamed: 11','Unnamed: 12','Unnamed: 13']
df = df.drop(columns=empty_cols)
print("After dropping empty columns:", df.shape)

After dropping empty columns: (165, 10)


In [25]:
df.head()

Unnamed: 0,Country or Administrative area,Area [km2],Coast length [km],Rainfall [mm year -1],Factor L/A [-],Factor (L/A) *P [-],P[E] [%],MPW (metric tons year -1),M[E] (metric tons year -1),Ratio Me/MPW
0,Albania,28'486,362,1'117,0.01,14.0,1.56%,69'833,1'565,2.24%
1,Algeria,2'316'559,998,80,0.0004,0.0,0.09%,764'578,5'774,0.76%
2,Angola,1'247'357,1'600,1'025,0.001,1.0,0.09%,236'946,860,0.36%
3,Antigua and Barbuda,443,153,996,0.3,344.0,3.08%,627,2,0.29%
4,Argentina,2'779'705,4'989,567,0.002,1.0,0.26%,465'808,4'137,0.89%


 Verify the data type

In [34]:
print(df['M[E] (metric tons year -1)'].dtype)

float64


In [27]:
# Convert to numeric (any parse errors → NaN)
df['M[E] (metric tons year -1)'] = pd.to_numeric(
    df['M[E] (metric tons year -1)'],
    errors='coerce'
)

In [28]:
# Verify the conversion
print("After conversion:", df['M[E] (metric tons year -1)'].dtype)

After conversion: float64


Ensure the key column is numeric, then fill any parsing NaNs

In [29]:
col  = 'M[E] (metric tons year -1)'
df[col] = pd.to_numeric(df[col], errors='coerce')    # non-numbers → NaN
mean = df[col].mean()                                 # compute mean
df[col].fillna(mean, inplace=True)                    # fill NaNs with mean
print(f"Missing in '{col}' after imputation:", df[col].isna().sum())

Missing in 'M[E] (metric tons year -1)' after imputation: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mean, inplace=True)                    # fill NaNs with mean


Create the binary label: 0 = high polluter (>6008), 1 = low (≤6008)

In [30]:
# Create the 'plastic_contribution' column
threshold = 6008
df['plastic_contribution'] = (df['M[E] (metric tons year -1)'] <= threshold).astype(int)

# Display the first few rows with the new column
display(df.head())

Unnamed: 0,Country or Administrative area,Area [km2],Coast length [km],Rainfall [mm year -1],Factor L/A [-],Factor (L/A) *P [-],P[E] [%],MPW (metric tons year -1),M[E] (metric tons year -1),Ratio Me/MPW,plastic_contribution
0,Albania,28'486,362,1'117,0.01,14.0,1.56%,69'833,188.135593,2.24%,1
1,Algeria,2'316'559,998,80,0.0004,0.0,0.09%,764'578,188.135593,0.76%,1
2,Angola,1'247'357,1'600,1'025,0.001,1.0,0.09%,236'946,860.0,0.36%,1
3,Antigua and Barbuda,443,153,996,0.3,344.0,3.08%,627,2.0,0.29%,1
4,Argentina,2'779'705,4'989,567,0.002,1.0,0.26%,465'808,188.135593,0.89%,1


build feature matrix (X) and target (y)

In [31]:
X = df.drop([col, 'plastic_contribution', 'Country or Administrative area'], axis=1)
#   Keep only numeric features
X = X.select_dtypes(include=[np.number])
y = df['plastic_contribution']

print("Features shape:", X.shape, "Labels shape:", y.shape)

Features shape: (165, 2) Labels shape: (165,)


Split into train/test (80/20), stratifying to keep class balance

In [32]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (132, 2) Test: (33, 2)


Train the Logistic Regression model

In [33]:
# 8) Scale features so each has mean=0 and std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learn & apply on train
X_test_scaled  = scaler.transform(X_test)

Train the Logistic Regression model

In [None]:
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)