Introduction

This notebook analyzes the Nomao dataset to solve a deduplication problem using machine learning. The goal is to predict whether two records refer to the same place.

In [1]:
!pip install pandas numpy matplotlib seaborn scikit-learn



In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Data Loading and Exploration

We begin by loading the dataset and exploring the data types, missing values, and summary statistics.

In [4]:
# Load Data 
df = pd.read_csv("../data/Nomao/Nomao.data")  

  df = pd.read_csv("../data/Nomao/Nomao.data")


In [5]:
# Basic EDA
print("Initial Shape:", df.shape)
print(df.info())
print(df.describe())

Initial Shape: (34464, 120)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34464 entries, 0 to 34463
Columns: 120 entries, 0#1 to +1
dtypes: float64(6), int64(1), object(113)
memory usage: 31.6+ MB
None
                  1           1.1           1.2           1.3           1.4  \
count  34464.000000  34464.000000  34464.000000  34464.000000  34464.000000   
mean       0.636467      0.494792      0.626262      0.560934      0.534188   
std        0.424384      0.380138      0.305664      0.369693      0.325739   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.361111      0.218692      0.240000   
50%        1.000000      0.500000      0.666667      0.666667      0.473684   
75%        1.000000      1.000000      1.000000      1.000000      0.875000   
max        1.000000      1.000000      1.000000      1.000000      1.000000   

                1.5            +1  
count  34464.000000  34464.000000  
mean      

Data Preprocessing

Handle missing values with mean imputation

Standardize continuous features using StandardScaler

Encode categorical features if necessary

In [7]:
# Handling Missing Values
numeric_cols = df.select_dtypes(include=[np.number]).columns
non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns

# Impute only numeric columns
imputer = SimpleImputer(strategy='mean')
df_numeric = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)

# Combine back with non-numeric columns (if any)
df_imputed = pd.concat([df_numeric, df[non_numeric_cols].reset_index(drop=True)], axis=1)


In [8]:
# Feature / Target Separation
X = df_imputed.iloc[:, :-1]
y = df_imputed.iloc[:, -1]

Modeling

Train/test split

Train Logistic Regression and Random Forest classifiers

Evaluate models with accuracy, precision, recall, and F1-score

In [9]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Scaling Features

# Separate numeric and non-numeric columns
numeric_cols = X.select_dtypes(include=[np.number]).columns
non_numeric_cols = X.select_dtypes(exclude=[np.number]).columns

# Drop non-numeric columns for modeling
X_numeric = X[numeric_cols]

# Train/Test Split with numeric features only
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Scaling Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



Evaluation

Compare model performance using metrics and confusion matrices


In [12]:
# Logistic Regression
lr = LogisticRegression(max_iter=20, solver='saga', n_jobs=-1, class_weight='balanced', verbose=1)
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


Epoch 17, change: 0.024854762
Epoch 1, change: 1
Epoch 18, change: 0.019685692
Epoch 2, change: 0.50121554
Epoch 19, change: 0.016009834
Epoch 3, change: 0.49786032
Epoch 20, change: 0.014264126
Epoch 4, change: 0.44503256
Epoch 21, change: 0.012716885
Epoch 5, change: 0.38937869
Epoch 22, change: 0.011525611
Epoch 6, change: 0.39384479
Epoch 23, change: 0.010477661
Epoch 7, change: 0.22897759
Epoch 24, change: 0.0094697218
Epoch 8, change: 0.22551613
Epoch 25, change: 0.0086294332
Epoch 9, change: 0.23593694
Epoch 26, change: 0.0078399636
Epoch 10, change: 0.1812172
Epoch 27, change: 0.0071092426
Epoch 11, change: 0.10723151
Epoch 28, change: 0.0064558732
Epoch 12, change: 0.090130161
Epoch 29, change: 0.0058807658
Epoch 13, change: 0.06878339
Epoch 30, change: 0.0053320256
Epoch 14, change: 0.050517528
Epoch 31, change: 0.0048585338
Epoch 15, change: 0.035746777
Epoch 32, change: 0.0044035565
Epoch 16, change: 0.025277543
Epoch 33, change: 0.0040145247
Epoch 17, change: 0.018644835
E



Epoch 37, change: 0.0027684977
Epoch 38, change: 0.0025143655
Epoch 39, change: 0.002286579
Epoch 40, change: 0.0020807698
Epoch 41, change: 0.0019037741


In [None]:
# Random Forest
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

Epoch 42, change: 0.001798513


In [None]:
# Confusion Matrix
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, rf_preds), annot=True, fmt='d')
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

In [None]:
# Evaluation
print("Logistic Regression Results:")
print(classification_report(y_test, lr_preds))
print("Random Forest Results:")
print(classification_report(y_test, rf_preds))

Evaluation

Random Forest generally performs better than Logistic Regression

Feature Importance

Use feature importances from Random Forest

Visualize top contributing features with horizontal bar chart

In [None]:
# Feature Importance (Random Forest)
importances = rf.feature_importances_
indices = np.argsort(importances)[-10:]
plt.figure(figsize=(10, 6))
plt.title("Top 10 Feature Importances - Random Forest")
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [f"Feature {i}" for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Conclusion

Random Forest is effective for deduplication tasks

Key features influencing deduplication were identified

Future work includes trying XGBoost and improving preprocessing