## Entity Resolution
 Submitted by : Sanketh Nagarajan (sn2692)
 
 User Name in Leaderboard : SankethNagarajan
 
 Number of team members : 1 (Solo participant)
 
 Email ID: sn2692@columbia.edu

In [1]:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.svm import SVC
from sklearn.metrics import classification_report

Let's start by reading the given files.

In [2]:
#Reading Files
amazon = np.asarray(pd.read_csv("amazon.csv", encoding = "ISO-8859-1"))
rotten_tomatoes = np.asarray(pd.read_csv("rotten_tomatoes.csv", encoding = "ISO-8859-1"))
train = np.asarray(pd.read_csv("train.csv", encoding = "ISO-8859-1"))
test = np.asarray(pd.read_csv("test.csv", encoding = "ISO-8859-1"))
holdout = np.asarray(pd.read_csv("holdout.csv", encoding = "ISO-8859-1"))

On inspection, amazon.csv has date values in place of the runtime column for some entries (with runtimes given in the "star" column). We correct them next.

In [3]:
#Processing Rows which have date values in Run time (for amazon)
index = []
b = amazon[:,1]
for i in range(0,amazon.shape[0]):
    if("/" in str(b[i])):
        index.append(i)
#Replacing dates with run time & deleting runtime from star name (now empty valued)
amazon1 = np.copy(amazon)
amazon1[index,1] = amazon1[index,3]
amazon1[index,3] = "" 

Removing the unwanted columns from both the movie datasets (like "remarks", "year", etc.)

In [4]:
#Deleting unwanted columns in amazon & rotten_tomatoes
amazon1 = amazon1[:,0:4]
rotten_tomatoes1 = np.copy(rotten_tomatoes[:,0:10])
rotten_tomatoes1 = np.delete(rotten_tomatoes1,3,1)

The "runtime" field has values as strings. Lets convert them into seconds.

In [5]:
#Converting Runtime to seconds & Int data type
#For Amazon
b = amazon1[:,1]
amazon2 = np.copy(amazon1)
for i in range(0,amazon1.shape[0]):
    s = 0
    if(str(b[i]) != 'nan'):
        x = str(b[i]).split(",")
        for j in range(0,len(x)):
            k = str(x[j]).strip()
            l = k.split(" ")
            if ("hour" in str(l[1])):
                m = (int(l[0]) * 60 * 60)
                s = s + m
            if ("min" in str(l[1])):
                s = s + (int(l[0]) * 60)
            if ("sec" in  str(l[1])):
                s = s + int(l[0])
    amazon2[i,1] = int(s) 

#For Rotten Tomatoes
b = rotten_tomatoes1[:,1]
rotten_tomatoes2 = np.copy(rotten_tomatoes1)
for i in range(0,len(b)):
    s = 0
    if (str(b[i])!='nan'):
        a = rotten_tomatoes2[i,1]
        a = str(a)[:-1]
        c = a.split(".")
        for j in range(0,len(c)):
            k = str(c[j]).strip()
            l = k.split(" ")
            if ("hr" in str(l[1])):
                m = (int(l[0]) * 60 * 60)
                s = s + m
            if ("min" in str(l[1])):
                s = s + (int(l[0]) * 60)
    rotten_tomatoes2[i,1] = int(s)   

On observing the matching entries in the train.csv dataset, 3 features were found to be influential in the classification.

Now we define 4 functions to calculate 3 features (definition given below) that will be used to training our machine learning classifier.

Feature 1 : Gives the absolute difference between movie runtimes

Feature 2 : String similarity between director names (out of 1)

Feature 3: String similarity score between movie stars. It is calculated as follows:
- Calculate similarity scores for each star in the "star" column of amazon.csv with the 6 other "star" columns in rotten tomatoes.
- Select the highest score amongst all the comparisons and multiply it by the number of stars in the amazon.csv datastet for that movie. This is the third feature.



In [6]:
#Function to claculate similarity score between 2 strings a & b
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

#Creating Feature1 (Absolute Difference between movie run times)
def feature1(id1,id2):
    ida = np.where(amazon2[:,0]==id1)[0][0]
    idr = np.where(rotten_tomatoes2[:,0]==id2)[0][0]
    return(abs(amazon2[ida,1] - rotten_tomatoes2[idr,1]))

#Creating Feature2 (Similarity between Director Names)
def feature2(id1,id2):
    ida = np.where(amazon2[:,0]==id1)[0][0]
    idr = np.where(rotten_tomatoes2[:,0]==id2)[0][0]
    return(similar(amazon2[ida,2],rotten_tomatoes2[idr,2]))

#Creating Feature3 (Similarity between Star names)
def feature3(id1,id2):
    ida = np.where(amazon2[:,0]==id1)[0][0]
    idr = np.where(rotten_tomatoes2[:,0]==id2)[0][0] 
    a = str(amazon2[ida,3])
    b = a.split(",")
    r = rotten_tomatoes2[idr,3:9]
    av = 0
    h = 0
    m = 0
    for i in range(0,len(b)):
        k = str(b[i]).strip()
        if(k!='nan'):    
            for j in range(0,len(r)):
                l = str(r[j]).strip()
                if(l!='nan'):
                    s = similar(k,l)
                    if(s > h):
                        h = s
            m = m + h
    if (len(b) > 0):
        av = m
    
    return(av)

Now let us transform the train, test & holdout datasets according to the features we just defined.

In [7]:
#Constructing the training data (according to the defined features)
train_new = np.zeros((train.shape[0],4))
for i in range(0,train.shape[0]):
    id1 = train[i,0]
    id2 = train[i,1]
    res = train[i,2]
    train_new[i,0] = feature1(id1,id2)
    train_new[i,1] = feature2(id1,id2)
    train_new[i,2] = feature3(id1,id2)
    train_new[i,3] = res
    
#Constructing the testing data (according to the defined features)
test_new = np.zeros((test.shape[0],3))
for i in range(0,test.shape[0]):
    id1 = test[i,0]
    id2 = test[i,1]
    test_new[i,0] = feature1(id1,id2)
    test_new[i,1] = feature2(id1,id2)
    test_new[i,2] = feature3(id1,id2)

#Constructing the holdout data (according to the defined features)
holdout_new = np.zeros((holdout.shape[0],3))
for i in range(0,holdout.shape[0]):
    id1 = holdout[i,0]
    id2 = holdout[i,1]
    holdout_new[i,0] = feature1(id1,id2)
    holdout_new[i,1] = feature2(id1,id2)
    holdout_new[i,2] = feature3(id1,id2)

Now we have modelled our entity resolution problem as a machine learning classification problem. The best classfier was found to be a Support Vector Machine Classifier with an "rbf" kernel. First, 10 fold cross validation was used to measure model accuracy. For evaluatory purposes we have split the training data into 2 sets (training & testing) to ensure there is no overfitting.

From the training data we can see that the classes are imbalanced (there are only 28 mathcing movies in the train.csv dataset). I used Edited Nearest Neighbors as an undersampling technique to enable the model to learn more from matching examples.

In [8]:
X  = train_new[:,0:3]
Y = train_new[:,3]

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,stratify=Y,random_state=0) 

svc_pipe = make_pipeline(EditedNearestNeighbours(kind_sel="mode", n_neighbors=5), StandardScaler(), SVC(kernel='rbf'))
score = cross_val_score(svc_pipe, X_train, Y_train, cv=10)
print(np.mean(score))

0.940614035088


The average cross validation accuracy looks promising. Let's calcualte the accuracy on the test set which was split from the training set.

In [9]:
svc_pipe.fit(X_train,Y_train)
svc_pipe.score(X_test,Y_test)

0.96825396825396826

The model is 96.8% accurate on our custom made test set. Let's now calculate the precision, recall & F1 score for the model fit on our whole train.csv dataset.

In [10]:
svc_pipe.fit(X,Y)
y_new = svc_pipe.predict(X)

print(classification_report(Y, y_new,target_names=["not matching", "matching"]))

              precision    recall  f1-score   support

not matching       0.96      0.99      0.97       221
    matching       0.86      0.64      0.73        28

 avg / total       0.95      0.95      0.94       249



The average precision and recall for each category is 0.95, whereas the average F1 score is 0.94

The recall score for matching examples is a bit low (0.64) which can be attributed to the highly imbalanced training data.

Since the results are good we now create the gold.csv file which contains predictions for test.csv examples & holdout_gold.csv file which contains predictions for holdout.csv examples.

In [11]:
svc_pipe.fit(X,Y)
gold = svc_pipe.predict(test_new)
d = {'gold':gold}
out_test = pd.DataFrame(data=d)
out_test.to_csv("gold.csv", index=False)

gold = svc_pipe.predict(holdout_new)
d = {'gold':gold}
out_test = pd.DataFrame(data=d)
out_test.to_csv("holdout_gold.csv", index=False)

Pair-wise comparison was avoided by building a model which can just compare the given pair of entities (by their ids) and tell with confidence whether they represent the same entity or not. 

Anyother technique like match scores between entities need to have a Cartesian product of both the datasets to decide which entity matches a given entity (by choosing the highest match score) in one of the datasets. 

In [12]:
logit = make_pipeline(EditedNearestNeighbours(kind_sel="mode", n_neighbors=5), StandardScaler(), LogisticRegressionCV())
logit.fit(X,Y)
y_new = logit.predict(X)

print(classification_report(Y, y_new,target_names=["not matching", "matching"]))

              precision    recall  f1-score   support

not matching       0.96      0.99      0.97       221
    matching       0.86      0.64      0.73        28

 avg / total       0.95      0.95      0.94       249

