# NLP Rsearch Project 

This project is a machine translation project that aims at developing a classifier that can differentiate between a human and machine translation. The translation is from mandarin to english. The information the model is given to predict with, is the original sentence in mandarin, followed by a human translation of the sentence and then the candidate translation for the sentence in mandarin. The model's goal is to predict whether or not the candidate translation is generated by a machine or by a human. 

This model is based on a research paper by Professor Thorsten Joachims in 1998 on using SVMs in natural language processing. This method involves, the tokenization of sentences and word lemmatization to come up with useful information about the sentence at hand. It makes use of the property of SVMs which allows it to not overfit even with extremely high dimensional feature vectors. I have used this property along with word tokenization and lemmatization on each of the three sentences provided (for each example) to transform the data along with the additional data provided which was the score for the quality of the translation. The model uses a hard margin SVM to learn on the dataset and has an accuracy of 78.1609% on the test set. 

In [1]:
import numpy as np
import pandas as pd
import os
import sklearn
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
vectorizer = CountVectorizer()

In [2]:
for dirname, _, filenames in os.walk('//Users/sidharthvasudev/Desktop/NLP research test'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

//Users/sidharthvasudev/Desktop/NLP research test/Untitled.ipynb
//Users/sidharthvasudev/Desktop/NLP research test/train.txt
//Users/sidharthvasudev/Desktop/NLP research test/~$dependent Study with LIL Lab.docx
//Users/sidharthvasudev/Desktop/NLP research test/SVM-Joachims (1998).ipynb
//Users/sidharthvasudev/Desktop/NLP research test/Independent Study with LIL Lab.docx
//Users/sidharthvasudev/Desktop/NLP research test/test.txt
//Users/sidharthvasudev/Desktop/NLP research test/quicktask.pdf
//Users/sidharthvasudev/Desktop/NLP research test/.ipynb_checkpoints/SVM-Joachims (1998)-checkpoint.ipynb
//Users/sidharthvasudev/Desktop/NLP research test/.ipynb_checkpoints/Untitled-checkpoint.ipynb


In [3]:
train = open("//Users/sidharthvasudev/Desktop/NLP research test/train.txt", "r").read()
train_sep = train.rsplit("\n")

# Cleaning the dataset 

What is shown below is the cleaning process for the dataset provided (there are some irregular spaces when reading the file and converting to an array. Those are cleaned up using the two steps provided below.

In [4]:
i = 5
while i < 3504:
    try:
        train_sep.pop(i)
        i = i + 5
    except:
        i = 3504

In [5]:
#Cleaning the Dataset
modified = []
x = 4
while x <= 2920: 
    if train_sep[x] == "H" or train_sep[x] == "M":
        x = x + 5
    else: 
        y = x
        a = True
        while a:
            if train_sep[y] == "H" or train_sep[y] == "M":
                add = train_sep[x-4: y+1]
                add.append((x,y))
                modified.append(add)
                x = y
                a = False
            else:
                y = y + 1

In [6]:
df = pd.DataFrame(train_sep)
train_res = pd.DataFrame(df.values.reshape(584,5))


The same process conducted above for the train set is done on the test set. This could be further abstracted into a function, but was not done so, just to maintain separation for clarity. Further work can be done over here. 

In [7]:
test = open("//Users/sidharthvasudev/Desktop/NLP research test/test.txt", "r").read()
test_sep = test.rsplit("\n")

In [8]:
i = 5
while i < 1044:
    try:
        test_sep.pop(i)
        i = i + 5
    except:
        i = 1044

In [9]:
df = pd.DataFrame(test_sep)
test_res = pd.DataFrame(df.values.reshape(174,5))

# Training the model

In the first part of this, we describe the different tags and parts of speech we pay attention to. This could be further expanded with more time. Next, we work on tokenization and word lemmatization to provide numerical inputs for our model. This results in a high dimensional dataset with the properties of a sparse matrix. 

In [10]:
tag_map = defaultdict(lambda: wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['D'] = wn.ADV

In [11]:
def tokens(text):
    text = text.lower()
    text_list = word_tokenize(text)
    Final = []
    wordlemmatization = WordNetLemmatizer()
    for word, tag in pos_tag(text_list):
        word_lem = wordlemmatization.lemmatize(word, tag_map[tag[0]])
        Final.append(word_lem)
    return str(Final)
all_data = pd.concat([train_res, test_res], ignore_index=True)
all_data["0n"] = all_data[0].apply(tokens)
all_data["1n"] = all_data[1].apply(tokens)
all_data["2n"] = all_data[2].apply(tokens)

In [12]:
Train = all_data.iloc[0:584,:]
Train_X = Train[["0n","1n","2n",3]]
Train_Y = Train[4]
Test = all_data.iloc[584:,:]
Test_X = Test[["0n","1n","2n",3]]
Test_Y = Test[4]

Here we are essenntially converting the labels to -1s and +1s. 

In [13]:
Encoder = LabelEncoder()
Encoder.fit(Train_Y)
Train_Y = Encoder.transform(Train_Y)
Test_Y = Encoder.transform(Test_Y)

This is where we take the transformed sentences and vectorize them using tokenization. They are fitted to the train set so that the Test data also takes the same form. 

In [14]:
Tfidf_0 = TfidfVectorizer()
Tfidf_0.fit(Train_X["0n"])
Train_X0 = Tfidf_0.transform(Train_X["0n"]).toarray()
Test_X0 = Tfidf_0.transform(Test_X["0n"]).toarray()
Tfidf_1 = TfidfVectorizer()
Tfidf_1.fit(Train_X["1n"])
Train_X1 = Tfidf_1.transform(Train_X["1n"]).toarray()
Test_X1 = Tfidf_1.transform(Test_X["1n"]).toarray()
Tfidf_2 = TfidfVectorizer()
Tfidf_2.fit(Train_X["2n"])
Train_X2 = Tfidf_2.transform(Train_X["2n"]).toarray()
Test_X2 = Tfidf_2.transform(Test_X["2n"]).toarray()

In [15]:
TrX = np.concatenate((Train_X0, Train_X1, Train_X2,Train_X[3].values.reshape(584,1)), axis=1)
TeX = np.concatenate((Test_X0, Test_X1, Test_X2,Test_X[3].values.reshape(174,1)), axis=1)

Finally, this is where we train the model. I have kept it simple and used a hard margin SVM (this is why I use the high C value, to make the SVM work like a hard margin SVM.) I have not added any other features, because I believe it to be the most efficient and effective version. 

In [16]:
svclassifier = SVC(C=15)
svclassifier.fit(TrX, Train_Y)
y_pred = svclassifier.predict(TeX)
print("SVM Accuracy: ", accuracy_score(y_pred, Test_Y) * 100)
y_pred

SVM Accuracy:  78.16091954022988


array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])

In [17]:
y_predtr = svclassifier.predict(TrX)
print("SVM Accuracy: ", accuracy_score(y_predtr, Train_Y) * 100)

SVM Accuracy:  100.0
