# Text Mining Project

The goal of this project is to develop a metric that predicts the quality of a translation using the reference. 

Your metric should correlate well with the existing quality assessments that you have in the above corpus

In [137]:
pip install inflect

Collecting inflect
  Downloading inflect-5.3.0-py3-none-any.whl (32 kB)
Installing collected packages: inflect
Successfully installed inflect-5.3.0
Note: you may need to restart the kernel to use updated packages.


In [159]:
#### Load file

import pandas as pd

#### Preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer, sent_tokenize
from nltk.corpus import wordnet
from nltk.stem import SnowballStemmer
from nltk.stem import LancasterStemmer

import re, string, unicodedata
#import contractions
import inflect ## convert the numbers into words

from bs4 import BeautifulSoup


from tqdm import tqdm_notebook as tqdm #Customisable progressbar decorator for iterators

###
from sklearn.model_selection import train_test_split

### 1- Data Collection

In [170]:
corpus = pd.read_csv("de_eng_scores.csv")

In [171]:
corpus

Unnamed: 0,source,reference,translation,z-score,avg-score,annotators
0,"Ihr Zeitlupentempo maßen sie, als sie vor Spit...",Her timeless pace measures them when they equi...,Their slow speed was measured by researchers o...,-0.345024,76.0,1
1,"Er sagte, dass die Bereiche ruhige Treffpunkte...",He said the areas offer quiet meeting points b...,He said the spaces provided calm meeting point...,0.903800,97.5,2
2,Für die Geschäftsleute an der B 27 ist es nur ...,"For businessmen at the B 27, it's only a small...",This is only a small consolation for businesse...,0.700503,94.0,1
3,Diese Fähigkeit sei möglicherweise angeboren o...,This ability may be born or developed with gen...,"This ability may be innate, or may develop as ...",-1.256572,51.5,2
4,Weil sie Wassertemperaturen um die sechs Grad ...,Because they prefer water temperatures around ...,They generally only come to the surface in win...,0.293909,87.0,2
...,...,...,...,...,...,...
21699,"Lt. Cmdr. Patrick Evans, ein Pressesprecher de...","Lt. Cmdr. Patrick Evans, a press officer at th...","Lt. Cmdr. Patrick Evans, a Pentagon spokesman,...",1.246459,100.0,1
21700,"""Um ein Beispiel zu geben: Wenn ich ihn etwas ...","""To give an example: If I ask him something th...","""To give an example: If I ask him what happene...",0.792878,98.0,1
21701,"Ein Grund dafür, dass nicht alle Nachbarn das ...",One reason that not all neighbours view this a...,One reason for not all neighbours seeing this ...,0.597068,76.0,1
21702,Der Gewinn vor Zinsen und Steuern erhöhte sich...,Profit before interest and tax increased from ...,Profits before interest and taxes increased fr...,-0.305719,61.0,1


### 2-  Data  Preprocessing:
(Noise Removal and Normalization)

 - converting all letters to lower
 - converting numbers into words
 - removing punctuations, accent marks and other diacritics
 - removing white spaces
 - expanding abbreviations
 - removing stop words, sparse terms, and particular words
 - Collocations ( word combinations occurring together more often than would be expected by chance)
 - canonical form ( ex: b4 -> before)

In [172]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joanarafael/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joanarafael/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [175]:
def preprocessing(dataframe,stopword_type):
    
    processed_corpus = []
    
    stop_words = set(stopwords.words(str(stopword_type)))
    
    
    
    for i in tqdm(range(len(dataframe))):
        
        text = dataframe.iloc[i]
        
        #Remove punctuations
        
        
        if stopword_type == "german":
            text=re.sub('[^a-zA-ZäöüÄÖÜß]]', ' ', text)
        else:
            text = re.sub('[^a-zA-Z]', ' ', text)
            
        
        # Convert the numbers into words
        
        if text.isdigit():
            text= text.inflect.engine

        #Convert to lowercase
        text = text.lower()

        #Remove tags ( Denoise)
        text = BeautifulSoup(text).get_text()
        
        # Convert to list from string
        text = text.split()

        #Lemmatisation
        
        lem = WordNetLemmatizer()
        
        text = [lem.lemmatize(word) for word in text if not word in stop_words] 
        
        text = " ".join(text)
        
    
        processed_corpus.append(text)
        
        
    return processed_corpus

In [176]:
corpus["reference"]=preprocessing(corpus["reference"], "english")
corpus["source"]=preprocessing(corpus["source"], "german")
corpus["translation"]=preprocessing(corpus["translation"], "english")

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i in tqdm(range(len(dataframe))):


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=21704.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=21704.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=21704.0), HTML(value='')))




In [177]:
corpus

Unnamed: 0,source,reference,translation,z-score,avg-score,annotators
0,"zeitlupentempo maßen sie, spitzbergen sechs ti...",timeless pace measure equipped six animal broa...,slow speed measured researcher svalbard fitted...,-0.345024,76.0,1
1,"sagte, bereiche ruhige treffpunkte flüchtlinge...",said area offer quiet meeting point refugee vo...,said space provided calm meeting point refugee...,0.903800,97.5,2
2,"geschäftsleute b 27 kleiner trost, kunden rott...",businessmen b small consolation customer rotte...,small consolation business located along b roa...,0.700503,94.0,1
3,fähigkeit sei möglicherweise angeboren entwick...,ability may born developed gender maturity,ability may innate may develop animal reach se...,-1.256572,51.5,2
4,wassertemperaturen sechs grad celsius bevorzug...,prefer water temperature around six degree cel...,generally come surface winter prefer water tem...,0.293909,87.0,2
...,...,...,...,...,...,...
21699,"lt. cmdr. patrick evans, pressesprecher pentag...",lt cmdr patrick evans press officer pentagon a...,lt cmdr patrick evans pentagon spokesman said ...,1.246459,100.0,1
21700,"""um beispiel geben: frage, zwei jahre zurückli...",give example ask something two year back sang ...,give example ask happened two year ago sang mi...,0.792878,98.0,1
21701,"grund dafür, nachbarn problem ansehen, sein, k...",one reason neighbour view problem may necessar...,one reason neighbour seeing problem could dire...,0.597068,76.0,1
21702,gewinn zinsen steuern erhöhte laut zwischenbil...,profit interest tax increased million million ...,profit interest tax increased million euro mil...,-0.305719,61.0,1


### 3- Data Exploration / Visualization



3.1 Create a BOW

In [163]:
from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(
    max_df=0.8,
    stop_words="english", 
    max_features=10000, 
    ngram_range=(1,3)
)

In [165]:
X = cv.fit_transform(corpus["reference"])

In [166]:
list(cv.vocabulary_.keys())[:10]

['timeless',
 'pace',
 'measure',
 'equipped',
 'animal',
 'broadcaster',
 'spitsbergen',
 'said',
 'area',
 'offer']

###  4- Models Building

### Quality of Estimation
- #### Levels of Granularity
    - Word level, QE is concerned with predicting binary labels for words based on whether they were translated correctly or not.
    - Phrase-level QE aims to predict the quality of translated phrases and is derived from word-level results.
    - Sentence-level QE aims at assigning a score to a translated sentence based on the number of words that need to be changed in order to match the text provided in the nearest reference translation

### 5- Models Evaluation

#### Bilingual Evaluation Understudy (BLEU)