# Cleaning the data


# 1)- Importing key modules

In [1]:
# support both Python 2 and Python 3 with minimal overhead.
from __future__ import absolute_import, division, print_function

# I am an engineer. I care only about error not warning. So, let's be maverick and ignore warnings.
import warnings
warnings.filterwarnings('ignore')

In [2]:
import re    # for regular expressions 
import nltk  # for text manipulation 
import string 
import numpy as np 
import pandas as pd 

#For Visuals
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from matplotlib import rcParams
rcParams['figure.figsize'] = 11, 8
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

In [3]:
%reload_ext version_information
%version_information pandas,numpy, nltk, seaborn, matplotlib

Software,Version
Python,3.7.4 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython,7.8.0
OS,Darwin 19.0.0 x86_64 i386 64bit
pandas,0.25.1
numpy,1.17.2
nltk,3.4.5
seaborn,0.9.0
matplotlib,3.1.1
Mon Dec 09 11:35:50 2019 CET,Mon Dec 09 11:35:50 2019 CET


# 2)- Loading Dataset

In [4]:
data=pd.read_csv('final_data_w_classes_only_contracts.csv',index_col=[0])

In [5]:
data.shape

(8932, 2)

In [6]:
data.head()

Unnamed: 0,text,class
0,Supplier shall update the Documentation on a r...,1.0
1,"major release upgrades of Software, change of ...",1.0
2,Accept incident severity as set by E.ON Servic...,1.0
3,"Supplier shall provide all tools, documentatio...",1.0
4,For smaller Projects a deviation can be agreed...,1.0


In [7]:
data.tail()

Unnamed: 0,text,class
8927,EnsurethatSupplier’sperformancerequirementsast...,0.0
8928,Establishandexecutetheaccountmanagementdiscipl...,0.0
8929,Reviewofconsolidatedforecast/demandreportscove...,0.0
8930,OnceE.ON'sContractManagerdecidestoproceedwitha...,0.0
8931,"maymaketemporaryOperationalChanges,incaseitisa...",0.0


In [8]:
data['class'].value_counts()

0.0    4393
1.0    3550
Name: class, dtype: int64

**0 means other text and 1 means Deliverable and Obligations**

In [9]:
# And Deliverables^Objective class look like
data[data["class"]==1.0].head(10)

Unnamed: 0,text,class
0,Supplier shall update the Documentation on a r...,1.0
1,"major release upgrades of Software, change of ...",1.0
2,Accept incident severity as set by E.ON Servic...,1.0
3,"Supplier shall provide all tools, documentatio...",1.0
4,For smaller Projects a deviation can be agreed...,1.0
5,Supplier shall provide any hardware or testing...,1.0
6,Supplier is obliged to install the work on the...,1.0
7,If the Benchmarking Results indicate that the ...,1.0
8,Within thirty () days or such other period agr...,1.0
9,E.ON may request and the Supplier shall provid...,1.0


In [10]:
# let's see how Other type classlooks like
data[data["class"]==0.0].head(10)

Unnamed: 0,text,class
4539,"Furthermore, Supplier shall not be liable for ...",0.0
4540,Intervention control (all unauthorized interve...,0.0
4541,Supplier shall provide and maintain test and p...,0.0
4542,Configuration Management ensures that relevant...,0.0
4543,The procedural documentation contains transpar...,0.0
4544,Job control (it must be ensured that the data ...,0.0
4545,Fixed Charge invoiced on the first day of each...,0.0
4546,Participate in cross-delivery service provider...,0.0
4547,E.ON shall agree with the internal and externa...,0.0
4548,There is a concept for the creation and implem...,0.0


In [11]:
len(str(data['text']))

677

### check missing values

In [12]:
# Checking rows now
def summary_missing(dataset):
    n_miss = dataset.isnull().sum()
    n_obs = dataset.shape[0]
    n_miss_per = n_miss/n_obs*100
    n_miss_tbl = pd.concat([n_miss, n_miss_per], axis = 1).sort_values(1, ascending = False).round(1)
    n_miss_tbl = n_miss_tbl[n_miss_tbl[1] != 0]
    print('No. of fields: ', dataset.shape[0])
    print('No. of missing fields: ', n_miss_tbl.shape[0])
    n_miss_tbl = n_miss_tbl.rename(columns = {0:'No. of mising Value', 1:'%age of missing Value'})
    return n_miss_tbl

In [13]:
summary_missing(data)

No. of fields:  8932
No. of missing fields:  1


Unnamed: 0,No. of mising Value,%age of missing Value
class,989,11.1


In [14]:
data_class=data['class']

What to do with them? Delete them or ffill?

In [15]:
data_class.unique()

array([ 1., nan,  0.])

In [16]:
# where does missing value exist!!!

data_class.loc[data_class.isnull()]

3550   NaN
3551   NaN
3552   NaN
3553   NaN
3554   NaN
        ..
4534   NaN
4535   NaN
4536   NaN
4537   NaN
4538   NaN
Name: class, Length: 989, dtype: float64

In [17]:
# filling values

data = data.fillna(method="ffill")

In [18]:
summary_missing(data)

No. of fields:  8932
No. of missing fields:  0


Unnamed: 0,No. of mising Value,%age of missing Value


# 3)- Text Cleaning

### 3.1)-remove unwanted text patterns

In [19]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

What are some unwanted patterns. Let's say we have @ now

In [20]:
data['clean'] = np.vectorize(remove_pattern)(data['text'], "@[\w]*") 
data.head()

Unnamed: 0,text,class,clean
0,Supplier shall update the Documentation on a r...,1.0,Supplier shall update the Documentation on a r...
1,"major release upgrades of Software, change of ...",1.0,"major release upgrades of Software, change of ..."
2,Accept incident severity as set by E.ON Servic...,1.0,Accept incident severity as set by E.ON Servic...
3,"Supplier shall provide all tools, documentatio...",1.0,"Supplier shall provide all tools, documentatio..."
4,For smaller Projects a deviation can be agreed...,1.0,For smaller Projects a deviation can be agreed...


In [21]:
data.text[0]

'Supplier shall update the Documentation on a regular basis but at least:Once every half calendar year after the respective Service Commencement Date; andFollowing every update of the Services (e.g,.'

In [22]:
data.clean[0]

'Supplier shall update the Documentation on a regular basis but at least:Once every half calendar year after the respective Service Commencement Date; andFollowing every update of the Services (e.g,.'

### 3.2)-Removing Punctuations, Numbers, and Special Characters

In [23]:
data['clean'] = data['clean'].str.replace("[^a-zA-Z#]", " ") 
data.head()

Unnamed: 0,text,class,clean
0,Supplier shall update the Documentation on a r...,1.0,Supplier shall update the Documentation on a r...
1,"major release upgrades of Software, change of ...",1.0,major release upgrades of Software change of ...
2,Accept incident severity as set by E.ON Servic...,1.0,Accept incident severity as set by E ON Servic...
3,"Supplier shall provide all tools, documentatio...",1.0,Supplier shall provide all tools documentatio...
4,For smaller Projects a deviation can be agreed...,1.0,For smaller Projects a deviation can be agreed...


In [24]:
data.text[0]

'Supplier shall update the Documentation on a regular basis but at least:Once every half calendar year after the respective Service Commencement Date; andFollowing every update of the Services (e.g,.'

In [25]:
data.clean[0]

'Supplier shall update the Documentation on a regular basis but at least Once every half calendar year after the respective Service Commencement Date  andFollowing every update of the Services  e g  '

### 3.3)-Text Normalization

In [26]:
tokenized_text = data['clean'].apply(lambda x: x.split()) # tokenizing 
tokenized_text.head()

0    [Supplier, shall, update, the, Documentation, ...
1    [major, release, upgrades, of, Software, chang...
2    [Accept, incident, severity, as, set, by, E, O...
3    [Supplier, shall, provide, all, tools, documen...
4    [For, smaller, Projects, a, deviation, can, be...
Name: clean, dtype: object

In [27]:
print(tokenized_text[0])

['Supplier', 'shall', 'update', 'the', 'Documentation', 'on', 'a', 'regular', 'basis', 'but', 'at', 'least', 'Once', 'every', 'half', 'calendar', 'year', 'after', 'the', 'respective', 'Service', 'Commencement', 'Date', 'andFollowing', 'every', 'update', 'of', 'the', 'Services', 'e', 'g']


In [28]:
print(tokenized_text[1])

['major', 'release', 'upgrades', 'of', 'Software', 'change', 'of', 'Equipment', 'implementation', 'of', 'Improvements']


In [29]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hassansherwani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [30]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

tokenized_text = tokenized_text.apply(lambda x: [lemm.lemmatize(i) for i in x])

In [31]:
print(tokenized_text[0])

['Supplier', 'shall', 'update', 'the', 'Documentation', 'on', 'a', 'regular', 'basis', 'but', 'at', 'least', 'Once', 'every', 'half', 'calendar', 'year', 'after', 'the', 'respective', 'Service', 'Commencement', 'Date', 'andFollowing', 'every', 'update', 'of', 'the', 'Services', 'e', 'g']


In [32]:
print(tokenized_text[5])

['Supplier', 'shall', 'provide', 'any', 'hardware', 'or', 'testing', 'environment', 'for', 'the', 'testing', 'of', 'the', 'work']


In [33]:
# stitch these tokens back together.

for i in range(len(tokenized_text)):
    tokenized_text[i] = ' '.join(tokenized_text[i])    
data['clean'] = tokenized_text

In [34]:
data.text[3]

'Supplier shall provide all tools, documentation and other material reasonably necessary for E.ON to conduct the Acceptance Test at least ten () Business Days’ prior to the commencement of the Acceptance Test, together with to the extent reasonable a list of the Project Services and works to be accepted and the related documentation.'

In [35]:
data.clean[3]

'Supplier shall provide all tool documentation and other material reasonably necessary for E ON to conduct the Acceptance Test at least ten Business Days prior to the commencement of the Acceptance Test together with to the extent reasonable a list of the Project Services and work to be accepted and the related documentation'

In [36]:
data.head()

Unnamed: 0,text,class,clean
0,Supplier shall update the Documentation on a r...,1.0,Supplier shall update the Documentation on a r...
1,"major release upgrades of Software, change of ...",1.0,major release upgrade of Software change of Eq...
2,Accept incident severity as set by E.ON Servic...,1.0,Accept incident severity a set by E ON Service...
3,"Supplier shall provide all tools, documentatio...",1.0,Supplier shall provide all tool documentation ...
4,For smaller Projects a deviation can be agreed...,1.0,For smaller Projects a deviation can be agreed...


### All in one quick command

In [37]:
from nltk.corpus import stopwords
clean_text = [] # defining corpus
for i in range(0, len(data['text'])): # giving range of values from 0 to 1000
    processed_text = re.sub('[^a-zA-Z]', ' ', data['text'][i]) # using i for all values instead of 1 column
    processed_text = processed_text.lower()
    processed_text = processed_text.split()
    lemm = WordNetLemmatizer()
    stopword_set = set(stopwords.words('english'))
    processed_text = [lemm.lemmatize(word) for word in processed_text if not word in stopword_set]
    processed_text = ' '.join(processed_text )
    clean_text.append(processed_text) # finally attach all these cleaned values to corpus directory

In [38]:
# Remove numbers
clean_text = [word for word in clean_text if not word.isnumeric()]

In [39]:
type(clean_text)

list

In [40]:
# convert to series to check words
text_corpus = pd.Series(clean_text)

In [41]:
data['clean2'] = text_corpus
data.head()

Unnamed: 0,text,class,clean,clean2
0,Supplier shall update the Documentation on a r...,1.0,Supplier shall update the Documentation on a r...,supplier shall update documentation regular ba...
1,"major release upgrades of Software, change of ...",1.0,major release upgrade of Software change of Eq...,major release upgrade software change equipmen...
2,Accept incident severity as set by E.ON Servic...,1.0,Accept incident severity a set by E ON Service...,accept incident severity set e service desk ce...
3,"Supplier shall provide all tools, documentatio...",1.0,Supplier shall provide all tool documentation ...,supplier shall provide tool documentation mate...
4,For smaller Projects a deviation can be agreed...,1.0,For smaller Projects a deviation can be agreed...,smaller project deviation agreed within projec...


In [42]:
data.clean2[2]

'accept incident severity set e service desk central service integrator'

In [43]:
data.clean2[0]

'supplier shall update documentation regular basis least every half calendar year respective service commencement date andfollowing every update service e g'

In [44]:
data.clean2[3]

'supplier shall provide tool documentation material reasonably necessary e conduct acceptance test least ten business day prior commencement acceptance test together extent reasonable list project service work accepted related documentation'

In [45]:
data.text[3]

'Supplier shall provide all tools, documentation and other material reasonably necessary for E.ON to conduct the Acceptance Test at least ten () Business Days’ prior to the commencement of the Acceptance Test, together with to the extent reasonable a list of the Project Services and works to be accepted and the related documentation.'

In [46]:
data.clean[3]

'Supplier shall provide all tool documentation and other material reasonably necessary for E ON to conduct the Acceptance Test at least ten Business Days prior to the commencement of the Acceptance Test together with to the extent reasonable a list of the Project Services and work to be accepted and the related documentation'

We can see that "clean" column provides better cleaning practices compared to other two. For example, E.ON is detected as "e" in clean2 and "E ON" in clean. Reason is that I did use lower case and hence, "on" has been taken as stopword and got removed.

In [47]:
# save data
import pickle
data.to_pickle('file_clean.pkl')

In [48]:
data_pickle=pd.read_pickle('file_clean.pkl')
data_pickle.head()

Unnamed: 0,text,class,clean,clean2
0,Supplier shall update the Documentation on a r...,1.0,Supplier shall update the Documentation on a r...,supplier shall update documentation regular ba...
1,"major release upgrades of Software, change of ...",1.0,major release upgrade of Software change of Eq...,major release upgrade software change equipmen...
2,Accept incident severity as set by E.ON Servic...,1.0,Accept incident severity a set by E ON Service...,accept incident severity set e service desk ce...
3,"Supplier shall provide all tools, documentatio...",1.0,Supplier shall provide all tool documentation ...,supplier shall provide tool documentation mate...
4,For smaller Projects a deviation can be agreed...,1.0,For smaller Projects a deviation can be agreed...,smaller project deviation agreed within projec...
