<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-package" data-toc-modified-id="Import-package-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import package</a></span></li><li><span><a href="#Load-CSV" data-toc-modified-id="Load-CSV-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load CSV</a></span><ul class="toc-item"><li><span><a href="#Text-preprocessing" data-toc-modified-id="Text-preprocessing-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Text preprocessing</a></span></li></ul></li><li><span><a href="#Save-data" data-toc-modified-id="Save-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Save data</a></span></li></ul></div>

<div class="alert alert-block alert-warning">
<b>  </b>
</div>

# Import package

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load CSV

In [2]:
def read_data(dataset_path):
    df = pd.read_csv(dataset_path,  sep=',')
    #df = df[['rca_content','root_cause_category']] 
    return df

In [3]:
train_path = 'data_generated_ml/p1Rca.csv'
test_path = 'data_generated_ml/p1Rca_test.csv'

train_data = read_data(train_path)
test_data = read_data(test_path)


## Text preprocessing

In [4]:
import re
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\i0438403\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\i0438403\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [5]:
def apply_regex(text):
    text = BeautifulSoup(text, "lxml").text
    
    text = text.lower()

    #Folder path
    text = re.sub(r'[a-z]:\\\S+', r' ', text) 
    #non-ASCII
    text = re.sub(r'[^\x00-\x7F]+', r' ', text)
    #ponctuation
    text = re.sub(r',|\.|\(|\)|\:|\-|\_|\&|\?|\*|\>|\<', r' ', text)
    #Number
    text = re.sub(r'\d+', r' ', text)
    #one letter
    text = re.sub(r' [a-z] ', r' ', text)
    #ntlk stopwords
    text = re.sub(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*', r' ', text)
    
    
    return text

In [6]:
def apply_tokenize(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) <= 2:
                continue
            tokens.append(word.lower())
    return tokens

In [7]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def apply_lemmatize(text):
    from nltk.stem import WordNetLemmatizer 
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(t,get_wordnet_pos(t)) for t in text] 
    return text

In [8]:
def apply_stopwordsLocal(text, stopw):
    text =  [t for t in text if t not in stopw]
    return text

In [9]:
stopwordsLocal = ['inc','msc','net','investigation', 'detail','details','action',
                                'performed','date', 'draft','rca','root',
                               'cause','findings','investigated', 'by','root','cause']


In [23]:
def cleanText(text):
    text = apply_regex(text)
    text = apply_tokenize(text)
    text = apply_lemmatize(text)
    text = apply_stopwordsLocal(text, stopwordsLocal)
    text = ' '.join(map(str, text))
    return text

In [24]:
string = True
train_data['rca_content_nlp'] = train_data['rca_content'].apply(cleanText)
test_data['rca_content_nlp'] = test_data['rca_content'].apply(cleanText)

In [25]:
train_data

Unnamed: 0,problem_id,Number,rca_content,root_cause,root_cause_type,root_cause_category,rca_content_nlp
0,ITS-PRB0002883,ITS-INC2525678,Siemens Ticket has been open (IR 9370883) : ...,Server failure,Hardware,Ressources unavailability,siemens ticket open problem communication orac...
1,ITS-PRB0006202,ITS-INC2001737,The PO files are missing in the IDEA path in t...,Missing Setting/Wrong setting,Configuration Issue,Change Control,file miss idea path server luzsappidap miss fi...
2,ITS-PRB0006221,ITS-INC2006583,Draft RCA: Infor EAM could not communicate to...,Archive log space full,Resource Issue,Ressources unavailability,infor eam could communicate oracle server orac...
3,ITS-PRB0006240,ITS-INC2012930,Root Cause - There were faulty hardware comp...,server crash,Hardware,Ressources unavailability,faulty hardware component sfp cable vcm perfor...
4,ITS-PRB0006294,ITS-INC2040279,"* analysis by Support-Scallog: Indeed, res...",Disk space too high,Resource Issue,Ressources unavailability,analysis support scallog indeed restart applic...
5,ITS-PRB0006298,ITS-INC2040201,slots were blocked on 09/01/2019 at 5:36 pm ...,Wrong application usage,Human Error,Usage,slot block perform tray release unlocked
6,ITS-PRB0006322,ITS-INC2047668,The root cause of the problem with moving fi...,Wrong Job schedule,Scheduling,Scheduling,problem move file directory /xfp/sanofi/etudes...
7,ITS-PRB0006350,ITS-INC2069839,Geode+ team Confirmed document (0901ec618be196...,Infrastructure Obsolescence,Outdated Component,obSolescence,geode+ team confirm document xml publish end a...
8,ITS-PRB0006378,ITS-INC2076909,Unable to go on consumption in PL5 for TRPG p...,Root Cause Unidentified,Cause Unidentified,Investigation failed,unable consumption trpg plant waighing idoc in...
9,ITS-PRB0006381,ITS-INC2049973,Due to change implemented by MES which caused...,regression from application planned change,Application,Change Control,due change implement me issue incorrect quanti...


In [26]:
test_data

Unnamed: 0,Number,problem_id,rca_content,root_cause,root_cause_type,root_cause_category,rca_content_nlp
0,ITS-INC3648484,ITS-PRB0009342,* Primary issue is on XSPW10W182S as the pri...,Root Cause Unidentified,Cause Unidentified,Investigation failed,primary issue xspw print service start automat...
1,ITS-INC3644991,ITS-PRB0009712,Cluster service failed for unknown reason at 5...,Infrastructure Obsolescence,Outdated Component,obSolescence,cluster service fail unknown reason cluster ru...
2,ITS-INC3654989,ITS-PRB0009718,Investigation performed by Oracle team reveale...,Infrastructure Obsolescence,Outdated Component,obSolescence,perform oracle team reveal issue disc inaccess...
3,ITS-INC3676684,ITS-PRB0009745,Master applications use LDAP web service URL f...,Failure on Server,Hardware,Ressources unavailability,master application use ldap web service url us...
4,ITS-INC3696405,ITS-PRB0009758,Investigation Details: Why Investigation F...,Standard Editor Defect,Change Control,Change Control,finding investigate production case serial go ...
5,ITS-INC3735137,ITS-PRB0009862,Root Cause: \t VTOM agent was down as File Sy...,Network issue,Hardware,Ressources unavailability,vtom agent file system /products/exploit vtom ...
6,ITS-INC3758732,ITS-PRB0009868,Investigation Details: Why Investigation F...,Application Obsolescence,Outdated Component,obSolescence,finding investigate production work order 'rr ...
7,ITS-INC3762244,ITS-PRB0009879,The issue comes from a problem with WebService...,regression from application change,Application,Change Control,issue come problem webservice use sap phenix t...
8,ITS-INC3800514,ITS-PRB0009940,* Issue is caused because of Program Error. ...,regression from application change,Application,Change Control,issue program error part demand ritm change ma...
9,ITS-INC3806935,ITS-PRB0009958,Root Cause: The USERS tablespace on databas...,Overload activity on server,Resource Issue,Ressources unavailability,user tablespace database suzxibp fill till due...


# Save data

In [27]:
train_data.to_csv('data_generated_ml/p1Rca_nlp.csv')
test_data.to_csv('data_generated_ml/p1Rca_test_nlp.csv')