# A61H CPC Classification

<font size="3">

#### About the Module:
<I><span style="font-family:Arial">This module contains functions, libraries and initial processing of text data needed for CPC classification </span></I>

#### Input needed: 
<I><span style="font-family:Arial">This module use excel files extracted from Dervent Innovation for training and testing algorithms.<br> File named as "Dataset for EPO Automation" is use for training the models while excel file "Testdata_EPO Automation" contains patents whose classes we need to identify. <br> The excel files contains columns: <br> Publication number, inpadoc family, inpadoc family Id, title(english), title-dwpi, title-terms - dwpi, abstract-dwpi, abstract(english), abstract - dwpi use, abstract - dwpi advantage, abstract - dwpi drawing description, claims (english), description, cpc applied by clarivate, ipc Current Full, added by(self/client). </span></I>

#### Output expected: 
<I><span style="font-family:Arial">This module will provide functions and processed text data to other modules. No output is expected by executing this module alone.</span></I>

#### Related modules: 
<I><span style="font-family:Arial">This is the first module in the list and it is not calling any other module. However, this module is called by listed modules: <br> A61H0100, A61H0300, A61H0500, A61H0700, A61H0900, A61H1100, A61H1300, A61H1500, A61H1900, A61H2100, A61H2300, A61H3100, A61H3300, A61H3500, A61H3600, A61H3700, A61H3900 and final_file</span></I>

#### Who and when: 
<I><span style="font-family:Arial">Last Modified by : Nishant Chauhan</span><br>
<span style="font-family:Arial">Last Modified on : 20-July-2020</span><br>
<span style="font-family:Arial">Version no       : 2</span><br>
<span style="font-family:Arial">Developed by     : Nishant Chauhan </span><br></font></I>


In [1]:
%load_ext autoreload
%autoreload 2

## Libraries

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import re

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import LancasterStemmer
from nltk.corpus import wordnet

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = LancasterStemmer()

## Functions

In [6]:
def processed_text(text_data):
    
    '''Process text data of the patent. Remove stop words, non-alphabetic text and lemmatize'''
    
    processed = text_data.str.lower()
    processed = processed.apply(word_tokenize)    
    processed = processed.apply(lambda x: ' '.join(term for term in x))
    processed = processed.str.replace(r'[0-9-/]',' ')    
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term.isalpha()))
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if len(term) > 2))
    
    processed = processed.apply(lambda x: ' '.join(lemmatizer.lemmatize(term , wordnet.VERB) for term in x.split()))
    processed = processed.apply(lambda x: ' '.join(lemmatizer.lemmatize(term , wordnet.NOUN) for term in x.split()))
    processed = processed.apply(lambda x: ' '.join(lemmatizer.lemmatize(term , wordnet.ADJ) for term in x.split()))
    processed = processed.apply(lambda x: ' '.join(lemmatizer.lemmatize(term , wordnet.ADV) for term in x.split()))
    
    #processed = processed.apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))
    
    return processed

In [7]:
def processed_text_no_lemma(text_data):
    
    '''Process text data of the patent.
       Remove stop words, non-alphabetic text but no lemmatization. can be use if need to execute program quickly'''
    
    processed = text_data.str.lower()
    processed = text_data.apply(word_tokenize)    
    processed = processed.apply(lambda x: ' '.join(term for term in x))
    processed = processed.str.replace(r'[0-9-]',' ')    
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term.isalpha()))
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
    processed = processed.apply(lambda x: ' '.join(term for term in x.split() if len(term) > 2))
    
    #processed = processed.apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))
    
    return processed

In [8]:
def find(s, ch):
    
    '''To find the position of a word in text data. This function is used in near_operator function'''
    
    return [i for i, ltr in enumerate(s) if ltr == ch]

In [9]:
def and_operator(text, lista=[]):
    
    """To check if words are present in the textual data. 
        This will work like 'AND' operator in Dervent Innovation."""
    
    isin = False
    for string in lista:
        sub_isin = True
        for substr in string.split(' '):
            sub_isin = sub_isin & (substr in text)

        isin = isin or sub_isin
    return isin

In [10]:
def near_operator(text,list1=[],list2=[],near=5):
    
    """To check if words are present in the textual data. 
        This will work like 'NEAR' operator in Dervent Innovation. Default value of NEAR is 5"""
    
    isin, not_isin= 0,0
    str_list = text.split(' ')        
    for word in list1:
        count_list = find(str_list,word)            

        if (len(count_list) > 0):                
            for word_count in count_list:  
                sub_str = ' '.join(str_list[word_count-near:word_count+near+1])
                    
                for second_word in list2:
                    if second_word in sub_str:
                        isin = 1
                    else:
                        not_isin = 0        
        
    isin = isin or not_isin
        
    return isin

In [11]:
def ssto(text,list1=[]):
    
    """To check if multiple words are present in the textual data. 
        This will work like double inverted comma in Dervent Innovation."""
    
    isin = 0
    for word in list1:
        if word in text:
            isin = 1
    
    return isin

In [12]:
def train_model(X_train_dtm, y_train):
    
    '''Train model using ensemble algorithms (Logistic Regression, Random Forest, Multinomial and SVM).'''
    
    clf1 = LogisticRegression(solver='liblinear',random_state=0)
    clf2 = RandomForestClassifier(n_estimators=50, random_state=0)
    clf3 = MultinomialNB()
    clf4 = svm.SVC(kernel='linear', C=1, random_state=0, probability = True)
    
    eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('svm',clf4)], voting='soft')
    eclf = eclf.fit(X_train_dtm, y_train)
    
    scores = cross_val_score(eclf, X_train_dtm, y_train, cv=5)
    
    return eclf, scores

In [13]:
def predict_values(eclf, y_test, X_test_dtm):
    
    '''This function will provide predicted class and its accuracy using ensemble algorithm'''
    
    y_pred_class = eclf.predict(X_test_dtm)
    accuracy = metrics.accuracy_score(y_test, y_pred_class)
    matrix = metrics.confusion_matrix(y_test, y_pred_class)
    matrix_df = pd.DataFrame(matrix, columns = [['predicted', 'predicted'], ['Non-Relevant', 'Relevant']],
                             index = [['actual', 'actual'], ['Non-Relevant', 'Relevant']])
    
    return accuracy,matrix_df, y_pred_class

In [14]:
def precision_recall_scores(df, predicted, actual):
    
    ''' This function will generate accuracy score and confusion matrix'''
    
    nr00 = len(df[(df[predicted] == 0) & (df[actual] == 0)])
    nr01 = len(df[(df[predicted] == 1) & (df[actual] == 0)])
    nr10 = len(df[(df[predicted] == 0) & (df[actual] == 1)])
    nr11 = len(df[(df[predicted] == 1) & (df[actual] == 1)])
    
    accuracy = (1- (nr01+nr10)/(nr00+nr01+nr10+nr11))*100
    score = pd.DataFrame([(nr00,nr01),(nr10,nr11)],
                        index = [['actual', 'actual'], ['Non-Relevant', 'Relevant']],
                        columns = [['predicted', 'predicted'], ['Non-Relevant', 'Relevant']])
    return score, accuracy

In [15]:
def to_get_final_output(y_test, y_pred_class, X_test, df, cpc_class):
    
    '''This function will provide formatted dataframe of final predicted values of cpc classes'''
    
    combined_value = np.array([y_test, y_pred_class, X_test])
    df_predicted = pd.DataFrame(combined_value.T, columns= ['Actual','Predicted', 'Text_Data'], index=X_test.index)   
    select_text = df_predicted[(df_predicted['Predicted'] == 0) & (df_predicted['Actual'] == 1)]['Text_Data']
    
    for text in select_text.index:
        check = is_in_text(select_text[text],cpc_class)
        if check:
            df_predicted['Predicted'][text] = 1
    
    intermediate_df = pd.concat([df, df_predicted], axis=1, join='inner')
    
    return intermediate_df

In [16]:
def final_scores(processed_df,CPCclass):
    
    ''' This function will generate final table, accuracy score and confusion matrix of final product'''
    
    nr00 = len(processed_df[(processed_df['Predicted'] == 0) & (processed_df['Actual'] == 0)])
    nr01 = len(processed_df[(processed_df['Predicted'] == 1) & (processed_df['Actual'] == 0)])
    nr10 = len(processed_df[(processed_df['Predicted'] == 0) & (processed_df['Actual'] == 1)])
    nr11 = len(processed_df[(processed_df['Predicted'] == 1) & (processed_df['Actual'] == 1)])
    accuracy = (1- (nr01+nr10)/(nr00+nr01+nr10+nr11))*100
    
    final_df = processed_df[['number', 'title', 'Predicted']].copy()
    final_df = final_df.rename(columns={'Predicted' : CPCclass})
    
    score_df = pd.DataFrame([(nr00,nr01),(nr10,nr11)],
                        index = [['actual', 'actual'], ['Non-Relevant', 'Relevant']],
                        columns = [['predicted', 'predicted'], ['Non-Relevant', 'Relevant']])
    
    return final_df, score_df, accuracy

## Training data processing

In [17]:
df_initial = pd.read_excel(r'C:\Users\u6033331\EPO dataset.xlsx')

df_initial.fillna('na', inplace=True)

df_initial['title'] = (df_initial['Title (English)'] + ' ' + 
                       df_initial['Title - DWPI'] + ' ' + 
                       df_initial['Title Terms - DWPI'])

df_initial['abstract'] = (df_initial['Abstract - DWPI'] + ' ' + 
                          df_initial['Abstract (English)'] + ' ' + 
                          df_initial['Abstract - DWPI Use'] + ' ' + 
                          df_initial['Abstract - DWPI Advantage'] + ' ' + 
                          df_initial['Abstract - DWPI Drawing Description'])


In [18]:
cols = ['Publication Number', 
        'title', 
        'abstract',
        'INPADOC Family Members',
        'INPADOC Family ID', 
        'Claims (English)', 
        'Description', 
        'CPC applied by clarivate', 
        'IPC Current Full', 
        'Added by'
        ]

df = df_initial[cols] 

In [19]:
df = df.rename(columns={'INPADOC Family Members': 'family_member', 
                        'INPADOC Family ID': 'family_ID',
                        'Claims (English)':'claim',
                        'Description': 'desc', 
                        'CPC applied by clarivate': 'CPC',
                        'IPC Current Full': 'IPC',
                        'Publication Number': 'number'})


In [20]:
processed_title = processed_text(df['title'])

In [21]:
processed_abstract = processed_text(df['abstract'])

In [22]:
processed_claim = processed_text(df['claim'])

In [23]:
processed_desc = processed_text(df['desc'])

In [24]:
df['title'] = processed_title
df['abstract'] = processed_abstract
df['claim'] = processed_claim
df['desc'] = processed_desc

## Test Data Processing

In [25]:
df_test_initial = pd.read_excel(r'C:\Users\u6033331\Testdata_EPO Automation.xlsx')

df_test_initial.fillna('na', inplace=True)

df_test_initial['title'] = (df_test_initial['Title (English)'] + ' ' + 
                       df_test_initial['Title - DWPI'] + ' ' + 
                       df_test_initial['Title Terms - DWPI'])

df_test_initial['abstract'] = (df_test_initial['Abstract - DWPI'] + ' ' + 
                          df_test_initial['Abstract (English)'] + ' ' + 
                          df_test_initial['Abstract - DWPI Use'] + ' ' + 
                          df_test_initial['Abstract - DWPI Advantage'] + ' ' + 
                          df_test_initial['Abstract - DWPI Drawing Description'])

In [26]:
cols_test = ['Publication Number', 
        'title', 
        'abstract',
        'INPADOC Family Members',
        'INPADOC Family ID', 
        'Claims (English)', 
        'Description']

df_test = df_test_initial[cols_test] 

In [27]:
df_test = df_test.rename(columns={'INPADOC Family Members': 'family_member', 
                        'INPADOC Family ID': 'family_ID',
                        'Claims (English)':'claim',
                        'Description': 'desc', 
                        'Publication Number': 'number'})


In [28]:
processed_title_test = processed_text(df_test['title'])
processed_abstract_test = processed_text(df_test['abstract'])
processed_claim_test = processed_text(df_test['claim'])
processed_desc_test = processed_text(df_test['desc'])

In [29]:
df_test['title'] = processed_title_test
df_test['abstract'] = processed_abstract_test
df_test['claim'] = processed_claim_test
df_test['desc'] = processed_desc_test

## Model Training

In [30]:
X_title = df['title']
X_abstract = df['abstract']

X_test_title = df_test['title']
X_test_abstract = df_test['abstract']

X_tab = df_test['title'] + ' ' + df_test['abstract']
X_ctb = df_test['title'] + ' ' + df_test['abstract'] + ' ' + df_test['claim']
X_all = df_test['title'] + ' ' + df_test['abstract'] + ' ' + df_test['claim'] + ' ' + df_test['desc']

In [31]:
vect_title = TfidfVectorizer(stop_words = 'english', max_features=2500)

X_title_dtm = vect_title.fit_transform(X_title)
X_title_test_dtm = vect_title.transform(X_test_title)

In [32]:
vect_abstract = TfidfVectorizer(stop_words = 'english', max_features=2500)

X_abstract_dtm = vect_abstract.fit_transform(X_abstract)
X_abstract_test_dtm = vect_abstract.transform(X_test_abstract)

In [33]:
X_tab_dtm = vect_title.transform(X_abstract)
X_tab_test_dtm = vect_title.transform(X_test_abstract)


In [34]:
from datetime import datetime
dt_string = datetime.now().strftime("%d/%b/%Y - %H:%M %p")
print("Module Initial Processing is successfully loaded on",dt_string)

Module Initial Processing is successfully loaded on 03/Aug/2020 - 15:15 PM
