# Predict the Cooperative Patent Class (CPC) by means of NLP
> The application of data science techniques in the realm of patent analysis is in rapid growth, due to the availability of large quantity of data. Such a science leverages data to create better business intelligence, as well as facilitates decision making processes. The post recommends a machine learningdriven classification of patent publications based on a predictive model trained on the text data. The paper finds that a LinearSVC achieves best results, scoring an accuracy value of 67%. 

- toc: true 
- badges: true
- hide_binder_badge: true
- comments: true
- categories: [NLP, Multi-class text classification, Intellectual property, Patent data, Text Vectorization models, Bag of Words model, Multinomial Naive Bayes model, Multi-model selection.]

## Import of packages


In [None]:
import pandas as pd # data analysis
from io import StringIO 
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
%matplotlib inline 
sns.set(color_codes=True)

## Load the data
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DSBA Patent Project /Paper /EDA & NLP /patbase_export_274777127.csv')

col = ['Cooperative Patent Class', '1st Main Claim'] #we only use those two columns 
df = df[col]
df = df.dropna()

The data source of the patent-level data is PatBase, that is a product offered by MineSoft, i.e., patent solutions provider founded in 1996 offering online products and services, such as, patent research, monitoring, and analysis, as well as other intellectual property services. Moreover, the only two features needed in this prediction exercise are the Cooperative Patent Class and the text of the 1' Main Claim.  

## Making CPC label
We can see that each patent publication, i.e., each row, has multiple CPC labels. For convinience, we only use the first label in order of appearance. 

In [None]:
df['cpc'] = df['Cooperative Patent Class'].str.extract(r'(^.{0,1})')

## RegEx preprocessing

In [None]:
df['main_claim'] = df['1st Main Claim'].str.replace('\[EN\]\s1.\s', '', regex = True) # string stripping EN

df['main_claim'] = df['main_claim'].str.lower()

df['main_claim'] = df['main_claim'].str.replace('\d+', '')

  """


## Building word count vectors with scikit-learn

In [None]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [None]:
# Create a series to store the labels: y
y = df['cpc']

In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['main_claim'],y,test_size=0.33,random_state=53)

###CountVectorizer for text classification

In [None]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

### TfidfVectorizer for text classification

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english',
                                   sublinear_tf= True, 
                                   min_df=5, 
                                   ngram_range= (1,2),
                                   norm='l2', 
                                   encoding='latin-1')
 
# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
 
# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

## Naive Bayes: Text classification model

### Training and testing the labelling model with CountVectorizer

In [None]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB(alpha = 2.6)
 
# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)
 
# Create the predicted tags: pred
predct = nb_classifier.predict(count_test)

In [None]:
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,predct)
print(score)

0.6304475278483487


### Training and testing the labelling model with TfidfVectorizer

In [None]:
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)
 
# Create the predicted tags: pred
predtf = nb_classifier.predict(tfidf_test)

In [None]:
# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,predtf)
print(score)

## Attempts at improving the model

### Experimenting with different alphas parameters for the Multinomial Naive Bayes model.

In [None]:
## Create the list of alphas: alphas
alphas = np.arange(2,3,0.1)
 
# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(count_train,y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(count_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test,pred)
    return score
 
# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

Alpha:  2.0
Score:  0.6294703928082861

Alpha:  2.1
Score:  0.6294703928082861

Alpha:  2.2
Score:  0.6296658198162987

Alpha:  2.3000000000000003
Score:  0.6302521008403361

Alpha:  2.4000000000000004
Score:  0.6306429548563611

Alpha:  2.5000000000000004
Score:  0.6306429548563611

Alpha:  2.6000000000000005
Score:  0.6304475278483487

Alpha:  2.7000000000000006
Score:  0.6294703928082861

Alpha:  2.8000000000000007
Score:  0.6300566738323237

Alpha:  2.900000000000001
Score:  0.6290795387922611



### Inspecting count vectorizing Multinomial Naive Bayes model

The result of the inspection is not clear to me. Indeed, all the classes' features are the same except for the first class.

In [None]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

In [None]:
# Extract the features: feature_names
feature_names = count_vectorizer.get_feature_names()



In [None]:
# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))



In [None]:
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

A [(-12.073243863160105, '_n'), (-12.073243863160105, 'aabb'), (-12.073243863160105, 'aabbco'), (-12.073243863160105, 'aabbs'), (-12.073243863160105, 'aacmm'), (-12.073243863160105, 'aad'), (-12.073243863160105, 'abandon'), (-12.073243863160105, 'abatement'), (-12.073243863160105, 'abbe'), (-12.073243863160105, 'aberration'), (-12.073243863160105, 'abiotically'), (-12.073243863160105, 'abl'), (-12.073243863160105, 'ablating'), (-12.073243863160105, 'ablative'), (-12.073243863160105, 'abnormalities'), (-12.073243863160105, 'abnormality'), (-12.073243863160105, 'abnormally'), (-12.073243863160105, 'abnormity'), (-12.073243863160105, 'abort'), (-12.073243863160105, 'aborted')]


In [None]:
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

B [(-5.706773415428667, 'including'), (-5.668015405129263, 'coupled'), (-5.654878927223893, 'base'), (-5.632297322527184, 'assembly'), (-5.5704538172444815, 'extending'), (-5.516465507002063, 'anda'), (-5.513628625666863, 'position'), (-5.396160401912969, 'user'), (-5.247783826904798, 'body'), (-5.230560580921683, 'member'), (-5.103453193258515, 'device'), (-4.969921800633992, 'plurality'), (-4.92175839925537, 'surface'), (-4.863164234989317, 'having'), (-4.716325620804084, 'configured'), (-4.604730591663768, 'end'), (-4.5411557196183825, 'portion'), (-4.397233931131217, 'comprising'), (-4.065543850276079, 'said'), (-3.9583208889555124, 'second')]


In [None]:
print(class_labels)

['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H']


In [None]:
print(class_labels[6], feat_with_weights[-20:])

G [(-5.706773415428667, 'including'), (-5.668015405129263, 'coupled'), (-5.654878927223893, 'base'), (-5.632297322527184, 'assembly'), (-5.5704538172444815, 'extending'), (-5.516465507002063, 'anda'), (-5.513628625666863, 'position'), (-5.396160401912969, 'user'), (-5.247783826904798, 'body'), (-5.230560580921683, 'member'), (-5.103453193258515, 'device'), (-4.969921800633992, 'plurality'), (-4.92175839925537, 'surface'), (-4.863164234989317, 'having'), (-4.716325620804084, 'configured'), (-4.604730591663768, 'end'), (-4.5411557196183825, 'portion'), (-4.397233931131217, 'comprising'), (-4.065543850276079, 'said'), (-3.9583208889555124, 'second')]


## Selection of various model with count vectorizing 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

In [None]:
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(alpha = 2.6),
    LogisticRegression(random_state=0),
    ]

In [None]:
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))

entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, count_train, y_train, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy))

cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

In [None]:
cv_df.groupby('model_name').accuracy.mean()

model_name
LinearSVC                 0.574743
LogisticRegression        0.610743
MultinomialNB             0.625468
RandomForestClassifier    0.374146
Name: accuracy, dtype: float64

## Selection of various model with tfidf vectorizing 

In [None]:
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))

entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, tfidf_train, y_train, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy))

cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

In [None]:
cv_df.groupby('model_name').accuracy.mean()

model_name
LinearSVC                 0.674175
LogisticRegression        0.658003
MultinomialNB             0.547213
RandomForestClassifier    0.374916
Name: accuracy, dtype: float64

LinearSVC and Logistic Regression perform better than the other two classifiers, with LinearSVC having a slight advantage with a median accuracy of 67%. Focusing on the LinearSVC model, which has demonstrated to perform the best, I report its confusing matrix, to show the discrepancies between predicted and actual labels.