# Data analysis

**Execute the cell below. By running this cell, a dataset will be loaded from `patents.csv` file. There are three numpy arrays in this dataset:**
- `category`: the category to which a patent belongs 
- `patent_number`: a unique identifier for each patetnt
- `patent features`: a vector of 16 features describing several properties of each patent


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('patents.csv')
patent_features = df['patent_embedding'].to_numpy()
features_array = []
for i in range(patent_features.size):
    feature = str(patent_features[i])
    feature = feature.replace(r'\n', '')
    features_array.append(
        np.array(feature.split()[1:-1], dtype='float')[:16]
    )
patent_features = np.stack(features_array)

patent_category = df['category'].to_numpy() # TODO category column of dataset
patent_number = df['publication_number'].to_numpy() #TODO (publication_number column of dataset)

df

Unnamed: 0,publication_number,title,cpc_code,patent_embedding,category
0,US-2019250858-A1,memory controller and operating method thereof,G06F3/061,[ 0.00135472 0.01564001 -0.04858465 0.039866...,1
1,US-1000462-A,corn planter,A01C9/00,[-4.44490612e-02 2.48770583e-02 -5.62837869e-...,6
2,KR-200146416-Y1,antitheft vehicle security system,B60R25/209,[-2.53110677e-02 -2.04547048e-02 8.63679312e-...,0
3,KR-0160422-B1,a door opening and shutting apparatus and meth...,D06F37/42,[ 1.21761542e-02 1.97522007e-02 -6.62921891e-...,1
4,US-952306-A,spray burner,B05B1/3033,[-0.00214472 0.01606156 -0.09518531 0.060160...,0
...,...,...,...,...,...
15684,AT-415717-T,method and device for produce a low pressure w...,H01M8/04104,[ 1.77878514e-02 3.53233777e-02 -3.37363742e-...,1
15685,AT-424202-T,substitute _NUMBER_ thio _NUMBER_ _NUMBER_ dic...,C07D417/12,[-0.03664465 -0.01075565 -0.02483719 -0.033502...,5
15686,CA-2952951-A1,end tip for a vehicle wiper blade,B60S1/3894,[-4.39246558e-02 2.96350904e-02 -2.31920835e-...,0
15687,CH-608317-A,process for the compressive shrinkage of a web...,D06C21/00,[-3.34328553e-03 1.02757774e-02 -2.01825500e-...,6


<hr />

1- Which patent has the highest norm? (Eucledian distance from origin)


In [3]:
df.iloc[np.argmax(pd.DataFrame(patent_features).apply(np.linalg.norm, axis=1))]['title']

'penicillanylaldehydes'

2- Find the two patents that are the farthest from eachother.

In [4]:
from scipy.spatial.distance import pdist, squareform

pairwise_distances = squareform(pdist(patent_features, 'euclidean'))

max_distance_indices = np.unravel_index(np.argmax(pairwise_distances), pairwise_distances.shape)
patent1 = df.iloc[max_distance_indices[0]]
print(patent1)
patent2 = df.iloc[max_distance_indices[1]]
print(patent2)

publication_number                                      KR-100793527-B1
title                                                          abrasive
cpc_code                                                       C09G1/02
patent_embedding      [-3.89078408e-02 -3.91889922e-02 -1.55463070e-...
category                                                              5
Name: 1661, dtype: object
publication_number                                         US-3240764-A
title                                     polythiosemicarbazide chelate
cpc_code                                                      C08G73/08
patent_embedding      [ 0.01685247  0.02782189 -0.08498514 -0.033868...
category                                                              2
Name: 9236, dtype: object


3- Write a function that, given a patent number, finds its nearest neighbour.


In [5]:
from scipy.spatial.distance import cdist

def find_nearest(patent_n):
    # find the index
    patent_index = df[df['publication_number'] == patent_n].index[0]
    
    # print(patent_index)
    
    # find the distance between each patent with given patent 
    distances = cdist(patent_features[patent_index:patent_index+1], patent_features, 'euclidean')
    
    distances[0, patent_index] = np.inf
    
    # print(distances.argmin())
    
    nearest_index = np.argmin(distances[0])
    
    # print(nearest_index)
    
    nearest_neighbor = df.iloc[nearest_index]
    
    return nearest_neighbor



4- For each patent category, find the cluster center. This quantity is computed by taking average of all patents associated with each cluster.

In [6]:
# group patents by category
grouped = df.groupby('category')

for category, group_df in grouped:
    
    category_features = patent_features[group_df.index]
    
    # print(category_features)
    
    
    cluster_center = np.mean(category_features, axis=0)
    
    
    print(category, cluster_center)
    

0 [ 0.01086092 -0.02427292  0.06917166 -0.04593048 -0.02812299 -0.0124727
 -0.04987288  0.00655626  0.0098301  -0.01550384  0.00122531  0.00426678
  0.00017979  0.02210309 -0.02753392 -0.00829946]
1 [ 0.01021772  0.0140427  -0.03571764  0.05286253 -0.04302765 -0.00263517
  0.02233755 -0.04675915  0.01272022  0.03165236  0.01146286 -0.00024609
  0.01377522  0.00555212  0.02024696 -0.04467966]
2 [ 0.01844678  0.00991557 -0.05545595  0.02615103 -0.07078419 -0.0115121
  0.04539117 -0.05906673 -0.02173693  0.00203886  0.00052992  0.02329754
 -0.03247548  0.03103352  0.0140693  -0.06104154]
3 [ 0.01717531  0.01595333 -0.03129371  0.05920419 -0.05942006 -0.03559038
 -0.01542298 -0.05486974  0.00243557  0.004506   -0.02005723  0.00059813
 -0.00323446  0.00388401  0.01666861 -0.02052029]
4 [ 0.01498087  0.02345642 -0.00569218  0.04002896 -0.03471142  0.00468704
  0.01612199 -0.03838371  0.00732594  0.00352215  0.00011503  0.01232852
 -0.01395763  0.00333184  0.04570635 -0.0292569 ]
5 [ 8.795190

5- How many patents have a nearest neighbour that is in the same category?

In [7]:
df['has_same_category'] = False

grouped = df.groupby('category')

for category, group_df in grouped:
    # iterating through patents in same category
    for index, row in group_df.iterrows():
        # find the nearest neighbor
        nearest = find_nearest(row['publication_number'])
        
        if nearest['category'] == category:
            df.at[index, 'has_same_category'] = True


number_of_neighbors_with_same_category = df['has_same_category'].sum()       

print(number_of_neighbors_with_same_category)

13000


6- What is the average and std of distances between every pair of patents?


In [8]:
pairwise_distances = pdist(patent_features, 'euclidean')

average_distance = np.mean(pairwise_distances)
std_distance = np.std(pairwise_distances)

print(average_distance, std_distance)

0.17748927186252916 0.06170723377632848


7- What is the average and std of distances between every pair of patents within a category?
Using these calculated quantities, which cluster do you think is more condensed? Which one is more scattered?

In [12]:

grouped = df.groupby('category')

avg_distances_per_category = []
std_distances_per_category = []

for category, group_df in grouped:
    category_features = patent_features[group_df.index]
    print(group_df.shape[0])

    pairwise_distances = pdist(category_features, 'euclidean')

    avg_distance = np.mean(pairwise_distances)
    std_distance = np.std(pairwise_distances)
    
    print(f'average_distance: {average_distance}       category:{category}')
    print(f'std_distance: {std_distance}          category:{category}')
    print("*" * 20)

    avg_distances_per_category.append(avg_distance)
    std_distances_per_category.append(std_distance)


most_condensed_category = np.argmin(std_distances_per_category)
most_scattered_category = np.argmax(std_distances_per_category)

# most condensed category
print(list(grouped.groups.keys())[most_condensed_category])
# most scattered category
print(list(grouped.groups.keys())[most_scattered_category])

1948
average_distance: 0.17748927186252916       category:0
std_distance: 0.03962927024874084          category:0
********************
2702
average_distance: 0.17748927186252916       category:1
std_distance: 0.030502801225921056          category:1
********************
919
average_distance: 0.17748927186252916       category:2
std_distance: 0.04381173385220193          category:2
********************
1020
average_distance: 0.17748927186252916       category:3
std_distance: 0.04464178403289032          category:3
********************
1277
average_distance: 0.17748927186252916       category:4
std_distance: 0.04010400709341744          category:4
********************
1370
average_distance: 0.17748927186252916       category:5
std_distance: 0.042275628699896206          category:5
********************
4260
average_distance: 0.17748927186252916       category:6
std_distance: 0.03176592709185818          category:6
********************
2193
average_distance: 0.17748927186252916       categ

8 - What is your analysis from this dataset?

با توجه به عدد بالای تعداد حق اختراعاتی که به هم نزدیک و در یم خوشه قرار میگیرند، میتوان گفت که تشابهات آنهایی که در یک خوشه قرار دارند نسبتا نزدیک و 
تعداد فابل توجهی از داده ها در هر خوشه وجود دارند

رده 1: این دسته دارای کمترین انحراف معیار است که نشان می دهد پتنت های این دسته نسبتاً مشابه یکدیگر هستند. میانگین فاصله در این دسته تقریباً 0.177 است که با میانگین کل مجموعه داده مطابقت دارد.

دسته 3: این دسته دارای بالاترین انحراف معیار است که نشان می دهد پتنت های این دسته دارای تنوع ویژگی های گسترده تری هستند. میانگین فاصله در این دسته نیز تقریباً 0.177 است که مشابه میانگین کل مجموعه داده است. با این حال، انحراف معیار بالاتر نشان دهنده تنوع بیشتر است.

رده 1 به عنوان متراکم ترین خوشه شناخته می شود، به این معنی که پتنت های این دسته به طور متوسط شباهت زیادی به یکدیگر دارند. دسته 3 به عنوان پراکنده ترین خوشه شناسایی می شود، که نشان می دهد که پتنت های این دسته تنوع بیشتری را در ویژگی های خود دارند.