# Stage 1: Enhanced Data Cleaning, Preprocessing, and Exploratory Analysis


## 1. Data Collection and Cleaning


### Overview of JSON Parsing Issues with Initial Method:

Original Method:
* Initially attempted to load a JSON file using `pandas.read_json`, which resulted in errors due to file content issues.

Issues and Solutions:

1. **Trailing Data Error**:
   - **Problem**: Encountered a `ValueError` related to "trailing data" using `pandas.read_json` .
   - **Solution**: Inspected the JSON file for proper formatting and structure, ensuring no unexpected data followed the JSON content.

2. **Character Encoding Error**:
   - **Problem**: After resolving the first issue, a `UnicodeDecodeError` was encountered, suggesting incompatible character encoding.
   - **Solution**: Modified the file-reading command to include `encoding='utf-8'` , supporting a broader range of characters.

3. **Multiple JSON Objects Error**:
   - **Problem**: Following the encoding fix, a `JSONDecodeError` due to "extra data" indicated multiple JSON objects concatenated without proper formatting.
   - **Solution**:

     - If each line is a separate JSON object, parsed each line individually.
     - For multiple concatenated objects in a single stream, manually split and restructured the data for proper parsing.

The steps implemented below address these issues, enhancing the robustness of JSON data handling and parsing in diverse file structures.


In [316]:
import json
import pandas as pd

data = []

# Read the JSON file line by line, assuming each line is a complete JSON object
with open("patent_data.json", "r", encoding="utf-8") as file:
    for line in file:
        # Convert each line to a Python dictionary
        json_obj = json.loads(line)
        data.append(json_obj)

# Convert the list of dictionaries to a DataFrame
cleantech_patent_data = pd.DataFrame(data)

# Print the shape of the cleantech_patent_data dataframe
print(cleantech_patent_data.shape)

# Display the cleantech_patent_data dataframe
cleantech_patent_data

(30000, 8)


Unnamed: 0,publication_number,application_number,country_code,title_localized,abstract_localized,publication_date,inventor,cpc
0,US-2022239235-A1,US-202217717397-A,US,[{'text': 'Adaptable DC-AC Inverter Drive Syst...,[{'text': 'Disclosed is an adaptable DC-AC inv...,20220728,[],"[{'code': 'H02M7/5395', 'inventive': True, 'fi..."
1,US-2022239251-A1,US-202217580956-A,US,[{'text': 'System for providing the energy fro...,[{'text': 'In accordance with an example embod...,20220728,[],"[{'code': 'H02S40/38', 'inventive': True, 'fir..."
2,EP-4033090-A1,EP-21152924-A,EP,[{'text': 'Verfahren zum steuern einer windene...,[{'text': 'Verfahren zum Steuern einer Windene...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
3,EP-4033090-A1,EP-21152924-A,EP,[{'text': 'Verfahren zum steuern einer windene...,[{'text': 'Verfahren zum Steuern einer Windene...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi..."
4,US-11396827-B2,US-202117606042-A,US,[{'text': 'Control method for optimizing solar...,[{'text': 'A control method for optimizing a s...,20220726,[],"[{'code': 'F24S50/00', 'inventive': True, 'fir..."
...,...,...,...,...,...,...,...,...
29995,CN-214835218-U,CN-202121309411-U,CN,"[{'text': '一种具有充电功能的无人机机库', 'language': 'zh', ...",[{'text': 'The utility model relates to an unm...,20211123,"[ZHANG ANZHI, ZHANG WANYONG, HUANG HUIYONG, XU...","[{'code': 'Y02E10/50', 'inventive': False, 'fi..."
29996,CN-113690940-A,CN-202111107806-A,CN,"[{'text': '一种风电供电控制方法', 'language': 'zh', 'tru...",[{'text': 'The invention provides a wind power...,20211123,"[ZHANG YUCHUAN, LONG HAIYANG, TAN CUNZHEN]","[{'code': 'Y02E10/76', 'inventive': False, 'fi..."
29997,CN-113683210-A,CN-202110998828-A,CN,"[{'text': '一种基于无线控制的漂浮式太阳能曝气装置', 'language': '...",[{'text': '本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能...,20211123,[An Zhixia],"[{'code': 'C02F2209/22', 'inventive': False, '..."
29998,CN-113683210-A,CN-202110998828-A,CN,"[{'text': '一种基于无线控制的漂浮式太阳能曝气装置', 'language': '...",[{'text': '本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能...,20211123,[An Zhixia],"[{'code': 'C02F2209/22', 'inventive': False, '..."


In [317]:
# Check for missing values

cleantech_patent_data.isnull().sum()


publication_number    0
application_number    0
country_code          0
title_localized       0
abstract_localized    0
publication_date      0
inventor              0
cpc                   0
dtype: int64

In [318]:
# Inspect the datatype of each columns first element

for col in cleantech_patent_data.columns:
    print(
        col,
        ": ",
        type(cleantech_patent_data[col].iloc[0]),
        cleantech_patent_data[col].iloc[0],
    )

publication_number :  <class 'str'> US-2022239235-A1
application_number :  <class 'str'> US-202217717397-A
country_code :  <class 'str'> US
title_localized :  <class 'list'> [{'text': 'Adaptable DC-AC Inverter Drive System and Operation', 'language': 'en', 'truncated': False}]
abstract_localized :  <class 'list'> [{'text': 'Disclosed is an adaptable DC-AC inverter system and its operation. The system includes multiple DC input sources as input to provide a stable operation under various conditions. DC input sources may be added to the system or removed from the system without impacting the functionality of the system. The disclosed system is suited for solar energy harvesting in grid-connected or off-grid modes of operation.', 'language': 'en', 'truncated': False}]
publication_date :  <class 'str'> 20220728
inventor :  <class 'list'> []
cpc :  <class 'list'> [{'code': 'H02M7/5395', 'inventive': True, 'first': False, 'tree': []}, {'code': 'H02J3/32', 'inventive': False, 'first': False, 

This step involves normalizing the nested JSON structures. Given the nature of the patent data, with nested fields such as 'title_localized' and 'abstract_localized', we will extract these nested dictionaries and flatten them to create a more accessible DataFrame format.

- Normalize Nested Structures: Since fields like 'title_localized' and 'abstract_localized' contain key details in nested dictionaries, we'll extract these details into separate DataFrame columns. This will facilitate easier access and manipulation of these fields in subsequent analyses.


In [319]:
# Extracting data from nested fields
cleantech_patent_data["publication_number"] = [
    entry["publication_number"] for entry in data
]
cleantech_patent_data["application_number"] = [
    entry["application_number"] for entry in data
]
cleantech_patent_data["country_code"] = [
    entry["country_code"] for entry in data]
cleantech_patent_data["title"] = [
    entry["title_localized"][0]["text"] if entry["title_localized"] else None
    for entry in data
]
cleantech_patent_data["abstract"] = [
    entry["abstract_localized"][0]["text"] if entry["abstract_localized"] else None
    for entry in data
]
cleantech_patent_data["publication_date"] = [
    entry["publication_date"] for entry in data
]
cleantech_patent_data["inventors"] = [entry["inventor"] for entry in data]

# Display the cleantech_patent_data dataframe
cleantech_patent_data

Unnamed: 0,publication_number,application_number,country_code,title_localized,abstract_localized,publication_date,inventor,cpc,title,abstract,inventors
0,US-2022239235-A1,US-202217717397-A,US,[{'text': 'Adaptable DC-AC Inverter Drive Syst...,[{'text': 'Disclosed is an adaptable DC-AC inv...,20220728,[],"[{'code': 'H02M7/5395', 'inventive': True, 'fi...",Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[]
1,US-2022239251-A1,US-202217580956-A,US,[{'text': 'System for providing the energy fro...,[{'text': 'In accordance with an example embod...,20220728,[],"[{'code': 'H02S40/38', 'inventive': True, 'fir...",System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",[]
2,EP-4033090-A1,EP-21152924-A,EP,[{'text': 'Verfahren zum steuern einer windene...,[{'text': 'Verfahren zum Steuern einer Windene...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi...",Verfahren zum steuern einer windenergieanlage,Verfahren zum Steuern einer Windenergieanlage ...,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,..."
3,EP-4033090-A1,EP-21152924-A,EP,[{'text': 'Verfahren zum steuern einer windene...,[{'text': 'Verfahren zum Steuern einer Windene...,20220727,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...","[{'code': 'F03D7/0276', 'inventive': True, 'fi...",Verfahren zum steuern einer windenergieanlage,Verfahren zum Steuern einer Windenergieanlage ...,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,..."
4,US-11396827-B2,US-202117606042-A,US,[{'text': 'Control method for optimizing solar...,[{'text': 'A control method for optimizing a s...,20220726,[],"[{'code': 'F24S50/00', 'inventive': True, 'fir...",Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,[]
...,...,...,...,...,...,...,...,...,...,...,...
29995,CN-214835218-U,CN-202121309411-U,CN,"[{'text': '一种具有充电功能的无人机机库', 'language': 'zh', ...",[{'text': 'The utility model relates to an unm...,20211123,"[ZHANG ANZHI, ZHANG WANYONG, HUANG HUIYONG, XU...","[{'code': 'Y02E10/50', 'inventive': False, 'fi...",一种具有充电功能的无人机机库,The utility model relates to an unmanned air v...,"[ZHANG ANZHI, ZHANG WANYONG, HUANG HUIYONG, XU..."
29996,CN-113690940-A,CN-202111107806-A,CN,"[{'text': '一种风电供电控制方法', 'language': 'zh', 'tru...",[{'text': 'The invention provides a wind power...,20211123,"[ZHANG YUCHUAN, LONG HAIYANG, TAN CUNZHEN]","[{'code': 'Y02E10/76', 'inventive': False, 'fi...",一种风电供电控制方法,The invention provides a wind power supply con...,"[ZHANG YUCHUAN, LONG HAIYANG, TAN CUNZHEN]"
29997,CN-113683210-A,CN-202110998828-A,CN,"[{'text': '一种基于无线控制的漂浮式太阳能曝气装置', 'language': '...",[{'text': '本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能...,20211123,[An Zhixia],"[{'code': 'C02F2209/22', 'inventive': False, '...",一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia]
29998,CN-113683210-A,CN-202110998828-A,CN,"[{'text': '一种基于无线控制的漂浮式太阳能曝气装置', 'language': '...",[{'text': '本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能...,20211123,[An Zhixia],"[{'code': 'C02F2209/22', 'inventive': False, '...",一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia]


In [320]:
# Inspect the datatype of each columns of cleantech_patent_data's first element

for col in cleantech_patent_data.columns:
    print(
        col,
        ": ",
        type(cleantech_patent_data[col].iloc[0]),
        cleantech_patent_data[col].iloc[0],
    )


publication_number :  <class 'str'> US-2022239235-A1
application_number :  <class 'str'> US-202217717397-A
country_code :  <class 'str'> US
title_localized :  <class 'list'> [{'text': 'Adaptable DC-AC Inverter Drive System and Operation', 'language': 'en', 'truncated': False}]
abstract_localized :  <class 'list'> [{'text': 'Disclosed is an adaptable DC-AC inverter system and its operation. The system includes multiple DC input sources as input to provide a stable operation under various conditions. DC input sources may be added to the system or removed from the system without impacting the functionality of the system. The disclosed system is suited for solar energy harvesting in grid-connected or off-grid modes of operation.', 'language': 'en', 'truncated': False}]
publication_date :  <class 'str'> 20220728
inventor :  <class 'list'> []
cpc :  <class 'list'> [{'code': 'H02M7/5395', 'inventive': True, 'first': False, 'tree': []}, {'code': 'H02J3/32', 'inventive': False, 'first': False, 

We have successfully extracted the nested data into separate columns. Now, we will process the nested data in the 'cpc' column.

In [321]:
# For the 'cpc' column which is a list of dictionaries, extract it into a separate DataFrame and then process as required
cpc_data = pd.json_normalize(
    data, record_path=["cpc"], meta=["publication_number"], errors="ignore"
)

# Display the cpc_data dataframe

cpc_data


Unnamed: 0,code,inventive,first,tree,publication_number
0,H02M7/5395,True,False,[],US-2022239235-A1
1,H02J3/32,False,False,[],US-2022239235-A1
2,H02M1/32,True,False,[],US-2022239235-A1
3,H02J1/10,True,False,[],US-2022239235-A1
4,H02J3/381,True,False,[],US-2022239235-A1
...,...,...,...,...,...
92280,C02F2201/008,False,False,[],CN-113683210-A
92281,Y02W10/10,False,False,[],CN-113683210-A
92282,B64G4/00,True,False,[],CN-113682495-A
92283,B64G1/401,True,False,[],CN-113682495-A


In [322]:
# Inspect the datatype of each columns first element for cpc_data

for col in cpc_data.columns:
    print(
        col,
        ": ",
        type(cpc_data[col].iloc[0]),
        cpc_data[col].iloc[0],
    )

code :  <class 'str'> H02M7/5395
inventive :  <class 'numpy.bool_'> True
first :  <class 'numpy.bool_'> False
tree :  <class 'list'> []
publication_number :  <class 'str'> US-2022239235-A1


In [323]:
# Inspect if the tree columns list elements are empty

for i in range(len(cpc_data)):
    if len(cpc_data["tree"].iloc[i]) != 0:
        print(i)


Nothing is printed therefore we can safely drop the column tree from the cpc_data and then merge the cpc_data and cleantech_data on the "Publication_Number" column

In [324]:
# Print the shape of cleantech_patent_data and cpc_data
print(cpc_data.shape, cleantech_patent_data.shape)

print(
    "---------------------------------------------------------------------------",
)

# Drop the tree column

cpc_data.drop(["tree"], axis=1, inplace=True)

# Display the cleantech_patent_data dataframe
print(cpc_data.head())

# Merge the cpc_data and cleantech_patent_data dataframes

cleantech_patent_data = cleantech_patent_data.merge(
    cpc_data, on="publication_number")

print(
    "---------------------------------------------------------------------------",
)

# Display the cleantech_patent_data dataframe
print(cleantech_patent_data.head(2))

# Print the shale of all 3 dataframes to check if they are merged correctly
print(
    "---------------------------------------------------------------------------",
)
print(cpc_data.shape, cleantech_patent_data.shape)

# Print colnames of cleantech_patent_data
print(cleantech_patent_data.columns)

(92285, 5) (30000, 11)
---------------------------------------------------------------------------
         code  inventive  first publication_number
0  H02M7/5395       True  False   US-2022239235-A1
1    H02J3/32      False  False   US-2022239235-A1
2    H02M1/32       True  False   US-2022239235-A1
3    H02J1/10       True  False   US-2022239235-A1
4   H02J3/381       True  False   US-2022239235-A1
---------------------------------------------------------------------------
  publication_number application_number country_code  \
0   US-2022239235-A1  US-202217717397-A           US   
1   US-2022239235-A1  US-202217717397-A           US   

                                     title_localized  \
0  [{'text': 'Adaptable DC-AC Inverter Drive Syst...   
1  [{'text': 'Adaptable DC-AC Inverter Drive Syst...   

                                  abstract_localized publication_date  \
0  [{'text': 'Disclosed is an adaptable DC-AC inv...         20220728   
1  [{'text': 'Disclosed is an adapt

We see the successful merge from the results as the total columns are:  14

In [325]:
# Now remove the nested fields and the redundant columns

cleantech_patent_data.drop(
    ["title_localized", "abstract_localized", "inventor", "cpc"], axis=1, inplace=True
)

# Print cleantech_patent_data columns
print(cleantech_patent_data.columns)


# Display the cleantech_patent_data dataframe

cleantech_patent_data

Index(['publication_number', 'application_number', 'country_code',
       'publication_date', 'title', 'abstract', 'inventors', 'code',
       'inventive', 'first'],
      dtype='object')


Unnamed: 0,publication_number,application_number,country_code,publication_date,title,abstract,inventors,code,inventive,first
0,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02M7/5395,True,False
1,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02J3/32,False,False
2,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02M1/32,True,False
3,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02J1/10,True,False
4,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02J3/381,True,False
...,...,...,...,...,...,...,...,...,...,...
251034,CN-113683210-A,CN-202110998828-A,CN,20211123,一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia],C02F2209/22,False,False
251035,CN-113683210-A,CN-202110998828-A,CN,20211123,一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia],C02F2201/009,False,False
251036,CN-113683210-A,CN-202110998828-A,CN,20211123,一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia],C02F7/00,True,True
251037,CN-113683210-A,CN-202110998828-A,CN,20211123,一种基于无线控制的漂浮式太阳能曝气装置,本发明公开了一种基于无线控制的漂浮式太阳能曝气装置，包括漂浮板、太阳能转化机构以及输气泵，所...,[An Zhixia],C02F2201/008,False,False


Upon inspecting the clean_tech dataframe we see that the application number and publication number, rows 2 and 3 seems duplicate. We check for dupicates rows in the next step.

In [326]:
# Check for duplicates for each column as the there are nested data in some columns, which causes error when we use "cleantech_patent_data.duplicated().sum()"

for col in cleantech_patent_data.columns:
    print(col, ": ", cleantech_patent_data.duplicated(col).sum())

print(
    "---------------------------------------------------------------------------",
)

# Check for Uniques values in publication_number and application_number

print(
    "Unique values in publication_number: ",
    cleantech_patent_data["publication_number"].nunique(),
)
print(
    "Unique values in application_number: ",
    cleantech_patent_data["application_number"].nunique(),
)


publication_number :  242494
application_number :  242616
country_code :  251008
publication_date :  250876
title :  242757
abstract :  242607
inventors :  246031
code :  243942
inventive :  251037
first :  251037
---------------------------------------------------------------------------
Unique values in publication_number:  8545
Unique values in application_number:  8423


We expect publication number to be the unique attribute as it has more unique values, therefore we will drop duplicates rows from the dataframe using the "publication number" column.

In [327]:
# Drop the duplicates using publication_number

cleantech_patent_data.drop_duplicates(
    subset=["publication_number"], inplace=True)

# Reset the index to account for the dropped rows
cleantech_patent_data.reset_index(drop=True, inplace=True)

# Display the cleantech_patent_data dataframe
cleantech_patent_data

Unnamed: 0,publication_number,application_number,country_code,publication_date,title,abstract,inventors,code,inventive,first
0,US-2022239235-A1,US-202217717397-A,US,20220728,Adaptable DC-AC Inverter Drive System and Oper...,Disclosed is an adaptable DC-AC inverter syste...,[],H02M7/5395,True,False
1,US-2022239251-A1,US-202217580956-A,US,20220728,System for providing the energy from a single ...,"In accordance with an example embodiment, a so...",[],H02S40/38,True,False
2,EP-4033090-A1,EP-21152924-A,EP,20220727,Verfahren zum steuern einer windenergieanlage,Verfahren zum Steuern einer Windenergieanlage ...,"[Schaper, Ulf, von Aswege, Enno, Gerke Funcke,...",F03D7/0276,True,True
3,US-11396827-B2,US-202117606042-A,US,20220726,Control method for optimizing solar-to-power e...,A control method for optimizing a solar-to-pow...,[],F24S50/00,True,False
4,US-2022228549-A1,US-202117153853-A,US,20220721,Mutually supporting hydropower systems,The mutually supporting hydropower systems inc...,"[CHEN, CHUN-HUEI]",F03B3/00,True,False
...,...,...,...,...,...,...,...,...,...,...
8540,CN-214850044-U,CN-202120980740-U,CN,20211123,一种可调节太阳能板朝向的箱变,The utility model relates to a case becomes te...,"[WANG HAI, WANG YANG]",Y02E10/50,False,False
8541,CN-214840532-U,CN-202121225269-U,CN,20211123,Garden waste anaerobic fermentation coupling m...,The utility model discloses a gardens discarde...,"[Gao Zhelu, Dai Jiangyue, LIU ZIYI, YAN CONGQI...",Y02E60/50,False,False
8542,CN-214835218-U,CN-202121309411-U,CN,20211123,一种具有充电功能的无人机机库,The utility model relates to an unmanned air v...,"[ZHANG ANZHI, ZHANG WANYONG, HUANG HUIYONG, XU...",Y02E10/50,False,False
8543,CN-113690940-A,CN-202111107806-A,CN,20211123,一种风电供电控制方法,The invention provides a wind power supply con...,"[ZHANG YUCHUAN, LONG HAIYANG, TAN CUNZHEN]",Y02E10/76,False,False


After dropping the duplicates, Now we have 8545 rows left, out of 30000 rows originally. 