DCRS is the Dale-Chall Readability Score, PDW is the Percentage of Difficult Words, and ASL is the Average Sentence Length.

The resulting score can be interpreted as follows:

4.9 or below: Easily understandable by an average 4th-grade student or lower.

5.0–5.9: Easily understandable by an average 5th or 6th-grade student.

6.0–6.9: Easily understandable by an average 7th or 8th-grade student.

7.0–7.9: Easily understandable by an average 9th or 10th-grade student.

8.0–8.9: Easily understandable by an average 11th or 12th-grade student.

9.0–9.9: Easily understandable by an average college student.

10.0 or above: Only easily understandable by graduates or individuals with a higher level of education.


In [1]:
!pip install textstat

Collecting textstat
  Downloading textstat-0.7.3-py3-none-any.whl (105 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/105.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyphen (from textstat)
  Downloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.0/2.0 MB[0m [31m94.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.14.0 textstat-0.7.3


In [2]:
import pandas as pd
from tqdm import tqdm
import textstat
import re

# Load the DataFrame from the Excel file
input_file = "Abstract_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define a list to store the Dale-Chall Readability scores
dcr_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating Scores"):
    # Get the abstract summary
    abstract_summary = str(row['Abstract_Summary_t5_base'])

    # Clean the abstract summary
    abstract_summary = re.sub(r'[^\x00-\x7F]+', '', abstract_summary)

    # Calculate Dale-Chall Readability score
    dcr_score = textstat.dale_chall_readability_score(abstract_summary)

    # Append the score to the list
    dcr_scores.append(dcr_score)

# Add the scores to the DataFrame
df['Dale_Chall_Readability_Score'] = dcr_scores

# Save the updated DataFrame to a new Excel file
output_file = "Dale_Chall_Readability_Scores_t5_base_abstract_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average Dale-Chall Readability Score
print("\nAverage Dale-Chall Readability Score:", sum(dcr_scores) / len(dcr_scores))


Calculating Scores: 100%|██████████| 1630/1630 [00:01<00:00, 1385.90it/s]



Average Dale-Chall Readability Score: 10.5803926380368


In [3]:
import pandas as pd
from tqdm import tqdm
import textstat
import re

# Load the DataFrame from the Excel file
input_file = "Claims_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define a list to store the Dale-Chall Readability scores
dcr_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating Scores"):
    # Get the abstract summary
    claim_Summary = str(row['Claims_Summary_t5_base'])

    # Clean the abstract summary
    claim_Summary = re.sub(r'[^\x00-\x7F]+', '', claim_Summary)

    # Calculate Dale-Chall Readability score
    dcr_score = textstat.dale_chall_readability_score(claim_Summary)

    # Append the score to the list
    dcr_scores.append(dcr_score)

# Add the scores to the DataFrame
df['Dale_Chall_Readability_Score'] = dcr_scores

# Save the updated DataFrame to a new Excel file
output_file = "Dale_Chall_Readability_Scores_t5_base_claims_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average Dale-Chall Readability Score
print("\nAverage Dale-Chall Readability Score:", sum(dcr_scores) / len(dcr_scores))


Calculating Scores: 100%|██████████| 1630/1630 [00:00<00:00, 3430.44it/s]



Average Dale-Chall Readability Score: 9.873447852760737


In [4]:
import pandas as pd
from tqdm import tqdm
import textstat
import re

# Load the DataFrame from the Excel file
input_file = "Combined_Google_patent_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define a list to store the Dale-Chall Readability scores
dcr_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating Scores"):
    # Get the abstract summary
    Combined_Summary = str(row['Combined_Summary'])

    # Clean the abstract summary
    Combined_Summary = re.sub(r'[^\x00-\x7F]+', '', Combined_Summary)

    # Calculate Dale-Chall Readability score
    dcr_score = textstat.dale_chall_readability_score(Combined_Summary)

    # Append the score to the list
    dcr_scores.append(dcr_score)

# Add the scores to the DataFrame
df['Dale_Chall_Readability_Score'] = dcr_scores

# Save the updated DataFrame to a new Excel file
output_file = "Dale_Chall_Readability_Scores_t5_base_combined_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average Dale-Chall Readability Score
print("\nAverage Dale-Chall Readability Score:", sum(dcr_scores) / len(dcr_scores))


Calculating Scores: 100%|██████████| 1630/1630 [00:00<00:00, 2737.99it/s]



Average Dale-Chall Readability Score: 10.660993865030667
