The Coleman-Liau Index (CLI) is a readability test for English text, which approximates the US grade level thought necessary to understand the text. The formula for calculating the CLI is:

CLI = 0.0588 * L - 0.296 * S - 15.8

where:

L is the average number of letters per 100 words, and
S is the average number of sentences per 100 words.

The Coleman-Liau Index (CLI) is designed to approximate the U.S. grade level needed to understand a text. The score usually ranges from around 0 to 16, where 0 represents the reading level of a kindergartner and 16 corresponds to a college graduate's reading level.

Here is a rough interpretation of the scores:

0 - Kindergarten

1-6 - Elementary School (1st to 6th grade)

7-8 - Middle School (7th to 8th grade)

9-12 - High School (9th to 12th grade)

13-16 - College level and above

With an average CLI score of approximately 13.9, the text is estimated to be at the reading level of a college student. This means the text is fairly complex and might not be easily understood by individuals with lower education levels.

If the intended audience for your abstract summaries is scholars, researchers, or people with higher education, this score could be appropriate. However, if you're targeting a general audience, you might want to simplify the text to make it more accessible.

In [9]:
#abstract code
import pandas as pd
from tqdm import tqdm
import re

# Load the DataFrame from the Excel file
input_file = "Abstract_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define lists to store the CLI scores
cli_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating CLI Scores"):
    # Get the generated summary
    generated_summary = str(row['Abstract_Summary_t5_base'])

    # Clean the generated_summary
    generated_summary = re.sub(r'[^\x00-\x7F]+', '', generated_summary)

    # Count the number of letters and sentences in generated_summary
    num_letters = sum(c.isalpha() for c in generated_summary)
    num_sentences = generated_summary.count('.') + generated_summary.count('!') + generated_summary.count('?')
    num_words = len(generated_summary.split())

    # Calculate average letters and sentences per 100 words
    L = (num_letters / num_words) * 100 if num_words > 0 else 0
    S = (num_sentences / num_words) * 100 if num_words > 0 else 0

    # Calculate the Coleman-Liau Index
    cli_score = 0.0588 * L - 0.296 * S - 15.8

    # Append the CLI score to the list
    cli_scores.append(cli_score)

# Add the CLI scores to the DataFrame
df['CLI_Score'] = cli_scores

# Save the updated DataFrame to a new Excel file
output_file = "CLI_Scores_t5_base_abstract_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average CLI score
print("\nAverage CLI Score:", sum(cli_scores) / len(cli_scores) if len(cli_scores) > 0 else 0)


Calculating CLI Scores: 100%|██████████| 1630/1630 [00:00<00:00, 6682.18it/s]



Average CLI Score: 13.730044964442536


In [10]:
#claims code
import pandas as pd
from tqdm import tqdm
import re

# Load the DataFrame from the Excel file
input_file = "Claims_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define lists to store the CLI scores
cli_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating CLI Scores"):
    # Get the generated summary
    claim_Summary = str(row['Claims_Summary_t5_base'])

    # Clean the generated_summary
    generated_summary = re.sub(r'[^\x00-\x7F]+', '', claim_Summary)

    # Count the number of letters and sentences in generated_summary
    num_letters = sum(c.isalpha() for c in generated_summary)
    num_sentences = generated_summary.count('.') + generated_summary.count('!') + generated_summary.count('?')
    num_words = len(generated_summary.split())

    # Calculate average letters and sentences per 100 words
    L = (num_letters / num_words) * 100 if num_words > 0 else 0
    S = (num_sentences / num_words) * 100 if num_words > 0 else 0

    # Calculate the Coleman-Liau Index
    cli_score = 0.0588 * L - 0.296 * S - 15.8

    # Append the CLI score to the list
    cli_scores.append(cli_score)

# Add the CLI scores to the DataFrame
df['CLI_Score'] = cli_scores

# Save the updated DataFrame to a new Excel file
output_file = "CLI_Scores_t5_base_claims_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average CLI score
print("\nAverage CLI Score:", sum(cli_scores) / len(cli_scores) if len(cli_scores) > 0 else 0)


Calculating CLI Scores: 100%|██████████| 1630/1630 [00:00<00:00, 6198.79it/s]



Average CLI Score: 13.371598078429653


In [11]:
#combined code
import pandas as pd
from tqdm import tqdm
import re

# Load the DataFrame from the Excel file
input_file = "Combined_Google_patent_Summary_t5_base_file.xlsx"
df = pd.read_excel(input_file)

# Define lists to store the CLI scores
cli_scores = []

# Iterate over the rows in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df), desc="Calculating CLI Scores"):
    # Get the generated summary
    Combined_Summary = str(row['Combined_Summary'])

    # Clean the generated_summary
    generated_summary = re.sub(r'[^\x00-\x7F]+', '', Combined_Summary)

    # Count the number of letters and sentences in generated_summary
    num_letters = sum(c.isalpha() for c in generated_summary)
    num_sentences = generated_summary.count('.') + generated_summary.count('!') + generated_summary.count('?')
    num_words = len(generated_summary.split())

    # Calculate average letters and sentences per 100 words
    L = (num_letters / num_words) * 100 if num_words > 0 else 0
    S = (num_sentences / num_words) * 100 if num_words > 0 else 0

    # Calculate the Coleman-Liau Index
    cli_score = 0.0588 * L - 0.296 * S - 15.8

    # Append the CLI score to the list
    cli_scores.append(cli_score)

# Add the CLI scores to the DataFrame
df['CLI_Score'] = cli_scores

# Save the updated DataFrame to a new Excel file
output_file = "CLI_Scores_t5_base_combined_total.xlsx"
df.to_excel(output_file, index=False)

# Print the average CLI score
print("\nAverage CLI Score:", sum(cli_scores) / len(cli_scores) if len(cli_scores) > 0 else 0)


Calculating CLI Scores: 100%|██████████| 1630/1630 [00:00<00:00, 10433.77it/s]



Average CLI Score: 13.595977196087055
