## Errors addressed in Joshua's code

1. Translate data to English


2. Remove multiple commas


3. Correct truncated Questions at the end of a sentence


4. Correct invalid URL


5. Add space before open bracket


6. Correct truncated questions at the begining of a sentence or word


7. Capitalise first word in a sentence


8. Add Fullstop at the end of a sentence


9. Capitalise acronyms


10. Format number with commas


11. Remove whitespace

## Note, Please install java on your system

# Description
This script constitutes a data cleaning pipeline designed for processing the initial Q&A dataset. It accepts the unrefined Q&A dataset as input and yields a refined version suitable for further applications.

Following is a list of addressed errors (Please keep adding any additional errors addressed in future) -
1. Translation to English
2. Replace multiple commas
3. Extra character at the start of a Q/A
4. Numbers not formatted ("....1236337km2..." to "1,236,337km2...")
5. Invalid URL



Solved using Language Tool function
6. Answer not started in UpperCase

7. Missing apostrophe
8. Compound words without hyphen
9. Proper Noun
10. Incorrect capitalisation
11. Missing space (Done?)

Overall, the code aims to clean, correct, and preprocess data in the Q&A DataFrame, focusing on specific errors.

# <b>Instructions for running the script </b>

1. Install all neccessary python packages.


2. Make sure the source dataset is a .CSV file with a single table format that includes a 'Question' and an 'Answer' column.


3. Change the directory/path to your directory in 'def main()' method for the variable 'CSV_files'(source file path), 'filtered_data_file'(unclean/eliminated records) and 'cleaned_df_dir' (final clean dataset)


<b>NOTE: Please change the version number in the title, incase of any additions/changes to script inorder to identify the latest version of the script</b>.

### Install required python packages

In [1]:
#!pip install language_tool_python spacy textblob

In [2]:
#!python -m spacy download en_core_web_sm

In [3]:
#!pip install googletrans==4.0.0-rc1

In [3]:
#!pip install pyenchant

In [2]:
#!pip uninstall language-tool-python
#!pip install language-tool-python

## Install necessary libraries

In [4]:
import pandas as pd
import os
import csv
import re
import language_tool_python
import time
import cProfile
from textblob import TextBlob
import spacy
from urllib.parse import urlparse, urlunparse
import locale
import nltk
from nltk.corpus import stopwords
import re
from googletrans import Translator
from tqdm import tqdm

## Activate the java

In [6]:
# Set the JAVA_HOME environment variable to the path of your Java installation
os.environ["JAVA_HOME"] = "C:/Path/To/Java"

# import the language_tool_python module
tool = language_tool_python.LanguageTool('en-GB')

Downloading LanguageTool 5.7: 100%|█████████████████████████████████████████████████| 225M/225M [03:36<00:00, 1.04MB/s]
Unzipping C:\Users\rosem\AppData\Local\Temp\tmpnpr7_967.zip to C:\Users\rosem\.cache\language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.7.zip to C:\Users\rosem\.cache\language_tool_python.


In [8]:
# Read the data
df = pd.read_csv('solar_panel_modified_data.csv')
df.head(2)

Unnamed: 0,id,Question,Answer
0,1,What is the main purpose of this paper?,The main purpose of this paper is to present a...
1,2,What are the key words associated with this pa...,The key words associated with this paper are R...


In [10]:
# Check the number of rows and column
df.shape

(3368, 3)

## Translate data to English

In [9]:
def translate_to_english(qa_df):
    translator = Translator()
    count = 0
    qa_list = []
    translated_records = []

    # Select Q&A from each record
    for i in range(qa_df.shape[0]):
        qa_list.append([qa_df['Question'][i], qa_df['Answer'][i]])
    
    print("Translating dataset:")
    for i, sent in tqdm(enumerate(qa_list), total=len(qa_list)):
        detected_language_Q = 'en'
        detected_language_A = 'en'

        try:
            # Detect the language of the text
            detected_language_Q = translator.detect(sent[0]).lang
            detected_language_A = translator.detect(sent[1]).lang

            if detected_language_Q != 'en':
                count += 1
                # Translate the text to English
                translation_Q = translator.translate(sent[0], src=detected_language_Q, dest='en')
                qa_list[i][0] = translation_Q.text

            if detected_language_A != 'en':
                count += 1
                # Translate the text to English
                translation_A = translator.translate(sent[1], src=detected_language_A, dest='en')
                qa_list[i][1] = translation_A.text

        except Exception as e:
            # Future scope to find another package that can cover unidentified languages
            pass

    # Unzip the list of tuples into separate lists for each column
    Question, Answer = zip(*qa_list)

    # Replace the existing columns with the new translated values
    qa_df['Question'] = Question
    qa_df['Answer'] = Answer

    return qa_df, count, translated_records

# Call the function
translated_df, translation_count, _ = translate_to_english(df)

# Output
print("\nTranslated DataFrame:")
print(translated_df)
print("\nNumber of Records Translated:", translation_count)

Translating dataset:


100%|██████████████████████████████████████████████████████████████████████████████| 3368/3368 [35:41<00:00,  1.57it/s]


Translated DataFrame:
        id                                           Question  \
0        1            What is the main purpose of this paper?   
1        2  What are the key words associated with this pa...   
2        3                                     What is MIVES?   
3        4             What is the advantage of PV/T over PV?   
4        7  What are the measures necessary to reduce ener...   
...    ...                                                ...   
3363  4286  What algorithm was used in Maghraoui et al. (2...   
3364  4287  What is the DOI for the journal article "RMSE ...   
3365  4288  What is the title of the paper presented at th...   
3366  4289  What model did Sauer et al. apply to forecasti...   
3367  4290  How many machine-learning methods were studied...   

                                                 Answer  
0     The main purpose of this paper is to present a...  
1     The key words associated with this paper are R...  
2     MIVES stands for




## Save the translated data

In [2]:
# Save the data translated
df.to_csv('solar_trans_data.csv', index=False)

## Define the errors and solution

In [8]:
def count_multiple_commas(text):
    """
    This function counts the occurrences of consecutive commas in a given text.

    Args:
        text (str): The input text to be analyzed.

    Returns:
        int: The count of consecutive commas.

        If the input is not a string, it returns the input as is.
    """
    if isinstance(text, str): # Checking if the input is a string
        pattern = re.compile(r',{2,}') # Defining a regular expression pattern to find consecutive commas
        return len(re.findall(pattern, text)) # Counting the occurrences of the pattern and returning the count
    else:
        return text  # If the input is not a string, return it as is
    
    
def replace_multiple_commas(text):
    """
    This function replaces consecutive commas with a single comma followed by a space in a given text.

    Args:
        text (str): The input text to be processed.

    Returns:
        str: The modified text with consecutive commas replaced.

        If the input is not a string, it returns the input as is.
    """
    if isinstance(text, str):  # Checking if the input is a string
        pattern = re.compile(r',{2,}')  # Defining a regular expression pattern to find consecutive commas
        corrected_text = re.sub(pattern, ', ', text)  # Replacing consecutive commas with ', ' in the text

        if corrected_text != text:  # Checking if any replacement was made
            print(f"\nOriginal text: {text}")
            print(f"Corrected text: {corrected_text}")

        return corrected_text  # Returning the modified text
    else:
        return text  # If the input is not a string, return it as is



'''
def comma_cluster_removal_df(df):
    total_comma_count = 0

    for column in df.columns:
        df[column] = df[column].apply(replace_multiple_commas)
        total_comma_count += df[column].apply(count_multiple_commas).sum()

    print(f"Total comma count: {total_comma_count}")
    return df, total_comma_count
'''


def comma_cluster_removal_df(df):
    """
    This function processes a DataFrame to remove comma clusters in its columns.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.

    Returns:
        pd.DataFrame: The modified DataFrame with comma clusters removed.
        int: The total count of commas removed.

    Note:
        This function assumes that 'replace_multiple_commas' and 'count_multiple_commas' functions are defined elsewhere.
    """
    total_comma_count = 0  # Initializing a variable to keep track of the total comma count

    # Loop through each column in the DataFrame
    for column in df.columns:
        df[column] = df[column].apply(replace_multiple_commas)  # Apply 'replace_multiple_commas' to the column
        total_comma_count += df[column].apply(count_multiple_commas).sum()  # Count commas and add to total

    print(f"Total comma count: {total_comma_count}")  # Print the total comma count
    return df, total_comma_count  # Return the modified DataFrame and total comma count



def is_truncated(sentence):
    """
    This function checks if a sentence is truncated, meaning it lacks proper sentence-ending punctuation.

    Args:
        sentence (str): The input sentence to be analyzed.

    Returns:
        bool: True if the sentence is truncated, False otherwise.
    """
    # Ensure the sentence is converted to a string
    sentence = str(sentence)
    # Define a list of sentence-ending punctuation marks
    sentence_endings = ['.', '!', '?', '."', '!"', '?"', '.”', '!”', '?”']

    # Check if the sentence is empty or very short
    if len(sentence) < 2:
        return False  # Not truncated

    # Check if the last character of the sentence is a sentence-ending punctuation mark
    if sentence[-1] in sentence_endings:
        return False  # Not truncated
    else:
        print ("\n" + sentence+"\n")
        return True   # Truncated
    
    
'''
def is_undesirable_question_to_count(question):
    if isinstance(question, str):
        count_phrases = [

            "what is the title",
            "what is the research topic",
            "what data is used in the study",
            "what data sets are collected in this study",
            "how did the study",
            "what does the arrow in figure 6 represent",
            "what was the publication date of the study",
            "what is the title of the paper",
            "what is the purpose of this study",
            "what was the goal of the study",
            "what data was utilized in the study",
            "what is the source of funding for this study",
            "what was the aim of this study",
            "what are the objectives of the study",
            "what was the main focus of the study",
            "what were the main conclusions of the study",
            "what methods were used in the study",
            "what ratio was used for the analysis",
            "what models are shown in fig 4",
            "what is the conclusion of this study",
            "what do the innovations of this study enable",
            "what is the article about",
            "what is the doi number for the article",
            "where can the tool be accessed",
            "what data has been used",
            "what were the results of the study",
            "what are the key findings of this study",
            "what data sources were used in this study",
            "what are the limitations of this study",
            "where was the research conducted",
            "what are the key words for this article",
            "what is the main objective of this study",
            "what evidence supports the research",
            # Add more phrases here
        ]
        return any(phrase in question.lower() for phrase in count_phrases)
    return False
'''


def is_undesirable_question_to_count(question):
    """
    This function checks if a given question is considered an 'undesirable' question for counting purposes.

    Args:
        question (str): The input question to be analyzed.

    Returns:
        bool: True if the question is undesirable for counting, False otherwise.
    """
    if isinstance(question, str):  # Checking if the input is a string
        count_phrases = [
            # List of phrases that indicate undesirable questions
            # If any of these phrases are found in the question (case insensitive), the function returns True
            "what is the title",
            "what is the research topic",
            "what data is used in the study",
            "what data sets are collected in this study",
            "how did the study",
            "what does the arrow in figure 6 represent",
            "what was the publication date of the study",
            "what is the title of the paper",
            "what is the purpose of this study",
            "what was the goal of the study",
            "what data was utilized in the study",
            "what is the source of funding for this study",
            "what was the aim of this study",
            "what are the objectives of the study",
            "what was the main focus of the study",
            "what were the main conclusions of the study",
            "what methods were used in the study",
            "what ratio was used for the analysis",
            "what models are shown in fig 4",
            "what is the conclusion of this study",
            "what do the innovations of this study enable",
            "what is the article about",
            "what is the doi number for the article",
            "where can the tool be accessed",
            "what data has been used",
            "what were the results of the study",
            "what are the key findings of this study",
            "what data sources were used in this study",
            "what are the limitations of this study",
            "where was the research conducted",
            "what are the key words for this article",
            "what is the main objective of this study",
            "what evidence supports the research",

            # Add more phrases here as needed
        ]
        return any(phrase in question.lower() for phrase in count_phrases)  # Checking if any phrase is present in the question
    return False  # If the input is not a string, return False



# Assuming you have defined your functions is_undesirable_question_to_count and is_truncated properly
def count_truncated_questions_and_answers_in_df(df, filtered_data_file):
    """
    This function processes a DataFrame to count truncated questions and answers, and filter out undesirable data.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.
        filtered_data_file (str): The file path to save the filtered data.

    Returns:
        int: The count of undesirable questions.
        int: The count of truncated questions.
        int: The count of truncated answers.
        pd.DataFrame: The DataFrame containing remaining data without truncated questions and answers.
    """
    #df = pd.read_csv(file_path)
    df.info()
    columns_with_spaces = df.columns.tolist()
    print(columns_with_spaces)
    questions = df['Question']
    answers = df['Answer']

    questions_count = df['Question'].apply(is_undesirable_question_to_count).sum()

    # Count truncated questions
    truncated_questions_count = df['Question'].apply(is_truncated).sum()

    # Count truncated answers
    truncated_answers_count = df['Answer'].apply(is_truncated).sum()

    # Filter out truncated rows
    truncated_questions = []
    truncated_answers = []

    # Filter out truncated rows
    not_truncated_indices = []
    for i in range(len(df)):
        if not (is_truncated(questions[i]) or is_truncated(answers[i])):
            not_truncated_indices.append(i)
        else:
            if is_truncated(questions[i]):
                truncated_questions.append(questions[i])
                print(f"Truncated Question {i}: {questions[i]}")
            if is_truncated(answers[i]):
                truncated_answers.append(answers[i])
                print(f"Corresponding Question {i}: {questions[i]}")
                print(f"Truncated Answer  {i}: {answers[i]} \n")
    df = df.iloc[not_truncated_indices]

    # Filter out questions and their corresponding answers
    filtered_indices = [i for i, question in enumerate(df['Question']) if is_undesirable_question_to_count(question)]
    filtered_data = pd.DataFrame({
        'Question': df['Question'].iloc[filtered_indices],
        'Answer': df['Answer'].iloc[filtered_indices]
    })

    # Print the number of rows in the filtered data
    print("Number of rows in filtered data:", len(filtered_data))

    # Create a new DataFrame for remaining data without truncated QnA
    remaining_indices = [i for i in range(len(df)) if i not in filtered_indices]
    remaining_data = df.iloc[remaining_indices]

    # Print the number of rows in the remaining data
    print("Number of rows in remaining data:", len(remaining_data))

    # Save the filtered data
    filtered_data.to_csv(filtered_data_file, index=False, encoding='utf-8')

    # Save the remaining data
    #remaining_data.to_csv(remaining_data_file, index=False, encoding='utf-8')

    return questions_count, truncated_questions_count, truncated_answers_count, remaining_data



def save_filtered_data(file_path, filtered_data):
    """
    This function saves filtered data to a CSV file.

    Args:
        file_path (str): The file path where the filtered data will be saved.
        filtered_data (list): List of dictionaries containing 'Question' and 'Answer' pairs.

    Returns:
        None
    """
    with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Question', 'Answer']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()  # Write the header row
        writer.writerows(filtered_data)  # Write the filtered data
        

def count_errors(text):
    """
    This function counts the number of errors in a given text.

    Args:
        text (str): The input text to be checked for errors.

    Returns:
        int: The number of errors found in the text.
    """
    tool = language_tool_python.LanguageTool('en-GB')  # Initialize a language checking tool
    matches = tool.check(text)  # Check the text for errors
    return len(matches)  # Return the number of errors found



def LanguageTool_df(df):
    """
    This function applies language correction using LanguageTool to the cells of a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.

    Returns:
        pd.DataFrame: The DataFrame with language corrections applied.
    """
    tool = language_tool_python.LanguageTool('en-GB')  # Initialize a language checking tool

    corrected_content = []  # List to store corrected content

    #original_error_count = 0
    #corrected_error_count = 0

    for i in tqdm(range(len(df))):
        corrected_row = []  # List to store corrected row content
        for cell in df.iloc[i]:  # Iterate through each cell in the row
            if isinstance(cell, str):  # Check if the cell contains a string
                corrected_cell = tool.correct(cell)  # Apply language correction
                corrected_cell = corrected_cell.replace(' Answer', 'Answer')  # Additional specific correction
            else:
                corrected_cell = cell  # If not a string, keep the cell as is
            corrected_row.append(corrected_cell)  # Add the corrected cell to the row
        corrected_content.append(corrected_row)  # Add the corrected row to the content list

    LanguageTool_corrected_df = pd.DataFrame(corrected_content, columns=df.columns)  # Create a new DataFrame with corrections

    return LanguageTool_corrected_df  # Return the DataFrame with language corrections



# Assuming you have a function count_errors defined elsewhere in your code
# Example usage:
# corrected_df, original_errors, corrected_errors = LanguageTool_df(your_dataframe)

def validate_and_correct_url_in_dataframe(df):
    """
    This function validates and corrects URLs in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.

    Returns:
        pd.DataFrame: The DataFrame with validated and corrected URLs.
    """
    def validate_and_correct_url(url):
        """
        This nested function validates and corrects a single URL.

        Args:
            url (str): The input URL to be validated and corrected.

        Returns:
            str: The corrected URL.
        """
        if isinstance(url, str):
            try:
                parsed_url = urlparse(url)
                if not parsed_url.scheme:
                    # If the URL doesn't have a scheme (e.g., http://), add "http://"
                    corrected_url = 'http://' + url
                    parsed_url = urlparse(corrected_url)

                if not parsed_url.netloc:
                    # If the URL doesn't have a valid network location, return it as is
                    return url

                # Replace "www" with "www." if it doesn't have a dot after it
                corrected_netloc = re.sub(r'www(?!\.)', 'www.', parsed_url.netloc)

                # Replace "com" with ".com" if it doesn't have a dot before it, and is not followed by a forward slash
                corrected_netloc = re.sub(r'(?<!\.)com(?!/)', '.com', corrected_netloc)

                 # Replace "com" with ".org" if it doesn't have a dot before it, and is not followed by a forward slash
                corrected_netloc = re.sub(r'(?<!\.)org(?!/)', '.org', corrected_netloc)


               # Remove dots from ".com" in the middle of the string
                corrected_netloc = re.sub(r'(?<=\w)\.com(?=\w)', 'com', corrected_netloc)

                # Remove dots from ".org" in the middle of the string
                corrected_netloc = re.sub(r'(?<=\w)\.org(?=\w)', 'org', corrected_netloc)


                 # Replace ".com" at the start of a word with "com"
                corrected_netloc = re.sub(r'\.com(?=\w)', 'com', corrected_netloc)

                 # Replace ".com" at the start of a word with "org"
                corrected_netloc = re.sub(r'\.org(?=\w)', 'org', corrected_netloc)


                # Reassemble the URL with the corrected netloc
                parsed_url = parsed_url._replace(netloc=corrected_netloc)
                corrected_url = urlunparse(parsed_url)

                return corrected_url

            except ValueError:
                # Invalid URL, return it as is
                return url
        else:
            # If the input is not a string (e.g., it's a float or other non-string type),
            # return it as is
            return url

    if 'Full_Text_URL' in df.columns:
        df['Full_Text_URL'] = df['Full_Text_URL'].apply(validate_and_correct_url)

    return df



def add_space_before_opening_bracket(df):
    """
    This function adds a space before an opening bracket in string cells of a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to be processed.

    Returns:
        pd.DataFrame: The DataFrame with spaces added before opening brackets.
    """
    #df = pd.read_csv(input_file_path)

    for column in df.columns:
        df[column] = df[column].apply(lambda cell: re.sub(r'([A-Za-z])\(', r'\1 (', cell) if isinstance(cell, str) else cell)

    #df.to_csv(ammended_space_file, index=False)

    return df   # Return the output file path after processing



# Function to correct the truncated starting word of the answers
def correct_truncated_start_word(text: str) -> str:
    if isinstance(text, str):
        # Load the spaCy English model
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(text)

        # Checking if the start token is capitalized or not
        if doc[0].is_title == False:
            first_token = doc[0].text
            if first_token[0].isdigit() and not first_token[-1].isalnum():
                # Correct the word without modifying the '%' symbol
                corrected_word = ''.join([str(TextBlob(part).correct()) if not part.isalnum() else part for part in first_token.split('%')])
                corrected_text = corrected_word + " " + " ".join([token.text for token in doc[1:]])
                return corrected_text.capitalize()

        return text
    else:
        return text

def add_full_stops(df):
# function to add full stops to the end of each “Answer” column.
    df['Answer']=df['Answer'].apply(lambda answer: answer.strip()+  '.' if isinstance(answer,str) else answer)
    return df



# Download NLTK data for part-of-speech tagging
nltk.download('punkt')
nltk.download('stopwords')

# POS tagger
tagger = nltk.pos_tag

# Function to check if a word is a noun or other parts of speech
def is_noun_or_other(word):
    tagged = tagger([word])
    pos_tags = ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS', 'WP', 'IN']
    return tagged[0][1] in pos_tags

# Function to identify acronyms that are not stopwords and are nouns or other parts of speech
def identify_and_capitalize_acronyms(df, acronyms_df, persons_df):
    # Preprocess acronyms: remove whitespace and lowercase
    acronyms_df['Acronyms'] = acronyms_df['Acronyms'].str.replace(r'\s+', '').str.lower()
    acronyms = set(acronyms_df['Acronyms'])

    # Get the set of English stopwords
    english_stopwords = set(stopwords.words('english'))

    # Preprocess DataFrame
    df['Question'] = df['Question'].str.lower()
    df['Answer'] = df['Answer'].str.lower()

    # Load names of persons into a set
    persons_set = set(persons_df['name'].str.lower())

    for index, row in df.iterrows():
        q_words = re.findall(r'\b\w+\b', row['Question'])
        a_words = re.findall(r'\b\w+\b', row['Answer'])

        q_matches = [word for word in q_words if word in acronyms and word not in english_stopwords and is_noun_or_other(word) and word not in persons_set]
        a_matches = [word for word in a_words if word in acronyms and word not in english_stopwords and is_noun_or_other(word) and word not in persons_set]

        # Capitalize acronyms
        for word in q_matches:
            row['Question'] = row['Question'].replace(word, word.upper())

        for word in a_matches:
            row['Answer'] = row['Answer'].replace(word, word.upper())

         # Capitalize names from persons_df
        for word in q_words:
            if word in persons_set:
                row['Question'] = row['Question'].replace(word, word.title())

        for word in a_words:
            if word in persons_set:
                row['Answer'] = row['Answer'].replace(word, word.title())

    return df



def punc_removal_at_start(input_df):
    """
    This function removes non-alphanumeric characters from the beginning of strings in the 'Question' and 'Answer' columns.

    Args:
        input_df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: A new DataFrame with cleaned 'Question' and 'Answer' columns.
    """
    try:
        # Create a copy of the input DataFrame to avoid modifying the original
        df = input_df.copy()

        # Define a function to clean a single string by removing non-alphanumeric characters
        def clean_text(text):
            if isinstance(text, str):
                # Find and remove non-alphanumeric characters from the beginning of the text
                cleaned_text = re.sub(r'^[^a-zA-Z0-9\s]+', '', text)
                return cleaned_text
            else:
                return text  # Return non-string values as is

        # Apply the clean_text function to the 'Question' and 'Answer' columns
        df['Question'] = df['Question'].apply(clean_text)
        df['Answer'] = df['Answer'].apply(clean_text)

        return df

    except Exception as e:
        print("An error occurred:", str(e))
        
        
        
# Function to format numbers with commas, preserving non-numeric characters
def format_number_with_commas(number_str):
    try:
        # Check if the input is a numeric value
        if isinstance(number_str, (int, float)):
            return locale.format_string("%d", int(number_str), grouping=True)

        # Use regular expression to find and format numbers in the string
        formatted_str = re.sub(r'(\d[\d,]*)', lambda x: locale.format_string("%d", int(x.group(0).replace(',', '')), grouping=True), number_str)
        return formatted_str
    except (ValueError, TypeError):
        return number_str

# Function to format numbers in a DataFrame
def format_numbers_in_dataframe(df):
    df_copy = df.copy()  # Create a copy of the DataFrame to avoid modifying the original
    columns_to_format = ['Question', 'Answer']  # Specify the columns to format
    for column in columns_to_format:
        df_copy[column] = df_copy[column].apply(format_number_with_commas)

    return df_copy

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rosem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rosem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
def main():
    # List of CSV files to process
    start_time = time.time()

    csv_files = [
       # "C:/Users/HP/Documents/Sam/Data Science Voluteer/CSV/Geothermal_energy_wellcome_20230717.csv",
        #"C:/Users/Joshua Giwa/Downloads/QnAs_generated_27_07_2023/Solar power_wellcome_20230717.csv",
        #"C:/Users/Joshua Giwa/Downloads/QnAs_generated_27_07_2023/Solar power_gatesopen_20230717.csv",
        #"C:/Users/Joshua Giwa/Downloads/QnAs_generated_27_07_2023/Geothermal+energy_f1000_20230717.csv",
       # "C:/Users/Joshua Giwa/Downloads/QnAs_generated_27_07_2023/Geothermal_energy_wellcome_20230717.csv",
      #  "C:/Users/Joshua Giwa/Downloads/QnAs_generated_27_07_2023/Carbon+footprint_wellcome_20230717.csv"
      #  "C:/Users/Joshua Giwa/Downloads/test_dataset.csv"
        #"C:/Users/LolaPwasanga/Downloads/datarecords.csv"
       "C:/Users/rosem/Documents/TiiQi/solar_trans_data.csv"
        #'C:/Users/Joshua Giwa/Downloads/QnAs_generated_26_07_2023/QnA_Biomass_f1000_20230717.csv'
       #'/content/test_8086.csv'
        ## Add more file paths here as needed
    ]

    total_questions_count = 0
    total_truncated_questions_count = 0
    total_truncated_answers_count = 0


    for csv_file in csv_files:
        df = pd.read_csv(csv_file)
        # Translate the Q&As in the df to English and return translated df and count
        #df, translated_count, translated_records= translate_to_english(df)

        #print(df.columns)

        #df = add_full_stops(df)

        # Removing any punctuation if any from the start of the answer
        df['Answer'] = df['Answer'].str.replace(r'^[\.\:\- ]*', '', regex=True)     # regex to replace the punctuations from the start of the answer

        # Correcting answers that are truncated in the start of the answer
        df['Answer'] = df['Answer'].apply(correct_truncated_start_word)

       # Load AcronymsFile_2.csv into a DataFrame
        #acronyms_file = 'C:/Users/Joshua Giwa/Downloads/AcronymsFile_2.csv'
        #acronyms_df = pd.read_csv(acronyms_file)

        # Load names_of_persons.csv into a DataFrame
        #persons_file = 'C:/Users/Joshua Giwa/Downloads/names_of_persons.csv'
        #persons_df = pd.read_csv(persons_file)

       # result_df = identify_and_capitalize_acronyms(df, acronyms_df, persons_df)

        #locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
        locale.setlocale(locale.LC_ALL, 'en_GB') #modified_ri
        formatted_df = format_numbers_in_dataframe(df)


        #df['Full_Text_URL'] = df['Full_Text_URL'].apply(validate_and_correct_url)

        # corrected_file_path = os.path.join(os.path.expanduser("~"), "Downloads", os.path.basename(csv_file).replace('.csv', '_corrected.csv'))
        #remaining_data_file = os.path.join(os.path.expanduser("~"), "Downloads", os.path.basename(csv_file).replace('.csv', '_remaining_data.csv'))
        #filtered_data_file = os.path.join(os.path.expanduser("~"), "content", os.path.basename(csv_file).replace('.csv', '_filtered_questions_non_optimized.csv'))
        filtered_data_file = os.path.join(os.path.basename(csv_file).replace('.csv', '_filtered_questions_non_optimized.csv')) #modified_riya

        #ammended_space_file = os.path.join(os.path.expanduser("~"), "Downloads", os.path.basename(csv_file).replace('.csv', 'bracket_spaced.csv'))
        #print(filtered_data_file)
        punc_removal_at_start_df = punc_removal_at_start(formatted_df)
        questions_count, truncated_questions, truncated_answers, remaining_data = count_truncated_questions_and_answers_in_df(punc_removal_at_start_df, filtered_data_file)
        comma_cluster_removed_df, total_comma_count = comma_cluster_removal_df(remaining_data)
        space_before_bracket_ammended_df = add_space_before_opening_bracket(comma_cluster_removed_df)
        print("LanguageTool:")
        LanguageTool_corrected_df= LanguageTool_df(space_before_bracket_ammended_df)
        validate_and_correct_url_in_df = validate_and_correct_url_in_dataframe(LanguageTool_corrected_df)



        # Save the processed DataFrame with "updated" added to the name
        cleaned_df =  validate_and_correct_url_in_df.copy()
        cleaned_df_filename = os.path.basename(csv_file).replace('.csv', '_cleaned.csv')

        # Assuming that you have a directory where you want to save the updated DataFrames
        #cleaned_df_dir = os.path.join(os.path.expanduser("~"), "content", "cleaned_QnAs_non_optimized")
        cleaned_df_dir = os.path.join("cleaned_QnAs_non_optimized") #modified_riya

        if not os.path.exists(cleaned_df_dir):
            os.makedirs(cleaned_df_dir)

        cleaned_df_file_path = os.path.join(cleaned_df_dir, cleaned_df_filename)

        # Save the updated DataFrame as a CSV file
        cleaned_df.to_csv(cleaned_df_file_path, index=False, encoding='utf-8')


        # Now, you can use the updated DataFrame for further processing if needed



        total_questions_count += questions_count
        total_truncated_questions_count += truncated_questions
        total_truncated_answers_count += truncated_answers


        print(f"File: {csv_file}\n")
        #print(f"Total translated Q&A: {translated_count}")
        print(f"Total undesirable questions: {questions_count}")
        print(f"Total Truncated Questions: {truncated_questions}")
        print(f"Total Truncated Answers: {truncated_answers}")
       #print(f"Original typo/error found: {original_error_count}")
        print(f"excessive comma occurence: {total_comma_count}")
        #print(f"Corrected typo/error count: {corrected_error_count}")


        print(f"(undesirable_questions + Truncated Questions + Truncated Answers + excessive comma occurence): {questions_count + truncated_questions + truncated_answers +  total_comma_count}\n")

  #  print(f"Total Questions Count: {total_questions_count}")
  #  print(f"Total Truncated Questions Count: {total_truncated_questions_count}")
  #  print(f"Total Truncated Answers Count: {total_truncated_answers_count}")
  #  print(f"Total (Questions + Truncated Questions + Truncated Answers): {total_questions_count + total_truncated_questions_count + total_truncated_answers_count}")

    end_time = time.time()

    elapsed_time_seconds = end_time - start_time
    elapsed_time_minutes = elapsed_time_seconds / 60

    print(f"Script ran for {elapsed_time_minutes:.2f} minutes.")

if __name__ == "__main__":
    main()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3368 entries, 0 to 3367
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3368 non-null   int64 
 1   Question  3368 non-null   object
 2   Answer    3330 non-null   object
dtypes: int64(1), object(2)
memory usage: 79.1+ KB
['id', 'Question', 'Answer']

What kind of short-term support strategies are commonly used for


What is the DOI of the article "ISO 14,043


How can building integrated photovoltaics (BIPV)


What techniques were compared in the statistical and


What was the modified collector efficiency increased upto according to Md.R. Al-Mamun et al. Solar Energy 264 (2,023) 111,998


What is the purpose of the research paper "Adsorption of gold(III) ions from wastewater using Mg-doped hybrid nanomaterial


What parameters were used to calculate the velocity f′(η), temperature θ(η), and entropy generation NG, drag coefficient (Cf),


What journal published 

100%|██████████████████████████████████████████████████████████████████████████████| 2787/2787 [04:22<00:00, 10.62it/s]

File: C:/Users/rosem/Documents/TiiQi/solar_trans_data.csv

Total undesirable questions: 68
Total Truncated Questions: 87
Total Truncated Answers: 542
excessive comma occurence: 5941378
(undesirable_questions + Truncated Questions + Truncated Answers + excessive comma occurence): 5942075

Script ran for 39.34 minutes.





### !jupyter notebook list


In [None]:
from jupyter_core.paths import jupyter_config_dir
jupyter_dir = jupyter_config_dir()
jupyter_dir


'C:\\Users\\Joshua Giwa\\.jupyter'