# Final DataSet Creation
-------------------

## Planning :

   1. First read out all available data/parquet files
   2. Using `Gemini` manipulate the data
      
      2.1  Select one of the lines from Human Generated data and rewrite using `Gemini`, that's how we are going to create combination of llm generated text and human generated text.

      2.2 Using the defined function, we will be counting no of words generated by `LLM(Gemini)` and already we will be having count of words available from `Human`.

      2.3 A new column will be storing percentage of `LLM` generated text, using the formula:
          
          llm_generated_perc = word count for llm generated data / total word count in the text

      2.4 Make sure to iterate the same fuction for different number of lines.
      Example : First we will be regenerating one of lines from human generated text, next 2 and next 3 lines and so on. This will help to enhance the data model.

   3. Finally summing up all functions to create the final dataset.

   





### 1. Loading the dataset
--------------------

In [None]:
%%capture
!pip install fastparquet
!pip install langchain_google_genai
!pip install langchain

In [None]:
## importing necessary libraries

import pandas as pd
import numpy as np

from tqdm import tqdm

## importing libraries for generative ai functionality
from langchain_google_genai import GoogleGenerativeAI
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate

In [None]:
## setting up the environment

from google.colab import userdata
GENAI_API_KEY = userdata.get("GENAI_API_KEY")

In [None]:
## loading the data

detectllm_data = pd.read_parquet("/content/drive/MyDrive/BTP 8th SEM/Data/Dataset/DetectLLM.parquet", engine = 'fastparquet')
detectllm_data.shape

In [None]:
## take a look at the data

detectllm_data.head()

In [None]:
## take a look at the data

detectllm_data.tail()

#### 1.1 Seperating Human Written Data only
-------------

In [None]:
## human data
detectllm_data_human = detectllm_data[detectllm_data['Label'] == 'Human']

## finally checking the output data
# detectllm_data_human.shape
detectllm_data_human.Label.value_counts()

### 2. Data Manipulation
---------------------

In [None]:
'''
   1. setting up a function, sentence_splitter , that will split the Text Data on the appearance of full stop and a list containing all the splitted sentences will appear
   2. Next there will a function called regenerate_sentence, it will take all these list as an in input and regenerate using any llms.
   3. The trick here is for every list, it will tweak every possible element and then store the tweaked elements in a new column, regenerated_text
'''


'''
   Initiating the model and initiating the chain
   using GoogleGenerativeAI and using chain
'''

## defining safety_settings to ignore the errors
# safety_settings = [
#   {
#   "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
#   "threshold": "BLOCK_NONE"
#   },
#   {
#   "category": "HARM_CATEGORY_HATE_SPEECH",
#   "threshold": "BLOCK_NONE"
#   },
#   {
#   "category": "HARM_CATEGORY_HARASSMENT",
#   "threshold": "BLOCK_NONE"
#   },
#   {
#   "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
#   "threshold": "BLOCK_NONE"
#    },
# ]

safety_settings = {
    "HARM_CATEGORY_SEXUALLY_EXPLICIT": {"threshold": "BLOCK_NONE"},
    "HARM_CATEGORY_HATE_SPEECH": {"threshold": "BLOCK_NONE"},
    "HARM_CATEGORY_HARASSMENT": {"threshold": "BLOCK_NONE"},
    "HARM_CATEGORY_DANGEROUS_CONTENT": {"threshold": "BLOCK_NONE"}
}

model = GoogleGenerativeAI(model='models/gemini-pro',
                           google_api_key=GENAI_API_KEY)

## initializing the chain and the parser for better result
parser = StrOutputParser()


## Using an Prompt Template
template = """
        Rewrite the sentence, no constains in no of lines.
        Sentence : {sentence}
"""
## creating prompt
prompt = PromptTemplate.from_template(template)

chain = prompt | model | parser  ##chain will be used for further text generation


# Step 1: Splitting the text column into sentences and storing the result in a new column
def sentence_splitter(text):
    """
    Splits the input text into sentences based on full stops.
    Args:
        text (str): Input text containing sentences separated by full stops.
    Returns:
        list: List of individual sentences.
    """
    text = text.replace('[', '').replace(']', '').replace('"', "")
    sentences = text.split('. ')
    return sentences

def split_text_column(df):
    """
    Splits the 'Text' column into sentences and stores the result in a new column 'splitted_text'.
    Args:
        df (DataFrame): DataFrame containing 'Text' column with text data.
    Returns:
        DataFrame: DataFrame with an additional column 'splitted_text'.
    """
    df['splitted_text'] = df['Text'].apply(sentence_splitter)
    return df

# Step 2: Regenerate sentences using an LLMs model
def regenerate_sentence(sentences):
    """
    Regenerates sentences using a generative AI model.
    Args:
        sentences (list): List of sentences.
    Returns:
        list: List of regenerated sentences.
    """
    regenerated_sentences = []
    for sentence in sentences:
        if isinstance(sentence, str):
            generated_sentence = chain.invoke(sentence)
            regenerated_sentences.append(generated_sentence)
    return regenerated_sentences


def regenerate_sentences_in_dataframe(df):
    """
    Regenerates each splitted sentence using an LLMs model and stores the regenerated
    sentences in a new column 'regenerated_text'.
    Args:
        df (DataFrame): DataFrame containing 'splitted_text' column with lists of sentences.
    Returns:
        DataFrame: DataFrame with an additional column 'regenerated_text'.
    """
    df['regenerated_text'] = df['splitted_text'].apply(regenerate_sentence)
    return df

## Step 3 : Regenerate the whole Text
def regenerate_text(text):
    """
    Regenerates the entire text using a generative AI model.
    Args:
        text (str): The input text to be regenerated.
    Returns:
        str: The regenerated text.
    """
    text = text.replace('[', '').replace(']', '').replace('"', "")
    regenerated_text = chain.invoke([text])
    return regenerated_text

def regenerate_text_in_dataframe(df):
    """
    Regenerates the entire text in the 'Text' column of the DataFrame using an LLMs model.
    Args:
        df (DataFrame): DataFrame containing 'Text' column with text data.
    Returns:
        DataFrame: DataFrame with an additional column 'regenerated_text'.
    """
    df.loc[:, 'regenerated_text'] = df['char_less_text'].apply(regenerate_text)
    return df


# Step 4: Remove Extra character
def char_remover(text):
    """
    Splits the input text into sentences based on full stops.
    Args:
        text (str): Input text containing sentences separated by full stops.
    Returns:
        list: List of individual sentences.
    """
    text = text.replace("['", '').replace("']", '').replace('"', "").replace("'", "")
    return text

def char_remover_column(df):
    """
    Splits the 'Text' column into sentences and stores the result in a new column 'splitted_text'.
    Args:
        df (DataFrame): DataFrame containing 'Text' column with text data.
    Returns:
        DataFrame: DataFrame with an additional column 'splitted_text'.
    """
    df['char_less_text'] = df['Text'].apply(char_remover)
    return df

def regenerate_text_in_dataframe(df):
    """
    Regenerates the entire text in the 'Text' column of the DataFrame using an LLMs model.
    Args:
        df (DataFrame): DataFrame containing 'Text' column with text data.
    Returns:
        DataFrame: DataFrame with an additional column 'regenerated_text'.
    """
    df['regenerated_text'] = df['char_less_text'].apply(lambda x: regenerate_text(x))
    return df


In [None]:
# detectllm_data['Text'][0].split(".")

In [None]:
# sentence_splitter(detectllm_data['Text'][0])

In [None]:
## applying the functions [split_text_column] to split the sentence

# split_text_column(df=detectllm_data)

### 3. Generating New Data [Applying the function]

In [None]:
detectllm_data_human.head()

In [None]:
## Removal of Extra character is done first
splitted_data = char_remover_column(detectllm_data_human)

In [None]:
## testing splitted data
splitted_data.tail()

In [None]:
## applying the second function
# regenerated_data = regenerate_text_in_dataframe(splitted_data)

In [None]:
# regenerated_data

In [None]:
regenerate_text(splitted_data['char_less_text'][0])


In [None]:
## adding new columns using `assign` method in dataframe

splitted_data.assign(regenerated_text = None)

In [None]:
## generating new data


regenerate_text_in_dataframe(splitted_data)