# LGBM 15 fold + 5 fold Deberta Explained  💨

### Introduction:
Welcome to this Jupyter notebook developed for The Learning Agency Lab - Automated Essay Scoring 2.0 . This notebook is designed to help you participate in the competition and to Develop automated techniques to improve upon essay scoring algorithms to improve student learning outcomes.



### Inspiration and Credits 🙌
This notebook is inspired by the work of SiddhVR, available at [this Kaggle project](https://www.kaggle.com/code/siddhvr/aes-2-0-deberta-lgbm-baseline). I extend my gratitude to SiddhVR for sharing their insights and code publicly.


**Model Components:**

1. **Light Gradient Boosting Machine (LGBM):** LGBM is a powerful gradient boosting framework that efficiently handles large datasets and provides high accuracy. It is employed as the primary machine learning algorithm in our model for essay scoring.

2. **Deberta Transformer Embeddings:** Deberta is a state-of-the-art transformer-based language model. We utilize Deberta embeddings to capture rich semantic information from essays, enhancing the model's understanding of the text.

3. **TF-IDF Vectorization:** Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is utilized to convert text into numerical representations while considering the importance of words in essays. It helps in capturing the significance of words and phrases in the scoring process.

4. **CountVectorizer:** CountVectorizer is used to convert text into a matrix of token counts. It enables us to extract features based on word frequencies, which contribute to the overall scoring mechanism.

5. **Ensemble Learning:** Our model employs ensemble learning techniques, combining predictions from multiple models to improve the overall accuracy and robustness of the scoring process.

**🌟 Explore my profile and other public projects, and don't forget to share your feedback!**

## 👉 [Visit my Profile]( https://www.kaggle.com/code/zulqarnainalipk) 👈


**Working of the Model:**

1. **Data Preprocessing:** The essay text undergoes preprocessing steps, including lowercasing, HTML tag removal, punctuation removal, spelling error detection, and contraction expansion.

2. **Feature Engineering:** Features are extracted from the preprocessed text, including paragraph length, sentence count, word count, and other linguistic features.

3. **Model Training:** The preprocessed features are fed into the LGBM model along with Deberta embeddings, TF-IDF vectors, and CountVectorizer representations. The model is trained using a combination of these features to learn the scoring patterns.

4. **Cross-Validation:** The model undergoes cross-validation using stratified k-fold technique to ensure robustness and reliability in scoring.

5. **Ensemble Prediction:** Predictions from multiple models trained on different folds are averaged to obtain the final predicted scores for the essays.

6. **Submission Generation:** The predicted scores are then used to generate a submission file in CSV format, which can be used for evaluation or submission in essay scoring competitions.


## Acknowledgments 🙏
I acknowledge The Learning Agency Lab organizers for providing the dataset and the competition platform.

Let's get started! Feel free to reach out if you have any questions or need assistance along the way.
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈


In [None]:
import gc
import torch
import copy
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments,DataCollatorWithPadding
import nltk
from datasets import Dataset
from glob import glob
import numpy as np 
import pandas as pd
import polars as pl
import re
import random
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from scipy.special import softmax
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier,GradientBoostingClassifier,BaggingClassifier
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB,MultinomialNB,ComplementNB
from sklearn.neural_network import MLPClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score
from sklearn.metrics import cohen_kappa_score
from lightgbm import log_evaluation, early_stopping
import lightgbm as lgb
nltk.download('wordnet')

---

**Explaination**


1. `MAX_LENGTH`: Maximum length of the input sequence. Set it to 1024 tokens.

2. `TEST_DATA_PATH`: Path to the test data CSV file. 

3. `MODEL_PATH`: Path to the directory containing the trained models.

4. `EVAL_BATCH_SIZE`: Batch size used for evaluation. Set it to 1, meaning each evaluation batch contains a single sample.



In [None]:
MAX_LENGTH = 1024
TEST_DATA_PATH = "/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv"
MODEL_PATH = '/kaggle/input/aes2-400-20240419134941/*/*'
EVAL_BATCH_SIZE = 1

## Tokenization and Dataset Preparation

---
 **Explanation:**


- **Model Loading:**
  - The `glob` function is used  to retrieve the list of **models** available at the specified **MODEL_PATH**.

- **Tokenizer Initialization:**
  - The **tokenizer** is initialized using `AutoTokenizer.from_pretrained(models[0])`, where `models[0]` represents the first model in the list.

- **Tokenization Function Definition:**
  -  `tokenize` function is defined, which takes a `sample` as input and tokenizes the "full_text" column of the sample using the initialized **tokenizer**. Tokenization is performed with a maximum length of **MAX_LENGTH**, and truncation is applied if the text exceeds this length.

- **Test Data Processing:**
  - The test data CSV file located at **TEST_DATA_PATH** is read into a pandas DataFrame named **df_test**.

- **Hugging Face Dataset Creation:**
  - The pandas DataFrame **df_test** is converted into a Hugging Face Dataset named **ds**. The `tokenize` function is applied to tokenize the "full_text" column of each sample, and the "essay_id" and "full_text" columns are removed from the dataset.

- **Training Arguments Initialization:**
  - The **TrainingArguments** object **args** is initialized, specifying the evaluation batch size (`per_device_eval_batch_size`) and output directory for the Trainer. Here, the output directory is set to ".", meaning the current directory.

- **Model Evaluation Loop:**
  - Iterates over each **model** in the **models** list.
  
- **Model Loading and Evaluation:**
  - Loads the current **model** for sequence classification using `AutoModelForSequenceClassification.from_pretrained(model)`.
  - Initializes a **Trainer** object with the loaded **model**, **args**, a `DataCollatorWithPadding` object initialized with the **tokenizer**, and the **tokenizer** itself.
  - Uses the **Trainer** to make predictions on the test dataset **ds** and obtains the raw **predictions**.
  - Applies softmax to the raw predictions along the last axis to obtain **class probabilities**.
  
- **Memory Management:**
  - Deletes the **model** and **Trainer** objects to release memory using `del`.
  - Clears the GPU memory cache using `torch.cuda.empty_cache()`.
  - Performs garbage collection using `gc.collect()` to free up memory resources.

We efficiently evaluates each model on the test dataset, storing the predicted probabilities for further analysis. 

In [None]:
models = glob(MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(models[0])

def tokenize(sample):
    return tokenizer(sample['full_text'], max_length=MAX_LENGTH, truncation=True)

df_test = pd.read_csv(TEST_DATA_PATH)
ds = Dataset.from_pandas(df_test).map(tokenize).remove_columns(['essay_id', 'full_text'])

args = TrainingArguments(
    ".", 
    per_device_eval_batch_size=EVAL_BATCH_SIZE, 
    report_to="none"
)

predictions = []
for model in models:
    model = AutoModelForSequenceClassification.from_pretrained(model)
    trainer = Trainer(
        model=model, 
        args=args, 
        data_collator=DataCollatorWithPadding(tokenizer), 
        tokenizer=tokenizer
    )    
    preds = trainer.predict(ds).predictions
    predictions.append(softmax(preds, axis=-1))
    del model, trainer
    torch.cuda.empty_cache()
    gc.collect()

## Model Loading and Evaluation Setup

---

**Explanation:**


- **Score Calculation:**
  - A variable named **predicted_score** is initialized to 0. This variable will be used to aggregate the predicted scores from all models.
  
- **Aggregation Loop:**
  - Iterates over each **p** in **predictions**, where **predictions** is the list containing the predicted probabilities from all models.
  
- **Summation of Predictions:**
  - Adds each set of predicted probabilities **p** to the **predicted_score** variable.
  
- **Normalization:**
  - Divides the **predicted_score** by the total number of models (**len(predictions)**) to obtain the average predicted score across all models.

We aggregates the predicted scores from all models by summing them and then normalizing by the total number of models. The resulting **predicted_score** represents the average predicted score across all models. 

In [None]:
predicted_score = 0.
for p in predictions:
    predicted_score += p
    
predicted_score /= len(predictions)

---

**Explanation:**


- **Score Assignment:**
  - Adds a new column named **'score'** to the DataFrame **df_test**.
  - Assigns the predicted score for each essay by finding the index of the maximum value in the **predicted_score** array along the last axis (-1) using `.argmax(-1)`.
  - Since scores typically start from 1 in essay scoring tasks, 1 is added to the predicted index to get the actual score.
  - This assigns the predicted scores to the 'score' column in the DataFrame.


We assigns the predicted scores to the 'score' column in the DataFrame **df_test** based on the highest probability score predicted by the model. 

## Generating Predictions for Test Data

In [None]:
df_test['score'] = predicted_score.argmax(-1) + 1


---

**Explanation:**


- **DataFrame Selection:**
  - Selects the 'essay_id' and 'score' columns from the DataFrame **df_test** using `df_test[['essay_id', 'score']]`.
  
- **CSV Export:**
  - Writes the selected columns to a CSV file named **'submission1.csv'** using the `.to_csv()` function.
  - The parameter `index=False` is used to exclude the DataFrame index from being written to the CSV file.

Exports the 'essay_id' and 'score' columns from the DataFrame **df_test** to a CSV file named **'submission1.csv'**, which can be used for submission or further analysis. 

In [None]:
df_test[['essay_id', 'score']].to_csv('submission1.csv', index=False)

# Data Loading



---

**Explanation:**


- **Column Definition:**
  - Defines a list named **columns** containing a single tuple.
  - Inside the tuple, there is a Polars expression that splits the values in the "full_text" column by the pattern "\n\n" (double newline) using `.str.split(by="\n\n")`. This effectively splits each essay into paragraphs.
  - The result is aliased as "paragraph".

- **File Paths:**
  - Defines a variable named **PATH** containing the path to the directory where the training and test CSV files are located.

- **Data Loading:**
  - Loads the training data from the CSV file named "train.csv" located in the specified **PATH**.
  - After loading, the "full_text" column is split into paragraphs using the defined **columns**.

- **Test Data Loading:**
  - Loads the test data from the CSV file named "test.csv" located in the specified **PATH**.
  - Similar to the training data, the "full_text" column is split into paragraphs using the defined **columns**.

- **Display:**
  - Displays the first row of the training data using the `.head(1)` function.

We prepare the training and test datasets by splitting the "full_text" column into paragraphs, which can be useful for further analysis or model training.

In [None]:
columns = [  
    (
        pl.col("full_text").str.split(by="\n\n").alias("paragraph")
    ),
]
PATH = "/kaggle/input/learning-agency-lab-automated-essay-scoring-2/"

train = pl.read_csv(PATH + "train.csv").with_columns(columns)
test = pl.read_csv(PATH + "test.csv").with_columns(columns)



## Spelling Error Count Function

---

**Explanation:**


- **Spacy Initialization:**
  - Imports the `spacy` library, which is used for natural language processing tasks.
  - Loads the English language model "en_core_web_sm" using `spacy.load("en_core_web_sm")`. This model is commonly used for basic NLP tasks.

- **English Vocabulary Loading:**
  - Opens the file located at '/kaggle/input/english-word-hx/words.txt' in read mode.
  - Reads the contents of the file and stores each word as a lowercase string in a set named **english_vocab** after stripping any leading or trailing whitespace.

- **Spelling Error Counting Function:**
  - Defines a function named **count_spelling_errors** that takes a **text** input as an argument.
  
- **Text Processing:**
  - Processes the input **text** using the loaded Spacy **nlp** object, which tokenizes and lemmatizes the text.
  - Lemmatized tokens are generated by extracting the lowercase lemmas of each token in the processed document and storing them in the list **lemmatized_tokens**.

- **Spelling Error Counting:**
  - Counts the number of spelling errors in the **text** by iterating over each token in **lemmatized_tokens** and checking if it is not present in the **english_vocab** set.
  - The total count of tokens not present in the English vocabulary is stored in the variable **spelling_errors**.

- **Result Return:**
  - Returns the **spelling_errors** count as the output of the function.

We define a function **count_spelling_errors** that calculates the number of spelling errors in a given text based on the tokens that do not exist in the provided English vocabulary set. 

In [None]:
import spacy
import re

nlp = spacy.load("en_core_web_sm")
with open('/kaggle/input/english-word-hx/words.txt', 'r') as file:
    english_vocab = set(word.strip().lower() for word in file)
def count_spelling_errors(text):

    
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_.lower() for token in doc]

    spelling_errors = sum(1 for token in lemmatized_tokens if token not in english_vocab)


    return spelling_errors



# Contraction Expansion and Text Preprocessing Functions 

---

**Explanation:**


- **Contractions Expansion:**
  - A dictionary named **cList** is defined, containing mappings from common contractions to their expanded forms.
  - A regular expression pattern **c_re** is created using `re.compile()` to match any contraction in the text.
  - The function **expandContractions** takes a **text** input and uses the regular expression pattern to find contractions and replace them with their expanded forms.

- **HTML Tag Removal:**
  - The function **removeHTML** removes any HTML tags from the text using a regular expression pattern.

- **Data Preprocessing:**
  - The function **dataPreprocessing** performs several preprocessing steps on the input **x**:
    - Converts the text to lowercase using `.lower()`.
    - Removes HTML tags from the text using the **removeHTML** function.
    - Removes Twitter handles by substituting any word starting with '@' with an empty string.
    - Removes apostrophes followed by digits.
    - Removes any digits from the text.
    - Removes URLs starting with "http" or "https".
    - Replaces multiple consecutive whitespaces with a single whitespace using `re.sub()`.
    - Removes excessive periods and commas by replacing consecutive occurrences with a single instance.
    - Strips any leading or trailing whitespaces from the text.


In [None]:
cList = {
  "ain't": "am not","aren't": "are not","can't": "cannot","can't've": "cannot have","'cause": "because",  "could've": "could have","couldn't": "could not","couldn't've": "could not have","didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have","hasn't": "has not",
  "haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will","he'll've": "he will have","he's": "he is",
  "how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is","I'd": "I would","I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have",
  "isn't": "is not","it'd": "it had","it'd've": "it would have","it'll": "it will", "it'll've": "it will have","it's": "it is","let's": "let us","ma'am": "madam","mayn't": "may not",
  "might've": "might have","mightn't": "might not","mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have","needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not","oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
  "shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is",
  "should've": "should have","shouldn't": "should not","shouldn't've": "should not have","so've": "so have","so's": "so is","that'd": "that would","that'd've": "that would have","that's": "that is","there'd": "there had","there'd've": "there would have","there's": "there is","they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we had",
  "we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
  "weren't": "were not","what'll": "what will","what'll've": "what will have",
  "what're": "what are","what's": "what is","what've": "what have","when's": "when is","when've": "when have",
  "where'd": "where did","where's": "where is","where've": "where have","who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is",
  "why've": "why have","will've": "will have","won't": "will not","won't've": "will not have","would've": "would have","wouldn't": "would not",
  "wouldn't've": "would not have","y'all": "you all","y'alls": "you alls","y'all'd": "you all would",
  "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you had","you'd've": "you would have","you'll": "you you will","you'll've": "you you will have","you're": "you are",  "you've": "you have"
   }

c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

def removeHTML(x):
    html=re.compile(r'<.*?>')
    return html.sub(r'',x)
def dataPreprocessing(x):
    x = x.lower()
    x = removeHTML(x)
    x = re.sub("@\w+", '',x)
    x = re.sub("'\d+", '',x)
    x = re.sub("\d+", '',x)
    x = re.sub("http\w+", '',x)
    x = re.sub(r"\s+", " ", x)
#     x = expandContractions(x)
    x = re.sub(r"\.+", ".", x)
    x = re.sub(r"\,+", ",", x)
    x = x.strip()
    return x

# Punctuation Removal Function

---

**Explanation:**

Function **remove_punctuation** is defined to remove punctuation from a given text:

- **Punctuation Removal:**
  - The function takes a **text** input as an argument.
  - A translator is created using `str.maketrans('', '', string.punctuation)`, which generates a translation table that maps each character in the string `string.punctuation` to `None`. This effectively removes all punctuation characters.
  - The translation table is then applied to the input **text** using `text.translate(translator)`, which removes all punctuation characters from the text.
  - The processed text with removed punctuation is returned as the output of the function.

- **Example Usage:**
  - An example **text** string "Hello, world! This is a test." is provided.
  - The **remove_punctuation** function is called with the example **text** as input, and the processed text with removed punctuation is printed.



In [None]:
import string

def remove_punctuation(text):

    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

text = "Hello, world! This is a test."
print(remove_punctuation(text))


# Paragraph Preprocessing

---

**Explanation:**

We define twowo functions for paragraph preprocessing and feature engineering:

- **Paragraph Preprocessing:**
  - The function **Paragraph_Preprocess** takes a DataFrame **tmp** as input.
  - The DataFrame **tmp** is exploded on the 'paragraph' column, which means that each row containing a list of paragraphs is expanded into multiple rows, each containing a single paragraph.
  - Text preprocessing operations are applied to each paragraph in the DataFrame:
    - The **dataPreprocessing** function is applied to clean the text data.
    - The **remove_punctuation** function is applied to remove punctuation from the text.
    - The **count_spelling_errors** function is applied to count the number of spelling errors in each paragraph.
    - The length of each paragraph in terms of characters and sentences is calculated.
  - The resulting DataFrame **tmp** contains processed paragraph data.

- **Paragraph Feature Engineering:**
  - The function **Paragraph_Eng** takes a DataFrame **train_tmp** as input.
  - The DataFrame is grouped by the 'essay_id' column, and various aggregate statistics are computed for the paragraph-level features.
  - Aggregate statistics include count, maximum, mean, minimum, sum, first, last, kurtosis, and quantiles for features such as paragraph length, sentence count, word count, and spelling error count.
  - The resulting DataFrame **df** is converted to a pandas DataFrame.
  - The 'score' column from the original training data is appended to the DataFrame.
  - The function returns the DataFrame **train_feats**, which contains engineered features at the paragraph level.

- **Feature Names Extraction:**
  - Extracts the feature names from the DataFrame **train_feats** excluding 'essay_id' and 'score' columns.

- **Output Display:**
  - Prints the number of features extracted.
  - Displays the first three rows of the DataFrame **train_feats** containing the engineered features.

These functions preprocess paragraphs and perform feature engineering to extract relevant features from the text data. 

In [None]:

def Paragraph_Preprocess(tmp):
    tmp = tmp.explode('paragraph')
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(dataPreprocessing))
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(remove_punctuation).alias('paragraph_no_pinctuation'))
    tmp = tmp.with_columns(pl.col('paragraph_no_pinctuation').map_elements(count_spelling_errors).alias("paragraph_error_num"))
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x)).alias("paragraph_len"))
    tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x.split('.'))).alias("paragraph_sentence_cnt"),
                    pl.col('paragraph').map_elements(lambda x: len(x.split(' '))).alias("paragraph_word_cnt"),)

    return tmp
# feature_eng
paragraph_fea = ['paragraph_len','paragraph_sentence_cnt','paragraph_word_cnt']
paragraph_fea2 = ['paragraph_error_num'] + paragraph_fea
def Paragraph_Eng(train_tmp):
    num_list = [0, 50,75,100,125,150,175,200,250,300,350,400,500,600]
    num_list2 = [0, 50,75,100,125,150,175,200,250,300,350,400,500,600,700]
    aggs = [
        *[pl.col('paragraph').filter(pl.col('paragraph_len') >= i).count().alias(f"paragraph_>{i}_cnt") for i in [0, 50,75,100,125,150,175,200,250,300,350,400,500,600,700] ], 
        *[pl.col('paragraph').filter(pl.col('paragraph_len') <= i).count().alias(f"paragraph_<{i}_cnt") for i in [25,49]], 

        *[pl.col(fea).max().alias(f"{fea}_max") for fea in paragraph_fea2],
        *[pl.col(fea).mean().alias(f"{fea}_mean") for fea in paragraph_fea2],
        *[pl.col(fea).min().alias(f"{fea}_min") for fea in paragraph_fea2],
        *[pl.col(fea).sum().alias(f"{fea}_sum") for fea in paragraph_fea2],
        *[pl.col(fea).first().alias(f"{fea}_first") for fea in paragraph_fea2],
        *[pl.col(fea).last().alias(f"{fea}_last") for fea in paragraph_fea2],
        *[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in paragraph_fea2],
        *[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in paragraph_fea2],  # 求四分之一值
        *[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in paragraph_fea2],  # 求四分之三值
    
        ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df
tmp = Paragraph_Preprocess(train)
train_feats = Paragraph_Eng(tmp)
train_feats['score'] = train['score']

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))


# Sentence Preprocessing

---

**Explanation:**

We defines two additional functions for sentence-level preprocessing and feature engineering:

- **Sentence Preprocessing:**
  - The function **Sentence_Preprocess** takes a DataFrame **tmp** as input.
  - The 'full_text' column of the DataFrame is processed to split each essay into sentences using the period ('.') as the delimiter.
  - The resulting DataFrame **tmp** contains the 'sentence' column with each element representing a sentence.
  - The 'sentence' column is then exploded to create multiple rows, each containing a single sentence.
  - The length of each sentence in terms of characters and words is calculated.

- **Sentence Feature Engineering:**
  - The function **Sentence_Eng** takes a DataFrame **train_tmp** as input.
  - The DataFrame is grouped by the 'essay_id' column, and various aggregate statistics are computed for the sentence-level features.
  - Aggregate statistics include count, maximum, mean, minimum, sum, first, last, kurtosis, and quantiles for features such as sentence length and word count.
  - The resulting DataFrame **df** is converted to a pandas DataFrame.
  - The sentence-level features are merged with the existing features in the DataFrame **train_feats** based on the 'essay_id' column.

- **Feature Names Extraction:**
  - Extracts the feature names from the DataFrame **train_feats** excluding 'essay_id' and 'score' columns.

- **Output Display:**
  - Prints the number of features extracted after merging sentence-level features.
  
These functions preprocess sentences and perform feature engineering to extract relevant features from the text data at the sentence level. 

In [None]:
def Sentence_Preprocess(tmp):
    tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=".").alias("sentence"))
    tmp = tmp.explode('sentence')
    tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x)).alias("sentence_len"))

    tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x.split(' '))).alias("sentence_word_cnt"))
    
    return tmp
sentence_fea = ['sentence_len','sentence_word_cnt']
def Sentence_Eng(train_tmp):
    aggs = [
        *[pl.col('sentence').filter(pl.col('sentence_len') >= i).count().alias(f"sentence_>{i}_cnt") for i in [0,15,50,100,150,200,250,300] ], 
        *[pl.col('sentence').filter(pl.col('sentence_len') <= i).count().alias(f"sentence_<{i}_cnt") for i in [15,50] ], 

        *[pl.col(fea).max().alias(f"{fea}_max") for fea in sentence_fea],
        *[pl.col(fea).mean().alias(f"{fea}_mean") for fea in sentence_fea],
        *[pl.col(fea).min().alias(f"{fea}_min") for fea in sentence_fea],
        *[pl.col(fea).sum().alias(f"{fea}_sum") for fea in sentence_fea],
        *[pl.col(fea).first().alias(f"{fea}_first") for fea in sentence_fea],
        *[pl.col(fea).last().alias(f"{fea}_last") for fea in sentence_fea],
        *[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in sentence_fea],
        *[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in sentence_fea],  
        *[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in sentence_fea],  
    
        ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df

tmp = Sentence_Preprocess(train)
train_feats = train_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))


# Word Preprocessing

---

**Explanation:**

We define two more functions are  for word-level preprocessing and feature engineering:

- **Word Preprocessing:**
  - The function **Word_Preprocess** takes a DataFrame **tmp** as input.
  - The 'full_text' column of the DataFrame is processed to split each essay into words using a space (' ') as the delimiter.
  - The resulting DataFrame **tmp** contains the 'word' column with each element representing a word.
  - The 'word' column is then exploded to create multiple rows, each containing a single word.
  - The length of each word in terms of characters is calculated.
  - Any rows with word lengths equal to 0 are filtered out to remove empty words.

- **Word Feature Engineering:**
  - The function **Word_Eng** takes a DataFrame **train_tmp** as input.
  - The DataFrame is grouped by the 'essay_id' column, and various aggregate statistics are computed for the word-level features.
  - Aggregate statistics include count of words with lengths from 1 to 15, maximum word length, mean word length, standard deviation of word length, and quantiles of word length.
  - The resulting DataFrame **df** is converted to a pandas DataFrame.
  - The word-level features are merged with the existing features in the DataFrame **train_feats** based on the 'essay_id' column.

- **Feature Names Extraction:**
  - Extracts the feature names from the DataFrame **train_feats** excluding 'essay_id' and 'score' columns.

- **Output Display:**
  - Prints the number of features extracted after merging word-level features.
  
These functions preprocess words and perform feature engineering to extract relevant features from the text data at the word level. 

In [None]:
def Word_Preprocess(tmp):
    tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=" ").alias("word"))
    tmp = tmp.explode('word')
    tmp = tmp.with_columns(pl.col('word').map_elements(lambda x: len(x)).alias("word_len"))
    tmp = tmp.filter(pl.col('word_len')!=0)
    
    return tmp
def Word_Eng(train_tmp):
    aggs = [
        *[pl.col('word').filter(pl.col('word_len') >= i+1).count().alias(f"word_{i+1}_cnt") for i in range(15) ], 
        pl.col('word_len').max().alias(f"word_len_max"),
        pl.col('word_len').mean().alias(f"word_len_mean"),
        pl.col('word_len').std().alias(f"word_len_std"),
        pl.col('word_len').quantile(0.25).alias(f"word_len_q1"),
        pl.col('word_len').quantile(0.50).alias(f"word_len_q2"),
        pl.col('word_len').quantile(0.75).alias(f"word_len_q3"),
        ]
    df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
    df = df.to_pandas()
    return df

tmp = Word_Preprocess(train)
train_feats = train_feats.merge(Word_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))


# Tf-idf Vectorizer Setup and Feature Engineering

---

**Explanation:**

A TF-IDF vectorizer is used to convert the essays into numerical features, followed by merging these features with the existing feature set:

- **TF-IDF Vectorization:**
  - The **TfidfVectorizer** class is initialized with the following parameters:
    - **tokenizer**: A lambda function is provided to tokenize each document. Since the essays are already preprocessed and tokenized, the identity function lambda x: x is used to maintain the tokens as they are.
    - **preprocessor**: A lambda function is provided to preprocess each document. Again, the identity function is used here to maintain the documents as they are.
    - **token_pattern**: None is provided, as tokenization is already done and no specific pattern is needed.
    - **strip_accents**: 'unicode' is used to remove accents during preprocessing.
    - **analyzer**: 'word' is specified to analyze words as the basic elements.
    - **ngram_range**: The range of n-grams considered is set to (3,6), meaning it will consider n-grams of sizes 3 to 6.
    - **min_df**: The minimum document frequency is set to 0.05, meaning words that occur in less than 5% of the documents will be ignored.
    - **max_df**: The maximum document frequency is set to 0.95, meaning words that occur in more than 95% of the documents will be ignored.
    - **sublinear_tf**: True is specified to apply sublinear term frequency scaling, which tends to emphasize the importance of less frequent terms.
  - The **fit_transform** method is called on the vectorizer with the list comprehension `[i for i in train['full_text']]` to fit the vectorizer to the training data and transform it into a TF-IDF matrix.
  - The resulting TF-IDF matrix is converted to a dense matrix using the **toarray** method.
  - A DataFrame **df** is created from the dense matrix, where each column represents a TF-IDF feature.
  - Column names are assigned as 'tfid_0', 'tfid_1', ..., 'tfid_n' where n is the number of features.
  - The 'essay_id' column is added to the DataFrame to facilitate merging with other features.

- **Merging Features:**
  - The TF-IDF features DataFrame **df** is merged with the existing feature DataFrame **train_feats** based on the 'essay_id' column.

- **Feature Names Extraction:**
  - Extracts the feature names from the DataFrame **train_feats** excluding 'essay_id' and 'score' columns.

- **Output Display:**
  - Prints the number of features extracted after merging TF-IDF features.

We add TF-IDF features derived from the essays to the existing feature set. These features capture the importance of words and n-grams in each essay and can be used as input .

In [None]:
vectorizer = TfidfVectorizer(
            tokenizer=lambda x: x,
            preprocessor=lambda x: x,
            token_pattern=None,
            strip_accents='unicode',
            analyzer = 'word',
            ngram_range=(3,6),
            min_df=0.05,
            max_df=0.95,
            sublinear_tf=True,
)

train_tfid = vectorizer.fit_transform([i for i in train['full_text']])
dense_matrix = train_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = train_feats['essay_id']
train_feats = train_feats.merge(df, on='essay_id', how='left')
feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Number of Features: ',len(feature_names))


# CountVectorizer Setup and Feature Engineering

---

**Explanation:**

A countVectorizer is used to convert the essays into numerical features, followed by merging these features with the existing feature set:

- **CountVectorizer Configuration:**
  - The **CountVectorizer** class is initialized with similar parameters to the TF-IDF vectorizer:
    - **tokenizer**: A lambda function is provided to tokenize each document.
    - **preprocessor**: A lambda function is provided to preprocess each document.
    - **token_pattern**: None is provided, as tokenization is already done.
    - **strip_accents**: 'unicode' is used to remove accents during preprocessing.
    - **analyzer**: 'word' is specified to analyze words as the basic elements.
    - **ngram_range**: The range of n-grams considered is set to (2,3), meaning it will consider n-grams of sizes 2 to 3.
    - **min_df**: The minimum document frequency is set to 0.10, meaning words that occur in less than 10% of the documents will be ignored.
    - **max_df**: The maximum document frequency is set to 0.85, meaning words that occur in more than 85% of the documents will be ignored.

- **Vectorization and Dense Matrix Conversion:**
  - The **fit_transform** method is called on the CountVectorizer with the list comprehension `[i for i in train['full_text']]` to fit the vectorizer to the training data and transform it into a count matrix.
  - The resulting count matrix is converted to a dense matrix using the **toarray** method.

- **DataFrame Creation and Column Naming:**
  - A DataFrame **df** is created from the dense matrix, where each column represents a count feature.
  - Column names are assigned as 'tfid_cnt_0', 'tfid_cnt_1', ..., 'tfid_cnt_n' where n is the number of features.

- **Merging Features:**
  - The count features DataFrame **df** is merged with the existing feature DataFrame **train_feats** based on the 'essay_id' column.

We add count features derived from the essays to the existing feature set. These features represent the frequency of occurrence of specific word n-grams in each essay .

In [None]:
vectorizer_cnt = CountVectorizer(
            tokenizer=lambda x: x,
            preprocessor=lambda x: x,
            token_pattern=None,
            strip_accents='unicode',
            analyzer = 'word',
            ngram_range=(2,3),
            min_df=0.10,
            max_df=0.85,
)
train_tfid = vectorizer_cnt.fit_transform([i for i in train['full_text']])
dense_matrix = train_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_cnt_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = train_feats['essay_id']
train_feats = train_feats.merge(df, on='essay_id', how='left')

# Deberta Embeddings and Model Evaluation Setup

---

**Explanation:**

out-of-fold (OOF) predictions from a pre-trained DeBERTa model are loaded and added as features to the existing feature set:

- **Loading DeBERTa Out-of-Fold Predictions:**
  - The **joblib.load** function is used to load the out-of-fold predictions stored in the file '/kaggle/input/aes2-400-20240419134941/oof.pkl'.
  - The variable **deberta_oof** now contains the loaded OOF predictions.

- **Adding DeBERTa OOF Predictions as Features:**
  - A loop iterates over the range of 6, assuming there are 6 different predictions from the DeBERTa model.
  - For each prediction index **i**, a new column named **f'deberta_oof_{i}'** is added to the DataFrame **train_feats**. The values for these columns are taken from the **deberta_oof** array.

- **Feature Names Extraction:**
  - After adding DeBERTa OOF predictions as features, the script extracts feature names from the DataFrame **train_feats**, excluding the 'essay_id' and 'score' columns.

- **Output Display:**
  - Prints the number of features extracted after adding DeBERTa OOF predictions.

- **DataFrame Shape:**
  - The shape of the DataFrame **train_feats** is printed to display the number of rows and columns after adding the new features.

Load DeBERTa out-of-fold predictions and adds them as features to the existing feature set. These features represent the predictions made by the DeBERTa model on the training data .

In [None]:
import joblib

deberta_oof = joblib.load('/kaggle/input/aes2-400-20240419134941/oof.pkl')
print(deberta_oof.shape, train_feats.shape)

for i in range(6):
    train_feats[f'deberta_oof_{i}'] = deberta_oof[:, i]

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))    

train_feats.shape

# QWK Metric Definitions and Parameters 

---

**Explanation:**

We defines functions related to the Quadratic Weighted Kappa (QWK) metric and an objective function for gradient boosting models:

- **Quadratic Weighted Kappa Function (`quadratic_weighted_kappa`):**
  - This function calculates the Quadratic Weighted Kappa (QWK) metric between the true labels and predicted labels.
  - It takes two arguments: `y_true` (true labels) and `y_pred` (predicted labels).
  - It first adjusts the labels and predictions by adding a constant `a` to both, which helps handle cases where the label or prediction range does not start from 0.
  - The predictions are clipped between 1 and 6 to ensure they fall within the valid range.
  - The `cohen_kappa_score` function from scikit-learn is used to compute the QWK score with quadratic weights.
  - The function returns a tuple ('QWK', qwk, True), where 'QWK' is a string indicating the name of the metric, `qwk` is the computed QWK score, and True indicates that higher values of the metric are better.

- **Objective Function for Gradient Boosting (`qwk_obj`):**
  - This function defines a custom objective function for gradient boosting models based on the QWK metric.
  - It takes two arguments: `y_true` (true labels) and `y_pred` (predicted labels).
  - Similar to the previous function, it adjusts the labels and predictions by adding a constant `a` to both.
  - Predictions are clipped between 1 and 6.
  - The function computes the gradient and Hessian of the objective function with respect to the predicted labels.
  - These gradients and Hessians are used during the training of gradient boosting models to optimize the objective function.
  - The computed gradients and Hessians are returned as tuples `(grad, hess)`.

- **Constants `a` and `b`:**
  - Constants `a` and `b` are defined with values 2.998 and 1.092, respectively. These constants are used in both the QWK function and the objective function.
  - `a` is added to both true labels and predicted labels to handle cases where the label or prediction range does not start from 0.
  - `b` is used in the objective function calculation.



In [None]:
def quadratic_weighted_kappa(y_true, y_pred):
    y_true = y_true + a
    y_pred = (y_pred + a).clip(1, 6).round()
    qwk = cohen_kappa_score(y_true, y_pred, weights="quadratic")
    return 'QWK', qwk, True
def qwk_obj(y_true, y_pred):
    labels = y_true + a
    preds = y_pred + a
    preds = preds.clip(1, 6)
    f = 1/2*np.sum((preds-labels)**2)
    g = 1/2*np.sum((preds-a)**2+b)
    df = preds - labels
    dg = preds - a
    grad = (df/g - f*dg/g**2)*len(labels)
    hess = np.ones(len(labels))
    return grad, hess
a = 2.998
b = 1.092


## Data Preparation for Model Training

---

**Explanation:**

The feature matrix `X` and the target variables `y_split`, `y`, and `oof` are prepared for training and evaluation:

- **Feature Matrix `X`:**
  - The feature matrix `X` is extracted from the DataFrame `train_feats`. Only the features are selected for training, and the column names stored in the list `feature_names` are used to index the DataFrame.
  - The `.astype(np.float32)` method is used to cast the feature values to 32-bit floating-point numbers to reduce memory usage.
  - The `.values` attribute is used to extract the values from the DataFrame, resulting in a NumPy array.

- **Split Target Variable `y_split`:**
  - The target variable `y_split` is extracted from the 'score' column of the DataFrame `train_feats`. It is cast to an integer data type using `.astype(int)`.
  - This variable is used for splitting the dataset into training and validation sets.

- **Adjusted Target Variable `y`:**
  - The target variable `y` is derived from the 'score' column of the DataFrame `train_feats`.
  - A constant `a` is subtracted from the 'score' values to adjust them, ensuring that the range starts from 0. This adjustment is consistent with the previous parts of the code.

- **Out-of-Fold Target Variable `oof`:**
  - The out-of-fold (OOF) target variable `oof` is extracted from the 'score' column of the DataFrame `train_feats`.
  - OOF predictions are typically used for model evaluation and ensemble methods.


In [None]:
X = train_feats[feature_names].astype(np.float32).values

y_split = train_feats['score'].astype(int).values
y = train_feats['score'].astype(np.float32).values-a
oof = train_feats['score'].astype(int).values

In [None]:
len(feature_names)

# Feature Selection Wrapper Function

---

**Explanation:**

The `feature_select_wrapper` function performs feature selection using LightGBM regression models trained within a Stratified K-Fold cross-validation loop. Here's a breakdown of the function:

- **Feature Selection Process:**
  - The function starts by defining a list of features to be considered for selection, obtained from the `feature_names` variable.
  - It initializes a Series `fse` with zeros as the index, where each index corresponds to a feature name.

- **Stratified K-Fold Cross-Validation:**
  - The function uses Stratified K-Fold cross-validation (`StratifiedKFold`) with 5 folds (`n_splits=5`). This ensures that class distributions are approximately equal in each fold.
  - Within each fold, the training data is split into train and test sets (`X_train_fold`, `X_test_fold`, `y_train_fold`, `y_test_fold`, `y_test_fold_int`) using the indices generated by `skf.split`.
  - A LightGBM regression model is instantiated (`lgb.LGBMRegressor`) with specified hyperparameters for regression.
  - The model is trained on the training data (`X_train_fold`, `y_train_fold`) and evaluated on both the training and validation sets (`eval_set`), using the Quadratic Weighted Kappa as the evaluation metric (`eval_metric=quadratic_weighted_kappa`).
  - The model's predictions on the test set (`X_test_fold`) are computed and rounded to the nearest integer within the range [1, 6].
  - Performance metrics such as F1 score (`f1_fold`) and Cohen's kappa score (`kappa_fold`) are computed and printed for each fold.
  - Additionally, a confusion matrix (`cm`) is generated and displayed for visual inspection of model performance.

- **Feature Importance Calculation:**
  - Feature importance scores (`predictor.feature_importances_`) from each fold are aggregated in the `fse` Series, where each score is added to the corresponding feature's entry.
  - After processing all folds, the feature importance scores in `fse` are sorted in descending order, and the top 13,000 features are selected (`feature_select`).

- **Return Value:**
  - The function returns the list of selected features (`feature_select`).

We create a function to allows for iterative feature selection based on the importance scores generated by LightGBM models trained in a cross-validation framework. By evaluating model performance and feature importance across multiple folds, it provides a  method for selecting informative features. 

In [None]:
def feature_select_wrapper():

    features = feature_names

    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    fse = pd.Series(0, index=features)
    
        
        
    for train_index, test_index in skf.split(X, y_split):

        X_train_fold, X_test_fold = X[train_index], X[test_index]


        y_train_fold, y_test_fold, y_test_fold_int = y[train_index], y[test_index], y_split[test_index]

        model = lgb.LGBMRegressor(
                    objective = qwk_obj,
                    metrics = 'None',
                    learning_rate = 0.01,
                    max_depth = 5,
                    num_leaves = 10,
                    colsample_bytree=0.3,
                    reg_alpha = 0.7,
                    reg_lambda = 0.1,
                    n_estimators=700,
                    random_state=412,
                    extra_trees=True,
                    class_weight='balanced',
                    verbosity = - 1)

        predictor = model.fit(X_train_fold,
                                      y_train_fold,
                                      eval_names=['train', 'valid'],
                                      eval_set=[(X_train_fold, y_train_fold), (X_test_fold, y_test_fold)],
                                      eval_metric=quadratic_weighted_kappa,
                                      callbacks=callbacks,)
        models.append(predictor)
        predictions_fold = predictor.predict(X_test_fold)
        predictions_fold = predictions_fold + a
        oof[test_index]=predictions_fold
        predictions_fold = predictions_fold.clip(1, 6).round()
        predictions.append(predictions_fold)
        f1_fold = f1_score(y_test_fold_int, predictions_fold, average='weighted')
        f1_scores.append(f1_fold)


        kappa_fold = cohen_kappa_score(y_test_fold_int, predictions_fold, weights='quadratic')
        kappa_scores.append(kappa_fold)

        cm = confusion_matrix(y_test_fold_int, predictions_fold, labels=[x for x in range(1,7)])

        disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                      display_labels=[x for x in range(1,7)])
        disp.plot()
        plt.show()
        print(f'F1 score across fold: {f1_fold}')
        print(f'Cohen kappa score across fold: {kappa_fold}')

        fse += pd.Series(predictor.feature_importances_, features)
    
    
    
    feature_select = fse.sort_values(ascending=False).index.tolist()[:13000]
    return feature_select

## Model Training and Evaluation Loop

In [None]:
f1_scores = []
kappa_scores = []
models = []
predictions = []
callbacks = [log_evaluation(period=25), early_stopping(stopping_rounds=75,first_metric_only=True)]
feature_select = feature_select_wrapper()

In [None]:
X = train_feats[feature_select].astype(np.float32).values

# Model Evaluation Metrics Calculation

___

**Explaination**

We conducts model training and evaluation using a LightGBM regressor within a Stratified K-Fold cross-validation loop. 

- **Setting up Cross-Validation:**
  - The number of splits for Stratified K-Fold cross-validation is set to 15 (`n_splits = 15`). This means the dataset will be split into 15 folds.
  - Stratified K-Fold cross-validation is initialized with the specified parameters.

- **Model Training and Evaluation:**
  - Within the loop, each fold is iterated over.
  - Training and testing data are split based on the current fold indices.
  - A LightGBM regressor model is instantiated with specified hyperparameters for regression.
  - The model is trained on the training data and evaluated on both the training and validation sets.
  - Predictions are made on the test set, and performance metrics such as F1 score and Cohen's kappa score are computed and printed for each fold.
  - A confusion matrix is generated and displayed to visualize the model's performance on each fold.

- **Aggregating Results:**
  - F1 scores and Cohen's kappa scores for each fold are stored in the respective lists (`f1_scores` and `kappa_scores`).
  - After iterating over all folds, the mean F1 score and mean Cohen's kappa score across all folds are computed and printed.



In [None]:
n_splits = 15

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

f1_scores = []
kappa_scores = []
models = []
predictions = []
callbacks = [log_evaluation(period=25), early_stopping(stopping_rounds=75,first_metric_only=True)]

i=1
for train_index, test_index in skf.split(X, y_split):
   
    print('fold',i)
    X_train_fold, X_test_fold = X[train_index], X[test_index]
    
   
    y_train_fold, y_test_fold, y_test_fold_int = y[train_index], y[test_index], y_split[test_index]
    
    model = lgb.LGBMRegressor(
                objective = qwk_obj,
                metrics = 'None',
                learning_rate = 0.01,
                max_depth = 5,
                num_leaves = 10,
                colsample_bytree=0.3,
                reg_alpha = 0.7,
                reg_lambda = 0.1,
                n_estimators=700,
                random_state=42,
                extra_trees=True,
                class_weight='balanced',
                verbosity = - 1)

    predictor = model.fit(X_train_fold,
                                  y_train_fold,
                                  eval_names=['train', 'valid'],
                                  eval_set=[(X_train_fold, y_train_fold), (X_test_fold, y_test_fold)],
                                  eval_metric=quadratic_weighted_kappa,
                                  callbacks=callbacks,)
    models.append(predictor)
    predictions_fold = predictor.predict(X_test_fold)
    predictions_fold = predictions_fold + a
    oof[test_index]=predictions_fold
    predictions_fold = predictions_fold.clip(1, 6).round()
    predictions.append(predictions_fold)
    f1_fold = f1_score(y_test_fold_int, predictions_fold, average='weighted')
    f1_scores.append(f1_fold)
    
    
    kappa_fold = cohen_kappa_score(y_test_fold_int, predictions_fold, weights='quadratic')
    kappa_scores.append(kappa_fold)
    
    cm = confusion_matrix(y_test_fold_int, predictions_fold, labels=[x for x in range(1,7)])

    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                  display_labels=[x for x in range(1,7)])
    disp.plot()
    plt.show()
    print(f'F1 score across fold: {f1_fold}')
    print(f'Cohen kappa score across fold: {kappa_fold}')
    i+=1

mean_f1_score = np.mean(f1_scores)
mean_kappa_score = np.mean(kappa_scores)

print(f'Mean F1 score across {n_splits} folds: {mean_f1_score}')
print(f'Mean Cohen kappa score across {n_splits} folds: {mean_kappa_score}')

In [None]:
import pickle

with open('models.pkl', 'wb') as f:
    pickle.dump(models, f)

In [None]:
with open('models.pkl', 'rb') as f:
    models = pickle.load(f)

# **Inference**

___

**Explaination**
- **Paragraph Preprocessing:**
  - The `Paragraph_Preprocess` function is applied to the test data (`test`) to preprocess the paragraphs.
  - The resulting processed paragraphs are then used to extract paragraph features using the `Paragraph_Eng` function.
  - The extracted paragraph features are stored in `test_feats`.

- **Sentence Preprocessing:**
  - The `Sentence_Preprocess` function is applied to the test data to preprocess the sentences within the essays.
  - Features related to sentence length and word count are extracted using the `Sentence_Eng` function.
  - The extracted sentence features are merged with `test_feats`.

- **Word Preprocessing:**
  - The `Word_Preprocess` function is applied to the test data to preprocess individual words.
  - Features related to word length and word count are extracted using the `Word_Eng` function.
  - The extracted word features are merged with `test_feats`.

- **TfidfVectorizer:**
  - The TfidfVectorizer (`vectorizer`) is applied to transform the test essays into TF-IDF features.
  - The TF-IDF features are merged with `test_feats`.

- **CountVectorizer:**
  - The CountVectorizer (`vectorizer_cnt`) is applied to transform the test essays into count-based features.
  - The count-based features are merged with `test_feats`.

- **Deberta Out-of-Fold Predictions:**
  - The out-of-fold predictions from the Deberta model (`predicted_score`) are assigned to columns in `test_feats` for each score category.

- **Feature Number:**
  - The number of features extracted is calculated by excluding the 'essay_id' and 'score' columns from `test_feats`.
  - The total number of features is printed.





In [None]:
# Paragraph
tmp = Paragraph_Preprocess(test)
test_feats = Paragraph_Eng(tmp)
# Sentence
tmp = Sentence_Preprocess(test)
test_feats = test_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')
# Word
tmp = Word_Preprocess(test)
test_feats = test_feats.merge(Word_Eng(tmp), on='essay_id', how='left')

# Tfidf
test_tfid = vectorizer.transform([i for i in test['full_text']])
dense_matrix = test_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df, on='essay_id', how='left')

# CountVectorizer
test_tfid = vectorizer_cnt.transform([i for i in test['full_text']])
dense_matrix = test_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_cnt_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df, on='essay_id', how='left')

for i in range(6):
    test_feats[f'deberta_oof_{i}'] = predicted_score[:, i]

# Features number
feature_names = list(filter(lambda x: x not in ['essay_id','score'], test_feats.columns))
print('Features number: ',len(feature_names))


## Generating Submission File


___

**Explaination**
- **Ensemble Prediction:**
  - For each model in the list `models`, predictions are made using the selected features (`feature_select`) from the test data (`test_feats`).
  - The predictions are adjusted by adding `a` to align them with the scoring scale.
  - The adjusted predictions from all models are stored in the `probabilities` list.
  
- **Aggregating Predictions:**
  - The predictions from all models are averaged element-wise to obtain the final prediction probabilities for each sample in the test data.
  
- **Rounding Predictions:**
  - The averaged probabilities are rounded to the nearest integer value to get the final predicted scores.
  - Predicted scores below 1 are clipped to 1, and scores above 6 are clipped to 6 to ensure they fall within the valid scoring range.
  
- **Displaying Predictions:**
  - The final predictions are printed to the console.



In [None]:
probabilities = []
for model in models:
    proba= model.predict(test_feats[feature_select])+ a
    probabilities.append(proba)

predictions = np.mean(probabilities, axis=0)

predictions = np.round(predictions.clip(1, 6))

print(predictions)

In [None]:
submission=pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv")
submission['score']=predictions
submission['score']=submission['score'].astype(int)
submission.to_csv("submission.csv",index=None)


## Keep Exploring! 👀

Thank you for delving into this notebook! If you found it insightful or beneficial, I encourage you to explore more of my projects and contributions on my profile.

👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈

[GitHub]( https://github.com/zulqarnainalipk) |
[LinkedIn]( https://www.linkedin.com/in/zulqarnainalipk/)

## Share Your Thoughts! 🙏

Your feedback is invaluable! Your insights and suggestions drive our ongoing improvement. If you have any comments, questions, or ideas to contribute, please feel free to reach out.

📬 Contact me via email: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I extend my sincere gratitude for your time and engagement. Your support inspires me to create even more valuable content.
Happy coding and best of luck in your data science endeavors! 🚀

