# TEST 1: Leveraging LLMs for Feature Generation and Classification

Typically, if our data has $N$ features, we need around $10N$ data items to reach peak performance with classic classifiers like Logistic Regression. Therefore, if our vocabulary has 10,000 words, we would need around 1 million items in the training set to reach peak performance.

An interesting idea regarding this was explored in 2024 in [Balek, V., S'ykora, L., Sklen'ak, V., & Kliegr, T. (2024). LLM-based feature generation from text for interpretable machine learning. ArXiv, abs/2409.07132](https://arxiv.org/abs/2409.07132). The idea is to use an LLM to generate meaningful and interpretable features from text, and then use Logistic Regression for classification.

For example, in the movie plots dataset, we could have features like:
- "Is the protagonist an animal?" (0 or 1)
- "Does the plot indicate psychological suffering?" (0 or 1)

With a reasonable number of these features, our model could make predictions based on meaningful features instead of raw words.

## Objectives
* Perform feature extraction for a particular dataset
* Compare performance and explainability of classifiers with different approaches. 

## Rules

I highlight a few elements of our usual rules:

* You are **NOT ALLOWED** to use AI to generate any code you are asked to make yourself. This includes ChatGPT, CoPilot and all similar generators.
* You are **NOT ALLOWED** to use Google or any other search engine.
* You are **ALLOWED** to use the offical documentations for libraries: 
    * [sklearn](https://scikit-learn.org/)
    * [numpy](https://numpy.org/)
    * [matplotlib](https://matplotlib.org/)
    * [google AI studio](https://aistudio.google.com/)
* You are **ALLOWED** to use previous code from this course as basis.
* You **MUST** use the university's proctoring software to show you are complying with these rules
* This task is **INDIVIDUAL**. DO NOT share your code or results with anyone else.

## Tasks and Deliverables

* At any point, refer to [Balek et al.](https://arxiv.org/abs/2409.07132). 
* Make a well-commented code to solve each one of the tasks below.
* Each task will be evaluated as:
    * Insufficient: task is not done, off-topic, or low-effort
    * In process: task is incomplete, done with a clear conceptual error, or comments 
    * Proficient: everything works and comments are enough to understand what is being done
    * Advanced: everything works, comments are enough to understand what is being done, and code is well organized and formated using functions, dataclasses, and other adequate structures.
* This task should be finished by the end of the class.
* After you are finished, submit the executed notebook in our LMS system.

### 1. Dataset Preparation:
Adapting Balek et al.'s strategy to our movie plot classification case, create a dataset with at least 100 labeled items and at least 5 meaningful features. None of the features can be the class itself ("is this a drama plot?"). Use a clear strategy to avoid exceeding free tier quotas. Store data locally in a format of your choice.

### 2. Classification:
Use the generated features to train a Logistic Regression model. Use cross-validation to select the best hyperparameters. Report accuracy and f1-score for your classifier.

### 3. Performance Comparison
Compare the performance of the following approaches:
1. Traditional Bag-of-Words
2. LLM-generated features with Logistic Regression
3. Direct classification using LLM

Use a bar plot to show the performance differences (choose either accuracy or F1-score).

### 4. Improvement Strategies
Determine whether labeling more items would improve system performance. Use data to justify your answer.


# **ANSWERS**

## 1. Dataset Preparation:

First, I will formulate the questions I believe would be helpful to differ a text between comedy and drama. The questions I thought of are:

- Is the wording descriptive?

- Does the text indicate suffering?

- Do the characters sound goofy or funny?

- Does the text use more complex wording?

- Does the text leave unanswered questions?

And the prompt I formulatted is the following:

In [22]:
import os
from dotenv import load_dotenv
import google.generativeai as genai
import pandas as pd
import time

load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

genai.configure(api_key=GEMINI_API_KEY)

prompt = """Gemni I have a text that I need to classify based on questions. You should answer for the text, 5 questions with yes or no (use 1 for yes and 0 for no). 
You should not leave any questions unanswered, always give a yes or no. If you are unsure, use your best judgement. The questions are as follows:
1. Does the text use very descriptive and formal language?
2. Does the text indicate some kind of suffering?
3. Does the characters in the text sound goofy or funny?
4. Does the text indicate a theme of subversion of expectations?
5. Does the text indicate a profound narrative?

The output should be binary, 1 for 'yes' and 0 for 'no'. Please follow this format when answering:

<QUESTION_1_ANSWER><QUESTION_2_ANSWER><QUESTION_3_ANSWER><QUESTION_4_ANSWER><QUESTION_5_ANSWER>

"""

df = pd.read_csv('https://raw.githubusercontent.com/tiagoft/NLP/main/wiki_movie_plots_drama_comedy.csv')

In [None]:
generation_config = genai.GenerationConfig(
    max_output_tokens=5,
)

# Use our prompt four times
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

df_A = df[df['Genre'] == 'drama'].sample(n=50, random_state=42)
df_B = df[df['Genre'] == 'comedy'].sample(n=50, random_state=42)
df_sampled = pd.concat([df_A, df_B])


df_processed = pd.DataFrame(columns=["Plot", "Q1", "Q2", "Q3", "Q4", "Q5", "Genre"])

i = 0

while i < 100:
    try:
        text = df_sampled['Plot'].iloc[i]
        genre = df_sampled['Genre'].iloc[i]

        response = model.generate_content([prompt, text],
                                        generation_config=generation_config)
        
        print(response.text)

        response = response.text
        
        df_processed = pd.concat([df_processed, pd.DataFrame({'Plot': [text], 'Q1': [response[0]], 'Q2': [response[1]], 'Q3': [response[2]], 'Q4': [response[3]], 'Q5': [response[4]], 'Genre': [genre]})])

        i += 1
    except Exception as e:
        print(e)
        time.sleep(62)

df_processed.to_csv('movies_processed.csv', index=False)

01001
00001
00001
00001
00001
00001
00001
01001
10011
01001
00001
00010
01001
429 Resource has been exhausted (e.g. check quota).
01001
00011
00011
00011
00000
00001
00010
00100
00001
00010
00001
00001
00010
00011
00011
429 Resource has been exhausted (e.g. check quota).
01001
00010
00010
00001
00001
00101
00001
00101
01001
00000
00010
00011
00001
00010
00011
00001
429 Resource has been exhausted (e.g. check quota).
00101
00010
00101
00010
00010
00101
00001
00001
00001
00010
00011
00010
00011
00001
00001
00010
429 Resource has been exhausted (e.g. check quota).
00000
00000
00001
00000
00001
00010
01001
00010
00011
00011
00011
00001
00001
00001
00000
01001
00011
01001
00001
00001
00000
00011
429 Resource has been exhausted (e.g. check quota).
00100
00001
00011
00000
00011
00101
00011
00010
00001
01001
00011
00000
00001
00011
00001
00010
00010
429 Resource has been exhausted (e.g. check quota).
00001


In [34]:
df_processed.head()

Unnamed: 0,Plot,Q1,Q2,Q3,Q4,Q5,Genre
0,The film is about a family who move to the sub...,0,1,0,0,1,comedy
0,Before heading out to a baseball game at a nea...,0,0,0,0,1,comedy
0,The plot is that of a black woman going to the...,0,0,0,0,1,comedy
0,On a beautiful summer day a father and mother ...,0,0,0,0,1,drama
0,A thug accosts a girl as she leaves her workpl...,0,0,0,0,1,drama


In [25]:
df_processed["Genre"].value_counts()

Genre
drama     55
comedy    45
Name: count, dtype: int64

In [26]:
df_processed["Q1"].value_counts()

Q1
0    99
1     1
Name: count, dtype: int64

In [27]:
df_processed["Q2"].value_counts()

Q2
0    89
1    11
Name: count, dtype: int64

In [28]:
df_processed["Q3"].value_counts()

Q3
0    92
1     8
Name: count, dtype: int64

In [29]:
df_processed["Q4"].value_counts()

Q4
0    61
1    39
Name: count, dtype: int64

In [30]:
df_processed["Q5"].value_counts()

Q5
1    70
0    30
Name: count, dtype: int64

In [31]:
df_processed = df_processed.astype({"Q1": int, "Q2": int, "Q3": int, "Q4": int, "Q5": int,})

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(df_processed[['Q1', 'Q2', 'Q3', 'Q4', 'Q5']], df_processed['Genre'], test_size=0.2)

In [33]:
model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      comedy       0.75      0.60      0.67        10
       drama       0.67      0.80      0.73        10

    accuracy                           0.70        20
   macro avg       0.71      0.70      0.70        20
weighted avg       0.71      0.70      0.70        20

