### Résumé de tâche :

**L’objectif** est : pour un prompt et deux réponses données R et R* où R* est une **« bonne réponse »**, est-ce que « R est aussi bonne que R* » ?
> ⬆️ Comparaison

Plus **précisément** : 
1. Ont-elles des points communs ?  `analyse`
2. Des différences ? `analyse`
3. Elles se complètent ? En les combinant, on obtiendrait une réponse encore meilleure ?  `génération+évaluation`



> - Analyser le problème, proposer une solution et l’implémenter.
> - Pas besoin de chercher à faire le meilleur modèle possible (ni de tester des dizaines de modèles)
> - Se Concentrer sur l’analyse du problème, la proposition d’une solution pertinente et la qualité du code Python d’implémentation de la solution.


Analyser le problème :
- Tout d'abord, comment déterminer les critères de R, notamment **R***, autrement dit, comment devrait-on subdiviser les notes de contenu.
- Dans quel aspects on peut trouver des sim. et diff.  entre R et R* ?
- Comment fusionner R et R* pour générer une nouvelle réponse.
- Comment évaluer le(s) nouvelle(s) réponse(s).

---

### 1️⃣ Importation des données

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
import re
from nltk.tokenize import word_tokenize
from nltk import ngrams
from sentence_transformers import SentenceTransformer, util
import torch


In [3]:
# importer les données
prompts_train_df = pd.read_csv('data/prompts_train.csv')
reponses_train_df = pd.read_csv('data/summaries_train.csv')

In [4]:
# vérifier la taille des données
reponses_train_df.shape

(7165, 5)

In [6]:
# vérifier les premières lignes des données pour avoir une idée de la structure
reponses_train_df.head()

Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757


---

### 2️⃣ EDA (Exploratory Data Analysis) et Ingénierie des caractéristiques

In [7]:
# obtenir le nombre de réponses uniques et de prompts uniques
unique_rep = reponses_train_df['student_id'].nunique()
unique_prompts = reponses_train_df['prompt_id'].nunique()

# obtenir les données statistiques descriptives sur les scores de "content".
content_score_stats = reponses_train_df['content'].describe()
print('réponses uniques:', unique_rep)
print('prompts uniques:', unique_prompts)
print('*'*90)
print('Statistiques de score de content :')
print(content_score_stats)
print('*'*90)

réponses uniques: 7165
prompts uniques: 4
******************************************************************************************
Statistiques de score de content :
count    7165.000000
mean       -0.014853
std         1.043569
min        -1.729859
25%        -0.799545
50%        -0.093814
75%         0.499660
max         3.900326
Name: content, dtype: float64
******************************************************************************************


1. Un grand nombre d'élèves ont des scores négatifs.
2. Il existe des réponses de qualité.
3. Une diversité de réponses en termes de qualité de contenu.

---

On va ensuite effectuer les opérations suivantes sur les deux DataFrames `prompt_train_df`:
1. Fusionner les 2 tableaux
2. Supprimer la colonne "wording"
3. Ajouter d'autres caractéristiques

In [None]:
# 合并两个DataFrame
merged_df = reponses_train_df.merge(prompts_train_df[['prompt_id', 'prompt_text','prompt_question']], on='prompt_id', how='left')


#Text transformation
#将merged_df的字符串内容转换为小写
merged_df = merged_df.applymap(lambda s: s.lower() if type(s) == str else s)

# 确保所有的文本都是字符串
merged_df["text"]=[str(data) for data in merged_df.text] #converting all to string
merged_df["prompt_text"]=[str(data) for data in merged_df.prompt_text] #converting all to string

# #删除所有标点符号
# merged_df["text"]=merged_df.text.apply(lambda x: re.sub('[^A-Za-z0-9 ]+', ' ', x))
# merged_df["prompt_text"]=merged_df.prompt_text.apply(lambda x: re.sub('[^A-Za-z0-9 ]+', ' ', x))
# 没有删除标点符号，因为标点符号对于一些引用的情况是有用的


# 删除wording列，聚焦于 content 评分
merged_df = merged_df.drop(columns=['wording'])

merged_df.head()

merged_df.to_csv('data/tood.csv', index=False)

In [None]:
merged_df.head()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

In [None]:
embeddings_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1', device=device)

In [None]:
prompt_to_emb_dict = merged_df.groupby('prompt_id')['prompt_text'].first().transform(lambda x: embeddings_model.encode(x, batch_size=1, show_progress_bar=False)).to_dict()

In [None]:
def semantic_similarity(row, model=embeddings_model, prompt_embeddings=prompt_to_emb_dict):
    prompt_vector = prompt_embeddings[row['prompt_id']]
    summary_vector = model.encode(row['text'], batch_size=1, show_progress_bar=False)
    return util.cos_sim([prompt_vector], [summary_vector]).item()

In [None]:
merged_df['semantic_similarity'] = merged_df.apply(semantic_similarity, axis=1)

In [None]:
# 添加新 feature，获取 reponse 的长度以及 prompt 的长度，添加到 dataframe 中

def feature_engineering(df):
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    def count_text_length(df, col, tokenizer):
        longueur_col = df[col].apply(lambda x: len(tokenizer.encode(x)))
        return longueur_col
    
    df['reponse_longueur'] = count_text_length(df, 'text',tokenizer)
    df['prompt_longueur'] = count_text_length(df, 'prompt_text',tokenizer)
    df['rep/prompt_ratio'] = df['reponse_longueur'] / df['prompt_longueur']
    df['vocabulary_richness'] = df['text'].apply(lambda x: len(set(x.split())))
    df['word_overlap'] = df.apply(lambda row: len(set(row['prompt_text'].split()) & set(row['text'].split())), axis=1)
    
    def quotes_count(row):
        summary = row['text']
        text    = row['prompt_text']
        quotes_from_summary = re.findall(r'"([^"]*)"', summary)
        return [quote in text for quote in quotes_from_summary].count(True)

    df['quotes_num'] = df.apply(quotes_count, axis=1)



    def word_ngram_overlap(row, n):
        original_tokens = row['prompt_text'].split()
        summary_tokens = row['text'].split()

        original_ngrams = set(ngrams(original_tokens, n))
        summary_ngrams = set(ngrams(summary_tokens, n))
        
        common_ngrams = original_ngrams.intersection(summary_ngrams)
        
        return len(common_ngrams) / len(summary_ngrams) if len(summary_ngrams) else 0, len(common_ngrams)


    df['bigram_overlap']  = df.apply(lambda x: word_ngram_overlap(x,2)[1], axis=1 )
    df['trigram_overlap'] = df.apply(lambda x: word_ngram_overlap(x,3)[1], axis=1 )
    df['jaccard_similarity'] = df.apply(lambda row: len(set(word_tokenize(row['prompt_text'])) & set(word_tokenize(row['text']))) / len(set(word_tokenize(row['prompt_text'])) | set(word_tokenize(row['text']))), axis=1)


    return df


In [None]:
merged_df_featured = feature_engineering(merged_df)

In [None]:
merged_df_featured.head(20)

In [None]:
# 保存这些特征
merged_df_featured.to_csv('data/merged_df_featured.csv', index=False)


In [None]:
# Setting up the figure and axes for two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(6,3), dpi=100)

# Scatter plot for content score vs summary word count
ax1.scatter(merged_df_featured['reponse_longueur'], merged_df_featured['content'], alpha=0.5, color='blue')
ax1.set_title('S_content vs reponse_longueur')
ax1.set_xlabel('reponse_longueur')
ax1.set_ylabel('Content Score')

# Scatter plot for content score vs unique word count
ax2.scatter(merged_df_featured['vocabulary_richness'], merged_df_featured['content'], alpha=0.5, color='blue')
ax2.set_title('S_content vs vocabulaire_longueur')
ax2.set_xlabel('vocabulaire_longueur')
ax2.set_ylabel('Content Score')

# Adjusting the layout to prevent overlap
plt.tight_layout()

# Displaying the plots
plt.show()

# Calculating the correlation between summary word count and content score
word_count_content_corr = merged_df_featured['reponse_longueur'].corr(merged_df_featured['content'])

# Calculating the correlation between unique word count and content score
vocabulary_richness_corr = merged_df_featured['vocabulary_richness'].corr(merged_df_featured['content'])

(word_count_content_corr, vocabulary_richness_corr)

1. 观察结果：内容得分与摘要字数之间存在明显的正相关关系，表明较长的摘要往往会获得较高的内容得分。相关系数约为0.793。
2. 观察结果：
内容得分与vocabulaire 规模之间的关系：

总结中使用的独特词汇数量与内容得分之间存在强正相关，表明使用更多种类的词汇的总结往往会获得较高的内容得分。相关系数约为0.807。而且这种相关系数比单纯的计算长度更加强，这表明使用的词汇种类的数量相比于单独的长度增加，对内容得分的影响，更大。

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
selected_vars = ['content', 'reponse_longueur', 'prompt_longueur', 'rep/prompt_ratio', 'vocabulary_richness', 'word_overlap', 'quotes_num', 'bigram_overlap', 'trigram_overlap', 'jaccard_similarity']

# Create a subset DataFrame with selected variables
subset_df = merged_df[selected_vars]

# Calculate the correlation matrix for the subset
correlation_matrice = subset_df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(5, 5))


# Create a mask to hide values below the threshold

# Create a heatmap of the correlation matrix
sns.heatmap(correlation_matrice, annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Matrice des variables sélectionnées")
plt.show()

## 下一步：确定 R 和 R*

In [None]:
import matplotlib.pyplot as plt

# Setting up the figure and axes
fig, ax = plt.subplots(1, 2, figsize=(6, 3))

# Plotting histograms
ax[0].hist(reponses_train_df['content'], bins=30, color='green', edgecolor='black')
ax[0].set_title('Content Score Distribution')
ax[0].set_xlabel('Content Score')
ax[0].set_ylabel('Frequency')

ax[1].hist(merged_df['reponse_longueur'], bins=30, color='blue', edgecolor='black')
ax[1].set_title('Summary Word Count Distribution')
ax[1].set_xlabel('Word Count')
ax[1].set_ylabel('Frequency')

# Adjusting the layout to prevent overlap
plt.tight_layout()

# Displaying the plots
plt.show()

内容得分分布：

该分布略微呈左偏态，但基本符合正态分布，峰值稍低于零。这表明有相当数量的摘要获得了负面的内容得分。

摘要字数分布：

该分布高度右偏，说明大多数摘要包含较少的单词数量，只有少数摘要具有非常高的字数。

mean       -0.014853
std         1.043569
min        -1.729859
25%        -0.799545
50%        -0.093814
75%         0.499660
max         3.900326

因此，结合上面的 content 分数分布图，以及之前的统计数据，我们认为，如果要判断一个回答的质量，可以对整个数据集进行划分：

1. 较差回答 n <= -1
2. 一般回答 -1 < n < 1
3. 较好回答 1 <= n <= 3
4. 极好回答 n > 3

In [None]:
def categorize_score(n):
    if n <= 0:
        return 'mauvaise réponse'
    elif n <= 1:
        return 'réponse moyenne'
    elif n <= 3:
        return 'bonne réponse'
    else: # n > 3
        return 'excellente réponse'

# 应用这个函数到 'content' 列，并创建一个新列 'quality_category'
merged_df['quality_category'] = merged_df['content'].apply(categorize_score)

# 显示更新后的 DataFrame
merged_df.head(20)

In [None]:
# Getting the counts of each category
category_counts = merged_df['quality_category'].value_counts()
category_counts

In [None]:
# 接下来，抽取一个来自“bonne réponse”或“excellente回答”的学生的一些摘要，再抽取一个随机学生的摘要。


# 假设 merged_df 是你的原始 DataFrame


# 首先，我们从 '较好回答' 或 '极好回答' 中随机抽取一个记录
good_answers_df = merged_df[merged_df['quality_category'].isin(['bonne réponse', 'excellente réponse'])]
selected_good_answer = good_answers_df.sample(n=1)

# 接下来，根据selected_good_answer的 prompt_id 从 prompts_train_df 中找到相应的所以记录
# 然后，从相通 prompt_id 的数据集中随机抽取另一个记录（确保与第一个记录不同）
selected_random_answer = merged_df[merged_df['prompt_id'] == selected_good_answer['prompt_id'].values[0]].drop(selected_good_answer.index).sample(n=1)

# 将这两个记录分别生成新的 DataFrame
df_good_answer = pd.DataFrame(selected_good_answer)
df_random_answer = pd.DataFrame(selected_random_answer)



# 从原始 df 中删除这两个记录
merged_df = merged_df.drop(selected_good_answer.index)
merged_df = merged_df.drop(selected_random_answer.index)


# 显示结果
df_good_answer

# 相同点通过 word embedding 或者 tfidf 提取关键词进行陈述

In [None]:
df_random_answer

In [None]:
merged_df.shape

In [None]:
# 将 df_good_answer 和 df_random_answer 中的 text 字段提取出来，并转换成 JSON 形式

# 假设 df_good_answer 和 df_random_answer 已经正确定义
# 提取 text 字段
good_answer_text = df_good_answer['text'].iloc[0]
random_answer_text = df_random_answer['text'].iloc[0]

# 将文本转换成 JSON 格式
texts_json_str = f'''{{
    "A": "{good_answer_text}",
    "B": "{random_answer_text}"
}}'''


texts_json_str

### OpenAI API 来融合 R 和 R*

In [None]:
import os
from openai import OpenAI
import json
import torch
from transformers import RobertaTokenizer, RobertaModel
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor


os.environ["OPENAI_API_KEY"] = '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'

client = OpenAI()

In [None]:


# use json to format the prompt
texts = texts_json_str

completion = client.chat.completions.create(
  model="gpt-3.5-turbo-0301",
  messages=[
    {"role": "system", "content": 
     "Given 2 texts: A and B in a json format . You need to intergrate the key points of A into B without changing too much B's content, in order to return a new text."},
    {"role": "system", "content": 
     "The length of the new text created should be less than the sum of the lengths of two texts. Your output should be in a correct json format, with the key 'combined_text' and the value as the new text."},
    {"role": "system", "content": 
     "You should focus on the contenu"},
    {"role": "user", 
     "content": texts}
  ],
    temperature=0.3
    )
# print(completion.choices[0].message)
# print(dict(completion).get('usage'))

data = completion.model_dump_json(indent=2)
json_data = json.loads(data)

In [None]:
text = json_data['choices'][0]['message']['content']
text

In [None]:
returned_text = json.loads(text)
print(returned_text)

print(returned_text["combined_text"])
print(len(returned_text["combined_text"]))
print(len(texts))


In [None]:
import pandas as pd
# 创建一个字典，包含学生ID、提示ID和文本字段的值
stu_id= '000000ffffff'
prompt_id = 'def789'
text_test = returned_text["combined_text"]

data_4_test = {
    'student_id': [stu_id],
    'prompt_id': [prompt_id],
    'text': [text_test]
}

# 使用字典创建DataFrame
test_data = pd.DataFrame(data_4_test)

# 打印DataFrame
test_data


In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

In [None]:

print(f'prompts_train shape: {prompts_train_df.shape}')
print(f'summaries_train shape: {merged_df.shape}')
print('-'*90)
print(f'prompts_train missing values: {prompts_train_df.isnull().sum().sum()}')
print(f'summaries_train missing values: {merged_df.isnull().sum().sum()}')
print('-'*90)
merged_df.head()

In [None]:
import pickle


def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    
    tokens = [token.lower() for token in tokens]
    
    tokens = [token for token in tokens if token.isalnum()]
    
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    preprocessed_text = ' '.join(tokens)
    
    return preprocessed_text

def extract_features_ngrams(texts):
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    
    features = tfidf_vectorizer.fit_transform(texts)
    with open('tfidf_vectorizer.pkl', 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    
    return features


def extract_features_features(merged_df):
    feature_columns = ['reponse_longueur', 'rep/prompt_ratio', 'vocabulary_richness', 'word_overlap', 'jaccard_similarity', 'quote_count']
    train_features = merged_df[feature_columns]
    return train_features
    



def extract_features_tfidf(texts):
    tfidf_vectorizer = TfidfVectorizer()
    
    features = tfidf_vectorizer.fit_transform(texts)
    with open('tfidf_vectorizer.pkl', 'wb') as f:
        pickle.dump(tfidf_vectorizer, f)
    
    return features

In [None]:
new_features = extract_features_features(merged_df)

In [None]:
train_text_summaries = merged_df.text
test_text_summaries = test_data.text
train_text_summaries

In [None]:
preprocessed_summaries_train = [preprocess_text(summary) for summary in train_text_summaries]
preprocessed_summaries_test = [preprocess_text(summary) for summary in test_text_summaries]

train_tfidf_features = extract_features_tfidf(preprocessed_summaries_train)


In [None]:
with open('tfidf_vectorizer.pkl', 'rb') as f:
    loaded_tfidf_vectorizer = pickle.load(f)

test_tfidf_features = loaded_tfidf_vectorizer.transform(preprocessed_summaries_test)


In [None]:
new_features.shape


In [None]:
test_tfidf_features.shape


In [None]:
target_labels = merged_df[['content']]

In [None]:
regressor = XGBRegressor()


In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_features, target_labels, test_size=0.001, random_state=42)

In [None]:
regressor.fit(X_train, y_train)

In [None]:
predictions = regressor.predict(X_test)
print(predictions)
print(X_train.shape)

In [None]:
mse = mean_squared_error(y_test, predictions, multioutput='raw_values')


In [None]:
print(mse)

In [None]:
new_predictions = regressor.predict(test_tfidf_features)
print(new_predictions)

In [None]:
def clean_text(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text

def count_total_words(text: str) -> int:
    words = text.split()
    total_words = len(words)
    return total_words

def count_stopwords(text: str) -> int:
    stopword_list = set(stopwords.words('english'))
    words = text.split()
    stopwords_count = sum(1 for word in words if word.lower() in stopword_list)
    return stopwords_count

def count_punctuation(text: str) -> int:
    punctuation_set = set(string.punctuation)
    punctuation_count = sum(1 for char in text if char in punctuation_set)
    return punctuation_count

def count_numbers(text: str) -> int:
    numbers = re.findall(r'\d+', text)
    numbers_count = len(numbers)
    return numbers_count

def feature_engineer(dataframe: pd.DataFrame, feature: str = 'text') -> pd.DataFrame:
    dataframe[f'{feature}_length'] = dataframe[feature].apply(lambda x: len(x))
    dataframe[f'{feature}_word_cnt'] = dataframe[feature].apply(lambda x: count_total_words(x))
    dataframe[f'{feature}_stopword_cnt'] = dataframe[feature].apply(lambda x: count_stopwords(x))
    dataframe[f'{feature}_punct_cnt'] = dataframe[feature].apply(lambda x: count_punctuation(x))
    dataframe[f'{feature}_number_cnt'] = dataframe[feature].apply(lambda x: count_numbers(x))
    return dataframe

In [None]:
summaries_train = feature_engineer(summaries_train)
summaries_test = feature_engineer(summaries_test)