This Repo is for Kaggle - Linking Writing Processes to Writing Quality Competition
pip install -r requirements.txt
export KAGGLE_USERNAME="your_kaggle_username"
export KAGGLE_KEY="your_api_key"
kaggle competitions download -c linking-writing-processes-to-writing-quality
unzip linking-writing-processes-to-writing-quality.zip
kaggle datasets download -d hiarsl/writing-quality-challenge-constructed-essays
unzip writing-quality-challenge-constructed-essays.zip
cd models
python {file_name}.py
- Models will be saved in the main directory.
python feature_generator.py
- Final train csv will be saved in
features.csv
under the main directory. - Feature names will be saved as a list in
columns.txt
under the main directory. - See exploratory data analysis in
features_eda.ipynb
cd classifiers
python {file_name}.py
I am very pleased to have participated in this meaningful competition. Although I did not win a medal after the shakeup, I learned a lot in the feature type table competition. Hope to apply what I have learned next time and achieve better results. My thoughts aren't of much reference value, just simply serve to put a definitive end to this competition and share some findings.
Thanks to these excellent public notebooks: Feature Engineering: Sentence & paragraph features, Silver Bullet | Single Model | 165 Features, LGBM (X2) + NN.
-
Important features:
sentence_features
,word_features
,paragraph_features
**use word count with different lengths** for word_l in [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]: word_agg_df[f'word_len_ge_{word_l}_count'] = df[df['word_len'] == word_l].groupby(['id']).count().iloc[:, 0] word_agg_df[f'word_len_ge_{word_l}_count'] = word_agg_df[f'word_len_ge_{word_l}_count'].fillna(0)
pause_time_features
**only add one new feature** pauses_zero_sec=pl.col('time_diff').filter( pl.col('time_diff') < 0.5 ).count(),
time-related features
,word count features
,cursor position features
temp = df.group_by("id").agg( pl.sum('action_time').suffix('_sum'), pl.mean(num_cols).suffix('_mean'), pl.std(num_cols).suffix('_std'), pl.median(num_cols).suffix('_median'), pl.min(num_cols).suffix('_min'), pl.max(num_cols).suffix('_max'), pl.quantile(num_cols, 0.25).suffix('_quantile25'), pl.quantile(num_cols, 0.75).suffix('_quantile75') )
gaps = [1]
punctuations count
def match_punctuations(self, df): tmp_df = df.groupby('id').agg({'down_event': list}).reset_index() ret = list() for li in tqdm(tmp_df['down_event'].values): cnt = 0 items = list(Counter(li).items()) for item in items: k, v = item[0], item[1] if k in self.punctuations: cnt += v ret.append(cnt) ret = pd.DataFrame({'punct_cnt': ret}) return ret
-
Features with not very obvious effects
activity count
,event_count
(Tried using Tf-idf and regular proportion calculations, there was basically no difference.)gaps = [10, 20, ..., 50, 100]
In the end, used about 210 features. I tried more features (for example, constructing 300+, 600+, and 700+ features), but the scores on the leaderboard were poor, only around 0.595+. Therefore, I did not adopt them in the final model. In fact, when there are only about 2,500 training data entries, there shouldn't be too many features.
- Used LGBM, XGB, and CB, three traditional tree models, with equal allocation in the model proportions.
- Unable to obtain desired results with NN and TabNet.
After reading this creative discussion here. I tried using a classification model to assist in making certain adjustments to the regression model.
-
Binary Classification
count_label0 = 0 count_label1 = 0 def create_binary_score(score): global count_label0, count_label1 if score <= 1.5 or score >= 4.5: count_label0 += 1 return 0 else: count_label1 += 1 return 1
I hoped to use a binary classification method to differentiate between marginal scores and middle scores, but the final accuracy was only around 82%. After combining it with the regression model, the results were not satisfactory, so I ultimately abandoned this approach.
-
Five-category classification
def convert_score_to_category(score): if score <= 2.5: return 0 elif score == 3.0: return 1 elif score == 3.5: return 2 elif score == 4.0: return 3 elif score == 4.5: return 4 elif score >= 5.0: return 5
Here, in order to balance the data volume of each label, I set the division method as mentioned above. Previously, I tried treating each score as a separate category and added weights to minimize the impact of sample imbalance. However, due to the large discrepancy, the model was ultimately unable to train properly. Even when I divided it into the five categories mentioned above, the final classification accuracy was only just over 50%.
After reviewing others' solutions, it seems that no one used this idea, indicating that this method indeed does not work well.
- In this competition, it seems that features are not particularly important. Many high-scoring solutions are also based on making minor modifications to the baseline.
- How to extract more information from text and even use language models to construct features is a very effective approach.
- Building a trustworthy CV is crucial. In the competition, my CV has consistently lacked a strong correlation with the LB, which directly led to shakeup.
Thank you to Kaggle
and THE LEARNING AGENCY LAB
for hosting a very meaningful competition. The tabular competition has been a process of accumulating experience, and I have learned a lot during this process. Wish everyone good luck.