## Prepare packages

Importing necessary libraries: NumPy and Pandas for data manipulation, and TPOTClassifier from the TPOT library for automated machine learning. 

In [3]:
import numpy as np
import pandas as pd


In [4]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
from tpot import TPOTClassifier

## Read data

Loading the dataset from a CSV file named dataset_aggr.csv into a pandas DataFrame called data.

In [7]:
data = pd.read_csv('dataset_aggr.csv')

Adding a new column named label with all values set to 1 to indicate positive labels.

In [8]:
data['label'] = 1

We iterate over each row in the dataset (data.iterrows()).
For each row, we randomly select another row from the dataset (excluding the current row) to serve as a candidate.
Combining features from both rows to create a new row, setting the label to 0 and leaving the average rating (avg_rating) as None.
Appending the new row to the dataset DataFrame.

In [12]:
dataset = data.copy()

for idx, row in data.iterrows():
    # new_row = {'UserID': row['UserID'], }
    candidate = data[(data.ItemID != row['ItemID']) & (data.UserID != row['UserID'])].sample(n=1)
    candidate = candidate.iloc[0].to_dict()
    
    left_k = ['UserID', 'number_of_unique_songs',
       'number_of_unique_genres', 'genre_ratio', 'main_genre_dominance',
       'no_stimulus_points', 'stimulus_points',
       'driving_style_relaxed_driving', 'driving_style_sport_driving',
       'landscape_coast_line', 'landscape_country_side', 'landscape_mountains',
       'landscape_urban', 'mood_active', 'mood_happy', 'mood_lazy', 'mood_sad',
       'natural_phenomena_afternoon', 'natural_phenomena_day_time',
       'natural_phenomena_morning', 'natural_phenomena_night',
       'road_type_city', 'road_type_highway', 'road_type_serpentine',
       'sleepiness_awake', 'sleepiness_sleepy', 'traffic_conditions_free_road',
       'traffic_conditions_lots_of_cars', 'traffic_conditions_traffic_jam',
       'weather_cloudy', 'weather_rainy', 'weather_snowing', 'weather_sunny', 
       'dominant_genre_blues',
       'dominant_genre_pop', 'dominant_genre_rock', 'second_dominant_blues',
       'second_dominant_blues_classical_disco',
       'second_dominant_blues_classicalsecond_dominant_hh',
       'second_dominant_blues_disco_rock', 'second_dominant_blues_hh',
       'second_dominant_blues_metal_reggae', 'second_dominant_classical',
       'second_dominant_classical_country',
       'second_dominant_classical_country_disco_hh',
       'second_dominant_classical_country_disco_hh_jazz_metal_rock',
       'second_dominant_classical_disco',
       'second_dominant_classical_disco_reggae',
       'second_dominant_classical_hh_rock', 'second_dominant_country',
       'second_dominant_country_disco_rock',
       'second_dominant_country_jazz_rock', 'second_dominant_disco',
       'second_dominant_disco_hh', 'second_dominant_jazz',
       'second_dominant_metal']
    
    right_k = ['ItemID', 'category_name_blues', 'category_name_classical',
       'category_name_country', 'category_name_disco', 'category_name_hip_hop',
       'category_name_jazz', 'category_name_metal', 'category_name_pop',
       'category_name_reggae', 'category_name_rock']
    
    left = { k: row[k] for k in left_k }
    right = { k: candidate[k] for k in right_k }
    
    new_row = {**left, **right}
    new_row['avg_rating'] = None
    new_row['label'] = 0

    new_row_df = pd.DataFrame([new_row])

    dataset = pd.concat([dataset, new_row_df], ignore_index=True)
  

    

In [20]:
dataset.describe()

Unnamed: 0,UserID,number_of_unique_songs,number_of_unique_genres,genre_ratio,main_genre_dominance,no_stimulus_points,stimulus_points,driving_style_relaxed_driving,driving_style_sport_driving,landscape_coast_line,...,category_name_classical,category_name_country,category_name_disco,category_name_hip_hop,category_name_jazz,category_name_metal,category_name_pop,category_name_reggae,category_name_rock,label
count,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0,...,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0,930.0
mean,1019.14086,64.984946,8.132258,0.394187,0.589208,0.986022,1.088172,0.194624,0.184946,0.134409,...,0.035484,0.039785,0.091398,0.041935,0.039785,0.03871,0.602151,0.025806,0.03871,0.0
std,11.322104,48.325411,2.641218,0.271515,0.10208,0.86855,0.862301,0.404194,0.396689,0.341274,...,0.185099,0.195559,0.288329,0.20055,0.195559,0.193006,0.489717,0.158643,0.193006,0.0
min,1001.0,1.0,1.0,0.010101,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1009.0,20.0,6.0,0.08,0.542857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1019.0,70.0,10.0,0.542857,0.567308,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,1031.75,116.0,10.0,0.6,0.637931,1.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,1042.0,139.0,10.0,1.0,1.0,5.0,5.0,2.0,2.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


## Prepare data

Splitting the dataset into features (X) and labels (Y), excluding the UserID, ItemID, label, and avg_rating columns.
Further splitting the data into training and testing sets using a 80-20 split, ensuring that the class distribution is maintained (stratified split).

In [16]:
X = dataset.drop(['UserID', 'ItemID', 'label', 'avg_rating'], axis=1).values
Y = dataset.label.values

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42, stratify=Y)

## Build AutoML solution

Initializing a TPOTClassifier with specified settings
Fitting the TPOTClassifier to the training data, allowing it to automatically search for the best classifier pipeline.

In [17]:
tpot = TPOTClassifier(generations=8, population_size=30, verbosity=2, 
                      n_jobs=4,
                      scoring="f1")
tpot.fit(X_train, y_train)

                                                                             
Generation 1 - Current best internal CV score: 0.748775036140025
                                                                             
Generation 2 - Current best internal CV score: 0.7527942743808952
                                                                              
Generation 3 - Current best internal CV score: 0.7711318666581434
                                                                              
Generation 4 - Current best internal CV score: 0.7711318666581434
                                                                              
Generation 5 - Current best internal CV score: 0.7711318666581434
                                                                              
Generation 6 - Current best internal CV score: 0.7717138640580462
                                                                              
Generation 7 - Current best internal CV score: 0.77171

Once TPOT completes its search, we evaluate the best pipeline's performance on the testing data using the F1-score

In [18]:
f'Best F1-score found: {tpot.score(X_test, y_test)}'

'Best F1-score found: 0.7865168539325843'

Exporting the Python code for the best pipeline found by TPOT to a file named tpot_car_music.py.

In [19]:
tpot.export('tpot_car_music.py')