In this notebook I am going to teach models on data from <br> https://www.kaggle.com/datasets/ujjwalaggarwal402/medicine-dataset/data <br>
to predict category of medicine (antidiabetic, antibiotic,  etc.). <br>
I intend to use the following classifiers:
<ul>
<li>Logistic regression</li>
<li>K nearest neighbors</li>
<li>Decision tree</li>
<li>Random forests</li>
<li>Naive Bayes classifier</li>
<li>Support vector machine</li>
</ul>
I intend to teach the models on different amounts of input data, and then make a comparative analysis of the results obtained for the classifiers depending on the data provided for learning. <br>
In last notebook i got low scores like 12.5%. In this one i'll try to improve predictions score.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

In [2]:
# Loading dataset from kaggle after connect noteboook with drive
# https://www.kaggle.com/datasets/ujjwalaggarwal402/medicine-dataset/data
df_raw = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/machine_learning/ML_Projects/data/medicine_dataset.csv')
df_raw.head()

Unnamed: 0,Name,Category,Dosage Form,Strength,Manufacturer,Indication,Classification
0,Acetocillin,Antidiabetic,Cream,938 mg,Roche Holding AG,Virus,Over-the-Counter
1,Ibuprocillin,Antiviral,Injection,337 mg,CSL Limited,Infection,Over-the-Counter
2,Dextrophen,Antibiotic,Ointment,333 mg,Johnson & Johnson,Wound,Prescription
3,Clarinazole,Antifungal,Syrup,362 mg,AbbVie Inc.,Pain,Prescription
4,Amoxicillin,Antifungal,Tablet,802 mg,Teva Pharmaceutical Industries Ltd.,Wound,Over-the-Counter


In [3]:
df = df_raw.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Name            50000 non-null  object
 1   Category        50000 non-null  object
 2   Dosage Form     50000 non-null  object
 3   Strength        50000 non-null  object
 4   Manufacturer    50000 non-null  object
 5   Indication      50000 non-null  object
 6   Classification  50000 non-null  object
dtypes: object(7)
memory usage: 2.7+ MB


In [4]:
df.describe(include='object')

Unnamed: 0,Name,Category,Dosage Form,Strength,Manufacturer,Indication,Classification
count,50000,50000,50000,50000,50000,50000,50000
unique,64,8,8,999,20,8,2
top,Metostatin,Antidepressant,Inhaler,347 mg,Boehringer Ingelheim GmbH,Infection,Over-the-Counter
freq,860,6354,6364,77,2587,6393,25015


In [7]:
df.isna().sum()

Unnamed: 0,0
Name,0
Category,0
Dosage Form,0
Strength,0
Manufacturer,0
Indication,0
Classification,0


In [9]:
df.duplicated().sum()

0

In [13]:
df = df[['Category', 'Strength', 'Indication']]
df.head()

Unnamed: 0,Category,Strength,Indication
0,6,938 mg,Virus
1,5,337 mg,Infection
2,7,333 mg,Wound
3,3,362 mg,Pain
4,3,802 mg,Wound


In [11]:
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
Antidepressant,6354
Analgesic,6340
Antiseptic,6315
Antifungal,6289
Antipyretic,6280
Antiviral,6185
Antidiabetic,6171
Antibiotic,6066


In [12]:
mapped_values = {
    'Antidepressant': 0,
    'Analgesic': 1,
    'Antiseptic': 2,
    'Antifungal': 3,
    'Antipyretic': 4,
    'Antiviral': 5,
    'Antidiabetic': 6,
    'Antibiotic': 7
    }
df['Category'] = df['Category'].map(mapped_values)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Category'] = df['Category'].map(mapped_values)


Unnamed: 0,Category,Dosage Form,Strength,Indication,Classification
0,6,Cream,938 mg,Virus,Over-the-Counter
1,5,Injection,337 mg,Infection,Over-the-Counter
2,7,Ointment,333 mg,Wound,Prescription
3,3,Syrup,362 mg,Pain,Prescription
4,3,Tablet,802 mg,Wound,Over-the-Counter


In [14]:
df['Strength'] = df['Strength'].str.replace(' mg', '').astype(int)
df = pd.get_dummies(df, columns=['Indication'], drop_first=True)
df.head()

Unnamed: 0,Category,Strength,Indication_Diabetes,Indication_Fever,Indication_Fungus,Indication_Infection,Indication_Pain,Indication_Virus,Indication_Wound
0,6,938,False,False,False,False,False,True,False
1,5,337,False,False,False,True,False,False,False
2,7,333,False,False,False,False,False,False,True
3,3,362,False,False,False,False,True,False,False
4,3,802,False,False,False,False,False,False,True


In [15]:
# Setting Category as target feature to predict, then converting others to float.
target = df.pop('Category')
features = df.astype(float)
features.head()

Unnamed: 0,Strength,Indication_Diabetes,Indication_Fever,Indication_Fungus,Indication_Infection,Indication_Pain,Indication_Virus,Indication_Wound
0,938.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,337.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,333.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,362.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,802.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Fuction which will create classifiers and comparing scores.

In [22]:
def complete_scores(features, target, train_size):

    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import classification_report


    classification_scores = {}
    features_train, features_test, target_train, target_test = train_test_split(features, target, train_size=train_size)

    print('features_train shape:', features_train.shape)
    print('features_test shape:', features_test.shape)
    print('target_train shape:', target_train.shape)
    print('target_test shape:', target_test.shape)
    print()
    print()

    # Logistic regression
    lr = LogisticRegression()
    lr_params = {
        'max_iter': [5000, 7500],
        'solver': ['lbfgs', 'newton-cg', 'sag', 'saga']
    }
    lr_grid = GridSearchCV(lr, lr_params, cv=3)
    lr_grid.fit(features_train, target_train)
    lr_pred = lr_grid.predict(features_test)
    lr_score = lr_grid.score(features_test, target_test)
    classification_scores['Logistic regression'] = lr_score
    print('************************')
    print(f'Logistic regression best params: {lr_grid.best_params_}')
    print('************************')
    print('Logistic regression classification report')
    print(classification_report(target_test, lr_pred))
    print('************************')
    print('************************')
    print('************************')

    # KNN
    knn = KNeighborsClassifier()
    knn_params = {
        'n_neighbors': range(1, 11),
        'metric': ['minkowski', 'euclidean', 'manhattan', 'cosine']
    }
    knn_grid = GridSearchCV(knn, knn_params, cv=3)
    knn_grid.fit(features_train, target_train)
    knn_pred = knn_grid.predict(features_test)
    knn_score = knn_grid.score(features_test, target_test)
    classification_scores['KNN'] = knn_score
    print('************************')
    print(f'KNN best params: {knn_grid.best_params_}')
    print('************************')
    print('KNN classification report')
    print(classification_report(target_test, knn_pred))
    print('************************')
    print('************************')
    print('************************')

    # Decision tree
    dt = DecisionTreeClassifier()
    dt_params = {
        'criterion': ['gini', 'entropy'],
        'max_depth': range(3, 10),
        'min_samples_split': range(2, 6),
        'min_samples_leaf': range(1, 6)
    }
    dt_grid = GridSearchCV(dt, dt_params, cv=3)
    dt_grid.fit(features_train, target_train)
    dt_pred = dt_grid.predict(features_test)
    dt_score = dt_grid.score(features_test, target_test)
    classification_scores['Decision Tree'] = dt_score
    print('************************')
    print(f'Decision tree best params: {dt_grid.best_params_}')
    print('************************')
    print(f'Decision tree feature importances: {dt_grid.best_estimator_.feature_importances_}')
    print('************************')
    print('************************')
    print('Decision tree classification report')
    print(classification_report(target_test, dt_pred))
    print('************************')
    print('************************')
    print('************************')

    # Random forest
    rf = RandomForestClassifier()
    rf_params = {
        'criterion': ['gini', 'entropy'],
        'n_estimators': range(50, 101, 10),
        'max_depth': range(3, 16),
        'min_samples_split': range(3, 6),
        'min_samples_leaf': range(1, 6)
    }
    rf_grid = GridSearchCV(rf, rf_params, cv=3)
    rf_grid.fit(features_train, target_train)
    rf_pred = rf_grid.predict(features_test)
    rf_score = rf_grid.score(features_test, target_test)
    classification_scores['Random Forest'] = rf_score
    print('************************')
    print(f'Random forest best params: {rf_grid.best_params_}')
    print('************************')
    print(f'Random forest feature importances: {rf_grid.best_estimator_.feature_importances_}')
    print('************************')
    print('Random forest classification report')
    print(classification_report(target_test, rf_pred))
    print('************************')
    print('************************')
    print('************************')

    # Naive Bayes
    bayes = GaussianNB()
    bayes.fit(features_train, target_train)
    bayes_pred = bayes.predict(features_test)
    bayes_score = bayes.score(features_test, target_test)
    classification_scores['Naive Bayes'] = bayes_score
    print('************************')
    print('Naive Bayes classification report')
    print(classification_report(target_test, bayes_pred))
    print('************************')
    print('************************')
    print('************************')

    # Support Vector Machine
    svc = SVC()
    svc_params = {
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    }
    # Based on my previous experience, this classifier take a lot of time so
    # there will be only 2 cross validation
    svc_grid = GridSearchCV(svc, svc_params, cv=2)
    svc_grid.fit(features_train, target_train)
    svc_pred = svc_grid.predict(features_test)
    svc_score = svc_grid.score(features_test, target_test)
    classification_scores['Support Vector Machine'] = svc_score
    print('************************')
    print(f'Support vector machine best params: {svc_grid.best_params_}')
    print('************************')
    print('Support vector machine classification report')
    print(classification_report(target_test, svc_pred))
    print('************************')
    print('************************')
    print('************************')

    return classification_scores

In [20]:
scores_1_percent_train_size = complete_scores(features, target, 0.01)
print(scores_1_percent_train_size)

features_train shape: (500, 8)
features_test shape: (49500, 8)
target_train shape: (500,)
target_test shape: (49500,)




  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'saga'}
************************
Logistic regression classification report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      6289
           1       0.00      0.00      0.00      6284
           2       0.13      0.03      0.04      6249
           3       0.14      0.00      0.01      6235
           4       0.13      0.29      0.18      6209
           5       0.12      0.68      0.21      6113
           6       0.00      0.00      0.00      6115
           7       0.00      0.00      0.00      6006

    accuracy                           0.12     49500
   macro avg       0.06      0.13      0.05     49500
weighted avg       0.06      0.12      0.05     49500

************************
************************
************************
************************
KNN best params: {'metric': 'cosine', 'n_neighbors': 3}
************************
KNN cl

  _data = np.array(data, dtype=dtype, copy=copy,
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.82152447 0.         0.         0.04913929 0.05596083 0.
 0.         0.07337541]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.13      0.56      0.21      6289
           1       0.12      0.04      0.06      6284
           2       0.13      0.08      0.10      6249
           3       0.13      0.25      0.17      6235
           4       0.11      0.00      0.01      6209
           5       0.12      0.05      0.08      6113
           6       0.00      0.00      0.00      6115
           7       0.11      0.01      0.02      6006

    accuracy                           0.13     49500
   macro avg       0.11      0.13      0.08     49500
weighted avg       0.11      0.13     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

           0       0.13      0.04      0.07      6289
           1       0.12      0.02      0.04      6284
           2       0.13      0.31      0.18      6249
           3       0.12      0.11      0.12      6235
           4       0.13      0.23      0.16      6209
           5       0.12      0.14      0.13      6113
           6       0.13      0.13      0.13      6115
           7       0.00      0.00      0.00      6006

    accuracy                           0.12     49500
   macro avg       0.11      0.12      0.10     49500
weighted avg       0.11      0.12      0.10     49500

************************
************************
************************
************************
Support vector machine best params: {'kernel': 'sigmoid'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.13      0.51      0.20      6289
           1

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [25]:
scores_df_1 = pd.DataFrame(scores_1_percent_train_size.items(), columns=['Model', 'Score'])
scores_df_1

Unnamed: 0,Model,Score
0,Logistic regression,0.124162
1,KNN,0.123515
2,Decision Tree,0.123879
3,Random Forest,0.126424
4,Naive Bayes,0.124162
5,Support Vector Machine,0.126727


In [26]:
px.bar(scores_df_1, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.01')

### Strength have the biggest impact on models like Decision tree, Random Forest, so let's remove this column from data.

In [21]:
features.drop('Strength', axis=1, inplace=True)
features.head()

Unnamed: 0,Indication_Diabetes,Indication_Fever,Indication_Fungus,Indication_Infection,Indication_Pain,Indication_Virus,Indication_Wound
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [24]:
# And again
scores_1_percent_train_size = complete_scores(features, target, 0.01)
print(scores_1_percent_train_size)

features_train shape: (500, 7)
features_test shape: (49500, 7)
target_train shape: (500,)
target_test shape: (49500,)


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'lbfgs'}
************************
Logistic regression classification report


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

           0       0.13      0.37      0.19      6289
           1       0.12      0.24      0.16      6268
           2       0.12      0.12      0.12      6249
           3       0.12      0.12      0.12      6229
           4       0.00      0.00      0.00      6215
           5       0.00      0.00      0.00      6129
           6       0.12      0.13      0.12      6114
           7       0.00      0.00      0.00      6007

    accuracy                           0.12     49500
   macro avg       0.08      0.12      0.09     49500
weighted avg       0.08      0.12      0.09     49500

************************
************************
************************
************************
KNN best params: {'metric': 'minkowski', 'n_neighbors': 1}
************************
KNN classification report
              precision    recall  f1-score   support

           0       0.13      0.25      0.17      6289
           1       0.13      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _data = np.array(data, dtype=dtype, copy=copy,
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.16215365 0.09751311 0.138288   0.         0.16078696 0.
 0.44125827]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.12      0.49      0.20      6289
           1       0.12      0.12      0.12      6268
           2       0.12      0.12      0.12      6249
           3       0.12      0.12      0.12      6229
           4       0.00      0.00      0.00      6215
           5       0.00      0.00      0.00      6129
           6       0.12      0.13      0.12      6114
           7       0.00      0.00      0.00      6007

    accuracy                           0.12     49500
   macro avg       0.08      0.12      0.09     49500
weighted avg       0.08      0.12      0.09     4

  _data = np.array(data, dtype=dtype, copy=copy,


************************
Random forest best params: {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 60}
************************
Random forest feature importances: [0.18239409 0.15063406 0.17529479 0.08183264 0.08710178 0.10648198
 0.21626065]
************************
Random forest classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      6289
           1       0.13      0.38      0.19      6268
           2       0.12      0.24      0.16      6249
           3       0.12      0.12      0.12      6229
           4       0.13      0.13      0.13      6215
           5       0.00      0.00      0.00      6129
           6       0.00      0.00      0.00      6114
           7       0.00      0.00      0.00      6007

    accuracy                           0.13     49500
   macro avg       0.08      0.13      0.09     49500
weighted avg       0.08      0.13      0.09   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

           0       0.13      0.37      0.19      6289
           1       0.12      0.24      0.16      6268
           2       0.12      0.12      0.12      6249
           3       0.12      0.12      0.12      6229
           4       0.00      0.00      0.00      6215
           5       0.00      0.00      0.00      6129
           6       0.12      0.13      0.12      6114
           7       0.00      0.00      0.00      6007

    accuracy                           0.12     49500
   macro avg       0.08      0.12      0.09     49500
weighted avg       0.08      0.12      0.09     49500

************************
************************
************************
************************
Support vector machine best params: {'kernel': 'sigmoid'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.13      0.25      0.17      6289
           1

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [27]:
scores_df_1 = pd.DataFrame(scores_1_percent_train_size.items(), columns=['Model', 'Score'])
px.bar(scores_df_1, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.01')

In [28]:
scores_5_percent_train_size = complete_scores(features, target, 0.5)
print(scores_5_percent_train_size)

features_train shape: (25000, 7)
features_test shape: (25000, 7)
target_train shape: (25000,)
target_test shape: (25000,)


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'newton-cg'}
************************
Logistic regression classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3153
           1       0.13      0.25      0.17      3175
           2       0.13      0.25      0.17      3156
           3       0.12      0.12      0.12      3165
           4       0.00      0.00      0.00      3134
           5       0.13      0.25      0.17      3105
           6       0.00      0.00      0.00      3086
           7       0.00      0.00      0.00      3026

    accuracy                           0.13     25000
   macro avg       0.08      0.13      0.09     25000
weighted avg       0.08      0.13      0.10     25000

************************
************************
********


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
KNN best params: {'metric': 'cosine', 'n_neighbors': 6}
************************
KNN classification report
              precision    recall  f1-score   support

           0       0.12      0.50      0.20      3153
           1       0.00      0.00      0.00      3175
           2       0.13      0.50      0.20      3156
           3       0.00      0.00      0.00      3165
           4       0.00      0.00      0.00      3134
           5       0.00      0.00      0.00      3105
           6       0.00      0.00      0.00      3086
           7       0.00      0.00      0.00      3026

    accuracy                           0.13     25000
   macro avg       0.03      0.13      0.05     25000
weighted avg       0.03      0.13      0.05     25000

************************
************************
************************



Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


invalid value encountered in cast


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.08552252 0.13721001 0.0899065  0.16230485 0.12928086 0.22822169
 0.16755357]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3153
           1       0.13      0.25      0.17      3175
           2       0.13      0.25      0.17      3156
           3       0.12      0.12      0.12      3165
           4       0.00      0.00      0.00      3134
           5       0.13      0.25      0.17      3105
           6       0.00      0.00      0.00      3086
           7       0.00      0.00      0.00      3026

    accuracy                           0.13     25000
   macro avg       0.08      0.13      0.09     25000
weighted avg       0.08      0.13      0.


invalid value encountered in cast



************************
Random forest best params: {'criterion': 'gini', 'max_depth': 12, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 50}
************************
Random forest feature importances: [0.1357288  0.16876935 0.09016636 0.14626566 0.13386786 0.17620341
 0.14899856]
************************
Random forest classification report
              precision    recall  f1-score   support

           0       0.13      0.25      0.17      3153
           1       0.13      0.13      0.13      3175
           2       0.13      0.25      0.17      3156
           3       0.12      0.12      0.12      3165
           4       0.00      0.00      0.00      3134
           5       0.13      0.25      0.17      3105
           6       0.00      0.00      0.00      3086
           7       0.00      0.00      0.00      3026

    accuracy                           0.13     25000
   macro avg       0.08      0.13      0.09     25000
weighted avg       0.08      0.13      0.09  


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Support vector machine best params: {'kernel': 'poly'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3153
           1       0.13      0.25      0.17      3175
           2       0.13      0.25      0.17      3156
           3       0.12      0.12      0.12      3165
           4       0.00      0.00      0.00      3134
           5       0.13      0.25      0.17      3105
           6       0.00      0.00      0.00      3086
           7       0.00      0.00      0.00      3026

    accuracy                           0.13     25000
   macro avg       0.08      0.13      0.09     25000
weighted avg       0.08      0.13      0.10     25000

************************
************************
************************
{'Logistic regression': 0.12416161616161617, 'KNN': 0.12351515151515151, 'Decision Tree': 0.12387878787878788, 'Random Fores


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [29]:
scores_df_5 = pd.DataFrame(scores_5_percent_train_size.items(), columns=['Model', 'Score'])
px.bar(scores_df_5, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.05')

In [30]:
scores_10_percent_train_size = complete_scores(features, target, 0.1)
print(scores_10_percent_train_size)

features_train shape: (5000, 7)
features_test shape: (45000, 7)
target_train shape: (5000,)
target_test shape: (45000,)


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'lbfgs'}
************************
Logistic regression classification report



Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

           0       0.13      0.13      0.13      5716
           1       0.13      0.13      0.13      5669
           2       0.13      0.12      0.13      5669
           3       0.00      0.00      0.00      5680
           4       0.12      0.37      0.18      5629
           5       0.00      0.00      0.00      5623
           6       0.12      0.12      0.12      5570
           7       0.12      0.13      0.13      5444

    accuracy                           0.13     45000
   macro avg       0.09      0.13      0.10     45000
weighted avg       0.09      0.13      0.10     45000

************************
************************
************************
************************
KNN best params: {'metric': 'minkowski', 'n_neighbors': 10}
************************
KNN classification report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      5716
           1       0.13      


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


invalid value encountered in cast


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.20378507 0.06606691 0.02816306 0.15745    0.18435194 0.22111854
 0.13906449]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      5716
           1       0.13      0.13      0.13      5669
           2       0.13      0.12      0.13      5669
           3       0.00      0.00      0.00      5680
           4       0.12      0.37      0.18      5629
           5       0.00      0.00      0.00      5623
           6       0.12      0.12      0.12      5570
           7       0.12      0.13      0.13      5444

    accuracy                           0.13     45000
   macro avg       0.09      0.13      0.10     45000
weighted avg       0.09      0.13      0.


invalid value encountered in cast



************************
Random forest best params: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 90}
************************
Random forest feature importances: [0.16395968 0.14263796 0.10579784 0.12604288 0.12755889 0.19744122
 0.13656153]
************************
Random forest classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      5716
           1       0.13      0.50      0.20      5669
           2       0.13      0.12      0.13      5669
           3       0.00      0.00      0.00      5680
           4       0.12      0.25      0.17      5629
           5       0.00      0.00      0.00      5623
           6       0.00      0.00      0.00      5570
           7       0.00      0.00      0.00      5444

    accuracy                           0.13     45000
   macro avg       0.06      0.12      0.08     45000
weighted avg       0.06      0.13      0.08


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

           0       0.13      0.13      0.13      5716
           1       0.13      0.13      0.13      5669
           2       0.13      0.12      0.13      5669
           3       0.00      0.00      0.00      5680
           4       0.12      0.37      0.18      5629
           5       0.00      0.00      0.00      5623
           6       0.12      0.12      0.12      5570
           7       0.12      0.13      0.13      5444

    accuracy                           0.13     45000
   macro avg       0.09      0.13      0.10     45000
weighted avg       0.09      0.13      0.10     45000

************************
************************
************************
************************
Support vector machine best params: {'kernel': 'linear'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      5716
           1 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [31]:
scores_df_10 = pd.DataFrame(scores_10_percent_train_size.items(), columns=['Model', 'Score'])
px.bar(scores_df_10, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.1')

In [32]:
scores_25_percent_train_size = complete_scores(features, target, 0.25)
print(scores_25_percent_train_size)

features_train shape: (12500, 7)
features_test shape: (37500, 7)
target_train shape: (12500,)
target_test shape: (37500,)


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'lbfgs'}
************************
Logistic regression classification report
              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1       0.13      0.25      0.17      4728
           2       0.13      0.38      0.19      4685
           3       0.13      0.12      0.12      4792
           4       0.00      0.00      0.00      4753
           5       0.13      0.13      0.13      4622
           6       0.00      0.00      0.00      4632
           7       0.00      0.00      0.00      4527

    accuracy                           0.13     37500
   macro avg       0.08      0.13      0.09     37500
weighted avg       0.08      0.13      0.09     37500

************************
************************
************


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
KNN best params: {'metric': 'cosine', 'n_neighbors': 9}
************************
KNN classification report
              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1       0.13      0.37      0.19      4728
           2       0.00      0.00      0.00      4685
           3       0.00      0.00      0.00      4792
           4       0.13      0.25      0.17      4753
           5       0.13      0.27      0.17      4622
           6       0.00      0.00      0.00      4632
           7       0.00      0.00      0.00      4527

    accuracy                           0.13     37500
   macro avg       0.06      0.13      0.08     37500
weighted avg       0.06      0.13      0.08     37500

************************
************************
************************



Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


invalid value encountered in cast


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.         0.         0.         0.3034784  0.19867203 0.49784957
 0.        ]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1       0.13      0.13      0.13      4728
           2       0.13      0.63      0.21      4685
           3       0.00      0.00      0.00      4792
           4       0.00      0.00      0.00      4753
           5       0.13      0.13      0.13      4622
           6       0.00      0.00      0.00      4632
           7       0.00      0.00      0.00      4527

    accuracy                           0.13     37500
   macro avg       0.06      0.13      0.07     37500
weighted avg       0.06      0.13      0.


invalid value encountered in cast



************************
Random forest best params: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 3, 'n_estimators': 70}
************************
Random forest feature importances: [0.08766042 0.09432259 0.09165222 0.16358399 0.17517679 0.31463419
 0.0729698 ]
************************
Random forest classification report
              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1       0.13      0.25      0.17      4728
           2       0.12      0.62      0.21      4685
           3       0.00      0.00      0.00      4792
           4       0.00      0.00      0.00      4753
           5       0.00      0.00      0.00      4622
           6       0.00      0.00      0.00      4632
           7       0.00      0.00      0.00      4527

    accuracy                           0.13     37500
   macro avg       0.05      0.12      0.06     37500
weighted avg       0.05      0.13      0.06


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1       0.13      0.25      0.17      4728
           2       0.13      0.25      0.17      4685
           3       0.12      0.24      0.16      4792
           4       0.00      0.00      0.00      4753
           5       0.13      0.13      0.13      4622
           6       0.00      0.00      0.00      4632
           7       0.00      0.00      0.00      4527

    accuracy                           0.13     37500
   macro avg       0.08      0.12      0.09     37500
weighted avg       0.08      0.13      0.09     37500

************************
************************
************************
************************
Support vector machine best params: {'kernel': 'linear'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.12      0.12      0.12      4761
           1 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [33]:
scores_df_25 = pd.DataFrame(scores_25_percent_train_size.items(), columns=['Model', 'Score'])
px.bar(scores_df_25, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.25')

In [34]:
scores_50_percent_train_size = complete_scores(features, target, 0.5)
print(scores_50_percent_train_size)

features_train shape: (25000, 7)
features_test shape: (25000, 7)
target_train shape: (25000,)
target_test shape: (25000,)


************************
Logistic regression best params: {'max_iter': 5000, 'solver': 'lbfgs'}
************************
Logistic regression classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3211
           1       0.13      0.26      0.17      3187
           2       0.12      0.37      0.18      3091
           3       0.13      0.12      0.13      3120
           4       0.12      0.12      0.12      3165
           5       0.00      0.00      0.00      3081
           6       0.00      0.00      0.00      3095
           7       0.00      0.00      0.00      3050

    accuracy                           0.13     25000
   macro avg       0.08      0.12      0.09     25000
weighted avg       0.08      0.13      0.09     25000

************************
************************
************


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
KNN best params: {'metric': 'cosine', 'n_neighbors': 9}
************************
KNN classification report
              precision    recall  f1-score   support

           0       0.13      0.75      0.22      3211
           1       0.00      0.00      0.00      3187
           2       0.12      0.13      0.12      3091
           3       0.00      0.00      0.00      3120
           4       0.12      0.12      0.12      3165
           5       0.00      0.00      0.00      3081
           6       0.00      0.00      0.00      3095
           7       0.00      0.00      0.00      3050

    accuracy                           0.13     25000
   macro avg       0.05      0.12      0.06     25000
weighted avg       0.05      0.13      0.06     25000

************************
************************
************************



Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Decision tree best params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
************************
Decision tree feature importances: [0.         0.         0.15117076 0.32653208 0.12392463 0.16237168
 0.23600085]
************************
************************
Decision tree classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3211
           1       0.13      0.50      0.20      3187
           2       0.12      0.24      0.16      3091
           3       0.13      0.12      0.13      3120
           4       0.00      0.00      0.00      3165
           5       0.00      0.00      0.00      3081
           6       0.00      0.00      0.00      3095
           7       0.00      0.00      0.00      3050

    accuracy                           0.13     25000
   macro avg       0.06      0.12      0.08     25000
weighted avg       0.06      0.13      0.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Random forest best params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 60}
************************
Random forest feature importances: [0.09963176 0.12162092 0.09803025 0.19589559 0.1612189  0.15898144
 0.16462114]
************************
Random forest classification report
              precision    recall  f1-score   support

           0       0.13      0.13      0.13      3211
           1       0.13      0.26      0.17      3187
           2       0.12      0.50      0.20      3091
           3       0.13      0.12      0.13      3120
           4       0.00      0.00      0.00      3165
           5       0.00      0.00      0.00      3081
           6       0.00      0.00      0.00      3095
           7       0.00      0.00      0.00      3050

    accuracy                           0.13     25000
   macro avg       0.06      0.13      0.08     25000
weighted avg       0.06      0.13      0.08   


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



************************
Support vector machine best params: {'kernel': 'linear'}
************************
Support vector machine classification report
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      3211
           1       0.13      0.26      0.17      3187
           2       0.12      0.37      0.18      3091
           3       0.13      0.12      0.13      3120
           4       0.12      0.12      0.12      3165
           5       0.13      0.13      0.13      3081
           6       0.00      0.00      0.00      3095
           7       0.00      0.00      0.00      3050

    accuracy                           0.13     25000
   macro avg       0.08      0.13      0.09     25000
weighted avg       0.08      0.13      0.09     25000

************************
************************
************************
{'Logistic regression': 0.12516, 'KNN': 0.12736, 'Decision Tree': 0.12592, 'Random Forest': 0.12644, 'Naive Bayes': 0.1227


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [35]:
scores_df_50 = pd.DataFrame(scores_50_percent_train_size.items(), columns=['Model', 'Score'])
px.bar(scores_df_50, x='Model', y='Score', title='Classification models scores, 50k samples, train size=0.5')

### This was the second attempt to teach a model capable of predicting medicines categories. Unfortunately, another failure.