# Building the text classifier

In this phase, we focus on NLP processing and vectorization to prepare the cleaned data for classification model development.We will start by transforming the text into a format suitable for machine learning algorithms. The key steps involve tokenization, stemming, and feature extraction, optimizing the data for model training. Moving forward, we will employ classification models(conventional,ensemble to learn patterns and relationships within the text, creating a robust and accurate predictive system. As the final touch, the trained model is serialized using pickle, ensuring its preservation and easy deployment for future use. This comprehensive approach ensures the creation of a high-performing and deployable text classifier.


In [2]:
#importing basic package
import pandas as pd

In [3]:
# loading the dataset
data = pd.read_csv('filtered_data.csv')
data.head()

Unnamed: 0,Text,Classification,Filtered_Text
0,i didnt feel humiliated,sadness,'i didnt feel humiliated'
1,i can go from feeling so hopeless to so damned...,sadness,'i can go from feeling so hopeless to so damne...
2,im grabbing a minute to post i feel greedy wrong,anger,'im grabbing a minute to post i feel greedy wr...
3,i am ever feeling nostalgic about the fireplac...,love,'i am ever feeling nostalgic about the firepla...
4,i am feeling grouchy,anger,'i am feeling grouchy'


In [4]:
#seeing the type of labels and their count
data['Classification'].value_counts()

joy         6041
sadness     5243
anger       2429
fear        2157
love        1456
surprise     632
Name: Classification, dtype: int64

In [5]:
#function to remove commas
def remove_single_quotes(dataset, column_name):
    dataset[column_name] = dataset[column_name].str.replace("'", "")
    return dataset

In [6]:
#removing unnecessary column 
data=data.drop(['Text'],axis=1)
data.head()



Unnamed: 0,Classification,Filtered_Text
0,sadness,'i didnt feel humiliated'
1,sadness,'i can go from feeling so hopeless to so damne...
2,anger,'im grabbing a minute to post i feel greedy wr...
3,love,'i am ever feeling nostalgic about the firepla...
4,anger,'i am feeling grouchy'


In [7]:
#cleaning the text 
data = remove_single_quotes(data, 'Filtered_Text')
data.head()

Unnamed: 0,Classification,Filtered_Text
0,sadness,i didnt feel humiliated
1,sadness,i can go from feeling so hopeless to so damned...
2,anger,im grabbing a minute to post i feel greedy wrong
3,love,i am ever feeling nostalgic about the fireplac...
4,anger,i am feeling grouchy


In [8]:
# Rearranging the columns so that text --> Labels
data = data[['Filtered_Text', 'Classification']]
data.head()

Unnamed: 0,Filtered_Text,Classification
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


**Bonus task: Marking incomplete/grammatically incorrect sentences**

We will use spaCy package to analyze the syntax of each sentence and check for syntax errors.

*Note: This approach relies on pre-trained language models and may not catch all nuances of grammatical correctness.*

In [9]:
#getting spacy
import spacy
from spacy.tokens import Doc

In [10]:
# Loading spaCy English language model
nlp = spacy.load("en_core_web_sm")

In [11]:
# Function to grammatical correctness
def is_grammatically_correct(text):
    doc = nlp(text)
   
    return all(sent.has_vector for sent in doc.sents)

In [12]:
# Sample usage to mark records

def segregate_grammatically_correct(data, text_column):
    grammatically_correct = data[data[text_column].apply(is_grammatically_correct)]
    grammatically_incorrect = data[~data[text_column].apply(is_grammatically_correct)]

    return grammatically_correct, grammatically_incorrect

In [13]:
#applying the function and segregating 
grammatically_correct, grammatically_incorrect = segregate_grammatically_correct(data, 'Filtered_Text')

print("Grammatically Correct:")
print(grammatically_correct)

print("\nGrammatically Incorrect:")
print(grammatically_incorrect)

Grammatically Correct:
                                           Filtered_Text Classification
0                                i didnt feel humiliated        sadness
1      i can go from feeling so hopeless to so damned...        sadness
2       im grabbing a minute to post i feel greedy wrong          anger
3      i am ever feeling nostalgic about the fireplac...           love
4                                   i am feeling grouchy          anger
...                                                  ...            ...
17953  i just keep feeling like someone is being unki...          anger
17954  im feeling a little cranky negative after this...          anger
17955  i feel that i am useful to my people and that ...            joy
17956  im feeling more comfortable with derby i feel ...            joy
17957  i feel all weird when i have to meet w people ...           fear

[17958 rows x 2 columns]

Grammatically Incorrect:
Empty DataFrame
Columns: [Filtered_Text, Classification]
Inde

*Remarks: We can say that we have no grammatically incorrect text in the dataset*

In [14]:
#importing data pre-processing packages
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
# Loading spaCy English language model
nlp = spacy.load("en_core_web_sm")

In [16]:
# Function for text preprocessing
def preprocess_dataset(dataset, text_column):
   
    def preprocess_text(text):
        doc = nlp(text)
        tokens = [token.text for token in doc]
        
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word.lower() not in stop_words]

        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]

        return ' '.join(tokens)

    
    dataset[text_column] = dataset[text_column].apply(preprocess_text)

    return dataset


In [17]:
# applying the function
preprocessed_dataset = preprocess_dataset(data, 'Filtered_Text')
preprocessed_dataset.head()

Unnamed: 0,Filtered_Text,Classification
0,nt feel humiliated,sadness
1,go feeling hopeless damned hopeful around some...,sadness
2,grabbing minute post feel greedy wrong,anger
3,ever feeling nostalgic fireplace know still pr...,love
4,feeling grouchy,anger


In [18]:
# getting the preprocessed text
preprocessed_text = preprocessed_dataset['Filtered_Text']
preprocessed_text.head()

0                                   nt feel humiliated
1    go feeling hopeless damned hopeful around some...
2               grabbing minute post feel greedy wrong
3    ever feeling nostalgic fireplace know still pr...
4                                      feeling grouchy
Name: Filtered_Text, dtype: object

In [19]:
# Initializing the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

In [20]:
# Fitting and transforming the preprocessed text
tfidf_matrix = tfidf_vectorizer.fit_transform(preprocessed_text)


In [21]:
# Converting the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


print(tfidf_df.head())

    aa  abandon  abandoned  abandoning  abandonment  abc  abdomen  abide  \
0  0.0      0.0        0.0         0.0          0.0  0.0      0.0    0.0   
1  0.0      0.0        0.0         0.0          0.0  0.0      0.0    0.0   
2  0.0      0.0        0.0         0.0          0.0  0.0      0.0    0.0   
3  0.0      0.0        0.0         0.0          0.0  0.0      0.0    0.0   
4  0.0      0.0        0.0         0.0          0.0  0.0      0.0    0.0   

   ability  abit  ...  youthful  youtube  yuuki  zach  zealand  zero  zombie  \
0      0.0   0.0  ...       0.0      0.0    0.0   0.0      0.0   0.0     0.0   
1      0.0   0.0  ...       0.0      0.0    0.0   0.0      0.0   0.0     0.0   
2      0.0   0.0  ...       0.0      0.0    0.0   0.0      0.0   0.0     0.0   
3      0.0   0.0  ...       0.0      0.0    0.0   0.0      0.0   0.0     0.0   
4      0.0   0.0  ...       0.0      0.0    0.0   0.0      0.0   0.0     0.0   

   zone  zoom  zumba  
0   0.0   0.0    0.0  
1   0.0   0.0   

In [22]:
#importing model building/testing tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

In [23]:
# Append the TF-IDF features to the main dataset
data = pd.concat([data, tfidf_df], axis=1)
data.head()

Unnamed: 0,Filtered_Text,Classification,aa,abandon,abandoned,abandoning,abandonment,abc,abdomen,abide,...,youthful,youtube,yuuki,zach,zealand,zero,zombie,zone,zoom,zumba
0,nt feel humiliated,sadness,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,go feeling hopeless damned hopeful around some...,sadness,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,grabbing minute post feel greedy wrong,anger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ever feeling nostalgic fireplace know still pr...,love,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,feeling grouchy,anger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# Encoding the target variable
label_encoder = LabelEncoder()
data['target_class_encoded'] = label_encoder.fit_transform(data['Classification'])

In [25]:
# Defining features (X) and target variable (y)
X = data.drop(['target_class_encoded', 'Filtered_Text','Classification'], axis=1)  
y = data['target_class_encoded']

In [26]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Training and evaluating conventional classifiers
classifiers = {
    'Multinomial Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(),
    'Support Vector Classifier': SVC()
}

for name, classifier in classifiers.items():
 
    classifier.fit(X_train, y_train)


    y_pred = classifier.predict(X_test)


    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)
    

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:\n", classification_rep)

   


Classifier: Multinomial Naive Bayes
Accuracy: 0.7422
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.56      0.70       474
           1       0.87      0.44      0.58       410
           2       0.67      0.97      0.79      1157
           3       0.95      0.13      0.23       300
           4       0.76      0.94      0.84      1118
           5       1.00      0.02      0.04       133

    accuracy                           0.74      3592
   macro avg       0.86      0.51      0.53      3592
weighted avg       0.79      0.74      0.70      3592



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Classifier: Logistic Regression
Accuracy: 0.8641
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.83      0.86       474
           1       0.83      0.76      0.79       410
           2       0.82      0.95      0.88      1157
           3       0.86      0.61      0.71       300
           4       0.90      0.95      0.92      1118
           5       0.91      0.46      0.61       133

    accuracy                           0.86      3592
   macro avg       0.87      0.76      0.80      3592
weighted avg       0.87      0.86      0.86      3592


Classifier: Support Vector Classifier
Accuracy: 0.8583
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.81      0.85       474
           1       0.82      0.76      0.79       410
           2       0.81      0.96      0.88      1157
           3       0.85      0.58      0.69       300
           4       0.91      0.9

*Remark: Of all the three conventional models we tried, the highest accuracy was around 86.4% and of the Logistic Regression Model. We will go ahead and try ensemble learning methods now and will see if the accuracy can be improved or not.*

In [27]:
# Training and evaluating ensemble classifiers
ensemble_classifiers = {
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': XGBClassifier()
}

for name, classifier in ensemble_classifiers.items():

    classifier.fit(X_train, y_train)


    y_pred = classifier.predict(X_test)


    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)
    

    print(f"\nEnsemble Classifier: {name}")
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:\n", classification_rep)


Ensemble Classifier: Random Forest
Accuracy: 0.8767
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.87      0.87       474
           1       0.79      0.88      0.83       410
           2       0.88      0.91      0.90      1157
           3       0.80      0.74      0.77       300
           4       0.94      0.90      0.92      1118
           5       0.79      0.67      0.73       133

    accuracy                           0.88      3592
   macro avg       0.85      0.83      0.84      3592
weighted avg       0.88      0.88      0.88      3592


Ensemble Classifier: Gradient Boosting
Accuracy: 0.8394
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.78      0.85       474
           1       0.87      0.73      0.79       410
           2       0.75      0.94      0.83      1157
           3       0.81      0.70      0.75       300
           4       0.95     

*Remarks: The XGBoost classifier gave the highest accuracy till now which stands at 88.4% which is a bit of an improvement from the Logistic classifier. However we must try to perform better and we will move on to hyperparameter tuning to see if we can improve.*

In [28]:
#getting the packages to build an optimised ensemble model
from sklearn.experimental import enable_halving_search_cv
from sklearn.feature_selection import SelectKBest, chi2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow.keras.callbacks
from sklearn.model_selection import HalvingGridSearchCV

In [29]:
# Using chi-squared statistic for important feature selection
k_best = 1500
selector = SelectKBest(chi2, k=k_best)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

In [30]:
# Defining the XGBClassifier
xgb_classifier = XGBClassifier()



In [37]:
# Defining the hyperparameter grid
param_grid = {
    'learning_rate': [0.05, 0.1, 0.2],
    'n_estimators': [100, 150, 200, 250,300],
    'max_depth': [3, 4, 5],
    'min_child_weight': [0,1, 2, 3],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'gamma': [0, 0.1, 0.2],
}

In [38]:
#Buildiing the custom search

def custom_grid_search(model, param_grid, X_train, y_train, X_test, y_test, max_iterations=20, improvement_threshold=0.01, baseline_accuracy=0.88):
    best_model = None
    best_accuracy = 0
    no_improvement_count = 0

    xgb_classifier = XGBClassifier()

    grid_search = HalvingGridSearchCV(estimator=xgb_classifier, param_grid=param_grid, scoring='accuracy', cv=2, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    best_params = grid_search.best_params_

    best_model = xgb_classifier.set_params(**best_params)
    best_model.fit(X_train, y_train)

    best_accuracy = accuracy_score(y_test, best_model.predict(X_test))

    print(f"Initial Accuracy: {best_accuracy:.4f}")
    print(f"Best Parameters: {best_params}")

    if best_accuracy >= baseline_accuracy:
        print("Baseline accuracy achieved.")
        return best_model, best_accuracy

    early_stopping = EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True)

    for iteration in range(1, max_iterations):
        current_best_model = xgb_classifier.set_params(**best_params)
        current_best_model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_train, y_train)])

        y_pred = current_best_model.predict(X_test)

        current_accuracy = accuracy_score(y_test, y_pred)

        print(f"\nIteration {iteration}/{max_iterations}")
        print(f"Current Accuracy: {current_accuracy:.4f}")
        print(f"Best Parameters: {best_params}")

        if current_accuracy >= baseline_accuracy:
            print("Baseline accuracy achieved.")
            return current_best_model, current_accuracy

        if current_accuracy > best_accuracy + improvement_threshold:
            best_model = current_best_model
            best_accuracy = current_accuracy
            no_improvement_count = 0
        else:
            no_improvement_count += 1

        if no_improvement_count >= 10:
            print(f"\nStopping early as no improvement observed for the last {no_improvement_count} iterations.")
            break

    return best_model, best_accuracy

In [39]:
# Performing the grid search

best_model, best_accuracy = custom_grid_search(xgb_classifier, param_grid, X_train_selected, y_train, X_test_selected, y_test, max_iterations=20, improvement_threshold=0.01, baseline_accuracy=0.88)
print(f"\nBest Model Accuracy: {best_accuracy:.4f}")


3888 fits failed out of a total of 7776.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3888 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\sujoydutta\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\sujoydutta\anaconda3\lib\site-packages\xgboost\core.py", line 620, in inner_f
    return func(**kwargs)
  File "C:\Users\sujoydutta\anaconda3\lib\site-packages\xgboost\sklearn.py", line 1440, in fit
    raise ValueError(
ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2 3], got [1 2 3 4]



Initial Accuracy: 0.8516
Best Parameters: {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 250, 'subsample': 0.7}




[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3



[0]	validation_0-mlogloss:1.75207
[1]	validation_0-mlogloss:1.71673
[2]	validation_0-mlogloss:1.68486
[3]	validation_0-mlogloss:1.65568
[4]	validation_0-mlogloss:1.62860
[5]	validation_0-mlogloss:1.60371
[6]	validation_0-mlogloss:1.58134
[7]	validation_0-mlogloss:1.56046
[8]	validation_0-mlogloss:1.54089
[9]	validation_0-mlogloss:1.52249
[10]	validation_0-mlogloss:1.50558
[11]	validation_0-mlogloss:1.48956
[12]	validation_0-mlogloss:1.47429
[13]	validation_0-mlogloss:1.45984
[14]	validation_0-mlogloss:1.44609
[15]	validation_0-mlogloss:1.43315
[16]	validation_0-mlogloss:1.42071
[17]	validation_0-mlogloss:1.40881
[18]	validation_0-mlogloss:1.39737
[19]	validation_0-mlogloss:1.38649
[20]	validation_0-mlogloss:1.37604
[21]	validation_0-mlogloss:1.36576
[22]	validation_0-mlogloss:1.35608
[23]	validation_0-mlogloss:1.34660
[24]	validation_0-mlogloss:1.33728
[25]	validation_0-mlogloss:1.32856
[26]	validation_0-mlogloss:1.32008
[27]	validation_0-mlogloss:1.31152
[28]	validation_0-mlogloss:1.3

In [66]:

# Best parameters from the grid search
best_params = {
    'colsample_bytree': 0.9,
    'gamma': 0.1,
    'learning_rate': 0.6,
    'max_depth': 3,
    'min_child_weight': 3,
    'n_estimators': 300,
    'subsample': 0.7
}




In [67]:
# Create the XGBoost classifier with the best parameters
best_xgb_classifier = XGBClassifier(**best_params)


In [68]:
# Fitting the classifier on the training data
best_xgb_classifier.fit(X_train, y_train)

In [69]:
# Making predictions on new data
y_pred = best_xgb_classifier.predict(X_test)

In [70]:
# Calculating and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8845


*Note: Despite an exhaustive process of hyperparameter tuning, the marginal improvement in accuracy falls short of the desired threshold. Given the constraints of time and computing resources, we opt to proceed with this model, acknowledging that even incremental enhancements contribute to overall progress. Regrettably, further pursuit of accuracy is deemed impractical within the current limitations.*

In [71]:
#getting the pickle package
import pickle

In [72]:
# Saving the model to a file
with open('best_xgb_classifier.pkl', 'wb') as file:
    pickle.dump(best_xgb_classifier, file)

print("Model has been pickled.")

Model has been pickled.
