### Problem Statement ###

    Build a classifier to classify product titles. We want to classify an input product title into categories.
    Training data to classify this product into categories is provided with the problem. The categories
    assigned in the training data are the only categories to be considered.
    Also provided is a set of product titles where your classifier should predict the relevant labels. We
    will internally evaluate the classifier on the prediction on the test data set.
    Please provide all the relevant pre-processing, model development and tuning code.
    Few points
    1. It’s a tab separated file. The first column is the title, the second column contains labels.
    2. There are multiple labels in the 2 nd column but for this exercise you can consider only one
    label. If you want you can set it up as a multi label classification problem but its ok if you do
    not.
    3. Simplify the problem if need be.
    4. In evaluation please generate a probability score for each label as well.

### Solution Approach ###

#### Dataset Understanding ####

Training dataset has two columns
    1. Titles - text description of the particular product
    2. Labels - Multiple labels separated by tab.
        2.1 First Value is **Category** it belongs to.
        2.2 Values followed by First values are **Macro categories** that can tag under the labels.

#### Problem Approach ####

Breakdown of the problem
    1. First step is to separate the label into **Major Category** and macro categories.
          1.1 function `separateLabels` from the notebook is to separate the labels into categories and macro categories
          
    2. Once the macro labels are separated from the labels, next step is convert into one-hot encoding. Because, multiple products can tag to multiple macro labels. converting into one-hot encoding variable is best choice for this case. 
          function `convertIntoMultiLabel` is to separate the macro labels creates one-hot encoding representation of the data and append to the original dataset.
          
    3. In the dataset, there are lot of duplicates records were found. **Removed all those duplicate records**.
            `df.drop_duplicates().reset_index()` does the Trick
            
    4. The major step is to preprocess the input data ie.., "titles" from the dataset. To preprocess the text these are the steps that are carried out
        4.1 Convert all text to lower case.
        4.2 Replace REPLACE_BY_SPACE_RE symbols by space in text.
        4.3 Remove symbols that are in BAD_SYMBOLS_RE from text.
        4.4 Remove stop words.
      
      after all these steps carried out, we will get text with absence of stopwords, special characters and numbers. In our case, I didn't consider numbers because, the numbers just represent the volume of the product. From the data, volume doesn't help us to categorize the products.
     
     5. Modeling the Data
         5.1 For now, only considering problem as **Multi-Class Text Classification** Problem. 
         5.2 First step is to, check the words vocabulary in the dataset. In our dataset, there are 20109 Unique words. 
         5.3 Check the class count in the dataset and class imbalance. 
         5.4 Convert the data into vectors (or) sequence of integers and pad the input sequence if the length is too small.
         5.5 Convert the target label **category** into numerical records(label encoding)
         5.6 Do Train and Test Split
         5.7 Choose the appropriate Machine Learning Algorithm with better model performance. 
         

### Dataset Behaviour ###

In our dataset, there are 254 Categories as targets and the proportion of those dataset are highly imbalanced. For example, Category **Snacks** has 26% ratio in the train dataset and other categories has different proportion of data.

Exactly **108** category instance has data less than 10 records which the dataset has huge imbalance. Without thinking any further my first approach was to **capture the features per categories**. To capture the important feature per each category, used **Unigram and Bigram Method(BOW) model**. 

After capturing Bag of Words, I inspected whether each class has separate features that can be linearly separable. **BOW** can seen in the text file **word_grams_to_category.txt**. In this text file, we can observe that some category has unique keywords cannot seen in another categories. So, in this case **MultiNomial Naive Bayes** algorithm works better. 

#### First Approach ####

Applying MultiNomial Naive Bayes Algorithm. After applying transformation(ie..., Pipeline(CountVectorizer, TFIDFVectorizer)) to the text applied Naive Bayes Algorithm. 

Naive Bayes Algorithm -> consider each words are independent of each other which works well for this case. 

Got Weighted Average of 0.59% on test data. Though the model cannot able to identify the category which has less instance in the dataset.

                                        
                                        precision    recall  f1-score   support

                     branded grocery       0.39      0.83      0.53      4023
                              snacks       0.68      0.45      0.54       912
                           home care       0.65      0.72      0.68      2276
                   breakfast cereals       0.73      0.36      0.48       680
              sweets & confectionery       0.27      0.04      0.07       674
                           cosmetics       0.71      0.25      0.37      1056
                        ready-to-eat       0.00      0.00      0.00        67
                 fruits & vegetables       0.91      0.79      0.84       907
                              fryums       0.00      0.00      0.00         6
                           beverages       0.50      0.58      0.54       790
                        instant food       1.00      0.00      0.01       236
                             staples       0.75      0.54      0.63      1107
                           chocolate       0.62      0.82      0.70       935
                       personal care       0.44      0.92      0.60      2637
                          deodorants       0.70      0.35      0.46       362
                          edible oil       0.58      0.25      0.35       774
                              spices       0.71      0.72      0.71       703
                          healthcare       0.69      0.14      0.23       328
                       otc medicines       1.00      0.03      0.05        77
                           hair care       0.80      0.05      0.10       372
                            biscuits       0.70      0.56      0.63       439
                               dairy       0.69      0.40      0.51       457
                           baby care       0.68      0.44      0.53       386
                              bakery       0.83      0.51      0.63       440
                           spreads**       0.86      0.14      0.25       290
                              coffee       0.83      0.54      0.66       298
                 dried fruits & nuts       0.75      0.27      0.39       381
       cleaning agents & accessories       1.00      0.03      0.05       252
                         frozen food       0.00      0.00      0.00        94
                               cream       0.00      0.00      0.00        65
                       ready-to-cook       0.81      0.38      0.51       401
                               soaps       0.89      0.07      0.14       325
                                eggs       0.77      0.90      0.83       448
                            pet care       0.90      0.20      0.32        92
                               water       0.54      0.68      0.60       546
                            lip balm       0.00      0.00      0.00         8
                        cold storage       0.00      0.00      0.00        71
                    detergent powder       0.00      0.00      0.00       114
                  ayurvedic products       0.00      0.00      0.00        49
                              sweets       1.00      0.01      0.01       156
                             pickles       0.78      0.64      0.71       472
                       sanitary pads       0.00      0.00      0.00         6
                                milk       0.00      0.00      0.00        94
                       fresh chicken       0.78      0.33      0.47       129
          frozen fruits & vegetables       1.00      0.05      0.10        92
              stationery accessories       0.00      0.00      0.00         5
                         canned food       0.95      0.21      0.35       183
                               olive       0.00      0.00      0.00         5
                    other condiments       0.00      0.00      0.00        51
                           fragrance       0.00      0.00      0.00        10
                 shaving accessories       0.00      0.00      0.00        90
                          toothpaste       0.00      0.00      0.00         6
                               bread       0.00      0.00      0.00        45
                              squash       0.00      0.00      0.00         6
                       instant mixes       0.00      0.00      0.00        26
               shampoo & conditioner       0.00      0.00      0.00         7
                      sanitary needs       0.00      0.00      0.00        34
                       ground coffee       0.00      0.00      0.00         5
                    organic products       0.00      0.00      0.00        14
                           oral care       0.88      0.04      0.07       179
                               honey       0.00      0.00      0.00         5
                             cashews       0.00      0.00      0.00         5
                       health drinks       0.00      0.00      0.00        72
                             namkeen       0.00      0.00      0.00        17
                              lotion       0.00      0.00      0.00        10
                                salt       0.00      0.00      0.00        55
                      frozen seafood       0.00      0.00      0.00        16
                           nutrition       0.00      0.00      0.00        32
                         conditioner       0.00      0.00      0.00        12
                                 bun       0.00      0.00      0.00         4
                          vegetables       0.00      0.00      0.00        43
                               chips       0.00      0.00      0.00         8
               multi-purpose cleaner       0.00      0.00      0.00         5
                        wafer sticks       0.00      0.00      0.00         7
                            desserts       0.00      0.00      0.00        11
                        fruit juices       0.00      0.00      0.00        10
                         soft drinks       0.00      0.00      0.00         6
                           skin care       0.00      0.00      0.00         9
                           face mask       0.00      0.00      0.00         6
                        basmati rice       0.00      0.00      0.00        10
                           face wash       0.00      0.00      0.00         7
                                cake       0.00      0.00      0.00         9
                         lime pickle       0.00      0.00      0.00         6
                            dog food       0.00      0.00      0.00         4
                       energy drinks       0.00      0.00      0.00         6
                              yogurt       0.00      0.00      0.00         4
                          vermicelli       0.00      0.00      0.00         5
                  hair removal cream       0.00      0.00      0.00         7
                       shaving cream       0.00      0.00      0.00        10
                           pain balm       0.00      0.00      0.00         5
                       cooking pasta       0.00      0.00      0.00         6
                           olive oil       0.00      0.00      0.00         6
                     dishwash liquid       0.00      0.00      0.00         6
               agricultural products       0.00      0.00      0.00         5
                          hair color       0.00      0.00      0.00         7
                           body wash       0.00      0.00      0.00         6
                               brush       0.00      0.00      0.00         4
                                rusk       0.00      0.00      0.00         4
                         after shave       0.00      0.00      0.00         5
                       air freshener       0.00      0.00      0.00         4
                     dishwash powder       0.00      0.00      0.00         4
                        garbage bags       0.00      0.00      0.00         5
                            pastries       0.00      0.00      0.00        21
                              pulses       0.00      0.00      0.00        13
                          mayonnaise       0.00      0.00      0.00         6
                              fruits       0.00      0.00      0.00        20
                        concentrates       0.00      0.00      0.00         6
                              blades       0.00      0.00      0.00        12
                        dishwash bar       0.00      0.00      0.00         4
                         chewing gum       0.00      0.00      0.00         8
                             popcorn       0.00      0.00      0.00         7
                       floor cleaner       0.00      0.00      0.00         8
                              muesli       0.00      0.00      0.00         7
                         fabric care       0.00      0.00      0.00         5
                           baby food       0.00      0.00      0.00        19
                         food colour       0.00      0.00      0.00         4
                                ghee       0.00      0.00      0.00         6
                             almonds       0.00      0.00      0.00         4
                                 tea       0.00      0.00      0.00        21
                              chikki       0.00      0.00      0.00         7
                      cream biscuits       0.00      0.00      0.00         6
                        mango pickle       0.00      0.00      0.00         6
                        whole spices       0.00      0.00      0.00         8
                       cooking paste       0.00      0.00      0.00         7
                      toilet cleaner       0.00      0.00      0.00         4
                                 jam       0.00      0.00      0.00         6
                            veg soup       0.00      0.00      0.00         6
                             perfume       0.00      0.00      0.00         6
                               papad       0.00      0.00      0.00         4
                           baby soap       0.00      0.00      0.00         5
                             noodles       0.00      0.00      0.00        19
                      instant coffee       0.00      0.00      0.00         6
                      pharmaceutical       0.00      0.00      0.00         6
                      & Medical Supplies
                      toilet freshener     0.00      0.00      0.00         4
                            cat food       0.00      0.00      0.00         4
                              cheese       0.00      0.00      0.00         7
                        insecticides       0.00      0.00      0.00         7
                         corn flakes       0.00      0.00      0.00         5
                   car air freshener       0.00      0.00      0.00         6
                                oats       0.00      0.00      0.00         7
               baby care accessories       0.00      0.00      0.00         5
                                atta       0.00      0.00      0.00        16
                             cookies       0.00      0.00      0.00         5
                               dates       0.00      0.00      0.00         4
                          toothbrush       0.00      0.00      0.00         6
                         masala nuts       0.00      0.00      0.00         7

                            accuracy                           0.54     26511
                           macro avg       0.19      0.10      0.11     26511
                        weighted avg       0.59      0.54      0.49     26511


For some class, the model cannot able to correctly categorize at all. Firstly, I thought of because of those minority class it affecting overall performance. So, removed classes that records are less than 10 records. After removing I applied the same model. No improvements.. I WAS WRONG :|

### Second Approach ###

In first try, I trained a model without featuring ngrams into model. So this time, I tried featuring ngrams into model and train SGD Classifier. After necessary transformation trained SGD classifier. Noticed Little Improvements.
Improved from average weights 0.54% to 0.64% percentage.

                                         precision    recall  f1-score   support

                     branded grocery       0.54      0.80      0.65      4023
                              snacks       0.77      0.54      0.63       912
                           home care       0.70      0.73      0.72      2276
                   breakfast cereals       0.71      0.63      0.67       680
              sweets & confectionery       0.46      0.13      0.20       674
                           cosmetics       0.70      0.51      0.59      1056
                        ready-to-eat       1.00      0.06      0.11        67
                 fruits & vegetables       0.87      0.87      0.87       907
                              fryums       1.00      0.17      0.29         6
                           beverages       0.56      0.66      0.61       790
                        instant food       0.48      0.06      0.11       236
                             staples       0.76      0.62      0.68      1107
                           chocolate       0.68      0.82      0.75       935
                       personal care       0.59      0.82      0.69      2637
                          deodorants       0.70      0.79      0.74       362
                          edible oil       0.65      0.40      0.50       774
                              spices       0.63      0.87      0.73       703
                          healthcare       0.60      0.31      0.41       328
                       otc medicines       0.43      0.13      0.20        77
                           hair care       0.57      0.50      0.53       372
                            biscuits       0.68      0.71      0.70       439
                               dairy       0.62      0.63      0.63       457
                           baby care       0.68      0.75      0.71       386
                              bakery       0.76      0.76      0.76       440
                           spreads**       0.71      0.54      0.62       290
                              coffee       0.73      0.82      0.77       298
                 dried fruits & nuts       0.65      0.52      0.58       381
       cleaning agents & accessories       0.90      0.25      0.40       252
                         frozen food       0.54      0.16      0.25        94
                               cream       0.00      0.00      0.00        65
                       ready-to-cook       0.69      0.67      0.68       401
                               soaps       0.75      0.73      0.74       325
                                eggs       0.79      0.95      0.86       448
                            pet care       0.84      0.58      0.68        92
                               water       0.60      0.86      0.70       546
                            lip balm       0.00      0.00      0.00         8
                        cold storage       0.79      0.15      0.26        71
                    detergent powder       0.84      0.18      0.30       114
                  ayurvedic products       0.33      0.04      0.07        49
                              sweets       0.59      0.27      0.37       156
                             pickles       0.78      0.85      0.82       472
                       sanitary pads       0.00      0.00      0.00         6
                                milk       0.53      0.49      0.51        94
                       fresh chicken       0.71      0.51      0.59       129
          frozen fruits & vegetables       0.74      0.37      0.49        92
              stationery accessories       0.00      0.00      0.00         5
                         canned food       0.81      0.50      0.61       183
                               olive       0.00      0.00      0.00         5
                    other condiments       0.20      0.02      0.04        51
                           fragrance       0.00      0.00      0.00        10
                 shaving accessories       0.77      0.38      0.51        90
                          toothpaste       0.00      0.00      0.00         6
                               bread       1.00      0.27      0.42        45
                              squash       0.00      0.00      0.00         6
                       instant mixes       0.00      0.00      0.00        26
               shampoo & conditioner       0.00      0.00      0.00         7
                      sanitary needs       0.25      0.06      0.10        34
                       ground coffee       0.00      0.00      0.00         5
                    organic products       0.18      0.14      0.16        14
                           oral care       0.78      0.52      0.62       179
                               honey       0.00      0.00      0.00         5
                             cashews       0.00      0.00      0.00         5
                       health drinks       0.81      0.31      0.44        72
                             namkeen       0.00      0.00      0.00        17
                              lotion       0.00      0.00      0.00        10
                                salt       0.62      0.38      0.47        55
                      frozen seafood       1.00      0.19      0.32        16
                           nutrition       1.00      0.09      0.17        32
                         conditioner       0.00      0.00      0.00        12
                                 bun       0.00      0.00      0.00         4
                          vegetables       1.00      0.14      0.24        43
                               chips       0.00      0.00      0.00         8
               multi-purpose cleaner       0.00      0.00      0.00         5
                        wafer sticks       0.86      0.86      0.86         7
                            desserts       0.00      0.00      0.00        11
                        fruit juices       0.00      0.00      0.00        10
                         soft drinks       0.00      0.00      0.00         6
                           skin care       0.00      0.00      0.00         9
                           face mask       0.00      0.00      0.00         6
                        basmati rice       0.00      0.00      0.00        10
                           face wash       0.00      0.00      0.00         7
                                cake       1.00      0.11      0.20         9
                         lime pickle       0.00      0.00      0.00         6
                            dog food       0.00      0.00      0.00         4
                       energy drinks       0.00      0.00      0.00         6
                              yogurt       0.00      0.00      0.00         4
                          vermicelli       0.00      0.00      0.00         5
                  hair removal cream       0.00      0.00      0.00         7
                       shaving cream       0.00      0.00      0.00        10
                           pain balm       0.00      0.00      0.00         5
                       cooking pasta       0.00      0.00      0.00         6
                           olive oil       0.00      0.00      0.00         6
                     dishwash liquid       0.00      0.00      0.00         6
               agricultural products       1.00      0.20      0.33         5
                          hair color       0.00      0.00      0.00         7
                           body wash       0.00      0.00      0.00         6
                               brush       0.00      0.00      0.00         4
                                rusk       0.00      0.00      0.00         4
                         after shave       0.00      0.00      0.00         5
                       air freshener       0.00      0.00      0.00         4
                     dishwash powder       0.00      0.00      0.00         4
                        garbage bags       0.00      0.00      0.00         5
                            pastries       1.00      0.43      0.60        21
                              pulses       0.00      0.00      0.00        13
                          mayonnaise       0.00      0.00      0.00         6
                              fruits       0.00      0.00      0.00        20
                        concentrates       0.00      0.00      0.00         6
                              blades       1.00      0.08      0.15        12
                        dishwash bar       0.00      0.00      0.00         4
                         chewing gum       0.00      0.00      0.00         8
                             popcorn       1.00      0.43      0.60         7
                       floor cleaner       0.00      0.00      0.00         8
                              muesli       0.00      0.00      0.00         7
                         fabric care       0.00      0.00      0.00         5
                           baby food       1.00      0.05      0.10        19
                         food colour       1.00      0.50      0.67         4
                                ghee       0.00      0.00      0.00         6
                             almonds       0.00      0.00      0.00         4
                                 tea       0.00      0.00      0.00        21
                              chikki       1.00      0.14      0.25         7
                      cream biscuits       0.00      0.00      0.00         6
                        mango pickle       0.00      0.00      0.00         6
                        whole spices       0.00      0.00      0.00         8
                       cooking paste       0.00      0.00      0.00         7
                      toilet cleaner       0.00      0.00      0.00         4
                                 jam       0.00      0.00      0.00         6
                            veg soup       0.00      0.00      0.00         6
                             perfume       1.00      0.17      0.29         6
                               papad       0.00      0.00      0.00         4
                           baby soap       0.00      0.00      0.00         5
                             noodles       0.00      0.00      0.00        19
                      instant coffee       0.00      0.00      0.00         6
                      pharmaceuticals      0.00      0.00      0.00         6
                      and medical supplies
                    toilet freshener       0.00      0.00      0.00         4
                            cat food       1.00      1.00      1.00         4
                              cheese       0.00      0.00      0.00         7
                        insecticides       0.00      0.00      0.00         7
                         corn flakes       0.00      0.00      0.00         5
                   car air freshener       0.00      0.00      0.00         6
                                oats       0.00      0.00      0.00         7
               baby care accessories       0.00      0.00      0.00         5
                                atta       0.00      0.00      0.00        16
                             cookies       0.00      0.00      0.00         5
                               dates       0.00      0.00      0.00         4
                          toothbrush       0.00      0.00      0.00         6
                         masala nuts       0.00      0.00      0.00         7

                            accuracy                           0.65     26511
                           macro avg       0.33      0.20      0.22     26511
                        weighted avg       0.64      0.65      0.62     26511


Compare to previous model, the second one captured some categories were previous approach fails to capture. For example, **cat food** category. 

### Tried Solution but didn't see major improvement ###

Stochastic Gradient Classifier works well on the category where the instances are sufficient to memorize by model. So to improve further, did **auto weights** where the model will automatically assign larger weights to minority classes and train a algorithm on top of that. Still, I exactly don't know how the sklearn wrapper assigns weight these many classes. Now reading the concept to understand deeply.

Sampling the dataset is my another choice. But, we have these many classes I am not sure How model will sample dataset. Previously I have tried sampling on (multi-class problems N=3) and seen improvements on the model performance. 

Tried SVM. Taking very long time to run the dataset. In general the SVM won't scale to large number of classes. Because it has to build N-1 boundaries in M dimensional space. 

Tried Tree based Algorithm(XGBoost) algorithm on this dataset but result are not good. The model started overfitting

Tried LSTM network, model started overfitting.

### Environment ###

The entire environment is done in **Colab**. Trained multiple models in my local environment but in-between the kernel died. So switched to Colab. 

### Next Hope ###

NOTE - With given days, I managed to implement all above algorithms and somewhat come with little improvements to the classifier. Couldn't able to complete the entire solution. I guess, once we have figured out the model with stable performance, it's easy to predict the evaluation instance with probability score. 

**K Nearest Neighbor** - With somewhat different features in each category I think KNN Algorithm might work in this case. The nature of the clustering algorithm is when the dimensionality increase the algorithm might overfit. 

Read some papers related to this problem, implement it.