## Hackathon _ PS _ 3

* Develop a text classification pipeline to identify product categories.
* Companies are often posed with the problem of cataloging products effectively to help customers in navigating to the product of requirement. Each class of product is assigned an index that can be used to track its type.

* The task is to classify the products based on the description and title and identify the correct index

In [49]:
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup




* All data are being uploaded to google collab and cant upload through git hub as it was more than 25mb , so to run the model please upload the file in the collab and run
* had used both ways so that it would be easier to upload it

In [50]:
# from google.colab import files
# uploaded = files.upload()

In [51]:
# import pandas as pd
# import io
# df = pd.read_csv(io.BytesIO(uploaded['PS3_train.csv']))

NameError: ignored

In [52]:
df = pd.read_csv('/content/sample_data/PS3_train.csv')
df1=df.drop('uid',axis=1)
df1.head()


Unnamed: 0,content,title,target_ind
0,Premium quality five pocket jean from Wrangler...,Amazon.com: Wrangler Men's Rugged Wear Relaxed...,247
1,If you're looking for a different kind of anim...,Sakura Diaries - Complete Series Collector's E...,453
2,"First things first: Yes, Thinking XXX features...",Thinking XXX (Extended Cut) (2006),228
3,Feathertouch. 100% Polyester Machine Wash Warm...,Amazon.com: Petite Feathertouch Pull-On Pant: ...,223
4,"When you need outstanding fuel delivery, easy ...",ACDelco EP386 Fuel Pump,312


### Manipualting the columns for a better data structures
* ID is unique for everyone and hence doesnot needed to added in model
* The whole title woulod unneccessary increase the computation because the number of uniques titles are more than 60 percent of total data , 
* but the starting word of many title are very similar and hence added another column "TIT" which have only 1st word 
* And would remove all sapces of the tit data columns to get one meaning ful word 

In [53]:
df1['tit']=df1['title'].str[:6]
print(len(df1['title'].unique()))
print(len(df1['tit'].unique()))
df2=df1.drop('title',axis=1)
df2.head()

20599
8138


Unnamed: 0,content,target_ind,tit
0,Premium quality five pocket jean from Wrangler...,247,Amazon
1,If you're looking for a different kind of anim...,453,Sakura
2,"First things first: Yes, Thinking XXX features...",228,Thinki
3,Feathertouch. 100% Polyester Machine Wash Warm...,223,Amazon
4,"When you need outstanding fuel delivery, easy ...",312,ACDelc


In [54]:
df2['tit'].unique()

array(['Amazon', 'Sakura', 'Thinki', ..., 'Ten Ti', 'Resona', '460 Bu'],
      dtype=object)

In [55]:
df2['tit'] = df2['tit'].str.replace(' ', '')
df2.tail()

Unnamed: 0,content,target_ind,tit
35107,"SArah Walker, Anthony Rolfe Johnson, Jean Rigb...",473,Britte
35108,Buy Teenage Mutant Ninja Turtles 4: Turtles in...,74,Teenag
35109,A 10 movie collection of women action flicks. ...,460,Women
35110,This huarache sandal is designed for all-day c...,158,Softsp
35111,The Wrangler Cowboy Cut jean is a Western Wear...,340,Amazon


In [56]:
print(len(df1['target_ind'].unique()))
print(len(df1))

500
35112


### Data Pre-Processing
* remove all stopwords so that total no of words would decrease 
* remove all symbols and alphabets because that doesnot affect the model a lot
* Converted all letters to lower case so that our model would not get confused while training 

In [57]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [58]:
print(df1['content'].apply(lambda x: len(x.split(' '))).sum())

5174176


In [59]:
my_tags=df1['target_ind'].unique()
my_tags

array([247, 453, 228, 223, 312,  85, 179, 162,  83, 195, 150, 497, 327,
       210, 340, 345, 229, 225, 286, 347, 366, 128, 305, 147, 300, 311,
       350, 349, 151,  67,  81,  58, 135, 372,  61, 306, 348,  18, 295,
       357, 488, 456, 160, 204, 155, 112, 287, 368, 490, 105, 351, 244,
       303, 440, 202, 353, 379, 158, 369, 197,  80, 457, 342, 278, 265,
       352, 280, 274, 484, 361, 397,  33, 498, 425, 221,  52, 356, 455,
       146, 242, 318, 132, 299, 402,  86, 399,  68, 418, 222, 401, 404,
       378,  88, 424, 227, 273, 269, 355, 322, 470, 298, 183, 439, 392,
       473, 431, 152, 377, 363, 138, 175, 409, 172, 212, 166, 495,  89,
       124, 310, 246, 257, 474, 296, 248, 436,  66, 354, 410, 127, 331,
       262, 134,  95, 307, 430, 444, 464, 245, 375, 216, 476, 360, 496,
       193,  40, 149, 384, 234, 214, 189,  78, 390, 398,  76, 364,   6,
       215, 338,  56, 168, 481, 406, 148, 469,  50, 333, 329, 302, 320,
       491,  16,  72, 313, 328, 140, 435, 185,  79, 130, 412, 31

In [60]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text
    
df1['content'] = df1['content'].apply(clean_text)
df1.head()




Unnamed: 0,content,title,target_ind,tit
0,premium quality five pocket jean wrangler rugg...,Amazon.com: Wrangler Men's Rugged Wear Relaxed...,247,Amazon
1,youre looking different kind anime sakura diar...,Sakura Diaries - Complete Series Collector's E...,453,Sakura
2,first things first yes thinking xxx features a...,Thinking XXX (Extended Cut) (2006),228,Thinki
3,feathertouch 100 polyester machine wash warm g...,Amazon.com: Petite Feathertouch Pull-On Pant: ...,223,Amazon
4,need outstanding fuel delivery easy installati...,ACDelco EP386 Fuel Pump,312,ACDelc


In [61]:
print(df1['content'].apply(lambda x: len(x.split(' '))).sum())

3171816


## Multinomial Naive Bayes Classifier 


In [62]:
X = df1.content
y = df1.target_ind
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 2)

In [63]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

nb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
nb.fit(X_train, y_train);



In [64]:

from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
accuracy_score(y_pred, y_test)


0.24868990658464343

## SVM Machine


In [65]:
from sklearn.linear_model import SGDClassifier

sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
               ])
sgd.fit(X_train, y_train)

In [66]:
y_pred_2 = sgd.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred_2, y_test))

accuracy 0.38471177944862156


## Neural Network(ANN)
* Previously both  the model directly deal with the tect and vectorize inside the model so we dot have any scopes to tune the parameter
* Here i would be tokenize and vectorize every part of text according to best suit and would train a neural networks with 3 hidden layers

In [67]:
import itertools
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix

from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils

train_size = int(len(df2) * .75)
train_posts = df2['content'][:train_size]
train_tags = df2['target_ind'][:train_size]

test_posts = df2['content'][train_size:]
test_tags = df2['target_ind'][train_size:]

max_words = 4000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts) # only fit on train

x_train_1 = tokenize.texts_to_matrix(train_posts)
x_test_1 = tokenize.texts_to_matrix(test_posts)

##
train_posts_2 = df2['tit'][:train_size]
test_posts_2 = df2['tit'][train_size:]

max_words = 1
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts_2) # only fit on train

x_train_2 = tokenize.texts_to_matrix(train_posts_2)
x_test_2 = tokenize.texts_to_matrix(test_posts_2)

x_train=np.concatenate((x_train_1,x_train_2),axis=1)
x_test =np.concatenate((x_test_1,x_test_2),axis=1)



In [68]:
x_train[0].shape

(4001,)

In [69]:
encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)
batch_size = 16
epochs = 9

# Build the model
model = Sequential()
model.add(Dense(512, input_shape=x_train[0].shape))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5)) # Reduces Over fitting
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
              
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1, # Style of Showing
                    validation_split=0.2)

Epoch 1/9
Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9


In [70]:
score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)




### Test Case 
* 1ST WE HAVE TO PREPROCESS THE DATA AS WELL BECAUSE OUR MODEL HAS LEARN THE PREPROCESSED DATA
* AND REMEMBER I HAD DE CATEGORIZE THE OUTPUT AND ALSO REVERSE TRANSFORM BECAUSE THIS OPERATIONS ARE APPLIED ON THE OUTPUT WHILE TRAINING THE MODEL

In [71]:
#processing the test data as per aaur trainig data set
df3=pd.read_csv('/content/sample_data/PS3_test.csv')
df3=df3.drop('uid',axis=1)
df3['content'] = df3['content'].apply(clean_text)
df3['tit']=df3['title'].str[:6]
df3['tit'] = df3['tit'].str.replace(' ', '')
df3=df3.drop('title',axis=1)
df3.head()
norm_model_test=df3['content']

In [72]:
max_words = 4000
train_posts=df3['content']
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts)
x_data_1 = tokenize.texts_to_matrix(train_posts)


max_words = 1
train_posts_1=df3['tit']
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts_1)
x_data_2=tokenize.texts_to_matrix(train_posts_1)

x_data=np.concatenate((x_data_1,x_data_2),axis=1)

### Predictions from ANN

In [73]:
y_data=(model.predict(x_data))
y_data=np.argmax(y_data, axis=1)
encoder.inverse_transform(y_data)
y_data



array([318,  82, 348, ..., 101, 197, 252])

In [78]:
df5=pd.DataFrame(y_data)
df5.head()

Unnamed: 0,0
0,318
1,82
2,348
3,197
4,82


In [79]:
df5.to_csv('/content/sample_data/PS_3_OUTPUT.csv')

### Predictions from SVM

In [80]:
y_data_2=sgd.predict(norm_model_test)
y_data_2

array([361, 351, 390, ..., 431,   9, 494])

In [82]:
df6=pd.DataFrame(y_data_2)
df6.head()

Unnamed: 0,0
0,361
1,351
2,390
3,244
4,351


In [83]:
df6.to_csv('/content/sample_data/PS_3_OUTPUT_2.csv')

### Predictions from Naive Bayes

In [84]:
y_data_3=nb.predict(norm_model_test)
y_data_3

array([361, 351, 348, ..., 351, 348, 348])

In [85]:
df7=pd.DataFrame(y_data_3)
df7.to_csv('/content/sample_data/PS_3_OUTPUT_3.csv')

* We could check that SVM and neural network would almost give a best ouput