# Text-Classification with Amazon review data

## First Step: Gathering Dataset

In [2]:
import csv
import json

input_file = "Clothing_Shoes_and_Jewelry_5.json"
input_json = open(input_file, "r", encoding="utf-8")

output_file = "Clothing_Shoes_and_Jewelry_5.csv"
with open(output_file, "w", encoding="utf-8") as output_csv:
    csv_writer = csv.writer(output_csv)
    flag = 0
    for line in input_json.readlines():
        dic = json.loads(line)
        # writing headline in the beginning
        if flag == 0:
            csv_writer.writerow(dic)
            flag = 1
        csv_writer.writerow(dic.values())

print("Done")

Done


In [9]:
import pandas as pd
import string

input_data = pd.read_csv("Clothing_Shoes_and_Jewelry_5.csv")
input_data['overall'] = input_data['overall'].astype(object) # fix datatype error
input_data['reviewText'] = input_data['reviewText'].astype(object) # fix datatype error

dataset = {"reviewText": input_data["reviewText"], "overall": input_data["overall"]  }
dataset = pd.DataFrame(data = dataset)
dataset = dataset.dropna()

dataset = dataset[dataset["overall"] != '3'] # need datatype=object
dataset["label"] = dataset["overall"].apply(lambda rating : +1 if str(rating) > '3' else -1)

In [10]:
dataset.head()

Unnamed: 0,reviewText,overall,label
0,This is a great tutu and at a really great pri...,5.0,1
1,I bought this for my 4 yr old daughter for dan...,5.0,1
2,What can I say... my daughters have it in oran...,5.0,1
3,"We bought several tutus at once, and they are ...",5.0,1
4,Thank you Halo Heaven great product for Little...,5.0,1


In [11]:
dataset.label

0         1
1         1
2         1
3         1
4         1
         ..
278672    1
278673    1
278674    1
278675    1
278676    1
Name: label, Length: 278653, dtype: int64

## Splitting Dataset into Training and Testing sets

In [35]:
from sklearn.model_selection import train_test_split

# Añadir dos columnas, el txtclean, donde va a trajar el texto y vamos a limpiar los textos de espacios y demás
X = pd.DataFrame(dataset, columns = ["reviewText"])
y = pd.DataFrame(dataset, columns = ["label"])

#Separa en train y test los datos, random_state es la semilla de un numero aletrorio
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=50)

In [36]:
print(X,y)

                                               reviewText
0       This is a great tutu and at a really great pri...
1       I bought this for my 4 yr old daughter for dan...
2       What can I say... my daughters have it in oran...
3       We bought several tutus at once, and they are ...
4       Thank you Halo Heaven great product for Little...
...                                                   ...
278672  I don't normally go ga-ga over a product very ...
278673  I've been traveling back and forth to England ...
278674  These are very nice packing cubes and the 18 x...
278675  I am on vacation with my family of four and th...
278676  When I signed up to receive a free set of Shac...

[278653 rows x 1 columns]         label
0           1
1           1
2           1
3           1
4           1
...       ...
278672      1
278673      1
278674      1
278675      1
278676      1

[278653 rows x 1 columns]


In [37]:
print(train_X, test_X)

                                               reviewText
250767  well made and nice looking too. I lovee this I...
110585  I have a wide foot, New Balance is the only co...
46414   I love doc martens but these are so stiff and ...
126255  I wasn't really sure which size I would be I a...
96195   Quality of material very poor.  I should have ...
...                                                   ...
165976  This item is not a sweater, but rather a fairl...
186480  I recently bought 10-15 Patty items, after rea...
153726  I have been looking for a fleece &#34;closed b...
239522  These earrings have a nice shape, but I had th...
103917  I like the New Balance MW978. the only Issue t...

[208989 rows x 1 columns]                                                reviewText
2640    The tiny screws that hold the lid on fell out ...
238515  As like I said before Gerber has great product...
172625  These are a little big but cute i still rock t...
220153  Horrible pants, I can't stand the pol

In [18]:
? train_test_split

In [30]:
print(train_y, test_y)

        label
250767      1
110585      1
46414      -1
126255      1
96195      -1
...       ...
165976      1
186480      1
153726      1
239522      1
103917      1

[208989 rows x 1 columns]         label
2640       -1
238515      1
172625      1
220153     -1
192134      1
...       ...
249026      1
233452      1
59957       1
270455      1
97014       1

[69664 rows x 1 columns]


## Second Step: Text Data Processing
The second and the most important step — clean dataset

### CountVectorizer
In scikit-learn, CountVectorizer is a good tool to help us construct the Bag-of-words model that encoding data into the vector form.

Back to amazon dataset, ConuntVectorizer will do the pre-processing on text data before creating the vector, which we mentioned at the beginning of this section. Thus, we don’t need to clean data by ourselves.

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

# take a word as a token.
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b') 
# Learn the vocabulary dictionary and return term-document matrix.
train_vector = vectorizer.fit_transform(train_X["reviewText"])
print(train_vector)
# the vocabulary dictionary
test_vector = vectorizer.transform(test_X["reviewText"])


  (0, 65859)	1
  (0, 36598)	1
  (0, 5997)	1
  (0, 40329)	1
  (0, 35894)	2
  (0, 61066)	1
  (0, 30559)	2
  (0, 36134)	1
  (0, 60126)	2
  (0, 10581)	1
  (0, 1866)	1
  (0, 41680)	2
  (0, 31085)	1
  (0, 20918)	1
  (0, 14586)	1
  (0, 67104)	1
  (0, 4831)	1
  (0, 6366)	1
  (0, 25374)	1
  (0, 3893)	1
  (0, 27782)	1
  (0, 52861)	1
  (0, 60850)	1
  (0, 26754)	1
  (0, 38671)	1
  :	:
  (208988, 32476)	1
  (208988, 28397)	1
  (208988, 32452)	1
  (208988, 11712)	1
  (208988, 35223)	1
  (208988, 53637)	1
  (208988, 59931)	1
  (208988, 33336)	1
  (208988, 35508)	1
  (208988, 66626)	1
  (208988, 9373)	1
  (208988, 59347)	1
  (208988, 2089)	4
  (208988, 39779)	1
  (208988, 24409)	1
  (208988, 61999)	1
  (208988, 3101)	1
  (208988, 10412)	1
  (208988, 29546)	1
  (208988, 23365)	1
  (208988, 65028)	1
  (208988, 12165)	1
  (208988, 46727)	1
  (208988, 2989)	1
  (208988, 39563)	1


In [43]:
from sklearn.feature_extraction.text import CountVectorizer

train_X1 = ["John likes to watch movies",
           "Mary likes movies too", 
           "Joe only likes horror movies and action movies"]

vectorizer1 = CountVectorizer(token_pattern=r'\b\w+\b') # take a word as a token.
train_vector1 = vectorizer1.fit_transform(train_X1) # Learn the vocabulary dictionary and return term-document matrix.
token_set1 = vectorizer1.get_feature_names() # the vocabulary dictionary: ['action', 'and', 'horror', 'joe', 'john', 'likes', 'mary', 'movies', 'only', 'to', 'too', 'watch']
print(token_set1)
print(train_vector1)


['action', 'and', 'horror', 'joe', 'john', 'likes', 'mary', 'movies', 'only', 'to', 'too', 'watch']
  (0, 4)	1
  (0, 5)	1
  (0, 9)	1
  (0, 11)	1
  (0, 7)	1
  (1, 5)	1
  (1, 7)	1
  (1, 6)	1
  (1, 10)	1
  (2, 5)	1
  (2, 7)	2
  (2, 3)	1
  (2, 8)	1
  (2, 2)	1
  (2, 1)	1
  (2, 0)	1
  (0, 5)	1
  (0, 7)	1


In [44]:
test_X1 = ["Jay likes romantic movies"]
test_vector1 = vectorizer1.transform(test_X1)
print(test_vector1)

  (0, 5)	1
  (0, 7)	1


## Final Step — Model Constructing
For the classification problem, we use the popular model — Logistic Regression for demonstration. Below shows how we utilize the model.

In [40]:
from sklearn.linear_model import LogisticRegression

clr = LogisticRegression()
clr.fit(train_vector, train_y.values.ravel())
scores = clr.score(test_vector, test_y) # accuracy
print(scores)

0.9313275149288011


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
