In [1]:
import pandas as pd
import numpy as np

## Example of Text Classification

1. Topic identification
2. Spam Detection
3. Sentiment analysis
4. Spelling correction
5. Find Writer
6. ....

## Identify features from text

- It's unique because all information you need is in the text

1. Words
 - The most comon class of features
 - Handling commonly-occuring words : Stop words
 - Normalization : Lower case vs leave asis
    - US vs us 
    - White House vs white house
 - Stemming / Lemmaization
 - Characteristics of words : capitalization
 - Parts of speeech of words in a sentence : whether vs weather
 - Grammatical structure, sentence parsing : 'weather' fits more than 'whether' after 'the'
 - Grouping words of similar meaning, semantics : {buy, purchase}, {Mr. Ms. Dr. Prof.}

2. Others
 - Depending on classification tasks, features may come from inside words and word sequences : n-grams, {-ing, -ion}
 

## Naive Bayes Classifier

[Wiki](https://ko.wikipedia.org/wiki/%EB%82%98%EC%9D%B4%EB%B8%8C_%EB%B2%A0%EC%9D%B4%EC%A6%88_%EB%B6%84%EB%A5%98#%EB%AA%A8%EC%88%98%EC%B6%94%EC%A0%95%EA%B3%BC_%EC%9D%B4%EB%B2%A4%ED%8A%B8_%EB%AA%A8%EB%8D%B8)

In [2]:
import pandas as pd
import numpy as np

In [3]:
from sklearn import naive_bayes

clfrNB = naive_bayes.MultinomialNB()

But typically, the test data set is not labeled. So you're going to train something on a label set and then you need to apply it on an unlabeled tests. So, you need to use some of the labeled data to see how well these models are. Especially if you're comparing between models, or if you are tuning a model. So if you have some parameters, for example you have the C parameter in SVM, you need to know what is a good value of C. So, how would you do it? That problem is called the **model selection problem**.

For model selection

1. Keep some part of printing the label data set
2. Cross Validation

In [4]:
#Test data is used for tune the model
#NLTK also has ML models like naive bayes classifier and also it has SklearnClassifier or WekaClassfier
#which can be linked with other ML modules

In [5]:
from sklearn.model_selection import KFold

X = np.random.randint(0,5,10)
kf = KFold()
for train, test in kf.split(X):
    print('{} {}'.format(train,test))

[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]


## Sentiment Analysis

In [6]:
#One vs One
#One vs Rest
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

df

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0
...,...,...,...,...,...,...
413835,Samsung Convoy U640 Phone for Verizon Wireless...,Samsung,79.95,5,another great deal great price,0.0
413836,Samsung Convoy U640 Phone for Verizon Wireless...,Samsung,79.95,3,Ok,0.0
413837,Samsung Convoy U640 Phone for Verizon Wireless...,Samsung,79.95,5,Passes every drop test onto porcelain tile!,0.0
413838,Samsung Convoy U640 Phone for Verizon Wireless...,Samsung,79.95,3,I returned it because it did not meet my needs...,0.0


In [7]:
df = df.dropna()
df = df[df['Rating'] != 3]
df['Positive Rating'] = np.where(df['Rating']>3,1,0) #이렇게 하는구나
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positive Rating
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [8]:
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(df['Reviews'],df['Positive Rating'],
                                                 random_state=0)

In [9]:
X_train

97039     I bought a BB Black and was deliveried a White...
243783    overall i am very happy so far with this phone...
88792     the keyboard stutters! after i made a research...
388802    excellent smart phone, good performance. all p...
161607    I received my new Blu Vivo 5 Smartphone 3 days...
                                ...                        
159246                                            excellent
408347    Works great. Just waiting for my upgrade so I ...
197432    Although I'm only 26 I'm kind of a backwoods h...
153503              for the money not bad, but cheaply made
410159    broke it to quick tho now i need to get anothe...
Name: Reviews, Length: 231207, dtype: object

In [10]:
#Change text data to numeric data for sklearn

### Count Vectorizer

[Count Vectorize](http://theyoonicon.com/scikit-learn%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EA%B5%B0%EC%A7%91%ED%99%94%EA%B4%80%EB%A0%A8-%EA%B2%8C%EC%8B%9C%EB%AC%BC-%EC%B0%BE%EA%B8%B0/)

The bag-of-words approach is simple and commonly used way to represent text for use in machine learning, which ignores structure and only counts how often each word occurs. CountVectorizer allows us to use the bag-of-words approach by converting a collection of text documents into a matrix of token counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
#Convert everything to lowercase -> Find word -> Build word vocabulary ->
#Make count dataframe with word vocabulary(Count number of time specific word appears) ->
#Calculate distance
vect = CountVectorizer().fit(X_train)

In [12]:
vect.get_feature_names()[::500] #이렇게 하면 dimension이 높아질 수 밖에 없네요

['00',
 '1425067051',
 '2048those',
 '312',
 '4less',
 '700ma',
 '99303',
 'accompanying',
 'adr6275',
 'aliasing',
 'andentering',
 'applet',
 'assignment',
 'away',
 'bandwidth',
 'behemoth',
 'blazingly',
 'bouts',
 'bullets',
 'cambio',
 'cassettes',
 'cheapoverall',
 'cleary',
 'commenter',
 'condishion',
 'contrariado',
 'cpl',
 'cusmter',
 'debi',
 'denin',
 'deғιnιтely',
 'discontinue',
 'dollarsshipping',
 'durationi',
 'ele',
 'ensuring',
 'esteem',
 'exclusive',
 'eyeglasses',
 'featuresof',
 'flashy',
 'fragle',
 'fusion2',
 'getappstore',
 'gorila',
 'guardwho',
 'hasbro',
 'hijack',
 'human',
 'imidietly',
 'inefficiencies',
 'intentional',
 'irullu',
 'jinx',
 'kinds',
 'lava',
 'like',
 'looooooove',
 'makeup',
 'md',
 'microsaudered',
 'mobiletronix',
 'msgi',
 'nav',
 'nightmarish',
 'nui',
 'oldy',
 'origen',
 'p770',
 'pd53100',
 'phalet',
 'piso',
 'poori',
 'presentacion',
 'productsaid',
 'pugged',
 'quirky',
 'reappearing',
 'rediculoius',
 'remotely',
 'respons

In [13]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixe

In [71]:
X_train_vectorized = vect.transform(X_train)
type(X_train_vectorized) #Count vector is scipy sparse matrix -> 당연히 sparse matrix임
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [72]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=2000)
model.fit(X_train_vectorized,Y_train)

LogisticRegression(max_iter=2000)

C:\Users\netis\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

    https://scikit-learn.org/stable/modules/preprocessing.html
    
Please also refer to the documentation for alternative solver options:

    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
    
  n_iter_i = _check_optimize_result(

[Convergence Warning](https://stackoverflow.com/questions/62658215/convergencewarning-lbfgs-failed-to-converge-status-1-stop-total-no-of-iter)

[lbfgs](https://wikidocs.net/22155)

In [73]:
X_test_vectorized = vect.transform(X_test)

In [74]:
from sklearn.metrics import roc_auc_score

predictions = model.predict(X_test_vectorized) #Words don't appear in train data are ignored
roc_auc_score(Y_test,predictions)

0.9305452091767882

In [75]:
model.predict(vect.transform(['Phone is working','Phone is not working']))

array([1, 0])

In [25]:
dir(model)

['C',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_n_features',
 '_estimator_type',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_predict_proba_lr',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_validate_data',
 'class_weight',
 'classes_',
 'coef_',
 'decision_function',
 'densify',
 'dual',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'intercept_scaling',
 'l1_ratio',
 'max_iter',
 'multi_class',
 'n_features_in_',
 'n_iter_',
 'n_jobs',
 'penalty',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'random_state',
 'score',
 'set_params',
 'solver',
 'sparsify',
 'tol',
 'verbose',
 'w

In [42]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()

In [43]:
print('Smallest : {}'.format(feature_names[sorted_coef_index[:10]]))
print('Largest : {}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest : ['mony' 'worst' 'false' 'worthless' 'horribly' 'messing' 'unsatisfied'
 'blacklist' 'junk' 'superthin']
Largest : ['excelent' 'excelente' '4eeeks' 'exelente' 'efficient' 'excellent'
 'loving' 'pleasantly' 'loves' 'mn8k2ll']


### Tfidf (Term frequency inverse document frequency)

Tf–idf, or Term frequency-inverse document frequency, allows us to weight terms based on how important they are to a document.
High weight is given to terms that appear often in a particular document, but don't appear often in the corpus. Features with low tf–idf are either commonly used across all documents or rarely used and only occur in long documents.

[TF-IDF](https://wiserloner.tistory.com/646)


#### Algorithm

tf = Term frequency for each document

df = Document frequency

idf = ln((1+n)/(1+df))+1

In [161]:
docs = ['I go to my home my home is very large',
    'I went out my home I go to the market',
    'I bought a yellow lemon I go back to home']
token = [nltk.word_tokenize(text) for text in texts]
vocab = list(set(token[0] + token[1] + token[2]))

In [162]:
from math import log
import pandas as pd
N = len(docs)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log((1+N)/(1+df))

In [163]:
result = []

for i in range(len(docs)):
    result.append([])
    for j in range(len(vocab)):
        result[-1].append(docs[i].count(vocab[j]))
        
tf = pd.DataFrame(result,columns=vocab)
tf

Unnamed: 0,yellow,my,the,back,large,out,bought,a,is,I,very,home,market,went,to,lemon,go
0,0,2,0,0,1,0,0,1,1,1,1,2,0,0,1,0,1
1,0,1,1,0,0,1,0,1,0,2,0,1,1,1,1,0,1
2,1,0,0,1,0,0,1,2,0,2,0,1,0,0,1,1,1


In [164]:
result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])
idf_

Unnamed: 0,IDF
yellow,0.693147
my,0.287682
the,0.693147
back,0.693147
large,0.693147
out,0.693147
bought,0.693147
a,0.0
is,0.693147
I,0.0


In [168]:
idf_['IDF'] * tf

Unnamed: 0,yellow,my,the,back,large,out,bought,a,is,I,very,home,market,went,to,lemon,go
0,0.0,0.575364,0.0,0.0,0.693147,0.0,0.0,0.0,0.693147,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.287682,0.693147,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.693147,0.0,0.0,0.0
2,0.693147,0.0,0.0,0.693147,0.0,0.0,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.693147,0.0


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_tfidf = TfidfVectorizer().fit(X_train)
vect_tfidf_mindf = TfidfVectorizer(min_df=5).fit(X_train) #Reduce features by ignoring features appears less than 5

len(vect_tfidf.get_feature_names()), len(vect_tfidf_mindf.get_feature_names())

(53216, 17951)

In [52]:
vect_tfidf_mindf.get_feature_names()

['00',
 '000',
 '0000',
 '000000',
 '000mah',
 '007',
 '00pm',
 '01',
 '02',
 '03',
 '032g',
 '04',
 '04th',
 '05',
 '051',
 '06',
 '07',
 '08',
 '09',
 '0a',
 '0c',
 '0ghz',
 '0hd',
 '0ii',
 '0k',
 '0l',
 '0mp',
 '0s',
 '0stars',
 '10',
 '100',
 '1000',
 '10000',
 '1001multi',
 '100gb',
 '100hours',
 '100mb',
 '100s',
 '100x',
 '101',
 '102',
 '1020',
 '103',
 '104',
 '105',
 '106',
 '107',
 '1080',
 '1080i',
 '1080p',
 '1080x1920',
 '109',
 '10am',
 '10gb',
 '10mbps',
 '10min',
 '10mins',
 '10pm',
 '10screen',
 '10th',
 '10x',
 '10year',
 '10yo',
 '11',
 '110',
 '11059mem',
 '1109miami',
 '110v',
 '111',
 '112',
 '1122',
 '113',
 '114',
 '115',
 '118',
 '119',
 '119gb',
 '11gb',
 '11pm',
 '11th',
 '11yr',
 '12',
 '120',
 '1200',
 '120fps',
 '120mb',
 '1217asus',
 '123',
 '124',
 '124gb',
 '125',
 '126',
 '128',
 '1280',
 '1280x720',
 '128g',
 '128gb',
 '128mb',
 '129',
 '12gb',
 '12hrs',
 '12mm',
 '12mp',
 '12pm',
 '12th',
 '13',
 '130',
 '1300',
 '1320',
 '133',
 '1334',
 '135',
 '1

In [55]:
X_train_vectorized = vect_tfidf_mindf.transform(X_train)

model = LogisticRegression(max_iter=2000)
model.fit(X_train_vectorized,Y_train)

predictions = model.predict(vect_tfidf_mindf.transform(X_test))
roc_auc_score(Y_test,predictions)

0.9266357077003247

In [56]:
feature_names = np.array(vect_tfidf_mindf.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()

print('Smallest : {}'.format(feature_names[sorted_coef_index[:10]]))
print('Largest : {}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest : ['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']
Largest : ['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [62]:
model.predict(vect_tfidf_mindf.transform(['Phone is working','Phone is not working']))

array([1, 0])

One problem with our previous bag-of-words approach is word order is disregarded. So, not an issue, "phone is working" is seen the same as an issue, "phone is not working".

### n-grams

Regard context

In [57]:
word='Swallow'
list(zip(*[word[i:] for i in range(3)]))

[('S', 'w', 'a'),
 ('w', 'a', 'l'),
 ('a', 'l', 'l'),
 ('l', 'l', 'o'),
 ('l', 'o', 'w')]

In [None]:
def ngram(word, n):
    ngram = list(zip(*[word[i:] for i in range(n)]))
    return set([''.join(x) for x in ngram])

In [61]:
vect_bigram = CountVectorizer(min_df=5, ngram_range = (1,2)).fit(X_train)
len(vect_bigram.get_feature_names()), vect_bigram.get_feature_names()

(198917,
 ['00',
  '00 activation',
  '00 also',
  '00 am',
  '00 and',
  '00 as',
  '00 bucks',
  '00 but',
  '00 cheaper',
  '00 compared',
  '00 dlls',
  '00 dollars',
  '00 for',
  '00 however',
  '00 if',
  '00 in',
  '00 is',
  '00 it',
  '00 less',
  '00 mo',
  '00 month',
  '00 more',
  '00 no',
  '00 not',
  '00 on',
  '00 or',
  '00 per',
  '00 phone',
  '00 phones',
  '00 plus',
  '00 pm',
  '00 price',
  '00 smartwatch',
  '00 so',
  '00 that',
  '00 the',
  '00 this',
  '00 to',
  '00 too',
  '00 total',
  '00 unlocked',
  '00 usd',
  '00 with',
  '00 you',
  '000',
  '000 200',
  '000 but',
  '000 colors',
  '000 feet',
  '000 for',
  '000 mah',
  '000 models',
  '000 on',
  '000 would',
  '0000',
  '000000',
  '000mah',
  '000mah battery',
  '007',
  '007 james',
  '00pm',
  '01',
  '01 16',
  '01 24',
  '01 and',
  '01 day',
  '01 is',
  '01 un',
  '02',
  '02 and',
  '02 lolipop',
  '03',
  '032g',
  '032g gn6ma',
  '04',
  '04 12',
  '04 2016',
  '04th',
  '04th of',


Keep in mind that, although n-grams can be powerful in capturing meaning, longer sequences can cause an explosion of the number of features.