In [1]:
import pandas as pd
from sentiment_classification import * 

## Data Review & Train /Test Spliting

In [2]:
df = pd.read_csv('data/review_w_spam_label.csv')

# Apply one-hot encoding to spam data 
df['target'] = np.where(df['target']=='spam',1,0)
spam = df[df['target'] == 1]

print(f'{(len(spam)/len(df))*100} percent of data is spam.')
df.head(10)

13.406317300789663 percent of data is spam.


Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


### Call class and function to split training /testing data 

In [3]:
SentClass = SentimentClassification(df, 'text', 'target')

## Model Evaluation 
- 1. Multinomial **Naive Bayes classifier** model w/ Count Vectorizer
- 2. Multinomial **Naive Bayes classifier** model w/ TFI-DF Vectorizer (Minimum Document Frequency = **3**)
- 3. **SVM** w/ TF-IDF Vectorizer and document length as additional feature (Minimum Document Frequency = **5**)
- 4. **Logistic Regression** model w/ Tfidf Vectorizer (Minimum Document Frequency = **5** ; and using **word n-grams from n=1 to n=3** unigrams, bigrams, and trigrams).
  Add following additional features:
    - the length of document (number of characters)
    - **number of digits per document**
- 5. **Logistic Regression** model w/ Count Vectorizer 
    - First 2000 rows only of training data X_train 
    - Ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**
    - Using this document-term matrix and the following additional features:
      * the length of document (number of characters)
      * number of digits per document
      * **number of non-word characters (anything other than a letter, digit or underscore.)**
    - fit a Logistic Regression model with regularization C=100 and max_iter=1000

In [4]:
# AUC, multinomial Naive Bayes classifier model w/ Count Vectorizer
count_vect = SentClass.NB_classifier_count_vectorizer()

# AUC, multinomial Naive Bayes classifier model w/ TFI-DF Vectorizer
tfidf_vect, tfidf_X_train_vectorized = SentClass.get_tfidf_vectorizer(3)
SentClass.NB_classfier_tfidf(tfidf_vect, tfidf_X_train_vectorized)

# AUC, SVM using TF-IDF and document length as additional feature
SentClass.Tfidf_Vector_with_SVM(5)

# AUC, Logistic Regression using TF-IDF and document length + Num of digits per document as additional features 
SentClass.logistic_regression_Tfidf_ngrams()

count_vect_ngrams, LR_model_w_count = SentClass.logistic_regressions_count_vect_ngrams(2000)

AUC, multinomial Naive Bayes classifier model w/ Count Vectorizer 0.991545422134696
AUC, multinomial Naive Bayes classifier model w/ TFI-DF Vectorizer 0.9954968337775665
AUC, SVM using TF-IDF and document length as additional feature: 0.9963202213809143
AUC, Logistic Regression w/ Tfidf and document length + num of digits per document as additional features 0.9938882569648405
AUC, Logistic Regressions w/ Count Vectorizer, adding 3 features: 0.9918552535524506 
Features added - 1) length of document 2) num of digits per document 3) num of non-word characters


## Feature Review
### TFI-DF Features 
- Print out features with the smallest and largest tf-idf, among the **max** tf-idf values

In [5]:
SentClass.get_tfidf_features(tfidf_vect, 20, 20)

(moral              0.204078
 36504              0.215685
 sum1               0.216737
 100percent         0.217856
 genuine            0.217856
 showing            0.218333
 crack              0.219528
 havnt              0.219528
 honeybee           0.219528
 laughed            0.219528
 sweetest           0.219528
 norm150p           0.219653
 w45wq              0.219653
 minnaminunginte    0.220250
 nurungu            0.220250
 vettam             0.220250
 affection          0.224637
 bcums              0.224637
 kettoda            0.224637
 manda              0.224637
 dtype: float64,
 yup        1.0
 you        1.0
 yo         1.0
 where      1.0
 unsold     1.0
 type       1.0
 towards    1.0
 too        1.0
 those      1.0
 thanx      1.0
 thank      1.0
 space      1.0
 say        1.0
 sad        1.0
 right      1.0
 out        1.0
 or         1.0
 okie       1.0
 ok         1.0
 nite       1.0
 dtype: float64)

### Review non-word characters in dataset 
- Non-word characters: anything other than a letter, digit or underscore
- Use RegEx `\w` and `\W` character classes
- Calculate the average number of non-word characters per document for not spam and spam documents

In [6]:
nonword_in_spam, nonword_in_nonspam = SentClass.calculate_avg_nonword_char('text')

Average number of non-word characters in spam data 29.041499330655956 
Average number of non-word characters in non-spam data 17.29181347150259


In [7]:
# Review the non-word characters found in spam 
nonword_in_spam.head(10)

Unnamed: 0,nonword_char
2,. ( )&' '
5,"' ' ! ' ? ! , £."
8,!! £ ! . . .
9,? !
11,"! , > . /, , +"
12,"! £, ! : : & .."
15,": , >> ://. .?="
19,"- / . :, /. +"
34,£/ .
42,- - = + .


### find the 10 smallest and 10 largest coefficients from the model

- The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.


In [8]:
SentClass.return_coef(count_vect_ngrams, LR_model_w_count)

Smallest coefficients: ['..', '. ', ' m', ' i', 'he', 'at', ' d', ' h', 'us', 'th'] 
Largest coefficients: ['00', '50', ' 0', '0 ', 'xt', ' 1', '09', 'å£', 'xt ', ' å']
