## Libraries for text classification problems
### 1. Scikit-learn  
### 2. nltk 
Interacts with Weka and other ML libraries

### 1. Scikit-learn  


In [3]:
# Use of scikit-learn
## Naive bayes
from sklearn import naive_bayes 

###  Multinomial Naive Bayes or bernoulli naive bayes  
#### Multinomial NB object created  
NB_clfr = naive_bayes.MultinomialNB()  
#### Call method with train date and label of target
NB_clfr.fit(train_data, train_labels)  
#### Predit method  
NB_clfr.predict(test_data)  
#### Validation on test_data using a metric of choice  
metrics.f1score(test_labels, predicted_labels, 'micro')  

##### Averaging type can be micro and macro


### SKlearn's svm classifier
from sklearn import svm

#### build an SVM object
svm_clfr = svm.SVC(kernel= 'linear', C = 0.1)  
###### default kernel is rbf and c is 1


#### Use the object to call methods to train and predict
svm_clfr.fit(train_data, train_labels)  
svm_clfr.predict(test_data)  



## Model Selection 
svm has tuning parameters like cv. Model selection can be done using **repeated** k-fold cv.  
k-fold cv helps one to utilie all data for training, and also compare model, no validation/tuning data needs to be cut out

from sklearn import model_selection  
predict_cv = model_selection.cross_val_predict(X= train_date,y= train_labels,cv= 5,method= svm_clfr)

### 2. nltk   
nltk has own classifiers  
- NaiveBayesClassifier  
- DecisionTreeClassifier  
- ConditionalExponentialClassifier  
- Maxent Classifier (Max Entropy)  

It also allows to call sklearn and weka's classifiers  
- WekaClassifier  
- SklearnClassifier

from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)

#### Get labels on test set
classifier.classify(score_set)

#### For multi-label classification
classifier.classify_many(score_set)

#### Generate metrics after scoring
nltk.classify.util.accuracy(classifier, test_set)

#### Method to get variable importance, pass the no. of features to see

classifier.show_most_informative_features()


## For svm , there is no native svm function, but can use scikitlearn's
TBD

## Example of Sentiment Analysis 

#### Data 
Amazon Reviews, datset structure:   
1. Product Name   
2. Price  
3. Brand  
4. Rating  
5. Review Votes  
6. Review Comments  

#### Task : Identify Positive or Negative sentiment of the product by using review comments  
Rating >3 as positive  , other as negative  label  

#### 1. Split Data into train, test for now

#### 2. Feature Engineering - Features can be made using 
1. Word count  
2. TF-IDF, but we first need to form a bag of words/vocabulary that identifies with positive or negative sentiment.  
3. Can even make n-grams as features, but that will bulk up the feature space  
4. Other custom ways  

#### a. CountVectorizer could help in building a bag of words/vocabulary first. It does the following steps - 
1. tokenizes documents into words, recognizing **sentence boundaries**  
2. Performs normalization of words, converting them to lower case    
3. performs stemming and lemmatization   
4. Prepare a matrix, where each column name is one of the tokens, row is the document, data is the count of occurence of the   
   word. This matrix is understandably sparse, with lots of 0's.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data, this gives the bag of words
vect = CountVectorizer().fit(X_train)


#### 2b. Build a document term matrix from the bag of words. It does the following steps -  

transform the documents in the training data to a document-term matrix  
X_train_vectorized = vect.transform(X_train)  
X_train_vectorized   

** This will be a sparse matrix, with a whole lot of features**


#### 3. Fit a model using a method  
**Use Logistic Regression first**   

from sklearn.linear_model import LogisticRegression  

Train the model  
model = LogisticRegression()  
model.fit(X_train_vectorized, y_train)  

**Generate predictions, after transformating test set into DTM as well**  

from sklearn.metrics import roc_auc_score  

**Predict the transformed test documents**  
predictions = model.predict(vect.transform(X_test))  

** Any words in test set, that do not appear in train set will just be ignored **

**Evaluate Predictions**  

print('AUC: ', roc_auc_score(y_test, predictions))  



#### 4. Observe the features with largest and smallest coefficients  
**get the feature names as numpy array**   
feature_names = np.array(vect.get_feature_names())  

**Sort the coefficients from the model**  
sorted_coef_index = model.coef_[0].argsort()

**Find the 10 smallest and 10 largest coefficients  
The 10 largest coefficients are being indexed using [:-11:-1]   
 so the list returned is in order of largest to smallest**  
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))  
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))  

#### Feature Engineering with TF IDF 
TF is term frequence i.e   
no. of time a wrod occurs in a doc / total words in the doc  
IDF is a weight, giving weight to the words that are comparatively rare in the corpus of socuments.   
IDF = log(1/ (no. of documents having the word / total documents))  
as log is an increasing positiv function beyond 1, the weight increases as word is rare   

**So, TF-IDF can be expected to be high for words that occur very times in the corupus, but are often repeated in some documents. It will weigh down the words repeated too often like stop words to 0**

#### We can derive TF-IDF DTM matrix using TFidfVectorizer. 

** TFidf will give same features as count based matrix, but sparser. So, we can ensure to keep only   
those features that occur a certain no. of times in the frequency matrix. The same argument can be used in TF DTM as well.  
it ensures words that occur in a minimum no. of documents, become part of the library**

from sklearn.feature_extraction.text import TfidfVectorizer

#### Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)

#### Feature Engineering with n-grams
We get context in words by using 2-grams or 3-grams at times, rather than just using single words. Example, in review, you could  
have words like - 'not a fan' , 'not excited' ; which are not getting captured to carry a negative sentiment.  

**Fit the CountVectorizer to the training data specifiying a minimum 
 document frequency of 5 and extracting 1-grams and 2-grams**  
 
** Having the min_df argument, really helps to reduce features that could be just noice and not a pattern**

vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

** The sklearn vectorizers are very adaptive  - read documentation**