# Text Classification
## 1.  Problem definition
### I am going to take the following steps to carry out a sentiment analysis by using the classification method into multi-class classification problem.  
1.  Collection of reviews with labels
  *  a.	Packages - In particular we will use the text feature such as tf-idf and 
the probabilistic inference models such as Naïve Bayes from the package.  The package that can be used sklearn.  Nlt(Natural Language Tool kit) – This is one of the best documented python packages for language processing.  
  *  b.	Install packages – Packages mentioned above.
  *  c.	Nltk.download -  download data
  *  d.	Get the positive and negative reviews -  1 is assign to positive and 0 as negative.
  *  e.	Load data – 
  *  f.	Generate labels – “0” for positive and “1” represent negative.
2.  Split data into train and test set
3.  Extract word count features – Calculate the tf-idf features form training set and check dataset size.  
4.  Build the Naive Bayes classifier from training set – Using Multinomial Naïve Bayes (discrete data) to train the model.
5.  Apply Classifier on test data set.
6.  Evaluate the performance -  Accuracy, precision, recall, f1-score.

## Solution Implementation

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Get the training dataset for the specified categoires
categories = ['rec.sport.hockey', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
training_data = fetch_20newsgroups(subset='train', categories=categories)

In [None]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)


(2154, 35956)


In [None]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))


                    precision    recall  f1-score   support

  rec.sport.hockey       0.97      0.90      0.93       389
talk.religion.misc       0.93      0.99      0.96       399
     comp.graphics       0.83      0.98      0.90       394
         sci.space       1.00      0.72      0.84       251

          accuracy                           0.92      1433
         macro avg       0.93      0.90      0.91      1433
      weighted avg       0.93      0.92      0.92      1433



In [None]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]], 
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])


------------------------------------------------------------------
comp.graphics --> sci.space

From: robert@slipknot.rain.com (Robert Reed)
Subject: Re: ACM SIGGRAPH (and ACM in general)
Reply-To: Robert Reed <robert@slipknot.rain.com>
Organization: Home Animation Ltd.
Lines: 50

In article <1993Apr29.023508.11556@koko.csustan.edu> rsc@altair.csustan.edu (Steve Cunningham) writes:
|
|And no, SIGGRAPH 93 has not skipped town -- we're preparing the best
|SIGGRAPH conference yet!

Speaking of SIGGRAPH, I just went through the ordeal of my annual registration
for SIGGRAPH and re-upping of membership in the ACM last night, and was I ever
grossed out!  The new prices for membership are almost highway robbery!

For example:

	SIGGRAPH basic fee went from $26 last year to $59 this year for the same
	thing, a 127% increase.  Those facile enough to arrange a trip to the
	annual conference could reduce this to $27 by selecting SIGGRAPH Lite,
	which means SIGGRAPH is charging an additional $32 (o

##Solution Implementation

In [None]:
# Get the training dataset for the specified categoires
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
training_data = fetch_20newsgroups(subset='train', categories=categories)

In [None]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)

(2257, 35788)


In [None]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
soc.religion.christian       0.96      0.89      0.92       389
         comp.graphics       0.97      0.81      0.88       396
               sci.med       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [None]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]], 
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])

------------------------------------------------------------------
sci.med --> soc.religion.christian

From: adwright@iastate.edu ()
Subject: Re: centi- and milli- pedes
Organization: Iowa State University, Ames IA
Lines: 37

In <1993Apr29.112642.1@vms.ocom.okstate.edu> chorley@vms.ocom.okstate.edu writes:

>In article <35004@castle.ed.ac.uk>, gtclark@festival.ed.ac.uk (G T Clark) writes:
>> msnyder@nmt.edu (Rebecca Snyder) writes:
>> 
>>>Does anyone know how posionous centipedes and millipedes are? If someone
>>>was bitten, how soon would medical treatment be needed, and what would
>>>be liable to happen to the person?
>> 
>>>(Just for clarification - I have NOT been bitten by one of these,  but my
>>>house seems to be infested, and I want to know 'just in case'.)
>> 
>>>Rebecca
>> 
>> 
>> 	Millipedes, I understand, are vegetarian, and therefore almost
>> certainly will not bite and are not poisonous. Centipedes are
>> carnivorous, and although I don't have any absolute knowledge on t

##3. Model Evaluation
In ordert o evalute the model performance we must interpret the classification report labeled "Performance on testing set". Looking at the main classification metrics we can say:

  *  **Accuracy** - Model 1 accuracy is higher than model 2.  Sample size is lager in Model 2 than model 1.  They both have a balnced dataset.  
  *  **Macro Avg** - This is the average score between Precision, F1 score, and recal. It does not take class imbalace into account, but, this calls is balance. Model 1 figures are higher than model two.
  *  **Weighted Avg** - This simply gives us the weighted average between F1 score, Recall and precision. This metrics favours the majority class. Hence we see above the weighted average and Macro average scores are almost the same indicating there is an balance between the classes. This mertics tends to favour the larger sample class.  

**Conclusion**- Model 1 has a higher weghted average which indicates that it prcision across the features are better than that of model 2.  