#  Brown corpus.

Question: Write a program in Python to find the occurrence of articles / determiners in brown corpus

**Installing NLTK**

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

**Occurence of determiners**

In [None]:
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories=genre))

genres = ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
determiners = ['he','she','it','they','can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=determiners)

                   he   she    it  they   can could   may might  must  will 
      adventure   761   240   492   206    46   151     5    58    27    50 
 belles_lettres  1174   178  1059   488   246   213   207   113   170   236 
      editorial   268    41   386   148   121    56    74    39    53   233 
        fiction   813   280   458   230    37   166     8    44    55    52 
     government   120     0   218    92   117    38   153    13   102   244 
        hobbies   155    21   476   177   268    58   131    22    83   264 
          humor   146    58   162    70    16    30     8     8     9    13 
        learned   328    54   856   338   365   159   324   128   202   340 
           lore   541   232   566   303   170   141   165    49    96   175 
        mystery   670   219   515   106    42   141    13    57    30    20 
           news   451    42   363   205    93    86    66    38    50   389 
       religion   137    10   264   115    82    59    78    12    54    71 

**Occurence of articles**

In [None]:
from nltk.corpus import brown

print(brown.categories())

  	
cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories=genre))

genres = ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
articles = ['a','an','the']
cfd.tabulate(conditions=genres, samples=articles)

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
                    a    an   the 
      adventure  1354   159  3370 
 belles_lettres  3308   583  9726 
      editorial  1095   184  3508 
        fiction  1281   156  3423 
     government   867   208  4143 
        hobbies  1737   226  4300 
          humor   505    75   930 
        learned  3215   695 11079 
           lore  2304   364  6328 
        mystery  1136   125  2573 
           news  1993   300  5580 
       religion   655   119  2295 
        reviews   874   163  2048 
        romance  1335   152  2758 
science_fiction   222    33   652 


# Classifiers

Questions: 

1. Build NaiveBayes Classifier using NLTK with the training data and find the classification accuracy  of the test data. Consider any bench mark data set.
2. List the most significant features of data set
3. Apply supervised classification algorithms (any 5  algorithms) using SKLEARN for the same problem.
4. Explore possibility of supervised algorithms using SPACY.

**Dataset Description:**

Dataset used: IMDB Dataset

Dataset description: 
1. IMDB dataset having 50K movie reviews for natural
language processing or Text analytics.
2. Dataset is of binary sentiment classification
3.  Total instances - 50,000
4.  To classify the reviews as either positive or negative using either classification algorithms.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

import sklearn.model_selection
import sklearn.metrics
from sklearn.metrics import classification_report,confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Load the data**

In [None]:
df_eng = pd.read_csv('/content/drive/MyDrive/data/IMDB.csv')
df_eng

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


**Pre-processing**

In [None]:
import string

def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

def remove_punctuations(txt):
  text_nopunc="".join([c for c in txt if c not in string.punctuation])
  return text_nopunc

df_eng['review']=df_eng['review'].apply(lambda x: remove_punctuations(x))

df_eng['review']=df_eng['review'].apply(lambda x: clean_text(x))



**Feature extraction**

In [None]:
from sklearn.preprocessing import LabelEncoder
Encoder = LabelEncoder()
df_eng['sentiment']=Encoder.fit_transform(df_eng['sentiment'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_eng['review'], df_eng['sentiment'], test_size=0.3)

In [None]:
tfidf = TfidfVectorizer(max_features=20000)
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

**Naive Bayes Classification**

In [None]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
y_pred = naive_bayes.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

Accuracy score: 0.8615333333333334
Confusion Matrix: [[6485 1003]
 [1074 6438]]
Classification report
              precision    recall  f1-score   support

           0       0.86      0.87      0.86      7488
           1       0.87      0.86      0.86      7512

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



**Classifier 1: MLP Classifier**

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,100), max_iter=10, alpha=0.0001,solver='sgd', verbose=10,  random_state=21,tol=0.000000001)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('MLPClassifier')
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))


Iteration 1, loss = 0.69300070
Iteration 2, loss = 0.69249483
Iteration 3, loss = 0.69187788
Iteration 4, loss = 0.69119441
Iteration 5, loss = 0.69050472
Iteration 6, loss = 0.68977390
Iteration 7, loss = 0.68899275
Iteration 8, loss = 0.68809264
Iteration 9, loss = 0.68710589
Iteration 10, loss = 0.68605717
Iteration 11, loss = 0.68492359
Iteration 12, loss = 0.68368507
Iteration 13, loss = 0.68237763
Iteration 14, loss = 0.68088815
Iteration 15, loss = 0.67921352
Iteration 16, loss = 0.67739863
Iteration 17, loss = 0.67541404
Iteration 18, loss = 0.67314292
Iteration 19, loss = 0.67066106
Iteration 20, loss = 0.66785268
Iteration 21, loss = 0.66465688
Iteration 22, loss = 0.66110032
Iteration 23, loss = 0.65709077
Iteration 24, loss = 0.65255240
Iteration 25, loss = 0.64742947
Iteration 26, loss = 0.64166895
Iteration 27, loss = 0.63518922
Iteration 28, loss = 0.62793042
Iteration 29, loss = 0.61980871
Iteration 30, loss = 0.61081340
Iteration 31, loss = 0.60095203
Iteration 32, los



MLPClassifier
Accuracy score: 0.8344666666666667
Confusion Matrix: [[6186 1302]
 [1181 6331]]
Classification report
              precision    recall  f1-score   support

           0       0.84      0.83      0.83      7488
           1       0.83      0.84      0.84      7512

    accuracy                           0.83     15000
   macro avg       0.83      0.83      0.83     15000
weighted avg       0.83      0.83      0.83     15000



**Classifier 2 : Support Vector Classifier**

In [None]:
from sklearn.svm import SVC # "Support Vector Classifier" 
clf = SVC(kernel='linear') 
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('SVC')
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

SVC
Accuracy score: 0.8968
Confusion Matrix: [[6575  797]
 [ 751 6877]]
Classification report
              precision    recall  f1-score   support

           0       0.90      0.89      0.89      7372
           1       0.90      0.90      0.90      7628

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



**Classifier 3 : Decision Tree Classifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print('DecisionTreeClassifier')

print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

DecisionTreeClassifier
Accuracy score: 0.7123333333333334
Confusion Matrix: [[5251 2121]
 [2194 5434]]
Classification report
              precision    recall  f1-score   support

           0       0.71      0.71      0.71      7372
           1       0.72      0.71      0.72      7628

    accuracy                           0.71     15000
   macro avg       0.71      0.71      0.71     15000
weighted avg       0.71      0.71      0.71     15000



**Classifier 4 : KNN**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf=KNeighborsClassifier(3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('KNeighborsClassifier')

print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

KNeighborsClassifier
Accuracy score: 0.7413333333333333
Confusion Matrix: [[5149 2223]
 [1657 5971]]
Classification report
              precision    recall  f1-score   support

           0       0.76      0.70      0.73      7372
           1       0.73      0.78      0.75      7628

    accuracy                           0.74     15000
   macro avg       0.74      0.74      0.74     15000
weighted avg       0.74      0.74      0.74     15000



**Classifier 5: Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier 
clf= RandomForestClassifier(max_depth=6, n_estimators=12, max_features=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('RandomForestClassifier')
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

RandomForestClassifier
Accuracy score: 0.5221333333333333
Confusion Matrix: [[6880  492]
 [6676  952]]
Classification report
              precision    recall  f1-score   support

           0       0.51      0.93      0.66      7372
           1       0.66      0.12      0.21      7628

    accuracy                           0.52     15000
   macro avg       0.58      0.53      0.43     15000
weighted avg       0.58      0.52      0.43     15000



**Classifier 6: AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
clf= AdaBoostClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('AdaBoostClassifier')
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))



AdaBoostClassifier
Accuracy score: 0.7966
Confusion Matrix: [[5623 1749]
 [1302 6326]]
Classification report
              precision    recall  f1-score   support

           0       0.81      0.76      0.79      7372
           1       0.78      0.83      0.81      7628

    accuracy                           0.80     15000
   macro avg       0.80      0.80      0.80     15000
weighted avg       0.80      0.80      0.80     15000



# Spacy Classifier : Logistic Regression

In [None]:
import spacy
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x7fb2d2c65e50>

In [None]:
# Import pandas & read csv file
import pandas as pd
reviews=pd.read_csv("https://raw.githubusercontent.com/hanzhang0420/Women-Clothing-E-commerce/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")

# Extract desired columns and view the dataframe 
df_amazon = reviews[['Review Text','Recommended IND']].dropna()
df_amazon.head(10)

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1
5,"I love tracy reese dresses, but this one is no...",0
6,I aded this in my basket at hte last mintue to...,1
7,"I ordered this in carbon for store pick up, an...",1
8,I love this dress. i usually get an xs but it ...,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1


In [None]:
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

# Create our list of punchuationmarks
punctuations = string.punctuation

# Create our list of stop words
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vector
parser = English()

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization, 
    lowercasing, removing stop words and punctuations."""
    
    # Creating our token object which is used to create documents with linguistic annotations
    mytokens = parser(sentence)
    
    # lemmatizing each token and converting each token in lower case
    # Note that spaCy uses '-PRON-' as lemma for all personal pronouns lkike me, I etc
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    
    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations]
    
    # Return preprocessed list of tokens
    return mytokens   

In [None]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        """Override the transform method to clean text"""
        return [clean_text(text) for text in X]
    
    def fit(self, X, y= None, **fit_params):
        return self
    
    def get_params(self, deep= True):
        return {}

# Basic function to clean the text
def clean_text(text):
    """Removing spaces and converting the text into lowercase"""
    return text.strip().lower() 

In [None]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [None]:
from sklearn.model_selection import train_test_split

X = df_amazon['Review Text'] # The features we want to analyse
ylabels = df_amazon['Recommended IND'] # The labels, in this case feedback

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size = 0.3, random_state = 1)
print(f'X_train dimension: {X_train.shape}')
print(f'y_train dimension: {y_train.shape}')
print(f'X_test dimension: {X_test.shape}')
print(f'y_train dimension: {y_test.shape}')

X_train dimension: (15848,)
y_train dimension: (15848,)
X_test dimension: (6793,)
y_train dimension: (6793,)


In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline ([("cleaner", predictors()),
                 ("vectorizer", bow_vector),
                 ("classifier", classifier)])

# Model generation
pipe.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x7fb2e4126350>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 t...\b\\w\\w+\\b',
                                 tokenizer=<function spacy_tokenizer at 0x7fb2e45d1a70>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
             

In [None]:


from sklearn import metrics

# Predicting with test dataset
predicted = pipe.predict(X_test)

# Model accuracy score
print(f'Logistic Regression Accuracy: {metrics.accuracy_score(y_test, predicted)}')
print(f'Logistic Regression Precision: {metrics.precision_score(y_test, predicted)}')
print(f'Logistic Regression Recall: {metrics.recall_score(y_test, predicted)}')



Logistic Regression Accuracy: 0.8875312822022671
Logistic Regression Precision: 0.9121517689283883
Logistic Regression Recall: 0.9552532665115446
