<a href="https://colab.research.google.com/github/EPRADDH/NLP_Natural_Language_Processing_Methods/blob/main/Sentiment_Analysis_useing_a_LogisticRegression_classifier_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Loading the Data from a Data Set**

Sentiment Labelled Sentences Data Set available from the UCI Machine Learning Repository.

In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/NLP-Natural-Language-Processing-Methods/sentiment labelled sentences/amazon_cells_labelled.txt', names=['review', 'sentiment'], sep='\t')
df.head()

Unnamed: 0,review,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


**Splitting the Data Set into a Training Set and a Test Set**

In [None]:
from sklearn.model_selection import train_test_split
reviews = df['review'].values
labels = df['sentiment'].values
reviews_train, reviews_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.2, random_state=1000)

To train an ML model and then test it, we need a way to represent the text data numerically.

This can be done with the technology known as Bag of Words (BoW). You can generate a BoW matrix for text data with sklearn‘s CountVectorizer() function. This fuction is designed to convert text into numerical feature vectors, first performing tokenization and filtering of stopwords. The CountVectorizer() performs tokenization using either the default tokenizer or a custom one


**Create a Custom Tokenizer Using spaCy**

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import string
punctuations = string.punctuation
parser = English()
stopwords = list(STOP_WORDS)
def spacy_tokenizer(utterance):
      tokens = parser(utterance)
      return [token.lemma_.lower().strip() for token in tokens if token.text.lower().strip() not in stopwords and token.text not in punctuations]

**Transforming Text into Numerical Feature Vectors**

As mentioned, transforming text to feature vectors can be done with sklearn‘s CountVectorizer() function. In the example below, we use the spaCy’s custom tokenizer created in the previous section. Alternatively, you might use the default option, passing no parameters to CountVectorizer()

In [None]:
#using custom function
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))


In [None]:
#By default, the vectorizer might be created as follows:
#vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(reviews_train)



CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function spacy_tokenizer at 0x7f6c83bd1b70>,
                vocabulary=None)

**Below transform the text into numerical feature vectors:**

In [None]:
X_train = vectorizer.transform(reviews_train)
X_test = vectorizer.transform(reviews_test)

**Training the Model**

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

 **Evaluation of the Model**

In [None]:
accuracy = classifier.score(X_test, y_test)
print("Accuracy: %s" % (accuracy))

Accuracy: 0.76


This means that the accuracy of our model is 76%

**Predictions on New Data**

In [None]:
new_reviews = ['Old version of python useless', 'Very good effort, but not five stars', 'Clear and concise']
X_new = vectorizer.transform(new_reviews)
classifier.predict(X_new)

array([0, 1, 1])

As you can see, the model have worked fine for the above reviews.