<a href="https://colab.research.google.com/github/Towfique1311/SentimentAnalysis/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#X_train and Y_tain and X_test is our Dataset. We can add more data if needed

In [None]:
 #Training data
 X_train = ["This was awesome and tasty food",
"Great food! I like it a lot",
"Happy Ending! awesome acting by the actress",
"Adore it! it was truly great experience",
"Bad not upto the hype",
"Could have been better",
"Surely a Disapointing meal"]

Y_train = [1,1,1,1,0,0,0] # 1 is Positive, 0 is Negative class

#testing data
X_test = ["It was awesome and I loved the meal and now I am happy"," I saw the movie  and it was bad"]

# In this part we will be cleaning our data. Such as tokenization etc.

In [None]:
# here we are importing our necessary libraries for data cleaning 
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

In [None]:
#Importing nltk and downloading the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#creating an object for tokenizer
#we are passing words for regular expression tokenizer

tokenizer = RegexpTokenizer(r'\w+')

#we are selecting stopwords for english
eng_stopwords = set(stopwords.words('english'))

#we are creating porterstemmer
ps = PorterStemmer()


In [None]:
#here we are creating a function for clening data
def getCleanedText(text):
  text = text.lower()

  #now we are performing tokenization
  tokens = tokenizer.tokenize(text)
  new_tokens =  [token for token in tokens if token not in eng_stopwords] # we are combinign tokenizer and stopwords removal
  
  #here we are doing stemming
  stemmed_tokens =  [ps.stem(tokens) for tokens in new_tokens]

  clean_text = " ".join(stemmed_tokens)
  return clean_text




#Using the above defined function, we are cleaning our training and testing data 

In [None]:
X_clean = [getCleanedText(i) for i in X_train]
Xt_clean = [getCleanedText(i) for i in X_test]
            


In [None]:
X_clean

['awesom tasti food',
 'great food like lot',
 'happi end awesom act actress',
 'ador truli great experi',
 'bad upto hype',
 'could better',
 'sure disapoint meal']

#In this portion, We are doing Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

In [None]:
#we are creating the instance of the CountVectorizer and setting the ngram range
cv = CountVectorizer(ngram_range=(1,2))

In [None]:
#we are vectozing our input and converting it into an array
X_vec = cv.fit_transform(X_clean).toarray()

In [None]:
X_vec

array([[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]])

this count vectorizer tells us how many times the feature manes has repeated in the array of X_vec
this model can also be reffered as a bag of words model

In [None]:
print(cv.get_feature_names_out())

['act' 'act actress' 'actress' 'ador' 'ador truli' 'awesom' 'awesom act'
 'awesom tasti' 'bad' 'bad upto' 'better' 'could' 'could better'
 'disapoint' 'disapoint meal' 'end' 'end awesom' 'experi' 'food'
 'food like' 'great' 'great experi' 'great food' 'happi' 'happi end'
 'hype' 'like' 'like lot' 'lot' 'meal' 'sure' 'sure disapoint' 'tasti'
 'tasti food' 'truli' 'truli great' 'upto' 'upto hype']


In [None]:
Xt_vect = cv.transform(Xt_clean).toarray()

In [None]:
Xt_vect

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

#Naive Bayes
for this sentiment analysis we are using Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
#we are creating an instance of that
mn = MultinomialNB()
#we  fit our model
mn.fit(X_vec, Y_train)
#the output in the multinomial naive bayes

MultinomialNB()

we are performing our prediction

In [None]:
#we are passing the test vector value to make a prediction
y_pred = mn.predict(Xt_vect)
y_pred

array([1, 0])

Here in the Output y_pred, we see that the value in the array is [1,0].

If we look at our testing data, we cound see that the first sentence is a positive sentence. As it can be seen on the first value of the array is 1. 

The second sentence is a negative sentence in the testing data.  s it can be seen on the second value of the array is 0.

As we defined in our dataset, Positive -> 1, negative -> 0

So, we can see that our classification is working. 