## NLP in Machine Learning

In [1]:
import pandas as pd
data=pd.read_csv("SMSSpamCollection.csv",sep="\t",header=None,names=["labels","messages"])

In [2]:
data.head()

Unnamed: 0,labels,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
data.shape

(5572, 2)

#### Text Classification

it would require new modules like `nltk` ,`spaCy`, re - regular expression...
- **NLTK** - Natural language tool kit

In [4]:
import nltk
import re

In [5]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rakesh_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [7]:
data["messages"][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

**we need to remove stopwords from the sentences to clean it**

- A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace” like operations.

In [8]:
#remove characters which are not alphabets
re.sub('[^a-zA-Z]',' ',data["messages"][0])

'Go until jurong point  crazy   Available only in bugis n great world la e buffet    Cine there got amore wat   '

In [9]:
#to convert into a lowercase
re.sub('[^a-zA-Z]',' ',data["messages"][0]).lower()

'go until jurong point  crazy   available only in bugis n great world la e buffet    cine there got amore wat   '

In [10]:
words = re.sub('[^a-zA-Z]',' ',data["messages"][0]).lower().split()

In [11]:
words

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

In [12]:
[word for word in words if word not in stopwords.words('english')]

['go',
 'jurong',
 'point',
 'crazy',
 'available',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'got',
 'amore',
 'wat']

> **Porter stemmer** is a suffix stripping algorithm. It uses predefined rules to convert words into their root forms. The algorithm removes and replaces well-known suffixes of English words.

In [13]:
ps = PorterStemmer()

In [14]:
[ps.stem(word) for word in words if word not in stopwords.words('english')]

['go',
 'jurong',
 'point',
 'crazi',
 'avail',
 'bugi',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'got',
 'amor',
 'wat']

In [15]:
" ".join([ps.stem(word) for word in words if word not in stopwords.words('english')])

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'

#### Regular Expression

In [16]:
string = "hello, world"
pattern = r"hello"

result = re.search(pattern,string)

In [17]:
result

<re.Match object; span=(0, 5), match='hello'>

In [18]:
if result:
    print("pattern found")
else:
    print("pattern not found")

pattern found


In [19]:
string = "hello, world"
pattern = r"world"
replacement = "python"

re.sub(pattern,replacement,string)

'hello, python'

In [20]:
corpus = []
for i in range(0,len(data)):
    review = re.sub('[^a-zA-Z]'," ",data['messages'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
    review = " ".join(review)
    corpus.append(review)

In [21]:
data['messages'][40]

'Pls go ahead with watts. I just wanted to be sure. Do have a great weekend. Abiola'

In [22]:
corpus[40]

'pl go ahead watt want sure great weekend abiola'

#Encoding techniques
## Bag of Words
> Bag Of Words - **sklearn.feature_extraction.text.CountVectorizer()**

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. 

```class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[source]```

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

document = ["One Geek helps Two Geeks",
			"Two Geeks help Four Geeks",
			"Each Geek helps many other Geeks at GeeksforGeeks"]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(document)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

# Encode the Document
vector = vectorizer.transform(document)

# Summarizing the Encoded Texts
print("Encoded Document is:")

print(vector.toarray())


Vocabulary:  {'one': 9, 'geek': 3, 'helps': 7, 'two': 11, 'geeks': 4, 'help': 6, 'four': 2, 'each': 1, 'many': 8, 'other': 10, 'at': 0, 'geeksforgeeks': 5}
Encoded Document is:
[[0 0 0 1 1 0 0 1 0 1 0 1]
 [0 0 1 0 2 0 1 0 0 0 0 1]
 [1 1 0 1 1 1 0 1 1 0 1 0]]


In [24]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
cv = CountVectorizer()

In [26]:
X = cv.fit_transform(corpus).toarray()

In [28]:
X.shape
#no of features or no of cols  = count of vocabulary

(5572, 6296)

In [32]:
#to set num of features
cv = CountVectorizer(max_features = 2500) #top 2500 featues based on most frequent vocabulary

In [33]:
X = cv.fit_transform(corpus).toarray()

In [34]:
X.shape

(5572, 2500)

In [35]:
data.shape

(5572, 2)

In [36]:
data.labels

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: labels, Length: 5572, dtype: object

In [37]:
y = pd.get_dummies(data['labels'],drop_first = True)

In [38]:
y

Unnamed: 0,spam
0,0
1,0
2,1
3,0
4,0
...,...
5567,1
5568,0
5569,0
5570,0


In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state= 10)

In [41]:
#we will use multinomial Naive Baye's
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train,y_train)

print(f"Training data Accuracy:  {model.score(X_train,y_train)}")

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
print(f"Test data Accuracy: {accuracy_score(y_test,y_pred)}")

  y = column_or_1d(y, warn=True)


Training data Accuracy:  0.9899497487437185
Test data Accuracy: 0.9770279971284996


In [42]:
data="Thanks gaurav for the update. This provides clear visibility for the next release."

In [43]:
data

'Thanks gaurav for the update. This provides clear visibility for the next release.'

In [44]:
review=re.sub('[^a-zA-Z]'," ",data)
review=review.lower()
review=review.split()
review=[ps.stem(word) for word in review if word not in stopwords.words("english")]
review=" ".join(review)
    
corpus=[]
corpus.append(review)

In [45]:
corpus

['thank gaurav updat provid clear visibl next releas']

In [46]:
tfdata=cv.transform(corpus).toarray()

In [47]:
model.predict(tfdata)[0]

1