<h1><center>Natural Language Processing</center></h1>

<h3><center>NLP 08</center></h3>

![hM0xGrmJw](https://miro.medium.com/max/720/1*PGB0w1JZslqA-hM0xGrmJw.gif)

# Topics
1.	Introduction to Natural Language Processing
2.	Why learn NLP?
3.	Let's start playing with Python!
4.	Text Wrangling and Cleansing
 - Sentence splitter
 - Tokenization
 - Stemming         
 - Lemmatization    
 - Stop word removal
 - Diving into NLTK
5.	Vectorizing with Python
 - Count Vectorizer 
 - TF-IDF Vectorizer
6.	Modelling with Python <---------------------------------------------------- **This is where we are**
 - Classification
 - Clustering
 - Sentiment Analysis


# Reading data with pandas 

In [125]:
import pandas as pd
df = pd.read_csv('SPAM text.csv')

In [126]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Writing a function to clean out stopwords

In [127]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

my_stopwords = stopwords.words('english')


def datacleaning(mystring):
    # Step 1 
    # Tokenize the string
    my_tokenized_string = word_tokenize(mystring)
    
    # ----------------- End of step 1 ----------------- #
    temp = []
    for i in my_tokenized_string:
        if i in my_stopwords:
            pass
        else:
            temp.append(i)

    my_new_string = ' '.join(temp)
    return my_new_string

In [128]:
# Executing our function on our column
df['Clean_Message'] = df['Message'].apply(datacleaning)

In [129]:
df.head()

Unnamed: 0,Category,Message,Clean_Message
0,ham,"Go until jurong point, crazy.. Available only ...","Go jurong point , crazy .. Available bugis n g..."
1,ham,Ok lar... Joking wif u oni...,Ok lar ... Joking wif u oni ...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,U dun say early hor ... U c already say ...
4,ham,"Nah I don't think he goes to usf, he lives aro...","Nah I n't think goes usf , lives around though"


<h1><center>Remember</center></h1>


# Vectorizing string

In [130]:
from sklearn.feature_extraction.text import CountVectorizer

In [131]:
# We initialise it
vectorizer = CountVectorizer()

In [132]:
vectorizer.fit(['Whatever text we want','We give it as list'])

CountVectorizer()

### Getting all the vocabulary out.

In [133]:
vectorizer.get_feature_names_out()

array(['as', 'give', 'it', 'list', 'text', 'want', 'we', 'whatever'],
      dtype=object)

### Getting the vectors

In [134]:
vector = vectorizer.transform(['Whatever text we want','We give it as list'])

In [135]:
vector

<2x8 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [136]:
vector.toarray()

array([[0, 0, 0, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 0, 1, 0]])

In [137]:
vectorizer.transform(['football']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0]])

<h1><center>Doing the same to a DataFrame Column</center></h1>

In [138]:
# We initialise it
vectorizer = CountVectorizer()

In [139]:
vectorizer.fit(df['Clean_Message'])

CountVectorizer()

### Getting all the vocabulary out.

In [140]:
vectorizer.get_feature_names_out()

array(['00', '000', '000pes', ..., 'èn', 'ú1', '〨ud'], dtype=object)

In [141]:
vectorizer.get_feature_names_out()[0:100]

array(['00', '000', '000pes', '008704050406', '0089', '0121',
       '01223585236', '01223585334', '0125698789', '02', '0207',
       '02072069400', '02073162414', '02085076972', '021', '03', '04',
       '0430', '05', '050703', '0578', '06', '07', '07008009200',
       '07046744435', '07090201529', '07090298926', '07099833605',
       '07123456789', '0721072', '07732584351', '07734396839',
       '07742676969', '07753741225', '0776xxxxxxx', '07781482378',
       '07786200117', '077xxx', '078', '07801543489', '07808',
       '07808247860', '07808726822', '07815296484', '07821230901',
       '078498', '07880867867', '0789xxxxxxx', '07946746291',
       '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800',
       '08000407165', '08000776320', '08000839402', '08000930705',
       '08000938767', '08001950382', '08002888812', '08002986030',
       '08002986906', '08002988890', '08006344447', '0808', '08081263000',
       '08081560665', '0825', '083', '0844', '08448350055', '08448714184'

<h2><center>This is trash</center></h2>

![](https://media.giphy.com/media/ZNnBcUZinjJoGdZQkL/giphy.gif)

---
What are these column names, we can't move forward with this. 

We need to fix this !!

![](https://media.tenor.com/2HiYJsN72X8AAAAC/despicable-me-agnes.gif)


# Look at what will happen, if we don't fix it

In [142]:
len(vectorizer.get_feature_names_out())

8683

### Getting the vectors

In [143]:
vector = vectorizer.transform(df['Clean_Message'])

In [144]:
vector

<5572x8683 sparse matrix of type '<class 'numpy.int64'>'
	with 52566 stored elements in Compressed Sparse Row format>

In [145]:
vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

<h2><center>Unacceptable</center></h2>

![](https://media.giphy.com/media/H1YGzkKcin3uPaUb58/giphy-downsized-large.gif)

# Remove the numbers

### Approach 1

In [146]:
numbers = '0123456789'

In [147]:
a = '100'

In [148]:
a.isdigit()

True

In [149]:
b = 'random'

In [150]:
b.isdigit()

False

### Approach 2

In [151]:
for i in numbers:
    print(i)

0
1
2
3
4
5
6
7
8
9


In [152]:
'@' in '0123456789@#$%^&*)'

True

In [153]:
# test = 'This is a test string'

## So, let's write a function that removes numbers from our text

In [154]:
def remove_numbers(x):
    # Step 1 - Tokenize the string
    my_tokenized_string = word_tokenize(x)    
    # ----------------- End of step 1 ----------------- #
    temp = []
    # Step 2 - Loop over your tokens
    for i in my_tokenized_string:
        # Step 3 - Check if your words are numbers or not
        if i.isdigit():
            pass
        else:
            # Step 4 - If they are not numbers, save them
            temp.append(i)

    # Step 5 - Once saved, stitch them back together
    my_new_string = ' '.join(temp)
    
    # Step 6 - Return the stitched sentence back
    return my_new_string

Let's see if our function woked 

In [155]:
remove_numbers('This is a test string 100 and this99 is another num123ber 1235')

'This is a test string and this99 is another num123ber'

Works like the way we would expect. 

### Let's clean our columns

In [156]:
df['Number_Clean_Message'] = df['Clean_Message'].apply(remove_numbers)

In [157]:
df

Unnamed: 0,Category,Message,Clean_Message,Number_Clean_Message
0,ham,"Go until jurong point, crazy.. Available only ...","Go jurong point , crazy .. Available bugis n g...","Go jurong point , crazy .. Available bugis n g..."
1,ham,Ok lar... Joking wif u oni...,Ok lar ... Joking wif u oni ...,Ok lar ... Joking wif u oni ...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry 2 wkly comp win FA Cup final tkts 2...,Free entry wkly comp win FA Cup final tkts 21s...
3,ham,U dun say so early hor... U c already then say...,U dun say early hor ... U c already say ...,U dun say early hor ... U c already say ...
4,ham,"Nah I don't think he goes to usf, he lives aro...","Nah I n't think goes usf , lives around though","Nah I n't think goes usf , lives around though"
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,This 2nd time tried 2 contact u. U £750 Pound ...,This 2nd time tried contact u. U £750 Pound pr...
5568,ham,Will ü b going to esplanade fr home?,Will ü b going esplanade fr home ?,Will ü b going esplanade fr home ?
5569,ham,"Pity, * was in mood for that. So...any other s...","Pity , * mood . So ... suggestions ?","Pity , * mood . So ... suggestions ?"
5570,ham,The guy did some bitching but I acted like i'd...,The guy bitching I acted like 'd interested bu...,The guy bitching I acted like 'd interested bu...


### Let's do the vectorization again

In [158]:
# We initialise it
vectorizer2 = CountVectorizer()

In [159]:
vectorizer2.fit(df['Number_Clean_Message'])

CountVectorizer()

### Getting all the vocabulary out.

In [160]:
vectorizer2.get_feature_names_out()

array(['00', '000', '000pes', ..., 'èn', 'ú1', '〨ud'], dtype=object)

In [161]:
vectorizer2.get_feature_names_out()[0:100]

array(['00', '000', '000pes', '02', '0207', '03', '04', '0430', '05',
       '06', '07', '0776xxxxxxx', '07781482378', '077xxx', '07880867867',
       '0789xxxxxxx', '07946746291', '0796xxxxxx', '07xxxxxxxxx', '08',
       '083', '08452810075over18', '08700435505150p', '08700469649',
       '08700621170150p', '08701417012150p', '0870141701216',
       '08702840625', '08704050406', '08704439680ts', '0870737910216yrs',
       '0870753331018', '0871', '087123002209am', '08712400602450p',
       '0871277810710p', '0871277810910p', '08714342399',
       '087147123779am', '0871750', '08717890890', '08718720201',
       '087187262701', '08718727870', '08718727870150ppm', '09',
       '09065171142', '09066649731from', '0a', '0quit', '10', '100',
       '1000', '1000call', '1000s', '100p', '100percent', '100txt',
       '10am', '10k', '10p', '10ppm', '10th', '11', '114', '118p',
       '11mths', '11pm', '12', '120p', '125', '1250', '125gift',
       '12hours', '12hrs', '12mths', '13', '14', '14

In [162]:
len(vectorizer2.get_feature_names_out())

8234

# So, we went from 8683 columns to 8234

![](https://media.giphy.com/media/nbNWgtnMgIYpUSy3e9/giphy.gif)


## That was a complete waste of my time. 
Well well, looks like I will have to follow approach 2 and nuke the whole thing, all the numbers gotta go. 


<h2><center>But wait</center></h2>

Why should i have all the fun
- i have you for that. 
- Don't we all want to have fun !!

![](https://media.giphy.com/media/cXblnKXr2BQOaYnTni/giphy.gif)

Well, time for you to follow approach 1 of removing numbers, 
- Write a function
- Implement the function on your column
- Vectorize your new column
- See how many columns you have. 

![](https://media.tenor.com/tLMOOEUZ6tsAAAAC/good-luck-spongebob.gif)

<h1><center>Basics Terminology of Machine Learning</center></h1>

- Train-Test Split
- Accuracy
- RMSE