## Preprocessing Steps for Natural Language Processing (NLP)

Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. 

Apart from numerical data, Text data is available to a great extent which is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.

To prepare the text data for the model building we perform text preprocessing. It is the very first step of NLP projects

Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for analysis. It includes Removing punctuations, Removing URLs, Removing Stop words, Lower casing, Tokenization, Stemming, and Lemmatization

### SMS Spam Data

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,572 messages, tagged acording being ham (legitimate) or spam.

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

In [1]:
# detect the type of encoding in csv to help read the csv correctly
import chardet
def detect_encoding(csv_file_path):
    ''' This function reads a chunk of the file to detect the encoding
    
    rb: The mode parameter specifies the mode in which the file is opened. 'rb' stands for 
    "read binary" mode. 
    
    used with non-text files or when you want to explicitly read file content as bytes.'''
    
    with open(csv_file_path, 'rb') as f:
        #read the file
        rawdata = f.read(1024)
    # Use chardet to detect the encoding
    result = chardet.detect(rawdata)
    encoding = result['encoding']
    # result = chardet.detect(rawdata)['encoding']
    return encoding


encoding_type = detect_encoding(r'C:\Users\ELITEBOOK 840 G3\OneDrive\Desktop\REGEX and NLP\spam.csv')
print(encoding_type)

ISO-8859-1


Chardet is a library that detects the encoding of the file

Specifying Encoding: Since CSV files can be encoded in different formats, it's important to specify the correct encoding when reading the file. If the encoding is not specified or incorrect, it may lead to misinterpretation of characters, resulting in data corruption or errors.

By specifying the encoding="ISO-8859-1" parameter, the code ensures that the CSV file "spam.csv" is read using the ISO-8859-1 character encoding. This ensures that the data is correctly interpreted and displayed

In [2]:
import pandas as pd

# load the dataset and view what the data looks like
data = pd.read_csv(r'C:\Users\ELITEBOOK 840 G3\OneDrive\Desktop\REGEX and NLP\spam.csv', encoding ='ISO-8859-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
# Set display option to show entire content of cells
pd.set_option('display.max_colwidth', 1)
df = data.iloc[:, :2]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [4]:
df.shape

(5572, 2)

The data has 5572 rows and 2 columns.

Let’s check the dependent variable distribution between spam and ham. What is the dependent variable?

In [5]:
#checking the count of the dependent variable
df['v1'].value_counts()

v1
ham     4825
spam    747 
Name: count, dtype: int64

### Steps to Clean the Data

#### 1. Punctuation Removal
In this step, all the punctuations from the text are removed. string library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’

In [6]:
#library that contains punctuation and Regex 
import re
import string



#defining the function to remove punctuation
def remove_punctuation_from_text(text):
    """ 
    Remove punctuation characters from a text string.

    Parameters:
    text (str): Input text string containing punctuation.

    Returns:
    str: Text string with all punctuation characters removed.
    
    """
    # Define a regex pattern to match punctuation characters
    punctuation_pattern = '['+ re.escape(string.punctuation) +']'
    # Use regex to substitute punctuation characters with an empty string
    punctuation_free_text = re.sub(punctuation_pattern, '', text)
    return punctuation_free_text

#storing the puntuation free text
df['clean_msg'] = df['v2'].apply(remove_punctuation_from_text)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_msg'] = df['v2'].apply(remove_punctuation_from_text)


Unnamed: 0,v1,v2,clean_msg
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


string.punctuation is a string provided by Python that contains all punctuation characters.

re.escape() is a function that escapes special characters in a string so they can be treated as literal characters in a regular expression.

The pattern [...] specifies a character set in a regular expression, where the characters inside the square brackets are the characters to match.

re_sub() is  used for replacing text in a string using regular expressions.

In [None]:
#Method 2 Try it!

In [None]:
#defining the function to remove punctuation
#def remove_punctuation(text):
    #"""" 
    #This function takes a text string as input and removes all punctuation characters from it. 
    #It achieves this using list comprehension to iterate over each character in the input text and
    #retaining only those characters that are not present in the string.punctuation string.
    
    #"""
    #punctuationfree="".join([i for i in text if i not in string.punctuation])
    #return punctuationfree

#storing the puntuation free text
#df['clean_msg']= df['v2'].apply(lambda x:remove_punctuation(x))
#df.head

It iterates over each character i in the text string and filters out characters that are not present in the string.punctuation string. Essentially, it creates a new list containing only the characters from text that are not punctuation characters.

The join() method is then used to concatenate the characters from the filtered list into a single string. The empty string "" before the join() method specifies that we want to join the characters together without any separator between them.

The function is applied to each element of the 'v2' column in a DataFrame (df) using the apply method. The result is stored in a new column named 'clean_msg'.

The lambda function is used to apply remove_punctuation to each element of the 'v2' column.

 All the punctuations are removed from v2 and stored in the clean_msg column.

#### 2. Lowering the Text
It is one of the most common text preprocessing Python steps where the text is converted into the same case preferably lower case. But it is not necessary to do this step every time you are working on an NLP problem as for some problems lower casing can lead to loss of information.

For example, if in any project we are dealing with the emotions of a person, then the words written in upper cases can be a sign of frustration or excitement.

In [7]:
df['msg_lower']= df['clean_msg'].apply(lambda x: x.lower())
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['msg_lower']= df['clean_msg'].apply(lambda x: x.lower())


Unnamed: 0,v1,v2,clean_msg,msg_lower
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though


All the text of clean_msg column are converted into lower case and stored in msg_lower column

#### 3. Tokenization
Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens could be words, phrases, symbols, or other meaningful elements depending on the context of the text and the requirements of the task at hand. 

In this step, the text is split into smaller units. We can use either sentence tokenization (sent_tokenize) or word tokenization (word_tokenize) or regex based on our problem statement. In this case we use regex but you can also use word tokenization.

In [8]:
#defining function for tokenization
import re
def tokenization(text):
    ''' This function takes a text string as input and returns a list of tokens (words) 
    extracted from the text.'''
    
    tokens = re.split('[^\w]+',text)
    return tokens
#applying function to the column
df['msg_tokenied']= df['msg_lower'].apply(lambda x: tokenization(x))
df.head()

Unnamed: 0,v1,v2,clean_msg,msg_lower,msg_tokenied
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


This function, tokenization, takes a text string as input and returns a list of tokens (words) extracted from the text.

The regular expression pattern '[^\w]+' matches one or more non-word characters (i.e., characters that are not letters, digits, or underscores).

The re.split() function splits the text wherever this pattern occurs, resulting in a list of tokens.

**Try with word tokenization**

#### Stop Word Removal

Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words carry less or no meaning.

NLTK library consists of a list of words that are considered stopwords for the English language. Some of them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]

But it is not necessary to use the provided list as stopwords as they should be chosen wisely based on the project. For example, ‘How’ can be a stop word for a model but can be important for some other problem where we are working on the queries of the customers. We can create a customized list of stop words for different problems.

In [9]:
#importing nlp library
import nltk
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [10]:
#defining the function to remove stopwords from tokenized text
import re
def remove_stopwords(tokens):
    # Define a regex pattern to match stopwords
    stopwords_pattern = r'\b(?:{})\b'.format('|'.join(stopwords))
    
    # Remove stopwords from the list of tokens using regex
    filtered_tokens = [token for token in tokens if not re.search(stopwords_pattern, token)]
    
    return filtered_tokens

# Apply remove_stopwords function to 'msg_tokenized' column
df['msg_no_stopwords'] = df['msg_tokenied'].apply(remove_stopwords)
df.head()

Unnamed: 0,v1,v2,clean_msg,msg_lower,msg_tokenied,msg_no_stopwords
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


\b is a word boundary anchor that matches the position between a word character and a non-word character.

(?:{}) is a non-capturing group that contains the list of stopwords joined by the | (OR) operator.

The format() method inserts the list of stopwords into the pattern.
This pattern matches any word that is a stopword.

_{} (Curley)_: Used to specify an exact number or range of 
repetitions

Capturing groups are used to capture and remember the text matched by the pattern inside the parentheses.

Non-capturing groups are defined using (?: ) syntax in a regular expression.
They are used when you want to match a pattern but don't need to capture the matched text as a separate group.

Non-capturing groups are especially useful when you want to use grouping for logical grouping or alternation (with |), but you don't want to create a capturing group.

re.search() is used to check if each token does not match the stopwords pattern.It filters out tokens that match the stopwords pattern, leaving only tokens that do not match.

In [11]:
stopwords_pattern = r'\b(?:{})\b'.format('|'.join(stopwords))
stopwords_pattern

"\\b(?:i|me|my|myself|we|our|ours|ourselves|you|you're|you've|you'll|you'd|your|yours|yourself|yourselves|he|him|his|himself|she|she's|her|hers|herself|it|it's|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|that'll|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|don't|should|should've|now|d|ll|m|o|re|ve|y|ain|aren|aren't|couldn|couldn't|didn|didn't|doesn|doesn't|hadn|hadn't|hasn|hasn't|haven|haven't|isn|isn't|ma|mightn|mightn't|mustn|mustn't|needn|needn't|shan|shan't|shouldn|shouldn't|wasn|wasn't|weren|weren't|won|won't|wouldn|wouldn't)\\b"

#### Stemming

It is also known as the text standardization step where the words are stemmed or diminished to their root/base form.

But the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is not diminished to a proper English word.

In [13]:
from nltk.stem import SnowballStemmer
import nltk
nltk.download('punkt')  # Download necessary NLTK data

def perform_stemming(tokens):
    # Initialize the SnowballStemmer with the desired language
    stemmer = SnowballStemmer("english")
    
    # Perform stemming on each token
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    
    return stemmed_tokens

# Apply perform_stemming function to 'msg_no_stopwords' column
df['msg_stemmed'] = df['msg_no_stopwords'].apply(perform_stemming)
df.head()

[nltk_data] Downloading package punkt to C:\Users\ELITEBOOK 840
[nltk_data]     G3\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,v1,v2,clean_msg,msg_lower,msg_tokenied,msg_no_stopwords,msg_lemmatized,msg_stemmed
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entri, 2, wkli, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv, entri, questionstd, txt, ratetc, appli, 08452810075over18]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]","[nah, dont, think, goe, usf, live, around, though]"


In [None]:
#try it out with porter and lancaster stemming

#### Lemmatiization
It stems the word but makes sure that it does not lose its meaning.  Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.

In [14]:
from nltk.stem import WordNetLemmatizer
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()

#defining the function for lemmatization
def lemmatizer(text):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return lemm_text

df['msg_lemmatized']=df['msg_no_stopwords'].apply(lambda x:lemmatizer(x))
df.head()

Unnamed: 0,v1,v2,clean_msg,msg_lower,msg_tokenied,msg_no_stopwords,msg_lemmatized,msg_stemmed
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entri, 2, wkli, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv, entri, questionstd, txt, ratetc, appli, 08452810075over18]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]","[nah, dont, think, goe, usf, live, around, though]"


in the last row we can see goes has changed to go

In [15]:
# Create a new DataFrame with only the first and last columns
df1 = pd.DataFrame({'Classification': df.iloc[:, 0], 'Context': df.iloc[:, -2]})

# Display the new DataFrame
df1.head()

Unnamed: 0,Classification,Context
0,ham,"[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,"[ok, lar, joking, wif, u, oni]"
2,spam,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"[nah, dont, think, go, usf, life, around, though]"


### Count Vectorizer

CountVectorizer is a feature extraction technique used in natural language processing (NLP) to convert a collection of text documents into a matrix of token counts. It is a part of the scikit-learn library in Python.

Here's how CountVectorizer works:

- Tokenization: It first tokenizes the text, splitting it into individual words or terms. It can optionally preprocess the text by converting it to lowercase, removing accents, or applying custom tokenization rules.

- Counting: It then counts the frequency of each term in each document. The result is a matrix where each row represents a document, each column represents a unique term (word), and each cell contains the count of how many times the term appears in the corresponding document.

- Vectorization: The resulting matrix is often sparse because most documents only contain a subset of the entire vocabulary. It converts this sparse matrix into a dense matrix suitable for input into machine learning algorithms.

###### Parameters

-max_features: Specifies the maximum number of features (terms) to consider. Only the most frequent terms are kept if specified.

-stop_words: Specifies a list of words to be ignored during tokenization, typically common words like "the", "and", etc.

-ngram_range: Specifies the range of n-grams to consider. An n-gram is a contiguous sequence of n items from a given sample of text (e.g., words, characters).

-binary: If True, the presence or absence of a term is considered, rather than its frequenc meaning each term in the document is represented as either 1 (if the term is present in the document) or 0 (if the term is absent).

**Try tuning the vectorizer with the above parameters!!!**

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Convert the lists of tokens into strings
df1['Context'] = df1['Context'].apply(lambda x: ' '.join(x))

# Convert the text tokens into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df1['Context'])

# Convert the sparse matrix to a DataFrame
X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the DataFrame containing the count of each word
X_df.head()

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,ìï,ìïll,ûthanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Joining the 'Classification' column back to the DataFrame containing the word counts**

In [17]:
# Concatenate the Classification column with the DataFrame containing word counts
df3 = pd.concat([df1['Classification'], X_df], axis=1)

# Display the resulting DataFrame
df3.head()

Unnamed: 0,Classification,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,...,ìï,ìïll,ûthanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df3.shape

(5572, 8830)

In [19]:
# Replace 'ham' with 1 and 'spam' with 0 in the 'Classification' column
df3['Classification'] = df3['Classification'].replace({'ham': 1, 'spam': 0})

# Display the DataFrame after replacing values
df3.head()

Unnamed: 0,Classification,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,...,ìï,ìïll,ûthanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Logistic Regression Model

Logistic Regression is used for predicting the probability of a binary outcome (yes/no, 1/0, true/false) based on one or more predictor variables. Despite its name, logistic regression is actually a classification algorithm, not a regression algorithm.

- It squeezes the range of output values to exist only between 0 and 1
- it has a point of inflection which can be used to separate the feature space into two distinct areas
- It is only effective in cases we have linearity seperable data

**Linearly separable"** refers to the property of a dataset where the classes (categories or labels) can be perfectly separated by a linear decision boundary. In other words, if you can draw a straight line (or hyperplane in higher dimensions) that completely separates the instances of one class from the instances of another class, then the dataset is said to be linearly separable.

#### Confusion Matrix

A confusion matrix is a table that is used to evaluate the performance of a classification model. It compares the actual labels of a dataset with the labels predicted by the model. The confusion matrix is especially useful for understanding the types of errors that a model is making and provides insights into its strengths and weaknesses.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the data into features (X) and target variable (y)
X = df3.drop(columns=['Classification'])
y = df3['Classification']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the logistic regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

# Get the coefficients and intercept
coefficients = logreg_model.coef_
intercept = logreg_model.intercept_

# Predict the target variable for the testing data
y_pred = logreg_model.predict(X_test)

# Generate a classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display the coefficients and intercept
print("Coefficients:", coefficients)
print("Intercept:", intercept)

# Get predicted probabilities for each class
y_proba = logreg_model.predict_proba(X_test)
print("Predicted Probabilities:")
print(y_proba)


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.84      0.91       150
           1       0.98      1.00      0.99       965

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

Coefficients: [[-0.01764915 -0.02244956 -0.28785433 ... -0.13109832  0.10314806
   0.00880045]]
Intercept: [4.23625106]
Predicted Probabilities:
[[0.0378138  0.9621862 ]
 [0.0353678  0.9646322 ]
 [0.47548274 0.52451726]
 ...
 [0.00390266 0.99609734]
 [0.00796229 0.99203771]
 [0.17390999 0.82609001]]


**Precision:** The precision measures the proportion of correctly predicted instances among all instances predicted as positive. For class 0 (spam), it is 1.00, indicating that all instances predicted as spam are indeed spam. For class 1 (ham), it is 0.98, indicating that 98% of instances predicted as ham are actually ham.

**Recall:** The recall measures the proportion of correctly predicted instances of a class among all instances of that class in the dataset. For class 0 (spam), it is 0.84, indicating that 84% of actual spam instances are correctly classified as spam. For class 1 (ham), it is 1.00, indicating that all actual ham instances are correctly classified as ham.

**F1-score:** The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. For class 0 (spam), it is 0.91, and for class 1 (ham), it is 0.99.

**Accuracy:** The accuracy measures the proportion of correctly predicted instances among all instances in the dataset. It is 0.98, indicating that 98% of instances are correctly classified overall.

**Support:** The support is the number of actual occurrences of each class in the dataset.

**Intercept:** The intercept is the value of the decision boundary where the predicted probability of the positive class (ham) equals 0.5. In logistic regression, it represents the log-odds of the positive class when all predictor variables are zero. In this case, the intercept is approximately 4.24.

**Coefficients:** Coefficients represent the change in the log-odds of the positive class (ham) for a one-unit change in the corresponding predictor variable, holding all other variables constant. Each coefficient corresponds to a feature (word in this case) and indicates the strength and direction of its influence on the prediction. Positive coefficients indicate a positive association with the positive class (ham), while negative coefficients indicate a negative association. In this example, the coefficients represent the log-odds changes for each word in predicting ham.

In [21]:
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred are the actual and predicted labels, respectively
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Convert confusion matrix to DataFrame
confusion_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], columns=['Predicted 0', 'Predicted 1'])

# Display the DataFrame
print("Confusion Matrix:")
confusion_df

Confusion Matrix:


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,126,24
Actual 1,0,965


Predicted 0 corresponds to the model's predictions for the negative class (spam).

Predicted 1 corresponds to the model's predictions for the positive class (ham).

Actual 0 represents the actual instances of the negative class (spam).

Actual 1 represents the actual instances of the positive class (ham).

**Interpretation:**

True Negatives (TN): The model correctly predicted 126 instances as spam (negative class) when they were actually spam.

False Positives (FP): The model incorrectly predicted 24 instances as ham (positive class) when they were actually spam.

True Positives (TP): The model correctly predicted 965 instances as ham when they were actually ham.

False Negatives (FN): The model did not incorrectly predict any instances as spam when they were actually ham.