# Spam sms Detection

In [2]:
import pandas as pd
df =pd.read_csv("spam.csv", encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df.groupby('v1').describe()


Unnamed: 0_level_0,v2,v2,v2,v2,Unnamed: 2,Unnamed: 2,Unnamed: 2,Unnamed: 2,Unnamed: 3,Unnamed: 3,Unnamed: 3,Unnamed: 3,Unnamed: 4,Unnamed: 4,Unnamed: 4,Unnamed: 4
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
v1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
ham,4825,4516,"Sorry, I'll call later",30,45,39,"bt not his girlfrnd... G o o d n i g h t . . .@""",3,10,9,GE,2,6,5,"GNT:-)""",2.0
spam,747,653,Please call our customer service representativ...,4,5,4,PO Box 5249,2,2,1,"MK17 92H. 450Ppw 16""",2,0,0,,


In [4]:
df['category']=df['v1'].apply(lambda x: 1 if x=='spam' else 0)
df.head(10)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4,category
0,ham,"Go until jurong point, crazy.. Available only ...",,,,0
1,ham,Ok lar... Joking wif u oni...,,,,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,,1
3,ham,U dun say so early hor... U c already then say...,,,,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,,0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,,1
6,ham,Even my brother is not like to speak with me. ...,,,,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,,0
8,spam,WINNER!! As a valued network customer you have...,,,,1
9,spam,Had your mobile 11 months or more? U R entitle...,,,,1


In [5]:
#df= df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)
df = df[['v2', 'category']]
df.head(10)

Unnamed: 0,v2,category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [6]:
df.rename(columns={'v2':'sms'},inplace=True)
df.head(5)

Unnamed: 0,sms,category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


**Splitting the dataset into training and testing subsets**:
The **train_test_split()** function is called with the feature data and target variable as input.
The **test_size** parameter is set to 0.25, indicating that 25% of the dataset will be allocated for testing, while the remaining 75% will be used for training.
The function returns four subsets:
    xtrain,
    xtest,
    ytrain,
    ytest.
**xtrain** represents the training feature data.
**xtest** represents the testing feature data.
**ytrain** represents the training target variable or labels.
**ytest** represents the testing target variable or labels.


In [7]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(df.sms, df.category, test_size=0.25)

**Feature Extraction**:
The **CountVectorizer class** is a text preprocessing tool that converts a collection of text documents into a matrix of token counts.
The **fit_transform()** method combines the steps of fitting the vectorizer to the data (building the vocabulary) and transforming the data into a numerical matrix.
The **toarray()** method is called on the transformed matrix (xtraincount) to convert it into a numpy array.
In summary, the code snippet uses the CountVectorizer class to convert the text data in the training dataset (xtrain.values) into a matrix of token counts (xtraincount). The matrix is then converted into a numpy array, and a subset of the array (the first three rows) is displayed as output. The resulting array represents the numerical representation of the text data, where each row corresponds to a document and each column represents a unique token (word) in the dataset.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
xtraincount = cv.fit_transform(xtrain.values)
xtraincount.toarray()[:3] #4:14

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**NAIVE BAYES MODEL**:
A Naive Bayes model is a probabilistic algorithm that is based on Bayes' theorem and assumes that the features are conditionally independent given the class label. It is often used for classification tasks and is known for its simplicity and efficiency. Naive Bayes models are commonly used in natural language processing tasks, spam filtering, and sentiment analysis.
**Gaussian Naive Bayes**
Gaussian Naive Bayes is one of the classification models in the Naive Bayes family . It assumes that the features follow a Gaussian (normal) distribution. This model is suitable for continuous or real-valued features. It is commonly used in cases where the features are continuous and have a Gaussian distribution.
**Multinomial Naive Bayes**
Multinomial Naive Bayes is another classification model in the Naive Bayes family. It is specifically designed for discrete features, such as word counts or categorical data. This model assumes that the features follow a multinomial distribution. It is commonly used in text classification tasks, where the features represent the occurrence or frequency of words in documents.
**Bernoulli Naive Bayes**
Bernoulli Naive Bayes is another variant of the Naive Bayes model. It is suitable for binary features, where each feature can take only two values (e.g., presence or absence of a particular attribute). This model assumes that the features follow a Bernoulli distribution. It is commonly used in document classification tasks, where binary term occurrence features are used.
**Creating and training the Multinomial Naive Bayes model**:
The **MultinomialNB** class implements the Naive Bayes algorithm for classification, assuming that the features follow a multinomial distribution.
The **fit()** method of the MultinomialNB object is then called to train the model.
**xtraincount** represents the training feature data, which is the matrix of token counts obtained from the previous step (xtraincount = cv.fit_transform(xtrain.values)).
**ytrain** represents the training target variable or labels.

In [9]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(xtraincount, ytrain)

    **Prediction**:
**cv.transform(emails)** is using a cv object, which I assume is a CountVectorizer or a similar text processing tool, to transform the text in the emails into a numerical representation suitable for the machine learning model. This process converts the text into a matrix of token counts.

**model.predict(emailscount)** is using a trained machine learning model, denoted as model, to predict the classification of the transformed emails. This step applies the trained model to the transformed data and returns the predicted labels for the emails.

To summarize, the code snippet takes a set of email messages, transforms them into a numerical representation using a tool like CountVectorizer, and then applies a trained machine learning model to predict the classification of these emails.

In [10]:
emails= {
    'Hey bro, can we watch football together?',
    'upto 20% discount on parking, exclusive offer just for you',
    'I"ve a diccount of 50% can we go to the shopping mall'
}
emailscount =cv.transform(emails)
model.predict(emailscount)

array([1, 0, 0], dtype=int64)

**Check Accuracy**:
Transforms the test data (xtest) using a CountVectorizer (cv) to convert it into a numerical representation (xtestcount).
Use the transformed test data (xtestcount) and the actual values/ true labels (ytest) to calculate the accuracy of the trained machine learning model (model) using the score function.

In [11]:
xtestcount = cv.transform(xtest)
model.score(xtestcount, ytest)

0.9820531227566404

**Creating Pipeline**:
The need for using a pipeline in this case arises from the desire to avoid repeating the preprocessing steps for both the training and test datasets. Initializing and transforming the datasets separately would require duplicating the code for the vectorizer and potentially introduce inconsistencies if the steps are not applied consistently.

Pipeline ensure that the same preprocessing steps are applied to both datasets, maintaining consistency and reducing the risk of errors. By creating a pipeline, you can combine the steps of vectorizing the data using CountVectorizer and training a Multinomial Naive Bayes classifier into a single object.

**Creating the Pipeline**:

**Vectorizer**: The first step in the pipeline is defined as ('vectorizer', CountVectorizer()). Here, 'vectorizer' is a string identifier for this step, and CountVectorizer() creates an instance of the CountVectorizer class. The CountVectorizer is responsible for converting the raw text data into a numerical representation.

**Model**: The second step in the pipeline is defined as ('model', MultinomialNB()). Here, 'model' is a string identifier for this step, and MultinomialNB() creates an instance of the Multinomial Naive Bayes classifier. This classifier is commonly used for text classification tasks.

In [12]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('model', MultinomialNB())
])

In [13]:
clf.fit(xtrain, ytrain)


In [14]:
clf.score(xtest, ytest)


0.9820531227566404

In [15]:
clf.predict(emails)


array([1, 0, 0], dtype=int64)