# SVM MODEL
### Dataset = Spam - Ham
### Dataset URL = https://raw.githubusercontent.com/diazoniclabs/Machine-Learning-using-sklearn/master/Datasets/spam.tsv

# Exploratory Data analysis

### Take the data And create a dataframe

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/diazoniclabs/Machine-Learning-using-sklearn/master/Datasets/spam.tsv',sep='\t')
print(df)

     label                                            message  length  punct
0      ham  Go until jurong point, crazy.. Available only ...     111      9
1      ham                      Ok lar... Joking wif u oni...      29      6
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...     155      6
3      ham  U dun say so early hor... U c already then say...      49      6
4      ham  Nah I don't think he goes to usf, he lives aro...      61      2
...    ...                                                ...     ...    ...
5567  spam  This is the 2nd time we have tried 2 contact u...     160      8
5568   ham               Will ü b going to esplanade fr home?      36      1
5569   ham  Pity, * was in mood for that. So...any other s...      57      7
5570   ham  The guy did some bitching but I acted like i'd...     125      1
5571   ham                         Rofl. Its true to its name      26      1

[5572 rows x 4 columns]


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
 2   length   5572 non-null   int64 
 3   punct    5572 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 174.3+ KB


In [3]:
df.shape

(5572, 4)

In [4]:
df.size

22288

In [5]:
df.describe()

Unnamed: 0,length,punct
count,5572.0,5572.0
mean,80.48995,4.177495
std,59.942907,4.623919
min,2.0,0.0
25%,36.0,2.0
50%,62.0,3.0
75%,122.0,6.0
max,910.0,133.0


### Label and Value Counts of The Dataset

In [6]:
#Now I want to know the exact numbers of spam and ham messages
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

###  Divide the data into input and output

In [7]:

#Input = Message
#Output = Label
x = df.iloc[:,1].values #Only when text messages are there
print(x)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 ... 'Pity, * was in mood for that. So...any other suggestions?'
 "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free"
 'Rofl. Its true to its name']


In [8]:
#Output
y = df.iloc[:,0].values
print(y)

['ham' 'ham' 'spam' ... 'ham' 'ham' 'ham']


### Train, Test and, Split

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state= 0)

In [10]:
# For x
print(x.shape)
print(x_train.shape)
print(x_test.shape)

(5572,)
(4179,)
(1393,)


In [11]:
#For y 
print(y.shape)
print(y_train.shape)
print(y_test.shape)

(5572,)
(4179,)
(1393,)


### Apply TFIDF Vectoriser

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train_v = vect.fit_transform(x_train)
x_test_v = vect.transform(x_test)

### Apply a Classifier/ Regressor or Cluster


In [13]:
from sklearn.svm import SVC
model = SVC()

### Model Fitting

In [14]:
model.fit(x_train_v,y_train) # We are training the model here

### Predicting the Output


In [15]:
y_pred = model.predict(x_test_v)
print(y_pred)

['ham' 'spam' 'ham' ... 'ham' 'ham' 'ham']


In [16]:
print(y_test)

['ham' 'spam' 'ham' ... 'spam' 'ham' 'ham']


### Accuracy


In [17]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)*100

98.56424982053123

### Individual Prediction


In [18]:
a = df['message'][10]
print(a)

I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.


In [19]:
a = vect.transform([a])
model.predict(a)

array(['ham'], dtype=object)

In [20]:
b = df['message'][12]
print(b)

URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18


In [21]:
b = vect.transform([b])
model.predict(b)

array(['spam'], dtype=object)

In [22]:
c = df['message'][34]
print(c)

Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged


In [23]:
c = vect.transform([c])
model.predict(c)

array(['spam'], dtype=object)

### Evaluate a Custom Message


In [24]:
d = 'Win free tickets today'
print(d)

Win free tickets today


In [25]:
d = vect.transform([d])
model.predict(d)

array(['spam'], dtype=object)

## There are Some Pointers of how we created a data model:-
 -We Gather the data and create a dataframe
    - Then we gather info about the data 
    - Then we train/test variables and applies train_test_split
    - Then we divide the data into Input and Output
    - Then Applied Tfidf Vectorizer
    - Then We Applied SVC
    - Predicted the output
    - Accuracy

## TO create a web app, we need to pipeline the model
### Pipelining = combining two or more modules, 
### So here we combine/pipeline Tfidf Vectorizer and SVC Module

In [26]:
# FOr Pipelining
from sklearn.pipeline import make_pipeline
text_model = make_pipeline(TfidfVectorizer(),SVC())

In [27]:
#Now let us fit the pipelined model
text_model.fit(x_train,y_train)

## Predict the output


In [28]:
y_pred1 = text_model.predict(x_test)
print(y_pred1)

['ham' 'spam' 'ham' ... 'ham' 'ham' 'ham']


In [29]:
print(y_test)

['ham' 'spam' 'ham' ... 'spam' 'ham' 'ham']


## Accuracy Score


In [30]:
accuracy_score(y_pred1,y_test)*100

98.56424982053123

## Individual Prediction


In [31]:
e = 'win free tickets today'
print(e)

win free tickets today


In [32]:
text_model.predict([e])

array(['spam'], dtype=object)

In [33]:
#JOBLIB = two methods = 1.Dump, 2. .Load
import joblib
joblib.dump(text_model , 'spam-ham')
#Here we are saving the pipelined file into a file named spam-ham

['spam-ham']