In [11]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

In [12]:
!unzip /content/smsspamcollection.zip

Archive:  /content/smsspamcollection.zip
replace SMSSpamCollection? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: SMSSpamCollection       
replace readme? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: readme                  


It is always useful to check how our data is being stored in our dataset. Understanding the data is necessary before starting working over it. A delimiter can be identified effortlessly by checking the data. Based on our inspection, we can use the relevant delimiter in the sep parameter.

In [14]:
df = pd.read_csv('SMSSpamCollection', delimiter='\t',header=None) 
#header	This parameter is use to make passed row/s[int/int list] as header
#None implies you are not specifying the column names and want it to be inferred from csv file

Datasets are mostly found in .csv format. CSV (or Comma Separated Values)            
Commas used in CSV files are known as delimiters

Dataset in .csv file format has data items separated by a delimiter other than a comma.This includes semicolon, colon, tab space, vertical bars, etc. 

In such cases, we need to use the sep parameter inside the read.csv() function

In [16]:
df

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [20]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1],df[0])

In [24]:
vectorizer = TfidfVectorizer() 
X_train = vectorizer.fit_transform(X_train_raw) 

Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). The term frequency is the number of occurrences of a specific term in a document.

fit() function calculates the values of these parameters. The transform function applies the values of the parameters on the actual data and gives the normalized value. 

The fit_transform() function performs both in the same step. Note that the same value is got whether we perform in 2 steps or in a single step.

In [23]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [25]:
X_test = vectorizer.transform( ['URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, whats up?'] )

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. 

We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

In [26]:
predictions = model.predict(X_test)
print(predictions)

['spam' 'ham']


In [27]:
model.predict_proba(X_test)

array([[0.22578622, 0.77421378],
       [0.96268468, 0.03731532]])

In [28]:
X_test

<2x7456 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [38]:
test = vectorizer.transform(['Ok lar... Joking wif u oni...'])

In [39]:
model.predict(test)

array(['ham'], dtype=object)

In [42]:
t = vectorizer.transform(["Did you hear about the new \"Divorce Barbie\"? It comes with all of Ken's stuff!"])

In [43]:
model.predict(t)

array(['ham'], dtype=object)