# Logistic regression for SMS spam classification


Each line of the data file `sms.txt`
contains a label---either "spam" or "ham" (i.e. non-spam)---followed
by a text message. Here are a few examples (line breaks added for readability):

    ham     Ok lar... Joking wif u oni...
    ham     Nah I don't think he goes to usf, he lives around here though
    spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005.
            Text FA to 87121 to receive entry question(std txt rate)
            T&C's apply 08452810075over18's
    spam    WINNER!! As a valued network customer you have been
            selected to receivea £900 prize reward! To claim
            call 09061701461. Claim code KL341. Valid 12 hours only.

To create features suitable for logistic regression, use tools from the ``sklearn.feature_extraction.text``:

* Convert words to lowercase.
* Remove punctuation and special characters (but convert the \$ and
  £ symbols to special tokens and keep them, because these are useful for predicting spam).
* Create a dictionary containing the 3000 words that appeared
  most frequently in the entire set of messages.
* Encode each message as a vector $\mathbf{x}^{(i)} \in
  \mathbb{R}^{3000}$. The entry $x^{(i)}_j$ is equal to the
  number of times the $j$th word in the dictionary appears in that
  message.
* Discard some ham messages to have an
  equal number of spam and ham messages.
* Split data into a training set of 1000 messages and a
  test set of 400 messages.
  
Follow the instructions below to complete the implementation. You will be asked to: 

* write a code to implement logestic regression algorithm (you can use sklearn library for this but it affects your score.)
* Make predictions and report the accuracy on the test set
* Test out the classifier on a few of your own text messages

In [20]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
from IPython.display import display, HTML
warnings.filterwarnings("ignore")

# build Logisitc Regression classifier
for this part you can use Andrew Ng course for machine learning week3 in coursera.

In [8]:
#code here

# Load and prep data
using provided construction load and preprocess the data.

In [8]:
path ='./sms.txt'
with open(path) as f:
    lines = f.readlines()

In [16]:
target=[] ; feature=[]
for l in lines: target.append(l[:3]) ; feature.append(l[4:len(l)-2])

In [76]:
df = pd.DataFrame(zip(feature,target),columns=['feature','target'])
df.head(2)

Unnamed: 0,feature,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni..,ham


In [77]:
df['feature'] = df['feature'].str.replace('\d+', '')
df['feature'] = df['feature'].str.lower()
df['feature'] = df['feature'].str.replace(r'[^\w\s]', '')

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
data = vectorizer.fit_transform(df['feature'])
#vectorizer.get_feature_names()

# Train logistic regresion model
Using the logestic Regression method, train the logistic regression model.

In [None]:
#code here

# Make predictions on test set
Use the model fit in the previous cell to make predictions on the test set and compute the accuracy (percentage of messages in the test set that are classified correctly). You should be able to get accuracy above 95%.


In [1]:
#code here

# Inspect model parameters
find which words are most common in spam and ham messages.

In [2]:
#code here

##  Make a prediction on new messages
Type a few of your own messages in below and make predictions. Are they ham or spam? Do the predictions make sense?

In [7]:
#code here

In [33]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

df = pd.read_csv('/Users/apple/Documents/SBU/Mine/SMSSpamCollection', delimiter='\t',header=None)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1],df[0])

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
print(X_train)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

X_test = vectorizer.transform( ['URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, whats up?'] )
predictions = classifier.predict(X_test)
print(predictions)

  (0, 3194)	0.2925489569980067
  (0, 6982)	0.22034179063014236
  (0, 5146)	0.31809939238015034
  (0, 5038)	0.2053760358865043
  (0, 6033)	0.1510056744887204
  (0, 5294)	0.31809939238015034
  (0, 2887)	0.17178929067253457
  (0, 1251)	0.32578479081429085
  (0, 3939)	0.22707067827539176
  (0, 4802)	0.19548811471828356
  (0, 3491)	0.12225693876694367
  (0, 1688)	0.2425713765968942
  (0, 811)	0.2145406383109565
  (0, 1550)	0.1417063587391772
  (0, 7223)	0.31685024517647237
  (0, 6845)	0.3644091832327992
  (1, 2686)	0.2763568694085726
  (1, 4733)	0.1270308219072053
  (1, 3607)	0.12037903388605233
  (1, 3008)	0.18769507324999662
  (1, 4385)	0.2059706807268583
  (1, 7083)	0.1659985388362787
  (1, 4743)	0.1577680482224013
  (1, 2341)	0.3080332263777975
  (1, 6419)	0.24318490703404466
  :	:
  (4177, 7253)	0.12851619435128098
  (4177, 1572)	0.11836095234262708
  (4177, 6553)	0.28460868748890367
  (4177, 6532)	0.11486300667798416
  (4177, 3607)	0.1084257181473214
  (4177, 6536)	0.08985844883450542

In [32]:
X_train

<4179x7535 sparse matrix of type '<class 'numpy.float64'>'
	with 56076 stored elements in Compressed Sparse Row format>