# Feature Extraction and Modeling

Agenda:

- Feature Extraction: Bag of Words, TF-IDF, and other one-off features
- Modeling: Your basic classification modeling. 

## Feature Extraction

**Bag of words** Term Frequency (TF), How often a word appears in a document.

**Term Frequency-Inverse Document Frequency (TF-IDF)**

Inverse Document Frequency (IDF): How much information a word provides, based on how commonly a word appears across multiple documents. The more frequently a word appears, the lower the IDF for that word will be. $$
\mbox{idf}(\mbox{word})
=
\log\left(\frac{\mbox{# of documents}}{\mbox{# of documents containing the word}}\right)
$$

Term Frequency - Inverse Document Frequency (TF-IDF): The multiplication of the two measures above. A word that has a high frequency in a document will have a high TF. If it appears in many other documents, than the information the word provides, or uniqueness of that word, is lowered. This is done mathematically by multiplying by the IDF, which will approach 0 and the number of documents with the word increases. 

**Other features**, such as document_length: how many words appear in each document. 

**Wrangle the data**



In [1]:
import pandas as pd

df = pd.read_csv('spam.csv', 
                 encoding='latin-1',
                 usecols=[0,1]) # use first 2 columns only to get rid of unnamed columns full of nans
df.columns = ['label', 'text']

Notice the imbalanced dataset

In [2]:
labels = pd.concat([df.label.value_counts(), # get total counts of ham vs spam
                    df.label.value_counts(normalize=True)], axis=1) # getting the prop of ham vs. spam
labels.columns = ['n', 'percent']
labels

Unnamed: 0,n,percent
ham,4825,0.865937
spam,747,0.134063


Clean and prep the text

In [9]:
import unicodedata
import re
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

def basic_clean(text):
    text = (unicodedata.normalize('NFKD', text.lower())
            .encode('ascii', 'ignore') # ascii to reduce noise
            .decode('utf-8', 'ignore') # decode using utf-8
           )
    return re.sub(r"[^a-z0-9\s]", '', text)

def tokenize(string):
    '''
    This function takes in a string and
    returns a tokenized string.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str=True)
    
    return string

def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

def remove_stopwords(string, extra_words=[], exclude_words=[]):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords


def prep_spam_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)\
                            .apply(lemmatize)
      
    df['lemmatized'] = df[column].apply(basic_clean).apply(lemmatize)

    df['words'] = [re.sub(r'([^a-z0-9\s]|\s.\s)', '', doc).split() for doc in df.lemmatized]
    
    return df[['label', column, 'lemmatized', 'clean', 'words']]

In [11]:
df = prep_spam_data(df, 'text')

**Split into train, validate and test**

In [25]:
from sklearn.model_selection import train_test_split

train_validate, test = train_test_split(df[['label', 'clean']], 
                                        stratify=df.label, 
                                        test_size=.2)

train, validate = train_test_split(train_validate, 
                                   stratify=train_validate.label, 
                                   test_size=.25)

In [26]:
print(train.label.value_counts())
print(validate.label.value_counts())
print(test.label.value_counts())
train.head()

ham     2894
spam     448
Name: label, dtype: int64
ham     965
spam    150
Name: label, dtype: int64
ham     966
spam    149
Name: label, dtype: int64


Unnamed: 0,label,clean
2281,ham,hav almost reached call unable connect u
5524,spam,awarded sipix digital camera call 09061221061 ...
3658,ham,waiti come ltgt min
2688,ham,yes know cheesy song frosty snowman
4832,spam,new mobile 2004 must go txt nokia 89545 collec...


### Bag of Words

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html



In [39]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer, which create bag-of-words model.
# stop_words : Specify language to remove stopwords. 
# min_df: ignore terms that have a document frequency strictly 
# lower than the given threshold. This value is also called cut-off in the literature. 
# If float, the parameter represents a proportion of documents, integer absolute counts. 
# ngram_range: the lower and upper boundary of the range of n-values for 
# different word n-grams or char n-grams to be extracted. 

vectorizer = CountVectorizer(stop_words='english', 
                             min_df=20, 
                             ngram_range=(1,2), 
                             binary=True)

# Learn vocabulary in sentences. 
vectorizer.fit(train.clean)

# Get dictionary. 
vectorizer.get_feature_names()

['150ppm',
 '16',
 '18',
 '1st',
 '500',
 'a1000',
 'a2000',
 'account',
 'afternoon',
 'amp',
 'ask',
 'awarded',
 'babe',
 'best',
 'better',
 'bit',
 'box',
 'buy',
 'called',
 'camera',
 'car',
 'care',
 'cash',
 'chance',
 'chat',
 'check',
 'claim',
 'class',
 'collect',
 'come',
 'coming',
 'contact',
 'cool',
 'cost',
 'customer',
 'da',
 'dat',
 'day',
 'dear',
 'didnt',
 'dont',
 'dont know',
 'draw',
 'dun',
 'dunno',
 'easy',
 'eat',
 'end',
 'enjoy',
 'feel',
 'fine',
 'finish',
 'free',
 'friend',
 'getting',
 'girl',
 'god',
 'going',
 'gonna',
 'good',
 'good morning',
 'got',
 'great',
 'guaranteed',
 'gud',
 'guess',
 'guy',
 'haha',
 'half',
 'happy',
 'havent',
 'heart',
 'hello',
 'help',
 'hey',
 'hi',
 'holiday',
 'home',
 'hope',
 'hour',
 'house',
 'hows',
 'ii',
 'ill',
 'ill later',
 'im',
 'ive',
 'job',
 'jus',
 'kiss',
 'know',
 'landline',
 'lar',
 'late',
 'later',
 'latest',
 'leave',
 'lesson',
 'let',
 'let know',
 'liao',
 'life',
 'like',
 'line',
 

In [40]:
# Transform each sentences in vector space.
bow = vectorizer.transform(train.clean)
bow_array = bow.toarray()
bow_array

array([[0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [43]:
# Show sentences and vector space representation.
for i, v in zip(train.clean, bow_array):
    print(i)
    print(v)

hav almost reached call unable connect u
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
awarded sipix digital camera call 09061221061 landline delivery within 28days c box177 m221bp 2yr warranty 150ppm 16 p pa399
[1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

im done
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
maybe find something else instead
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
splashmobile choose 1000 gr8 tone wk s

 0 0 0 0]
whats coming hill monster hope great day thing r going fine busy though
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
great shoot big load get ready
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0]
outside office take
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
dont shall buy one dear
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
sorry lot friendofafriend 

 0 0 0 0]
sorry vikky im watching olave mandara movie kano trishul theatre wit frnds
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
life mean lot love life love people life world call friend call world ge
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 

yup im elaborating safety aspect issue
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1]
burger king wanna play footy top stadium get 2 burger king 1st sept go large super cocacola walk winner
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0]
give sec think think
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
messagesome text missing sendername missing number missing sentdate missing missing u lot thats everything missing sent via fullonsmscom
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

 0 0 0 0]
yes dont care cause know
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0]
wiskey brandy rum gin beer vodka scotch shampain wine kudiyarasu dhina vaazhthukkal
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0]
meet corporation st outside gap u see mind working
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
camera awarded sipix digital camera call 09061221066 fromm landline delivery within 28 day
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 

 0 0 0 0]
ur tonexs subscription renewed charged a450 choose 10 polys month wwwclubzedcouk billing msg
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
spoon okay
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 0 0 0 0]
msg r time passthey silently say thinking u right also making u think least 4 moment gd ntswt drms shesil
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
thanx lot
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 0 0 0 0]
yes obviously eggspert potato headu speak soon
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0]
cal sir meeting
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0]
ltgt pp

### TF-IDF

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

We get back a sparse matrix, a matrix with more 0s than anything else. Numpy has a special type that makes some manipulations and operations faster on sparse matrices.

Becuase our data set is pretty small, we can convert our sparse matrix to a regular one, and put everything in a dataframe. If our data were larger, the operation below might take much longer.


In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', min_df=20, 
                             ngram_range=(1,2), 
                             binary=True)

tfidf_sparse_matrix = tfidf.fit_transform(train.clean)
tfidf_sparse_matrix

pd.DataFrame(tfidf_sparse_matrix.todense(), columns=tfidf.get_feature_names())

Unnamed: 0,150ppm,16,18,1st,500,a1000,a2000,account,afternoon,amp,...,work,world,worry,ya,yeah,year,yes,yo,youre,yup
0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,0.457546,0.426747,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.790334,0.0,0.0,0.0
4,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3337,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3338,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3339,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3340,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


In [53]:
# Get vocabularies.
vectorizer.vocabulary_

{'hav': 2840,
 'reached': 4730,
 'unable': 5977,
 'connect': 1722,
 'awarded': 1043,
 'sipix': 5217,
 'digital': 2011,
 'camera': 1429,
 '09061221061': 156,
 'landline': 3423,
 'delivery': 1945,
 '28days': 342,
 'box177': 1280,
 'm221bp': 3659,
 '2yr': 381,
 'warranty': 6189,
 '150ppm': 262,
 '16': 274,
 'pa399': 4285,
 'waiti': 6157,
 'come': 1666,
 'ltgt': 3632,
 'min': 3839,
 'yes': 6485,
 'know': 3381,
 'cheesy': 1548,
 'song': 5326,
 'frosty': 2565,
 'snowman': 5301,
 'new': 4040,
 'mobile': 3888,
 '2004': 310,
 'txt': 5947,
 'nokia': 4085,
 '89545': 633,
 'collect': 1655,
 'today': 5816,
 'a1': 662,
 'www4tcbiz': 6400,
 '2optout': 367,
 '08718726270150gbpmtmsg18': 114,
 'txtauction': 5949,
 'thats': 5716,
 'going': 2678,
 'ruin': 4954,
 'thesis': 5736,
 'yo': 6501,
 'sorry': 5338,
 'shower': 5176,
 'sup': 5564,
 'said': 4976,
 'mind': 3841,
 'bedroom': 1142,
 'minute': 3851,
 'ok': 4192,
 'sed': 5052,
 'sexy': 5103,
 'mood': 3919,
 'came': 1428,
 'minuts': 3852,
 'latr': 3448,
 '

In [48]:
# Transform to document-term matrix
vector_spaces = vectorizer.transform(train.clean)
vector_spaces.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [49]:
# Show sentences and vector space representation.
# 
# (A, B) C
# A : Document Index
# B : Specific word-vector index
# C : TF-IDF score
for i, v in zip(sentences, vector_spaces):
    print(i)
    print(v)

hav almost reached call unable connect u
  (0, 5977)	0.5291640085167502
  (0, 4730)	0.5014015352399835
  (0, 2840)	0.43424348543179897
  (0, 1722)	0.5291640085167502
awarded sipix digital camera call 09061221061 landline delivery within 28days c box177 m221bp 2yr warranty 150ppm 16 p pa399
  (0, 6189)	0.28794421667425113
  (0, 5217)	0.2695947816343014
  (0, 4285)	0.28794421667425113
  (0, 3659)	0.28794421667425113
  (0, 3423)	0.2077110539912124
  (0, 2011)	0.2630455847907504
  (0, 1945)	0.23260968587471312
  (0, 1429)	0.21637386824676613
  (0, 1280)	0.28794421667425113
  (0, 1043)	0.21477711058278795
  (0, 381)	0.28794421667425113
  (0, 342)	0.28794421667425113
  (0, 274)	0.20180910227565804
  (0, 262)	0.21637386824676613
  (0, 156)	0.28794421667425113
waiti come ltgt min
  (0, 6157)	0.7277943235803661
  (0, 3839)	0.4445794650077219
  (0, 3632)	0.3776341095970367
  (0, 1666)	0.360634165223566
yes know cheesy song frosty snowman
  (0, 6485)	0.30113560018115115
  (0, 5326)	0.391222954779

  (0, 5742)	0.33770014582742147
  (0, 4740)	0.39164145178397464
  (0, 4549)	0.5462368580381425
  (0, 2791)	0.39164145178397464
  (0, 2083)	0.27698638005902587
  (0, 880)	0.45177029360022614
told hr want posting chennaibecause im working
  (0, 6358)	0.3666963983539312
  (0, 6181)	0.25164451935121057
  (0, 5828)	0.32623182866432443
  (0, 4527)	0.49162925365777693
  (0, 3069)	0.20906542351248264
  (0, 2992)	0.4107001142785633
  (0, 1552)	0.49162925365777693
cheer card time year already
  (0, 6475)	0.45712380289260907
  (0, 5785)	0.3594569611619255
  (0, 1544)	0.5954301893070595
  (0, 1448)	0.5543387155548424
work week
  (0, 6354)	0.724144350298825
  (0, 6232)	0.689648432123421
congrats thats great wanted tell tell score co might make relax motivating thanks sharing
  (0, 6183)	0.26494592245231113
  (0, 5716)	0.21458188861207614
  (0, 5709)	0.2288812411575854
  (0, 5677)	0.4120322026294491
  (0, 5127)	0.36464707260922563
  (0, 5023)	0.32497199605117094
  (0, 4809)	0.33463403580979567
  (0,

  (0, 3381)	0.33858512104429656
said kiss kiss cant sound effect gorgeous man isnt kind person need smile brighten day
  (0, 5349)	0.24793531315872427
  (0, 5281)	0.23484976375667246
  (0, 4976)	0.20645029821008956
  (0, 4384)	0.2214468122641402
  (0, 4023)	0.176933360740207
  (0, 3707)	0.23484976375667246
  (0, 3361)	0.47988369304616213
  (0, 3355)	0.2713490611598981
  (0, 3181)	0.264086509555965
  (0, 2706)	0.319307814775801
  (0, 2191)	0.29895971376981606
  (0, 1888)	0.1665110548735141
  (0, 1315)	0.3354590111730418
u talking bout early morning almost noon
  (0, 5636)	0.47987324081178073
  (0, 4097)	0.5119324190544133
  (0, 3924)	0.359092511341842
  (0, 2161)	0.41737260024786044
  (0, 1274)	0.4522162667733704
keep safe need miss already envy everyone see real life
  (0, 4973)	0.4582283745900194
  (0, 4740)	0.3807194981612216
  (0, 4023)	0.29423724668431067
  (0, 3859)	0.3531032566206399
  (0, 3504)	0.3501586443465233
  (0, 2258)	0.5578627761890876
free polyphonic ringtone text super

u lousy run already come back half dead hee
  (0, 4959)	0.39649806893843187
  (0, 3607)	0.48316788752738565
  (0, 2865)	0.4258822534416212
  (0, 2798)	0.37380955906266233
  (0, 1900)	0.46921716273609804
  (0, 1666)	0.26089087324237475
tell address
  (0, 5677)	0.5921052024108714
  (0, 762)	0.805860676096049
well give co said didnut one nighters persevered found one cheap apologise advance somewhere sleep isnt
  (0, 5249)	0.2594658995462088
  (0, 4976)	0.23397992908310755
  (0, 4382)	0.38019163123069655
  (0, 4057)	0.38019163123069655
  (0, 3181)	0.29930178504673277
  (0, 1995)	0.38019163123069655
  (0, 1531)	0.31760670308462324
  (0, 918)	0.38019163123069655
  (0, 774)	0.3305942491197694
take care n get well soon
  (0, 5331)	0.7168257438599683
  (0, 1451)	0.697252359579803
yup lunch buffet u eat already
  (0, 6533)	0.4423448351470957
  (0, 3644)	0.4569099026551712
  (0, 2171)	0.4636067594958581
  (0, 1357)	0.6169547473065429
let math good
  (0, 3748)	0.7665766105213592
  (0, 3493)	0.488

yeah probs last night obviously catching speak soon
  (0, 6473)	0.28477020708870915
  (0, 5360)	0.3416814924700901
  (0, 5331)	0.3121920890642674
  (0, 4614)	0.4776996478983074
  (0, 4162)	0.43838158342337025
  (0, 4056)	0.2785108264721068
  (0, 1482)	0.4547000545795325
like love arrange
  (0, 3610)	0.4694932540589082
  (0, 3515)	0.4273932220583618
  (0, 962)	0.7726002317704476
bold 2 ltgt
  (0, 3632)	0.5031549102136779
  (0, 1245)	0.8641962371636812
r u saying order slipper co pay returning
  (0, 5259)	0.5006404910004373
  (0, 5013)	0.40906032034564976
  (0, 4880)	0.5259638389436644
  (0, 4353)	0.3853592729827451
  (0, 4250)	0.39609219121235967
gwr
  (0, 2782)	1.0
movie laptop
  (0, 3937)	0.6704917701017467
  (0, 3432)	0.7419169672044348
check maili mailed varma kept copy regarding membershiptake careinsha allah
  (0, 6082)	0.3367437454747066
  (0, 4800)	0.3205306935442014
  (0, 3797)	0.3367437454747066
  (0, 3697)	0.3367437454747066
  (0, 3696)	0.3367437454747066
  (0, 3331)	0.286650

  (0, 2253)	0.6103994792746393
guy planning coming
  (0, 4445)	0.683972908118352
  (0, 2780)	0.5067774011162713
  (0, 1671)	0.5247453922408107
thanks understanding ive trying tell sura
  (0, 5988)	0.5087988621236603
  (0, 5920)	0.37398786768142567
  (0, 5709)	0.3480055899891269
  (0, 5677)	0.31323997773993373
  (0, 5577)	0.5087988621236603
  (0, 3209)	0.3509320964406686
ltgt think say syllabus
  (0, 5744)	0.39737469692762584
  (0, 5612)	0.726901178583898
  (0, 5011)	0.4140655812429803
  (0, 3632)	0.3771706791956796
eat fo lunch senor
  (0, 5077)	0.576530987490917
  (0, 3644)	0.406415254269436
  (0, 2476)	0.576530987490917
  (0, 2171)	0.41237201896176934
tell something thats okay
  (0, 5716)	0.5249239928942909
  (0, 5677)	0.5039698140514641
  (0, 4193)	0.685907594511715
two fundamental cool life walk
  (0, 6165)	0.49361611970703556
  (0, 3504)	0.39837798870503
  (0, 2575)	0.6346844618568951
  (0, 1747)	0.44137709315066165
know u mean king havin credit im goin2bed night night sweet only1

  (0, 1057)	0.2512851841043954
wan2 win meetgreet westlife 4 u m8 currently tour 1unbreakable 2untamed 3unkempt text 12 3 83049 cost 50p std text
  (0, 6296)	0.17206204830586025
  (0, 6257)	0.2706709937094583
  (0, 6177)	0.2706709937094583
  (0, 5871)	0.24122089468459695
  (0, 5700)	0.2765656485003807
  (0, 5445)	0.23040648792164367
  (0, 3782)	0.2706709937094583
  (0, 3665)	0.2706709937094583
  (0, 1833)	0.2105102198992155
  (0, 1763)	0.1908046976113725
  (0, 587)	0.2706709937094583
  (0, 478)	0.25763911728690864
  (0, 418)	0.2706709937094583
  (0, 375)	0.2706709937094583
  (0, 301)	0.2706709937094583
  (0, 234)	0.22232910097251363
good evening
  (0, 2689)	0.562881773437026
  (0, 2290)	0.8265374214954752
whats staff name taking class u
  (0, 6264)	0.4861572181502654
  (0, 5631)	0.4925377631229786
  (0, 5418)	0.5878141947729693
  (0, 1605)	0.41896561143083544
oh dang didnt mean send lol
  (0, 5072)	0.2995253710467849
  (0, 4188)	0.33615603967365093
  (0, 3769)	0.4063210013911372
  (0, 

u r much close heart u go away shattered plz stay
  (0, 5441)	0.3835070717890016
  (0, 5128)	0.5141672956769172
  (0, 4467)	0.4047726903156711
  (0, 2860)	0.35304151955734386
  (0, 1621)	0.39114881883394714
  (0, 1044)	0.3835070717890016
tuition 330 hm go 1120 1205 one mind
  (0, 5932)	0.4039241842193515
  (0, 3841)	0.3041211936775291
  (0, 2921)	0.4039241842193515
  (0, 396)	0.440151794407951
  (0, 235)	0.440151794407951
  (0, 227)	0.440151794407951
many happy return day wish happy birthday
  (0, 6310)	0.3674635381518599
  (0, 4878)	0.44227369657101273
  (0, 2824)	0.6530836914129694
  (0, 1888)	0.26726379228023744
  (0, 1200)	0.4140244954066412
think rule tamilnaduthen tough people
  (0, 5869)	0.4630542861782318
  (0, 5744)	0.2973745230692549
  (0, 5638)	0.543974976190383
  (0, 4956)	0.5177844540013601
  (0, 4372)	0.3648830025562811
hi mate rv u hav nice hol message 3 say hello coz havenaot sent u 1 age started driving stay roadsrvx
  (0, 5441)	0.23113671612106054
  (0, 5432)	0.250662

  (0, 3777)	0.5394491397806044
honeybee said im sweetest world god laughed amp said waitu havnt met person reading msg moral even god crack joke gmgngegn
  (0, 6361)	0.18908008285896405
  (0, 6161)	0.24439563406979195
  (0, 5599)	0.24439563406979195
  (0, 4976)	0.3277948387995491
  (0, 4737)	0.21541945575395352
  (0, 4384)	0.17580289967643725
  (0, 3944)	0.1572151189411248
  (0, 3921)	0.23733908846915608
  (0, 3823)	0.2187513077338436
  (0, 3450)	0.24439563406979195
  (0, 3271)	0.22669871070747827
  (0, 3069)	0.11325061237507576
  (0, 2951)	0.24439563406979195
  (0, 2846)	0.23733908846915608
  (0, 2672)	0.3781601657179281
  (0, 2664)	0.24439563406979195
  (0, 1786)	0.24439563406979195
  (0, 868)	0.16929316906167832
stuff sell ill tell
  (0, 5677)	0.4272579268014468
  (0, 5508)	0.5409140101705467
  (0, 5066)	0.6211783208724079
  (0, 3067)	0.3728272941519747
mobile numberpls sm ur mail idconvey regard achanammarakheshqatar
  (0, 6034)	0.1976335445853517
  (0, 5270)	0.29646052088815555
  

In [None]:
# Show sentences and vector space representation.
# 
# (A, B) C
# A : Document Index
# B : Specific word-vector index
# C : TF-IDF score

for i, v in zip(train.clean, vector_spaces):
    print(i)
    print(v)

## Model

In [None]:
lm = LogisticRegression().fit(X_train, y_train)

train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

hav almost reached call unable connect u
  (0, 5977)	0.5291640085167502
  (0, 4730)	0.5014015352399835
  (0, 2840)	0.43424348543179897
  (0, 1722)	0.5291640085167502
awarded sipix digital camera call 09061221061 landline delivery within 28days c box177 m221bp 2yr warranty 150ppm 16 p pa399
  (0, 6189)	0.28794421667425113
  (0, 5217)	0.2695947816343014
  (0, 4285)	0.28794421667425113
  (0, 3659)	0.28794421667425113
  (0, 3423)	0.2077110539912124
  (0, 2011)	0.2630455847907504
  (0, 1945)	0.23260968587471312
  (0, 1429)	0.21637386824676613
  (0, 1280)	0.28794421667425113
  (0, 1043)	0.21477711058278795
  (0, 381)	0.28794421667425113
  (0, 342)	0.28794421667425113
  (0, 274)	0.20180910227565804
  (0, 262)	0.21637386824676613
  (0, 156)	0.28794421667425113
waiti come ltgt min
  (0, 6157)	0.7277943235803661
  (0, 3839)	0.4445794650077219
  (0, 3632)	0.3776341095970367
  (0, 1666)	0.360634165223566
yes know cheesy song frosty snowman
  (0, 6485)	0.30113560018115115
  (0, 5326)	0.391222954779

  (0, 6197)	0.6676486596391591
  (0, 2425)	0.744476505527229
urgent please call 09066612661 landline complimentary 4 lux costa del sol holiday a1000 cash await collection ppm 150 sae tc james 28 eh74rr
  (0, 6037)	0.17608382161472116
  (0, 5656)	0.18247411975548303
  (0, 5309)	0.24243748648854932
  (0, 4972)	0.20694938495271573
  (0, 4543)	0.24243748648854932
  (0, 3650)	0.2720361163823724
  (0, 3423)	0.18678753921069893
  (0, 3220)	0.2720361163823724
  (0, 2938)	0.187962310931594
  (0, 2201)	0.2720361163823724
  (0, 1936)	0.2272551180689277
  (0, 1764)	0.24243748648854932
  (0, 1693)	0.22004698733182698
  (0, 1658)	0.20105991331070522
  (0, 1472)	0.1707513062198966
  (0, 1039)	0.20917803836742127
  (0, 665)	0.1917670165330941
  (0, 341)	0.24964561722565004
  (0, 251)	0.21157192315055473
  (0, 194)	0.2589385140032611
project w frens lor
  (0, 4626)	0.602736496703402
  (0, 3590)	0.4136854006768192
  (0, 2544)	0.6823291762841228
haha dont angry take practice real thing
  (0, 5742)	0.3377

  (0, 2462)	0.6485487430438114
ah well confuses thing doesnt thought friend maybe wrong thing already sort invited tho may come co money
  (0, 6390)	0.28359025372349345
  (0, 5759)	0.23516741831776894
  (0, 5756)	0.2760000713833169
  (0, 5742)	0.412622368822695
  (0, 5341)	0.29843889756063835
  (0, 3909)	0.23026402272960056
  (0, 3762)	0.2568693979044248
  (0, 3157)	0.2879764905424434
  (0, 2551)	0.20440409927281197
  (0, 2065)	0.2568693979044248
  (0, 1719)	0.3505923539434547
  (0, 1666)	0.17372432952783426
  (0, 806)	0.2568693979044248
2 babe feel let 4get itboth try cheer upnot fit soo muchxxlove u locaxx
  (0, 6026)	0.3352444910378679
  (0, 5918)	0.21803144379689807
  (0, 5330)	0.3352444910378679
  (0, 3958)	0.3352444910378679
  (0, 3560)	0.3352444910378679
  (0, 3493)	0.1960545438593389
  (0, 3186)	0.3352444910378679
  (0, 2448)	0.30765149495447874
  (0, 2402)	0.2140472935941317
  (0, 1544)	0.2800584988710896
  (0, 1057)	0.209569738692079
  (0, 457)	0.29876853429311884
r still mee

  (0, 3119)	0.870653613097977
oh wow thats gay firmware update help
  (0, 6380)	0.386317838605869
  (0, 6019)	0.386317838605869
  (0, 5716)	0.2887741674940243
  (0, 4188)	0.2787603437591381
  (0, 2878)	0.32916429441378114
  (0, 2604)	0.4373319585615156
  (0, 2444)	0.4907248020928086
probably money worry thing coming due several outstanding invoice work two three month ago
  (0, 6368)	0.2970427486407553
  (0, 6354)	0.25378119729758475
  (0, 5742)	0.24968279755575765
  (0, 4607)	0.301243725691804
  (0, 4271)	0.3893728532206667
  (0, 3916)	0.2950667095839516
  (0, 3909)	0.2786711032444008
  (0, 3160)	0.42429536713689403
  (0, 1671)	0.2830549572208907
  (0, 803)	0.343207800930711
nope forgot show next week
  (0, 6232)	0.4718711708225913
  (0, 4100)	0.6442599782054265
  (0, 2506)	0.6018859348990143
sometimes put wall around heartsnot safe getting hurt find care enough break wall amp get closer goodnoon
  (0, 6170)	0.6041251260106112
  (0, 4973)	0.27036727119265014
  (0, 3022)	0.250401598577

  (0, 985)	0.47658501761303407
burger king wanna play footy top stadium get 2 burger king 1st sept go large super cocacola walk winner
  (0, 6304)	0.20454709567887783
  (0, 6179)	0.17835105879511148
  (0, 6165)	0.19945973398767514
  (0, 5565)	0.23535376368244998
  (0, 5417)	0.2564624388750137
  (0, 5087)	0.23535376368244998
  (0, 4447)	0.18078862986976996
  (0, 3435)	0.2564624388750137
  (0, 3358)	0.4571165761842237
  (0, 2495)	0.2564624388750137
  (0, 1642)	0.2564624388750137
  (0, 1366)	0.48822931089492194
  (0, 296)	0.18078862986976996
alright ill make sure car back tonight
  (0, 5847)	0.4079420679519368
  (0, 5578)	0.4136812922849868
  (0, 3701)	0.36646789159618653
  (0, 3067)	0.3135843668922743
  (0, 1447)	0.4423441872565709
  (0, 852)	0.48388925063351196
boy love u grl hogolo boy gold chain kodstini grl agalla boy necklace madstini grl agalla boy hogli 1 mutai eerulli kodthini grl love u kano
  (0, 4021)	0.15688063064944427
  (0, 3975)	0.15688063064944427
  (0, 3683)	0.1568806306

  (0, 5888)	0.509480367980275
  (0, 4944)	0.5551752208537301
  (0, 4849)	0.44279185488777545
  (0, 3632)	0.28806641300258196
  (0, 2872)	0.3913608868264235
number vivek
  (0, 6126)	0.8611431229290155
  (0, 4138)	0.5083625889383111
hiwhat think match
  (0, 5744)	0.39299587112568146
  (0, 3746)	0.5733670089954812
  (0, 2917)	0.7188911727610395
cant keep talking people sure pay agree price pls tell want really buy much willing pay
  (0, 6294)	0.30615758171005236
  (0, 6181)	0.18409491438396489
  (0, 5677)	0.20319848469837254
  (0, 5636)	0.3004547946068073
  (0, 5578)	0.23391038989313495
  (0, 4749)	0.2173499075536912
  (0, 4593)	0.27360851310338413
  (0, 4459)	0.20969071135330233
  (0, 4372)	0.24124967850256657
  (0, 4353)	0.5270258958905106
  (0, 1380)	0.23391038989313495
  (0, 804)	0.3596599107110481
aww must nearly deadwell jez iscoming todo workand whilltake forever
  (0, 6356)	0.3441594016820788
  (0, 6276)	0.3441594016820788
  (0, 5823)	0.3441594016820788
  (0, 4016)	0.3275893114374

  (0, 2144)	0.19739792498078282
time week ryan
  (0, 6232)	0.4533443055986053
  (0, 5785)	0.40136320969816114
  (0, 4965)	0.7958558377508507
yes fine love safe
  (0, 6485)	0.4582876654469948
  (0, 4973)	0.604597198354906
  (0, 3610)	0.39862016521909394
  (0, 2432)	0.5153024425725576
kkwhere youhow performed
  (0, 6510)	0.5866111703772874
  (0, 4376)	0.5866111703772874
  (0, 3373)	0.5583678622352636
wat time u wan 2 meet later
  (0, 6196)	0.4467731357119
  (0, 6176)	0.49959579768597345
  (0, 5785)	0.38167345917412177
  (0, 3781)	0.469358025582616
  (0, 3446)	0.4299141990257132
see knew giving break time woul lead always wanting miss curfew gonna gibe til one midnight movie gonna get til 2 need come home need getsleep anything need b studdying ear training
  (0, 6377)	0.22556964120676798
  (0, 6184)	0.22556964120676798
  (0, 5883)	0.18843768293308652
  (0, 5785)	0.11375848603570064
  (0, 5783)	0.34689669658827876
  (0, 5503)	0.22556964120676798
  (0, 4023)	0.35692105476715363
  (0, 3937)

  (0, 6485)	0.3283597735001394
  (0, 6194)	0.44056543523116687
  (0, 6061)	0.5019880744131688
  (0, 2511)	0.5019880744131688
  (0, 1691)	0.44056543523116687
happy new year princess
  (0, 6475)	0.5059951521572763
  (0, 4596)	0.5643165637900184
  (0, 4040)	0.44337128145207183
  (0, 2824)	0.4784742716271259
many lick take get center tootsie pop
  (0, 5854)	0.5231223009622206
  (0, 4508)	0.4800656304203813
  (0, 3501)	0.4979357632893393
  (0, 1498)	0.4979357632893393
say slowly godi love amp need youclean heart bloodsend ten special people amp u c miracle tomorrow itplspls
  (0, 6507)	0.29818169895034485
  (0, 5837)	0.20514154827198208
  (0, 5363)	0.23586045054619734
  (0, 5268)	0.28481362739114663
  (0, 5011)	0.19059062624323878
  (0, 4372)	0.22443081650134372
  (0, 4023)	0.17647290814283556
  (0, 3853)	0.28481362739114663
  (0, 3610)	0.1811988792964869
  (0, 3196)	0.29818169895034485
  (0, 2860)	0.2297360304765864
  (0, 2673)	0.29818169895034485
  (0, 1228)	0.29818169895034485
  (0, 868)

1 dont number 2 gonna massive pain as id rather get involved thats possible
  (0, 5716)	0.2600062441046056
  (0, 4519)	0.39376458413595367
  (0, 4294)	0.3265595362247746
  (0, 4138)	0.26083249214880844
  (0, 3742)	0.42056539093879647
  (0, 3161)	0.4418383880676147
  (0, 3047)	0.32372286342643625
  (0, 2687)	0.2833697833816119
  (0, 2083)	0.21326075584249934
im sorry bout last nite wasnaot ur fault spouse pmt sumthin u 4give think u shldxxxx
  (0, 6191)	0.33226139865440957
  (0, 6034)	0.1511650586812531
  (0, 5744)	0.18163716950942607
  (0, 5553)	0.33226139865440957
  (0, 5404)	0.33226139865440957
  (0, 5338)	0.1829597756190731
  (0, 5154)	0.33226139865440957
  (0, 4469)	0.33226139865440957
  (0, 4066)	0.2687625479765143
  (0, 3069)	0.1412942161389121
  (0, 2390)	0.29611001442808654
  (0, 1274)	0.2615692233873653
  (0, 458)	0.33226139865440957
ok sure time tho sure get library class try see point good eve
  (0, 5918)	0.2679443903987145
  (0, 5785)	0.20777358978962007
  (0, 5756)	0.32433

  (0, 4192)	0.4371653688842131
  (0, 3069)	0.3992158544680142
  (0, 2943)	0.5072391242516321
  (0, 1671)	0.6262759875790709
lol oops sorry fun
  (0, 5338)	0.397692279855508
  (0, 4228)	0.5932339877221547
  (0, 3572)	0.46319250425465747
  (0, 2573)	0.5247542189986639
work right
  (0, 6354)	0.7009187642782617
  (0, 4896)	0.7132411134270336
long get reply defer admission til next semester
  (0, 5783)	0.38157582696834585
  (0, 5069)	0.4076110312016201
  (0, 4843)	0.27961174568784375
  (0, 3579)	0.34287478165669016
  (0, 1930)	0.4962395043189571
  (0, 767)	0.4962395043189571
guai ii shd haf seen he naughty ii free today go jogging
  (0, 5816)	0.20311625270370373
  (0, 5130)	0.3177814315390312
  (0, 5056)	0.3177814315390312
  (0, 4006)	0.34258860235337785
  (0, 3264)	0.3553412416916377
  (0, 3061)	0.5036331266596265
  (0, 2790)	0.2938882917272683
  (0, 2761)	0.3733150773355626
  (0, 2532)	0.1837350311367895
hidid asked waheeda fathima leave
  (0, 6154)	0.507434435239255
  (0, 3471)	0.3332756