# Overview
#### Context

Dataset comes from the data center of ASTA funding.
Export the dataset from MySQL, saved as a .csv file.

#### Content

The original file is a csv file contains 1570 rows and 14 columns, in this project, we focus on the 'ART_APP_QUSORC' column for the raw text of the message. 

# Approach
* Environment Configuration


* Loading Data


* Text Processing

## Environment Configuration

In [1]:
# Show the absolute path of the executable binary for the Python interpreter.
import sys
print(sys.executable)

/opt/anaconda3/bin/python


In [2]:
import pymysql
import pandas as pd
import numpy as np
import nltk
import re
import os
from wordcloud import WordCloud
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

## Loading Data

In [3]:
art = pd.read_csv('xx.csv', encoding = 'latin-1', error_bad_lines=False)[['ART_APP_ID','ART_APP_QUSORC']]

b'Skipping line 612: expected 13 fields, saw 16\nSkipping line 698: expected 13 fields, saw 16\n'


In [4]:
art.head()

Unnamed: 0,ART_APP_ID,ART_APP_QUSORC
0,69,
1,402,\r\nhttp://bit.ly/2DLTOpV \r\n\r\nHello to al...
2,640,Disclaimer statement: We are not legally liabl...
3,680,baise sur tabouret <A HREF='https://rencontre-...
4,688,"Hello there, My name is Aly and I would like t..."


In [5]:
art.size

1570

## Text Processing

In [6]:
# Rename varibles and print a piece of data.
art = art.rename(columns = {"ART_APP_ID" : "ID", "ART_APP_QUSORC":"Email"})
art[100: 110]

Unnamed: 0,ID,Email
100,778,"\r\nDelivery Club â ??????????? ??????, ???..."
101,779,Melde dich Wen du in der N?Â¤he von lichtenber...
102,780,"Push the Download Now"" button to download <b>E..."
103,It will just take a few moments.,
104,<a href=https://crackpluskeygen.org/software?q...,
105,"A powerful audio synthesizer t""",
106,781,??? ??????? ?????????? ????????.<a href=http:/...
107,782,We would like to inform that you liked a comme...
108,783,hi there \r\nWe all know there are no tricks w...
109,784,\r\nhttp://prooknann.ru/Moskitnye-setki.html ...


In [7]:
# Build a corpus
corpus = []
ps = PorterStemmer()

In [8]:
# Show original Messages(top 10)
for i in range(10):
    print (art['Email'][i])
    print('\r')

nan

 
http://bit.ly/2DLTOpV 

Hello to all! I'm looking for people who would like to start earning online! Start is very simple, you just need to install the browser at and use it as the main one. It is very easy, convenient and fast - you will love workin

Disclaimer statement: We are not legally liable for any losses or damages that you may incur due to the expiration of arthurfunding.com. Such losses may include but are not limited to: financial loss, deleted data, downgrade of search rankings, missed cu

baise sur tabouret <A HREF='https://rencontre-gratuite.oleificiodiseneghe.it/rencontre-gratuite-telephone.html'>rencontre gratuite telephone</A> pute 
en caravane <a href='https://site-de-rencontres.oleificiodiseneghe.it/site-de-rencontre-gitan.html'>

Hello there, My name is Aly and I would like to know if you would have any interest to have your website here at arthurfunding.com  promoted as a resource on our blog alychidesign.com ?

 We are  updating our do-follow bro

In [9]:
# Vectorized text
v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
vectors = v.fit_transform(art['Email'].apply(lambda x: np.str_(x)))
vectors.shape

(785, 3495)

#### Process the text.

In [10]:
for i in range(0, len(art.Email)):

    # Applying Regular Expression
    
    '''
    Replace email addresses with 'emailaddr'
    Replace URLs with 'httpaddr'
    Replace money symbols with 'moneysymb'
    Replace phone numbers with 'phonenumbr'
    Replace numbers with 'numbr'
    '''
    msg = art['Email'][i]
    msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', str(art['Email'][i]))
    msg = re.sub('(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', str(art['Email'][i]))
    msg = re.sub('£|\$', 'moneysymb', str(art['Email'][i]))
    msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', str(art['Email'][i]))
    msg = re.sub('\d+(\.\d+)?', 'numbr', str(art['Email'][i]))
    
    ''' Remove all punctuations '''
    msg = re.sub('[^\w\d\s]', ' ', str(art['Email'][i]))
    
    if i in range(2):
        print("\t\t\t\t\t MESSAGE ", i)
    
    if i in range(2):
        print("\n After Regular Expression - Message ", i, " : ", msg)
    
    # Each word to lower case
    msg = msg.lower()    
    if i in range(2):
        print("\n Lower case Message ", i, " : ", msg)
    
    # Splitting words to Tokenize
    msg = msg.split()    
    if i in range(2):
        print("\n After Splitting - Message ", i, " : ", msg)
    
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    if i in range(2):
        print("\n After Stemming - Message ", i, " : ", msg)
    
    # preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    if i in range(2):
        print("\n Final Prepared - Message ", i, " : ", msg, "\n\n")
    
    # Preparing WordVector Corpus
    corpus.append(msg)

					 MESSAGE  0

 After Regular Expression - Message  0  :  nan

 Lower case Message  0  :  nan

 After Splitting - Message  0  :  ['nan']

 After Stemming - Message  0  :  ['nan']

 Final Prepared - Message  0  :  nan 


					 MESSAGE  1

 After Regular Expression - Message  1  :   
http   bit ly 2DLTOpV 

Hello to all  I m looking for people who would like to start earning online  Start is very simple  you just need to install the browser at and use it as the main one  It is very easy  convenient and fast   you will love workin

 Lower case Message  1  :   
http   bit ly 2dltopv 

hello to all  i m looking for people who would like to start earning online  start is very simple  you just need to install the browser at and use it as the main one  it is very easy  convenient and fast   you will love workin

 After Splitting - Message  1  :  ['http', 'bit', 'ly', '2dltopv', 'hello', 'to', 'all', 'i', 'm', 'looking', 'for', 'people', 'who', 'would', 'like', 'to', 'start', 'earning'