# Overview
#### Context

Dataset comes from the data center of ASTA funding.
Use pymysql to connect to the database.

#### Content

The original dataset contains 1570 rows and 14 columns, in this project, we focus on the 'ART_APP_QUSORC' column for the raw text of the message. 

# Approach
* Environment Configuration


* Loading Data


* Text Processing

## Environment Configuration

In [1]:
# Show the absolute path of the executable binary for the Python interpreter.
import sys
print(sys.executable)

/opt/anaconda3/bin/python


In [2]:
import pymysql
import re
import functools 
import operator
import pymysql
import pandas as pd
import numpy as np
import nltk
import re
import os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

## Loading Data

In [3]:
db = pymysql.connect(host='xx.xx.xx.xx', user='root', password='xxxx', db='x', charset='utf8mb4')

In [4]:
cursor = db.cursor()

In [5]:
cursor.execute("SELECT VERSION()")

1

In [6]:
data = cursor.fetchone()
print ("Database version : %s " % data)

Database version : 5.5.62-0ubuntu0.14.04.1 


In [7]:
sql = "select ART_APP_QUSORC from ART.ART_APP"
cursor.execute(sql)

593

In [8]:
results = cursor.fetchall()

In [9]:
for row in results:
    print(row)

(None,)
(" \r\nhttp://bit.ly/2DLTOpV \r\n\r\nHello to all! I'm looking for people who would like to start earning online! Start is very simple, you just need to install the browser at and use it as the main one. It is very easy, convenient and fast - you will love workin",)
('Disclaimer statement: We are not legally liable for any losses or damages that you may incur due to the expiration of arthurfunding.com. Such losses may include but are not limited to: financial loss, deleted data, downgrade of search rankings, missed cu',)
("baise sur tabouret <A HREF='https://rencontre-gratuite.oleificiodiseneghe.it/rencontre-gratuite-telephone.html'>rencontre gratuite telephone</A> pute \r\nen caravane <a href='https://site-de-rencontres.oleificiodiseneghe.it/site-de-rencontre-gitan.html'>",)
('Hello there, My name is Aly and I would like to know if you would have any interest to have your website here at arthurfunding.com  promoted as a resource on our blog alychidesign.com ?\r\n\r\n We are  u

In [10]:
type(results)

tuple

## Text Processing

In [11]:
art = list(results)

In [12]:
art

[(None,),
 (" \r\nhttp://bit.ly/2DLTOpV \r\n\r\nHello to all! I'm looking for people who would like to start earning online! Start is very simple, you just need to install the browser at and use it as the main one. It is very easy, convenient and fast - you will love workin",),
 ('Disclaimer statement: We are not legally liable for any losses or damages that you may incur due to the expiration of arthurfunding.com. Such losses may include but are not limited to: financial loss, deleted data, downgrade of search rankings, missed cu',),
 ("baise sur tabouret <A HREF='https://rencontre-gratuite.oleificiodiseneghe.it/rencontre-gratuite-telephone.html'>rencontre gratuite telephone</A> pute \r\nen caravane <a href='https://site-de-rencontres.oleificiodiseneghe.it/site-de-rencontre-gitan.html'>",),
 ('Hello there, My name is Aly and I would like to know if you would have any interest to have your website here at arthurfunding.com  promoted as a resource on our blog alychidesign.com ?\r\n\r\n 

In [13]:
# Build a corpus
corpus = []
ps = PorterStemmer()

In [14]:
# Show original Messages(top 10)
for i in range(10):
    print (art[i])
    print('\r')

(None,)

(" \r\nhttp://bit.ly/2DLTOpV \r\n\r\nHello to all! I'm looking for people who would like to start earning online! Start is very simple, you just need to install the browser at and use it as the main one. It is very easy, convenient and fast - you will love workin",)

('Disclaimer statement: We are not legally liable for any losses or damages that you may incur due to the expiration of arthurfunding.com. Such losses may include but are not limited to: financial loss, deleted data, downgrade of search rankings, missed cu',)

("baise sur tabouret <A HREF='https://rencontre-gratuite.oleificiodiseneghe.it/rencontre-gratuite-telephone.html'>rencontre gratuite telephone</A> pute \r\nen caravane <a href='https://site-de-rencontres.oleificiodiseneghe.it/site-de-rencontre-gitan.html'>",)

('Hello there, My name is Aly and I would like to know if you would have any interest to have your website here at arthurfunding.com  promoted as a resource on our blog alychidesign.com ?\r\n\r\n W

In [15]:
# Vectorized text
art = pd.DataFrame(art, columns=['a'])['a'].astype(str).str.zfill(11)
v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
vectors = v.fit_transform(art.apply(lambda x: np.str_(x)))
vectors.shape

(593, 3590)

#### Process the text.

In [16]:
for i in range(0, len(art)):
    
    # Applying Regular Expression
    
    '''
    Replace email addresses with 'emailaddr'
    Replace URLs with 'httpaddr'
    Replace money symbols with 'moneysymb'
    Replace phone numbers with 'phonenumbr'
    Replace numbers with 'numbr'
    '''
    msg = art[i]
    msg = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', 'httpaddr', str(art[i]))
    msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', str(art[i]))
    msg = re.sub('£|\$', 'moneysymb', str(art[i]))
    msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', str(art[i]))
    msg = re.sub('\d+(\.\d+)?', 'numbr', str(art[i]))    
 
    
    ''' Remove all punctuations '''
    msg = re.sub('[^\w\d\s]', ' ', str(art[i]))
    
    if i in range(4):
        print("\t\t\t\t\t MESSAGE ", i)
    
    if i in range(4):
        print("\n After Regular Expression - Message ", i, " : ", msg)
    
    # Each word to lower case
    msg = msg.lower()    
    if i in range(4):
        print("\n Lower case Message ", i, " : ", msg)
    
    # Splitting words to Tokenize
    msg = msg.split()    
    if i in range(4):
        print("\n After Splitting - Message ", i, " : ", msg)
    
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    if i in range(4):
        print("\n After Stemming - Message ", i, " : ", msg)
    
    # preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    if i in range(4):
        print("\n Final Prepared - Message ", i, " : ", msg, "\n\n")
    
    # Preparing WordVector Corpus
    corpus.append(msg)
    
    

					 MESSAGE  0

 After Regular Expression - Message  0  :  0000000None

 Lower case Message  0  :  0000000none

 After Splitting - Message  0  :  ['0000000none']

 After Stemming - Message  0  :  ['0000000none']

 Final Prepared - Message  0  :  0000000none 


					 MESSAGE  1

 After Regular Expression - Message  1  :   
http   bit ly 2DLTOpV 

Hello to all  I m looking for people who would like to start earning online  Start is very simple  you just need to install the browser at and use it as the main one  It is very easy  convenient and fast   you will love workin

 Lower case Message  1  :   
http   bit ly 2dltopv 

hello to all  i m looking for people who would like to start earning online  start is very simple  you just need to install the browser at and use it as the main one  it is very easy  convenient and fast   you will love workin

 After Splitting - Message  1  :  ['http', 'bit', 'ly', '2dltopv', 'hello', 'to', 'all', 'i', 'm', 'looking', 'for', 'people', 'who', '