# Amazon Comprehend - Classification Example
### Classify using Text Features

Objective: Train a model to identify tweets that require followup  

Input: Tweets  
Target: Binary. 0=Normal, 1=Followup



#### AWS Twitter Labelled Tweets are available in this bucket: 
#### https://s3.console.aws.amazon.com/s3/buckets/aml-sample-data/?region=us-east-2
####   File:  social-media/aml_training_dataset.csv

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Download Twitter training data

In [3]:
!aws s3 cp s3://aml-sample-data/social-media/aml_training_dataset.csv .

download: s3://aml-sample-data/social-media/aml_training_dataset.csv to ./aml_training_dataset.csv


### Prepare Training and Test data 

In [4]:
df = pd.read_csv('aml_training_dataset.csv')

In [5]:
print('Rows: {0}, Columns: {1}'.format(df.shape[0],df.shape[1]))

Rows: 10000, Columns: 45


In [6]:
df.columns

Index(['created_at_in_seconds', 'description', 'favorite_count', 'favorited',
       'favourites_count', 'followers_count', 'friends_count', 'geo_enabled',
       'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_user_id', 'location', 'r.created_at_in_seconds',
       'r.description', 'r.favorite_count', 'r.favorited',
       'r.favourites_count', 'r.followers_count', 'r.friends_count',
       'r.geo_enabled', 'r.in_reply_to_screen_name', 'r.in_reply_to_status_id',
       'r.in_reply_to_user_id', 'r.location', 'r.retweet_count',
       'r.screen_name', 'r.sid', 'r.statuses_count', 'r.text', 'r.time_zone',
       'r.uid', 'r.user.name', 'r.utc_offset', 'r.verified', 'retweet_count',
       'screen_name', 'sid', 'statuses_count', 'text', 'time_zone', 'uid',
       'user.name', 'utc_offset', 'verified', 'trainingLabel'],
      dtype='object')

In [7]:
df.head()

Unnamed: 0,created_at_in_seconds,description,favorite_count,favorited,favourites_count,followers_count,friends_count,geo_enabled,in_reply_to_screen_name,in_reply_to_status_id,...,screen_name,sid,statuses_count,text,time_zone,uid,user.name,utc_offset,verified,trainingLabel
0,1410411075,"GM for @theCUBE, the Leading Interview format ...",0,False,1760,1265,2001,True,JeffBezosAmazon,,...,JeffFrick,509927125139202048,13506,@JeffBezosAmazon @amazon @awscloud -> Jeff get...,Pacific Time (US & Canada),19747132,Jeff Frick,-25200.0,False,0
1,1410418911,Everything is Content covers tech #conferences...,0,False,22,183,154,True,,,...,contentpurveyor,509959993781919744,1816,Visit @awscloud Sept 11-16 #Amsterdam to learn...,Eastern Time (US & Canada),2465523308,ContentPurveyor,-14400.0,False,1
2,1410422657,"Founder CEO @mailcloud, previously @zaggora. G...",6,False,4870,3733,301,True,,,...,malcolmbell,509975703421739008,4244,Super excited to take this selfie at #seedcamp...,,135183746,Malcolm Bell,,False,0
3,1410424594,Accelerating startups from prototype to IPO wi...,7,False,2393,50783,3259,True,,,...,seedcamp,509983830275407872,5934,The stage is set at @awscloud in Holburn for o...,London,15317462,seedcamp,0.0,False,0
4,1410425306,Online Investment Manager,0,False,50,86,237,False,,,...,ZenAssets,509986815164813312,120,RT @seedcamp The stage is set at @awscloud in ...,Kyiv,2207428117,Zen Assets,7200.0,False,0


In [8]:
df = df[['text','trainingLabel']]

In [9]:
# trainingLabel contains the class
# Valid values are: 
#  0 = Normal
#  1 = Followup

df.trainingLabel.value_counts()

0    8580
1    1420
Name: trainingLabel, dtype: int64

In [10]:
tweet_normal = df['trainingLabel'] == 0
tweet_followup = df['trainingLabel'] == 1

In [11]:
# Some examples of tweets that are classified as requiring follow-up
for i in range(15):
    print(df[tweet_followup]['text'].iloc[i])
    print()

Visit @awscloud Sept 11-16 #Amsterdam to learn how their services can help YOU @IBCShow. http://t.co/110YOIlTbU

It's insulting that @awscloud marked an EBS volume limit increase support request as low severity but I can't do anything while I wait.

dear @awscloud it would be really great to have custom tabs on the aws console with searches on. I have too many instances already :)

@awscloud chat about a future guest spot on #MobileChat to discuss how infrastructure backend is just as important as mobile front end?

"RT @aneel: I wonder how much @eucalyptus's api thing with @awscloud was worth to @hphelioncloud https://t.co/tKUNboFkn1” << $$$

How to build scalable #applications on the @awscloud platform: http://t.co/WrCtaFU4jq

Seriously though, @awscloud - how is the request rate limit determined? How do I get it boosted?

@awscloud It would be cool if we received an alert that we are nearing a service limit before we actually hit it.

Wondering if @awscloud is still using #TrueCrypt

In [12]:
# Some examples of tweets that are classified as normal
for i in range(10):
    print(df[tweet_normal]['text'].iloc[i])
    print()

@JeffBezosAmazon @amazon @awscloud -> Jeff get's into the Act. http://t.co/QYVgBd8o0b  #ALSIceBucketChallenge

Super excited to take this selfie at #seedcamp today - thanks to @awscloud for hosting :) http://t.co/4849U96iJS

The stage is set at @awscloud in Holburn for our second day of mentoring, this time with investors #seedcamp http://t.co/868c0aDQNl

RT @seedcamp The stage is set at @awscloud in Holburn for our second day of mentoring, this time with investors! http://t.co/y88oKRYW04

Warming up for the final pitches at @seedcamp week London #investor day. Thanks to @awscloud for hosting us. http://t.co/7CNRlo9tYa

That is one full room at @awscloud as our mentoring day kicks off with an intro from @phylosopher #seedcamp http://t.co/70MEiflur4

Cloud Pioneer http://t.co/Kox4nZ8ajx @awscloud @cloudfoundry  @HELPLINE_Corp @JLU122 @Aco63

Standing room only @seedcamp London week, hosted at @awscloud. #seedcamp http://t.co/PYUtzSuVtY

Thanks to @AWScloud for hosting us today and havin

In [13]:
# Training, Validation and Test Split
# Comprehend service automatically splits the provided dataset into 80-20 ratio for training and validation
# We need to independently confirm quality of the model using a test set.

# So, let's reserve 10% of the data for test and provide the remaining 90% to Comprehend service
# Training & Validation   = 90% of the data
# Test       = 10% of the data

# Randomize the datset
np.random.seed(5)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

In [14]:
rows = df.shape[0]
train = int(.9 * rows)
test = rows - train

In [15]:
rows, train, test

(10000, 9000, 1000)

In [16]:
df_train = df[:train]
df_test = df[train:]

In [17]:
df_train.trainingLabel.value_counts()

0    7729
1    1271
Name: trainingLabel, dtype: int64

In [18]:
df_test.trainingLabel.value_counts()

0    851
1    149
Name: trainingLabel, dtype: int64

In [19]:
df_train.columns

Index(['text', 'trainingLabel'], dtype='object')

In [20]:
df_train.to_csv('twitter_train.csv',
                index=False,
                header=False,
                columns=['trainingLabel','text'])

In [21]:
df_test.to_csv('twitter_test_with_label.csv',
                index=False,
                header=False,
                columns=['trainingLabel','text'])

In [22]:
df_test.to_csv('twitter_test_without_label.csv',
                index=False,
                header=False,
                columns=['text'])

### Upload to S3

### Specify your bucket name. Replace 'chandra-ml-sagemaker' with your bucket

In [23]:
!aws s3 cp twitter_train.csv s3://chandra-ml-sagemaker/twitter/train/twitter_train.csv

upload: ./twitter_train.csv to s3://chandra-ml-sagemaker/twitter/train/twitter_train.csv


In [24]:
!aws s3 cp twitter_test_without_label.csv s3://chandra-ml-sagemaker/twitter/test/twitter_test_without_label.csv

Completed 111.0 KiB/111.0 KiB (1.4 MiB/s) with 1 file(s) remainingupload: ./twitter_test_without_label.csv to s3://chandra-ml-sagemaker/twitter/test/twitter_test_without_label.csv
