# Identifying tweets related to customer service
Aim: automatically identify tweets related to customer care and transfer them to the managers of the customer service centers

# Table of Contents
1. [Data Preparation](#data-preparation)
    1. [Labeling training data](#Labeling-training-data)

## Data Preparation

In [1]:
import pandas as pd

df = pd.read_csv("tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,name,retweet_count,text,tweet_created,user_timezone
0,5.70306e+17,neutral,1.0,,,cairdin,0,@TAA What @dhepburn said.,2/24/2015 11:35,Eastern Time (US & Canada)
1,5.70301e+17,positive,0.3486,,0.0,jnardino,0,@TAA plus you've added commercials to the expe...,2/24/2015 11:15,Pacific Time (US & Canada)
2,5.70301e+17,neutral,0.6837,,,yvonnalynn,0,@TAA I didn't today... Must mean I need to tak...,2/24/2015 11:15,Central Time (US & Canada)
3,5.70301e+17,negative,1.0,Bad Flight,0.7033,jnardino,0,@TAA it's really aggressive to blast obnoxious...,2/24/2015 11:15,Pacific Time (US & Canada)
4,5.70301e+17,negative,1.0,Can't Tell,1.0,jnardino,0,@TAA and it's a really big bad thing about it,2/24/2015 11:14,Pacific Time (US & Canada)


### Labeling training data
0 -> NOT related to customer service

1 -> Related to customer service

In [7]:
import numpy as np
df['Is_Customer_Service_Issue'] = np.where(df['negativereason'] == 'Customer Service Issue', 1, 0)

df['text'][df['Is_Customer_Service_Issue']==1]

24       @TAA you guys messed up my seating.. I reserve...
25       @TAA status match program.  I applied and it's...
32       @TAA help, left expensive headphones on flight...
33       @TAA awaiting my return phone call, just would...
39       @TAA Your chat support is not working on your ...
                               ...                        
14620    @TAA I wait 2+ hrs for CS to call me back re w...
14621    @TAA I've been on hold for 55 mins about my Ca...
14629    @TAA How do I change my flight if the phone sy...
14636    @TAA leaving over 20 minutes Late Flight. No w...
14638    @TAA you have my money, you change my flight, ...
Name: text, Length: 2910, dtype: object

## Building the model

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

text_data = df['text']

# CountVectorizer will converts text to lowercase by default
binary_vectorizer = CountVectorizer(binary=True, stop_words={'english'})
binary_vectorizer.fit(text_data)
X_binary = binary_vectorizer.transform(text_data)
binary_vectorizer.get_feature_names()



['00',
 '000',
 '000114',
 '000419',
 '000ft',
 '000lbs',
 '0011',
 '0016',
 '00a',
 '00am',
 '00p',
 '00pm',
 '01',
 '0162389030167',
 '0162424965446',
 '0162431184663',
 '0167560070877',
 '0185',
 '01ldxn3qqq',
 '01pm',
 '02',
 '0200',
 '03',
 '0316',
 '0372389047497',
 '04',
 '0400',
 '04sdytt7zd',
 '05',
 '0510',
 '0530',
 '05am',
 '05pm',
 '06',
 '0600',
 '0638',
 '0671',
 '07',
 '0736',
 '0769',
 '07p',
 '07xhcacjax',
 '08',
 '0985',
 '0_0',
 '0bjnz4eix5',
 '0cevy3p42b',
 '0ewj7oklji',
 '0hmmqczkcf',
 '0hxlnvzknp',
 '0jjt4x3yxg',
 '0jutcdrljl',
 '0kn7pjelzl',
 '0liwecasoe',
 '0pdntgbxc6',
 '0prgysvurm',
 '0wbjawx7xd',
 '0xjared',
 '10',
 '100',
 '1000',
 '1000cost',
 '1001',
 '1002',
 '1007',
 '1008',
 '101',
 '1016',
 '1019',
 '1020',
 '1024',
 '1025',
 '1027',
 '1028',
 '103',
 '1030pm',
 '1032',
 '1038',
 '104',
 '1041',
 '1046',
 '105',
 '1050',
 '1051',
 '1058',
 '106',
 '1065',
 '1071',
 '1074',
 '1079871763',
 '108',
 '1080',
 '1081',
 '1086',
 '108639',
 '1089',
 '1098',
