# NLP Project
In this notebook, we will go over 14000 US Airline tweets, along with their sentiment. Then using NLP techniques, predict the sentiment of the Test Tweets

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
train_df.columns

Index(['tweet_id', 'airline_sentiment', 'airline', 'airline_sentiment_gold',
       'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord',
       'tweet_created', 'tweet_location', 'user_timezone'],
      dtype='object')

In [4]:
test_df.columns

Index(['tweet_id', 'airline', 'airline_sentiment_gold', 'name',
       'negativereason_gold', 'retweet_count', 'text', 'tweet_coord',
       'tweet_created', 'tweet_location', 'user_timezone'],
      dtype='object')

In [5]:
# Let's remove tweet_id column, and let's separate x and y in training data
train_df.drop(columns=['tweet_id'], inplace=True)
test_df.drop(columns=['tweet_id'], inplace=True)

y_train = train_df['airline_sentiment']

train_df.drop(columns=['airline_sentiment'], inplace=True)

In [6]:
y_train.head()

0    negative
1    positive
2    positive
3    negative
4    negative
Name: airline_sentiment, dtype: object

In [7]:
# Count percentage of NaN values in different columns
(train_df.isna().sum() / train_df.shape[0])*100

airline                    0.000000
airline_sentiment_gold    99.717668
name                       0.000000
negativereason_gold       99.781421
retweet_count              0.000000
text                       0.000000
tweet_coord               92.932605
tweet_created              0.000000
tweet_location            32.331512
user_timezone             32.577413
dtype: float64

As we can see >90 percent of data is missing in some columns, we can't get much insights from those columns. Let's just drop these columns

In [8]:
train_df.drop(columns=['airline_sentiment_gold','negativereason_gold','tweet_coord'], inplace=True)
test_df.drop(columns=['airline_sentiment_gold','negativereason_gold','tweet_coord'], inplace=True)

In [9]:
train_df.columns

Index(['airline', 'name', 'retweet_count', 'text', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [10]:
test_df.columns

Index(['airline', 'name', 'retweet_count', 'text', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [11]:
# Count percentage of NaN values in different columns
(train_df.isna().sum() / train_df.shape[0])*100

airline            0.000000
name               0.000000
retweet_count      0.000000
text               0.000000
tweet_created      0.000000
tweet_location    32.331512
user_timezone     32.577413
dtype: float64

Let's look tweet_location and user_timezone

In [12]:
train_df['tweet_location'].head()

0               Washington D.C.
1    Indianapolis, Indiana; USA
2                      Illinois
3                           NaN
4                           NaN
Name: tweet_location, dtype: object

In [13]:
train_df['user_timezone'].head()

0        Atlantic Time (Canada)
1    Central Time (US & Canada)
2    Central Time (US & Canada)
3        Atlantic Time (Canada)
4    Eastern Time (US & Canada)
Name: user_timezone, dtype: object

Let's see the different values taken by these columns

In [14]:
len(set(train_df['tweet_location'])), len(set(train_df['user_timezone']))

(2659, 79)

Since Tweet Locations have so many different values, we will merge the tweet location in the tweet itself. For this reason, change Nan values to empty strings

In [15]:
train_df['tweet_location'].fillna('', inplace=True)
test_df['tweet_location'].fillna('', inplace=True)

In [16]:
train_df.head()

Unnamed: 0,airline,name,retweet_count,text,tweet_created,tweet_location,user_timezone
0,Southwest,ColeyGirouard,0,"@SouthwestAir I am scheduled for the morning, ...",2015-02-17 20:16:29 -0800,Washington D.C.,Atlantic Time (Canada)
1,Southwest,WalterFaddoul,0,@SouthwestAir seeing your workers time in and ...,2015-02-23 14:36:22 -0800,"Indianapolis, Indiana; USA",Central Time (US & Canada)
2,United,LocalKyle,0,@united Flew ORD to Miami and back and had gr...,2015-02-18 08:46:29 -0800,Illinois,Central Time (US & Canada)
3,Southwest,amccarthy19,0,@SouthwestAir @dultch97 that's horse radish 😤🐴,2015-02-20 16:20:26 -0800,,Atlantic Time (Canada)
4,United,J_Okayy,0,@united so our flight into ORD was delayed bec...,2015-02-19 18:13:11 -0800,,Eastern Time (US & Canada)


In [17]:
train_df.head().tweet_location + ' ' + train_df.head().text

0    Washington D.C. @SouthwestAir I am scheduled f...
1    Indianapolis, Indiana; USA @SouthwestAir seein...
2    Illinois @united Flew ORD to Miami and back an...
3       @SouthwestAir @dultch97 that's horse radish 😤🐴
4     @united so our flight into ORD was delayed be...
dtype: object

Let's merge the **text** and **tweet_location** column into a new column named **text_revised**, and then we will drop the **tweet_location** and **text** columns

In [18]:
train_df['text_revised'] = train_df['tweet_location'] + ' ' + train_df['text']
test_df['text_revised'] = test_df['tweet_location'] + ' ' + test_df['text']

train_df.drop(columns=['tweet_location', 'text'], inplace=True)
test_df.drop(columns=['tweet_location', 'text'], inplace=True)

In [19]:
train_df.columns

Index(['airline', 'name', 'retweet_count', 'tweet_created', 'user_timezone',
       'text_revised'],
      dtype='object')

In [20]:
train_df.head().text_revised

0    Washington D.C. @SouthwestAir I am scheduled f...
1    Indianapolis, Indiana; USA @SouthwestAir seein...
2    Illinois @united Flew ORD to Miami and back an...
3       @SouthwestAir @dultch97 that's horse radish 😤🐴
4     @united so our flight into ORD was delayed be...
Name: text_revised, dtype: object

In [21]:
train_df.columns

Index(['airline', 'name', 'retweet_count', 'tweet_created', 'user_timezone',
       'text_revised'],
      dtype='object')

In [22]:
# Count percentage of NaN values in different columns
(train_df.isna().sum() / train_df.shape[0])*100

airline           0.000000
name              0.000000
retweet_count     0.000000
tweet_created     0.000000
user_timezone    32.577413
text_revised      0.000000
dtype: float64

In [23]:
train_df.head().user_timezone

0        Atlantic Time (Canada)
1    Central Time (US & Canada)
2    Central Time (US & Canada)
3        Atlantic Time (Canada)
4    Eastern Time (US & Canada)
Name: user_timezone, dtype: object

In [24]:
train_df.user_timezone.fillna('U', inplace=True)
test_df.user_timezone.fillna('U', inplace=True)

In [25]:
from sklearn.preprocessing import OneHotEncoder

In [26]:
encoder = OneHotEncoder()

In [27]:
encoder.fit(train_df.values[:, 4:5])

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [28]:
timezones = set(train_df.user_timezone)
timezones

{'Abu Dhabi',
 'Adelaide',
 'Alaska',
 'America/Atikokan',
 'America/Boise',
 'America/Chicago',
 'America/Los_Angeles',
 'America/New_York',
 'Amsterdam',
 'Arizona',
 'Athens',
 'Atlantic Time (Canada)',
 'Bangkok',
 'Beijing',
 'Berlin',
 'Bern',
 'Bogota',
 'Brasilia',
 'Brisbane',
 'Brussels',
 'Bucharest',
 'Buenos Aires',
 'Caracas',
 'Casablanca',
 'Central America',
 'Central Time (US & Canada)',
 'Copenhagen',
 'Dublin',
 'EST',
 'Eastern Time (US & Canada)',
 'Edinburgh',
 'Greenland',
 'Guadalajara',
 'Guam',
 'Hawaii',
 'Helsinki',
 'Hong Kong',
 'Indiana (East)',
 'Istanbul',
 'Jerusalem',
 'Kyiv',
 'La Paz',
 'Lima',
 'Lisbon',
 'London',
 'Madrid',
 'Mazatlan',
 'Melbourne',
 'Mexico City',
 'Mid-Atlantic',
 'Monterrey',
 'Mountain Time (US & Canada)',
 'Nairobi',
 'New Caledonia',
 'New Delhi',
 'Pacific Time (US & Canada)',
 'Paris',
 'Perth',
 'Prague',
 'Pretoria',
 'Quito',
 'Rome',
 'Santiago',
 'Sarajevo',
 'Saskatchewan',
 'Seoul',
 'Singapore',
 'Solomon Is.',


In [29]:
'Central Time (US & Canada)' in timezones

True

In [30]:
test_new_timezones = []

for tz in test_df.user_timezone :
    if tz in timezones:
        test_new_timezones.append(tz)
    else :
        test_new_timezones.append('U')

test_new_timezones = np.array(test_new_timezones)
test_new_timezones

array(['Central Time (US & Canada)', 'Central Time (US & Canada)',
       'Eastern Time (US & Canada)', ..., 'Caracas',
       'Central Time (US & Canada)', 'Eastern Time (US & Canada)'],
      dtype='<U27')

In [31]:
test_df.user_timezone = test_new_timezones

In [32]:
temp_df_train = pd.DataFrame(encoder.transform(train_df.values[:, 4:5]).toarray(), columns=encoder.categories_)
temp_df_test = pd.DataFrame(encoder.transform(test_df.values[:, 4:5]).toarray(), columns = encoder.categories_)

In [33]:
temp_df_train.head()

Unnamed: 0,Abu Dhabi,Adelaide,Alaska,America/Atikokan,America/Boise,America/Chicago,America/Los_Angeles,America/New_York,Amsterdam,Arizona,...,Sydney,Taipei,Tehran,Tijuana,Tokyo,U,Vienna,Warsaw,Wellington,West Central Africa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
train_df.drop(columns=['user_timezone'], inplace=True)
test_df.drop(columns=['user_timezone'], inplace=True)

In [35]:
train_df = pd.concat([train_df, temp_df_train], axis=1)

In [36]:
test_df = pd.concat([test_df, temp_df_test], axis=1)

In [37]:
test_df.head()

Unnamed: 0,airline,name,retweet_count,tweet_created,text_revised,"(Abu Dhabi,)","(Adelaide,)","(Alaska,)","(America/Atikokan,)","(America/Boise,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,American,zsalim03,0,2015-02-22 18:15:50 -0800,Texas @AmericanAir In car gng to DFW. Pulled o...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,American,sa_craig,0,2015-02-22 13:22:57 -0800,"College Station, TX @AmericanAir after all, th...",0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Southwest,DanaChristos,1,2015-02-17 18:52:31 -0800,CT @SouthwestAir can't believe how many paying...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,US Airways,rossj987,0,2015-02-22 23:16:24 -0800,"Washington, D.C. @USAirways I can legitimately...",0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,American,tranpham18,0,2015-02-23 08:44:51 -0800,New York City @AmericanAir still no response f...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
# Count percentage of NaN values in different columns
(train_df.isna().sum() / train_df.shape[0])

airline                           0.0
name                              0.0
retweet_count                     0.0
tweet_created                     0.0
text_revised                      0.0
(Abu Dhabi,)                      0.0
(Adelaide,)                       0.0
(Alaska,)                         0.0
(America/Atikokan,)               0.0
(America/Boise,)                  0.0
(America/Chicago,)                0.0
(America/Los_Angeles,)            0.0
(America/New_York,)               0.0
(Amsterdam,)                      0.0
(Arizona,)                        0.0
(Athens,)                         0.0
(Atlantic Time (Canada),)         0.0
(Bangkok,)                        0.0
(Beijing,)                        0.0
(Berlin,)                         0.0
(Bern,)                           0.0
(Bogota,)                         0.0
(Brasilia,)                       0.0
(Brisbane,)                       0.0
(Brussels,)                       0.0
(Bucharest,)                      0.0
(Buenos Aire

In [39]:
from nltk.tokenize import sent_tokenize, word_tokenize  # Importing Sentence and Word Tokenizer

In [40]:
train_df.text_revised[1]

'Indianapolis, Indiana; USA @SouthwestAir seeing your workers time in and time out going above and beyond is why I love flying with you guys. Thank you!'

In [43]:
train_tweets = []

for i in range(len(train_df.text_revised)) :
    words = word_tokenize(train_df.text_revised[i])
    train_tweets.append((words, y_train[i]))

train_tweets[0:5]

[(['Washington',
   'D.C.',
   '@',
   'SouthwestAir',
   'I',
   'am',
   'scheduled',
   'for',
   'the',
   'morning',
   ',',
   '2',
   'days',
   'after',
   'the',
   'fact',
   ',',
   'yes..not',
   'sure',
   'why',
   'my',
   'evening',
   'flight',
   'was',
   'the',
   'only',
   'one',
   'Cancelled',
   'Flightled'],
  'negative'),
 (['Indianapolis',
   ',',
   'Indiana',
   ';',
   'USA',
   '@',
   'SouthwestAir',
   'seeing',
   'your',
   'workers',
   'time',
   'in',
   'and',
   'time',
   'out',
   'going',
   'above',
   'and',
   'beyond',
   'is',
   'why',
   'I',
   'love',
   'flying',
   'with',
   'you',
   'guys',
   '.',
   'Thank',
   'you',
   '!'],
  'positive'),
 (['Illinois',
   '@',
   'united',
   'Flew',
   'ORD',
   'to',
   'Miami',
   'and',
   'back',
   'and',
   'had',
   'great',
   'crew',
   ',',
   'service',
   'on',
   'both',
   'legs',
   '.',
   'THANKS'],
  'positive'),
 (['@',
   'SouthwestAir',
   '@',
   'dultch97',
   'that

In [46]:
# Making a list of stop words and punctuations
from nltk.corpus import stopwords
import string
stop_words = set(stopwords.words('english'))   # get a list of stop words in english
punctuations = list(string.punctuation)  # get a list of all punctuations
stop_words.update(punctuations)   # add punctuations in the set of stop words
stop_words

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'need

In [47]:
from nltk import pos_tag     # Import pos_tag
from nltk.corpus import wordnet

# convert code from pos_tag to code that lemmatizer can understand
def get_simple_pos(tag) :
    if tag.startswith('J') :
        return wordnet.ADJ
    elif tag.startswith('V') :
        return wordnet.VERB
    elif tag.startswith('N') :
        return wordnet.NOUN
    elif tag.startswith('R') :
        return wordnet.ADV
    else :
        return wordnet.NOUN

In [48]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [51]:
s = 'This painting is nice'
pos_tag([word_tokenize(s)[1]])

[('painting', 'NN')]

In [52]:
def clean_tweet(words) :
    cleaned_words = []
    for w in words :
        if w.lower() not in stop_words :
            pos = pos_tag([w])
            clean_word = lemmatizer.lemmatize(w, pos = get_simple_pos(pos[0][1]))
            cleaned_words.append(clean_word.lower())
    return cleaned_words

In [55]:
# Testing if the function works
words = ['Jaikirat', 'is', 'good']
clean_tweet(words)

['jaikirat', 'good']

In [56]:
train_tweets = [(clean_tweet(words), sentiment) for words, sentiment in train_tweets]

In [57]:
train_tweets[0]

(['washington',
  'd.c.',
  'southwestair',
  'schedule',
  'morning',
  '2',
  'day',
  'fact',
  'yes..not',
  'sure',
  'even',
  'flight',
  'one',
  'cancelled',
  'flightled'],
 'negative')

In [58]:
train_sentiments = [sentiment for words, sentiment in train_tweets]

In [59]:
train_sentiments

['negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'neutral',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'negative',
 'positive',
 'neutral',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'neutral',
 'positive',
 'positive',
 'negative',
 'neutral',
 'neutral',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'neutral',
 'neutral',
 'negative',
 'negative',
 'neutral',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'neutral',
 'positive'

In [60]:
train_text_tweets = [' '.join(words) for words, sentiments in train_tweets]

In [61]:
train_text_tweets[0:5]

['washington d.c. southwestair schedule morning 2 day fact yes..not sure even flight one cancelled flightled',
 'indianapolis indiana usa southwestair see worker time time go beyond love fly guy thank',
 'illinois united flew ord miami back great crew service leg thanks',
 "southwestair dultch97 's horse radish 😤🐴",
 'united flight ord delayed air force one last flight sbn 8:20 5 min land']

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

In [63]:
count_vec = CountVectorizer(max_features=75)
x_train_features = count_vec.fit_transform(train_text_tweets).todense()
x_train_features

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

['agent',
 'airline',
 'airport',
 'americanair',
 'amp',
 'another',
 'back',
 'bad',
 'bag',
 'book',
 'boston',
 'ca',
 'call',
 'cancelled',
 'change',
 'check',
 'chicago',
 'city',
 'co',
 'could',
 'customer',
 'day',
 'dc',
 'delay',
 'delayed',
 'dm',
 'email',
 'even',
 'first',
 'flight',
 'flightled',
 'fly',
 'gate',
 'get',
 'give',
 'go',
 'good',
 'great',
 'guy',
 'help',
 'hold',
 'home',
 'hour',
 'hr',
 'http',
 'jetblue',
 'know',
 'last',
 'late',
 'like',
 'lose',
 'love',
 'make',
 'min',
 'minute',
 'miss',
 'need',
 'never',
 'new',
 'ny',
 'nyc',
 'one',
 'people',
 'phone',
 'plane',
 'please',
 're',
 'really',
 'san',
 'say',
 'seat',
 'see',
 'service',
 'southwestair',
 'still',
 'take',
 'thank',
 'thanks',
 'ticket',
 'time',
 'today',
 'tomorrow',
 'travel',
 'try',
 'tx',
 'united',
 'usa',
 'usairways',
 'use',
 've',
 'virginamerica',
 'wait',
 'want',
 'washington',
 'way',
 'weather',
 'well',
 'work',
 'would',
 'york']

In [64]:
test_tweets = []

for i in range(len(test_df.text_revised)) :
    test_tweets.append(word_tokenize(test_df.text_revised[i]))
    
test_tweets[0:5]

[['Texas',
  '@',
  'AmericanAir',
  'In',
  'car',
  'gng',
  'to',
  'DFW',
  '.',
  'Pulled',
  'over',
  '1hr',
  'ago',
  '-',
  'very',
  'icy',
  'roads',
  '.',
  'On-hold',
  'with',
  'AA',
  'since',
  '1hr',
  '.',
  'Ca',
  "n't",
  'reach',
  'arpt',
  'for',
  'AA2450',
  '.',
  'Wat',
  '2',
  'do',
  '?'],
 ['College',
  'Station',
  ',',
  'TX',
  '@',
  'AmericanAir',
  'after',
  'all',
  ',',
  'the',
  'plane',
  'didn',
  '’',
  't',
  'land',
  'in',
  'identical',
  'or',
  'worse',
  ')',
  'conditions',
  'at',
  'GRK',
  'according',
  'to',
  'METARs',
  '.'],
 ['CT',
  '@',
  'SouthwestAir',
  'ca',
  "n't",
  'believe',
  'how',
  'many',
  'paying',
  'customers',
  'you',
  'left',
  'high',
  'and',
  'dry',
  'with',
  'no',
  'reason',
  'for',
  'flight',
  'Cancelled',
  'Flightlations',
  'Monday',
  'out',
  'of',
  'BDL',
  '!',
  'Wow',
  '.'],
 ['Washington',
  ',',
  'D.C.',
  '@',
  'USAirways',
  'I',
  'can',
  'legitimately',
  'say',
  '

In [65]:
test_tweets = [clean_tweet(words) for words in test_tweets]

In [66]:
test_tweets[0:5]

[['texas',
  'americanair',
  'car',
  'gng',
  'dfw',
  'pulled',
  '1hr',
  'ago',
  'icy',
  'road',
  'on-hold',
  'aa',
  'since',
  '1hr',
  'ca',
  "n't",
  'reach',
  'arpt',
  'aa2450',
  'wat',
  '2'],
 ['college',
  'station',
  'tx',
  'americanair',
  'plane',
  '’',
  'land',
  'identical',
  'bad',
  'condition',
  'grk',
  'accord',
  'metars'],
 ['ct',
  'southwestair',
  'ca',
  "n't",
  'believe',
  'many',
  'pay',
  'customer',
  'left',
  'high',
  'dry',
  'reason',
  'flight',
  'cancelled',
  'flightlations',
  'monday',
  'bdl',
  'wow'],
 ['washington',
  'd.c.',
  'usairways',
  'legitimately',
  'say',
  'would',
  'rather',
  'driven',
  'cross',
  'country',
  'flown',
  'us',
  'airways'],
 ['new',
  'york',
  'city',
  'americanair',
  'still',
  'response',
  'aa',
  'great',
  'job',
  'guy']]

In [67]:
test_text_tweets = [' '.join(words) for words in test_tweets]

In [68]:
test_text_tweets[0:5]

["texas americanair car gng dfw pulled 1hr ago icy road on-hold aa since 1hr ca n't reach arpt aa2450 wat 2",
 'college station tx americanair plane ’ land identical bad condition grk accord metars',
 "ct southwestair ca n't believe many pay customer left high dry reason flight cancelled flightlations monday bdl wow",
 'washington d.c. usairways legitimately say would rather driven cross country flown us airways',
 'new york city americanair still response aa great job guy']

In [69]:
x_test_features = count_vec.transform(test_text_tweets).todense()

Now is the time for a classifier

In [70]:
from sklearn.svm import SVC

In [71]:
svc = SVC()

In [72]:
train_df.drop(columns=['text_revised'], inplace=True)
test_df.drop(columns=['text_revised'], inplace=True)

In [73]:
train_df.head()

Unnamed: 0,airline,name,retweet_count,tweet_created,"(Abu Dhabi,)","(Adelaide,)","(Alaska,)","(America/Atikokan,)","(America/Boise,)","(America/Chicago,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,Southwest,ColeyGirouard,0,2015-02-17 20:16:29 -0800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Southwest,WalterFaddoul,0,2015-02-23 14:36:22 -0800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,United,LocalKyle,0,2015-02-18 08:46:29 -0800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Southwest,amccarthy19,0,2015-02-20 16:20:26 -0800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,United,J_Okayy,0,2015-02-19 18:13:11 -0800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
len(set(train_df.airline))

6

In [75]:
encoder = OneHotEncoder()
encoder.fit(train_df.values[:, 0:1])

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [76]:
airlines = set(train_df.airline)

In [77]:
test_airline = []

for ta in test_df.airline:
    if not ta in airlines :
        test_airline.append('U')
    else :
        test_airline.append(ta)
test_airline

['American',
 'American',
 'Southwest',
 'US Airways',
 'American',
 'United',
 'US Airways',
 'US Airways',
 'US Airways',
 'American',
 'United',
 'US Airways',
 'Southwest',
 'American',
 'Southwest',
 'US Airways',
 'United',
 'Delta',
 'United',
 'Delta',
 'Southwest',
 'United',
 'American',
 'American',
 'US Airways',
 'Delta',
 'American',
 'US Airways',
 'Delta',
 'United',
 'American',
 'Delta',
 'US Airways',
 'United',
 'US Airways',
 'US Airways',
 'United',
 'Southwest',
 'American',
 'US Airways',
 'Virgin America',
 'Delta',
 'American',
 'US Airways',
 'American',
 'Delta',
 'Delta',
 'United',
 'American',
 'United',
 'US Airways',
 'American',
 'American',
 'Southwest',
 'Southwest',
 'Southwest',
 'American',
 'Virgin America',
 'United',
 'United',
 'American',
 'Southwest',
 'US Airways',
 'American',
 'Southwest',
 'American',
 'Delta',
 'American',
 'Southwest',
 'American',
 'United',
 'United',
 'Southwest',
 'US Airways',
 'United',
 'Delta',
 'American',
 'U

In [78]:
test_df.airline = test_airline

In [79]:
temp_df_train = pd.DataFrame(encoder.transform(train_df.values[:, 0:1]).toarray(), columns=encoder.categories_)
temp_df_test = pd.DataFrame(encoder.transform(test_df.values[:, 0:1]).toarray(), columns=encoder.categories_)

In [80]:
temp_df_train.head()

Unnamed: 0,American,Delta,Southwest,US Airways,United,Virgin America
0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0


In [81]:
train_df.drop(columns=['airline'], inplace=True)
test_df.drop(columns=['airline'], inplace=True)

In [82]:
train_df = pd.concat([temp_df_train, train_df], axis=1)
test_df = pd.concat([temp_df_test, test_df], axis=1)

In [83]:
train_df.head()

Unnamed: 0,"(American,)","(Delta,)","(Southwest,)","(US Airways,)","(United,)","(Virgin America,)",name,retweet_count,tweet_created,"(Abu Dhabi,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,0.0,0.0,1.0,0.0,0.0,0.0,ColeyGirouard,0,2015-02-17 20:16:29 -0800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,WalterFaddoul,0,2015-02-23 14:36:22 -0800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,LocalKyle,0,2015-02-18 08:46:29 -0800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,amccarthy19,0,2015-02-20 16:20:26 -0800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,J_Okayy,0,2015-02-19 18:13:11 -0800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
train_df.drop(columns=['name'], inplace=True)

In [85]:
test_df.drop(columns=['name'], inplace=True)

In [86]:
train_df.head()

Unnamed: 0,"(American,)","(Delta,)","(Southwest,)","(US Airways,)","(United,)","(Virgin America,)",retweet_count,tweet_created,"(Abu Dhabi,)","(Adelaide,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-17 20:16:29 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-23 14:36:22 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0,2015-02-18 08:46:29 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-20 16:20:26 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0,2015-02-19 18:13:11 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
tweet_date = []
tweet_hour = []
tweet_min = []
for t in train_df.tweet_created :
    tweet_date.append(int(t[8:10]))
    tweet_hour.append(int(t[11:13]))
    tweet_min.append(int(t[14:16]))
    
tweet_date = np.array(tweet_date)
tweet_hour = np.array(tweet_hour)
tweet_min = np.array(tweet_min)

In [88]:
train_df.tweet_date = tweet_date
train_df.tweet_hour = tweet_hour
train_df.tweet_min = tweet_min

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [89]:
tweet_date = []
tweet_hour = []
tweet_min = []
for t in test_df.tweet_created :
    tweet_date.append(int(t[8:10]))
    tweet_hour.append(int(t[11:13]))
    tweet_min.append(int(t[14:16]))
    
tweet_date = np.array(tweet_date)
tweet_hour = np.array(tweet_hour)
tweet_min = np.array(tweet_min)

In [90]:
test_df.tweet_date = tweet_date
test_df.tweet_hour = tweet_hour
test_df.tweet_min = tweet_min

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [91]:
train_df.head()

Unnamed: 0,"(American,)","(Delta,)","(Southwest,)","(US Airways,)","(United,)","(Virgin America,)",retweet_count,tweet_created,"(Abu Dhabi,)","(Adelaide,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-17 20:16:29 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-23 14:36:22 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0,2015-02-18 08:46:29 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0,2015-02-20 16:20:26 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0,2015-02-19 18:13:11 -0800,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
train_df.drop(columns=['tweet_created'], inplace=True)
test_df.drop(columns=['tweet_created'], inplace=True)

In [93]:
train_df.head()

Unnamed: 0,"(American,)","(Delta,)","(Southwest,)","(US Airways,)","(United,)","(Virgin America,)",retweet_count,"(Abu Dhabi,)","(Adelaide,)","(Alaska,)",...,"(Sydney,)","(Taipei,)","(Tehran,)","(Tijuana,)","(Tokyo,)","(U,)","(Vienna,)","(Warsaw,)","(Wellington,)","(West Central Africa,)"
0,0.0,0.0,1.0,0.0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [94]:
train_df.shape

(10980, 86)

In [95]:
x_train_features.shape

(10980, 75)

In [108]:
count_vec = CountVectorizer(max_features=100)
x_train_features = count_vec.fit_transform(train_text_tweets).todense()
x_test_features = count_vec.transform(test_text_tweets).todense()

In [109]:
type(x_train_features)

numpy.matrixlib.defmatrix.matrix

In [110]:
train_df = pd.concat([pd.DataFrame(x_train_features, columns=count_vec.get_feature_names()), train_df], axis=1)
test_df = pd.concat([pd.DataFrame(x_test_features, columns=count_vec.get_feature_names()), test_df], axis=1)

In [111]:
svc = SVC(C=3,gamma=0.2)

In [112]:
svc.fit(train_df, y_train)

SVC(C=3, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [113]:
svc.score(train_df, y_train)

0.9407103825136612

In [114]:
y_predict = svc.predict(test_df)

In [163]:
y_predict

array(['negative', 'negative', 'negative', ..., 'neutral', 'positive',
       'negative'], dtype=object)

In [None]:
for y in y_predict :
    print(y)

negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
positive
negative
negative
neutral
negative
negative
neutral
negative
negative
negative
negative
negative
positive
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
positive
negative
neutral
positive
negative
negative
negative
negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
positive
positive
negative
negative
neutral
negative
negative
negative
negative
positive
negative
negative
negative
negative
negative
neutral
negative
negative
negative
negative
negative
neutral
negative
negative
negative
positive
negative
negative
negative
negative
positive
negative
neutral
negative
negative
negative
negative
positive
negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
ne

negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
positive
positive
positive
negative
negative
negative
negative
neutral
neutral
negative
negative
negative
neutral
negative
negative
negative
positive
neutral
negative
negative
negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
neutral
positive
negative
positive
negative
negative
positive
negative
neutral
negative
negative
negative
positive
neutral
neutral
negative
negative
negative
neutral
neutral
negative
positive
neutral
negative
neutral
negative
neutral
negative
negative
negative
negative
negative
neutral
neutral
neutral
negative
negative
negative
negative
negative
negative
negative
positive
neutral
positive
negative
positive
negative
ne

negative
negative
negative
negative
neutral
negative
neutral
negative
positive
negative
negative
negative
negative
negative
neutral
negative
positive
neutral
negative
negative
negative
negative
negative
negative
positive
negative
negative
negative
negative
negative
negative
negative
neutral
negative
neutral
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
neutral
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
negative
positive
negative
negative
negative
neutral
negative
negative
negative
positive
neutral
negative
negative
negative
negative
neutral
negative
negative
negative
negative
negative
negative
positive
negative
neutral
negative
negative
positive
positive
negative
neutral
negative
negative
positive
negative
negative
negative
positive
neutral
negative
negative
negative
negative
neutral
positive
negative
negative
negative
negative
negative
negativ

In [169]:
import csv
with open('sol.csv', "wb") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for line in y_predict:
            writer.writerow(line)

TypeError: a bytes-like object is required, not 'str'

In [170]:
test_df.shape

(3660, 186)