## Load the data and analyze it

In [19]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv("Tweets.csv")
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [6]:
data.shape

(14640, 15)

In [11]:
data.describe()

Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


In [9]:
data_narrowed = data[["text","airline_sentiment"]]

In [12]:
data_narrowed.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


## Text Preprocessing 

### Html removal , special char removal, number removal, lowercasing, Lemmatizing, Stemming, etc

It makes more sense to do each of these operations on the data at the same time by row in order to not iterate through the data over and over again wasting processing time. All the functions that are called to clean the data are within the "clean_data" and "normalize" functions defined below which call other functions to complete all the text cleaning tasks.  

In [34]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)        # Append processed words to new list.
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []                            # Create empty list to store pre-processed words.
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)                # Append processed words to new list.
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []                           # Create empty list to store pre-processed words.
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)              # Append processed words to new list.
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = stem_words(words)
    words = lemmatize_verbs(words)
    return words

In [37]:
from bs4 import BeautifulSoup
import contractions
import re, string, unicodedata
import nltk   
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer

def clean_text(text):
    result = BeautifulSoup(text).get_text()
    result = contractions.fix(result)
    result = nltk.word_tokenize(result)
    result = normalize(result)
    return result
data_narrowed["text_clean"] = data_narrowed["text"].apply(lambda cw : clean_text(cw))
data_narrowed.head()



Unnamed: 0,text,airline_sentiment,text_clean
0,@VirginAmerica What @dhepburn said.,neutral,"[virginameric, dhepburn, say]"
1,@VirginAmerica plus you've added commercials t...,positive,"[virginameric, plu, ad, commerc, expery, tacky]"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,"[virginameric, today, must, mean, nee, tak, an..."
3,@VirginAmerica it's really aggressive to blast...,negative,"[virginameric, real, aggress, blast, obnoxy, e..."
4,@VirginAmerica and it's a really big bad thing...,negative,"[virginameric, real, big, bad, thing]"


In [43]:
data_narrowed['liststring'] = [','.join(map(str, l)) for l in data_narrowed['text_clean']]
data_narrowed.head()

Unnamed: 0,text,airline_sentiment,text_clean,liststring
0,@VirginAmerica What @dhepburn said.,neutral,"[virginameric, dhepburn, say]","virginameric,dhepburn,say"
1,@VirginAmerica plus you've added commercials t...,positive,"[virginameric, plu, ad, commerc, expery, tacky]","virginameric,plu,ad,commerc,expery,tacky"
2,@VirginAmerica I didn't today... Must mean I n...,neutral,"[virginameric, today, must, mean, nee, tak, an...","virginameric,today,must,mean,nee,tak,anoth,trip"
3,@VirginAmerica it's really aggressive to blast...,negative,"[virginameric, real, aggress, blast, obnoxy, e...","virginameric,real,aggress,blast,obnoxy,enterta..."
4,@VirginAmerica and it's a really big bad thing...,negative,"[virginameric, real, big, bad, thing]","virginameric,real,big,bad,thing"


## Vectorization

In [45]:
from sklearn.feature_extraction.text import CountVectorizer          

cv = CountVectorizer()  
X = cv.fit_transform(data_narrowed.liststring)
print(cv.vocabulary_)
print(X.shape)
print(type(X))
print(X.toarray())

{'virginameric': 11659, 'dhepburn': 3913, 'say': 8655, 'plu': 7903, 'ad': 1909, 'commerc': 3388, 'expery': 4483, 'tacky': 9474, 'today': 10952, 'must': 7032, 'mean': 6737, 'nee': 7127, 'tak': 9484, 'anoth': 2171, 'trip': 11067, 'real': 8231, 'aggress': 1962, 'blast': 2678, 'obnoxy': 7415, 'entertain': 4340, 'guest': 5292, 'fac': 4526, 'littl': 6437, 'recours': 8267, 'big': 2635, 'bad': 2440, 'thing': 10833, 'sery': 8791, 'would': 12038, 'pay': 7711, '30': 817, 'flight': 4715, 'seat': 8729, 'play': 7879, 'fly': 4803, 'va': 11577, 'ye': 12110, 'near': 7124, 'every': 4416, 'tim': 10900, 'vx': 11698, 'ear': 4199, 'worm': 12015, 'go': 5148, 'away': 2379, 'miss': 6882, 'prim': 8050, 'opportun': 7511, 'men': 6778, 'without': 11961, 'hat': 5369, 'parody': 7665, 'https': 5605, 'tcomwpg7grezp': 10212, 'wel': 11821, 'notbut': 7336, 'amaz': 2090, 'ar': 2237, 'hour': 5578, 'good': 5175, 'know': 6201, 'suicid': 9342, 'second': 8740, 'lead': 6326, 'dea': 3770, 'among': 2117, 'teen': 10721, '1024': 57

In [46]:
print(X.shape)

(14640, 12194)


In [47]:
print(type(X))

<class 'scipy.sparse.csr.csr_matrix'>


In [57]:
myvocabulary = cv.vocabulary_
myvocabulary

{'virginameric': 11659,
 'dhepburn': 3913,
 'say': 8655,
 'plu': 7903,
 'ad': 1909,
 'commerc': 3388,
 'expery': 4483,
 'tacky': 9474,
 'today': 10952,
 'must': 7032,
 'mean': 6737,
 'nee': 7127,
 'tak': 9484,
 'anoth': 2171,
 'trip': 11067,
 'real': 8231,
 'aggress': 1962,
 'blast': 2678,
 'obnoxy': 7415,
 'entertain': 4340,
 'guest': 5292,
 'fac': 4526,
 'littl': 6437,
 'recours': 8267,
 'big': 2635,
 'bad': 2440,
 'thing': 10833,
 'sery': 8791,
 'would': 12038,
 'pay': 7711,
 '30': 817,
 'flight': 4715,
 'seat': 8729,
 'play': 7879,
 'fly': 4803,
 'va': 11577,
 'ye': 12110,
 'near': 7124,
 'every': 4416,
 'tim': 10900,
 'vx': 11698,
 'ear': 4199,
 'worm': 12015,
 'go': 5148,
 'away': 2379,
 'miss': 6882,
 'prim': 8050,
 'opportun': 7511,
 'men': 6778,
 'without': 11961,
 'hat': 5369,
 'parody': 7665,
 'https': 5605,
 'tcomwpg7grezp': 10212,
 'wel': 11821,
 'notbut': 7336,
 'amaz': 2090,
 'ar': 2237,
 'hour': 5578,
 'good': 5175,
 'know': 6201,
 'suicid': 9342,
 'second': 8740,
 'lea

### TFIDF

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,2))
tfs = tfidf.fit_transform(data_narrowed.liststring)
feature_names = tfidf.get_feature_names()
# corpus_index = [n for n in data_narrowed]

# df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(feature_names)
# print(df)

['0011', '0016', '006', '0162389030167', '0162424965446', '0162431184663', '0167560070877', '0214', '021mbps', '022015', '0223', '02272015', '02282015', '03', '0303', '03032015', '0316', '0372389047497', '0400', '0510', '0530', '0600', '0638', '0671', '0736', '0769', '0_0', '0xjared', '10', '100', '1000', '10000', '10000lbs', '1000cost', '1000p', '1000pm', '1001', '1001pm', '1002', '1005am', '1005pm', '1007', '1007p', '1008', '100pm', '101', '1010', '101030', '1014am', '1015', '1015am', '1016', '1019', '102', '1020', '1020pm', '10215', '1024', '1025', '1027', '1028', '103', '1030', '1030a', '1030pm', '1031', '1032', '1035', '1038', '1039', '104', '1041', '1045', '1045pm', '1046', '105', '1050', '1050am', '1051', '1051pm', '1055', '1055pm', '1058', '106', '1065', '1071', '1074', '1079871763', '108', '1080', '1081', '1086', '108639', '1089', '1098', '1099', '10a', '10am', '10d', '10f', '10hour', '10hrs', '10m', '10min', '10mins', '10minute', '10p', '10pm', '10th', '10voucherwhatajoke', '

In [62]:
print(tfidf.vocabulary_)

{'virginameric': 11659, 'dhepburn': 3913, 'say': 8655, 'plu': 7903, 'ad': 1909, 'commerc': 3388, 'expery': 4483, 'tacky': 9474, 'today': 10952, 'must': 7032, 'mean': 6737, 'nee': 7127, 'tak': 9484, 'anoth': 2171, 'trip': 11067, 'real': 8231, 'aggress': 1962, 'blast': 2678, 'obnoxy': 7415, 'entertain': 4340, 'guest': 5292, 'fac': 4526, 'littl': 6437, 'recours': 8267, 'big': 2635, 'bad': 2440, 'thing': 10833, 'sery': 8791, 'would': 12038, 'pay': 7711, '30': 817, 'flight': 4715, 'seat': 8729, 'play': 7879, 'fly': 4803, 'va': 11577, 'ye': 12110, 'near': 7124, 'every': 4416, 'tim': 10900, 'vx': 11698, 'ear': 4199, 'worm': 12015, 'go': 5148, 'away': 2379, 'miss': 6882, 'prim': 8050, 'opportun': 7511, 'men': 6778, 'without': 11961, 'hat': 5369, 'parody': 7665, 'https': 5605, 'tcomwpg7grezp': 10212, 'wel': 11821, 'notbut': 7336, 'amaz': 2090, 'ar': 2237, 'hour': 5578, 'good': 5175, 'know': 6201, 'suicid': 9342, 'second': 8740, 'lead': 6326, 'dea': 3770, 'among': 2117, 'teen': 10721, '1024': 57

In [64]:
tfidf.idf_

array([9.89843391, 9.89843391, 9.89843391, ..., 9.89843391, 8.98214318,
       9.89843391])

In [54]:
print(tfs.shape)

(14640, 12194)


## Vader analyzer

In [80]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

def get_sentiment(text, **kwargs):
#     print(text)
    sentiment_score = sid.polarity_scores(text)
    positive_meter = round((sentiment_score['pos'] * 10), 2)
    negative_meter = round((sentiment_score['neg'] * 10), 2)
    return positive_meter if kwargs['k'] == 'positive' else negative_meter


data_narrowed['positive'] = data_narrowed.text.apply(get_sentiment, k='positive')
data_narrowed['negative'] = data_narrowed.text.apply(get_sentiment, k='negative')

for index, row in data_narrowed.iterrows(): 
    print("Positive : {}, Negative : {}, Sentiment: {}".format(row['positive'], row['negative'], row['airline_sentiment']))
    
    
    
    

Positive : 0.0, Negative : 0.0, Sentiment: neutral
Positive : 0.0, Negative : 0.0, Sentiment: positive
Positive : 0.0, Negative : 0.0, Sentiment: neutral
Positive : 0.0, Negative : 2.46, Sentiment: negative
Positive : 0.0, Negative : 3.21, Sentiment: negative
Positive : 0.74, Negative : 2.56, Sentiment: negative
Positive : 3.22, Negative : 0.0, Sentiment: positive
Positive : 1.97, Negative : 1.6, Sentiment: neutral
Positive : 2.18, Negative : 3.67, Sentiment: positive
Positive : 3.58, Negative : 0.0, Sentiment: positive
Positive : 0.0, Negative : 3.75, Sentiment: neutral
Positive : 5.65, Negative : 0.0, Sentiment: positive
Positive : 1.69, Negative : 0.0, Sentiment: positive
Positive : 0.0, Negative : 0.0, Sentiment: positive
Positive : 7.61, Negative : 0.0, Sentiment: positive
Positive : 0.0, Negative : 3.7, Sentiment: negative
Positive : 0.7, Negative : 1.69, Sentiment: positive
Positive : 1.3, Negative : 0.0, Sentiment: negative
Positive : 0.0, Negative : 0.0, Sentiment: positive
Po

## Summarization of NLP and text analysis

Overall, there are a lot of things we need to do to prep the data. Though we can clean the data programmatically as much as we have, you can tell just by looking through the vocabulary output that there is further work that could be done that would be easier to do by a human who understands english but wouldn't be nearly as fast as a computer if we could teach it proper and slang enligh. Some of the issues that are still present in the data that are not captured are abbreviations, slang words, misspelled words (which there are packages that can be used to try to help eliminate that specifically), and mood/tone of voice in the text itself that a computer simply can't pick up on (for example, sarcasm in a movie review). Because of those factors, our model can only ever get so accurate. 

In the end, the Vader model did alright. Sometimes it gives a slight negative or positive review to a neutral tweet. It does a great job when it determines the sentiment value is strong in either direction. For example, if it determines the positive sentiment value to be 7 and the negative to be 0, then it always is accurate to the true label. This means when our model is very confident, it is also very accurate. It struggles more with neutral labels and accurately assigning pos/neg sentiment values to those which makes sense because its tryig to define where to draw the line for pos/neg. 

The vectorization strategy of TFIDF could have a great application in legal documents and trying to help the staff maybe organize documents based on what is contained within the document/contract/form. This helps to bring only the most important words across all the documents to the front by stating which words are less frequent and those less frequent words might be helpful in determining how that document should be organized or processed. For example, we would not want to try to organize documents or analyze them based on words that are very frequent and exist in all files. Those extremely frequent words should hold lower importance in this case. 