### Ideas

- Look at the top words used in positive and negative tweets. You could use word clouds to show what the top words are (a bar chart's second choice). You can use the wordcloud library for this. 

1.   Towards Data Science, ["A Complete Exploratory Data Analysis and Visualization for Text Data"](https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a)
2.   YouTube, [Natural Language Processing (Part 2): Data Cleaning & Text Pre-Processing in Python](https://www.youtube.com/watch?v=iQ1bfDMCv_c), see 20:22 for code on text cleaning in cell 19.
3. Keep watching the series at this link: https://www.youtube.com/watch?v=N9CT6Ggh0oE

# Twitter NLP Project

The purpose of this project is to: 

1. Use NLP to analyze customer support tweets on Twitter to classify the sentiment of the tweets as negative or positive. 
2. And use clustering to look for patterns among consumers and and the companies and gain insights from that information. 

Why this is valuable: 

For a company that currently uses Twitter for customer support, this could help them get a deeper understanding of their customer service interactions.

In the case of companies that aren’t using Twitter, but are considering it, this would help them build a customer service strategy by giving them a better sense of how other brands provide service and which brands have the most positive interactions with their customers.

Further, while many consumers still use traditional customer service avenues, like speaking to customer support over the phone, one-third of millennials use social media to connect with brands. And as more digital natives are born, that trend will likely continue and make customer service provided on social media increasingly important. ([Steinmetz](https://time.com/4894182/twitter-company-complaints/))

### Table of Contents

1. Import Statements.
2. Importing the Dataset. 
3. Exploratory Data Analysis (EDA).
4. Analyzing the Data with NLP Techniques.
5. Using Clustering to Draw Insights.
6. Key Takeaways.
7. Next Steps.
8. Resources.
9. Appendix

### 1. Import Statements

In [0]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Importing the Dataset

In [96]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
path = '/content/drive/My Drive/Colab Notebooks/Datasets for Data Science Projects/twcs.csv'
twitter_df = pd.read_csv(path)

### 3. Exploratory Data Analysis (EDA)

Because this is a new data set, it'll be helpful to use EDA to get a feel for the data, add features, clean it up and explore the data with visuals.

*Looking at the Dataframe*

In [98]:
twitter_df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [99]:
twitter_df['author_id'].nunique()

702777

In [100]:
print(twitter_df['created_at'].min())
print(twitter_df['created_at'].max())

Fri Apr 01 17:37:48 +0000 2016
Wed Sep 28 18:06:15 +0000 2016


Intial observations from looking at the dataframe: 

- The dataset's 2,811,774 rows by seven columns, which is large. 
- There are 702,777 unique authors in this dataset and there are an average of four tweets/author.
- The tweets are from the beginning of April through late September in 2016.

*3.2 Feature Engineering*

I'd like to add features related to the companies, like which industries they belong to, and I'll work on that in that in this section.

To get started, I need to isolate the values in the *author_id* column so that it only includes the names of companies.

In [109]:
# I'll make a new dataframe to keep track of the changes that I'm making.
authors_df = twitter_df.copy()

# I'll also make a new variable, called author_alphas, and add it to the authors_dataframe. 
# This variable will show which variables are alphabetic. And since individuals are represented by numbers, this will tell us which author_ids are companies and which aren't.
authors_df['author_alphas'] = authors_df['author_id'].str.isalpha()

print(authors_df['author_alphas'])

0           True
1          False
2          False
3           True
4          False
           ...  
2811769     True
2811770    False
2811771    False
2811772     True
2811773    False
Name: author_alphas, Length: 2811774, dtype: bool


In [110]:
# I'll make another variable, called 'author_is_company', to help me drop the individuals from the dataframe.
author_is_individual = authors_df[authors_df['author_alphas'] == False].index

authors_df.drop(author_is_individual, inplace=True)

authors_df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,author_alphas
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0,True
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0,True
5,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0,True
7,11,sprintcare,False,Tue Oct 31 22:10:35 +0000 2017,@115713 This is saddening to hear. Please shoo...,,12.0,True
9,15,sprintcare,False,Tue Oct 31 20:03:31 +0000 2017,@115713 We understand your concerns and we'd l...,12.0,16.0,True


In [111]:
# From looking at the dataframe, my sense is that the table only has companies listed, but I'll double-check that by looking at the values in that column. 
authors_df['author_alphas'].unique()

array([ True])

Now that the *author_id* column only has companies listed, I'm closer to being able to do feature engineering with the company information. But there's still more to do, like getting a list of all the company names and seeing how many companies there are.

In [113]:
# This will give me the list of company names.
authors_df['author_id'].unique()

array(['sprintcare', 'VerizonSupport', 'ChipotleTweets', 'AskPlayStation',
       'marksandspencer', 'MicrosoftHelps', 'ATVIAssist', 'AdobeCare',
       'AmazonHelp', 'XboxSupport', 'AirbnbHelp', 'nationalrailenq',
       'AirAsiaSupport', 'Morrisons', 'NikeSupport', 'AskAmex',
       'McDonalds', 'YahooCare', 'AskLyft', 'UPSHelp', 'Delta',
       'AppleSupport', 'Tesco', 'SpotifyCares', 'comcastcares',
       'AmericanAir', 'TMobileHelp', 'VirginTrains', 'SouthwestAir',
       'AskeBay', 'GWRHelp', 'sainsburys', 'AskPayPal', 'HPSupport',
       'ChaseSupport', 'CoxHelp', 'DropboxSupport', 'VirginAtlantic',
       'AzureSupport', 'AlaskaAir', 'ArgosHelpers', 'AskTarget',
       'GoDaddyHelp', 'CenturyLinkHelp', 'AskPapaJohns', 'askpanera',
       'Walmart', 'USCellularCares', 'AsurionCares', 'GloCare',
       'NeweggService', 'VirginAmerica', 'DunkinDonuts', 'TfL',
       'asksalesforce', 'Kimpton', 'AskCiti', 'IHGService',
       'LondonMidland', 'JetBlue', 'BoostCare', 'JackBox', 'Al

In [114]:
authors_df['author_id'].nunique()

91

I'll use feature engineering to add these variables to the dataframe by building off the list of company names: 

1. Industry.
  - This will list which sector each company's in. This includes banks, retail, restaurants and so on.
  - To get the sector information, I went to the US Chamber of Commerce and saw that on their [frequently asked questions page](https://www.uschamber.com/about/about-the-us-chamber/frequently-asked-questions#3) they recommend using [Salesgenie](https://www.salesgenie.com/) to research how companies are classified. From there, I made a Salesgenie profile and used their free, three-day trial to gather data.
  - Note: Salesgenie only has information on US companies, but some of the companies in the dataset were from other countries. In those cases, I assigned sectors to companies based off how their American counter-parts had been classified. For instance, I listed the Royal Bank of Canada as a bank because Citibank is also listed as a bank.
  - For a full list of non-US companies and the industries they were assigned to, please reference the Appendix.

2. NASDAQ Listing.
  - Please note that: 
  1. The NASDAQ information was gathered from [Yahoo! Finance](https://finance.yahoo.com/). For your reference, here's an example of a search where I took the [Adjusted Close Number for Adobe's stock from May 31, 2016](https://finance.yahoo.com/quote/ADBE/history?period1=1462078800&period2=1464757200&interval=1mo&filter=history&frequency=1mo).
  2. Some companies that are listed on NASDAQ now, like Uber, weren't listed on the exchange in 2016 because they hadn't gone public yet. 
  3. In some cases a company's not listed on NASDAQ because it's privately held and may never go public.

3. NASDAQ Price (in USD).
  - If a company was listed on NASDAQ on May 31, 2016 this will show the adjusted value of how much the stock cost on that day.

4. Major Event.
  - If a company experienced an event that impacted thousands of people, such as a service outtage, and the event was covered by multiple news sources I recorded that and assigned it a value of one; however, if a company didn't have a major event, it was given a zero.
  - For your reference, [here's an example of a search]((https://www.google.com/search?biw=1920&bih=969&tbs=cdr%3A1%2Ccd_min%3A4%2F1%2F2016%2Ccd_max%3A9%2F28%2F2016&tbm=nws&ei=MznTXcm0Eoi8tgWlqqZY&q=london+midland&oq=london+midland&gs_l=psy-ab.3..0l2.39991.41189.0.41514.14.11.0.3.3.0.165.1405.3j8.11.0....0...1c.1.64.psy-ab..1.13.1263...0i131k1.0.X-aRa17UL9k) where I found information about a train derailment that impacted London Midland's services. In the 'major_event' column, I said that a, "train derailment caused disruption in service".

5. Event Month.
  - Because the dataframe covers April 1st through September 28th, I looked for events that occurred during that timeframe and noted the month as a number. For instance, April was four and May was five. In cases where there wasn't a major event, I put down a zero.

6. Event Highlights.
  - To provide insights, I wrote down a highlight about event.

7. Company.
  - To make it easier to tell companies and individuals apart, I'll add a column to show if the author of a tweet is a company or an individual. If the author's a company, they'll get assigned a value of one. If it's an individual, the value will be zero.

Adding the new features to the dataframe:

In [0]:
# I'll make a new dataframe for the feature engineering.
regex_df = twitter_df.copy()

In [117]:
list1 = []

for i in regex_df['author_id']:
 try:
   i = float(i)
 except:
   pass
 if type(i) in [int, float]:
   list1.append('non-company')
 else:
   list1.append(i)

#print list1
regex_df['new_column'] = list1
print(regex_df)

         tweet_id   author_id  ...  in_response_to_tweet_id   new_column
0               1  sprintcare  ...                      3.0   sprintcare
1               2      115712  ...                      1.0  non-company
2               3      115712  ...                      4.0  non-company
3               4  sprintcare  ...                      5.0   sprintcare
4               5      115712  ...                      6.0  non-company
...           ...         ...  ...                      ...          ...
2811769   2987947  sprintcare  ...                2987948.0   sprintcare
2811770   2987948      823869  ...                      NaN  non-company
2811771   2812240      121673  ...                2812239.0  non-company
2811772   2987949      AldiUK  ...                2987950.0       AldiUK
2811773   2987950      823870  ...                      NaN  non-company

[2811774 rows x 8 columns]


In [0]:
# I'll drop the 'author_id' column since 'new_column' has been added.
regex_df = regex_df[['tweet_id', 'inbound', 'created_at', 'text', 'response_tweet_id', 'in_response_to_tweet_id', 'new_column']]

In [119]:
# Renaming 'new_column' as 'author_id'.
regex_df = regex_df.rename(columns={'new_column': 'author_id'})
regex_df.head()

Unnamed: 0,tweet_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,author_id
0,1,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0,sprintcare
1,2,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0,non-company
2,3,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0,non-company
3,4,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0,sprintcare
4,5,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0,non-company


In [0]:
# Importing the 'companies_df' from Google Drive.
path = '/content/drive/My Drive/Colab Notebooks/Industry List Spreadsheet.xlsx'
companies_df = pd.read_excel(path)

In [121]:
companies_df.head()

Unnamed: 0,author_id,industry,company,nasdaq_listing,nasdaq_price,major_event,event_month,event_highlights
0,AdobeCare,Computer - Software Developers,1,1,95.79,1,5,Launched new service.
1,AirAsiaSupport,Airline Companies,1,0,0.0,0,0,
2,AirbnbHelp,Marketing Consultants,1,1,22.68,1,9,Adopted anti-discrimination policies.
3,AlaskaAir,Airline Companies,1,1,54.74,1,4,Bought Virgin America.
4,AldiUK,Grocers - Retail,1,0,0.0,0,0,


In [0]:
# Merging the 'regex_df' and 'companies_df' dataframes.
regex_df = regex_df.merge(companies_df, on='author_id', how='left')

In [123]:
regex_df.head()

Unnamed: 0,tweet_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,author_id,industry,company,nasdaq_listing,nasdaq_price,major_event,event_month,event_highlights
0,1,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0,sprintcare,Cellular Telephones (Services),1.0,1.0,4.53,0.0,0.0,
1,2,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0,non-company,non-company,0.0,0.0,0.0,0.0,0.0,
2,3,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0,non-company,non-company,0.0,0.0,0.0,0.0,0.0,
3,4,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0,sprintcare,Cellular Telephones (Services),1.0,1.0,4.53,0.0,0.0,
4,5,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0,non-company,non-company,0.0,0.0,0.0,0.0,0.0,


In [124]:
# I'll compare the shape of the original and new dataframes to make sure nothing was lost.
print(twitter_df.shape)
print(regex_df.shape)

(2811774, 7)
(2811774, 14)


*3.3 Data Cleaning*

This dataframe has string and integer data that needs to be cleaned, so a couple of different approaches will be needed for this step. To get this step started, I'll make a new dataframe.

In [0]:
clean_df = regex_df.copy()

*Cleaning the Numbers Data*

The values in these columns need to be cleaned:

* 'nasdaq_listing.'
* 'nasdaq_price.'
* 'major_event.'
* 'event_month.'
* 'company.'

In [0]:
# The numbers columns can be cleaned with the same method, so I'll define a function for this step.
def clean_numbers(string, int):
    clean_df.loc[(clean_df[string] != int), string] = 0
    return;

In [0]:
# Cleaning 'nasdaq_listing'.
clean_numbers('nasdaq_listing', 1)

In [0]:
# Cleaning the 'nasdaq_price'.
clean_numbers('nasdaq_price', 4-1146)

In [0]:
# Cleaning 'major_event'.
clean_numbers('major_event', 1)

In [0]:
# Cleaning 'event_month'.
clean_numbers('event_month', 4-9)

In [0]:
# Cleaning 'company'.
clean_numbers('company', 1)

*Language Parsing*

With the numbers data taken care of, it's time to clean these columns:

* 'text'.
* 'industry'.
* 'event_highlights'.
* 'created_at'.



In [0]:
import re
import string

In [0]:
# First, the columns with text need to get converted from floats to strings.
def make_string(df, string):
  df[string] = df[string].astype(str)
  return;

In [0]:
make_string(clean_df, 'text')

In [0]:
make_string(clean_df, 'industry')

In [0]:
make_string(clean_df, 'event_highlights')

In [0]:
make_string(clean_df, 'created_at')

In [0]:
# The 'clean_text' function will: make text lowercase, remove text in the square brackets, remove punctuation and words that have numbers.
def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\'\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    return text

round1 = lambda x: clean_text(x)

In [0]:
# Let's take a look at the updated text
clean_df['text'] = pd.DataFrame(clean_df['text'].apply(round1))

In [0]:
clean_df['industry'] = pd.DataFrame(clean_df['industry'].apply(round1))

In [0]:
clean_df['event_highlights'] = pd.DataFrame(clean_df['event_highlights'].apply(round1))

In [0]:
clean_df['created_at'] = pd.DataFrame(clean_df['created_at'].apply(round1))

*Reviewing the Clean Dataframe*

With the data cleaning and language parsing done, I'll look at the updated dataframe to make sure it looks right.

In [143]:
clean_df.head()

Unnamed: 0,tweet_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,author_id,industry,company,nasdaq_listing,nasdaq_price,major_event,event_month,event_highlights
0,1,False,tue oct,i understand i would like to assist you we wo...,2.0,3.0,sprintcare,cellular telephones services,1.0,1.0,0.0,0.0,0.0,
1,2,True,tue oct,sprintcare and how do you propose we do that,,1.0,non-company,noncompany,0.0,0.0,0.0,0.0,0.0,
2,3,True,tue oct,sprintcare i have sent several private message...,1.0,4.0,non-company,noncompany,0.0,0.0,0.0,0.0,0.0,
3,4,False,tue oct,please send us a private message so that we c...,3.0,5.0,sprintcare,cellular telephones services,1.0,1.0,0.0,0.0,0.0,
4,5,True,tue oct,sprintcare i did,4.0,6.0,non-company,noncompany,0.0,0.0,0.0,0.0,0.0,


*Visualizing the Data*



### 4. Analyzing Data with Supervised NLP



In [168]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk

!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Resource: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958/

In [164]:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(sentences_df)
X_train_counts.shape

(2, 2)

In [165]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2, 2)

In [166]:
# Machine Learning
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, sentences_df.company)

ValueError: ignored

*Trying Another Approach to BoW Using BoW from the Function Defined Below, in the Topic Modeling Section*

In [0]:
sentences = clean_df[['text', 'company']]

In [0]:
sentences.head()

Unnamed: 0,text,company
0,i understand i would like to assist you we wo...,1
1,sprintcare and how do you propose we do that,0
2,sprintcare i have sent several private message...,0
3,please send us a private message so that we c...,1
4,sprintcare i did,0


In [0]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

In [0]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.

# def bow_features(sentences, common_words):
def bow_features(sentences):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame()
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                for token in sentence]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

In [0]:
# Set up the bags.
twitter_words = bag_of_words(sentences)

AttributeError: ignored

In [0]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

# bow_corpus

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

### 5. Topic Modeling (Unsupervised NLP)





In [0]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [0]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
stemmer = SnowballStemmer("english")

In [0]:
# Write a function to perform the pre processing steps on the entire dataset

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [0]:
# processed_text = twitter_df['text'].sample(n=50000, random_state=1)
# print(processed_text)

In [0]:
processed_docs = []

for doc in twitter_df['text'].sample(n=50000, random_state=1):
    processed_docs.append(preprocess(doc))

In [0]:
print(processed_docs[:2])

[['atviassist'], ['applesupport', 'iphon']]


In [0]:
# Create a dictionary from 'processed_docs' containing the number of times a word appears 
# in the training set using gensim.corpora.Dictionary and call it 'dictionary'

dictionary = gensim.corpora.Dictionary(processed_docs)

In [0]:
# Checking dictionary created

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 atviassist
1 applesupport
2 iphon
3 americanair
4 dumb
5 know
6 store
7 think
8 pain
9 info
10 send


- Gensim filter_extremes

- filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

- Filter out tokens that appear in:
- less than no_below documents (absolute number) or
- more than no_above documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [0]:
# OPTIONAL STEP
# Remove very rare and very common words:
# - words appearing less than 15 times
# - words appearing in more than 10% of all documents

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

- Gensim doc2bow

- doc2bow(document)

- Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). 
- No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [0]:
# Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
# words and how many times those words appear. Save this to 'bow_corpus'

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [0]:
# Preview BOW for our sample preprocessed document

document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 86 ("kind") appears 1 time.
Word 87 ("respons") appears 1 time.
Word 88 ("uber_support") appears 1 time.


We are going for 10 topics in the document corpus.

We will be running LDA using all CPU cores to parallelize and speed up model training.

Some of the parameters we will be tweaking are:

num_topics is the number of requested latent topics to be extracted from the training corpus.
id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
workers is the number of extra processes to use for parallelization. Uses all available cores by default.
alpha and eta are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is 1/num_topics)

Alpha is the per document topic distribution.

High alpha: Every document has a mixture of all topics(documents appear similar to each other).
Low alpha: Every document has a mixture of very few topics
Eta is the per topic word distribution.

High eta: Each topic has a mixture of most words(topics appear similar to each other).
Low eta: Each topic has a mixture of few words.
passes is the number of training passes through the corpus. For example, if the training corpus has 50,000 documents, chunksize is 10,000, passes is 2, then online training is done in 10 updates:

- 1 documents 0-9,999
- 2 documents 10,000-19,999
- 3 documents 20,000-29,999
- 4 documents 30,000-39,999
- 5 documents 40,000-49,999
- 6 documents 0-9,999
- 7 documents 10,000-19,999
- 8 documents 20,000-29,999
- 9 documents 30,000-39,999
- 10 documents 40,000-49,999

In [0]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [0]:
# For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.042*"flight" + 0.025*"americanair" + 0.024*"delta" + 0.019*"book" + 0.019*"virgintrain" + 0.017*"southwestair" + 0.016*"british_airway" + 0.015*"hour" + 0.014*"time" + 0.013*"travel"


Topic: 1 
Words: 0.055*"send" + 0.049*"look" + 0.038*"number" + 0.037*"assist" + 0.035*"account" + 0.034*"address" + 0.032*"email" + 0.030*"sorri" + 0.026*"detail" + 0.024*"issu"


Topic: 2 
Words: 0.026*"devic" + 0.023*"know" + 0.020*"work" + 0.018*"version" + 0.017*"start" + 0.015*"gdrqu" + 0.013*"like" + 0.013*"want" + 0.012*"upgrad" + 0.012*"get"


Topic: 3 
Words: 0.026*"servic" + 0.019*"time" + 0.018*"custom" + 0.018*"tesco" + 0.017*"uber_support" + 0.015*"go" + 0.014*"issu" + 0.013*"work" + 0.012*"charg" + 0.012*"phone"


Topic: 4 
Words: 0.032*"store" + 0.023*"card" + 0.018*"account" + 0.017*"xboxsupport" + 0.017*"option" + 0.015*"onlin" + 0.014*"purchas" + 0.012*"check" + 0.012*"tri" + 0.012*"websit"


Topic: 5 
Words: 0.047*"team" + 0.031*"sorri" + 0.027*"know" + 0.026*"hear"

In [0]:
# Take the top five words(ish) for each category. Using expertise, label cateogry in interesting ways, that'll be your topic name.
# If words are too general, add them to your stop words and see what shows up.
# Goal: It'd be nice for the top few topics to be clean and explainable.

# LDA gives 1. topics that are represented. 2. for each of the documents you gave it, it'll tell you how strong it is in each of the topics. Look for the document weightings for each topic.
# 2. Get a matrix that has a column for each topic and row for each document. Might have more than one topic that's relatively high (or there could be one).  
# 2. Then you can look on a per company basis and see what the top topics that companies are tweeting about.

### 6. Sentiment Analysis

In [0]:
from textblob import TextBlob

In [0]:
sample_text_df = clean_df[['text', 'company', 'author_id']].copy()

In [0]:
sample_text_df['company'].unique()

array([1, 0])

In [0]:
sample_text_df = sample_text_df.sample(n=50000, random_state=1)

In [0]:
def find_pol(review):
    return TextBlob(review).sentiment.polarity

sample_text_df['polarity'] = sample_text_df['text'].apply(find_pol)

In [0]:
def find_pol(subjectivity):
    return TextBlob(subjectivity).sentiment.subjectivity

sample_text_df['subjectivity'] = sample_text_df['text'].apply(find_pol)

In [0]:
# Don't re-run LDA - need to aggregate documents according to the story you want to tell.
# Could talk about what's happening on a company level, overall company's topics vs. non-company topics. Groupby() can be useful here, with an aggregation function like describe(), avg() or mean().

In [0]:
sample_text_df['company'].unique()

array([0, 1])

In [0]:
sample_text_df

In [0]:
sample_text_df.head()

Unnamed: 0,text,company,author_id,polarity,subjectivity
2360834,atviassist pls fix hqs,0,719820,0.0,0.0
375495,applesupport iphone with ios,0,216966,0.0,0.0
177509,either im dumb bc i dont know how to use the a...,0,165872,-0.375,0.5
1497308,its such a pain mine is doing it too,0,502403,0.0,0.5
2179157,sorry we got it wrong could you send us a dm ...,0,KFC_UKI_Help,-0.166667,0.8


In [0]:
# Two-dimensional scatterplot to show where each company fits on the plot.
# What's their average on those two dimensions.
# Some will be mildly positive, or the other end and be negative.
# Look at example tweets that are at the high and low ranges to generalize what the scores mean.

In [0]:
sample_text_df['company'] = sample_text_df['company'] != 0

In [0]:
sample_text_df['company'].unique()

array([ True])

In [0]:
sample_text_df.head()

Unnamed: 0,text,company,author_id,polarity,subjectivity
2360834,atviassist pls fix hqs,True,719820,0.0,0.0
375495,applesupport iphone with ios,True,216966,0.0,0.0
177509,either im dumb bc i dont know how to use the a...,True,165872,-0.375,0.5
1497308,its such a pain mine is doing it too,True,502403,0.0,0.5
2179157,sorry we got it wrong could you send us a dm ...,True,KFC_UKI_Help,-0.166667,0.8


### 6. Key Takeaways

### 7. Next Steps



- Use clustering to analyze patterns in the companies.
- Look at things by industry.

### 8. Resources

This is the list of resources that were used for this analysis:

1. [Google News](https://news.google.com/?hl=en-US&gl=US&ceid=US:en).
2. [Salesgenie.com](https://www.salesgenie.com/).
3. [Steinmetz, Katy](https://time.com/4894182/twitter-company-complaints/). Time - Tech. "Does Tweeting at Companies Really Work?"
4. [US Chamber of Commerce](https://www.uschamber.com/about/about-the-us-chamber/frequently-asked-questions#3). "Frequently Asked Questions."
5. [Yahoo! Finance](https://finance.yahoo.com/).

- LDA resource: https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb
- Additional LDA resource: https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925

- Kaggle dataset: https://www.kaggle.com/thoughtvector/customer-support-on-twitter/data

- Sentiment analysis: https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/3-Sentiment-Analysis.ipynb

- Sentiment Polarity: https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/
- Sentiment Analysis from YouTube series: https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/3-Sentiment-Analysis.ipynb

- Uploading content to Google Colab: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

- Machine Learning - Text Processing: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958

- Text Classification: The First Step Towards Machine Learning Mastery: https://medium.com/data-from-the-trenches/text-classification-the-first-step-toward-nlp-mastery-f5f95d525d73

# 9. Appendix



For your reference, here's the list of non-US companies that were in the dataframe and which industry they were assigned to:

1. Aldi UK, Grocers - Retail.
2. Argos, Retail.
3. Dollar Shave Club, Retail.
4. Glo, Cellular Telephones (Services).
5. Great Western Railway, Commuter & Passenger Rail Service.
6. Greggs, Restaurants.
7. InterContinental Hotel Group, Hotel & Motel Management.
8. London Midland, Commuter & Passenger Rail Service.
9. Marks and Spencer, Grocers - Retail.
10. Morrisons, Grocers - Retail.
11. National Rail, Commuter & Passenger Rail Service.
12. OPPO Care India, Cellular Telephones (Services).
13. Pearson, Education.
14. PlayStation, Video Games - Manufacturer.
15. Royal Bank of Canada, Banks.
16. Sainsburys, Grocers - Retail.
17. Size?, Retail.
18. SoundCloud, Radio Stations & Broadcasting Companies.
19. Spotify, Radio Stations & Broadcasting Companies.
20. Tesco, Grocers - Retail.
21. Tigo Ghana, Cellular Telephones (Services).
22. Transport for London, Commuter & Passenger Rail Service.
23. Virgin America, Airline Companies.
24. Virgin Atlantic, Airline Companies.
25. Virgin Mobile USA, Cellular Telephones (Services).
26. Virgin Money, Banks.
27. Virgin Trains, Commuter & Passenger Rail Service.

