## Analysis, Step by Step

In [1]:
# Modules:
import pandas as pd
import numpy as np
# Load data:
if 'df_main' not in globals().keys():    
    # Main function of cell is to display data, if statements prevents re-load:
    df_main = pd.read_csv('data.csv', header=None)
    df_main.columns = ['text']

# Form of data:
print(f"Shape of data: {df_main.shape}")

# Data has missing values:
print(f"Data has nans: {df_main.isnull().values.any()}")

# Display random rows:
idx = np.random.randint(len(df_main))
df_main[idx:].head()

Shape of data: (2811774, 1)
Data has nans: False


Unnamed: 0,text
501467,@252761 Hey there! Sorry about that. We’d reco...
501468,@SpotifyCares @159550 Can you help us? our EP ...
501469,@252762 Hey Peter! Can you DM us the link so w...
501470,@SpotifyCares https://t.co/ZLrT5waWaC
501471,"@252762 Thanks, we'll get this reported. If yo..."


#### First Impressions

The data consists in a collection of tweets, mostly customer support, sent by users or companies in a dialogue form. The data is remarkably unstructured. A possibly useful feature is that tweets can be clearly directed at the company via "@company" or to an anonimized user via "@12345". This seems to be the most useful type of information contained in the data so far, although if a help request addresses a specific employee it is also anonimized.

In [2]:
from process_tags import *

print("--Can a tweet be directed at both a user and a company?")
for i in df_main['text']:
    if has_user_name(i) and has_company_name(i):
        print('yes ->' + i)
        break

print("--Can a tweet be directed at multiple companies?")
for i in df_main['text']:
    if count_company_names(i) > 1:
        print('yes ->' + i)
        break

print("--Can a tweet be directed at multiple users?")
for i in df_main['text']:
    if count_user_names(i) > 1:
        print('yes ->' + i)
        break

--Can a tweet be directed at both a user and a company?
yes ->@ChipotleTweets @28 
I don't fit in my Veggie Burrito costume #Halloween https://t.co/7tJDVpzLWn
--Can a tweet be directed at multiple companies?
yes ->@MicrosoftHelps @XboxSupport Brilliant, thank you!
--Can a tweet be directed at multiple users?
yes ->.@VerizonSupport @115725 @115726                                                 &gt;All of VERIZON IS DOWN&lt;
When can we expect a fix ?


#### Format of Data
Intermediate conclusions are somewhat pessimistic. There seems to be no fool proof and easy way to make out whether a tweet is coming from a user or a company. An additional difficulty is that some tweets form part of conversation, and there is additionally no easy way to determine which tweet has initialized the conversation or when the conversation has ended.

#### Types of Tweets
Mostly help requests, though some tweets are just randomly sent out to show support for a company. 

#### Intermediate conclusion
The best way to continue the data analysis would be to employ unsupervised learning to hope for more easily processed insights.

## The Bigger Picture - How can value be extracted?
Now that we have somewhat of an idea of the type of data we are dealing with, the question becomes "How can value be extracted". I think the most usefull things for a company to know are:
1. When have I received a request for support?
2. When has the customer support been completed?
3. Can these things be derived from the contents of a tweet?

#### Topic Discovery
The assignment asks for us to discover 10 topics. In my view the next logical approach is to use unsupservised learning. 

In [3]:
from textprocessing import *
import os
import pickle
jn      = os.path.join
is_f    = os.path.isfile

# Create folder structure - if it does not exist:
prepare_setup()

# Prepare data - or load:
path = jn('artifacts','data','df_unsupervised.pkl')
if not os.path.isfile(path):
    df_unsupervised = df_main.sample(frac = .01)
    df_unsupervised['text'] = df_unsupervised['text'].apply(process_sentence)
    df_unsupervised.to_pickle(path)
else:
    df_unsupervised = pd.read_pickle(path)

# Create latent dirichlet model - or load:
if is_f(jn('artifacts', 'models', 'lda.pkl')) and is_f(jn('artifacts', 'models', 'countvec.pkl')):
    # Load:
    with open(jn('artifacts', 'models', 'countvec.pkl'), 'rb') as fs:
        countvec = pickle.load(fs)
    with open(jn('artifacts', 'models', 'lda.pkl'), 'rb') as fs:
        lda = pickle.load(fs)
else:
    countvec, lda = train_lda_model(df_unsupervised['text'])
    # Save:
    with open(jn('artifacts', 'models', 'lda.pkl'), 'wb') as fs:
        pickle.dump(countvec, fs)
    with open(jn('artifacts', 'models', 'countvec.pkl'), 'wb') as fs:
        pickle.dump(countvec, fs)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/vscode/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/vscode/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/vscode/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/vscode/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/vscode/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_

In [39]:
import matplotlib.pyplot as plt

#lda.transform(countvec.transform(process_sentence(["Hello let's go"])))
#lda.transform(countvec.transform([process_sentence("delayed flight")]))
# 
components = lda.components_
words = countvec.get_feature_names_out()
for i in range(1):
    component = components[i,:]
    top_words = words[np.argsort(-component)]



#countvec.get_feature_names_out()

In [40]:
top_words

array(['issue', 'like', 'thanks', 'would', 'try', 'feedback', 'us', 'day',
       'device', 'share', 'always', 'update', 'resolved', 'problem',
       'hope', 'better', 'experience', 'review', 'resolve', 'got', 'know',
       'able', 'network', 'end', 'appreciate', 'reply', 'latest', 'kind',
       'moment', 'goes', 'see', 'please', 'great', 'sort', 'additional',
       'make', 'assist', 'sure', 'provide', 'want', 'location',
       'response', 'article', 'kindly', 'console', 'trip', 'love',
       'power', 'record', 'understand', 'looking', 'version', 'start',
       'leave', 'system', 'reaching', 'improve', 'next', 'attention',
       'glad', 'new', 'different', 'browser', 'affected', 'something',
       'use', 'link', 'apologize', 'evening', 'mac', 'aware', 'help',
       'read', 'log', 'forward', 'run', 'button', 'slow', 'recent',
       'locator', 'test', 'section', 'good', 'situation', 'dont',
       'either', 'concern', 'chance', 'concerning', 'morning',
       'following', 'soo