# Session 3: Text analysis approaches

\#\#\# __DRAFT__ \#\#\#

Text analysis is a classic computational and data science problem, 

Compared with regression and classification approaches on continuous and categorical dataset taking text data and deriving distinct insights is a far more complicated task. Text data and especially free text (text fields in sentence form) is typically classed as a form of unstructured data because of the various nuances introduced by languages.

With the ever increasing computational power has come a side-by-side improvements in approaches to text analysis. 

The idea of topic modelling, identifying abstract 'topics' within a collection of documents (corpus) using statistical models, was first described in 1998, with probabilistic latent semantic analysis (PLSA) outlined in 1999 and latent Dirichlet allocation (LDA) developed in [2002](http://jmlr.csail.mit.edu/papers/v3/blei03a.html). LDA has become one of the most commonly used topic modelling approaches since although many extensions of LDA have since been proposed.

In [1]:
import numpy as np
import pandas as pd
import gensim
import matplotlib.pyplot as plt

In [2]:
# import the dataset

twitter_data = pd.read_csv('data/twcs/twcs.csv')

twitter_data.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [3]:
inbound_dat = twitter_data[twitter_data.inbound == True]

inbound_dat.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0
6,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610.0,
8,12,115713,True,Tue Oct 31 22:04:47 +0000 2017,@sprintcare You gonna magically change your co...,111314.0,15.0


In [4]:
inbound_dat.author_id.unique().shape

(702669,)

In [5]:
user_tweets = inbound_dat.groupby('author_id')['text'].apply(lambda x: ','.join(x))

In [6]:
user_tweets = user_tweets.reset_index()

user_tweets.text[:10].tolist()

['Screw you @116016 and your stupid Blueprint program. I never signed up for this crap and now you’re going to charge me interest fees? https://t.co/WwBzUIhSbG,@ChaseSupport Actually it just doesn’t work in Safari, but that’s still pretty bad.,Dear @ChaseSupport, it’s kinda hard to pay my bills when the entire payment section of your site is unavailable 🤦🏻\u200d♀️',
 "Now the flight @Delta is sending our bag back on just got delayed two hours. So mad right now, I can't even.",
 '@MOO Big thanks to Quentin for the exceptional service! Just ordered our 3rd round of #businesscards 👍,The ribotRainbow! New #businesscards thanks to @moo 😊#rainbow #ourteamrocks https://t.co/nqMMUqYzKt https://t.co/gVtJDEoGFu',
 'Yup https://t.co/GpkFa9MfHQ,same. https://t.co/gxkJt8BNV6',
 '@comcastcares Is it possible to get business class internet at a residence, and if so are there any restrictions/limitations?',
 '@Delta I just sent you a DM,@Delta I will never fly your airline again',
 'Wow. Used to think