<a href="https://colab.research.google.com/github/LeandroMAcosta/Twitter-s-customer-service-threads-analysis/blob/main/Twitter's_customer_service_threads_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

objetivo: detectar partes de conversaciones en redes sociales (twitter o reddit) en las que las intervenciones están en una estructura conversacional argumentativa, es decir, una contesta a la otra, o elabora el tema presentado, no son inconexas.

- buscar un dataset
- pensar qué características podrían ser indicativas de coherencia: semejanza entre las intervenciones (hablan del mismo tema), uso de conectores, 
- aplicar aproximación de autoaprendizaje y descubrir nuevas características útiles para identificar coherencia

dataset: 
https://www.kaggle.com/thoughtvector/customer-support-on-twitter

In [35]:
!pip install -U spacy==3.1.2
!pip install umap-learn
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.1.0/es_core_news_md-3.1.0-py3-none-any.whl (42.7 MB)
[K     |████████████████████████████████| 42.7 MB 38 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


In [36]:
import json
import pandas as pd
import numpy as np
from google.colab import drive
from tqdm.notebook import tqdm

from collections import defaultdict

drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [37]:
DATASET_PATH = "/content/drive/MyDrive/facu/textmining/"
DATASET_NAME = "twcs.csv"
MAX_ROWS = 1000000
THRESHOLD_THREADS = 10

df = pd.read_csv(DATASET_PATH + DATASET_NAME, nrows=MAX_ROWS)   
df

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4,6.0
...,...,...,...,...,...,...,...
999995,1108226,AmazonHelp,False,Wed Nov 08 16:47:00 +0000 2017,"@381550 If the issue continues, kindly contact...",,1108224.0
999996,1108227,AmazonHelp,False,Wed Nov 08 16:47:00 +0000 2017,"@381550 Seems like a bug, I kindly request you...",11082281108229,1108224.0
999997,1108228,381550,True,Wed Nov 08 16:59:59 +0000 2017,@AmazonHelp are u kidding with me same thing i...,1108230,1108227.0
999998,1108230,AmazonHelp,False,Wed Nov 08 17:24:00 +0000 2017,@381550 That's quite a comment. We'd like to h...,,1108228.0


In [38]:
# Preprocess csv
df.set_index("tweet_id", inplace = True)
df.in_response_to_tweet_id = df.in_response_to_tweet_id.astype("Int64")
df

Unnamed: 0_level_0,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2,3
2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1
3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1,4
4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3,5
5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4,6
...,...,...,...,...,...,...
1108226,AmazonHelp,False,Wed Nov 08 16:47:00 +0000 2017,"@381550 If the issue continues, kindly contact...",,1108224
1108227,AmazonHelp,False,Wed Nov 08 16:47:00 +0000 2017,"@381550 Seems like a bug, I kindly request you...",11082281108229,1108224
1108228,381550,True,Wed Nov 08 16:59:59 +0000 2017,@AmazonHelp are u kidding with me same thing i...,1108230,1108227
1108230,AmazonHelp,False,Wed Nov 08 17:24:00 +0000 2017,@381550 That's quite a comment. We'd like to h...,,1108228


In [39]:
tweets_ids = set(df.index)
roots = set(df[df["in_response_to_tweet_id"].isnull()].index)
leafs = set(df[df["response_tweet_id"].isnull()].index)

In [40]:
print(len(leafs), len(roots), len(tweets_ids))

347194 267691 1000000


In [41]:
def dfs_tweets(tweet_id, threads, thread):
    if tweet_id not in tweets_ids:
        # If any tweet in the middle betweet root and leaf, isn't in the 
        # Data Frame, ignore thread
        return

    thread.append(tweet_id)
    if tweet_id in roots:
        # If the tweet is a root, append the entire thread reversed
        threads.append(thread[::-1])
    else:
        parent = df.loc[tweet_id]["in_response_to_tweet_id"]
        dfs_tweets(parent, threads, thread)


In [42]:
# Build Threads

threads = []
for leaf in tqdm(leafs):
    dfs_tweets(leaf, threads, [])


  0%|          | 0/347194 [00:00<?, ?it/s]

In [53]:
# Show firsts 10 threads
for thread in threads[:10]:
    for i, tweet_id in enumerate(thread):
        print("(@" + df.loc[tweet_id]["author_id"] + ")", df.loc[tweet_id]["text"])
    print("---------------------------------")

(@367816) It's a shock if 24 hrs later @safaricom_care, cannot be able to top up a scratch card for me even after giving them serial number 3times.
(@Safaricom_Care) @367816 Hello, share the voucher serial number, amount and your mobile number via DM we check and advise. ^WN
(@367816) @Safaricom_Care Have asked what's the lead time of topping up a scratch card thru' availing serial number and your team ain't responding @119433!
(@Safaricom_Care) @367816 @119433 Hi, we are unable to top up due to a slight system issue, we will get back to you soonest. Apologies. ^KD
(@367816) @Safaricom_Care @119433 An issue raised yesterday @ 8am and hasn't been resolved over 24hrs later you underplay that as a "slight system issue" ? ? Jokers!
(@Safaricom_Care) @367816 @119433 As per our conversation, we are dealing with your issue on ticket number 1-D8B011A. We will get back to you. ^KD
(@367816) @Safaricom_Care @119433 Let me know the outcome of the ticket raised last Friday!
(@Safaricom_Care) @3678

In [45]:
# Filter threads with less than `THRESHOLD_THREADS` tweets
print(len(threads))
threads = list(filter(lambda thread: len(thread) >= THRESHOLD_THREADS, threads))
print(len(threads))

345271
12161
