# CapStone Project : Topic modeling on Amazon customer support 

## Problem Statement 

Amazon is the one of the world's biggest e-commerce platform. Apart from its own platform, Amazon has different channels of customer support. Twitter is one of them. Amazon has a customer suport account on Twitter called @Amazonhlep. It is not surprise that the volume of customer requests is huge. In 2017 , there are nearly 100K tweets on AmazonHelp. With such a high volume, maintaining the quality of the customer support would be challenging.

This project aim at having a better allocation of customer support workforce by identifying the intent of customer. The intention of customer will be categorized into different topics and then we can restructure the customer support team into sub team based on these topics.  Ultimately, to triage and route the customer support reqeusts to appropriate sub team in order to provide a more efficient customer support.

## Exective Summary 

No business can survive without customers. This reflects on the mission of Amazon, one of the world's biggest e-commerce platform, “to be Earth’s most customer-centric company" and the award-winning Customer Service team is an essential part of this mission. This is why maintaining the quality of the customer support is crucial to Amazon.

### why Twitter?
Firstly, Twitter can be a big showcase for Amazon customer support. It is a globally well known social media platform and it had more than 321 million monthly active users as of 2018. Unlike Amazon 's own customer support platform,  everyone can see other people's complaints or requests on AmazonHelp (Amazon's customer support on Twitter). Because Twitter's primary purpose is to connect people and allow people to share their thoughts with a big audience.If we can provide good quality and efficient customer support,it also facilitates to positive brand image.

Secondly, Twitter aims to create highly skimmable content for our tech-heavy, attention-deficit modern world. Thus, Tweets can be only up to 140 characters long. The short conversation is similar to live chat. Users contact customer support to have a specific problem solved, and the manifold of problems to be discussed is relatively small, especially compared to unconstrained conversational datasets like the reddit Corpus. Understanding the tweets pattern helps us to understand the talking pattern of the users nowaday. And this project explores Twitter as a pioneer project. We can start from Twitter first and then explore to other customer support platform such as Facebook, email.

### Key findings

1. Presence of non English tweets such as Japanese, Spanish, German as AmazonHelp support foreign languages.
2. spam tweets, a single users contribute to 417 tweets alone. The tweets accross 2015-2017
3. 99% of the tweets are requested in 2017 
4. Oct,Nov and Dec has the most tweets. prioritize these three months.
5. The SLA(Service Level Agreement) of AmazonHelp is around 13 mins.It can be the baseline of SLA. And when the tweet volumn are high in Oct, Nov and Dec. The SLA should be shorter than 13 mins. 

### Dataset__
The dataset `Customer Support on Twitter` is from Kaggle (https://www.kaggle.com/thoughtvector/customer-support-on-twitter). It offers a large corpus of modern English (mostly) conversations between consumers and customer support agents on Twitter.

The dataset consists of the tweets refer to 108 different companies. AmazonHelp has the highest number of tweets. In this project, I will only focus on @AmazonHelp 's tweets. 

### Conclusion and Recommendation 
LDA helped to categorize 3 main topics, they are inferred to be :

Topic 1 : Carrier - especially with USPS (Postal service company of United States)

Topic 2 : account issue - email follow up

Topic 3 : delivery - with Prime account

### Recommendation :
To restructure the customer support team into sub team based on these 3 topics.
For example, we can have a sub team dealing with carrier issue and espceially keep an eye on USPS. The second team will be responsible for incoming emails. The third team will be dealing with Prime account. And the peak season is in Oct, Nov and Dec, be aware of the allocation of staffs and the SLA.




In [1]:
import re
import pandas as pd
import numpy as np
import string
from langdetect import detect

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

from termcolor import colored

# Import the dataset

In [47]:
all_twitter = pd.read_csv('./datasets/twcs.csv')

# First glance of the dataset

In [12]:
# Define function for basic eda

def data_explore(df):
    #First two rows
    print("First five rows of data:")
    display(df.head())
    print()
    # Print shape of dataframe
    print(colored(f"Shape: {df.shape}",'blue',attrs=['bold']))
    print()
    # Print datatypes
    print(colored("Columns & Datatypes: ",'blue',attrs=['bold']))
    df.info()
    print()
    # Check for null values
    print(colored("Null values:",'blue',attrs=['bold']))
    if df.isnull().values.any() == False:
        print("None in Dataframe.")
    else:
        for col in df:
            print(f"{col}:{df[col].isnull().sum()}")
    print()
    # Count of distinct values
    print(colored("Unique values (by Columns)",'blue',attrs=['bold']))
    for col in df:
        print(f"{col}:{len(df[col].unique().tolist())}")
  

In [13]:
data_explore(all_twitter)

First five rows of data:


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0



[1m[34mShape: (2811774, 7)[0m

[1m[34mColumns & Datatypes: [0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tweet_id                 int64  
 1   author_id                object 
 2   inbound                  bool   
 3   created_at               object 
 4   text                     object 
 5   response_tweet_id        object 
 6   in_response_to_tweet_id  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB

[1m[34mNull values:[0m
tweet_id:0
author_id:0
inbound:0
created_at:0
text:0
response_tweet_id:1040629
in_response_to_tweet_id:794335

[1m[34mUnique values (by Columns)[0m
tweet_id:2811774
author_id:702777
inbound:2
created_at:2061666
text:2782618
response_tweet_id:1771146
in_response_to_tweet_id:1774823



# Examine dataset structure - Inbound and Response Tweets

In this section, I will get the data into a shape that's useful for further explorations: first consumer messages to companies, and their response.
Starter code below is from the Customer Support on Twitter dataset creator : 
retrieved from https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets

In [11]:
# Pick only inbound tweets that aren't in reply to anything...
first_inbound = all_twitter[pd.isnull(all_twitter.in_response_to_tweet_id) & all_twitter.inbound]
print('Found {} first inbound messages.'.format(len(first_inbound)))

# Merge in all tweets in response
inbounds_and_outbounds = pd.merge(first_inbound, all_twitter, left_on='tweet_id', 
                                  right_on='in_response_to_tweet_id')
print("Found {} responses.".format(len(inbounds_and_outbounds)))


# Filter out cases where reply tweet isn't from company
inbounds_and_outbounds = inbounds_and_outbounds[inbounds_and_outbounds.inbound_y ^ True]

# print the finding
print("Found {} responses from companies.".format(len(inbounds_and_outbounds)))
print("Tweets Preview:")
print(inbounds_and_outbounds)


Found 787346 first inbound messages.
Found 875292 responses.
Found 794299 responses from companies.
Tweets Preview:
        tweet_id_x author_id_x  inbound_x                    created_at_x  \
0                8      115712       True  Tue Oct 31 21:45:10 +0000 2017   
1                8      115712       True  Tue Oct 31 21:45:10 +0000 2017   
2                8      115712       True  Tue Oct 31 21:45:10 +0000 2017   
3               18      115713       True  Tue Oct 31 19:56:01 +0000 2017   
4               20      115715       True  Tue Oct 31 22:03:34 +0000 2017   
...            ...         ...        ...                             ...   
875287     2987942      823867       True  Wed Nov 22 07:30:39 +0000 2017   
875288     2987944      823868       True  Wed Nov 22 07:43:36 +0000 2017   
875289     2987946      524544       True  Wed Nov 22 08:25:48 +0000 2017   
875290     2987948      823869       True  Wed Nov 22 08:35:16 +0000 2017   
875291     2987950      823870       

Short Summary from above : 

In this dataset, there are 787346 first inbound messages. These are tweets that aren't in reply to anything.
And 794299 responses are from companies among the 875292 responses tweets.  

We can see from the authod_id that there are other companies'customer support tweeter accounts like 'sprintcare' and 'AirAsiaSupport'. In this project, I will only focus on Amazon's tweeter account named 'AmazonHelp'.

columns `in_response_to_tweet_id_x` and `response_tweet_id_y` have missing values. In this project, I will not need these columns. I will drop the response tweet id related columns. 


In [48]:
#I will need the request from the tweeter users and response from AmazonHelp
amazon_tweets = inbounds_and_outbounds \
    .loc[inbounds_and_outbounds.author_id_y == 'AmazonHelp'] 

In [49]:
#reset the index of the new dateframe 
amazon_tweets = amazon_tweets.reset_index(drop=True)

In [50]:
#There are non-english tweets as AmazonHelp also support languages 
data_explore(amazon_tweets)

First five rows of data:


Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
0,272,115770,True,Wed Nov 22 09:14:39 +0000 2017,amazonのfireTVstickが見れない😢,269,,269,AmazonHelp,False,Wed Nov 22 09:23:01 +0000 2017,@115770 こんにちは、アマゾン公式です。Fire TV Stickが見れないというのは...,270271.0,272.0
1,325,115792,True,Wed Nov 22 08:55:35 +0000 2017,amazonプライムビデオ、再生エラーが多いです,324,,324,AmazonHelp,False,Wed Nov 22 09:06:00 +0000 2017,@115792 ご不便をおかけしております。アプリをご利用でしょうか。強制停止&gt;端末の...,,325.0
2,617,115820,True,Tue Oct 31 22:16:32 +0000 2017,Way to drop the ball on customer service @1158...,615,,615,AmazonHelp,False,Tue Oct 31 22:29:00 +0000 2017,@115820 I'm sorry we've let you down! Without ...,616.0,617.0
3,621,115822,True,Tue Oct 31 22:19:34 +0000 2017,@115823 I want my amazon payments account CLOS...,620,,620,AmazonHelp,False,Tue Oct 31 22:28:34 +0000 2017,@115822 I am unable to affect your account via...,,621.0
4,624,115824,True,Tue Oct 31 22:12:37 +0000 2017,"@115825 also, beim Addams Family-Film in Prime...",622,,622,AmazonHelp,False,Tue Oct 31 22:28:00 +0000 2017,"@115824 Hi, wir erhalten die Filme/Serien so v...",623.0,624.0



[1m[34mShape: (84637, 14)[0m

[1m[34mColumns & Datatypes: [0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84637 entries, 0 to 84636
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   tweet_id_x                 84637 non-null  int64  
 1   author_id_x                84637 non-null  object 
 2   inbound_x                  84637 non-null  bool   
 3   created_at_x               84637 non-null  object 
 4   text_x                     84637 non-null  object 
 5   response_tweet_id_x        84637 non-null  object 
 6   in_response_to_tweet_id_x  0 non-null      float64
 7   tweet_id_y                 84637 non-null  int64  
 8   author_id_y                84637 non-null  object 
 9   inbound_y                  84637 non-null  bool   
 10  created_at_y               84637 non-null  object 
 11  text_y                     84637 non-null  object 
 12  response_tweet_id_y        41190 

### Dilemma : Should I focus on the request from the user or should I include the response tweets from AmazonHelp? 

In [52]:
# Let's see the request from the user 
#I will need the request from the tweeter users to the AmazonHelp
amazon_request = all_twitter.loc[(all_twitter.inbound==True)& (all_twitter.text.str.contains('@AmazonHelp'))]

In [54]:
#reset the index of the new dateframe 
amazon_request=amazon_request.reset_index(drop=True)

In [56]:
#fisrt glance of the dataframe
data_explore(amazon_request)

First five rows of data:


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,270,115770,True,Wed Nov 22 09:24:30 +0000 2017,@AmazonHelp ありがとうございます。\n今、電話で主人が対応していただいてます。,,269.0
1,271,115770,True,Wed Nov 22 09:30:36 +0000 2017,@AmazonHelp 電話で対応してもらいましたが改良されませんでした。\n保証期間も過ぎ...,273.0,269.0
2,274,115770,True,Wed Nov 22 09:44:04 +0000 2017,@AmazonHelp こちらこそありがとうございました。,275.0,273.0
3,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...,618.0,615.0
4,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...,,618.0



[1m[34mShape: (134725, 7)[0m

[1m[34mColumns & Datatypes: [0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134725 entries, 0 to 134724
Data columns (total 7 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tweet_id                 134725 non-null  int64  
 1   author_id                134725 non-null  object 
 2   inbound                  134725 non-null  bool   
 3   created_at               134725 non-null  object 
 4   text                     134725 non-null  object 
 5   response_tweet_id        99765 non-null   object 
 6   in_response_to_tweet_id  112744 non-null  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 6.3+ MB

[1m[34mNull values:[0m
tweet_id:0
author_id:0
inbound:0
created_at:0
text:0
response_tweet_id:34960
in_response_to_tweet_id:21981

[1m[34mUnique values (by Columns)[0m
tweet_id:134725
author_id:48044
inbound:1
created_at:132751
text:131142
response

My decision :

Now I have two dataframe : 

1.`amazon_tweets` which combine the user request and amazon's response

2.`amazon_request` which only includes the user request 

After examine the dataframes, I decide to use amazon_requests since my goal is to build a topic modelling to triage the user request. I am interested to understand the intent of the customer. 
Also, the shape of amazon_requests are much larger(amazon_requests has 134725 rows and amazon_tweets has 84637 rows.) 
I would like to have a larger dataset for my modelling. 

Thus, I will use amazon_request dataframe from now on.


In [60]:
#I will not need the response tweet ids. dropped the response tweet id related columns
amazon_request = amazon_request.drop(['response_tweet_id', 'in_response_to_tweet_id'],axis=1)

In [61]:
#ensure the columns are dropped
amazon_request.head() 

Unnamed: 0,tweet_id,author_id,inbound,created_at,text
0,270,115770,True,Wed Nov 22 09:24:30 +0000 2017,@AmazonHelp ありがとうございます。\n今、電話で主人が対応していただいてます。
1,271,115770,True,Wed Nov 22 09:30:36 +0000 2017,@AmazonHelp 電話で対応してもらいましたが改良されませんでした。\n保証期間も過ぎ...
2,274,115770,True,Wed Nov 22 09:44:04 +0000 2017,@AmazonHelp こちらこそありがとうございました。
3,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...
4,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...


## Filter out non english tweet 

Afer looking into the dataset , we can see that there are non-english tweets as AmazonHelp also support languages 
In this section, I will filter out the non english tweets. I will only examine the english tweet in this project. 

In [62]:
#create a small subset of data for debug and test the function of filtering non english tweet
#i.e shorter running time 
amazon_sample = amazon_request.head(100)


In [63]:
#make sure this sample dataframe contains non english tweet 
amazon_sample.head(40)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text
0,270,115770,True,Wed Nov 22 09:24:30 +0000 2017,@AmazonHelp ありがとうございます。\n今、電話で主人が対応していただいてます。
1,271,115770,True,Wed Nov 22 09:30:36 +0000 2017,@AmazonHelp 電話で対応してもらいましたが改良されませんでした。\n保証期間も過ぎ...
2,274,115770,True,Wed Nov 22 09:44:04 +0000 2017,@AmazonHelp こちらこそありがとうございました。
3,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...
4,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...
5,623,115824,True,Tue Oct 31 22:32:07 +0000 2017,"@AmazonHelp Okay, danke für die Info"
6,627,115827,True,Wed Nov 01 12:50:18 +0000 2017,@AmazonHelp @115826 Yeah this is crazy we’re l...
7,634,115831,True,Tue Oct 31 21:39:58 +0000 2017,@115821 @AmazonHelp why is my order at my loca...
8,638,115834,True,Tue Oct 31 22:19:56 +0000 2017,@AmazonHelp Hi ready for some help
9,640,115834,True,Tue Oct 31 01:03:01 +0000 2017,@AmazonHelp Is the Echo Show no longer supported?


### Filter non English tweet - Test on subset (first 100 tweets)  

In [65]:
#use for loor to filiter out every non english tweet on Amazon
#fliter non english on the sample data

y = 0
for _ in amazon_sample['text']:
    if re.search('[^\x00-\x7F]',amazon_sample['text'][y]) == None:    # [^\x00-\x7F] is to filter non ascii 
        y = y+1
    else:
        amazon_sample.drop(index = y, inplace = True)
        y = y+1 
        continue
amazon_sample.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [66]:
amazon_sample

Unnamed: 0,tweet_id,author_id,inbound,created_at,text
0,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...
1,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...
2,638,115834,True,Tue Oct 31 22:19:56 +0000 2017,@AmazonHelp Hi ready for some help
3,640,115834,True,Tue Oct 31 01:03:01 +0000 2017,@AmazonHelp Is the Echo Show no longer supported?
4,641,115834,True,Tue Oct 31 01:02:39 +0000 2017,@AmazonHelp Nothing there helped me with the E...
5,646,115835,True,Tue Oct 31 21:40:30 +0000 2017,.@AmazonHelp Item has not been delivered but t...
6,651,115838,True,Tue Oct 31 22:25:58 +0000 2017,@AmazonHelp I don't want a form to fill out th...
7,654,115838,True,Tue Oct 31 22:53:50 +0000 2017,@AmazonHelp Already started the return. UPS ge...
8,652,115838,True,Tue Oct 31 22:11:15 +0000 2017,@AmazonHelp Is it possible to prevent AMZL fro...
9,657,115839,True,Tue Oct 31 22:18:42 +0000 2017,"@AmazonHelp Already handled, just venting. It ..."


### Further detection of the non English language

Now we can see the japanese tweets are dropped after running the function.
However, there are still German, Spanish tweets in the subset. 
For example : on index 50 'Ach schau an. Meine Bestellung war unzustellba...'

since the dataset is large, to further confirm if all of the non english 
I will use a libray called `langdetect` to detect the non english language. 

https://pypi.org/project/langdetect/ 

In [68]:
#test on the subset 

non_eng=[]
for x in amazon_sample['text']:
    if detect(x) != 'en':
        non_eng.append(x)
    else:
        continue

In [69]:
#checking the non english tweets
non_eng

['@AmazonHelp Where is my order? https://t.co/pXnKSCo2ex',
 '@AmazonHelp Sadly yes']

In [70]:
lang = detect('Sadly yes')
lang

'tr'

__Short summay from language detect:__

Here we can see that the tweets 'Sadly yes' are actually an English tweets but was mis-dectected as tr(Turkish). It seems not quite accurate.

At this stage, I will use the for loop to filter out every non english tweet on Amazon first
then I will run this language detect again after the data cleaning or looking for other way to further check on non English tweets 


### Filter out non english tweet 

In [71]:
#use for loor to filiter out every non english tweet on Amazon
#fliter non english on the sample data

y = 0
for _ in amazon_request['text']:
    if re.search('[^\x00-\x7F]',amazon_request['text'][y]) == None:    # [^\x00-\x7F] is to filter non ascii 
        y = y+1
    else:
        amazon_request.drop(index = y, inplace = True)
        y = y+1 
        continue
amazon_request.reset_index(drop=True, inplace=True)

In [72]:
#check duplicate based on 'text_x' content i.e the requests from twitter users 
amazon_request[amazon_request.duplicated(['text'])].shape

(3328, 5)

In [74]:
amazon_request.shape

(97011, 5)

In [76]:
#drop the duplicate 
amazon_request.drop_duplicates(subset='text',inplace=True)

In [77]:
#reset the index of the new dateframe 
amazon_request = amazon_request.reset_index(drop=True)

In [78]:
#shape after dropping duplicate 
amazon_request.shape

(93683, 5)

In [79]:
amazon_request.head(10)

Unnamed: 0,tweet_id,author_id,inbound,created_at,text
0,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...
1,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...
2,638,115834,True,Tue Oct 31 22:19:56 +0000 2017,@AmazonHelp Hi ready for some help
3,640,115834,True,Tue Oct 31 01:03:01 +0000 2017,@AmazonHelp Is the Echo Show no longer supported?
4,641,115834,True,Tue Oct 31 01:02:39 +0000 2017,@AmazonHelp Nothing there helped me with the E...
5,646,115835,True,Tue Oct 31 21:40:30 +0000 2017,.@AmazonHelp Item has not been delivered but t...
6,651,115838,True,Tue Oct 31 22:25:58 +0000 2017,@AmazonHelp I don't want a form to fill out th...
7,654,115838,True,Tue Oct 31 22:53:50 +0000 2017,@AmazonHelp Already started the return. UPS ge...
8,652,115838,True,Tue Oct 31 22:11:15 +0000 2017,@AmazonHelp Is it possible to prevent AMZL fro...
9,657,115839,True,Tue Oct 31 22:18:42 +0000 2017,"@AmazonHelp Already handled, just venting. It ..."


In [80]:
data_explore(amazon_request)

First five rows of data:


Unnamed: 0,tweet_id,author_id,inbound,created_at,text
0,616,115820,True,Tue Oct 31 23:22:08 +0000 2017,@AmazonHelp 3 different people have given 3 di...
1,619,115820,True,Tue Oct 31 23:32:26 +0000 2017,@AmazonHelp I frankly don't have the patience ...
2,638,115834,True,Tue Oct 31 22:19:56 +0000 2017,@AmazonHelp Hi ready for some help
3,640,115834,True,Tue Oct 31 01:03:01 +0000 2017,@AmazonHelp Is the Echo Show no longer supported?
4,641,115834,True,Tue Oct 31 01:02:39 +0000 2017,@AmazonHelp Nothing there helped me with the E...



[1m[34mShape: (93683, 5)[0m

[1m[34mColumns & Datatypes: [0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93683 entries, 0 to 93682
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_id    93683 non-null  int64 
 1   author_id   93683 non-null  object
 2   inbound     93683 non-null  bool  
 3   created_at  93683 non-null  object
 4   text        93683 non-null  object
dtypes: bool(1), int64(1), object(3)
memory usage: 2.9+ MB

[1m[34mNull values:[0m
None in Dataframe.

[1m[34mUnique values (by Columns)[0m
tweet_id:93683
author_id:36107
inbound:1
created_at:92722
text:93683



In [81]:
#export to csv, avoid runing the non english function every time
amazon_request.to_csv('./datasets/amazon_request_en.csv', index=False)