<a href="https://www.kaggle.com/code/nishkoder/customer-support-on-twitter-data-preprocessing?scriptVersionId=161706840" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Customer Support on Twitter: Data Preprocessing

#### 1.1 Gather Data <a id='1'></a>

### Introduction
#### 1.1 [Gather Data](#1)
#### 1.2 [Remove HTML tags](#2)
#### 1.3 [Remove URLs](#3)
#### 1.4 [Convert to lowercase](#4)
#### 1.5 [Remove emojis](#5)
#### 1.6 [Remove punctuation](#6)
#### 1.7 [Remove stopwords](#7)
#### 1.8 [Handle abbreviations/slang (example, customize your dictionary)](#8)
#### 1.9 [Stemming](#9)
#### 1.11 [Spelling correction](#10)
#### 1.12 [Tokenize](#11)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/customer-support-on-twitter/sample.csv
/kaggle/input/customer-support-on-twitter/twcs/twcs.csv


In [2]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

pd.options.mode.chained_assignment = None

df = pd.read_csv('/kaggle/input/customer-support-on-twitter/twcs/twcs.csv',
                nrows = 500)

In [3]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [4]:
df = df[['text']].copy()
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,text
0,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...
4,@sprintcare I did.


In [5]:
df.shape

(500, 1)

 #### 1.2 Remove HTML tags <a id='2'></a>

In [6]:
def remove_html_tags(text):
    """
    Remove HTML tags from a string.

    Args:
        text (str): The string containing HTML tags.

    Returns:
        str: The string with HTML tags removed.
    """
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text

In [7]:
df['text']=df['text'].apply(remove_html_tags)
df

Unnamed: 0,text
0,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...
4,@sprintcare I did.
...,...
495,"@115884 Oh, no! Please speak to a member of th..."
496,.@delta this has been my inflight studio exper...
497,@115885 2/2 https://t.co/6iDGBJAc2m
498,@Delta Is that not what I’ve done already?


#### 1.3 Remove URLs <a id='3'></a>

In [8]:
def remove_urls(text):
    """
    Remove URLs from a string.

    Args:
        text (str): The string containing URLs.

    Returns:
        str: The string with URLs removed.
    """
    # This pattern matches most URLs
    url_pattern = r'https?://\S+|www\.\S+'
    clean_text = re.sub(url_pattern, '', text)
    return clean_text

In [9]:
df['text']=df['text'].apply(remove_urls)
df

Unnamed: 0,text
0,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...
4,@sprintcare I did.
...,...
495,"@115884 Oh, no! Please speak to a member of th..."
496,.@delta this has been my inflight studio exper...
497,@115885 2/2
498,@Delta Is that not what I’ve done already?


#### 1.4 Convert to lowercase <a id='4'></a>

In [10]:
df['text']=df['text'].str.lower()
df

Unnamed: 0,text
0,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare i have sent several private messag...
3,@115712 please send us a private message so th...
4,@sprintcare i did.
...,...
495,"@115884 oh, no! please speak to a member of th..."
496,.@delta this has been my inflight studio exper...
497,@115885 2/2
498,@delta is that not what i’ve done already?


#### 1.5 Remove emojis <a id='5'></a>

In [11]:
def remove_emoji(text):
    """
    Remove emojis from a string.

    This function uses a regular expression pattern that matches all emojis
    and removes them from the input text.

    Args:
        text (str): The string from which emojis will be removed.

    Returns:
        str: The string with all emojis removed.
    """
    # Unicode ranges for emojis
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251" 
                               "]+", flags=re.UNICODE)
    clean_text = re.sub(emoji_pattern, '', text)
    return clean_text


In [12]:
df['text']=df['text'].apply(remove_emoji)
df

Unnamed: 0,text
0,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare i have sent several private messag...
3,@115712 please send us a private message so th...
4,@sprintcare i did.
...,...
495,"@115884 oh, no! please speak to a member of th..."
496,.@delta this has been my inflight studio exper...
497,@115885 2/2
498,@delta is that not what i’ve done already?


#### 1.6 Remove punctuation <a id='6'></a>

In [13]:
def remove_punc(text):
    """
    Remove punctuation from a string.

    Args:
        text (str): The string from which to remove punctuation.

    Returns:
        str: The string with punctuation removed.
    """
    exclude = set(string.punctuation)
    for char in exclude:
        text = text.replace(char, "")
    return text

In [14]:
df['text']=df['text'].apply(remove_punc)
df

Unnamed: 0,text
0,115712 i understand i would like to assist you...
1,sprintcare and how do you propose we do that
2,sprintcare i have sent several private message...
3,115712 please send us a private message so tha...
4,sprintcare i did
...,...
495,115884 oh no please speak to a member of the f...
496,delta this has been my inflight studio experie...
497,115885 22
498,delta is that not what i’ve done already


#### 1.7 Remove stopwords <a id='7'></a>

In [15]:
def remove_stopwords(text, language='english'):
    """
    Remove stopwords from a given text.

    Args:
        text (str): The input string from which to remove stopwords.
        language (str): The language of the text and the stopwords to be removed. Defaults to 'english'.

    Returns:
        str: A string with stopwords removed.

    Example:
        >>> sample_text = "This is an example showing off stop word filtration."
        >>> remove_stopwords(sample_text)
        'This example showing stop word filtration.'
    """
    # Ensure NLTK stopword list is available; otherwise, download it
    nltk.download('stopwords')
    nltk.download('punkt')

    # Load the list of stopwords for the specified language
    stop_words = set(stopwords.words(language))

    # Tokenize the text into words
    word_tokens = word_tokenize(text)

    # Filter out the stopwords
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]

    # Join the filtered words back into a string
    filtered_text = ' '.join(filtered_text)

    return filtered_text

In [16]:
df['text']=df['text'].apply(remove_stopwords)
df

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopword

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprintcare propose
2,sprintcare sent several private messages one r...
3,115712 please send us private message assist c...
4,sprintcare
...,...
495,115884 oh please speak member flt crew immedia...
496,delta inflight studio experience today nothing...
497,115885 22
498,delta ’ done already


#### 1.8 Handle abbreviations/slang (example, customize your dictionary) <a id='8'></a>

In [17]:
abbr_dict={
  "AFAIK": "As Far As I Know",
  "AFK": "Away From Keyboard",
  "ASAP": "As Soon As Possible",
  "ATK": "At The Keyboard",
  "ATM": "At The Moment",
  "A3": "Anytime, Anywhere, Anyplace",
  "BAK": "Back At Keyboard",
  "BBL": "Be Back Later",
  "BBS": "Be Back Soon",
  "BFN": "Bye For Now",
  "B4N": "Bye For Now",
  "BRB": "Be Right Back",
  "BRT": "Be Right There",
  "BTW": "By The Way",
  "B4": "Before",
  "CU": "See You",
  "CUL8R": "See You Later",
  "CYA": "See You",
  "FAQ": "Frequently Asked Questions",
  "FC": "Fingers Crossed",
  "FWIW": "For What It's Worth",
  "FYI": "For Your Information",
  "GAL": "Get A Life",
  "GG": "Good Game",
  "GN": "Good Night",
  "GMTA": "Great Minds Think Alike",
  "GR8": "Great!",
  "G9": "Genius",
  "IC": "I See",
  "ICQ": "I Seek you (also a chat program)",
  "ILU": "I Love You",
  "IMHO": "In My Honest/Humble Opinion",
  "IMO": "In My Opinion",
  "IOW": "In Other Words",
  "IRL": "In Real Life",
  "KISS": "Keep It Simple, Stupid",
  "LDR": "Long Distance Relationship",
  "LMAO": "Laugh My A.. Off",
  "LOL": "Laughing Out Loud",
  "LTNS": "Long Time No See",
  "L8R": "Later",
  "MTE": "My Thoughts Exactly",
  "M8": "Mate",
  "NRN": "No Reply Necessary",
  "OIC": "Oh I See",
  "PITA": "Pain In The A..",
  "PRT": "Party",
  "PRW": "Parents Are Watching",
  "QPSA?": "Que Pasa?",
  "ROFL": "Rolling On The Floor Laughing",
  "ROFLOL": "Rolling On The Floor Laughing Out Loud",
  "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
  "SK8": "Skate",
  "STATS": "Your sex and age",
  "ASL": "Age, Sex, Location",
  "THX": "Thank You",
  "TTFN": "Ta-Ta For Now!",
  "TTYL": "Talk To You Later",
  "U": "You",
  "U2": "You Too",
  "U4E": "Yours For Ever",
  "WB": "Welcome Back",
  "WTF": "What The F...",
  "WTG": "Way To Go!",
  "WUF": "Where Are You From?",
  "W8": "Wait...",
  "7K": "Sick:-D Laugher",
  "TFW": "That feeling when. TFW internet slang often goes in a caption to an image.",
  "MFW": "My face when",
  "MRW": "My reaction when",
  "IFYP": "I feel your pain",
  "TNTL": "Trying not to laugh",
  "JK": "Just kidding",
  "IDC": "I don’t care",
  "ILY": "I love you",
  "IMU": "I miss you",
  "ADIH": "Another day in hell",
  "ZZZ": "Sleeping, bored, tired",
  "WYWH": "Wish you were here",
  "TIME": "Tears in my eyes",
  "BAE": "Before anyone else",
  "FIMH": "Forever in my heart",
  "BSAAW": "Big smile and a wink",
  "BWL": "Bursting with laughter",
  "BFF": "Best friends forever",
  "CSL": "Can’t stop laughing"
}

In [18]:
def expand_text(text):
    """
    Expand abbreviations and slang in a given text based on a predefined dictionary.
    
    Args:
        text (str): The input string containing abbreviations and/or slang.
    
    Returns:
        str: The processed string with abbreviations and slang expanded.
    """
    words = text.split()
    expanded_words = [abbr_dict.get(word.lower(), word) for word in words]
    expanded_text = ' '.join(expanded_words)
    return expanded_text

In [19]:
df['text']=df['text'].apply(expand_text)
df

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprintcare propose
2,sprintcare sent several private messages one r...
3,115712 please send us private message assist c...
4,sprintcare
...,...
495,115884 oh please speak member flt crew immedia...
496,delta inflight studio experience today nothing...
497,115885 22
498,delta ’ done already


#### 1.9 Stemming <a id='9'></a>

In [20]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

def stem_text(text):
    """
    Apply stemming to a given text.
    
    Args:
        text (str): The input string to be stemmed.
    
    Returns:
        str: The stemmed string.
    """
    # Initialize the Porter Stemmer
    stemmer = PorterStemmer()
    
    # Tokenize the text into individual words
    tokens = word_tokenize(text)
    
    # Stem each word in the text
    stemmed_words = [stemmer.stem(word) for word in tokens]
    
    # Join the stemmed words back into a string
    stemmed_text = ' '.join(stemmed_words)
    
    return stemmed_text

In [21]:
df['text']=df['text'].apply(stem_text)
df

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprintcar propos
2,sprintcar sent sever privat messag one respond...
3,115712 pleas send us privat messag assist clic...
4,sprintcar
...,...
495,115884 oh pleas speak member flt crew immedi a...
496,delta inflight studio experi today noth work e...
497,115885 22
498,delta ’ done alreadi


#### 1.10 Spelling correction <a id='10'></a>

In [22]:
from textblob import TextBlob
def spell_check_text(text):
    """
    Correct spelling in the input text.

    Args:
        text (str): The input text to be spell-checked and corrected.

    Returns:
        str: The corrected text.

    Example:
        >>> sample_text = "Speling erors in somthing writen can be eaisly overseen."
        >>> corrected_text = spell_check_text(sample_text)
        >>> print(corrected_text)
    """
    blob = TextBlob(text)
    corrected_text = str(blob.correct())
    
    return corrected_text

In [23]:
df['text']=df['text'].apply(spell_check_text)
df

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprinter propos
2,sprinter sent never privat message one respond...
3,115712 pleas send us privat message assist cli...
4,sprinter
...,...
495,115884 oh pleas speak member felt crew dimmed ...
496,felt flight studio expert today not work excep...
497,115885 22
498,felt ’ done already


In [24]:
df

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprinter propos
2,sprinter sent never privat message one respond...
3,115712 pleas send us privat message assist cli...
4,sprinter
...,...
495,115884 oh pleas speak member felt crew dimmed ...
496,felt flight studio expert today not work excep...
497,115885 22
498,felt ’ done already


In [25]:
def whitespace_clean(text):
    """
    Remove leading and trailing white spaces from the input text.

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned text with no leading or trailing white spaces.

    Example:
        >>> sample_text = "  Hello, world! This is an example.    "
        >>> clean_text = whitespace_clean(sample_text)
        >>> print(clean_text)
    """
    # Remove leading and trailing white spaces
    clean_text = text.strip()
    
    return clean_text

In [26]:
df['text']=df['text'].apply(whitespace_clean)
df

Unnamed: 0,text
0,115712 understand would like assist would need...
1,sprinter propos
2,sprinter sent never privat message one respond...
3,115712 pleas send us privat message assist cli...
4,sprinter
...,...
495,115884 oh pleas speak member felt crew dimmed ...
496,felt flight studio expert today not work excep...
497,115885 22
498,felt ’ done already


#### 1.11 Tokenize <a id='11'></a>

In [27]:
from nltk.tokenize import word_tokenize, sent_tokenize
def tokenize_texts(input_series):
    """
    Tokenize each text in the input Pandas Series into sentences and words.

    Args:
        input_series (pd.Series): A Pandas Series containing text data to be tokenized.

    Returns:
        pd.Series: A new Pandas Series where each element is a tuple containing two lists:
            - A list of sentences.
            - A list of words.

    Example:
        >>> input_series = pd.Series(["Hello, world! This is an example.", "Tokenizing this text into words and sentences."])
        >>> tokenized_series = tokenize_texts(input_series)
        >>> print(tokenized_series)
    """
    tokenized_data = input_series.apply(lambda text: (sent_tokenize(text), word_tokenize(text)))
    return tokenized_data

In [28]:
df['text']=tokenize_texts(df['text'])


In [29]:
df.to_csv('/kaggle/working/submission.csv')