### ***In any machine learning task, cleaning or preprocessing of data is important so that our computer can understand how to interact well with humans (Natural Language) and when it comes to unstructured data like text, this process is even more important.***

## TASK

### We will work on a Sample dataset (which is a CSV file) and we will perform three common text cleaning/ text pre-processing steps from the ones listed below. 

- Lower casing

- Removal of Punctuations

- Removal of Frequent words

- Removal of Rare words

- Stemming

- Lemmatization

- Removal of emojis

- Removal of emoticons

- Conversion of emoticons to words

- Conversion of emojis to words

- Removal of URLs

- Spelling correction

## Dataset Description


- This dataset is about `Tweets` to and from companies doing customer support on Twitter.

- It had 7 columns which includes:

  - `tweet_id` : The unique ID of the tweet
  - `author_id`: Unique ID of the tweet author
  - `inbound`: Whether or not the tweet was sent to  company
  - `created_at`: When the tweet was creted
  - `text`: The text content of the tweet
  - `response_tweet_id`: The tweet tht responded to this one, if there is any
  - `in_response_to_tweet_id`: The tweet this tweet was in response to if there is any

### ***The three text cleaning/ preprocessing steps we will use are:***

- Lower Casing

- Removal of punctuations

- Stemming


### IMPORTING LIBRARIES

In [30]:
#Import the libraries needed

import numpy as np #for numerical manipulation. 
import pandas as pd # data manipulation
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

### EXPLANATION OF NATURAL LANGUAGE LIBRARY 


- `spacy` library is an open-source natural language processing library for Python designed to be fast and efficient and it offers pre-trained models for various languages that can perform tasks such as part-of-speech tagging, named entity recognition, dependency parsing, and more.

- `string` library is a built-in Python module that provides a collection of string constants and helper functions for working with strings. It includes useful constants like ASCII letters, digits, punctuation characters, and whitespace characters.

- `nltk` library is a leading platform for building Python programs to work with human language data. It is widely used in natural language processing (NLP) tasks, making it a valuable tool for text analysis and machine learning applications.

- `re` library allows you to work with textual data by providing functions for pattern matching, searching, and replacing strings based on specific patterns.

In [31]:
#Loading dataset

data = pd.read_csv("sample.csv")

#view dataset
data.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


In [32]:
#shape of dataset
data.shape

(93, 7)

### ***We are interested in the text column; that is the column we would preprocessing/ clean using NLP. We need to carefully chose our preprocessing steps based on what we want to do. We will run `lower casing`, `removal of punctations` and `stemming`in the text data***

In [33]:
#Calling out the text data column
#Creating a new dataframe called text_data

text_data = data[["text"]].astype(str)

#view the new dataset

text_data

Unnamed: 0,text
0,@AppleSupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...
2,@76328 I really hope you all change but I'm su...
3,@105836 LiveChat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...
...,...
88,@105860 I wish Amazon had an option of where I...
89,They reschedule my shit for tomorrow https://t...
90,"@105861 Hey Sara, sorry to hear of the issues ..."
91,@Tesco bit of both - finding the layout cumber...


In [34]:
#data information
text_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    93 non-null     object
dtypes: object(1)
memory usage: 876.0+ bytes


## TEXT CLEANING 

### Lower Casing

- `Lower Casing`: Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that the input text can all be treated the same way; or instance 'come', 'Come', 'COME'. Lowercasing is a common preprocessing step that helps in standardizing text data and improving the performance of natural language processing tasks.

In [35]:
#creating a column called text_lowercase to the text_data 

text_data["text_lowercase"] = text_data["text"].str.lower()

text_data

Unnamed: 0,text,text_lowercase
0,@AppleSupport causing the reply to be disregar...,@applesupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...,@105835 your business means a lot to us. pleas...
2,@76328 I really hope you all change but I'm su...,@76328 i really hope you all change but i'm su...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...,@virgintrains see attached error message. i've...
...,...,...
88,@105860 I wish Amazon had an option of where I...,@105860 i wish amazon had an option of where i...
89,They reschedule my shit for tomorrow https://t...,they reschedule my shit for tomorrow https://t...
90,"@105861 Hey Sara, sorry to hear of the issues ...","@105861 hey sara, sorry to hear of the issues ..."
91,@Tesco bit of both - finding the layout cumber...,@tesco bit of both - finding the layout cumber...


### Removal Of Punctuations

- `Removal of Punctutions`: It is a common text cleaning/pre-processing process that is used to eliminate all punctations from the text data so that they can be treated the same way; for example, 'angry' and 'angry!'. Here, we are going to make use of the string function `string.punctuation`. Using the string function, we can add and remove the punctuations we want and also be careful when choosing list of punctuations to exclude.  Removing punctuation is a standard text preprocessing step that helps in enhancing the quality of text data for various natural language processing applications.

In [36]:
#Create a new variable to take the string function 
Punct_to_remove = string.punctuation

#Uisng the definition to call on the text data
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', Punct_to_remove))


In [37]:
#Calling our function 
text_data["clean_text"] = text_data["text"].apply(lambda text: remove_punctuation(text))

#View the dtaset to see the clean_text column
text_data

Unnamed: 0,text,text_lowercase,clean_text
0,@AppleSupport causing the reply to be disregar...,@applesupport causing the reply to be disregar...,AppleSupport causing the reply to be disregard...
1,@105835 Your business means a lot to us. Pleas...,@105835 your business means a lot to us. pleas...,105835 Your business means a lot to us Please ...
2,@76328 I really hope you all change but I'm su...,@76328 i really hope you all change but i'm su...,76328 I really hope you all change but Im sure...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is online at the moment - htt...,105836 LiveChat is online at the moment https...
4,@VirginTrains see attached error message. I've...,@virgintrains see attached error message. i've...,VirginTrains see attached error message Ive tr...
...,...,...,...
88,@105860 I wish Amazon had an option of where I...,@105860 i wish amazon had an option of where i...,105860 I wish Amazon had an option of where I ...
89,They reschedule my shit for tomorrow https://t...,they reschedule my shit for tomorrow https://t...,They reschedule my shit for tomorrow httpstcoR...
90,"@105861 Hey Sara, sorry to hear of the issues ...","@105861 hey sara, sorry to hear of the issues ...",105861 Hey Sara sorry to hear of the issues yo...
91,@Tesco bit of both - finding the layout cumber...,@tesco bit of both - finding the layout cumber...,Tesco bit of both finding the layout cumberso...


## Stemming 

- `Stemming`: Stemming is a text normalization technique in natural language processing that involves reducing words to their word stem, root, or base form. The goal is to convert words into their common base or root form, even if the root itself is not a valid word.

Illustration: When we have words like `riding` or `rides`, stemming will stem the suffix to make them `ride`

- Stemming can help improve text analysis tasks such as information retrieval, text classification, and sentiment analysis by treating different forms of the same word as a single entity but note that it won't always produce valid words as it focuses on linguistic normalization

In [38]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Initialize the PorterStemmer
stemmer = PorterStemmer()

# Function to perform stemming on text
def stem_text(text):
    words = word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

In [39]:
# Create a new column 'stemmed_text' with the stemmed version of our text
text_data['stemmed_text'] = text_data['text'].apply(stem_text)

#view the dataset to see the new column
text_data

Unnamed: 0,text,text_lowercase,clean_text,stemmed_text
0,@AppleSupport causing the reply to be disregar...,@applesupport causing the reply to be disregar...,AppleSupport causing the reply to be disregard...,@ applesupport caus the repli to be disregard ...
1,@105835 Your business means a lot to us. Pleas...,@105835 your business means a lot to us. pleas...,105835 Your business means a lot to us Please ...,@ 105835 your busi mean a lot to us . pleas dm...
2,@76328 I really hope you all change but I'm su...,@76328 i really hope you all change but i'm su...,76328 I really hope you all change but Im sure...,@ 76328 i realli hope you all chang but i 'm s...
3,@105836 LiveChat is online at the moment - htt...,@105836 livechat is online at the moment - htt...,105836 LiveChat is online at the moment https...,@ 105836 livechat is onlin at the moment - htt...
4,@VirginTrains see attached error message. I've...,@virgintrains see attached error message. i've...,VirginTrains see attached error message Ive tr...,@ virgintrain see attach error messag . i 've ...
...,...,...,...,...
88,@105860 I wish Amazon had an option of where I...,@105860 i wish amazon had an option of where i...,105860 I wish Amazon had an option of where I ...,@ 105860 i wish amazon had an option of where ...
89,They reschedule my shit for tomorrow https://t...,they reschedule my shit for tomorrow https://t...,They reschedule my shit for tomorrow httpstcoR...,they reschedul my shit for tomorrow http : //t...
90,"@105861 Hey Sara, sorry to hear of the issues ...","@105861 hey sara, sorry to hear of the issues ...",105861 Hey Sara sorry to hear of the issues yo...,"@ 105861 hey sara , sorri to hear of the issu ..."
91,@Tesco bit of both - finding the layout cumber...,@tesco bit of both - finding the layout cumber...,Tesco bit of both finding the layout cumberso...,@ tesco bit of both - find the layout cumberso...
