# The Cleaning Text Data Process

The first step is **normalization**, it converts messy data into a neat standard, which it can do by removing stop words like 'the' and 'a' as well as converting sentences to lowercase. Additional steps like stemming (converting words to their original form like going and gone to go) can also be applied during normalization. The next step is **tokenization**, which divides words into individual tokens in the dataset. The last step is **vectorization** (like word embedding), where each token is converted into a machine readable, useful for training machine learning models.


!["Text Data Cleaning Process"](./Cleaning%20Text%20Data%20Process.png)

# Import Libraries

In [19]:
import pandas as pd
import nltk #Tokenization Library
nltk.download('punkt') #To fix a bug

from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MrIzzat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Load Dataset

In [13]:
text_data = pd.read_csv('./text_data.csv')

# Normalizing the data

During this normalization, any unnecessary puncation is removed. Then, all the letters are converted to lower case.

In [14]:
text_data.head()

Unnamed: 0,sentiment,post
0,positive,"""I love technology, especially data science!"""
1,neutral,"""Data science is a part of newer technologies"""
2,negative,"""Technology is a hindrance to our development"""


## Converting all the letters to lowercase 

In [15]:
text_data['post'] = text_data['post'].str.capitalize()
text_data['post']

0     "i love technology, especially data science!"
1    "data science is a part of newer technologies"
2    "technology is a hindrance to our development"
Name: post, dtype: object

## Removing all puncation from the text

In [16]:
text_data['post'] = text_data['post'].str.replace(r'[^\w\s]','')
text_data['post']

0     "i love technology, especially data science!"
1    "data science is a part of newer technologies"
2    "technology is a hindrance to our development"
Name: post, dtype: object

## Finding the topic of each sentence and placing it into it's own seperate row

In [17]:
text_data.loc[text_data['post'].str.contains('data science'),'Topic'] = 'data science'
text_data.head()

Unnamed: 0,sentiment,post,Topic
0,positive,"""i love technology, especially data science!""",data science
1,neutral,"""data science is a part of newer technologies""",data science
2,negative,"""technology is a hindrance to our development""",


During normalization, extra things can be done to the text such as removing unnecessary stop words like "The", "a" etc. as well as stemming.

# Tokenizing the Data

During this step, each word will be stored seperately in a list. The list is placed into it's own column called `tokens`.

In [18]:
text_data['tokens'] = text_data['post'].apply(nltk.word_tokenize)
text_data

Unnamed: 0,sentiment,post,Topic,tokens
0,positive,"""i love technology, especially data science!""",data science,"[``, i, love, technology, ,, especially, data,..."
1,neutral,"""data science is a part of newer technologies""",data science,"[``, data, science, is, a, part, of, newer, te..."
2,negative,"""technology is a hindrance to our development""",,"[``, technology, is, a, hindrance, to, our, de..."


# Vectorizing the Data

During this step, each token is converted to a certain numeric value, using the `CountVectorizer()` function from the sklearn library

## Creating a vectorizer

In [24]:
vectorizer = CountVectorizer()

words_matrix = vectorizer.fit_transform(text_data['post'].values)
words_matrix.toarray()

array([[1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0],
       [0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1]], dtype=int64)

## Getting the number of times each word appears per sentence

In [22]:
counts = pd.DataFrame(words_matrix.toarray(),
                     columns=vectorizer.get_feature_names_out())
counts

Unnamed: 0,data,development,especially,hindrance,is,love,newer,of,our,part,science,technologies,technology,to
0,1,0,1,0,0,1,0,0,0,0,1,0,1,0
1,1,0,0,0,1,0,1,1,0,1,1,1,0,0
2,0,1,0,1,1,0,0,0,1,0,0,0,1,1


## Viewing the vector to word dictionary

In [23]:
print(vectorizer.vocabulary_)

{'love': 5, 'technology': 12, 'especially': 2, 'data': 0, 'science': 10, 'is': 4, 'part': 9, 'of': 7, 'newer': 6, 'technologies': 11, 'hindrance': 3, 'to': 13, 'our': 8, 'development': 1}
