<a href="https://colab.research.google.com/github/ColstonBod-oy/sentiment-text-preprocessing/blob/main/Sentiment_Text_Preprocessing_using_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Preprocessing with Python using the NLTK Library**

NLTK (Natural Language Toolkit: http://opencv.org) is powerful open-source library built for Python and is used to create systems for Natural Language Processing (NLP).

It has various tools for both preprocessing and analyzing text data, allowing one set of text to be analyzed using multiple tools to glean multiple insights into the data. **The version of NLTK we will be using for our activity is NLTK 3 for Python 3. Guides or commands for NLTK 2 for Python 2, will not work in conjunction with our sample code.**

***Kindly go over the NLTK documentation (direct link here: https://nltk.readthedocs.io/en/latest/index.html) that we will be exploring for text data management. We will focus exclusively on text preprocessing for our sample code, but you might encounter NLP commands and tools which are meant for analysis. You do not have to use those tools for our sample code.***

Make sure to read all the steps one by one. DO NOT RUN THE ENTIRE CODE AS IS OR USE CTRL+F9 AS PARTS OF IT WILL NOT WORK TOGETHER. Only run each cell as needed once you have read and understood what each code block does. Take note that you will NEED to make modifications to the code before it will be able to import your data from Google Drive. **Make sure you click "File" > "Save a Copy in Drive" before making any modifications or running the code below.**

# Importing our Libraries

**We start by importing all the required libraries and tools that we will be needed for our code to work.** We will be using many of these libraries in our sample code, so make sure that all of them are imported correctly, otherwise the sample code will not run as intended.

***These are the key libraries to take note of:***
1. **import re**: This imports the re library, which stands for Regular Expressions. Regular expressions use the backslash character ("\\") to indicate special forms or to allow special characters to be used without invoking their special meaning.
2. **import string**: This importas the string library, which as the name suggests is a set of string modules. These modules are used for outputting various strings of alphanumeric values and are useful when doing text preprocessing as they provide a quick set of template values to start with.
3. **import pandas as pd**: This imports the pandas library, which is an open source data management library that will help us upload and process our text data. We rename it to **pd** to make it easier to refer to in our code.
4. **import nltk**: This imports the base dependencies for the NLTK library which is what we will be using for the bulk of our text preprocessing tasks for this sample code. Take note that we will also be importing additional modules for other text preprocessing tasks that we will using throughtout the module, such as the tools for the tokenizing and tagging.

In [1]:
import re
import string
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Importing our Text

**Our next step is uploading our text that we will be processing for the acitivty.** The text data that will be used for this activity can come from three main sources: online text repositories that have public text data such as Kaggle (https://www.kaggle.com/) and HuggingFace (https://huggingface.co/), new text data you have downloaded yourself using the web scraper tool from the previous activity, or a specialized API that you can use such as the Twitter/X API to scrape a specific site only.

The datasets that are used here can be found on these links:

Dataset 1 (https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment)

Dataset 2 (https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset)

**You can also make use of the sample data that is uploaded here: https://bit.ly/cc19datasets**



**Uploading using Google Drive**

We have two options for importing our data, the first is to upload our data to Google Drive, then link our Google Drive account to access our files here directly. You will need to authorize colab to connect with your Google Drive account in order for this to work. **This is faster, but requires you to upload your files to Google Drive which may not be possible if you have run out of space.**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Uploading directly to Google Colab**

Our second option for uploading data is to upload our files directly to Google Colab. **This is slower, but does not require you to set aside space in Google Drive.**

In [3]:
from google.colab import files
uploaded = files.upload()

Saving Tweets.csv to Tweets.csv


**Once we have uploaded our files, we will then assign our file to a Pandas DataFrame data type, which is what we will use to process our text.** Take note that the root folder directory is **/content/** for the file manager in colab.

We will upload our dataset (in csv form) to our DataFrame and then display the contents of the dataset to check if it was uploaded correctly. It should display the first five and last five entries in our dataset.

In [4]:
pd.set_option('display.max_colwidth', 50)
text_dataset = pd.read_csv('/content/Tweets.csv')
text_dataset_2 = pd.read_csv('/content/sentimentdataset.csv')
display(text_dataset)
display(text_dataset_2)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
727,728,732,Collaborating on a science project that receiv...,Happy,2017-08-18 18:20:00,ScienceProjectSuccessHighSchool,Facebook,#ScienceFairWinner #HighSchoolScience,20.0,39.0,UK,2017,8,18,18
728,729,733,Attending a surprise birthday party organized ...,Happy,2018-06-22 14:15:00,BirthdayPartyJoyHighSchool,Instagram,#SurpriseCelebration #HighSchoolFriendship,25.0,48.0,USA,2018,6,22,14
729,730,734,Successfully fundraising for a school charity ...,Happy,2019-04-05 17:30:00,CharityFundraisingTriumphHighSchool,Twitter,#CommunityGiving #HighSchoolPhilanthropy,22.0,42.0,Canada,2019,4,5,17
730,731,735,"Participating in a multicultural festival, cel...",Happy,2020-02-29 20:45:00,MulticulturalFestivalJoyHighSchool,Facebook,#CulturalCelebration #HighSchoolUnity,21.0,43.0,UK,2020,2,29,20


# Text Preprocessing

**Now, we can start preprocessing our text with the tools that are available in NLTK.** We will only use a small set of core tools and techniques for our text preprocessing code, but you are ***encouraged to experiment with the different options and commands that are available for NLTK for processing data.***

**Make sure to check the documentation for detailed descriptions of all the Text Preprocessing and even Analysis tools available for NLTK.** We will only be looking at the simple preprocessing techniques that were discussed in the class, as many of the tools that are available in NLTK are focused on NLP, which is centered on analysis and will be covered in future topics.

***Cleaning our Text***

**Now that we have our dataset, the first important step is to clean our text data.** This will allow us to remove extra, unnecessary, and errorneous data from our dataset, making it easier to process later on.

**We start by choosing which columns we will keep from our original dataset.** The reason for this is that keeping all the columns will make the processing of data longer, and we don't necessarily need all our data for analysis. You are free to choose more or less columns that you will keep from your original dataset.

In [5]:
specific_text = text_dataset[["airline", "text"]]
specific_text_2 = text_dataset_2[["User", "Text"]]

**We create a custom function that will allow us to apply multiple preprocessing techniques to our text.** While we can apply each technique to our text separately, using a custom function allows us to process the text multiple times in one go, making the text clear and outputting it into a new column. In this case, we will be using three sets of cleaning techniques for our text to make it easier to process. We will be **normalizing the text**, and **reducing noise in the text**.

This function accepts three inputs:
1. df: ***The text dataset we will be using***
2. col: ***The column of text that will be cleaned***
3. clean_col: ***The new column were the cleaned text will be placed***

**This is achived by using the lambda function of python, and the re library that we imported earlier.** This allows us to go through each line of text in our dataset one by one and edit it using specific techniques. *Stopwords were left in as they might affect the accuracy of the tagging process later on if they are removed.*

In [6]:
def text_cleaning(df, col, clean_col):
  # Create a copy of the input DataFrame to avoid modifying the original DataFrame
  cleaned_df = df.copy()
  # This changes the text to lower case and removes spaces on either side
  cleaned_df[clean_col] = cleaned_df[col].apply(lambda text: text.lower().strip())
  # This removes extra spaces in between the text
  cleaned_df[clean_col] = cleaned_df[clean_col].apply(lambda text: re.sub(r' +', ' ', text))
  # This removes the special characters from the text
  cleaned_df[clean_col] = cleaned_df[clean_col].apply(lambda text: re.sub(r'[^\w\s]', '', text))
  # This removes the stopwords from the text
  # df[clean_col] = df[clean_col].apply(lambda text: ' '.join(stemmer.stem(text) for word in text.split() if word not in stop_words))

  return cleaned_df

**We then use our function by entering the correct data for the dataset, the column to be cleaned, and the new column to be created.** We then display our tables of data with the newly cleaned text. If you would like to clean multiple columns, then you would need to call this function multiple times and point to the new columns that you would like to clean.

In [7]:
specific_text = text_cleaning(specific_text, "text", "cleaned_text")
specific_text_2 = text_cleaning(specific_text_2, "Text", "cleaned_text")
display(specific_text)
display(specific_text_2)

Unnamed: 0,airline,text,cleaned_text
0,Virgin America,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn said
1,Virgin America,@VirginAmerica plus you've added commercials t...,virginamerica plus youve added commercials to ...
2,Virgin America,@VirginAmerica I didn't today... Must mean I n...,virginamerica i didnt today must mean i need t...
3,Virgin America,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...
4,Virgin America,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...
...,...,...,...
14635,American,@AmericanAir thank you we got on a different f...,americanair thank you we got on a different fl...
14636,American,@AmericanAir leaving over 20 minutes Late Flig...,americanair leaving over 20 minutes late fligh...
14637,American,@AmericanAir Please bring American Airlines to...,americanair please bring american airlines to ...
14638,American,"@AmericanAir you have my money, you change my ...",americanair you have my money you change my fl...


Unnamed: 0,User,Text,cleaned_text
0,User123,Enjoying a beautiful day at the park! ...,enjoying a beautiful day at the park
1,CommuterX,Traffic was terrible this morning. ...,traffic was terrible this morning
2,FitnessFan,Just finished an amazing workout! 💪 ...,just finished an amazing workout
3,AdventureX,Excited about the upcoming weekend getaway! ...,excited about the upcoming weekend getaway
4,ChefCook,Trying out a new recipe for dinner tonight. ...,trying out a new recipe for dinner tonight
...,...,...,...
727,ScienceProjectSuccessHighSchool,Collaborating on a science project that receiv...,collaborating on a science project that receiv...
728,BirthdayPartyJoyHighSchool,Attending a surprise birthday party organized ...,attending a surprise birthday party organized ...
729,CharityFundraisingTriumphHighSchool,Successfully fundraising for a school charity ...,successfully fundraising for a school charity ...
730,MulticulturalFestivalJoyHighSchool,"Participating in a multicultural festival, cel...",participating in a multicultural festival cele...


***Tokenizing our Text***

**Once we have cleaned our text, our next step is to tokenize our text.** This means that we will split our text into smaller chunks which is going to be used for better analysis and tagging later on. In this case, we will be making use of **word tokenization** for our given text.

**In order to ensure that our tokenization method will work, we first need to convert our cleaned text to strings.** This is done to ensure compatibility with the tokenization method, since cleaning it will sometimes cause some of the data to change data type, or some of the data was a different data type to begin with.

In [8]:
specific_text['cleaned_text'] = specific_text['cleaned_text'].astype(str)
specific_text_2['cleaned_text'] = specific_text_2['cleaned_text'].astype(str)

**Once we have converted our cleaned text to strings, we can then tokenize our text.** Similar to the text cleaning process, we will then place the tokenized text into a new column so that it will be easier to keep track of. Take note that is process may take a while if you have a lot of text data, so be patient and don't cancel the process if it is still running. Once we have tokenized all the cleaned text, we then display the updated tables with the raw/original text, the cleaned text, and the tokenized text.

In [9]:
specific_text['tokenized_text'] = specific_text['cleaned_text'].apply(word_tokenize)
specific_text_2['tokenized_text'] = specific_text_2['cleaned_text'].apply(word_tokenize)
display(specific_text)
display(specific_text_2)

Unnamed: 0,airline,text,cleaned_text,tokenized_text
0,Virgin America,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn said,"[virginamerica, what, dhepburn, said]"
1,Virgin America,@VirginAmerica plus you've added commercials t...,virginamerica plus youve added commercials to ...,"[virginamerica, plus, youve, added, commercial..."
2,Virgin America,@VirginAmerica I didn't today... Must mean I n...,virginamerica i didnt today must mean i need t...,"[virginamerica, i, didnt, today, must, mean, i..."
3,Virgin America,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...,"[virginamerica, its, really, aggressive, to, b..."
4,Virgin America,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...,"[virginamerica, and, its, a, really, big, bad,..."
...,...,...,...,...
14635,American,@AmericanAir thank you we got on a different f...,americanair thank you we got on a different fl...,"[americanair, thank, you, we, got, on, a, diff..."
14636,American,@AmericanAir leaving over 20 minutes Late Flig...,americanair leaving over 20 minutes late fligh...,"[americanair, leaving, over, 20, minutes, late..."
14637,American,@AmericanAir Please bring American Airlines to...,americanair please bring american airlines to ...,"[americanair, please, bring, american, airline..."
14638,American,"@AmericanAir you have my money, you change my ...",americanair you have my money you change my fl...,"[americanair, you, have, my, money, you, chang..."


Unnamed: 0,User,Text,cleaned_text,tokenized_text
0,User123,Enjoying a beautiful day at the park! ...,enjoying a beautiful day at the park,"[enjoying, a, beautiful, day, at, the, park]"
1,CommuterX,Traffic was terrible this morning. ...,traffic was terrible this morning,"[traffic, was, terrible, this, morning]"
2,FitnessFan,Just finished an amazing workout! 💪 ...,just finished an amazing workout,"[just, finished, an, amazing, workout]"
3,AdventureX,Excited about the upcoming weekend getaway! ...,excited about the upcoming weekend getaway,"[excited, about, the, upcoming, weekend, getaway]"
4,ChefCook,Trying out a new recipe for dinner tonight. ...,trying out a new recipe for dinner tonight,"[trying, out, a, new, recipe, for, dinner, ton..."
...,...,...,...,...
727,ScienceProjectSuccessHighSchool,Collaborating on a science project that receiv...,collaborating on a science project that receiv...,"[collaborating, on, a, science, project, that,..."
728,BirthdayPartyJoyHighSchool,Attending a surprise birthday party organized ...,attending a surprise birthday party organized ...,"[attending, a, surprise, birthday, party, orga..."
729,CharityFundraisingTriumphHighSchool,Successfully fundraising for a school charity ...,successfully fundraising for a school charity ...,"[successfully, fundraising, for, a, school, ch..."
730,MulticulturalFestivalJoyHighSchool,"Participating in a multicultural festival, cel...",participating in a multicultural festival cele...,"[participating, in, a, multicultural, festival..."


***Tagging our Text***

**After tokenizing our text, the last step is to tag our text.** This will involve categorizing our various text data based on a set number of categories based on their characteristics. **In our case, we will be using rule-based tagging to conduct POS (Parts-of-Speech) tagging for our text.** This will tag each of the text in our dataset with the appropriate POS tags based on the word itself, and it position or usage in the text that it is in.

This process will take the longest time to complete, as the NLTK model will have to go through each word individual and tag it based on the rules that it has for POS tagging. Same as before, just let the process finish as long as colab has not crashed or frozen. Once the tagging process is complete, we then display the final updated tables with the raw/original text, the cleaned text, the tokenized text, and finally the tagged text.

In [10]:
specific_text['tagged_text'] = specific_text['tokenized_text'].apply(pos_tag)
specific_text_2['tagged_text'] = specific_text_2['tokenized_text'].apply(pos_tag)
display(specific_text)
display(specific_text_2)

Unnamed: 0,airline,text,cleaned_text,tokenized_text,tagged_text
0,Virgin America,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn said,"[virginamerica, what, dhepburn, said]","[(virginamerica, NN), (what, WP), (dhepburn, N..."
1,Virgin America,@VirginAmerica plus you've added commercials t...,virginamerica plus youve added commercials to ...,"[virginamerica, plus, youve, added, commercial...","[(virginamerica, NN), (plus, CC), (youve, NN),..."
2,Virgin America,@VirginAmerica I didn't today... Must mean I n...,virginamerica i didnt today must mean i need t...,"[virginamerica, i, didnt, today, must, mean, i...","[(virginamerica, NN), (i, NN), (didnt, VBP), (..."
3,Virgin America,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...,"[virginamerica, its, really, aggressive, to, b...","[(virginamerica, NN), (its, PRP$), (really, RB..."
4,Virgin America,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...,"[virginamerica, and, its, a, really, big, bad,...","[(virginamerica, NN), (and, CC), (its, PRP$), ..."
...,...,...,...,...,...
14635,American,@AmericanAir thank you we got on a different f...,americanair thank you we got on a different fl...,"[americanair, thank, you, we, got, on, a, diff...","[(americanair, NN), (thank, NN), (you, PRP), (..."
14636,American,@AmericanAir leaving over 20 minutes Late Flig...,americanair leaving over 20 minutes late fligh...,"[americanair, leaving, over, 20, minutes, late...","[(americanair, NN), (leaving, VBG), (over, IN)..."
14637,American,@AmericanAir Please bring American Airlines to...,americanair please bring american airlines to ...,"[americanair, please, bring, american, airline...","[(americanair, JJ), (please, NN), (bring, VB),..."
14638,American,"@AmericanAir you have my money, you change my ...",americanair you have my money you change my fl...,"[americanair, you, have, my, money, you, chang...","[(americanair, NN), (you, PRP), (have, VBP), (..."


Unnamed: 0,User,Text,cleaned_text,tokenized_text,tagged_text
0,User123,Enjoying a beautiful day at the park! ...,enjoying a beautiful day at the park,"[enjoying, a, beautiful, day, at, the, park]","[(enjoying, VBG), (a, DT), (beautiful, JJ), (d..."
1,CommuterX,Traffic was terrible this morning. ...,traffic was terrible this morning,"[traffic, was, terrible, this, morning]","[(traffic, NN), (was, VBD), (terrible, JJ), (t..."
2,FitnessFan,Just finished an amazing workout! 💪 ...,just finished an amazing workout,"[just, finished, an, amazing, workout]","[(just, RB), (finished, VBN), (an, DT), (amazi..."
3,AdventureX,Excited about the upcoming weekend getaway! ...,excited about the upcoming weekend getaway,"[excited, about, the, upcoming, weekend, getaway]","[(excited, VBN), (about, IN), (the, DT), (upco..."
4,ChefCook,Trying out a new recipe for dinner tonight. ...,trying out a new recipe for dinner tonight,"[trying, out, a, new, recipe, for, dinner, ton...","[(trying, VBG), (out, RP), (a, DT), (new, JJ),..."
...,...,...,...,...,...
727,ScienceProjectSuccessHighSchool,Collaborating on a science project that receiv...,collaborating on a science project that receiv...,"[collaborating, on, a, science, project, that,...","[(collaborating, VBG), (on, IN), (a, DT), (sci..."
728,BirthdayPartyJoyHighSchool,Attending a surprise birthday party organized ...,attending a surprise birthday party organized ...,"[attending, a, surprise, birthday, party, orga...","[(attending, VBG), (a, DT), (surprise, NN), (b..."
729,CharityFundraisingTriumphHighSchool,Successfully fundraising for a school charity ...,successfully fundraising for a school charity ...,"[successfully, fundraising, for, a, school, ch...","[(successfully, RB), (fundraising, VBG), (for,..."
730,MulticulturalFestivalJoyHighSchool,"Participating in a multicultural festival, cel...",participating in a multicultural festival cele...,"[participating, in, a, multicultural, festival...","[(participating, VBG), (in, IN), (a, DT), (mul..."


# Compiling and Downloading the Text

**Once we have preprocessed our text based on the needs and requirements of our data mining tools and methodology, the last step is to download our text data for later use.** This allows store our text data and use it for futher processing and analysis as needed later on.

**To achieve this, we convert our tables back into a CSV file.** We turn our file into a CSV document which is then stored in the runtime of this Google Colab instance. You can change the file name to whatever name is appropriate for the data you are collecting.

In [11]:
FILE_NAME = "preprocessed_text_data.csv"
FILE_NAME_2 = "preprocessed_text_data_2.csv"
specific_text.to_csv(FILE_NAME, index=False)
specific_text_2.to_csv(FILE_NAME_2, index=False)

**We then download our preprocessed text data.** We now have a new CSV file containing both the raw and various preprocessed text data ready to go. Make sure your internet connection is stable and working before running this code.

In [12]:
from google.colab import files
files.download(FILE_NAME)
files.download(FILE_NAME_2)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>