* In this notebook we are creating a function for text preprocessing. we analyed a raw dataset for sentiment analysis.
* In essence, sentiment analysis helps businesses, researchers, and individuals to automatically understand and quantify the emotions expressed in large volumes of text data, enabling data-driven decision-making.

In [2]:
# Importing the required python libraries for text preprocessing.

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [3]:
# Downloading necessary nltk data files

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [6]:
# Reading the Text data using pandas

data = pd.read_csv("/content/Tweets.csv")

In [8]:
# Reading the first 5 rows of the data

data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [10]:
# Printing all the column names.

data.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [24]:
# Extracting only the important features for the sentiment analysis task.

useful_cols = ['airline_sentiment', 'airline','name','text']
data = data[useful_cols]

In [25]:
# Creating a function to tokenize and preprocess the texts.

def preprocess(data):
  text = data.lower()                                                            # Converting the texts to lowercase
  tokens = word_tokenize(text)                                                   # Tokenizing the text
  tokens = [word for word in tokens if word.isalpha()]                           # Removing the punctuations
  stop_words = set(stopwords.words('english'))                                   # Removing the stopwords
  tokens = [word for word in tokens if word not in stop_words]                   # Lemmatizing the tokens
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]

  return tokens

In [26]:
# Creating a tokenized dataset using the preprocessing function.

tokenized_data = data[:1000].map(preprocess)
tokenized_data

Unnamed: 0,airline_sentiment,airline,name,text
0,[neutral],"[virgin, america]",[cairdin],"[virginamerica, dhepburn, said]"
1,[positive],"[virgin, america]",[jnardino],"[virginamerica, plus, added, commercial, exper..."
2,[neutral],"[virgin, america]",[yvonnalynn],"[virginamerica, today, must, mean, need, take,..."
3,[negative],"[virgin, america]",[jnardino],"[virginamerica, really, aggressive, blast, obn..."
4,[negative],"[virgin, america]",[jnardino],"[virginamerica, really, big, bad, thing]"
...,...,...,...,...
995,[negative],[united],[],"[united, time, finally, get, dallas, could, dr..."
996,[negative],[united],[cristobalwong],"[united, trying, get, final, destination, need..."
997,[negative],[united],[itsmetsforme],"[united, guy, really, customer, service, spent..."
998,[positive],[united],[swampynomo],"[united, priority, iove]"


SUMMARY:

* We Analyzed the raw dataset for text preprocessing.
* This dataset contains many text columns(features) so we used Text preprocessing techniques like tokenizing, removing stop words, and lemmatizing.
* Then we created a tokenized dataset which will be used for further processing
for a sentiment analysis task by converting the words into vectors also for training and evaluating.