# SENTIMENT ANALYSIS
- sentimental analysis for public transport for producers and consumers point of view in Ireland

# Social Media Data
Social media provides textual data like comments, posts. Reddit API is used to scrap text data on public transport for producers and consumers point of view in Ireland

Keyword lists to collect posts/comments are shown as follows:

- Consumer (Bus: bus, public transport, train/ rail, tram, luas)
- Producer (I am a driver,driving my car, driving in ireland) 

## Reddit Data
In order to obtain data from Reddit these are the steps required:-

a)Create an application

b)Note down the personal use script and secret tokens

c)Request a temporary OAuth token from Reddit using our user name and password

d)Make Requests/ use PRAW

e)Read the Data

## DATA SCRAPPING USING REDDIT

In [1]:
#Import all the necessary libraries
from dotenv import load_dotenv
from os import getenv
import requests
import json

#!pip install praw
import praw
from datetime import datetime as dt
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Load environment variables from .env file
load_dotenv()

#Reddit API credentials
APP_NAME = getenv('APP_NAME')
APP_ID = getenv("APP_ID")
APP_SECRET = getenv("APP_SECRET")
REDDIT_USERNAME = getenv('REDDIT_USERNAME')
REDDIT_PASSWORD = getenv('REDDIT_PASSWORD')

In [3]:
#Check the credentials are correct and successful
if APP_NAME and APP_ID and APP_SECRET and REDDIT_USERNAME:
    print("Success!")

Success!


In [4]:
#set up a Reddit instance without user-specific authentication
reddit = praw.Reddit(
    client_id=getenv("APP_ID"),
    client_secret=getenv("APP_SECRET"),
    user_agent=f"{getenv('APP_NAME')} u/{getenv('REDDIT_USERNAME')}",
)

In [5]:
#Print reddit.read_only
print(reddit.read_only)

True


In [6]:
#Print the reddit instance information
print(reddit)

<praw.reddit.Reddit object at 0x000002C92F63AD50>


In [7]:
#Access the subreddit named "irishtourism"
subreddit = reddit.subreddit("irishtourism")

In [8]:
#Display the name of the Subreddit
print(subreddit.display_name)

irishtourism


In [9]:
#Display the Reddit Title and Description
#print(subreddit.title)
#print(subreddit.description)

In [10]:
# Collects posts and comments related to transport in Ireland based on specified keyword
keywords = ["using the bus","buses","using the coach","using the train","use the train",
            "use the dart", "using the dart", "I use the dart","bike",
            "rail", "bus eireann", "public transport", "tram","luas"]

#Create Lists to store the data
post_titles = []
post_bodies = []
comments = []

#Loop through subreddit submissions based on the search query -transport in ireland
for submission in subreddit.search("transport in ireland", limit=None):
    submission.comments.replace_more(limit=None)
    post_title = submission.title
    post_body = submission.selftext
    #Iterate through comments in the submission
    for comment in submission.comments.list():
        #Check if any keyword is present in lowercase form in the comment body
        if any(keyword.lower() in comment.body.lower() for keyword in keywords):
            #If a relevant keyword is found, store the submission details and comment
            post_titles.append(post_title)
            post_bodies.append(post_body)
            comments.append(comment.body)

In [11]:
#Calculate max length among lists
max_length = max(len(post_titles), len(post_bodies), len(comments))

#Padding shorter arrays with None values to match the length of the longest array
post_titles += [None] * (max_length - len(post_titles))
post_bodies += [None] * (max_length - len(post_bodies))
comments += [None] * (max_length - len(comments))

In [12]:
#Create DataFrame for json
ireland_df = {
    "Post Title": post_titles,
    "Post Body": post_bodies,
    "Comment": comments
}

# Using Json DATA FOR SENTIMENT ANALYSIS OF IRELAND MODE OF PUBLIC TRANSPORT FROM CONSUMERS POINT OF VIEW USING VADER SENTIMENT ANALYSIS

In [13]:
#Save the data as JSON
with open('irishtourism_data.json', 'w') as outfile:
    json.dump(ireland_df, outfile, indent=4)

In [14]:
#Check keys/fields in the JSON data
print(ireland_df.keys())

dict_keys(['Post Title', 'Post Body', 'Comment'])


In [15]:
#Access specific elements eg post titles
#print(ireland_df['Post Title'])

In [16]:
import json
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split

In [17]:
#Tokenize text using a regular expression tokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

#Initialize CountVectorizer
cv = CountVectorizer(stop_words='english', ngram_range=(1, 1), tokenizer=token.tokenize)

#Load the JSON data
with open('irishtourism_data.json', 'r') as file:
    data = json.load(file)

#Extract comments from JSON data
comments = data['Comment']

# Convert text data into a matrix of token counts
text_counts = cv.fit_transform(comments)

#Initialize VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

#Store comment sentiments and token counts in a list of dictionaries
comment_info = []

# Analyze sentiment for each comment
for idx, comment in enumerate(comments):
    comment_sentiment = sid.polarity_scores(comment)
    
    #Determine sentiment based on compound score
    if comment_sentiment['compound'] >= 0.05:
        sentiment = 'Positive'
    elif comment_sentiment['compound'] <= -0.05:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    
    #Append sentiment and token counts to the list
    comment_info.append({
        'sentiment': sentiment,
        'token_counts': text_counts[idx].toarray().tolist()  # Add token counts to sentiment info
    })


In [18]:
# Display sentiment and token counts for each comment
#for idx, info in enumerate(comment_info, 1):
    #print(f"Comment {idx} Sentiment:", info['sentiment'])
    #print("Token Counts:", info['token_counts'])
    #print("----")

In [19]:
#Extracting sentiment labels into a list
sentiments = [info['sentiment'] for info in comment_info]

#Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, sentiments, test_size=0.20, random_state=0)


In [20]:
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

#Calculate the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracy Score: ",accuracy_score)

Accuracy Score:  0.7608695652173914


# USING CSV

In [21]:
#Create DataFrame for csv
ireland_df1 = {
    "Post Title": post_titles,
    "Post Body": post_bodies,
    "Comment": comments
}

In [22]:
import pandas as pd
#Convert dictionary to DataFrame
ireland_df1 = pd.DataFrame.from_dict(ireland_df1)

#Saving the DataFrame to a CSV file named 'Transport in ireland Reddit Posts.csv'
ireland_df1.to_csv('Transport in ireland Reddit Posts.csv', index=False)


## USING TF-IDF FOR SENTIMENT ANALYSIS OF IRELAND PUBLIC TRANSPORT FROM CONSUMERS POINT OF VIEW

In [23]:
#Import the transport_ireland1 Data
transport_ireland1 = pd.read_csv("Transport in ireland Reddit Posts.csv")

In [24]:
#View the head of the dataset
transport_ireland1.head()

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Bus eireann extremely unreliable Allow a lot o...
1,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Yeah even in towns like Athlone that got the n...
2,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Last year I was supposed to get a car service ...
3,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Irish Rail: Good but too overcrowded during pe...
4,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I've been staying in Maynooth for about 2 mont...


In [25]:
#View the tail of the dataset
transport_ireland1.tail()

Unnamed: 0,Post Title,Post Body,Comment
452,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...",99.9% of things are closed on Christmas day. P...
453,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...","Honestly, it's not a great time to visit if yo..."
454,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...",I am not really worrying about eating (I can c...
455,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...","> but about ""what to do"" as I won't be able to..."
456,Transport - make your journey around the count...,**Bus**\n\n[Bus Eireann](http://www.buseireann...,That's pretty comprehensive. I won't asked wha...


In [26]:
#Count duplicates in the 'Comment' variable
comment_duplicates = transport_ireland1['Comment'].duplicated().sum()
comment_duplicates

0

In [27]:
#View the duplicate
duplicate_comments = transport_ireland1[transport_ireland1['Comment'].duplicated(keep=False)]
print(duplicate_comments)

Empty DataFrame
Columns: [Post Title, Post Body, Comment]
Index: []


In [28]:
#Drop the duplicated rows from transport_ireland based on the comment variable
transport_ireland1 = transport_ireland1.drop_duplicates(subset=['Comment'])
transport_ireland1

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Bus eireann extremely unreliable Allow a lot o...
1,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Yeah even in towns like Athlone that got the n...
2,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Last year I was supposed to get a car service ...
3,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Irish Rail: Good but too overcrowded during pe...
4,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I've been staying in Maynooth for about 2 mont...
...,...,...,...
452,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...",99.9% of things are closed on Christmas day. P...
453,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...","Honestly, it's not a great time to visit if yo..."
454,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...",I am not really worrying about eating (I can c...
455,Christmas alone in Dublin,"Because of... life, I need to go on a vacation...","> but about ""what to do"" as I won't be able to..."


In [29]:
#Check if the duplicates have been dropped
comment_duplicates = transport_ireland1['Comment'].duplicated().sum()
comment_duplicates

0

In [30]:
#Check the shape of the transport_ireland data
transport_ireland1.shape

(457, 3)

## Preprocessing text

In [31]:
#Count the number of words in the comment
transport_ireland1['word_count'] = transport_ireland1['Comment'].apply(lambda x: len(str(x).split(" ")))

In [32]:
transport_ireland1[['Comment','word_count']].head()

Unnamed: 0,Comment,word_count
0,Bus eireann extremely unreliable Allow a lot o...,59
1,Yeah even in towns like Athlone that got the n...,83
2,Last year I was supposed to get a car service ...,189
3,Irish Rail: Good but too overcrowded during pe...,40
4,I've been staying in Maynooth for about 2 mont...,149


In [33]:
#Find the maximum number of words in the 'word_count' variable of the 'transport_ireland1' DataFrame
largest_word_count = transport_ireland1["word_count"].max()
largest_word_count

940

In [34]:
#Count the number of characters in the comment variable
transport_ireland1['char_count'] = transport_ireland1['Comment'].str.len() 

In [35]:
#View the head of the char_count and comment
transport_ireland1[['Comment','char_count']].head()

Unnamed: 0,Comment,char_count
0,Bus eireann extremely unreliable Allow a lot o...,355
1,Yeah even in towns like Athlone that got the n...,431
2,Last year I was supposed to get a car service ...,958
3,Irish Rail: Good but too overcrowded during pe...,251
4,I've been staying in Maynooth for about 2 mont...,826


In [36]:
#Find the maximum number of characters in the 'char_count' variable of the 'transport_ireland1' DataFrame
largest_word_count = transport_ireland1["char_count"].max()
largest_word_count

5212

In [37]:
#View the head of the transport_ireland1
transport_ireland1.head(5)

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Bus eireann extremely unreliable Allow a lot o...,59,355
1,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Yeah even in towns like Athlone that got the n...,83,431
2,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Last year I was supposed to get a car service ...,189,958
3,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,Irish Rail: Good but too overcrowded during pe...,40,251
4,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I've been staying in Maynooth for about 2 mont...,149,826


In [38]:
#define a function to calculate the average length of words in a sentence
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

In [39]:
#Apply the avg_word function on comment
transport_ireland1['avg_word'] = transport_ireland1['Comment'].apply(lambda x: avg_word(x))

In [40]:
#View comment and avg_word
transport_ireland1[['Comment','avg_word']].head()

Unnamed: 0,Comment,avg_word
0,Bus eireann extremely unreliable Allow a lot o...,5.0
1,Yeah even in towns like Athlone that got the n...,4.204819
2,Last year I was supposed to get a car service ...,4.327684
3,Irish Rail: Good but too overcrowded during pe...,4.727273
4,I've been staying in Maynooth for about 2 mont...,4.536913


In [41]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [42]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [43]:
#count number of stopwords in each comment and store in a variable called stopwords
transport_ireland1['stopwords'] = transport_ireland1['Comment'].apply(lambda x: len([x for x in x.split() if x in stop]))
transport_ireland1[['Comment','stopwords']].head()

Unnamed: 0,Comment,stopwords
0,Bus eireann extremely unreliable Allow a lot o...,21
1,Yeah even in towns like Athlone that got the n...,34
2,Last year I was supposed to get a car service ...,75
3,Irish Rail: Good but too overcrowded during pe...,13
4,I've been staying in Maynooth for about 2 mont...,65


In [44]:
#count number of numerics in the comment variable and store the value in a variable called numeric
transport_ireland1['numerics'] = transport_ireland1['Comment'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
transport_ireland1[['Comment','numerics']].head()

Unnamed: 0,Comment,numerics
0,Bus eireann extremely unreliable Allow a lot o...,0
1,Yeah even in towns like Athlone that got the n...,0
2,Last year I was supposed to get a car service ...,1
3,Irish Rail: Good but too overcrowded during pe...,0
4,I've been staying in Maynooth for about 2 mont...,1


In [45]:
#count number of uppercases in the comment variable and store the value in a variable called upper
transport_ireland1['upper'] = transport_ireland1['Comment'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
transport_ireland1[['Comment','upper']].head()

Unnamed: 0,Comment,upper
0,Bus eireann extremely unreliable Allow a lot o...,0
1,Yeah even in towns like Athlone that got the n...,0
2,Last year I was supposed to get a car service ...,8
3,Irish Rail: Good but too overcrowded during pe...,0
4,I've been staying in Maynooth for about 2 mont...,3


In [46]:
#Convert all uppercases to lowe cases
transport_ireland1['Comment'] = transport_ireland1['Comment'].apply(lambda x: " ".join(x.lower() for x in x.split()))
transport_ireland1['Comment'].head()

0    bus eireann extremely unreliable allow a lot o...
1    yeah even in towns like athlone that got the n...
2    last year i was supposed to get a car service ...
3    irish rail: good but too overcrowded during pe...
4    i've been staying in maynooth for about 2 mont...
Name: Comment, dtype: object

In [47]:
#Remove all special characters
transport_ireland1['Comment'] = transport_ireland1['Comment'].str.replace('[^\w\s]','')
transport_ireland1['Comment'].head()

0    bus eireann extremely unreliable allow a lot o...
1    yeah even in towns like athlone that got the n...
2    last year i was supposed to get a car service ...
3    irish rail good but too overcrowded during pea...
4    ive been staying in maynooth for about 2 month...
Name: Comment, dtype: object

In [48]:
#Remove all stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
transport_ireland1['Comment'] = transport_ireland1['Comment'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
transport_ireland1['Comment'].head()

0    bus eireann extremely unreliable allow lot ext...
1    yeah even towns like athlone got new electric ...
2    last year supposed get car service pickup dubl...
3    irish rail good overcrowded peak hours dublin ...
4    ive staying maynooth 2 months short study ive ...
Name: Comment, dtype: object

In [49]:
#list the 10 most frequent words
#freq = pd.Series(' '.join(transport_ireland1['Comment']).split()).value_counts()[:10]
#freq

In [50]:
#freq = list(freq.index)
#transport_ireland1['Comment'] = transport_ireland1['Comment'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
#transport_ireland1['Comment'].head()

In [51]:
#Apply text correction using TextBlob to the first 5 comments in the 'Comment'
from textblob import TextBlob
transport_ireland1['Comment'] [:5].apply(lambda x: str(TextBlob(x).correct()))

0    bus eireann extremely reliable allow lot extra...
1    yeah even towns like alone got new electric bu...
2    last year supposed get car service picked dubl...
3    irish rail good overcrowded peak hours dublin ...
4    give staying maynooth 2 months short study giv...
Name: Comment, dtype: object

In [52]:
#Use TextBlob to tokenize the words in the second comment in the 'Comment' column
nltk.download('punkt')
TextBlob(transport_ireland1['Comment'][1]).words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


WordList(['yeah', 'even', 'towns', 'like', 'athlone', 'got', 'new', 'electric', 'buses', 'theres', 'town', 'bus', 'supposed', 'going', 'every', 'half', 'hour', 'yet', 'people', 'left', 'standing', 'rain', 'knowing', 'come', 'would', 'nice', 'could', 'least', 'announce', 'bus', 'route', 'cancellations', 'delays', 'people', 'could', 'chance', 'make', 'alternative', 'plans', 'rather', 'waiting', 'bus', 'thats', 'due', 'show', 'another', 'hour', 'two'])

In [53]:
#Appy word stemming using PorterStemmer to the first 5 comments in the 'Comment' column
from nltk.stem import PorterStemmer
st = PorterStemmer()
transport_ireland1['Comment'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0    bu eireann extrem unreli allow lot extra time ...
1    yeah even town like athlon got new electr buse...
2    last year suppos get car servic pickup dublin ...
3    irish rail good overcrowd peak hour dublin bu ...
4    ive stay maynooth 2 month short studi ive gott...
Name: Comment, dtype: object

In [54]:
#Apply word lemmatizing using lemmatize to the 'Comment' column
nltk.download('wordnet')
from textblob import Word
transport_ireland1['Comment'] = transport_ireland1['Comment'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
transport_ireland1['Comment'].head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0    bus eireann extremely unreliable allow lot ext...
1    yeah even town like athlone got new electric b...
2    last year supposed get car service pickup dubl...
3    irish rail good overcrowded peak hour dublin b...
4    ive staying maynooth 2 month short study ive g...
Name: Comment, dtype: object

In [55]:
#Use TextBlob to generate bigrams for the first comment in the 'Comment' column
TextBlob(transport_ireland1['Comment'][0]).ngrams(2)

[WordList(['bus', 'eireann']),
 WordList(['eireann', 'extremely']),
 WordList(['extremely', 'unreliable']),
 WordList(['unreliable', 'allow']),
 WordList(['allow', 'lot']),
 WordList(['lot', 'extra']),
 WordList(['extra', 'time']),
 WordList(['time', 'need']),
 WordList(['need', 'time']),
 WordList(['time', 'avoid']),
 WordList(['avoid', 'intercounty']),
 WordList(['intercounty', 'travel']),
 WordList(['travel', 'slow']),
 WordList(['slow', 'private']),
 WordList(['private', 'bus']),
 WordList(['bus', 'dublin']),
 WordList(['dublin', 'oversubscribed']),
 WordList(['oversubscribed', 'aircoach']),
 WordList(['aircoach', 'gobus']),
 WordList(['gobus', 'etc']),
 WordList(['etc', 'need']),
 WordList(['need', 'booked']),
 WordList(['booked', 'advance']),
 WordList(['advance', 'expressway']),
 WordList(['expressway', 'really']),
 WordList(['really', 'overpriced']),
 WordList(['overpriced', 'luas']),
 WordList(['luas', 'dublin']),
 WordList(['dublin', 'bus']),
 WordList(['bus', 'dart']),
 Word

In [56]:
#Apply sentiment analysis using TextBlob to the first 5 comments in the 'Comment' column
transport_ireland1['Comment'][:5].apply(lambda x: TextBlob(x).sentiment)

0    (-0.0654761904761905, 0.48214285714285715)
1     (0.02411616161616162, 0.3993686868686868)
2                   (0.3925925925925926, 0.625)
3    (0.27346938775510204, 0.39693877551020407)
4     (0.06527056277056278, 0.3055546536796537)
Name: Comment, dtype: object

In [57]:
#Apply sentiment analysis using TextBlob to the 'Comment' column and storing the polarity score
transport_ireland1['sentiment'] = transport_ireland1['Comment'].apply(lambda x: TextBlob(x).sentiment[0] )
transport_ireland1[['Comment','sentiment']].head()

Unnamed: 0,Comment,sentiment
0,bus eireann extremely unreliable allow lot ext...,-0.065476
1,yeah even town like athlone got new electric b...,0.024116
2,last year supposed get car service pickup dubl...,0.392593
3,irish rail good overcrowded peak hour dublin b...,0.273469
4,ive staying maynooth 2 month short study ive g...,0.065271


In [58]:
#View the dataset
transport_ireland1.head(5)

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count,avg_word,stopwords,numerics,upper,sentiment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,bus eireann extremely unreliable allow lot ext...,59,355,5.0,21,0,0,-0.065476
1,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,yeah even town like athlone got new electric b...,83,431,4.204819,34,0,0,0.024116
2,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,last year supposed get car service pickup dubl...,189,958,4.327684,75,1,8,0.392593
3,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,irish rail good overcrowded peak hour dublin b...,40,251,4.727273,13,0,0,0.273469
4,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,ive staying maynooth 2 month short study ive g...,149,826,4.536913,65,1,3,0.065271


In [59]:
#classify the sentiments as negative, positive or neutral using polarity score
from textblob import TextBlob

#Function to categorize sentiment polarities
def categorize_sentiment(polarity):
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

# Assuming 'transport_ireland1' is your DataFrame
transport_ireland1['sentiment_polarity'] = transport_ireland1['Comment'].apply(lambda x: TextBlob(x).sentiment.polarity)
transport_ireland1['sentiment'] = transport_ireland1['sentiment_polarity'].apply(categorize_sentiment)
print(transport_ireland1[['Comment', 'sentiment']].head())

                                             Comment sentiment
0  bus eireann extremely unreliable allow lot ext...  Negative
1  yeah even town like athlone got new electric b...  Positive
2  last year supposed get car service pickup dubl...  Positive
3  irish rail good overcrowded peak hour dublin b...  Positive
4  ive staying maynooth 2 month short study ive g...  Positive


In [60]:
#View the Dataset
transport_ireland1.head()

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count,avg_word,stopwords,numerics,upper,sentiment,sentiment_polarity
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,bus eireann extremely unreliable allow lot ext...,59,355,5.0,21,0,0,Negative,-0.065476
1,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,yeah even town like athlone got new electric b...,83,431,4.204819,34,0,0,Positive,0.024116
2,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,last year supposed get car service pickup dubl...,189,958,4.327684,75,1,8,Positive,0.392593
3,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,irish rail good overcrowded peak hour dublin b...,40,251,4.727273,13,0,0,Positive,0.273469
4,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,ive staying maynooth 2 month short study ive g...,149,826,4.536913,65,1,3,Positive,0.065271


In [61]:
#columns to drop before text classification
columns_to_drop = ['Post Title','Post Body','word_count','char_count','avg_word','stopwords','numerics','upper','sentiment_polarity']
transport_ireland1.drop(columns=columns_to_drop, inplace=True)

In [62]:
#View to check if they have been dropped
transport_ireland1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Comment    457 non-null    object
 1   sentiment  457 non-null    object
dtypes: object(2)
memory usage: 26.9+ KB


# Text Classification

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#Vectorize the Text Data using TF-IDF
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
                        stop_words='english', ngram_range=(1,1))
X = tfidf.fit_transform(transport_ireland1['Comment'])
y = transport_ireland1['sentiment']

#Split the data into Train-Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Create a classifier for MNB
classifier = MultinomialNB()

#Train the Classifier
classifier.fit(X_train, y_train)

#Evaluate the Model
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00        14
     Neutral       0.00      0.00      0.00         8
    Positive       0.76      1.00      0.86        70

    accuracy                           0.76        92
   macro avg       0.25      0.33      0.29        92
weighted avg       0.58      0.76      0.66        92



# SENTIMENT ANALYSIS OF IRELAND PUBLIC TRANSPORT FROM PRODUCERS POINT OF VIEW

## DATA SCRAPPING FROM REDDIT

In [64]:
#Access the subreddit named "irishtourism"
subreddit = reddit.subreddit("irishtourism")

In [65]:
#List the keywords related to producer of mode of transport in ireland
keywords = ["driving in ireland","driving the bus","driving the tram/luas","as a driver","drive the bus","i am a driver", "I drive", "my car", "my vehicle", "my company", "drive my car","driving my car"]

#Create Lists to store the data
post_titles = []
post_bodies = []
comments = []

#Fetch posts/comments based on keywords
for submission in subreddit.search("transport in ireland", limit=None):
    submission.comments.replace_more(limit=None)
    post_title = submission.title
    post_body = submission.selftext
    for comment in submission.comments.list():
        if any(keyword.lower() in comment.body.lower() for keyword in keywords):
            post_titles.append(post_title)
            post_bodies.append(post_body)
            comments.append(comment.body)

In [66]:
#Calculate max length among lists
max_length = max(len(post_titles), len(post_bodies), len(comments))

#Padding shorter arrays with None values to match the length of the longest array
post_titles += [None] * (max_length - len(post_titles))
post_bodies += [None] * (max_length - len(post_bodies))
comments += [None] * (max_length - len(comments))

In [67]:
#Create DataFrame
ireland_producer = {
    "Post Title": post_titles,
    "Post Body": post_bodies,
    "Comment": comments
}

ireland_producer = pd.DataFrame(ireland_producer)

In [68]:
#View the ireland_producer dataframe
ireland_producer

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,..."
5,Getting around,Hello all!\n\nThank you in advance for your an...,I found driving in Ireland to be fun. But I do...
6,Which route (by train/bus) would you fine folk...,"Hi everyone, \n\nIn May '24, my parents (75 yr...",Anything we have to be at in a timely fashion ...
7,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,"Are they not just ‘normal’ taxi drivers, under..."
8,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,Consistently much cheaper than what though? Th...
9,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,yeah but free now is as well. many people pref...


In [69]:
#save the DataFrame transport_ireland into a CSV
ireland_producer.to_csv("Transport in ireland producer view Reddit Posts.csv", index=False)

# SENTIMENT ANALYSIS OF IRELAND PUBLIC TRANSPORT FROM PRODUCERS POINT OF VIEW

In [70]:
#Import the dataset after saving it as csv
transport_producer = pd.read_csv("Transport in ireland producer view Reddit Posts.csv")

In [71]:
#View the head of the dataset
transport_producer.head()

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,..."


In [72]:
#View the head of the dataset
transport_producer.tail()

Unnamed: 0,Post Title,Post Body,Comment
18,Getting to Dublin from Trim,Hi - we're using Trim as our home base for our...,It should be fine to right into town tbh. You ...
19,Need advice on rental car versus buses for Dub...,"Hello all, my husband and I (in our 30s) are p...",I’ll second the YouTube visit. I watched sever...
20,Ireland car rental for young driver?,I’m travelling to Ireland for the second time ...,"Yes that’s what we’re going to do, put the ren..."
21,10 Day Itinerary - Would Love Feedback/Advice,Hello! My husband and I are going to Ireland f...,I'm a local so of course I feel this way but h...
22,Plan for 10 days trip with public transport + ...,Hello everyone! My girlfriend and I are planni...,"Wow at those prices, maybe I should rent out m..."


In [73]:
#View the info of the dataset
transport_producer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Post Title  23 non-null     object
 1   Post Body   23 non-null     object
 2   Comment     23 non-null     object
dtypes: object(3)
memory usage: 684.0+ bytes


In [74]:
#Check the characteristics
transport_producer.describe()

Unnamed: 0,Post Title,Post Body,Comment
count,23,23,23
unique,16,16,23
top,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,I use public transport quite a lot and it does...
freq,3,3,1


In [75]:
#Count duplicates in the 'Comment' variable
comment_duplicates = transport_producer['Comment'].duplicated().sum()
comment_duplicates

0

In [76]:
#View the duplicate
duplicate_comments = transport_producer[transport_producer['Comment'].duplicated(keep=False)]
print(duplicate_comments)

Empty DataFrame
Columns: [Post Title, Post Body, Comment]
Index: []


In [77]:
#Drop the duplicated rows from transport_ireland based on the comment variable
transport_producer = transport_producer.drop_duplicates(subset=['Comment'])
transport_producer

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,..."
5,Getting around,Hello all!\n\nThank you in advance for your an...,I found driving in Ireland to be fun. But I do...
6,Which route (by train/bus) would you fine folk...,"Hi everyone, \n\nIn May '24, my parents (75 yr...",Anything we have to be at in a timely fashion ...
7,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,"Are they not just ‘normal’ taxi drivers, under..."
8,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,Consistently much cheaper than what though? Th...
9,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,yeah but free now is as well. many people pref...


In [78]:
#Check if they have been dropped
transport_producer['Comment'].duplicated().sum()

0

In [79]:
#Check for any missing data points from comment
transport_producer['Comment'].isnull().sum()

0

In [80]:
#View the Data
type(transport_producer)

pandas.core.frame.DataFrame

In [81]:
#View the shape of the data
transport_producer.shape

(23, 3)

## USING VADER SENTIMENT ANALYSIS

In [82]:
#Create a VADER SentimentIntensityAnalyzer instance
transport_sentiment = SentimentIntensityAnalyzer()

#Create a function to get compound sentiment score using VADER
def get_sentiment_scores(text):
    return transport_sentiment.polarity_scores(text)

#Apply the function to get the compound scores and store the values in vader_sentiment
transport_producer['vader_sentiment'] = transport_producer['Comment'].apply(get_sentiment_scores)

#Extract the 'neg', 'neu', 'pos', 'compound' scores from 'vader_sentiment'
transport_producer['vader_neg'] = transport_producer['vader_sentiment'].apply(lambda x: x['neg'])
transport_producer['vader_neu'] = transport_producer['vader_sentiment'].apply(lambda x: x['neu'])
transport_producer['vader_pos'] = transport_producer['vader_sentiment'].apply(lambda x: x['pos'])
transport_producer['vader_compound'] = transport_producer['vader_sentiment'].apply(lambda x: x['compound'])

#Create a function to categorize sentiments based on compound scores
def categorize_sentiment(vader_compound):
    if vader_compound >= 0.05:
        return 'positive'
    elif vader_compound <= -0.05:
        return 'negative'
    else:
        return 'neutral'

#Apply the categorization function to create the 'sentiment' column
transport_producer['sentiment'] = transport_producer['vader_compound'].apply(categorize_sentiment)

#Display the updated DataFrame with the 'sentiment' column
print(transport_producer[['Comment','vader_compound','sentiment']])

                                              Comment  vader_compound  \
0   I use public transport quite a lot and it does...         -0.9729   
1   Thanks for the feedback! I’ll definitely be us...         -0.3578   
2   They could hire a private driver-guide if they...          0.9680   
3   Do you have any prior experience driving in Ir...         -0.3400   
4   The island is NOT driveable within half a day,...          0.9671   
5   I found driving in Ireland to be fun. But I do...          0.8442   
6   Anything we have to be at in a timely fashion ...          0.7328   
7   Are they not just ‘normal’ taxi drivers, under...          0.0000   
8   Consistently much cheaper than what though? Th...          0.1680   
9   yeah but free now is as well. many people pref...          0.8271   
10  Yes we did 🤣🤣🤣  \nMy friend wanted to look at ...          0.9195   
11  > Our main issue is we will not have a car. Be...          0.5378   
12  I agree. If you’re already used to driving in .

In [83]:
#View the head of the dataset
transport_producer.head()

Unnamed: 0,Post Title,Post Body,Comment,vader_sentiment,vader_neg,vader_neu,vader_pos,vader_compound,sentiment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...,"{'neg': 0.072, 'neu': 0.878, 'pos': 0.05, 'com...",0.072,0.878,0.05,-0.9729,negative
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...,"{'neg': 0.112, 'neu': 0.756, 'pos': 0.133, 'co...",0.112,0.756,0.133,-0.3578,negative
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...,"{'neg': 0.019, 'neu': 0.83, 'pos': 0.151, 'com...",0.019,0.83,0.151,0.968,positive
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...,"{'neg': 0.043, 'neu': 0.957, 'pos': 0.0, 'comp...",0.043,0.957,0.0,-0.34,negative
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,...","{'neg': 0.046, 'neu': 0.857, 'pos': 0.097, 'co...",0.046,0.857,0.097,0.9671,positive


In [84]:
#columns to drop
columns_to_drop = ['Post Title','Post Body','vader_sentiment','vader_neg','vader_neu','vader_pos','vader_compound']
transport_producer.drop(columns=columns_to_drop, inplace=True)

In [85]:
#View the head of the dataset
transport_producer.head()

Unnamed: 0,Comment,sentiment
0,I use public transport quite a lot and it does...,negative
1,Thanks for the feedback! I’ll definitely be us...,negative
2,They could hire a private driver-guide if they...,positive
3,Do you have any prior experience driving in Ir...,negative
4,"The island is NOT driveable within half a day,...",positive


In [86]:
#Pre-Processing using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv    = CountVectorizer(stop_words = 'english',ngram_range = (1, 1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(transport_producer['Comment'])

text_counts

<23x809 sparse matrix of type '<class 'numpy.int64'>'
	with 1230 stored elements in Compressed Sparse Row format>

In [87]:
#Split the data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, transport_producer['sentiment'], test_size=0.25, random_state=5)

In [88]:
#View shape of the splits
text_counts.shape, X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((23, 809), (17, 809), (6, 809), (17,), (6,))

# Text Classification

In [89]:
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

#Caluclate the accuracy score of the model
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)

Accuracuy Score:  0.5


## USING TF-IDF FOR SENTIMENT ANALYSIS OF IRELAND PUBLIC TRANSPORT FROM PRODUCERS POINT OF VIEW

In [90]:
#Import the dataset after saving it as csv
transport_producer1 = pd.read_csv("Transport in ireland producer view Reddit Posts.csv")

In [91]:
#View the head of the dataset
transport_producer1.head()

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,..."


In [92]:
#View the tail of the dataset
transport_producer1.tail()

Unnamed: 0,Post Title,Post Body,Comment
18,Getting to Dublin from Trim,Hi - we're using Trim as our home base for our...,It should be fine to right into town tbh. You ...
19,Need advice on rental car versus buses for Dub...,"Hello all, my husband and I (in our 30s) are p...",I’ll second the YouTube visit. I watched sever...
20,Ireland car rental for young driver?,I’m travelling to Ireland for the second time ...,"Yes that’s what we’re going to do, put the ren..."
21,10 Day Itinerary - Would Love Feedback/Advice,Hello! My husband and I are going to Ireland f...,I'm a local so of course I feel this way but h...
22,Plan for 10 days trip with public transport + ...,Hello everyone! My girlfriend and I are planni...,"Wow at those prices, maybe I should rent out m..."


In [93]:
#Count duplicates in the 'Comment' variable
comment_duplicates = transport_producer1['Comment'].duplicated().sum()
comment_duplicates

0

In [94]:
#View the duplicate
duplicate_comments = transport_producer1[transport_producer1['Comment'].duplicated(keep=False)]
print(duplicate_comments)

Empty DataFrame
Columns: [Post Title, Post Body, Comment]
Index: []


In [95]:
# Drop the duplicated rows from transport_ireland based on the comment variable
transport_producer1 = transport_producer1.drop_duplicates(subset=['Comment'])
transport_producer1

Unnamed: 0,Post Title,Post Body,Comment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,..."
5,Getting around,Hello all!\n\nThank you in advance for your an...,I found driving in Ireland to be fun. But I do...
6,Which route (by train/bus) would you fine folk...,"Hi everyone, \n\nIn May '24, my parents (75 yr...",Anything we have to be at in a timely fashion ...
7,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,"Are they not just ‘normal’ taxi drivers, under..."
8,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,Consistently much cheaper than what though? Th...
9,Rolling luggage in Dublin and Belfast?,Hello! I'm super excited to visit Ireland in O...,yeah but free now is as well. many people pref...


In [96]:
#Check for duplicates
comment_duplicates = transport_producer1['Comment'].duplicated().sum()
comment_duplicates

0

In [97]:
#Check the shape of the dataset
transport_producer1.shape

(23, 3)

## Preprocess text

In [98]:
#Count the number of words in the comment
transport_producer1['word_count'] = transport_producer1['Comment'].apply(lambda x: len(str(x).split(" ")))

In [99]:
#View comment and word_count
transport_producer1[['Comment','word_count']].head()

Unnamed: 0,Comment,word_count
0,I use public transport quite a lot and it does...,940
1,Thanks for the feedback! I’ll definitely be us...,29
2,They could hire a private driver-guide if they...,163
3,Do you have any prior experience driving in Ir...,60
4,"The island is NOT driveable within half a day,...",502


In [100]:
#Count number of characterr
transport_producer1['char_count'] = transport_producer1['Comment'].str.len() 

In [101]:
#View char_count and comment variables
transport_producer1[['Comment','char_count']].head()

Unnamed: 0,Comment,char_count
0,I use public transport quite a lot and it does...,5212
1,Thanks for the feedback! I’ll definitely be us...,163
2,They could hire a private driver-guide if they...,937
3,Do you have any prior experience driving in Ir...,306
4,"The island is NOT driveable within half a day,...",2693


In [102]:
#View 5 observation s of the dataset
transport_producer1.head(5)

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,I use public transport quite a lot and it does...,940,5212
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,Thanks for the feedback! I’ll definitely be us...,29,163
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,They could hire a private driver-guide if they...,163,937
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,Do you have any prior experience driving in Ir...,60,306
4,Getting around,Hello all!\n\nThank you in advance for your an...,"The island is NOT driveable within half a day,...",502,2693


In [103]:
#Create a function for average words in a sentence
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

In [104]:
#Calculate average words in a sentence and store in avg_word variable
transport_producer1['avg_word'] = transport_producer1['Comment'].apply(lambda x: avg_word(x))

In [105]:
#View comment and avg_word
transport_producer1[['Comment','avg_word']].head()

Unnamed: 0,Comment,avg_word
0,I use public transport quite a lot and it does...,4.51589
1,Thanks for the feedback! I’ll definitely be us...,4.655172
2,They could hire a private driver-guide if they...,4.754601
3,Do you have any prior experience driving in Ir...,4.116667
4,"The island is NOT driveable within half a day,...",4.465021


In [106]:
#Create stopwords
stop = stopwords.words('english')

In [107]:
#Count stopwords using stop and store value in stopwords variable
transport_producer1['stopwords'] = transport_producer1['Comment'].apply(lambda x: len([x for x in x.split() if x in stop]))
transport_producer1[['Comment','stopwords']].head()

Unnamed: 0,Comment,stopwords
0,I use public transport quite a lot and it does...,442
1,Thanks for the feedback! I’ll definitely be us...,15
2,They could hire a private driver-guide if they...,71
3,Do you have any prior experience driving in Ir...,31
4,"The island is NOT driveable within half a day,...",164


In [108]:
#Count number of numeric in the comment and store in numerics then view the comment and numerics
transport_producer1['numerics'] = transport_producer1['Comment'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
transport_producer1[['Comment','numerics']].head()

Unnamed: 0,Comment,numerics
0,I use public transport quite a lot and it does...,2
1,Thanks for the feedback! I’ll definitely be us...,0
2,They could hire a private driver-guide if they...,0
3,Do you have any prior experience driving in Ir...,0
4,"The island is NOT driveable within half a day,...",3


In [109]:
#Count number of uppercase letters and store value in upper then view upper and comments
transport_producer1['upper'] = transport_producer1['Comment'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
transport_producer1[['Comment','upper']].head()

Unnamed: 0,Comment,upper
0,I use public transport quite a lot and it does...,14
1,Thanks for the feedback! I’ll definitely be us...,0
2,They could hire a private driver-guide if they...,1
3,Do you have any prior experience driving in Ir...,4
4,"The island is NOT driveable within half a day,...",23


In [110]:
#Change the upper cases to lower case
transport_producer1['Comment'] = transport_producer1['Comment'].apply(lambda x: " ".join(x.lower() for x in x.split()))
transport_ireland1['Comment'].head()

0    bus eireann extremely unreliable allow lot ext...
1    yeah even town like athlone got new electric b...
2    last year supposed get car service pickup dubl...
3    irish rail good overcrowded peak hour dublin b...
4    ive staying maynooth 2 month short study ive g...
Name: Comment, dtype: object

In [111]:
#Remove special characters
transport_producer1['Comment'] = transport_producer1['Comment'].str.replace('[^\w\s]','')
transport_producer1['Comment'].head()

0    i use public transport quite a lot and it does...
1    thanks for the feedback ill definitely be usin...
2    they could hire a private driverguide if they ...
3    do you have any prior experience driving in ir...
4    the island is not driveable within half a day ...
Name: Comment, dtype: object

In [112]:
#Remove stopwords
stop = stopwords.words('english')
transport_producer1['Comment'] = transport_producer1['Comment'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
transport_producer1['Comment'].head()

0    use public transport quite lot doesnt tend let...
1    thanks feedback ill definitely using cards ins...
2    could hire private driverguide dont mind spend...
3    prior experience driving ireland continent gro...
4    island driveable within half day unless litera...
Name: Comment, dtype: object

In [113]:
freq = pd.Series(' '.join(transport_producer1['Comment']).split()).value_counts()[:10]
freq

driving    27
ireland    23
get        20
drive      19
dublin     18
dont       17
car        16
like       15
day        13
2          12
dtype: int64

In [114]:
#freq = list(freq.index)
#transport_producer1['Comment'] = transport_producer1['Comment'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
#transport_producer1['Comment'].head()

In [115]:
TextBlob(transport_producer1['Comment'][1]).words

WordList(['thanks', 'feedback', 'ill', 'definitely', 'using', 'cards', 'insurance', 'chase', 'sapphire', 'duped', 'fashion', 'past', 'getting', 'rental'])

In [116]:
st = PorterStemmer()
transport_producer1['Comment'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0    use public transport quit lot doesnt tend let ...
1    thank feedback ill definit use card insur chas...
2    could hire privat driverguid dont mind spend b...
3    prior experi drive ireland contin group size t...
4    island driveabl within half day unless liter w...
Name: Comment, dtype: object

In [117]:
transport_producer1['Comment'] = transport_producer1['Comment'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
transport_producer1['Comment'].head()

0    use public transport quite lot doesnt tend let...
1    thanks feedback ill definitely using card insu...
2    could hire private driverguide dont mind spend...
3    prior experience driving ireland continent gro...
4    island driveable within half day unless litera...
Name: Comment, dtype: object

In [118]:
TextBlob(transport_producer1['Comment'][0]).ngrams(2)

[WordList(['use', 'public']),
 WordList(['public', 'transport']),
 WordList(['transport', 'quite']),
 WordList(['quite', 'lot']),
 WordList(['lot', 'doesnt']),
 WordList(['doesnt', 'tend']),
 WordList(['tend', 'let']),
 WordList(['let', 'much']),
 WordList(['much', 'think']),
 WordList(['think', 'prepared']),
 WordList(['prepared', 'bit']),
 WordList(['bit', 'research']),
 WordList(['research', 'adjust']),
 WordList(['adjust', 'plan']),
 WordList(['plan', 'okay']),
 WordList(['okay', 'experience']),
 WordList(['experience', 'really']),
 WordList(['really', 'shouldnt']),
 WordList(['shouldnt', 'point']),
 WordList(['point', 'happen']),
 WordList(['happen', 'government']),
 WordList(['government', 'transport']),
 WordList(['transport', 'company']),
 WordList(['company', 'look']),
 WordList(['look', 'passenger']),
 WordList(['passenger', 'number']),
 WordList(['number', 'think']),
 WordList(['think', 'high']),
 WordList(['high', 'rather']),
 WordList(['rather', 'realising']),
 WordList(['

In [119]:
transport_producer1['Comment'][:5].apply(lambda x: TextBlob(x).sentiment)

0    (0.05311915235302331, 0.44942908346134147)
1                             (-0.1375, 0.4875)
2      (0.1379166666666667, 0.5283333333333333)
3    (0.14732142857142858, 0.35714285714285715)
4     (0.16056547619047618, 0.5881944444444444)
Name: Comment, dtype: object

In [120]:
transport_producer1['sentiment'] = transport_producer1['Comment'].apply(lambda x: TextBlob(x).sentiment[0] )
transport_producer1[['Comment','sentiment']].head()

Unnamed: 0,Comment,sentiment
0,use public transport quite lot doesnt tend let...,0.053119
1,thanks feedback ill definitely using card insu...,-0.1375
2,could hire private driverguide dont mind spend...,0.137917
3,prior experience driving ireland continent gro...,0.147321
4,island driveable within half day unless litera...,0.160565


In [121]:
transport_producer1.head(5)

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count,avg_word,stopwords,numerics,upper,sentiment
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,use public transport quite lot doesnt tend let...,940,5212,4.51589,442,2,14,0.053119
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,thanks feedback ill definitely using card insu...,29,163,4.655172,15,0,0,-0.1375
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,could hire private driverguide dont mind spend...,163,937,4.754601,71,0,1,0.137917
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,prior experience driving ireland continent gro...,60,306,4.116667,31,0,4,0.147321
4,Getting around,Hello all!\n\nThank you in advance for your an...,island driveable within half day unless litera...,502,2693,4.465021,164,3,23,0.160565


In [122]:
from textblob import TextBlob

# Function to categorize sentiment polarities
def categorize_sentiment(polarity):
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

transport_producer1['sentiment_polarity'] = transport_producer1['Comment'].apply(lambda x: TextBlob(x).sentiment.polarity)
transport_producer1['sentiment'] = transport_producer1['sentiment_polarity'].apply(categorize_sentiment)
print(transport_producer1[['Comment', 'sentiment']].head())

                                             Comment sentiment
0  use public transport quite lot doesnt tend let...  Positive
1  thanks feedback ill definitely using card insu...  Negative
2  could hire private driverguide dont mind spend...  Positive
3  prior experience driving ireland continent gro...  Positive
4  island driveable within half day unless litera...  Positive


In [123]:
transport_producer1.head(5)

Unnamed: 0,Post Title,Post Body,Comment,word_count,char_count,avg_word,stopwords,numerics,upper,sentiment,sentiment_polarity
0,Experiences using Public transport in Ireland,I’ve seen a lot of posts on this Reddit about ...,use public transport quite lot doesnt tend let...,940,5212,4.51589,442,2,14,Positive,0.053119
1,Transportation for a week in Ireland?,Hello! We have a group of 13 that will be tour...,thanks feedback ill definitely using card insu...,29,163,4.655172,15,0,0,Negative,-0.1375
2,Dublin and Skellig Michael city break,so I'll surprise my parents with a trip in Ire...,could hire private driverguide dont mind spend...,163,937,4.754601,71,0,1,Positive,0.137917
3,Ireland Recommendations,Ireland Recommendations\n\nWe are traveling to...,prior experience driving ireland continent gro...,60,306,4.116667,31,0,4,Positive,0.147321
4,Getting around,Hello all!\n\nThank you in advance for your an...,island driveable within half day unless litera...,502,2693,4.465021,164,3,23,Positive,0.160565


In [124]:
#columns to drop
columns_to_drop = ['Post Title','Post Body','word_count','char_count','avg_word','stopwords','numerics','upper','sentiment_polarity']
transport_producer1.drop(columns=columns_to_drop, inplace=True)

In [125]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#Vectorize the Text Data
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
                        stop_words='english', ngram_range=(1,1))
X = tfidf.fit_transform(transport_producer1['Comment'])
y = transport_producer1['sentiment']

#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#create a Classifier (Naive Bayes in this case)
classifier = MultinomialNB()

#Train the Classifier
classifier.fit(X_train, y_train)

#Evaluate the Model
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         1
    Positive       0.80      1.00      0.89         4

    accuracy                           0.80         5
   macro avg       0.40      0.50      0.44         5
weighted avg       0.64      0.80      0.71         5

