# CA2_Notebook_4_TwitterAPI_GavinDavis_sba22311

This notebook contains the code required to generate a Twitter API using a Twitter developer account to scrape tweets regarding dairy research and perform a sentiment analysis on the area. Recent tweets were gathered using the Twitter API and stored in a csv file named data.csv. The text data was cleaned and preprocessed for implementation into NLTK and a machine learning algorithm to predict positive, negative and neutral sentiment surrounding research in dairy.

After collecting tweet data using twitter api the tweets were processed using NLTK to remove stopwords and stem similar words, also removing symbols generating cleaned text data. Textblob was then used to assess the polarity of the tweets in other to generate metrics like polarity which could be used to understand sentiment. Based on a polarity greater than 0 tweets were deemed positive, while polarity less than 0 were deemed negative. Tweet data with a polarity equal to 0 were labelled neutral. After sentiment classification data was implemented into a machine learning model, Gaussian naive bayes classifier to generate a machine learning which could predict the sentiment of raw tweet data. This model showed approximately 50% accuracy.

In [53]:
#Impoting the necessary libraries

# For sending GET requests from the API
import requests
# For saving access tokens and for file management when creating and adding to the dataset
import os
# For dealing with json responses we receive from the API
import json
# For displaying the data after
import pandas as pd
# For saving the response data in CSV format
import csv
# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata
#To add wait time between requests
import time
import numpy as np

#For creating interactive visualisations
import plotly.express as px


In [54]:
#Importing variables created with Twitter API access tokens created in a .env file for privacy purposes

from dotenv import dotenv_values

config =  dotenv_values(".env")

In [55]:
#Defining an authorization function which contains the Bearer token to access twitter API
def auth():
    return config["BEARER_TOKEN"]

In [57]:
#Defining a function which takes the input bearer_token
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

In [70]:
#Defining a function for connecting to the URL for twitter recent search and defining the query parameters
def create_url(keyword, max_results = 15):
    
    search_url = "https://api.twitter.com/2/tweets/search/recent" 
    
    
    query_params = {'query': keyword,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

In [71]:
#Defining an endpoint function which takes next token and returns 200 if the response was successful

def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

In [87]:
#Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "Dairy research -is:retweet lang:en"
max_results = 100

In [88]:
url = create_url(keyword, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1])

Endpoint Response Code: 200


In [89]:
print(json.dumps(json_response, indent=4, sort_keys=True))

{
    "data": [
        {
            "author_id": "802339170127446016",
            "conversation_id": "1608740063881605123",
            "created_at": "2022-12-30T08:21:34.000Z",
            "edit_history_tweet_ids": [
                "1608740063881605123"
            ],
            "id": "1608740063881605123",
            "lang": "en",
            "public_metrics": {
                "like_count": 2,
                "quote_count": 0,
                "reply_count": 1,
                "retweet_count": 0
            },
            "reply_settings": "everyone",
            "text": "My friend who works in women\u2019s health told me today that until about ten years ago, all lactation advice for human women was based on dairy industry research. Lol. Lmao."
        },
        {
            "author_id": "1602596759167029249",
            "conversation_id": "1608701213767499776",
            "created_at": "2022-12-30T05:47:12.000Z",
            "edit_history_tweet_ids": [
                "160

In [56]:
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','source','tweet'])
csvFile.close()

In [57]:
#Defining a function which will append the scraped tweets to csvand defining the columns

def append_to_csv(json_response, fileName):

    #A counter variable
    counter = 0

    #Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Geolocation
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "

        # 4. Tweet ID
        tweet_id = tweet['id']

        # 5. Language
        lang = tweet['lang']

        # 6. Tweet metrics
        
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']
        

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        res = [author_id, created_at, geo, tweet_id, lang, like_count, quote_count, reply_count, text]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("# of Tweets added from this response: ", counter)

In [58]:
#Appending the tweets received from the twitter api to a csv to be used for sentiment analysis called data.csv
append_to_csv(json_response, "data.csv")

# of Tweets added from this response:  63


In [2]:
#Reading in the csv file of tweets

df = pd.read_csv('data.csv')

In [3]:
#Visualising the tweet data

df.head(10)

Unnamed: 0.1,Unnamed: 0,0,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,0,json_response,,,,,,,,
1,author id,created_at,geo,id,lang,like_count,quote_count,reply_count,source,tweet
2,1.53347E+18,2022-12-12 22:20:19+00:00,,1.60243E+18,en,1,0,0,ContentStudio.io,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
3,1.09347E+18,2022-12-12 21:42:05+00:00,,1.60242E+18,en,0,0,0,Sprout Social,Use industry experts to uncover business oppor...
4,397317974,2022-12-12 21:05:32+00:00,,1.60241E+18,en,0,0,0,dlvr.it,New Research: Taxonomic and predicted function...
5,8.27649E+17,2022-12-12 19:30:00+00:00,,1.60239E+18,en,0,0,0,TweetDeck,@wsuanimsci is pleased to welcome Visiting Ful...
6,1.59683E+18,2022-12-12 19:19:46+00:00,,1.60238E+18,en,0,0,0,Twitter Web App,@DianeOLeary I had HB 111 Crp 8 to 43 5wks bef...
7,8.85132E+17,2022-12-12 18:59:53+00:00,,1.60238E+18,en,9,0,0,Twitter Web App,@Gsinghxxx @Son_of_Space Do the research then....
8,83534499,2022-12-12 18:41:05+00:00,,1.60237E+18,en,6,0,0,Twitter Web App,Christine Kuo received the Humane Slaughter As...
9,1.5294E+18,2022-12-12 17:04:26+00:00,,1.60235E+18,en,3,0,0,Twitter Web App,@titwstomaslas @brian_lewis67 @pattherabbit3 @...


In [5]:
#Renaming columns of the dataframe

df.rename(columns={'Unnamed: 0':'author id',
                   '0':'created_at',
                   'Unnamed: 2':'geo',
                   'Unnamed: 3':'id',
                   'Unnamed: 4':'lang',
                   'Unnamed: 5':'like_count',
                   'Unnamed: 6':'quote_count',
                   'Unnamed: 7':'reply_count',
                   'Unnamed: 8':'source',
                   'Unnamed: 9':'tweet',
                   
                   
                  
                   
    
}, inplace = True)

In [6]:
#Visualising the renamed columns
df

Unnamed: 0,author id,created_at,geo,id,lang,like_count,quote_count,reply_count,source,tweet
0,0,json_response,,,,,,,,
1,author id,created_at,geo,id,lang,like_count,quote_count,reply_count,source,tweet
2,1.53347E+18,2022-12-12 22:20:19+00:00,,1.60243E+18,en,1,0,0,ContentStudio.io,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
3,1.09347E+18,2022-12-12 21:42:05+00:00,,1.60242E+18,en,0,0,0,Sprout Social,Use industry experts to uncover business oppor...
4,397317974,2022-12-12 21:05:32+00:00,,1.60241E+18,en,0,0,0,dlvr.it,New Research: Taxonomic and predicted function...
...,...,...,...,...,...,...,...,...,...,...
227,9.9708E+17,2022-12-23 05:32:51+00:00,,1.60616E+18,en,0,0,1,Got MilQ? Fake Milk to Replace Dairy and Breas...,
228,138418838,2022-12-22 21:20:40+00:00,,1.60604E+18,en,0,0,0,TW : Danone North America Propels Research Fo...,
229,485747496,2022-12-22 19:10:03+00:00,,1.606E+18,en,1,0,0,Yet another reason to avoid processed foods! R...,
230,39261830,2022-12-22 18:48:00+00:00,,1.606E+18,en,4,1,0,"ICYMI: In three years, UW System‚Äôs Dairy Inn...",


In [7]:
#Dropping unnecessary rows

df.drop([0,1], axis=0, inplace=True)

In [8]:
#Visualising the dataframe

df

Unnamed: 0,author id,created_at,geo,id,lang,like_count,quote_count,reply_count,source,tweet
2,1.53347E+18,2022-12-12 22:20:19+00:00,,1.60243E+18,en,1,0,0,ContentStudio.io,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
3,1.09347E+18,2022-12-12 21:42:05+00:00,,1.60242E+18,en,0,0,0,Sprout Social,Use industry experts to uncover business oppor...
4,397317974,2022-12-12 21:05:32+00:00,,1.60241E+18,en,0,0,0,dlvr.it,New Research: Taxonomic and predicted function...
5,8.27649E+17,2022-12-12 19:30:00+00:00,,1.60239E+18,en,0,0,0,TweetDeck,@wsuanimsci is pleased to welcome Visiting Ful...
6,1.59683E+18,2022-12-12 19:19:46+00:00,,1.60238E+18,en,0,0,0,Twitter Web App,@DianeOLeary I had HB 111 Crp 8 to 43 5wks bef...
...,...,...,...,...,...,...,...,...,...,...
227,9.9708E+17,2022-12-23 05:32:51+00:00,,1.60616E+18,en,0,0,1,Got MilQ? Fake Milk to Replace Dairy and Breas...,
228,138418838,2022-12-22 21:20:40+00:00,,1.60604E+18,en,0,0,0,TW : Danone North America Propels Research Fo...,
229,485747496,2022-12-22 19:10:03+00:00,,1.606E+18,en,1,0,0,Yet another reason to avoid processed foods! R...,
230,39261830,2022-12-22 18:48:00+00:00,,1.606E+18,en,4,1,0,"ICYMI: In three years, UW System‚Äôs Dairy Inn...",


In [9]:
#Dropping unnecessary columns for the sentiment analysis

df.drop(['author id', 'created_at', 'geo',
        'id', 'lang', 'like_count', 'quote_count',
        'reply_count', 'source'], axis=1, inplace=True)

In [10]:
#visualising dataframe with tweets only

df

Unnamed: 0,tweet
2,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
3,Use industry experts to uncover business oppor...
4,New Research: Taxonomic and predicted function...
5,@wsuanimsci is pleased to welcome Visiting Ful...
6,@DianeOLeary I had HB 111 Crp 8 to 43 5wks bef...
...,...
227,
228,
229,
230,


In [11]:
#Reseting the index

df.reset_index()

Unnamed: 0,index,tweet
0,2,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
1,3,Use industry experts to uncover business oppor...
2,4,New Research: Taxonomic and predicted function...
3,5,@wsuanimsci is pleased to welcome Visiting Ful...
4,6,@DianeOLeary I had HB 111 Crp 8 to 43 5wks bef...
...,...,...
225,227,
226,228,
227,229,
228,230,


In [12]:
#Dropping old index


df.dropna(inplace=True)

In [13]:
#Dropped new index by mistake

df

Unnamed: 0,tweet
2,‚Ü™Ô∏è https://t.co/wgpBfU45OQ\n\nDairy ü•õ p...
3,Use industry experts to uncover business oppor...
4,New Research: Taxonomic and predicted function...
5,@wsuanimsci is pleased to welcome Visiting Ful...
6,@DianeOLeary I had HB 111 Crp 8 to 43 5wks bef...
...,...
101,Digestive #Health #Food &amp; #Drink Market Si...
102,tweet
103,tweet
104,tweet


In [14]:
#Reseting index inplace true which solves the problem

df.reset_index(inplace=True)

In [15]:
#Defining variable X as the tweet data
X = df['tweet']

In [16]:
#Importing necessary libraries for natural language processing

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import string
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gavindavis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
# Store the stopwords into the object named as "stop_words"
stop_words = stopwords.words('english')

# Store the string.punctuation into an object punct
punct = string.punctuation

# Initialise an object using a method PorterStemmer
stemmer = PorterStemmer()

In [18]:
#Visualising stop words

stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [19]:
#Importing regular expression tolkenizer removing symbols,spaces capitals and stemming them

import re

cleaned_data=[]

# For loop from first value to length(X), ^a-zA-Z means include small and capital case letters

for i in range(len(X)):
    tweet = re.sub('[^a-zA-Z]', ' ', X.iloc[i])
    tweet = tweet.lower().split()
    tweet = [stemmer.stem(word) for word in tweet if (word not in stop_words) and (word not in punct)]
    tweet = ' '.join(tweet)
    cleaned_data.append(tweet)

In [20]:
#Visualising the cleaned data

cleaned_data

['http co wgpbfu oq dairi product milk chees yogurt contribut livestock industri ghg emiss come health research limit dairi consumpt mix lower risk cardiovascular diseas http co maprtthksr',
 'use industri expert uncov busi opportun provid global region outlook deliv dairi market analysi specif particular busi opportun countri product ingredi applic channel http co xywza ox research http co zkyumyzfyi',
 'new research taxonom predict function signatur reveal linkag rumen microbiota feed effici dairi cattl rais tropic area http co mpp bdmdfi microbiolog',
 'wsuanimsci pleas welcom visit fulbright scholar dr dinu gavojdian research develop institut bovin romania join dr adam progar research group work evalu health welfar dairi cattl wsu gocoug http co bmhsa uk',
 'dianeoleari hb crp wk xma get note month see mi ekg nh gp wont discuss agoni yr research onlin privat test stomach issu high gluten amp dairi sensit amp heavi metal doctor ignor amp dismiss patient symptom amp harm mani',
 'gsi

In [21]:
#Creating an numpy array of cleaned data

df = np.array(cleaned_data)

In [22]:
#COnverting the cleaned data to a dataframe

df = pd.DataFrame(cleaned_data)

In [23]:
#Visualising the dataframe

df

Unnamed: 0,0
0,http co wgpbfu oq dairi product milk chees yog...
1,use industri expert uncov busi opportun provid...
2,new research taxonom predict function signatur...
3,wsuanimsci pleas welcom visit fulbright schola...
4,dianeoleari hb crp wk xma get note month see m...
...,...
99,digest health food amp drink market size estim...
100,tweet
101,tweet
102,tweet


In [24]:
#renaming the tweet column as tweet

df.rename(columns={0: 'tweet'}, inplace=True)

In [25]:
#Visualising the dataframe

df

Unnamed: 0,tweet
0,http co wgpbfu oq dairi product milk chees yog...
1,use industri expert uncov busi opportun provid...
2,new research taxonom predict function signatur...
3,wsuanimsci pleas welcom visit fulbright schola...
4,dianeoleari hb crp wk xma get note month see m...
...,...
99,digest health food amp drink market size estim...
100,tweet
101,tweet
102,tweet


In [26]:
#Importing textblob for sentiment analysis

from textblob import TextBlob

In [27]:
#Defining functions for subjectivity and polarity from textblob

def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
    return TextBlob(text).sentiment.polarity

In [28]:
#Creating columns in the dataframe with subjectivity and polarity

df['tweet_Subjectivity'] = df['tweet'].apply(getSubjectivity)
df['tweet_Polarity'] = df['tweet'].apply(getPolarity)

In [29]:
#Visualising the new features in the dataframe which will be used for sentiment analysis

df


Unnamed: 0,tweet,tweet_Subjectivity,tweet_Polarity
0,http co wgpbfu oq dairi product milk chees yog...,0.000000,0.000000
1,use industri expert uncov busi opportun provid...,0.166667,0.083333
2,new research taxonom predict function signatur...,0.454545,0.136364
3,wsuanimsci pleas welcom visit fulbright schola...,0.000000,0.000000
4,dianeoleari hb crp wk xma get note month see m...,0.540000,0.160000
...,...,...,...
99,digest health food amp drink market size estim...,0.000000,0.000000
100,tweet,0.000000,0.000000
101,tweet,0.000000,0.000000
102,tweet,0.000000,0.000000


In [30]:
#Performing descriptive statistics on the new metrics to assess the spread of polarity

df.describe()

Unnamed: 0,tweet_Subjectivity,tweet_Polarity
count,104.0,104.0
mean,0.306194,0.018891
std,0.285673,0.271415
min,0.0,-0.8
25%,0.0,0.0
50%,0.305303,0.0
75%,0.5,0.126562
max,1.0,1.0


In [31]:
#Looking at the postive tweets with a polarity greater than 0

df[df.tweet_Polarity>0]

Unnamed: 0,tweet,tweet_Subjectivity,tweet_Polarity
1,use industri expert uncov busi opportun provid...,0.166667,0.083333
2,new research taxonom predict function signatur...,0.454545,0.136364
4,dianeoleari hb crp wk xma get note month see m...,0.54,0.16
7,titwstomasla brian lewi pattherabbit puzzl ric...,0.666667,0.125
8,lie amp expect research blindli put chemic aro...,0.5,0.3
11,mefrostyp fooliothegreat thequart iowadairyfar...,0.510417,0.077083
12,proud paper come time promot senior research c...,1.0,0.8
16,want farmer land scotland bill gate pump money...,0.5,0.3
18,kenya agricultur livestock research organis ka...,0.507071,0.234343
19,new research ilri feed amp forag team show use...,0.454545,0.136364


In [32]:
#Looking at the negative tweets with a polarity less than 0

df[df.tweet_Polarity<0]

Unnamed: 0,tweet,tweet_Subjectivity,tweet_Polarity
6,christin kuo receiv human slaughter associ stu...,0.066667,-0.033333
15,world futur go requir food innov amp plenti of...,1.0,-0.8
17,j hairyvegandud say meat dairi industri clearl...,0.1,-0.1
21,fire bottl research revers torpor focus diet b...,0.285714,-0.178571
28,climat vermont becom akin northeastern iran ar...,0.05,-0.05
29,vickiezisman carlheneghan nealbarnard meat dai...,0.5,-0.3
34,research microbiolog qualiti safeti tradit raw...,0.430769,-0.240385
58,talktv piersmorgan dread interview vegan activ...,0.75,-0.3
60,condensedmilk thick creami viscou liquid prepa...,0.475,-0.3
63,also gave overview research dairi calf perform...,0.375,-0.125


In [33]:
#Redefining df before sentiment

df_test = df

In [34]:
#Dropping last few rows as they just have "tweet" in the column

df_test.drop([100,101,102,103], axis =0, inplace=True)

In [35]:
#Visualising the df

df_test

Unnamed: 0,tweet,tweet_Subjectivity,tweet_Polarity
0,http co wgpbfu oq dairi product milk chees yog...,0.000000,0.000000
1,use industri expert uncov busi opportun provid...,0.166667,0.083333
2,new research taxonom predict function signatur...,0.454545,0.136364
3,wsuanimsci pleas welcom visit fulbright schola...,0.000000,0.000000
4,dianeoleari hb crp wk xma get note month see m...,0.540000,0.160000
...,...,...,...
95,covid pandem shift consum demand dairi base be...,1.000000,-0.800000
96,diet calori reduct fallaci canadian food guid ...,0.200000,0.100000
97,non dairi ice cream market non dairi ice cream...,0.687500,-0.462500
98,work implic retail deal fix life product vario...,0.066667,0.000000


In [36]:
#Labelling tweets with positive, negative or neutral based on polarity 
df_test.loc[df_test['tweet_Polarity']>0, 'sentiment'] = 'positive'
df_test.loc[df_test['tweet_Polarity']<0, 'sentiment'] = 'negative'
df_test.loc[df_test['tweet_Polarity']==0, 'sentiment'] = 'neutral'



In [37]:
#Visualising the df

df_test.head(10)

Unnamed: 0,tweet,tweet_Subjectivity,tweet_Polarity,sentiment
0,http co wgpbfu oq dairi product milk chees yog...,0.0,0.0,neutral
1,use industri expert uncov busi opportun provid...,0.166667,0.083333,positive
2,new research taxonom predict function signatur...,0.454545,0.136364,positive
3,wsuanimsci pleas welcom visit fulbright schola...,0.0,0.0,neutral
4,dianeoleari hb crp wk xma get note month see m...,0.54,0.16,positive
5,gsinghxxx son space research shot birth th uk ...,0.0,0.0,neutral
6,christin kuo receiv human slaughter associ stu...,0.066667,-0.033333,negative
7,titwstomasla brian lewi pattherabbit puzzl ric...,0.666667,0.125,positive
8,lie amp expect research blindli put chemic aro...,0.5,0.3,positive
9,research conduct knight frank found estat surv...,0.0,0.0,neutral


In [38]:
#Defining variable X as the tweet data for sentiment analysis

X = df_test['tweet']

In [39]:
#Defining variable y as the sentiment for sentiment analysis

y = df_test['sentiment']

In [40]:
# Collect all columns into dataframe named as sentiment_ordering
sentiment_ordering = ['negative', 'neutral', 'positive']

# store all values into column named as "y"
y = y.apply(lambda x: sentiment_ordering.index(x))

In [41]:
y.head(10)

0    1
1    2
2    2
3    1
4    2
5    1
6    0
7    2
8    2
9    1
Name: sentiment, dtype: int64

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate an object cv by calling a method named as CountVectorzer()
cv    = CountVectorizer(max_features = 3000)

# Train the dataset by calling a fit_transform() method
X_fin = cv.fit_transform(X).toarray()

# Display the rows and colums
X_fin.shape

(100, 1249)

In [43]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Instantiate an object model by calling a method MultinomialNB()
model = MultinomialNB()

In [44]:
# Split the dataset into training and testing parts
X_train, X_test, y_train, y_test = train_test_split(X_fin, y, test_size = 0.15)

In [45]:
# Train the model by calling a method fit()
model.fit(X_train,y_train)

In [46]:
# Call predict() method
y_pred = model.predict(X_test)

In [47]:
from sklearn.metrics import classification_report

# Instantiate a mthod named as Cla
cf = classification_report(y_test, y_pred)

# Display the values of an object cf
print(cf)

              precision    recall  f1-score   support

           0       0.14      1.00      0.25         1
           1       0.83      0.56      0.67         9
           2       0.50      0.20      0.29         5

    accuracy                           0.47        15
   macro avg       0.49      0.59      0.40        15
weighted avg       0.68      0.47      0.51        15



In [90]:
#Creating a bar plot to show weight of positive, negative and neutral tweets

fig = px.bar(df_test, y='sentiment')

In [91]:
#Displaying the figure present in report file as Figure 2

fig

# END OF FILE