Import the necessary libraries

In [26]:
import urllib.request
import os
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

import joblib

Reading the .csv file from Pandas

In [27]:
df = pd.read_csv('global-warming.csv', encoding='utf-8')
df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [28]:
df.shape

(15819, 3)

In [29]:
tokenizer=RegexpTokenizer(r'\w+')
en_stopwords=set(stopwords.words('english'))
ps=PorterStemmer()

def getStemmedReview(review):
    review=review.lower()
    review=review.replace("<br /><br />"," ")
    #Tokenize
    tokens=tokenizer.tokenize(review)
    new_tokens=[token for token in tokens if token not in  en_stopwords]
    stemmed_tokens=[ps.stem(token) for token in new_tokens]
    clean_review=' '.join(stemmed_tokens)
    return clean_review

In the above function, we are lowering the cases of sentences, breaking them into tokens, preserving the tokens if they are not part of predefined stopwords and finally stemming them for the root form.

Once all the above gets completed we are returning the clean text in the end.

Cleaning all the reviews and splitting our data for training and testing.

In [30]:
df['message'].apply(getStemmedReview)

X_train = df.loc[:12000, 'message'].values
y_train = df.loc[:12000, 'sentiment'].values
X_test = df.loc[12000:, 'message'].values
y_test = df.loc[12000:, 'sentiment'].values

ransforming words into feature vectors

To feed the data to the Machine Learning model, we have to convert categorical data, such as text or words, into a numerical form

We are going to use TfidfVectorizer for this purpose which is already present in the scikit-learn library

In [31]:
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8',decode_error='ignore')
vectorizer.fit(X_train)
X_train=vectorizer.transform(X_train)
X_test=vectorizer.transform(X_test)

Please note that in the above code, we perform the fit operation only on the training set and once the vectorizer learns completely from the training data, we use the same learning to transform our test data

Creating the model and checking the score on training and test data

In [32]:
model=LogisticRegression(solver='liblinear')
model.fit(X_train,y_train)
print("Score on training data is: "+str(model.score(X_train,y_train)))
print("Score on testing data is: "+str(model.score(X_test,y_test)))

Score on training data is: 0.8190984084659612
Score on testing data is: 0.733961770096884


In [33]:
joblib.dump(en_stopwords,'stopwords.pkl') 
joblib.dump(model,'model.pkl')
joblib.dump(vectorizer,'vectorizer.pkl')

['vectorizer.pkl']