# Introduction

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies..

# Getting Started

Sentiment analysis involves natural language processing because it deals with human-written text. You'll have to download a few Python libraries to work with.

In [44]:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# ML Libraries
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
# Global Parameters
stop_words = set(stopwords.words('english'))

import warnings
warnings.filterwarnings('ignore') 

# To train a machine learning model, we need data

 <h1>Load the Data that is required<h1>

In [45]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Check the data by calling the heads.

In [46]:
pd.set_option('display.max_colwidth', -1)
train.head(1)

Unnamed: 0,sentiment,message,tweetid
0,1,"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable",625221


# Pre-processing Tweets

This is one of the essential steps in any natural language processing (NLP) task. Data scientists never get filtered, ready-to-use data. To make it workable, there is a lot of processing that needs to happen.

## Letter casing: 
Converting all letters to either upper case or lower case.
## Tokenizing: 
Turning the tweets into tokens. Tokens are words separated by spaces in a text.
## Noise removal: 
Eliminating unwanted characters, such as HTML tags, punctuation marks, special characters, white spaces etc.
## Stopword removal: 
Some words do not contribute much to the machine learning model, so it's good to remove them. A list of stopwords can be defined by the nltk library, or it can be business-specific.
## Normalization: 
Normalization generally refers to a series of related tasks meant to put all text on the same level. Converting text to lower case, removing special characters, and removing stopwords will remove basic inconsistencies. Normalization improves text matching.
## Stemming: 
Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem. Porter Stemmer is the most widely used technique because it is very fast. Generally, stemming chops off end of the word, and mostly it works fine.

In [47]:
## Remove urls
print ('Removing URLs... for Train')
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
train['message'] = train['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

## Remove urls test
print ('Removing URLs...for Test data')
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
test['message'] = test['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)


Removing URLs... for Train
Removing URLs...for Test data


In [48]:
# Make lower case
print ('Lowering case... Train')
train['message'] = train['message'].str.lower()

# Make lower case
print ('Lowering case... Test')
test['message'] = test['message'].str.lower()

Lowering case... Train
Lowering case... Test


In [49]:
#Noise removal:
print ('Cleaning punctuation... for Test and Train using the below function')
def remove_punctuation_numbers(post):
    punc_numbers = string.punctuation  + '0123456789'
    return ''.join([l for l in post if l not in punc_numbers])
train['message'] = train['message'].apply(remove_punctuation_numbers)
test['message'] = test['message'].apply(remove_punctuation_numbers)


Cleaning punctuation... for Test and Train using the below function


In [50]:
#Removed NonAscii
print ('removing NonAscii')
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)
train['message'] = train['message'].apply(_removeNonAscii)
test['message'] = test['message'].apply(_removeNonAscii)

removing NonAscii


In [51]:
#train['message'] = train['message'].replace('\n', ' ').replace('\r', '')
## Remove urls
print ('Removing URLs... for Train')
pattern_url = r'\n'
subs_url = r' '
train['message'] = train['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
test['message'] = test['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

Removing URLs... for Train


In [52]:
pd.set_option('display.max_colwidth', -1)
train.head(1)

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dioxide is main cause of global warming and wait what urlweb via mashable,625221


In [53]:
#Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem
#ps = PorterStemmer()
#stemmed_words = [ps.stem(w) for w in train['message']]
#stemmed_words = [ps.stem(w) for w in test['message']]

#lemmatizer = WordNetLemmatizer()
#lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in train['message']]
#lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in test['message']]

In [54]:
pd.set_option('display.max_colwidth', -1)
train.head(1)

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dioxide is main cause of global warming and wait what urlweb via mashable,625221


In [55]:
#Splitting you Data for X and Y
X = train['message']
y = train['sentiment']

## Setting up testing and training sets

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Same tf vector will be used for Testing sentiments on unseen trending data

## Training Logistics Regression model

In [57]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# Linear SVC Model:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2))),('clf', LinearSVC()),])
text_clf_lsvc.fit(X_train, y_train)
predictions = text_clf_lsvc.predict(X_test)
print(f1_score(y_test, predictions,average="macro"))
print('accuracy_score',accuracy_score(y_test, predictions))

0.6805751909816737
accuracy_score 0.7756005056890013


## Checking the performance Model on the Validation set

In [58]:
print(f1_score(y_test, predictions,average="macro"))

0.6805751909816737


### Same tf vector will be used for Testing sentiments on unseen trending data

In [59]:
X_NT_test = test['message']
predicted = text_clf_lsvc.predict(X_NT_test)

### Making Prediction on unseen test data and adding sentiment column to our original data

In [60]:
predicted

array([1, 1, 1, ..., 2, 0, 1], dtype=int64)

In [61]:
test['sentiment'] = predicted

In [62]:
test.head()

Unnamed: 0,message,tweetid,sentiment
0,europe will now be looking to china to make sure that it is not alone in fighting climate change urlweb,169760,1
1,combine this with the polling of staffers re climate change and womens rights and you have a fascist state urlweb,35326,1
2,the scary unimpeachable evidence that climate change is already here urlweb itstimetochange climatechange zeroco,224985,1
3,karoli morgfair osborneink dailykos putin got to you too jill trump doesnt believe in climate change at all thinks its s hoax,476263,1
4,rt fakewillmoore female orgasms cause global warming sarcastic republican,872928,0


### Creating an ouput csv for submission

In [63]:
test[['tweetid','sentiment']].to_csv('testsubmissionLsvc.csv', index=False)