## 1. Introduction 

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies..

## 2. importing data set from 

**Downloading few pathon libraries to work with**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score
import re # for regular expressions
pd.set_option("display.max_colwidth", 200)
import string
import nltk # for text manipulation
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

### 3. Loading data set from kaggle.com

This data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo. The dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43943 tweets were collected. Each tweet is labelled as one of the following classes:

In [3]:
train_df = pd.read_csv('https://raw.githubusercontent.com/Bongani02/Bongani02-Climate_Change_Belief_Analysis_2020/main/train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/Bongani02/Bongani02-Climate_Change_Belief_Analysis_2020/main/test.csv')

In [4]:
train_df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable",625221
1,1,It's not like we lack evidence of anthropogenic global warming,126103
2,2,RT @RawStory: Researchers say we have three years to act on climate change before it’s too late https://t.co/WdT0KdUr2f https://t.co/Z0ANPT…,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year in the war on climate change https://t.co/44wOTxTLcD,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #ElectionNight",466954


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


In [6]:
test_df.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make sure that it is not alone in fighting climate change… https://t.co/O7T8rCgwDq,169760
1,Combine this with the polling of staffers re climate change and womens' rights and you have a fascist state. https://t.co/ifrm7eexpj,35326
2,"The scary, unimpeachable evidence that climate change is already here: https://t.co/yAedqcV9Ki #itstimetochange #climatechange @ZEROCO2_;..",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPutin got to you too Jill ! \nTrump doesn't believe in climate change at all \nThinks it's s hoax,476263
4,RT @FakeWillMoore: 'Female orgasms cause global warming!'\n-Sarcastic Republican,872928


In [7]:
print(train_df.shape)

(15819, 3)


### 4. Text Cleaning
**Removing Noise**

- Removing the web-urls
- Making everything lower case
- Removing punctuation

In [8]:
#Removing the web-urls
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'

train_df['message'] = train_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
test_df['message'] = test_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [9]:
#Making everything lower case
train_df['message'] = train_df['message'].str.lower()
test_df['message'] = test_df['message'].str.lower()
train_df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,"polyscimajor epa chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? url-web via @mashable",625221
1,1,it's not like we lack evidence of anthropogenic global warming,126103
2,2,rt @rawstory: researchers say we have three years to act on climate change before it’s too late url-web url-web…,698562
3,1,#todayinmaker# wired : 2016 was a pivotal year in the war on climate change url-web,573736
4,1,"rt @soynoviodetodas: it's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #electionnight",466954


Now let's remove the punctuation using the string import.

In [10]:
import string

#Removing punctuation
def remove_punctuations(message):
    for punctuation in string.punctuation:
        message = message.replace(punctuation, '')
    return message

train_df['message'] = train_df['message'].apply(remove_punctuations)
test_df['message'] = test_df['message'].apply(remove_punctuations)

### Tokenisation
A tokeniser divides text into a sequence of tokens, We will use tokenisers to clean up the data, making it ready for analysis.

In [11]:
train_df.head(10)

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dioxide is main cause of global warming and wait what urlweb via mashable,625221
1,1,its not like we lack evidence of anthropogenic global warming,126103
2,2,rt rawstory researchers say we have three years to act on climate change before it’s too late urlweb urlweb…,698562
3,1,todayinmaker wired 2016 was a pivotal year in the war on climate change urlweb,573736
4,1,rt soynoviodetodas its 2016 and a racist sexist climate change denying bigot is leading in the polls electionnight,466954
5,1,worth a read whether you do or dont believe in climate change urlweb urlweb,425577
6,1,rt thenation mike pence doesn’t believe in global warming or that smoking causes lung cancer urlweb,294933
7,1,rt makeandmendlife six big things we can all do today to fight climate change or how to be a climate activistã¢â‚¬â¦ urlweb hã¢â‚¬â¦,992717
8,1,aceofspadeshq my 8yo nephew is inconsolable he wants to die of old age like me but will perish in the fiery hellscape of climate change,664510
9,1,rt paigetweedy no offense… but like… how do you just not believe… in global warming………,260471


In [12]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)])

train_df['message'] = train_df['message'].apply(lemmatize_text)
test_df['message'] = test_df['message'].apply(lemmatize_text)

In [13]:
test_df.head(10)

Unnamed: 0,message,tweetid
0,europe will now be looking to china to make sure that it is not alone in fighting climate change… urlweb,169760
1,combine this with the polling of staffer re climate change and woman right and you have a fascist state urlweb,35326
2,the scary unimpeachable evidence that climate change is already here urlweb itstimetochange climatechange zeroco2,224985
3,karoli morgfair osborneink dailykos putin got to you too jill trump doesnt believe in climate change at all think it s hoax,476263
4,rt fakewillmoore female orgasm cause global warming sarcastic republican,872928
5,rt nycjim trump muzzle employee of several gov’t agency in effort to suppress info on climate change amp the environment urlweb…,75639
6,bmastenbrook yes wrote that in 3rd yr comp sci ethic part wa told by climate change denying lecturer that i wa wrong amp marked down,211536
7,rt climatehawk1 indonesian farmer weather climate change w conservation agriculture ipsnews urlweb…,569434
8,rt guardian british scientist face a ‘huge hit’ if the u cut climate change research urlweb,315368
9,aid for agriculture sustainable agriculture and climate change adaptation for smallscale farmer urlweb via aid4ag,591733


### 5. Data is cleaned building the model 

In [14]:
y = train_df['sentiment']
X = train_df['message']

In [15]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words="english")
X_vectorized = vectorizer.fit_transform(X)

In [16]:
X_train,X_val,y_train,y_val = train_test_split(X_vectorized,y,test_size=0.2,shuffle=True, stratify=y, random_state=42)

In [17]:
model = LinearSVC()
model.fit(X_train, y_train)
svc_pred = model.predict(X_val)

In [18]:
f1_score(y_val, svc_pred, average="macro")

0.6540752169904428

In [19]:
testx = test_df['message'] #also transform x_test
test_vect = vectorizer.transform(testx)

In [20]:
y_pred = model.predict(test_vect) #predict model with transform test data

In [21]:
test_df['sentiment'] = y_pred

In [22]:
test_df.head()

Unnamed: 0,message,tweetid,sentiment
0,europe will now be looking to china to make sure that it is not alone in fighting climate change… urlweb,169760,1
1,combine this with the polling of staffer re climate change and woman right and you have a fascist state urlweb,35326,1
2,the scary unimpeachable evidence that climate change is already here urlweb itstimetochange climatechange zeroco2,224985,1
3,karoli morgfair osborneink dailykos putin got to you too jill trump doesnt believe in climate change at all think it s hoax,476263,1
4,rt fakewillmoore female orgasm cause global warming sarcastic republican,872928,0


In [26]:
test_df[['tweetid','sentiment']].to_csv('Submission.csv', index=False) # create csv file that is align with compitition format

In [None]:
pwd
