# Classification Predict  Khensani Dlamini
© Explore Data Science Academy

## Climate Change Belief Analysis
Predict an individual’s belief in climate change based on historical tweet data

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

### Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn import metrics

import string
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data Review

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [4]:
train.sentiment.value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

In [5]:
test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [6]:
sample_submission.head()

Unnamed: 0,tweetid,sentiment
0,169760,1
1,35326,1
2,224985,1
3,476263,1
4,872928,1


### Separate Data into Labels and Features

In [7]:
y= train['sentiment']
X= train['message']

In [8]:
X.head()

0    PolySciMajor EPA chief doesn't think carbon di...
1    It's not like we lack evidence of anthropogeni...
2    RT @RawStory: Researchers say we have three ye...
3    #TodayinMaker# WIRED : 2016 was a pivotal year...
4    RT @SoyNovioDeTodas: It's 2016, and a racist, ...
Name: message, dtype: object

In [9]:
y.head()

0    1
1    1
2    2
3    1
4    1
Name: sentiment, dtype: int64

### Data Cleaning 

In [10]:
train['tweetid'].isnull().sum()

train.dropna(inplace=True)

blanks = []  # start with an empty list

for i,sent,twt,id in train.itertuples():  # iterate over the DataFrame
    if type(twt)==str:            # avoid NaN values
        if twt.isspace():         # check tweets for whitespace (empty tweets)
            blanks.append(i)     # add matching index numbers to the list

train.drop(blanks, inplace=True) #removing empty tweets

In [11]:
train['tweetid'].isnull().sum()
# train.head()

0

In [12]:
test['tweetid'].isnull().sum()

0

### Removing Stop Words and Punctuation

In [15]:
def remove_punctuation(words):
    words = words.str.lower()
    return ''.join([x for i, x in words.items() if x not in string.punctuation])

In [16]:
X = remove_punctuation(X)

In [17]:
# # tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(X)
X = [word for word in tokens if word not in stopwords.words('english')]

### Scale X values and plit into train and test data

In [13]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words="english")
X_vectorized = vectorizer.fit_transform(X)

In [14]:
# Split into train and test
X_train,X_test,y_train,y_test = train_test_split(X_vectorized,y,test_size=.3, random_state=42)

### Fit Model and Predict

In [21]:
# create logistic regression model instance
lm = LogisticRegression()
rfc = RandomForestClassifier()
dtc = DecisionTreeClassifier()
adbc = AdaBoostClassifier()

lm.fit(X_train, y_train)
rfc_pred = lm.predict(X_test)



In [22]:
f1_score(y_test, rfc_pred, average="macro")

0.5618126863803741

### Making Submission

In [23]:
testx = test['message']
test_vect = vectorizer.transform(testx)

In [25]:
#Predict test values
y_pred = lm.predict(test_vect)

In [26]:
test['sentiment'] = y_pred
test.head()

Unnamed: 0,message,tweetid,sentiment
0,Europe will now be looking to China to make su...,169760,1
1,Combine this with the polling of staffers re c...,35326,1
2,"The scary, unimpeachable evidence that climate...",224985,1
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,1
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,0


In [None]:
test[['tweetid','sentiment']].to_csv('testsubmission.csv', index=False)

### Conclusion