# Classification Predict  Khensani Dlamini
© Explore Data Science Academy

## Climate Change Belief Analysis
Predict an individual’s belief in climate change based on historical tweet data

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

### Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn import metrics

import string
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Data Review

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [4]:
train.sentiment.value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

In [5]:
test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [6]:
sample_submission.head()

Unnamed: 0,tweetid,sentiment
0,169760,1
1,35326,1
2,224985,1
3,476263,1
4,872928,1


### Data Cleaning 

In [7]:
# def remove_blanks(df, column_name):

#     df.dropna(inplace=True)

#     blanks = []  # start with an empty list

#     for twt in df[column_name]:  # iterate over the DataFrame
#         if type(twt)==str:            # avoid NaN values
#             if twt.isspace():         # check tweets for whitespace (empty tweets)
#                 blanks.append(i)     # add matching index numbers to the list
                
#     df = df[column_name].drop(blanks, inplace=True) #removing blank tweets
    
#     return df

In [8]:
# train = remove_blanks(train,'message')

In [9]:
# test = remove_blanks(test,'message')

In [10]:
# train.head()

### Removing Stop Words and Punctuation

In [7]:
def remove_punctuation(text):

    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
        
    return text
            

In [8]:
# remove_punctuation(train, 'message')
train["clean_message"] = train['message'].apply(remove_punctuation)

In [9]:
test["clean_message"] = test['message'].apply(remove_punctuation)

In [10]:
stop = stopwords.words('english')
train['clean_message'] = train['clean_message'].str.lower()
test['clean_message'] =test['clean_message'].str.lower()

train['cleaned_message'] = train['clean_message'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
test['cleaned_message'] = test['clean_message'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# train['clean_message'].apply(lambda x: [item for item in x if item not in stop])
# test['clean_message'].apply(lambda x: [item for item in x if item not in stop])

In [11]:
train.head()

Unnamed: 0,sentiment,message,tweetid,clean_message,cleaned_message
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief doesnt think carbon dio...,polyscimajor epa chief doesnt think carbon dio...
1,1,It's not like we lack evidence of anthropogeni...,126103,its not like we lack evidence of anthropogenic...,like lack evidence anthropogenic global warming
2,2,RT @RawStory: Researchers say we have three ye...,698562,rt rawstory researchers say we have three year...,rt rawstory researchers say three years act cl...
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired 2016 was a pivotal year in...,todayinmaker wired 2016 pivotal year war clima...
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,rt soynoviodetodas its 2016 and a racist sexis...,rt soynoviodetodas 2016 racist sexist climate ...


In [12]:
test.head()

Unnamed: 0,message,tweetid,clean_message,cleaned_message
0,Europe will now be looking to China to make su...,169760,europe will now be looking to china to make su...,europe looking china make sure alone fighting ...
1,Combine this with the polling of staffers re c...,35326,combine this with the polling of staffers re c...,combine polling staffers climate change womens...
2,"The scary, unimpeachable evidence that climate...",224985,the scary unimpeachable evidence that climate ...,scary unimpeachable evidence climate change al...
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,karoli morgfair osborneink dailykos \nputin go...,karoli morgfair osborneink dailykos putin got ...
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,rt fakewillmoore female orgasms cause global w...,rt fakewillmoore female orgasms cause global w...


### Separate Data into Labels and Features

In [13]:
y= train['sentiment']
X= train['cleaned_message']

In [14]:
X.head()

0    polyscimajor epa chief doesnt think carbon dio...
1      like lack evidence anthropogenic global warming
2    rt rawstory researchers say three years act cl...
3    todayinmaker wired 2016 pivotal year war clima...
4    rt soynoviodetodas 2016 racist sexist climate ...
Name: cleaned_message, dtype: object

In [15]:
y.head()

0    1
1    1
2    2
3    1
4    1
Name: sentiment, dtype: int64

### Vectorize X values and plit into train and test data

In [16]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words="english")
X_vectorized = vectorizer.fit_transform(X)

In [17]:
# Split into train and test
X_train,X_test,y_train,y_test = train_test_split(X_vectorized,y,test_size=.3, random_state=42)

### Fit Model and Predict

In [18]:
# create logistic regression model instance
lm = LogisticRegression()
rfc = RandomForestClassifier()
dtc = DecisionTreeClassifier()
adbc = AdaBoostClassifier()

rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)



In [19]:
f1_score(y_test, y_pred, average="macro")

0.5253643125595683

### Making Submission

In [20]:
testx = test['cleaned_message']
test_vect = vectorizer.transform(testx)

In [21]:
#Predict test values
y_pred = rfc.predict(test_vect)

In [22]:
test['sentiment'] = y_pred
test.head()

Unnamed: 0,message,tweetid,clean_message,cleaned_message,sentiment
0,Europe will now be looking to China to make su...,169760,europe will now be looking to china to make su...,europe looking china make sure alone fighting ...,1
1,Combine this with the polling of staffers re c...,35326,combine this with the polling of staffers re c...,combine polling staffers climate change womens...,1
2,"The scary, unimpeachable evidence that climate...",224985,the scary unimpeachable evidence that climate ...,scary unimpeachable evidence climate change al...,1
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,karoli morgfair osborneink dailykos \nputin go...,karoli morgfair osborneink dailykos putin got ...,1
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,rt fakewillmoore female orgasms cause global w...,rt fakewillmoore female orgasms cause global w...,0


In [23]:
test[['tweetid','sentiment']].to_csv('testsubmission.csv', index=False)

### Conclusion