# Classification Predict

### Climate change tweet classsification 

This notebook is aimed at classifying if a person/custumer believe in climate change or not, this is to help companies with determining if customising their product for lessining carbon footprint and environmental impact, is a good idea or not.

## Importing the modules

Start by importing packages that will be needed

In [1]:
import nltk

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

# set plot style
sns.set()

import warnings
warnings.filterwarnings('ignore')

Checking the fields of our data

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test_with_no_labels.csv')

## Getting an idea of the data on hand

In [3]:
train_df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [4]:
test_df.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


Check the number of rows and columns

In [5]:
train_df.shape

(15819, 3)

Check the data types contained on our data


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


Checking if we have null/ empty fields 

In [7]:
train_df.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

Checking the numberof time each sentiment appears

In [8]:
train_df['sentiment'].value_counts()


 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

## Natural language processing

Construct a function which will clean our text or tweets

In [9]:
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nlppreprocess import NLP
nlp = NLP()

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

def clean(tweet):
    #remove mentions
    tweet = re.sub(r'@[A-Za-z0-9]+', '', tweet) 
    
    #remove hashtag
    tweet = re.sub(r'#', '', tweet) 
    
    #remove RT
    tweet = re.sub (r'RT[\s]+', '', tweet)
    
    #remove hyper link
    tweet = re.sub(r'https?:\/\/\S', '', tweet) 
    
    #turning the tweet to lowercase 
    tweet = tweet.lower() 
    
    #Removing the punctuations
    tweet = ''.join([l for l in tweet if l not in string.punctuation])
    
    #Tokenise the tweet
    tokeniser = TreebankWordTokenizer()
    tweet = tokeniser.tokenize(tweet)
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tweet = ' '.join(lemmatizer.lemmatize(word) for word in tweet)   
    
    #remove stop words
    stopwords = NLP(replace_words=True, remove_stopwords=True, 
                            remove_numbers=True, remove_punctuations=False) 
    tweet = stopwords.process(tweet)
    
    return tweet

train_df['message1'] = train_df['message'].apply(clean)
train_df.head()

Unnamed: 0,sentiment,message,tweetid,message1
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief not think carbon dioxid...
1,1,It's not like we lack evidence of anthropogeni...,126103,not like we lack evidence anthropogenic global...
2,2,RT @RawStory: Researchers say we have three ye...,698562,researcher say we three year act climate chang...
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired wa pivotal year in war cli...
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,and racist sexist climate change denying bigo...


## Text to numeric

Since the model only recognise numeric data for predictions, the text data has to be converted to numeric data type and this is achieved using text frequency - inverse document frequency (TF-IDF).

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
#TF-IDF for coverting text to numeric data type
data = train_df['message1']

vectorizer=TfidfVectorizer(use_idf=True, max_df=0.95)
X_vectorized = vectorizer.fit_transform(data)

In [12]:
# Arranging the data into predictor variables and label
X = X_vectorized
y = train_df['sentiment']

## Models

Spliting the training data into test and training, or test and validation sets

In [13]:
from sklearn.model_selection import train_test_split

# Splitting the training data into 80% training and 20% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Different models will be compared in this section of the notebook

#### Logistic regression

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
#creating an instance
lr = LogisticRegression(random_state = 42)

In [16]:
lr.fit(X_train, y_train)

LogisticRegression(random_state=42)

In [17]:
y_pred = lr.predict(X_test)

In [18]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

          -1       0.84      0.26      0.40       278
           0       0.65      0.34      0.45       425
           1       0.72      0.91      0.81      1755
           2       0.75      0.70      0.73       706

    accuracy                           0.73      3164
   macro avg       0.74      0.55      0.59      3164
weighted avg       0.73      0.73      0.70      3164



## Testing data

In this section the model  will be applied on the testing data (unseen) to make predictions 

Checking the state of the testing data. Start with checking the number of columns and raws

In [28]:
test_df.shape

(10546, 3)

Check the data types on the testing data

In [25]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10546 entries, 0 to 10545
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   message   10546 non-null  object
 1   tweetid   10546 non-null  int64 
 2   message1  10546 non-null  object
dtypes: int64(1), object(2)
memory usage: 247.3+ KB


Check if there are any null/empty fields 

In [27]:
test_df.isnull().sum()

message     0
tweetid     0
message1    0
dtype: int64

Now clean the testing data in the same way the training data was cleaned

In [19]:
test_df['message1'] = test_df['message'].apply(clean)
test_df.head()

Unnamed: 0,message,tweetid,message1
0,Europe will now be looking to China to make su...,169760,europe will now looking china make sure not al...
1,Combine this with the polling of staffers re c...,35326,combine with polling staffer re climate change...
2,"The scary, unimpeachable evidence that climate...",224985,scary unimpeachable evidence climate change al...
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,putin got you too jill trump not believe in cl...
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,female orgasm cause global warming sarcastic r...


Convert the cleaned data into a numeric data

In [20]:
# TF-IDF for text to numeric
test_data = test_df['message1']

vectorized_t = vectorizer.transform(test_data)
X_test_t = vectorized_t

In [21]:
y_test_pred = lr.predict(X_test_t)

## Saving the predictions

Tunning the predictions into a dataframe

In [22]:
submission = pd.DataFrame({'tweetid':test_df['tweetid'],
                          'sentiment':y_test_pred})

copying the predictions into a CSV file

In [23]:
submission.to_csv('classification2.csv',index=False)

Use the above saved CSV file on Kaggle for submission

## Saving the model

In [24]:
import pickle
model_save_path = "lr2.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(lr,file)