# Classification Predict on Climate Change

In [None]:
Problem statement:
In this project, I have been tasked to create a machine learning model that will help classify if people believe in climate change.

In [None]:
Table of contents

1.Importing packages
2. Loading data sets
3. Exploratory Data Analysis (EDA)
4. Data Engineering
5. Modeling
6. Model Performance
7. Model Explanations

In [None]:
1.Importing packages

I will first load the packages I will need followed by the data I have been provided:

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Libraries for data preparation and model building
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score

In [None]:
2. Loading data sets

In this step I will load both data sets for training and testing my model.

In [4]:
# Load train and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test_with_no_labels.csv')

In [None]:
3. Exploratory Data Analysis (EDA)

I will now take a look at the first few rows of the training and testing data sets to better understand its features and labels.

In [13]:
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dio...,625221
1,1,its not like we lack evidence of anthropogenic...,126103
2,2,rt researchers say we have three years to act...,698562
3,1,todayinmaker wired 2016 was a pivotal year in...,573736
4,1,rt its 2016 and a racist sexist climate chang...,466954


In [None]:
The data sets indicates 3 columns containing 1 label(sentiment) and 2 features (message and tweetid

In [17]:
test_df.head()

Unnamed: 0,message,tweetid
0,europe will now be looking to china to make su...,169760
1,combine this with the polling of staffers re c...,35326
2,the scary unimpeachable evidence that climate ...,224985
3,\nputin got to you too jill \ntrump doesn...,476263
4,rt female orgasms cause global warming\nsarca...,872928


In [None]:
4. Feature Engineering

In [None]:
We will proceed to clean our data sets.

In [6]:
import re
import string

# Function to clean the text
def clean_text(text):
    # Removing URLs, usernames, and special characters
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    return text

# Apply text cleaning to the 'message' column
train_df['message'] = train_df['message'].apply(clean_text)
test_df['message'] = test_df['message'].apply(clean_text)

In [19]:
print(train.head())
print(test_df.head())

   sentiment                                            message  tweetid
0          1  polyscimajor epa chief doesnt think carbon dio...   625221
1          1  its not like we lack evidence of anthropogenic...   126103
2          2  rt  researchers say we have three years to act...   698562
3          1  todayinmaker wired  2016 was a pivotal year in...   573736
4          1  rt  its 2016 and a racist sexist climate chang...   466954
                                             message  tweetid
0  europe will now be looking to china to make su...   169760
1  combine this with the polling of staffers re c...    35326
2  the scary unimpeachable evidence that climate ...   224985
3      \nputin got to you too jill  \ntrump doesn...   476263
4  rt  female orgasms cause global warming\nsarca...   872928


# Feature Engineering with TF-IDF

In [9]:
# I will now combine the train and test data for TF-IDF vectorization
combined_data = pd.concat([train_df['message'], test_df['message']], axis=0)

# Initialize TF-IDF vectorizer with a maximum of 1000 features
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit TF-IDF on the combined data, then transform train and test messages
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['message'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['message'])

# Define the target variable for training
y_train = train['sentiment']

In [None]:
5. Modeling

In [20]:
from sklearn.linear_model import LogisticRegression

# Features and labels
X_train = X_train_tfidf
y_train = train['sentiment']

# Initialize and train the model
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [None]:
6. Model Performance

In [10]:
# Initialize Logistic Regression model
classifier = LogisticRegression()

# Train the model with TF-IDF features and sentiment labels
classifier.fit(X_train_tfidf, y_train)

# Optional: Check the model's training accuracy
train_accuracy = classifier.score(X_train_tfidf, y_train)
print("Training Accuracy:", train_accuracy)

Training Accuracy: 0.7506163474303054


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Predict and Generate Submission File

In [21]:
# Predict sentiment for test data
predictions = classifier.predict(X_test_tfidf)

# Prepare a DataFrame with tweetid and predicted sentiment
submission = pd.DataFrame({
    'tweetid': test_df['tweetid'],
    'sentiment': predictions
})

# Generate CSV file for submission
submission.to_csv('submission.csv', index=False)

# Check the CSV output
print(submission.head())

   tweetid  sentiment
0   169760          1
1    35326          1
2   224985          1
3   476263          1
4   872928          0


In [None]:
7. Model Explanation

In [None]:
I have chosen to go with the  Logistic Regression model. The model came out with a high F1_score.
I went through various stages of data cleaning and feature engineering to improve the data, train a Machine Learning Model and arrive at a model with a good performance to better predict unseen data coming from the outside world.