# Classification Predict Student Solution

© Explore Data Science Academy

Honour Code
We {NM_1 Tech_Gurus}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code.

Non-compliance with the honour code constitutes a material breach of contract.

Predict Overview: EA - Twitter Sentiment Classification
Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

# INTRODUCTION

In the modern business landscape, many organizations are passionately embedding eco-conscious principles into their foundation, seeking to mitigate environmental impacts and promote sustainability. These companies offer a variety of products and services that align seamlessly with their green ethos. A crucial part of their strategic marketing is understanding how the public perceives the looming threat of climate change and the extent to which they view it as an immediate concern. This knowledge will provide a critical lens through which to anticipate how their offerings may resonate with potential consumers.

Set against this backdrop, EDSA puts forth an exhilarating challenge during the Classification Sprint. Participants are tasked to construct a machine learning model with the ability to discern whether individuals believe in climate change or not, harnessing the power of unique tweet data for this purpose.

A comprehensive and precise solution to this challenge equips businesses with a key to unlock expansive consumer sentiment insights. These insights cut across a diverse array of demographic and geographic categories, paving the way for an enriched understanding that can underpin and shape their future marketing strategies

In [7]:
from IPython.display import Image
image_url = 'https://media.proprofs.com/images/QM/user_images/2286127/1551760625.jpg'
Image(url=image_url)

# Problem Statement
Companies operating in various industries need a reliable solution to understand the sentiments expressed by Twitter users regarding climate change. This information is crucial for making informed decisions, developing effective marketing strategies, and aligning their messaging with public sentiment. However, without a robust classification model, companies face challenges in accurately predicting sentiment and extracting valuable insights from the vast amount of data available on social media platforms. To address this problem, companies require a comprehensive sentiment analysis model that can provide real-time monitoring, sentiment classification, and trend analysis to help them stay ahead of the curve and effectively respond to the evolving concerns and preferences of their target audience.

In this notebook we set out to analyze tweets collected between Apr 27, 2015 and Feb 21, 2018 regarding climate change. We further wish to train a suitable and high performing classification model, to help classify the tweets into their respective sentiment categories regarding climate change.

# Table of Contents

1. [Introduction to Package Imports](#Introduction-to-Package-Imports)
2. [Data Loading Procedures](#Data-Loading-Procedures)
3. [Delving into Data: Exploratory Analysis](#Delving-into-Data:-Exploratory-Analysis)
4. [Crafting Data: Engineering Steps](#Crafting-Data:-Engineering-Steps)
5. [Building Models: Techniques and Approaches](#Building-Models:-Techniques-and-Approaches)
6. [Assessing Model Performance](#Assessing-Model-Performance)
7. [Understanding Model Explanations](#Understanding-Model-Explanations)

# 1. Importing Packages

In [6]:
# Libraries for data loading, data manipulation, and data visualization
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt  # Library for basic plotting in Python
import seaborn as sns  # Library for statistical data visualization
import plotly.express as px  # Library for interactive visualizations
from IPython.core.display import HTML  # Library for rendering HTML content in Jupyter Notebook
import string

# Libraries for data preparation and model building
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# 2. Loading the Data
Back to Table of Contents

⚡ Description: Loading the data ⚡
In this section you are required to load the data from the df_train file into a DataFrame.

In [3]:
# Load the data
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test_with_no_labels.csv")

In [3]:
df_train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954



The code df_train.head() displays the first 5 rows of the DataFrame df_train.

In [4]:
df_train.tail()

Unnamed: 0,sentiment,message,tweetid
15814,1,RT @ezlusztig: They took down the material on ...,22001
15815,2,RT @washingtonpost: How climate change could b...,17856
15816,0,notiven: RT: nytimesworld :What does Trump act...,384248
15817,-1,RT @sara8smiles: Hey liberals the climate chan...,819732
15818,0,RT @Chet_Cannon: .@kurteichenwald's 'climate c...,806319


The code displays a new DataFrame containing the specified number of rows from the end of the original DataFrame. The output includes both the column names and the corresponding data for those rows.

In [5]:
df_train.shape

(15819, 3)

The code df_train.shape provides a concise summary of the dimensions of the Pandas DataFrame df_train. In other words, it tells you how many rows (observations) and columns (features) exist in the DataFrame.

In [6]:
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [7]:
df_test.shape

(10546, 2)

# 2.1 Data Cleaning

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


The df_train.info() code provides a valuable snapshot of your DataFrame's structure and data characteristics. It helps you identify potential issues with missing values, data type mismatches, or unexpected memory usage, allowing you to make informed decisions about data cleaning, analysis, and modeling.

In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10546 entries, 0 to 10545
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  10546 non-null  object
 1   tweetid  10546 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 164.9+ KB


The code df_test.info() provides a concise overview of the DataFrame df_test, offering valuable insights into its structure and data types.

Overall, df_test.info() provides a quick and informative summary of the essential characteristics of your DataFrame, enabling you to make informed decisions about data cleaning, exploration, and analysis.

In [11]:
df_train.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

The code df_train.isnull().sum() calculates and returns the number of null values (missing values) for each column in the DataFrame df_train.

So, the combined expression df_train.isnull().sum() essentially counts the number of True values (null values) for each column in the original DataFrame and returns a Series (single-dimensional array) containing these counts.

In [12]:
df_test.isnull().sum()

message    0
tweetid    0
dtype: int64

*This code is a common way to quickly assess the presence and extent of missing data in a DataFrame.

*It's often used in data cleaning and preparation stages to identify columns that might need further handling or imputation of missing values.

*Understanding missing data is crucial for ensuring accurate analysis and modeling results.

# 3. Exploratory Data Analysis (EDA)
Back to Table of Contents

⚡ Description: Exploratory data analysis ⚡ In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame.

# 4. Data Engineering
Back to Table of Contents

⚡ Description: Data engineering ⚡
In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase.


# 5. Modelling
Back to Table of Contents

⚡ Description: Modelling ⚡
In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall.


# 6. Model Performance
Back to Table of Contents

⚡ Description: Model performance ⚡
In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why.

# 7. Model Explanations
Back to Table of Contents

⚡ Description: Model explanation ⚡
In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings.