# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**Team NM3**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

Team Members
1. Edidiong Udofia
2. Hlawulekani Rikhotso
3. Lesego Maponyane
4. Boitemogelo Tagane
5. Priscila Vhafuniwa Ndou
6. Fransisca Onyinyechukwu iloh
### Predict Overview: Climate Change Challenge

Many companies are built around lessening one’s environmental impact or carbon
footprint. They offer products and services that are environmentally friendly and
sustainable, in line with their values and ideals. They would like to determine how
people perceive climate change and whether or not they believe it is a real threat.
This would add to their market research efforts in gauging how their
product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the
task of creating a Machine Learning model that is able to classify whether or not a
person believes in climate change, based on their novel tweet data.
Providing an accurate and robust solution to this task gives companies access to a
broad base of consumer sentiment, spanning multiple demographic and
geographic categories - thus increasing their insights and informing future
marketing strategies.

![gettyimages-586087414-2048x2048-smaller-scaled.jpg](attachment:gettyimages-586087414-2048x2048-smaller-scaled.jpg)

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Introduction</a>

<a href=#two>2. Problem Statement</a>

<a href=#three>3. Importing Packages</a>

<a href=#four>4. Loading Data</a>

<a href=#five>5. Pre-processing</a>

<a href=#six>6. Exploratory Data Analysis (EDA)</a>

<a href=#seven>7. Data Engineering</a>

<a href=#eight>8. Modeling</a>

<a href=#nine>9. Model Performance</a>

<a href=#ten>10. Model Explanations</a>

<a href=#eleven>11. Conclusion</a>


 <a id="one"></a>
## **1. Introduction**

In the context of machine learning for sentiment analysis, sophisticated classification methods are useful instruments for deciphering the emotional nuance contained in textual data. These methods, which are similar to linear regression in principle, function as sophisticated sentiment interpreters. Examples of these methods include support vector machines, neural networks, and transformer models.

Similar to how linear regression makes connections between variables to reveal patterns in numerical data, sentiment analysis models navigate the complex terrain of text to identify the emotional content of every statement. For instance, when delving into the sentiment of a piece of text, a model may leverage techniques like word embeddings or transformer architectures to capture the contextual subtleties that define positive, negative, or neutral sentiments.

In this case, the equations driving these models are complex algorithms that analyze linguistic patterns rather than merely mathematical formulas. The endeavor is to capture the sentiment expressed, whether it be joy, discontent, or neutrality. With the help of these advanced techniques, sentiments within textual data can be understood more deeply, offering a sophisticated lens through which we can extract insightful information about the emotional content.





![SEC_165384176-94a7.webp](attachment:SEC_165384176-94a7.webp)

 <a id="one"></a>
## 2. Problem Statement
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Problem Statement ⚡ |
| :--------------------------- |
| In this section you are required to introduce and elaborate on the problem statement or challenge you are required to solve. |


Given a collection of tweets authored by an individual, develop a machine learning model to predict the individual's belief in climate change. The model should be able to identify patterns in the individual's language and sentiment that are indicative of their stance on climate change. The model should be trained on a large dataset of labeled tweets to ensure its generalizability and accuracy.

Potential Applications:

The proposed model could have various applications, including:

- Understanding public opinion on climate change: The model could be used to analyze large volumes of social media data to understand the general public's sentiment towards climate change.

- Identifying individuals with strong opinions on climate change: The model could be used to identify individuals who hold strong opinions on climate change, either for or against, for further research or targeted communication campaigns.

- Analyzing the effectiveness of climate change communication: The model could be used to analyze the effectiveness of different communication strategies in influencing public opinion on climate change.


 <a id="one"></a>
## 3. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

![modules-and-packages-in-python.jpg](attachment:modules-and-packages-in-python.jpg)

The tools required can be classified as packages or libraries, and they have advanced capabilities and operations within their desired context of use. The categories or contexts we will focus on are libraries that can load data, manipulate data, visualise the data, prepare the data to be used for analysis, model building, statistical functions, data preprocessing, categorical operations . As soon as we have loaded our desired packages, we can then begin working on the data. These packages are very useful for data scientists,as they make the processes of the development and deployment of a dependable model much smoother.

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import *

# Libraries for data preparation and model building
import *

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = ###

<a id="two"></a>
## 4. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In this section we then load the data that we will use to train as well as test the model. This data will assist us in ensuring the reliability of our model, as it will portray the areas we need to alted to improve our model

In [None]:
df = # load the data

<a id="two"></a>
## 5. Pre-processing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Pre-processing ⚡ |
| :--------------------------- |
| In this section you are required to make the raw data suitable for modelling. |

![data-preprocessing-techniques-1.png](attachment:data-preprocessing-techniques-1.png)

### Description
Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine. The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data's features. 

### Importance
The majority of the real-world datasets for machine learning are highly susceptible to be missing, inconsistent, and noisy due to their heterogeneous origin. 

Applying data mining algorithms on this noisy data would not give quality results as they would fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.

Duplicate or missing values may give an incorrect view of the overall statistics of data.
Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions.
Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data, without which it would just be a Garbage In, Garbage Out scenario.

In [None]:
# Pre-process the data

<a id="three"></a>
## 6. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


![1_Ra02AqsQlC0KV229EvM98g.jpg](attachment:1_Ra02AqsQlC0KV229EvM98g.jpg)

### What is EDA?
Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features and trends used by machine learning and deep learning models in Data Science. Thus, EDA has become an important milestone for anyone working in data science. The Data Science field is now very important in the business world as it provides many opportunities to make vital business decisions by analyzing hugely gathered data. Understanding the data thoroughly needs its exploration from every aspect. The impactful features enable making meaningful and beneficial decisions; therefore, EDA occupies an invaluable place in Data science.
### Objectives of EDA
- Identifying and removing data outliers
- Identifying trends in time and space
- Uncover patterns related to the target
- Creating hypotheses and testing them through experiments
- Identifying new sources of data

### Types of Exploratory Data Analysis
There are three main types of EDA:

1. Univariate 
2. Bivariate 
3. Multivariate 

### Univariate Analysis:
Examine Individual Variables: Analyze each variable independently to understand its distribution, central tendency, spread, and outliers.
Visualize Distributions: Use histograms, box plots, and summary statistics to describe the data.

### Bivariate and Multivariate Analysis:
Explore Relationships: Investigate relationships between variables using scatter plots, correlation matrices, and cross-tabulations.
Identify Patterns: Look for patterns, trends, or associations between different variables.
Group Comparisons: Compare groups within categorical variables.

In [None]:
# look at data statistics

In [None]:
# plot relevant feature interactions

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

<a id="four"></a>
## 7. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

![cb-dataengineervsscientist-060421-duplicate.png](attachment:cb-dataengineervsscientist-060421-duplicate.png)

### Feature Engineering
The feature engineering pipeline is the preprocessing steps that transform raw data into features that can be used in machine learning algorithms, such as predictive models. Feature engineering consists of creation, transformation, extraction, and selection of features, also known as variables, that are most conducive to creating an accurate ML algorithm. These processes entail:
- <b>Feature Creation:</b><br> Creating features involves identifying the variables that will be most useful in the predictive model. This is a subjective process that requires human intervention and creativity. Existing features are mixed via addition, subtraction, multiplication, and ratio to create new derived features that have greater predictive power.  
- <b>Transformations:</b><br> Transformation involves manipulating the predictor variables to improve model performance; e.g. ensuring the model is flexible in the variety of data it can ingest; ensuring variables are on the same scale, making the model easier to understand; improving accuracy; and avoiding computational errors by ensuring all features are within an acceptable range for the model. 
- <b>Feature Extraction:</b><br> Feature extraction is the automatic creation of new variables by extracting them from raw data. The purpose of this step is to automatically reduce the volume of data into a more manageable set for modeling. Some feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and principal components analysis.
- <b>Feature Selection:</b><br> Feature selection algorithms essentially analyze, judge, and rank various features to determine which features are irrelevant and should be removed, which features are redundant and should be removed, and which features are most useful for the model and should be prioritized.

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

<a id="five"></a>
## 8. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

![1682017713026.png](attachment:1682017713026.png)

### Model
Models are algorithms whose instructions are induced from a set of data and are then used to make predictions, recommendations, or prescribe an action based on a probabilistic assessment. The model uses algorithms to identify patterns in the data that form a relationship with an output. Models can predict things before they happen more accurately than humans, such as catastrophic weather events or who is at risk of imminent death in a hospital.

### Key steps
- <b>Choosing a Model:</b><br>
A machine learning model determines the output you get after running a machine learning algorithm on the collected data. It is important to choose a model which is relevant to the task at hand. Over the years, scientists and engineers developed various models suited for different tasks like speech recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is suited for numerical or categorical data and choose accordingly.

- <b>Training the Model:</b><br>
Training is the most important step in machine learning. In training, you pass the prepared data to your machine learning model to find patterns and make predictions. It results in the model learning from the data so that it can accomplish the task set. Over time, with training, the model gets better at predicting. 

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 9. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

![1_GmmzvXzqwkeX50HHSNzANg.png](attachment:1_GmmzvXzqwkeX50HHSNzANg.png)

### Evaluating the Model:

After training your model, you have to check to see how it’s performing. This is done by testing the performance of the model on previously unseen data. The unseen data used is the testing set that you split our data into earlier. If testing was done on the same data which is used for training, you will not get an accurate measure, as the model is already used to the data, and finds the same patterns in it, as it previously did. This will give you disproportionately high accuracy. 

### Evaluation Methods
<b>Confusion matrix:</b><br>
A confusion matrix is an error matrix. It is presented as a table in which the predicted class is compared with the actual class. Understanding confusion matrices is of paramount importance for understanding classification metrics, such as recall and precision. The rows of a confusion matrix represent real values, while the columns represent predicted values.

Reading a confusion matrix is relatively simple: 

- True Positive (TP): you predicted positive, the real value was positive

- True Negative (TN): you predicted negative, the real value was negative

- False Positive (FP): you predicted positive, the real value was negative

- False Negative (FN): you predicted negative, the real value was positive

<b>Precision:</b><br>
Precision is defined as the ratio of True Positives count to total True Positive count made by the model.
Precision =  TP/(TP+FP)
Precision can be generated easily using precision_score() function from sklearn library.

The function takes 2 required parameters:
1) Correct Target labels
2) Predicted Target labels

<b>Recall:</b><br>
Recall is defined as the ratio of True Positives count to the total Actual Positive count.
Recall = TP/(TP+FN)
Recall is also called “True Positive Rate” or “sensitivity”.

Recall can be generated easily using recall_score() function from sklearn library.
The function takes 2 required parameters
1) Correct Target labels
2) Predicted Target labels

<b>F1 Score:</b><br>
The F1 score is easily one of the most reliable ways to score how well a classification model performs. It is the weighted average of precision and recall, as defined by the equation below.

F1 = 2 [(Recall * Precision) / (Recall + Precision)]

<b>Specificity (Selectivity, True Negative Rate):</b><br>
Specificity is similar to sensitivity, only the focus is on the negative class. It is the proportion of true negative cases which were correctly identified as such. The equation for specificity is:

Specificity = (True Negative) / (True Negative + False Positive)

<b>Fall-out (False Positive Rate):</b><br>
Fall-out determines the probability of determining a positive value when there is no positive value. It is the proportion of actual negative cases that were incorrectly classified as positive. The equation for fall-out is:

Fall-out = (False Positive) / (True Negative + False Positive)

<b>Miss Rate (False Negative Rate):</b><br>
Miss rate can be defined as the proportion of positive values that were incorrectly classified as negative examples.

Miss Rate = (False negative) / (True positive + False negative)

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 10. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

![20220721_185806_0000.png](attachment:20220721_185806_0000.png)

Considering the three main Factors above,the best model is Logistic Regression. This model has very fairly less predicting time, making it more efficient in the real world. Moreover, it has good f1 score and accuracy. This is evident since it the one that yielded a good f1 score on the real testing data that was submitted to kaggle.

Logistic Regression is a statistical algorithm used for binary classification tasks, it predicts binary outcomes by estimating the probability of an event occurring based on input variables. Logistic Regression has several advantages, including its simplicity, interpretability, and efficiency in handling large datasets. It's important to note that Logistic Regression is specifically designed for binary classification tasks, since the project at hand is a multi-class classification, variants such as One-vs-Rest is used. The model at hand has other different hyperparameters that have also play a crucial role in ensuring a great performance namely max_iter, C , penalty and solver.

Multi_class specifies the strategy for handling multi-class classification. 'ovr' stands for "one-vs-rest," which means the model create separate binary classifiers for each class, treating it as the positive class and the rest as the negative class. Max_iter determines the maximum number of iterations for the solver to converge. It specifies the maximum number of times the algorithm will iterate through the data to find the optimal model parameters. In this case, the maximum number of iterations is set to 1000. Solver determines the algorithm used for optimization. 'saga' stands for Stochastic Average Gradient Descent. It is a variant of gradient descent that works well with large datasets and supports both L1 and L2 regularization. 

Penalty specifies the type of regularization to be applied. 'l2' refers to the L2 regularization, also known as ridge regression. It adds the squared magnitudes of the coefficients to the loss function, encouraging the model to keep the coefficients small and reducing overfitting.C controls the inverse of the regularization strength. A smaller value of 'C' indicates stronger regularization, while a larger value indicates weaker regularization. In this case, 'C' is set to 6, implying a relatively weaker regularization strength.

In [None]:
# discuss chosen methods logic

<a id="seven"></a>
## 11. Conclusion
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Conclusion ⚡ |
| :--------------------------- |
| In this section, you are required to conclude your findings and the project as a whole. |

---

In [None]:
# Conclusion