# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**Buhle, Nonjojo**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

---

### Twitter Sentiment Classification Challenge
---
#### Introduction

As the world grapples with the urgent challenge of climate change, a growing number of companies are stepping up to offer solutions. These businesses, guided by a strong commitment to sustainability, create eco-friendly products and services that empower individuals to minimize their environmental impact. However, understanding how people view climate change and its severity is crucial for these companies to ensure their offerings resonate with the market. By conducting market research on climate change perceptions, these companies can gain valuable insights to tailor their products and services effectively, reaching consumers who share their values and desire to make a positive difference.

#### The Challenge

The EDSA Classification Sprint throws down a fascinating challenge: Can we harness the power of Machine Learning to predict someone's stance on climate change, using only their public Twitter data?

Cracking this code unlocks a treasure trove of real-time consumer sentiment. Imagine companies gaining unprecedented insights across diverse demographics and geographies, empowering them to develop products and marketing strategies that truly resonate with their target audience. This isn't just about classification; it's about opening a dialogue, understanding perspectives, and driving positive change on a global scale.

<a id="cont"></a>

## Table of Contents

><a href=#one>1. Importing Packages</a>

><a href=#two>2. Loading Data</a>

><a href=#three>3. Exploratory Data Analysis (EDA)</a>

><a href=#four>4. Data Engineering and NLP Processing</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section I will import, and briefly discuss, the libraries that will be used throughout our analysis and modelling. |

---

First, I include a list of packages that may needed to be installed on your system:

In [36]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [37]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Then import all necessary modules:

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section I will load the data from the files into a DataFrame. |

---

In [39]:
# Loading the datasets
df_train = pd.read_csv('train_set.csv')
df_test = pd.read_csv('test_set.csv')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, I will perform an in-depth analysis of all the variables in the DataFrame. |

---


Let's start with investigating the dataset by viewing how the dataframe is currently looking:

In [40]:
# A look into the data:
print(df_train.head())
print(df_train.info())

  lang_id                                               text
0     xho  umgaqo-siseko wenza amalungiselelo kumaziko ax...
1     xho  i-dha iya kuba nobulumko bokubeka umsebenzi na...
2     eng  the province of kwazulu-natal department of tr...
3     nso  o netefatša gore o ba file dilo ka moka tše le...
4     ven  khomishini ya ndinganyiso ya mbeu yo ewa maana...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
None


Let's now look at the datatypes contained within our dataset:

In [41]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


Let us consider if our dataset contains missing values:

In [42]:
df_train.isnull().sum()

lang_id    0
text       0
dtype: int64

<a id="four"></a>
## 4. Data Engineering and NLP Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section I will: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [43]:
#Given the task at hand, the feature 'text' needs to be vectorized. A TF-IDF Vectorizer is used for this.
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8', decode_error='ignore')
vectorizer.fit(df_train['text'])

In [44]:
# Next, the 'lang_id' is encoded using LabelEncoder.
encoder = LabelEncoder()
df_train['lang_id'] = encoder.fit_transform(df_train['lang_id'].astype(str))

In [45]:
# The training dataset is then split into training and validation sets.
X_train, X_valid, y_train, y_valid = train_test_split(df_train['text'], df_train['lang_id'], 
                                                      test_size=0.2, random_state=42)

In [46]:
# Next, the text data in the train and validation sets are transformed into TF-IDF vectors.
X_train = vectorizer.transform(X_train)
X_valid = vectorizer.transform(X_valid)

In [47]:
# Now, a Logistic Regression model is trained.
model = LogisticRegression(random_state=0, C=2.5, max_iter=1000, n_jobs=-1)
model.fit(X_train, y_train)

In [48]:
# The performance of the model can be evaluated using the validation set.
y_pred = model.predict(X_valid)
print('F1 Score:', f1_score(y_valid, y_pred, average='macro'))

F1 Score: 0.995552204421297


In [49]:
# Next, the test data is transformed into a TF-IDF vector and the trained model is used to predict the language of the texts in the test data.
X_test = vectorizer.transform(df_test['text'])
predictions = model.predict(X_test)

In [50]:
# The predicted language IDs are then decoded back into the original form.
predictions = encoder.inverse_transform(predictions)

In [51]:
# Create a submission dataframe with the index column from the test data
submission = pd.DataFrame({
    'index': df_test['index'],
    'lang_id': predictions
})

# Save the DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

First, to avoid duplicating the operations performed on both the testing and training dataset, we join them together:

In [52]:
# Discard all changes and reload the data
df_train = pd.read_csv('train_set.csv')
df_evaluate = pd.read_csv('test_set.csv')

# Combine the training and evaluation data
df_combined = pd.concat([df_train, df_evaluate], ignore_index=True)

In [53]:
# Data preprocessing: lowercasing text
df_combined['text'] = df_combined['text'].apply(lambda x: x.lower())

In [54]:
# Feature extraction
vectorizer = TfidfVectorizer(sublinear_tf=True, encoding='utf-8', decode_error='ignore', stop_words='english')
X = vectorizer.fit_transform(df_combined['text'])

In [55]:
# Move target variable to the rightmost side and encode labels
encoder = LabelEncoder()
df_combined['lang_id'] = encoder.fit_transform(df_combined['lang_id'].astype(str))

In [56]:
# Split the combined data back into training and evaluation data
df_train = df_combined.iloc[:len(df_train)]
df_evaluate = df_combined.iloc[len(df_train):]

y = df_train['lang_id']

In [57]:
# Split the training data into training and testing subsets
X_train, X_valid, y_train, y_valid = train_test_split(X[:len(df_train)], y, test_size=0.2, random_state=42)

In [58]:
# Define and train the model

model = LinearSVC()
model.fit(X_train, y_train)



In [59]:
# Evaluate the model

y_pred = model.predict(X_valid)

# Print classification report
print('classification_report:\n', classification_report(y_valid, y_pred))

# Print confusion matrix
print('Confusion Matrix:\n', confusion_matrix(y_valid, y_pred))

# Print F1 score
print('F1 Score:', f1_score(y_valid, y_pred, average='weighted'))

classification_report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       583
           1       1.00      1.00      1.00       615
           3       0.99      0.99      0.99       583
           4       1.00      1.00      1.00       625
           5       1.00      1.00      1.00       618
           6       1.00      1.00      1.00       584
           7       1.00      1.00      1.00       598
           8       1.00      1.00      1.00       561
           9       1.00      1.00      1.00       634
          10       0.99      1.00      1.00       609
          11       0.99      0.99      0.99       590

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600

Confusion Matrix:
 [[581   2   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  0   1 577   0   0   0   0   0   0   2   3]
 [ 

In [60]:
# Predict on evaluation data
# Predictions = model.predict(X[len(df_train):])

In [61]:
# Decode the predicted labels
# Predictions = encoder.inverse_transform(predictions)

In [62]:
# Prepare a submission file
submission = pd.DataFrame({'index': df_evaluate.index, 'lang_id': predictions})
submission.to_csv('submission.csv', index=False)