<a href="https://colab.research.google.com/github/Mike-Wazovsky/JetBrain-emotion-classification/blob/main/JetBrains_emotion_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **JetBrains**

What I've done:
- Used numerical features for text analysis
- Used basic methods of cleaning text data
- Presented text data in matrix form
- Built a base model using numerical features and features extracted from text


Materials I used in the process:

https://www.analyticsvidhya.com/blog/2021/04/a-guide-to-feature-engineering-in-nlp/

https://towardsdatascience.com/how-to-turn-text-into-features-478b57632e99

https://www.section.io/engineering-education/nlp-based-detection-model-using-neattext-and-scikit-learn/

https://medium.com/neuronio/from-sentiment-analysis-to-emotion-recognition-a-nlp-story-bcc9d6ff61ae

## **Preparation**

In [1]:
import numpy as np 
import pandas as pd
from typing import Tuple
import pickle
import re

import sklearn 
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn. metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import nltk
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')
nltk.download('punkt')

!pip install neattext
import neattext.functions as nfx

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.7/114.7 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


In [2]:
print(nltk.__version__)
print(np.__version__)
print(pd.__version__)
print(sklearn.__version__)

3.8.1
1.22.4
1.4.4
1.2.2


There are 10 numerical features here that we will get from the text

In [3]:
# count number of characters 
def count_chars(text):
    return len(text)

# count number of words 
def count_words(text):
    return len(text.split())

# count number of capital characters
def count_capital_chars(text):
    count=0
    for i in text:
        if i.isupper():
            count+=1
    return count

# count number of capital words
def count_capital_words(text):
    return sum(map(str.isupper,text.split()))

# count number of words in quotes
def count_words_in_quotes(text):
    x = re.findall("\'.\'|\".\"", text)
    count=0
    if x is None:
        return 0
    else:
        for i in x:
            t=i[1:-1]
            count+=count_words(t)
        return count
    
# count number of sentences
def count_sent(text):
    return len(nltk.sent_tokenize(text))

# count number of unique words 
def count_unique_words(text):
    return len(set(text.split()))
    
# count of hashtags
def count_htags(text):
    x = re.findall(r'(\#\w[A-Za-z0-9]*)', text)
    return len(x)

# count of mentions
def count_mentions(text):
    x = re.findall(r'(\@\w[A-Za-z0-9]*)', text)
    return len(x)

# count of stopwords
def count_stopwords(text):
    stop_words = set(stopwords.words('english'))  
    word_tokens = word_tokenize(text)
    stopwords_x = [w for w in word_tokens if w in stop_words]
    return len(stopwords_x)

## **Dataset processing**

In [4]:
# get dataset
df = pd.read_csv('fb_sentiment.csv')

In [5]:
# adding features in data
df['char_count'] = df["FBPost"].apply(lambda x:count_chars(x))
df['word_count'] = df["FBPost"].apply(lambda x:count_words(x))
df['sent_count'] = df["FBPost"].apply(lambda x:count_sent(x))
df['capital_char_count'] = df["FBPost"].apply(lambda x:count_capital_chars(x))
df['capital_word_count'] = df["FBPost"].apply(lambda x:count_capital_words(x))
df['quoted_word_count'] = df["FBPost"].apply(lambda x:count_words_in_quotes(x))
df['stopword_count'] = df["FBPost"].apply(lambda x:count_stopwords(x))
df['unique_word_count'] = df["FBPost"].apply(lambda x:count_unique_words(x))
df['htag_count'] = df["FBPost"].apply(lambda x:count_htags(x))
df['mention_count'] = df["FBPost"].apply(lambda x:count_mentions(x))
df['avg_wordlength']=df['char_count']/df['word_count']
df['avg_sentlength']=df['word_count']/df['sent_count']
df['unique_vs_words']=df['unique_word_count']/df['word_count']
df['stopwords_vs_words']=df['stopword_count']/df['word_count']

# clean text
snowball = SnowballStemmer(language="english")
df['FBPost'] = df['FBPost'].apply(snowball.stem)
df['FBPost'] = df['FBPost'].apply(nfx.remove_userhandles)
df['FBPost'] = df['FBPost'].apply(nfx.remove_stopwords)

In [6]:
# drop empty elements
df.dropna(axis = 0)

# set index
df.set_index('Unnamed: 0', inplace = True)
df.head()

Unnamed: 0_level_0,FBPost,Label,char_count,word_count,sent_count,capital_char_count,capital_word_count,quoted_word_count,stopword_count,unique_word_count,htag_count,mention_count,avg_wordlength,avg_sentlength,unique_vs_words,stopwords_vs_words
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,drug runners u.s. senator murder http://www.am...,O,241,26,3,35,4,0,7,26,0,0,9.269231,8.666667,1.0,0.269231
1,"heres single, add, kindle. read 19th century s...",O,251,44,3,14,0,0,16,38,0,0,5.704545,14.666667,0.863636,0.363636
2,tire non-fiction.. check http://www.amazon.com...,O,146,8,2,5,0,0,3,8,0,0,18.25,4.0,1.0,0.375
3,ghost round island supposedly nonfiction.,O,47,7,1,3,0,0,2,7,0,0,6.714286,7.0,1.0,0.285714
4,barnes nobles version kindle expensive kindle?,N,86,16,1,5,0,0,8,15,0,0,5.375,16.0,0.9375,0.5


In [7]:
# Split features on numerical (numeric_features) and all (features)
features = [c for c in df.columns.values if c  not in ['Unnamed: 0', 'Label']]
numeric_features = [c for c in df.columns.values if c  not in ['Unnamed: 0', 'Label', 'FBPost']]

# set the label
target = 'Label'

# Train/Test split in a ratio 0.8:0.2
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size = 0.2, random_state = 42)
X_train.head()

Unnamed: 0_level_0,FBPost,char_count,word_count,sent_count,capital_char_count,capital_word_count,quoted_word_count,stopword_count,unique_word_count,htag_count,mention_count,avg_wordlength,avg_sentlength,unique_vs_words,stopwords_vs_words
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
29,let straight...kindle gaming device? dont jump...,103,17,2,4,0,0,6,17,0,0,6.058824,8.5,1.0,0.352941
535,love kindle. new purse kindle fit. ever. read ...,372,78,7,4,2,0,43,55,0,0,4.769231,11.142857,0.705128,0.551282
695,favorite things kindle...i love dictionary fun...,84,14,1,2,0,0,6,13,0,0,6.0,14.0,0.928571,0.428571
557,love it!! hold my.children read!!! v day 2010....,134,22,5,4,1,0,7,22,0,0,6.090909,4.4,1.0,0.318182
836,tell read kindle dark? internal lighting like ...,195,39,3,8,2,0,20,34,0,0,5.0,13.0,0.871795,0.512821


## **Pipeline Implementation**

In [8]:
class TextSelector(BaseEstimator, TransformerMixin):
    # function for text features
    
    def __init__(self, key):
        self.key = key

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    # function for numeric features
       
    def __init__(self, key):
        self.key = key

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        return X[[self.key]]

In [9]:
text = Pipeline([
                ('selector', TextSelector(key = 'FBPost')),
                ('cv',CountVectorizer())
            ])
text.fit_transform(X_train)

<800x2430 sparse matrix of type '<class 'numpy.int64'>'
	with 7599 stored elements in Compressed Sparse Row format>

In [10]:
# Combine the results of several transformed variables into a single data set.
# We'll make a pipeline for each variable, then concatenate them.
char_count =  Pipeline([
                ('selector', NumberSelector(key = 'char_count')),
                ('standard', StandardScaler())
            ])
word_count =  Pipeline([
                ('selector', NumberSelector(key = 'word_count')),
                ('standard', StandardScaler())
            ])
capital_char_count =  Pipeline([
                ('selector', NumberSelector(key = 'capital_char_count')),
                ('standard', StandardScaler())
            ])
capital_char_count =  Pipeline([
                ('selector', NumberSelector(key = 'capital_char_count')),
                ('standard', StandardScaler()),
            ])
capital_word_count =  Pipeline([
                ('selector', NumberSelector(key = 'capital_word_count')),
                ('standard', StandardScaler())
            ])
quoted_word_count =  Pipeline([
                ('selector', NumberSelector(key = 'quoted_word_count')),
                ('standard', StandardScaler())
            ])
stopword_count =  Pipeline([
                ('selector', NumberSelector(key = 'char_count')),
                ('standard', StandardScaler())
            ])
unique_word_count =  Pipeline([
                ('selector', NumberSelector(key = 'unique_word_count')),
                ('standard', StandardScaler())
            ])
htag_count =  Pipeline([
                ('selector', NumberSelector(key = 'htag_count')),
                ('standard', StandardScaler())
            ])
mention_count =  Pipeline([
                ('selector', NumberSelector(key = 'mention_count')),
                ('standard', StandardScaler())
            ])
avg_wordlength =  Pipeline([
                ('selector', NumberSelector(key = 'avg_wordlength')),
                ('standard', StandardScaler())
            ])
avg_sentlength =  Pipeline([
                ('selector', NumberSelector(key = 'avg_sentlength')),
                ('standard', StandardScaler())
            ])
unique_vs_words =  Pipeline([
                ('selector', NumberSelector(key = 'unique_vs_words')),
                ('standard', StandardScaler())
            ])
stopwords_vs_words =  Pipeline([
                ('selector', NumberSelector(key = 'stopwords_vs_words')),
                ('standard', StandardScaler())
            ])


feats = FeatureUnion([('text', text),
                      ('char_count', char_count),             
                      ('word_count', word_count),
                      ('capital_char_count', capital_char_count),
                      ('capital_word_count', capital_word_count),
                      ('quoted_word_count', quoted_word_count),
                      ('stopword_count', stopword_count),
                      ('unique_word_count', unique_word_count),
                      ('htag_count', htag_count),
                      ('mention_count', mention_count),
                      ('avg_wordlength', avg_wordlength),
                      ('avg_sentlength', avg_sentlength),
                      ('unique_vs_words', unique_vs_words),
                      ('stopwords_vs_words', stopwords_vs_words),])

In [11]:
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)

<800x2443 sparse matrix of type '<class 'numpy.float64'>'
	with 17199 stored elements in Compressed Sparse Row format>

## **Model Train & Results**

In [12]:
# All features are collected to submit them to the input of logistic regression
pipeline = Pipeline([
    ('features', feats),
    ('lr', LogisticRegression()),
])

# Now we can training the model
pipeline.fit(X_train, y_train)

In [13]:
# Output a classification report
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           N       0.20      0.17      0.18        12
           O       0.67      0.68      0.68        63
           P       0.84      0.85      0.84       125

    accuracy                           0.76       200
   macro avg       0.57      0.57      0.57       200
weighted avg       0.75      0.76      0.75       200



In [14]:
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(pipeline, open(filename, 'wb'))