<a href="https://colab.research.google.com/github/AndrewDavidRatnam/Feature-_Engineering_Bookcamp-/blob/main/Feature_Engineering_Natural_language_processing_Classifying_social_media_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.pipeline import Pipeline
import time

In [2]:
def advanced_grid_search(x_train, y_train, x_test, y_test, ml_pipeline, params, cv=3, include_probas=False, is_regression=False):
  model_grid_search = GridSearchCV(ml_pipeline, param_grid=params, cv=cv, error_score=-1)
  start_time = time.time()

  model_grid_search.fit(x_train,y_train)

  train_time = time.time()

  print(f"Training the model{(train_time - start_time):.2f} seconds")

  best_model = model_grid_search.best_estimator_

  y_preds = best_model.predict(x_test)

  if is_regression:
    rmse = np.sqrt(mean_squared_error(y_pred=y_preds, y_true=y_test))
    print(f'RMSE:{rmse:.5f}')
  else:
    print(classification_report(y_true=y_test, y_pred=y_preds))
  print(f'Best params : {model_grid_search.best_params_}')

  end_time = time.time()
  print(f"Overall took{(end_time - start_time):.2f} seconds")

  if include_probas:
    y_probas = best_model.predict_proba(x_test).max(axis=1)
    return best_model, y_preds, y_probas

  return best_model, y_preds



In [3]:
!pip install ydata-profiling



# EDA

In [4]:
tweet_df = pd.read_csv('/content/cleaned_airline_tweets.csv')
tweet_df.head()

Unnamed: 0,text,sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,"@VirginAmerica it was amazing, and arrived an ...",positive
2,@VirginAmerica I &lt;3 pretty graphics. so muc...,positive
3,@VirginAmerica So excited for my first cross c...,positive
4,I ‚ù§Ô∏è flying @VirginAmerica. ‚ò∫Ô∏èüëç,positive


In [5]:
tweet_df.columns

Index(['text', 'sentiment'], dtype='object')

In [6]:
tweet_df["sentiment"].unique()

array(['neutral', 'positive', 'negative'], dtype=object)

In [7]:
tweet_df["text"].nunique()

3860

In [8]:
tweet_df['sentiment'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
sentiment,Unnamed: 1_level_1
positive,0.348705
neutral,0.336528
negative,0.314767


can't run this and save in github

In [9]:
# from  ydata_profiling import ProfileReport
# profile = ProfileReport(tweet_df, title="Tweets Report", explorative=True)
# profile

Splitting our data into training and testing sets

In [10]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    tweet_df , test_size=0.2, random_state=0,stratify=tweet_df["sentiment"]
)

In [11]:
print(f"Count of tweets in training set:{train.shape[0]:,}")
print(f'Count of tweets in testing set:{test.shape[0]:,}')


Count of tweets in training set:3,088
Count of tweets in testing set:772


## Problem Definition and Success definition

This is kind of a givem, given a tweet can we classify it into the right sentiment. Accuracy seems like the standard approach

# Text Vectorization

3 main ways:
- bag of words
- count vectorization
- tf-idf

## Feature construction: Bag of words

`CountVectorizer(ngram_range=(1, 3))`

## Count vectorization

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
single_word = cv.fit_transform(train["text"])

print(single_word.shape)

(3088, 6018)


- scikit-learn‚Äôs CountVectorizer module converts text samples
into vectors
- transform the corpus to be a matrix of fixed-length vectors

In [13]:
pd.DataFrame(single_word.todense(), columns=cv.get_feature_names_out())

Unnamed: 0,00,000,000114,000ft,00pm,0167560070877,02,0200,03,0400,...,zacks_com,zakkohane,zero,zf5wjgtxzt,zgoqoxjbqy,zj76,zone,zsdgzydnde,zukes,zv2pt6trk9
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3083,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3084,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3085,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


vectorize our text with the 20 most common tokens

In [14]:
cv = CountVectorizer(max_features=20)
limited_vocab = cv.fit_transform(train["text"])
pd.DataFrame(limited_vocab.toarray(), index = train["text"], columns=cv.get_feature_names_out())

Unnamed: 0_level_0,americanair,and,flight,for,in,is,it,jetblue,me,my,of,on,southwestair,thanks,the,to,united,usairways,you,your
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
@JetBlue Maybe I'll just go to Cleveland instead.,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
smh RT @JetBlue: Our fleet's on fleek. http://t.co/IRiXaIfJJX,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
@SouthwestAir I would.,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
@USAirways trying to Cancelled Flight a flight urgently...get hung up on twice??? Sweet refund policy,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0
@AmericanAir you are beyond redemption. Jfk. Baggage claim looks like a luggage warehouse,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
‚Äú@JetBlue: Our fleet's on fleek. http://t.co/b5ttno68xu‚Äù I just üôà,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
"@united caught earlier flight to ORD. Gate checked bag, and you've lost it at O'Hare. original flight lands in 20minutes. #frustrating!",0,1,2,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,1,0
@AmericanAir hi when will your next set of flights be out for next year from Dublin???,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
@SouthwestAir Finally! Integration w/ passbook is a great Valentine gift - better then chocoLate Flight. You do heart me.,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0


In [15]:
cv = CountVectorizer(ngram_range=(1,3))
more_ngrams = cv.fit_transform(train["text"])
print(more_ngrams.shape)
pd.DataFrame(more_ngrams.toarray(), index = train['text'], columns = cv.get_feature_names_out()).head()

(3088, 70613)


Unnamed: 0_level_0,00,00 phone,00 phone hold,00 pm,00 pm that,000,000 air,000 air miles,000 crewmembers,000 crewmembers embody,...,zj76 how,zj76 how did,zone,zone was,zone was after,zsdgzydnde,zukes,zukes non,zukes non vegan,zv2pt6trk9
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
@JetBlue Maybe I'll just go to Cleveland instead.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
smh RT @JetBlue: Our fleet's on fleek. http://t.co/IRiXaIfJJX,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@SouthwestAir I would.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@USAirways trying to Cancelled Flight a flight urgently...get hung up on twice??? Sweet refund policy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
@AmericanAir you are beyond redemption. Jfk. Baggage claim looks like a luggage warehouse,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
cv = CountVectorizer(max_features=10)
cv.fit(train['text'])
cv.get_feature_names_out()

array(['and', 'flight', 'for', 'jetblue', 'on', 'southwestair', 'the',
       'to', 'united', 'you'], dtype=object)

Let's get rid of em stop words

In [17]:
cv = CountVectorizer(stop_words='english',max_features=10)
cv.fit(train["text"])
cv.get_feature_names_out()

array(['americanair', 'flight', 'http', 'jetblue', 'service',
       'southwestair', 'thank', 'thanks', 'united', 'usairways'],
      dtype=object)

here we see http,that is redundant , maybe the website we can scrape but it requires manual processing. Also we can futher use regex to remove:
- urls
- look for exclamation marks and other punctuation marks
- emojis in text :) or :(
- capitalization ratio etc


- Downside is that if a feature/token/word is not in the corpus of traning text , if it occurs in the test set, it would completely disregard it .
- Upside is that we get intepretable features with combination with a tree based model we get interpretable model, which can futher the feature selection and extraction process.


What we bout to do?

1. Take in a pipeline that has both the feature engineering pipeline and the
model in it.
2. Run a cross-validated grid search on the pipeline as a whole, tuning parameters
for the model and the feature engineering algorithms at the same time. This is
run on the training set.
3. Pick the set of parameters that maximizes accuracy.
4. Print a classification report on the test set.





Using the CountVectorizer‚Äôs features in our ML pipeline

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=3333)

ml_pipeline = Pipeline([
    ("vectorizer",CountVectorizer()),
    ("classifier",clf)
])

params = {
    "vectorizer__lowercase":[True,False],
    "vectorizer__stop_words":[None,"english"],
    "vectorizer__max_features":[100, 1000, 5000],
    "vectorizer__ngram_range":[(1,1),(1,3)],
    "classifier__C":[1e-1,1e0,1e1]
    }
print("Count Vectorizer + Logistic Regression ------------------------------------------------------------")
best_model, y_preds = advanced_grid_search(
    train["text"],train["sentiment"],test["text"],test["sentiment"],
    ml_pipeline, params
)



Count Vectorizer + Logistic Regression ------------------------------------------------------------
Training the model121.91 seconds
              precision    recall  f1-score   support

    negative       0.79      0.77      0.78       243
     neutral       0.75      0.78      0.77       260
    positive       0.85      0.83      0.84       269

    accuracy                           0.80       772
   macro avg       0.80      0.79      0.79       772
weighted avg       0.80      0.80      0.80       772

Best params : {'classifier__C': 1.0, 'vectorizer__lowercase': True, 'vectorizer__max_features': 5000, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}
Overall took122.05 seconds


## TF-IDF vectorization

Tf-id = $ TF(t,d) √ó IDF(t) $  where $ IDF(t) = Log [ (1+n)/ 1+df(t)] + 1 $ where  $ df(t) $ is total number of times the term $t$ occurs in the whole document

Another Bag of words vectorizer essentially

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfid_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

tfid_vectorizer.fit(train["text"])
idf = pd.DataFrame({"feature_name":tfid_vectorizer.get_feature_names_out(), 'idf_weights':tfid_vectorizer.idf_})

In [20]:
idf.sort_values("idf_weights", ascending=True)

Unnamed: 0,feature_name,idf_weights
913,united,2.497463
347,flight,2.558630
821,southwestair,2.586713
495,jetblue,2.648723
66,americanair,2.695243
...,...,...
532,lies,7.649308
531,license,7.649308
388,girl,7.649308
485,ipad,7.649308


Let's use this in our ML pipeline

In [21]:
ml_pipeline = Pipeline([
    ('tfid_vectorizer',TfidfVectorizer()),
    ("classifier",clf)
])
params = {
    "tfid_vectorizer__lowercase":[True,False],
    "tfid_vectorizer__stop_words":[None,"english"],
    "tfid_vectorizer__max_features":[100, 1000, 5000],
    "tfid_vectorizer__ngram_range":[(1,1),(1,3)],
    "classifier__C":[1e-1,1e0,1e1]
    }

In [22]:
print("TF-IDF Vectorizer + Log Reg\n=====================")
advanced_grid_search(train['text'], train['sentiment'], test['text'], test['sentiment'],ml_pipeline, params)

TF-IDF Vectorizer + Log Reg
Training the model100.10 seconds
              precision    recall  f1-score   support

    negative       0.80      0.84      0.82       243
     neutral       0.83      0.80      0.81       260
    positive       0.88      0.87      0.88       269

    accuracy                           0.84       772
   macro avg       0.84      0.84      0.84       772
weighted avg       0.84      0.84      0.84       772

Best params : {'classifier__C': 1.0, 'tfid_vectorizer__lowercase': True, 'tfid_vectorizer__max_features': 5000, 'tfid_vectorizer__ngram_range': (1, 3), 'tfid_vectorizer__stop_words': None}
Overall took100.16 seconds


(Pipeline(steps=[('tfid_vectorizer',
                  TfidfVectorizer(max_features=5000, ngram_range=(1, 3))),
                 ('classifier', LogisticRegression(max_iter=3333))]),
 array(['negative', 'neutral', 'positive', 'neutral', 'neutral',
        'positive', 'neutral', 'negative', 'positive', 'negative',
        'neutral', 'negative', 'negative', 'positive', 'neutral',
        'negative', 'negative', 'negative', 'positive', 'negative',
        'positive', 'positive', 'positive', 'negative', 'neutral',
        'negative', 'neutral', 'neutral', 'positive', 'neutral',
        'negative', 'neutral', 'neutral', 'positive', 'positive',
        'negative', 'neutral', 'positive', 'neutral', 'positive',
        'positive', 'neutral', 'neutral', 'negative', 'neutral',
        'negative', 'negative', 'positive', 'negative', 'positive',
        'negative', 'negative', 'negative', 'positive', 'positive',
        'positive', 'positive', 'neutral', 'positive', 'positive',
        'positive', 

It looks like normalizing token counts to extract originality in tokens
helps our model understand sentiment a bit better

# Feature improvement

Setting Max features is almost enough, but we want to extract more features too.
- Remove hashtags, but note down the hash tag value first, then try another text processing(tfid,count, hashing) for the new columm
- remove urls, but see what can be extracted, like the website name etc
- remove @ sign but add to another column named mentions and then use count vectorizer
- remove numbers as we can't use LLMs etc to see the context of the numbers
- remove emojis but note it down in a column

## Cleaning noise from text

Let's clean all the noise but extract relevan information about it

### Rough Work

In [23]:
import re

In [24]:
url_pattern = re.compile(r'http?://(?:www\.)?([^/\s]+)\S*')

In [25]:
train["urls"] = train["text"].apply(lambda x: " ".join(url_pattern.findall(x)))
train.loc[train["urls"].apply(lambda x: len(x)>0)]["urls"].value_counts()

Unnamed: 0_level_0,count
urls,Unnamed: 1_level_1
t.co,273
t.co t.co,6
t.co t.co t.co,1


In [26]:
url_pattern = re.compile(r'http?://(?:www\.)?([^/\s]+)\S*')
hashtag_pattern = re.compile(r'#(\w+)')
train["hashtags"] = train["text"].apply(lambda x: " ".join(hashtag_pattern.findall(x)))
train.loc[train["hashtags"].apply(lambda x: len(x.split())>1)]["hashtags"].value_counts() #more thatn 1 hastags
train.loc[train["hashtags"].apply(lambda x: len(x)>1)]["hashtags"].value_counts() #atleast one hashtags

Unnamed: 0_level_0,count
hashtags,Unnamed: 1_level_1
DestinationDragons,26
customerservice,6
UnitedAirlines,6
fail,4
disappointed,4
...,...
thanks,1
bna,1
1786,1
SWfan,1


In [27]:
url_pattern = re.compile(r'http?://(?:www\.)?([^/\s]+)\S*')
hashtag_pattern = re.compile(r'#(\w+)')
mention_pattern = re.compile(r'@(\w+)')
train["mention"] = train["text"].apply(lambda x: " ".join(mention_pattern .findall(x)))
train.loc[train["mention"].apply(lambda x: len(x)>0)]["mention"].value_counts()
train.loc[train["mention"].apply(lambda x: len(x.split())>1)]["mention"].value_counts()

Unnamed: 0_level_0,count
mention,Unnamed: 1_level_1
USAirways AmericanAir,23
SouthwestAir FortuneMagazine,8
SouthwestAir Imaginedragons,7
AmericanAir dfwairport,4
JetBlue WSJ,4
...,...
VirginAmerica ladygaga carrieunderwood ladygaga carrieunderwood,1
united HeathrowAirport,1
SouthwestAir poisonpill76,1
AmericanAir cityandsand,1


In [33]:
url_pattern = re.compile(r'http?://(?:www\.)?([^/\s]+)\S*')
hashtag_pattern = re.compile(r'#(\w+)')
mention_pattern = re.compile(r'@(\w+)')
emoji_pattern = re.compile(r'[^\x00-\x7F]+')
train["emojis"] = train["text"].apply(lambda x: " ".join(emoji_pattern .findall(x)))
#train.loc[train["emojis"].apply(lambda x: len(x)>0)]["emojis"].value_counts()
#train.loc[train["emojis"].apply(lambda x: len(x.split())>1)]["emojis"].value_counts()


In [29]:
number_pattern = re.compile(r'\d+')
train["numbers"] = train["text"].apply(lambda x: " ".join(number_pattern .findall(x)))
train.loc[train["numbers"].apply(lambda x: len(x)>0)]["numbers"].value_counts()
train.loc[train["numbers"].apply(lambda x: len(x.split())>1)][["numbers","text"]]#.value_counts()

Unnamed: 0,numbers,text
1030,10 9 17 20 15 7 8 7,"@united The guidelines say 10x9x17, my bag is ..."
1933,5 1 15,‚Äú@JetBlue: Our fleet's on fleek. http://t.co/X...
2648,2 4 2,@USAirways is okay for u 2 Cancelled Flight ch...
1097,9 180,@SouthwestAir has the smoooothest flight atten...
3800,3231 4 45,@AmericanAir 3231DTW to LAG at 4:45. Flight Ca...
...,...,...
2255,0 0 8,@JetBlue and The from @WSJ Team to Offer In-#F...
2441,2 3,@JetBlue our #FoodAllergy community. IF you wa...
2259,6 8,@JetBlue flight booked! Heading out to Califor...
680,10 1 2,@united - sitting in seat 10D on a flight back...


### Complete Pipeline

In [30]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier

class TextFeatureExtractor(BaseEstimator, TransformerMixin):

  def __init__(self):
    self.number_pattern = re.compile(r'\d+')
    self.url_pattern = re.compile(r'http?://(?:www\.)?([^/\s]+)\S*')
    self.hashtag_pattern = re.compile(r'#(\w+)')
    self.mention_pattern = re.compile(r'@(\w+)')
    self.emoji_pattern = re.compile(r'[^\x00-\x7F]+')


  def fit(self, C, y=None):
    return self

  def transform(self, X):
    s = pd.Series(X)
    features = pd.DataFrame()

    features["hashtags"] = s.apply(lambda x:" ".join(self.hashtag_pattern.findall(x)))
    features["urls"] = s.apply(lambda x:" ".join(self.url_pattern.findall(x)))
    features["mentions"] = s.apply(lambda x:" ".join(self.mention_pattern.findall(x)))
    features["excl_count"] = s.apply(lambda x:x.count("!"))
    features["ques_count"] = s.apply(lambda x:x.count("?"))
    features["comma_fullstop_count"] = s.apply(lambda x:x.count(".") + x.count(","))
    features["emojis"] = s.apply(lambda x:" ".join(self.emoji_pattern.findall(x)))

    cleaned_text = s.str.replace(self.url_pattern, '', regex=True) \
                      .str.replace(self.hashtag_pattern, '', regex=True) \
                      .str.replace(self.mention_pattern, '', regex=True) \
                      .str.replace(self.number_pattern, '', regex=True) \
                      .str.replace(self.emoji_pattern, '', regex=True) \
                      .str.strip()
    features["cleaned_text"] = cleaned_text
    return features
def get_col(df, col_name):
    return df[col_name]

Needed to split certain columns, with 2 emojis or 2 hashtags or 2 mentions etc

In [31]:
extractor = TextFeatureExtractor()
processor = ColumnTransformer([
    ('text_tfidf', TfidfVectorizer(max_features=1000), 'cleaned_text'),
    ('hash_vec', CountVectorizer(), 'hashtags'),
    ('mention_vec', CountVectorizer(), 'mentions'),
    ('url_vec', CountVectorizer(), 'urls'),
    ('emoji_vec', CountVectorizer(token_pattern=r"\S"), 'emojis'),
    ('excl_scalar', 'passthrough', ['excl_count']),
    ('ques_scalar', 'passthrough', ['ques_count']),
    ('comma_fullstop_scalar', 'passthrough', ['comma_fullstop_count'])
])
pipeline = Pipeline([
    ('extract', extractor),
    ('vectorize', processor),
    ('dim_reduction', TruncatedSVD(n_components=1000)),
    ("classifier",clf)
])

In [32]:
clf_model = pipeline.fit(train['text'],train['sentiment'])
y_preds= clf_model.predict(test['text'])
print(classification_report(y_true=test['sentiment'], y_pred=y_preds))

              precision    recall  f1-score   support

    negative       0.78      0.77      0.78       243
     neutral       0.79      0.80      0.79       260
    positive       0.86      0.86      0.86       269

    accuracy                           0.81       772
   macro avg       0.81      0.81      0.81       772
weighted avg       0.81      0.81      0.81       772



Not verygood, but not bad either.<br>
Almost same metrics as the author but atleast 10 times faster in training

# Feature extraction

just use truncated SVD bro lmao

# Summary
This notebook performs **feature engineering for sentiment classification** on a cleaned version of the well-known **Twitter US Airline Sentiment** dataset (`cleaned_airline_tweets.csv`). The task is multiclass classification: predict tweet sentiment (`positive`, `neutral`, `negative`) from the tweet `text`.  

The workflow starts with basic text vectorization baselines (**CountVectorizer** + **TfidfVectorizer** + Logistic Regression), then moves to a more sophisticated **custom feature extraction pipeline** that:

- extracts domain-relevant signals (hashtags, mentions, URLs, emojis, punctuation counts)
- cleans the main text by removing those elements
- applies different vectorizers to each extracted part
- reduces dimensionality with **TruncatedSVD**
- feeds everything into a classifier

The focus is on showing how thoughtful **pre-processing + feature construction** (beyond plain bag-of-words / TF-IDF) can enrich the signal for social media sentiment tasks.

### All Steps Involved + Why They Were Done

1. **Data Loading & Quick EDA**  
   - Read CSV, check columns, unique sentiments, class distribution, unique tweets.  
   - Stratified train/test split (80/20) on sentiment label.  
   **Why**: Understand imbalance (negative tweets usually dominate airline data), confirm text is the main input, ensure test set mirrors train distribution.

2. **Simple Text Vectorization Baselines**  
   - `CountVectorizer` experiments: single words, n-grams (1‚Äì3), stop words, max_features limit.  
   - `TfidfVectorizer` with similar hyperparameters.  
   - Wrapped in `Pipeline` + `LogisticRegression` ‚Üí grid search over: lowercase, stop_words, max_features, ngram_range, C.  
   **Why**: Establish strong baseline using standard sparse text representations. Compare frequency counting vs. term importance weighting (TF-IDF usually wins slightly on sentiment tasks).

3. **Feature Construction ‚Äì Noise Extraction & Cleaning**  
   - Regex patterns to extract:  
     ‚Äì URLs (domain part)  
     ‚Äì Hashtags (#word)  
     ‚Äì Mentions (@user)  
     ‚Äì Emojis (non-ASCII)  
     ‚Äì Numbers  
   - Count punctuation: ! ? . ,  
   - Clean main text by removing all extracted elements.  
   **Why**: Social media contains strong sentiment signals in structure (lots of !, emojis, CAPS, mentions, hashtags). Removing them from main text prevents noise dilution, while keeping them as separate features preserves signal.

4. **Custom Transformer** (`TextFeatureExtractor`)  
   - Fits nothing (stateless), transforms Series of tweets ‚Üí DataFrame with:  
     `cleaned_text`, `hashtags`, `urls`, `mentions`, `emojis`, `excl_count`, `ques_count`, `comma_fullstop_count`.  
   **Why**: Encapsulate regex-based feature engineering in scikit-learn-compatible object ‚Üí reusable & pipeline-friendly.

5. **Advanced Feature Processing Pipeline**  
   - `ColumnTransformer` applies different vectorizers to different columns:  
     ‚Äì `TfidfVectorizer(max_features=1000)` on `cleaned_text`  
     ‚Äì `CountVectorizer` on `hashtags`, `mentions`, `urls`  
     ‚Äì `CountVectorizer(token_pattern=r"\S")` on `emojis` (treats each emoji as token)  
     ‚Äì `passthrough` on scalar counts (!, ?, punctuation)  
   - `TruncatedSVD(n_components=1000)` for dimensionality reduction.  
   - Final classifier (LogisticRegression in example).  
   **Why**: Heterogeneous feature types need specialized handling. SVD reduces curse of dimensionality after wide concatenation; helps especially with sparse multi-source vectors.

6. **Model Training & Evaluation**  
   - Fit pipeline end-to-end on train text ‚Üí predict on test text.  
   - Print classification report.  
   **Why**: Show realistic end-to-end performance of rich feature set vs. plain TF-IDF baseline. (Notebook notes similar metrics but much faster training.)

7. **Comment on Feature Extraction**  
   - Brief mention of TruncatedSVD as simple & effective way to compress high-dimensional text features.  
   **Why**: Reminder that dimensionality reduction is often necessary after aggressive feature concatenation.

### Packages / Modules / Techniques Used

| Package / Module                          | Class / Function / Technique                              | Why It Was Used |
|-------------------------------------------|------------------------------------------------------------|-----------------|
| **pandas**                                | `read_csv`, `value_counts`, `apply`, `str.replace`, regex `.findall` | Data loading, grouping, custom regex-based feature extraction & cleaning |
| **numpy**                                 | (implicit)                                                 | Array operations (not heavily used here) |
| **matplotlib.pyplot**                     | (imported but not visibly used in code)                    | Potential for later plots (EDA) |
| **time**                                  | `time.time()`                                              | Measure grid-search training duration |
| **sklearn.model_selection**               | `train_test_split` (stratified), `GridSearchCV`            | Balanced split; hyperparameter tuning of vectorizer + model jointly |
| **sklearn.linear_model**                  | `LogisticRegression`                                       | Fast, interpretable linear baseline classifier for text |
| **sklearn.ensemble**                      | `ExtraTreesClassifier`, `RandomForestClassifier` (imported but not used in final pipeline) | Potential tree-based alternatives |
| **sklearn.pipeline**                      | `Pipeline`                                                 | Chain feature extraction ‚Üí vectorization ‚Üí reduction ‚Üí classifier |
| **sklearn.compose**                       | `ColumnTransformer`                                        | Apply different transformers to different output columns of custom extractor |
| **sklearn.feature_extraction.text**       | `CountVectorizer`, `TfidfVectorizer`                       | Core sparse text vectorization (bag-of-words vs. tf-idf) |
| **sklearn.preprocessing**                 | `FunctionTransformer` (imported but not heavily used)      | Potential wrapper for custom functions |
| **sklearn.decomposition**                 | `TruncatedSVD`                                             | Linear dimensionality reduction on concatenated sparse matrix |
| **sklearn.metrics**                       | `classification_report`                                    | Detailed per-class precision/recall/F1 evaluation |
| **sklearn.base**                          | `BaseEstimator`, `TransformerMixin`                        | Create custom scikit-learn compatible `TextFeatureExtractor` |
| **re**                                    | `compile`, `findall`                                       | Regex-based extraction of URLs, hashtags, mentions, emojis, numbers |
| **ydata-profiling**                       | `ProfileReport` (commented out)                            | Automated EDA report (disabled for GitHub compatibility) |

**Key Technique Highlights**  
- **Custom `TextFeatureExtractor`** ‚Äî most important contribution: regex-driven multi-aspect feature construction tailored to Twitter/social media noise.  
- **ColumnTransformer on heterogeneous text-derived columns** ‚Äî clean way to vectorize main cleaned text differently from metadata-like fields (hashtags, mentions, emojis, punctuation counts).  
- Joint grid search over vectorizer hyperparameters + model regularisation ‚Äî realistic way to tune the full pipeline.  
- Emphasis on cleaning while **preserving** structural sentiment cues rather than just removing everything.

This notebook provides a practical demonstration of **social-media-specific text feature engineering** beyond vanilla Count/TF-IDF, showing how domain knowledge (Twitter syntax) can be turned into engineered columns that enrich a classical ML pipeline.