# Modelling Overview 

## 1. Data Loading and inspection

In [41]:
import pandas as pd
tweets_cleaned_df = pd.read_csv('data/cleaned_apple_tweets.csv')
tweets_cleaned_df.head()

Unnamed: 0,tweet,product,tokens,processed_tweet,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,"['g', 'iphon', 'hr', 'tweet', 'dead', 'need', ...",g iphon hr tweet dead need upgrad plugin station,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,"['know', 'awesom', 'ipadiphon', 'app', 'youll'...",know awesom ipadiphon app youll like appreci d...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,"['wait', 'also', 'sale']",wait also sale,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,"['hope', 'year', 'festiv', 'isnt', 'crashi', '...",hope year festiv isnt crashi year iphon app,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,"['great', 'stuff', 'fri', 'marissa', 'mayer', ...",great stuff fri marissa mayer googl tim oreill...,Positive emotion


In [42]:
tweets_cleaned_df['sentiment'].value_counts()

No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: sentiment, dtype: int64

## Data Preprocessing 

In [43]:
tweets_cleaned_df = tweets_cleaned_df.dropna(subset=['processed_tweet', 'sentiment'])  # Removes rows where text or label is missing

# Train_Test_split 

In [44]:
X = tweets_cleaned_df['processed_tweet']
y = tweets_cleaned_df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the pipeline 

### Import Necessary libraries to make sure all tools are available.because we are interested in creating a **pipeline structure** that is streamlined and modulates the code

In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 

In [46]:
pipeline = Pipeline([
    ('tfidf' , TfidfVectorizer(stop_words ='english')), 
    ('classifier', LogisticRegression()),
])

In [47]:
pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')),
                ('classifier', LogisticRegression())])

# Predict and Evaluate Accuracy

In [48]:
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 68.58%


# Hyperparameter Tuning with GridSearchCV 
### We need to find the best parameters for our model inorder to evaluate its performance.With a **68%** baseline ,grid search will help uncover the optimal combination of `vectorizartion` and `logistic regression` parameters.
### And finally get accurate predictions inorder to push our `accuracy` higher.

## Tuning of :
1. `max_df` :ignoring the wording that appear in too many documents.
2. `ngram_range` :Unigrams vs bigrams 
3. `C` :Regularization strength of logistic regression.

### This is to explore combination of texts features granularity and the model flexibility .

In [49]:
from sklearn.model_selection import GridSearchCV    

param_grid = {
    'tfidf__max_df': [0.8, 0.9, 1.0],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'classifier__C': [0.1, 1, 10]}

## Setting up Grid Search 
### `cv=5` applies 5 -fold **cross validation**. in `verbose=1` shows progress,`n_jobs= -1` uses cores to speed it up.

In [50]:
grid = GridSearchCV(pipeline, param_grid, cv=5, verbose=1, n_jobs=-1)
grid.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   42.3s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words='english')),
                                       ('classifier', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'classifier__C': [0.1, 1, 10],
                         'tfidf__max_df': [0.8, 0.9, 1.0],
                         'tfidf__ngram_range': [(1, 1), (1, 2)]},
             verbose=1)

## Evaluate the best model 

In [52]:
print("Best Parameters:", grid.best_params_)

y_pred = grid.predict(X_test)
from sklearn.metrics import classification_report , accuracy_score

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'classifier__C': 1, 'tfidf__max_df': 0.8, 'tfidf__ngram_range': (1, 2)}
Accuracy: 0.6984564498346196
Classification Report:
                                     precision    recall  f1-score   support

                      I can't tell       0.00      0.00      0.00        27
                  Negative emotion       0.78      0.06      0.11       124
No emotion toward brand or product       0.71      0.89      0.79      1091
                  Positive emotion       0.65      0.51      0.57       572

                          accuracy                           0.70      1814
                         macro avg       0.54      0.36      0.37      1814
                      weighted avg       0.69      0.70      0.66      1814



  _warn_prf(average, modifier, msg_start, len(result))


# Computing the Confusion Matrix

### Computing a confucion matrix to give a breakdown of how well our model is classifying each **sentiment class**.

In [51]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

#predict labels 
y_pred = grid.predict(X_test)

#create matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[  0   0  19   8]
 [  0   7  90  27]
 [  0   2 969 120]
 [  0   0 281 291]]
