<a href="https://colab.research.google.com/github/13-1550/13-1550/blob/main/Lab_Data_Centric_AI_vs_Model_Centric_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Data-Centric vs Model-Centric approaches

This lab gives an introduction to data-centric vs model-centric approaches to machine learning problems, showing how data-centric approaches can outperform purely model-centric approaches.

In this lab, we'll build a classifier for product reviews (restricted to the magazine category), like:

> Excellent! I look forward to every issue. I had no idea just how much I didn't know.  The letters from the subscribers are educational, too.

Label: ⭐️⭐️⭐️⭐️⭐️ (good)

> My son waited and waited, it took the 6 weeks to get delivered that they said it would but when it got here he was so dissapointed, it only took him a few minutes to read it.

Label: ⭐️ (bad)

We'll work with a dataset that has some issues, and we'll see how we can squeeze only so much performance out of the model by being clever about model choice, searching for better hyperparameters, etc. Then, we'll take a look at the data (as any good data scientist should), develop an understanding of the issues, and use simple approaches to improve the data. Finally, we'll see how improving the data can improve results.

## Installing software

For this lab, you'll need to install [scikit-learn](https://scikit-learn.org/) and [pandas](https://pandas.pydata.org/). If you don't have them installed already, you can install them by running the following cell:

In [None]:
!pip install scikit-learn pandas



# Loading the data

First, let's load the train/test sets and take a look at the data.

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('reviews_train.csv')
test = pd.read_csv('reviews_test.csv')

test.sample(5)

Unnamed: 0,review,label
289,Love vogue to the brim,good
69,One of the best magazines I've ever bought. Fu...,good
629,Don't waste your time unless you have a kindle...,bad
350,THIS IS A MAGAZINE THAT HAS BEEN AROUND SINCE ...,good
574,magazine has too many adds not enough mens health,bad


# Training a baseline model

There are many approaches for training a sequence classification model for text data. In this lab, we're giving you code that mirrors what you find if you look up [how to train a text classifier](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), where we'll train an SVM on [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) features (numeric representations of each text field based on word occurrences).

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

In [4]:
sgd_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

In [5]:
_ = sgd_clf.fit(train['review'], train['label'])

## Evaluating model accuracy

In [6]:
from sklearn import metrics

In [7]:
def evaluate(clf):
    pred = clf.predict(test['review'])
    acc = metrics.accuracy_score(test['label'], pred)
    print(f'Accuracy: {100*acc:.1f}%')

In [8]:
evaluate(sgd_clf)

Accuracy: 76.2%


## Trying another model

76% accuracy is not great for this binary classification problem. Can you do better with a different model, or by tuning hyperparameters for the SVM trained with SGD?

# Exercise 1

Can you train a more accurate model on the dataset (without changing the dataset)? You might find this [scikit-learn classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) handy, as well as the [documentation for supervised learning in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

One idea for a model you could try is a [naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

You could also try experimenting with different values of the model hyperparameters, perhaps tuning them via a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Or you can even try training multiple different models and [ensembling their predictions](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier), a strategy often used to win prediction competitions like Kaggle.

**Advanced:** If you want to be more ambitious, you could try an even fancier model, like training a Transformer neural network. If you go with that, you'll want to fine-tune a pre-trained model. This [guide from HuggingFace](https://huggingface.co/docs/transformers/training) may be helpful.

In [9]:
# YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from sklearn.metrics import accuracy_score


dataTrain = pd.read_csv('reviews_train.csv')
dataTest = pd.read_csv('reviews_test.csv')

x_test = dataTest[['review']]
y_test = dataTest[['label']]

x_train = dataTrain[['review']]
y_train = dataTrain[['label']]

vectorizer = TfidfVectorizer()
x_train_vec = vectorizer.fit_transform(x_train['review'])
x_test_vec = vectorizer.transform(x_test['review'])

rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
}

grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(x_train_vec, y_train)

best_rf = grid_search.best_estimator_

y_pred = best_rf.predict(x_test_vec)


accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {100*accuracy:.1f}%')


best_rf = grid_search.best_estimator_
# evaluate your model and see if it does better
# than the ones we provided

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **

Accuracy: 84.2%


## Taking a closer look at the training data

Let's actually take a look at some of the training data:

In [10]:
dataTrain.tail()

Unnamed: 0,review,label
6661,Wouldn't open on my tablet - can't say much mo...,bad
6662,"I ordered this item in July, it is not Decembe...",bad
6663,</li>My favorite magazine ever!,bad
6664,I love this magazine and I am very excited to ...,good
6665,Great magazine. Good reading and awesome reviews.,good


Zooming in on one particular data point:

In [11]:
print(dataTrain.iloc[0].to_dict())

{'review': "Based on all the negative comments about Taste of Home, I will not subscribeto the magazine. In the past it was a great read.\nSorry it, too, has gone the 'way of the wind'.<br>o-p28pass4 </br>", 'label': 'good'}


This data point is labeled "good", but it's clearly a negative review. Also, it looks like there's some funny HTML stuff at the end.

# Exercise 2

Take a look at some more examples in the dataset. Do you notice any patterns with bad data points?

In [12]:
# YOUR CODE HERE
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'\W', ' ', text)
    # Remove stop words and apply stemming
    text = ' '.join(stemmer.stem(word) for word in text.split() if word not in stop_words)
    return text

# Apply preprocessing to the review column
dataTrain['review'] = dataTrain['review'].apply(preprocess_text)
dataTest['review'] = dataTest['review'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {100*accuracy:.1f}%')

Accuracy: 84.2%


## Issues in the data

It looks like there's some funny HTML tags in our dataset, and those datapoints have nonsense labels. Maybe this dataset was collected by scraping the internet, and the HTML wasn't quite parsed correctly in all cases.

# Exercise 3

To address this, a simple approach we might try is to throw out the bad data points, and train our model on only the "clean" data.

Come up with a simple heuristic to identify data points containing HTML, and filter out the bad data points to create a cleaned training set.

In [14]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
#Regular expressions (regex)
import pandas as pd

dataTrain = pd.read_csv('reviews_train.csv')


def is_bad_data(review: str) -> bool:
   html_pattern = r'<.*?>'
   return bool(re.search(html_pattern, review))

cleaned_data = dataTrain[~dataTrain['review'].apply(is_bad_data)]

cleaned_data.reset_index(drop=True, inplace=True)

print(f'Original dataset size: {dataTrain.shape[0]}')
print(f'Cleaned dataset size: {cleaned_data.shape[0]}')

Original dataset size: 6666
Cleaned dataset size: 4018


## Creating the cleaned training set

In [15]:
train_clean = dataTrain[~dataTrain['review'].map(is_bad_data)]

In [16]:
train_clean.to_csv('cleaned_reviews.csv', index=False)

## Evaluating a model trained on the clean training set

In [17]:
import pandas as pd
import re
from sklearn.linear_model import SGDClassifier
from sklearn.base import clone
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
train_clean = pd.read_csv('cleaned_reviews.csv')
print(train_clean.isnull().sum())

x = train_clean['review']
y = train_clean['label']

x = x.astype(str)

vectorizer = TfidfVectorizer()
x_vec = vectorizer.fit_transform(x)


sgd_clf = SGDClassifier()
sgd_clf_clean = clone(sgd_clf)
sgd_clf_clean.fit(x_vec, y)

print("Model fitted successfully.")

review    0
label     0
dtype: int64
Model fitted successfully.


In [19]:
print(train_clean.isnull().sum())  # Check for any missing values


review    0
label     0
dtype: int64


In [20]:
print(f"x_vec shape: {x_vec.shape}, y shape: {y.shape}")


x_vec shape: (4018, 4807), y shape: (4018,)


In [21]:
print(train_clean.dtypes)

review    object
label     object
dtype: object


In [22]:
print(train_clean.head())  # Display the first few rows of the dataframe


                                              review label
0  I still have not received this.  Obviously I c...   bad
1  This magazine is basically ads. Kindve worthle...   bad
2  The only thing I've recieved, so far, is the b...   bad
3  This is one magazine I really love. It has pri...  good
4                                     Did not. Open.   bad


In [23]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer


x_vec = vectorizer.transform(train_clean['review'])

_ = sgd_clf_clean.fit(x_vec, train_clean['label'])

This model should do significantly better:

In [29]:
def evaluate(model, vectorizer, data):

    x_test_vec = vectorizer.transform(data['review'].astype(str))  # Vectorize test data

    y = data['label']
    y_pred = model.predict(x_test_vec)  # Predict labels
    accuracy = accuracy_score(y, y_pred)


    print(f'Accuracy: {accuracy * 100:.2f}%')



test_data = pd.read_csv('reviews_test.csv')

sgd_clf_clean = SGDClassifier()
x_vec = vectorizer.transform(train_clean['review'])
_ = sgd_clf_clean.fit(x_vec, train_clean['label'])

# Evaluate the model
evaluate(sgd_clf_clean, vectorizer, test_data)

Accuracy: 97.30%
