# Class-Imbalanced Dataset

- A large difference in the size/number of samples between classes. In our example. We have a large difference in the reviews count between Nighlife and Restaurant category.

- Are problematic for modelling becuase the model will expend most of its effort fitting to the larger class.

> Solution: Since we have plenty of data in both classes, a good way to resolve the imbalanced is to downsample the larger class (restaurant) roughly the same size as the smaller class (nighlife)

In [1]:
# Import Library

import pandas as pd
import json

In [2]:
# Load Yelp business data 

biz_f = open('../../data/yelp_dataset/yelp_academic_dataset_business.json', encoding='utf8')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()

# Load Yelp Reviews data - 1,000,000 reviews
f = open('../../data/yelp_dataset/yelp_academic_dataset_review.json', encoding='utf8')
js = []
for i in range(1000000):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)

In [3]:
biz_df.dropna(inplace=True)

In [4]:
# Pull-out only the Nightlife and Restaurant Businesses

two_biz = biz_df[(biz_df.categories.str.contains('Restaurants')) | (biz_df.categories.str.contains('Nightlife'))]
two_biz.shape

(45704, 15)

In [5]:
# Join with the reviews to get all reviews on the two types of business category

twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')

In [6]:
# Trim away the features we won't use
twobiz_reviews = twobiz_reviews[['business_id',
                                'name',
                                'stars_y',
                                'text', 
                                 'categories']]

In [7]:
# Create the target column -- True for Nightlife businesses, and False otherwise

twobiz_reviews['target'] = twobiz_reviews.categories.str.contains('Nightlife')

In [8]:
twobiz_reviews.target.value_counts()

False    421891
True     147243
Name: target, dtype: int64

In [9]:
# Create a class-balanced  classification dataset

nightlife = twobiz_reviews[(twobiz_reviews.categories.str.contains('Nightlife'))]
nightlife.shape

(147243, 6)

In [10]:
resto = twobiz_reviews[(twobiz_reviews.categories.str.contains('Restaurants'))]
resto.shape

(548799, 6)

## Creating a Balanced Classification Dataset

---
1. Take a random sample of 10% of nightlife reviews and 2.68% of restaurant reviews (percentages chosen so the number of examples in eac class is roughly equal).


2. Create a 70/30 train-test split of this dataset. In this example, the training set ends-up with 20,602 reviews, and the test set with 8,830 reviews.


3. The training data contains 34,029 unique words; this is the number of features in the bag-of-words representation.

In [11]:
nytlife_subset = nightlife.sample(frac=0.1, random_state=123)
resto_subset = resto.sample(frac=0.0268, random_state=123)

In [12]:
nytlife_subset.shape, resto_subset.shape

((14724, 6), (14708, 6))

In [13]:
combine = pd.concat([nytlife_subset, resto_subset])
combine.shape

(29432, 6)

In [14]:
combine.target.value_counts()

True     18208
False    11224
Name: target, dtype: int64

In [21]:
import sklearn.model_selection as model
from sklearn.feature_extraction import text

In [19]:
# Split into training and test datasets

training_data, test_data = model.train_test_split(combine, train_size=0.7, random_state=123)



In [20]:
training_data.shape, test_data.shape

((20602, 6), (8830, 6))

---

## Transform Features

In [23]:
# Represent the review text as a bag-of-words
# We use CountVectorizer to convert the bag of words

bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data['text'])

In [24]:
bow_transform.vocabulary_

{'friend': 12474,
 'had': 14001,
 'recommended': 24573,
 'jasmine': 16260,
 'months': 19793,
 'ago': 1270,
 'but': 4901,
 'just': 16577,
 'made': 18325,
 'it': 16109,
 'down': 9619,
 'here': 14520,
 'ful': 12616,
 'medamas': 18999,
 'fava': 11470,
 'bean': 3184,
 'dish': 9225,
 'served': 26799,
 'with': 33334,
 'pita': 22797,
 'bread': 4372,
 'started': 28675,
 'me': 18949,
 'off': 21026,
 'and': 1729,
 'all': 1463,
 'by': 4966,
 'itself': 16128,
 'was': 32798,
 'delicious': 8572,
 'memorable': 19086,
 'meal': 18954,
 'so': 27866,
 'many': 18584,
 'different': 8965,
 'flavors': 11916,
 'of': 21022,
 'them': 30255,
 'great': 13633,
 'followed': 12120,
 'wit': 33332,
 'tabouleh': 29651,
 'lamb': 17232,
 'shrimp': 27236,
 'kebab': 16728,
 'on': 21181,
 'yellow': 33695,
 'rice': 25356,
 'house': 14997,
 'salad': 25995,
 'along': 1534,
 'cold': 6696,
 'pink': 22736,
 'drink': 9744,
 'realized': 24447,
 'two': 31334,
 'problems': 23607,
 'couldn': 7519,
 'leave': 17484,
 'anything': 1922,
 '

In [25]:
len(bow_transform.vocabulary_)

34029

In [28]:
X_tr_bow.shape

(20602, 34029)

In [32]:
X_te_bow = bow_transform.transform(test_data['text'])

In [36]:
X_te_bow.shape

(8830, 34029)

In [33]:
y_tr = training_data['target']
y_te = test_data['target']

In [35]:
# Create the tf-idf representation using the bag-of-words matrix

tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)

In [37]:
X_te_tfidf = tfidf_trfm.transform(X_te_bow)

In [38]:
import sklearn.preprocessing as preproc

In [39]:
# Just for kicks, l2-normalize the bag-of-words representation

X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)