# Classification with Logistic Regression

- Logistic Regression is a simple, linear classifier. Due to its simplicity. It is often a good first classifier to try.


- It takes a weighted combbination of the input features, and passes it through a sigmoid function, which smoothly maps any real number to a number between 0 and 1. The functions transforms a real number input, x, into a number betwen 0 and 1.


> A logistic classifier would predict the positive class if the sigmoid output is greater than 0.5,and the negative class otherwise.

In [1]:
# Import Library

import pandas as pd
import json

In [2]:
# Load Yelp business data 

biz_f = open('../../data/yelp_dataset/yelp_academic_dataset_business.json', encoding='utf8')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()

# Load Yelp Reviews data - 1,000,000 reviews
f = open('../../data/yelp_dataset/yelp_academic_dataset_review.json', encoding='utf8')
js = []
for i in range(1000000):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)

In [3]:
biz_df.dropna(inplace=True)

In [4]:
# Pull-out only the Nightlife and Restaurant Businesses

two_biz = biz_df[(biz_df.categories.str.contains('Restaurants')) | (biz_df.categories.str.contains('Nightlife'))]
two_biz.shape

(45704, 15)

In [5]:
# Join with the reviews to get all reviews on the two types of business category

twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner')

In [6]:
# Trim away the features we won't use
twobiz_reviews = twobiz_reviews[['business_id',
                                'name',
                                'stars_y',
                                'text', 
                                 'categories']]

In [7]:
# Create the target column -- True for Nightlife businesses, and False otherwise

twobiz_reviews['target'] = twobiz_reviews.categories.str.contains('Nightlife')

In [8]:
# Create a class-balanced  classification dataset

nightlife = twobiz_reviews[(twobiz_reviews.categories.str.contains('Nightlife'))]
resto = twobiz_reviews[(twobiz_reviews.categories.str.contains('Restaurants'))]

---

## Balance the Dataset

In [9]:
nytlife_subset = nightlife.sample(frac=0.1, random_state=123)
resto_subset = resto.sample(frac=0.0268, random_state=123)

In [10]:
combine = pd.concat([nytlife_subset, resto_subset])
combine.shape

(29432, 6)

In [11]:
import sklearn.model_selection as model

In [12]:
# Split into training and test datasets

training_data, test_data = model.train_test_split(combine, train_size=0.7, random_state=123)



---

## Transform Features

In [13]:
from sklearn.feature_extraction import text

In [14]:
# Represent the review text as a bag-of-words
# We use CountVectorizer to convert the bag of words

bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data['text'])

In [15]:
X_te_bow = bow_transform.transform(test_data['text'])

In [16]:
y_tr = training_data['target']
y_te = test_data['target']

In [17]:
# Create the tf-idf representation using the bag-of-words matrix

tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)

In [18]:
X_te_tfidf = tfidf_trfm.transform(X_te_bow)

In [19]:
import sklearn.preprocessing as preproc

In [20]:
# Just for kicks, l2-normalize the bag-of-words representation

X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)

--- 

### Now Lets build some simple Logistic Regression Classifiers

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
    ## Helper function to train a logistic classifier and score on test data
    m = LogisticRegression(C=_C).fit(X_tr, y_tr)
    s = m.score(X_test, y_test)
    print('Test score with', description, 'features:', s)
    return m

In [23]:
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')



Test score with bow features: 0.7221970554926387
Test score with l2-normalized features: 0.7069082672706681
Test score with tf-idf features: 0.6768969422423556


> Paradoxically, the results sow that the most accurate classifier is the one using BoW features.

> The reason is that the classifier are not well-tuned which is a common pitfall when comparing classifier.