---

### HW 1

---

**Course:** CSE158 / CSE258 / MGTA461 / DSC256

**Term:** Fall 25

**Due Date:** 2025-10-13



#### **Regression (week 1)**
* *First, using the book review data (see the “runner” code for the exact dataset names), let’s see whether ratings can be predicted as a function of review length, or by using temporal features associated with a review*

In [1]:
## import libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
import gzip
import json
from datetime import datetime
import dateutil.parser
import random

## define json_gz_path
json_gz_path = "/home/scotty/dsc_256/fall_25/module_01/hw/data/datasets/fantasy_10000.json.gz"

## define read_json_gz function
def read_json_gz(json_gz_path):
    datasets=[]
    with gzip.open(json_gz_path) as f:
        dataset = [json.loads(l) for l in f]
    
    return dataset

## read fantasy_10000.json.gz and display 1st element of dataset
dataset = read_json_gz(json_gz_path)
dataset[0]

{'user_id': '8842281e1d1347389f2ab93d60773d4d',
 'book_id': '18245960',
 'review_id': 'dfdbb7b0eb5a7e4c26d59a937e2e5feb',
 'rating': 5,
 'review_text': 'This is a special book. It started slow for about the first third, then in the middle third it started to get interesting, then the last third blew my mind. This is what I love about good science fiction - it pushes your thinking about where things can go. \n It is a 2015 Hugo winner, and translated from its original Chinese, which made it interesting in just a different way from most things I\'ve read. For instance the intermixing of Chinese revolutionary history - how they kept accusing people of being "reactionaries", etc. \n It is a book about science, and aliens. The science described in the book is impressive - its a book grounded in physics and pretty accurate as far as I could tell. Though when it got to folding protons into 8 dimensions I think he was just making stuff up - interesting to think about though. \n But what would 

In [2]:
## list(dicts)-> pandas dataframe
df = pd.DataFrame(dataset)

## convert review_text to normailzed len
df['review_len'] = df['review_text'].str.len()
df['review_len_norm'] = df['review_len']/df['review_len'].max()

## create model instance and define model parameters
predictor = LinearRegression(fit_intercept=True)
X = df['review_len_norm'].values.reshape(-1,1)
#X = [[1] + [f] for f in df['review_len_norm']]
#print(X[0:10])
y = df['rating']
#y = [r for r in df['rating']]

print(y[0:10])

## fit predictor and return theta_0(intercept), theta_1(slope), and MSE
predictor.fit(X,y)
y_pred = predictor.predict(X)
print(predictor.intercept_)
print(f"theta_1: {predictor.coef_}")
print(f"mse: {mean_squared_error(y,y_pred)}")

print(df[['rating','review_len','review_len_norm']].describe())

0    5
1    5
2    5
3    4
4    3
5    5
6    5
7    5
8    4
9    5
Name: rating, dtype: int64
3.685681355016952
theta_1: [0.98335392]
mse: 1.5522086622355378
             rating    review_len  review_len_norm
count  10000.000000  10000.000000     10000.000000
mean       3.740100    791.691700         0.055340
std        1.247921   1022.915566         0.071503
min        0.000000      0.000000         0.000000
25%        3.000000    157.000000         0.010974
50%        4.000000    429.000000         0.029987
75%        5.000000    983.000000         0.068712
max        5.000000  14306.000000         1.000000


#### 1. Train a simple predictor that estimates rating from review length:
$$\text{star rating} = \theta_{0} + \theta_{1} \cdot [\text{review length in charaters}]$$

* Rather than using the review length directly, scale the feature to be between 0 and 1 by dividing by the maximum review length in the dataset.
* Return the value of $\theta$ and the Mean Squared Error($MSE$) of your predictor (on the entire dataset)

In [3]:
## list(dicts)-> pandas dataframe
df = pd.DataFrame(dataset)

## convert review_text to normailzed len
df['review_len'] = df['review_text'].str.len()
df['review_len_norm'] = df['review_len']/df['review_len'].max()

## create model instance and define model parameters
predictor = LinearRegression(fit_intercept=False)
X = [[1] + [f] for f in df['review_len_norm']]
y = df['rating']

## fit predictor and return theta_0(intercept), theta_1(slope), and MSE
predictor.fit(X,y)
y_pred = predictor.predict(X)

print(f"theta: {predictor.coef_}")
print(f"mse: {mean_squared_error(y,y_pred)}")

theta: [3.68568136 0.98335392]
mse: 1.5522086622355378


#### 2. Extend your model to include (in addition to the scaled length) features based on the time of the review. The runner contains code to compute the weekday. 
* Using a one-hot encoding for the weekday and month, write down feature vectors for the first two examples. 
* Be careful not to include any redundant dimensions: e.g. your feature vector, including the offset term and the length feature, should contain no more than 19 dimensions.

In [None]:
#df['date_added'] = pd.to_datetime(df['date_added'], format='mixed',utc=True)
df['weekday'] = df['date_added'].apply(lambda x: str(x).split(' ')[0])
df['month'] = df['date_added'].apply(lambda x: str(x).split(' ')[1])

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
wd_encoded = encoder.fit_transform(df['weekday'].values.reshape(-1,1))
m_encoded = encoder.fit_transform(df['month'].values.reshape(-1,1))

X_old = np.array([[1] + [f] for f in df['review_len_norm']])

X = np.hstack([X_old,wd_encoded[:,1:],m_encoded[:,1:]])


"\n\nX = [[1] + [l] + [wd] + [m]\n     for l in df['review_len_norm']\n     for wd in wd_encoded\n     for m in m_encoded]\ny = df['rating']\nprint(X[:5])\n"

#### 3. Train models that:
* Use the weekday and month values directly as features:
$$
\text{star rating} \approx \theta_{0} + \theta_{1} \cdot [\text{review len in chars}] + \theta_{2} \cdot [t.\text{weekday}()] + \theta_{3} \cdot [t.\text{month}]
$$
* Use the one-hot encoding from Question 2
* Return the Mean Squared Error ($MSE$) of each model.

In [11]:
predictor = LinearRegression(fit_intercept=False)

## fit predictor and return theta_0(intercept), theta_1(slope), and MSE
predictor.fit(X,y)
y_pred = predictor.predict(X)

print(f"theta: {predictor.coef_}")
print(f"mse: {mean_squared_error(y,y_pred)}")

theta: [ 3.63737499e+00  9.94494578e-01 -1.32575099e-01 -8.77253042e-02
 -3.23638383e-02 -2.75054604e-02 -7.27194928e-02  3.42369201e-03
  1.30437875e-01  2.98482887e-02  7.72270388e-02  1.23162523e-01
  1.11167001e-01  1.63861311e-01  1.77934236e-01  2.39472524e-02
  1.32511588e-02  1.18585001e-01  1.06052980e-01]
mse: 1.5466315498487562


#### 4. Repeat the above question, but this time split the data into a training and test set. 
* You should split the data into 50%/50% train/test fractions following the split used by the code stub (or runner). 
* After training on the training set, compute the MSE of the two
models (the one-hot encoding from Question 2 and the direct encoding from Question 3) on the test set.

#### **Classification (week 2)**
* *Next, using the beer review data, we’ll try to predict ratings (positive or negative) based on characteristics of beer reviews. Load the 50,000 beer review dataset (done in the runner), and construct a label vector by considering whether a review score is four or above*
    ```
    y = [d['review/overall'] >=4 for d in dataset]
    ```

#### 5. Fit a logistic regressor that estimates the binarized score from review length:
$$P(\text{rating is positive}) = \sigma(\theta_{0}+\theta{1}\cdot[\text{length}])$$
* Use the class `weight=’balanced’` option, compute the number of True Positives, True, Negatives, False Positives, False Negatives, and the Balanced Error Rate of the classifier

#### 6. Compute the precision of your classifer for:
$$K \in \{1,100,1000,10000\}$$

#### 7. Improve your predictor (specifically, reduce the balanced error rate) by incorporating additional features from the data.
* e.g. beer styles, ratings, features from text, etc.
* The BER should be ~3% higher than the solution from Q5