# General pipeline for project 1
This is an example pipeline showing you how to  
(1) Load the provided data;  
(2) Train models on the train set, and use the validation set to evaluate your model performance;  
(3) Generate predictions (pred.csv) on the test set, which is ready for submission.

In [1]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression

In [3]:
# if you use Google Colab, un-comment this cell, modify `path_to_data` if needed, and run to mount data to `data`
# from google.colab import drive
# drive.mount('/content/drive')

# path_to_data = '/content/drive/MyDrive/HKUST stuff/COMP4332_Project1/data'
# !rm -f data
# !ln -s '/content/drive/MyDrive/HKUST stuff/COMP4332_Project1/data' data

Mounted at /content/drive


### (1) Loading data
The following code shows how to load the datasets for this project.  
Among which, we do not release the labels (the "stars" column) for the test set. You may evaluate your trained model on the validation set instead.

However, your submitted predictions (``pred.csv``) should be generated on the test set.

In [2]:
def load_data(split_name='train', columns=['text', 'stars'], folder='data'):
    '''
        "split_name" may be set as 'train', 'valid' or 'test' to load the corresponding dataset.
        
        You may also specify the column names to load any columns in the .csv data file.
        Among many, "text" can be used as model input, and "stars" column is the labels (sentiment). 
        If you like, you are free to use columns other than "text" for prediction.
    '''
    try:
        print(f"select [{', '.join(columns)}] columns from the {split_name} split")
        df = pd.read_csv(f'{folder}/{split_name}.csv')
        df = df.loc[:,columns]
        print("Success")
        return df
    except:
        print(f"Failed loading specified columns... Returning all columns from the {split_name} split")
        df = pd.read_csv(f'{folder}/{split_name}.csv')
        return df

In [4]:
train_df = load_data('train', columns=['text', 'stars'])
valid_df = load_data('valid', columns=['text', 'stars'])
# the test set labels (the 'stars' column) are not available! So the following code will instead return all columns
test_df = load_data('test', columns=['text', 'stars'])

select [text, stars] columns from the train split
Success
select [text, stars] columns from the valid split
Success
select [text, stars] columns from the test split
Failed loading specified columns... Returning all columns from the test split


In [5]:
# test_df.columns
# print(train_df.columns)
# print(valid_df.columns)
# print(test_df.columns)
test_df

Unnamed: 0,business_id,cool,date,funny,review_id,text,useful,user_id
0,V-qDa2kr5qWdhs7PU-l-3Q,0,2013-05-29,0,fBHWLNEJmhk6AkzmfLwWcw,Would like to give this more stars - usually I...,1,1pigoFijaHVWGrQl1_tYjw
1,C1zlvNlxlGZB8g0162QslQ,0,2012-03-02 15:51:49,0,ldEQ02aP1OeSa5N2beseNg,My wife and I took some friends here after din...,0,BKWPuPZFcGmgjRFRzoq1pw
2,0FOON_PNvG0ZxIZh6Jcv2A,0,2013-09-24 20:31:37,0,0oGr6v9VjtRsRsROGMoWTA,My husband and I had lunch here for the first ...,0,BYVYXKqNs-vv-N1ZhRMs0g
3,r49iBfbnfoK7yt4rdsL_7g,0,2018-10-20 01:34:08,0,eg5eJ5HmqXuzkxucnKvMTw,I love coming here with my friends! Great for ...,2,dpzmyNglDMeTgV3T5ylUSQ
4,xnLNPkL7bbdhD842T4oPqg,0,2016-09-25,1,BNDAe34Mxj--Brkzcfi4QA,Make sure that you double check how much these...,1,yk9wx31bfMEe_IXB8Q-ylA
...,...,...,...,...,...,...,...,...
3995,x_0Vf8AVBk_auLnNHRjoVA,2,2013-05-18 03:06:21,0,s7FLCfjgopRM6olA1NSccg,We live nearby and have stopped by this McDona...,0,Nf3VduiXhQVZRvM2GiXi-w
3996,KAJAsjVhYUPb6b_yodVqvA,0,2018-05-06 05:33:47,0,oJUnsu4PpTZz-kCE88-9uQ,It was boring as ever! All Spanish music so I ...,0,T3hk43jr0t7ZK8RPmce4sQ
3997,EnKpL0rRg1MTTKncmxbnMA,0,2012-03-21 20:49:25,0,celcHgmV26VvtzGdUFsR5w,"Was a long time customer, I was entertaining c...",1,WFWzzvWM45zTx-EShrVVxw
3998,-NR4KqS6lHseNvJ-GFzfMA,2,2016-08-14,1,69yY48SDj-UDCKlGgn-nqg,I really like this place! I like how you can t...,2,SS3sFA9ksCT9bjocM3Wbug


### (2) Training and validating 
The following example shows you how to train your model using the train set, and evaluate on the validation set.  
As an example, we only use the text data for training. Feel free to use other columns in your implementation.  

The model performance on the validation set can be roughly regarded as your models final performance, so we can use it to search for optimal hyper-parameters.

In [6]:
# Prepare the data.
# As an example, we only use the text data. 
x_train = train_df['text']
y_train = train_df['stars']
  
x_valid = valid_df['text']
y_valid = valid_df['stars']

x_test = test_df['text']

 You can use the valid data to choose the hyperparameters.
As an example, you can decide which value of C (1 or 100) is better by evaluating on the valid data.

In [7]:
# build the first linear model with TFIDF feature
tfidf = TfidfVectorizer()
lr1 = LogisticRegression(C=100)
steps = [('tfidf', tfidf),('lr', lr1)]
pipe1 = Pipeline(steps)

In [8]:
# train the first model
pipe1.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [9]:
# validate on the validation set
y_pred = pipe1.predict(x_valid)
print(classification_report(y_valid, y_pred))
print("\n\n")
print(confusion_matrix(y_valid, y_pred))
print('accuracy', np.mean(y_valid == y_pred))

              precision    recall  f1-score   support

           1       0.73      0.80      0.76       292
           2       0.39      0.29      0.34       163
           3       0.35      0.36      0.36       232
           4       0.43      0.44      0.43       421
           5       0.78      0.78      0.78       892

    accuracy                           0.62      2000
   macro avg       0.54      0.53      0.53      2000
weighted avg       0.62      0.62      0.62      2000




[[233  30  15   6   8]
 [ 48  48  50  10   7]
 [ 22  28  83  71  28]
 [  6  10  71 184 150]
 [ 10   6  16 161 699]]
accuracy 0.6235


In [10]:
# build the second linear model with TFIDF feature
tfidf = TfidfVectorizer()
lr2 = LogisticRegression(C=1)
steps = [('tfidf', tfidf),('lr', lr2)]
pipe2 = Pipeline(steps)

In [11]:
# train the second model
pipe2.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
# validate on the validation set
y_pred = pipe2.predict(x_valid)
print(classification_report(y_valid, y_pred))
print("\n\n")
print(confusion_matrix(y_valid, y_pred))
print('accuracy', np.mean(y_valid == y_pred))

              precision    recall  f1-score   support

           1       0.70      0.84      0.76       292
           2       0.47      0.17      0.25       163
           3       0.45      0.27      0.34       232
           4       0.48      0.48      0.48       421
           5       0.75      0.87      0.81       892

    accuracy                           0.66      2000
   macro avg       0.57      0.53      0.53      2000
weighted avg       0.63      0.66      0.63      2000




[[245  16   6   7  18]
 [ 65  28  39  19  12]
 [ 24  14  63  87  44]
 [  9   1  28 202 181]
 [  7   1   5 104 775]]
accuracy 0.6565


 We find the second model (pipe2) has higher accuracy, then we use the second model to make predictions on test data. In practice, you may not only focus on the accuracy, but also other metrics (precision, recall, f1), since the label distribution is not always balanced.

### (3) Generate predictions on the test set

In [13]:
predict_test = pipe2.predict(x_test)

In [14]:
predict_test

array([3, 4, 5, ..., 1, 5, 1])

In [15]:
# save your model predictions
pred_df = pd.DataFrame({'stars': predict_test, 'review_id': test_df['review_id']})
pred_df.to_csv('pred.csv', index=False)

 Then you may (download and) submit the predictions `pred.csv` on the test set. 