# Connect Notebook to Google Drive


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
root_path = "/content/gdrive/MyDrive/Machine_Learning_NLP_Nora_Pauelsen_TU_Wien"

# Install/Import packages 

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 16.8 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.0 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 56.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 73.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 68.8 MB/s 
Col

In [5]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [6]:
from transformers import BertTokenizer
import torch
import numpy as np
import pandas as pd
import datasets
from datasets import Dataset
from datasets import load_metric

In [7]:
#To push model to huggingface later
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Read in Dataset

In [8]:
df_raw = pd.read_csv("/content/gdrive/MyDrive/Machine_Learning_NLP_Nora_Pauelsen_TU_Wien/data/raw/okcupid_profiles.csv")
df_raw.head(5)
df = df_raw[["sex", "essay0"]]

In [9]:
df_raw.shape #59946, 31)
#df.groupby(["sex"]).size().plot.bar()

(59946, 31)

In [10]:
df_raw.head(2)

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,ethnicity,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
0,22,single,m,straight,a little extra,strictly anything,socially,never,working on college/university,"asian, white",...,about me: i would love to think that i was so...,currently working as an international agent fo...,making people laugh. ranting about a good salt...,"the way i look. i am a six foot half asian, ha...","books: absurdistan, the republic, of mice and ...",food. water. cell phone. shelter.,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet! you are ti...
1,35,single,m,straight,average,mostly other,often,sometimes,working on space camp,white,...,i am a chef: this is what that means. 1. i am ...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories. my b...,,,i am very open and will share just about anyth...,


# Use BERT to predict text classification (female or male) 

Tutorials: https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f

https://huggingface.co/docs/transformers/tasks/sequence_classification

https://www.google.com/search?q=transfomrer+trainer.train+see+on+one+example&rlz=1C1CHBF_deDE761DE761&oq=transfomrer+trainer.train+see+on+one+example&aqs=chrome..69i57j33i10i160.9172j0j4&sourceid=chrome&ie=UTF-8#kpvalbx=_-AOaYvv9EMfasAeG0ovwDA15

BERT input Variables: 
* input_ids: id representation of each token (When decoded: "[CLS] text [SEP] [PAD]..."
* token_typ_ids: Binary mask that identifies in which sequence a token belongs, for a single sequence all token type ids are 0
* attention_mask: Binary mask that identifies whether a token is a real word or just padding




## Preprocess Data

In [11]:
! pip install datasets
#from datasets import load_dataset 
#imdb = load_dataset("imdb") #was used to see how our dataformat needs to look like

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
#Filter out NAs of essay0 (about me in profile text)
df = df.dropna(subset =  ["essay0"])
len(df["essay0"]) #54458, before: 59946

54458

In [13]:
#make sex a binary variable 
df['female'] = np.where(df['sex']== 'f', 1, 0) #female = 1, male = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
df.head(2)

Unnamed: 0,sex,essay0,female
0,m,about me: i would love to think that i was so...,0
1,m,i am a chef: this is what that means. 1. i am ...,0


In [15]:
#split in train, test and validation data: 70% train, 15% test, 15% eval
training_data = df.sample(frac=0.7, random_state=25) #38,121 rows

testing_and_eval_data = df.drop(training_data.index) #30% = eval and test
testing_data = testing_and_eval_data.sample(frac=0.5, random_state=25) #of the 30% -> half is test, 8168 rows 
evaluation_data = testing_and_eval_data.drop(testing_data.index) #8169 rows

In [16]:
train_df = pd.DataFrame({
     "label" : training_data["female"],
     "text" : training_data["essay0"]
})

In [17]:
test_df = pd.DataFrame({
     "label" : testing_data["female"],
     "text" : testing_data["essay0"]
})

In [18]:
eval_df = pd.DataFrame({
     "label" : evaluation_data["female"],
     "text" : evaluation_data["essay0"]
})

In [19]:
test_df.head(2)
train_df.head(2)
eval_df.head(2)

Unnamed: 0,label,text
57,0,"i grew up in iowa. it gets a bad rap, but let ..."
65,0,i really like meeting new people. small-world ...


In [20]:
train_dataset = Dataset.from_dict(train_df)
test_dataset = Dataset.from_dict(test_df)
eval_dataset = Dataset.from_dict(eval_df)
dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset, "eval": eval_dataset})

In [21]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 38121
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 8168
    })
    eval: Dataset({
        features: ['label', 'text'],
        num_rows: 8169
    })
})

## Tokenize the datasets 

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [23]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [24]:
#Tokenize
tokenized_df = dataset_dict.map(preprocess_function, batched=True)

  0%|          | 0/39 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

In [25]:
tokenized_df

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 38121
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 8168
    })
    eval: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 8169
    })
})

## Use padding to make sure all have the same length 

In [26]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Load the pre-trained model: AutoModelSequenceClassification (for text classification)

In [27]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)#2 labels, because female and male 

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier

# Decide for a metric

In [28]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [29]:
metric_name = "accuracy"

**1. Define your training hyperparameters in TrainingArguments**

**2. Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator**

**3. Call train() to fine-tune your model**

In [30]:
training_args = TrainingArguments(
    output_dir="/content/gdrive/MyDrive/Machine_Learning_NLP_Nora_Pauelsen_TU_Wien/model results", #save model in my google drive
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3, 
    weight_decay=0.01,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
    
)

#do the same for eval data
#look at trainer methode, wann batch übergeben? output bekommen
#übergeb batch an model
#use 1 text example - if works, take whole eval dataset (for loop über alle daten, generate output, look at accuracy (e.g.))
#um besser optimieren: Test data nutzen, am Ende wenn trainiert: Validieren mit eval dataset (sonst bias) 

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_df["train"],
    eval_dataset=tokenized_df["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics = compute_metrics #use accuracy metrics defined above
)


In [31]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 38121
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2383


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [None]:
#push model to hub 
model.push_to_hub("my-finetuned-bert")

## Evaluate the Model 

# Inference
Let's test the model on a random sentence

In [None]:
# check for one example 
text_example = ["I am as complex as a plant"]
encoding = tokenizer(text_example, return_tensors="pt")
#encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}
outputs = trainer.model(**encoding)

In [None]:
outputs

In [None]:
predictions = outputs.logits.argmax(-1)

In [None]:
predictions

To evaluate the model, we need to import a metric. We use accuracy 

In [None]:
trainer.evaluate()

use 3. eval dataset: (nicht in training, optimization): 
Am ende nutzen: Um zu sehen, welches model am besten ist 
für letzten vergleich 




In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

#Dictionary that maps the category in the dataframe into the id representation of our label 
labels = {"f": 0,          
          "m": 1}

In [None]:
df["essay0"] = df["essay0"].to_string(index = False)

#print(tokenizer(text, padding = "max_length", max_length = 512, truncation = True, return_tensors = "pt"))

In [None]:
#df['female'] = np.where(df['sex'] == "f", 1, 0)
#print(df["female"])
#df['male'] = np.where(df['sex'] == "m", 1, 0)
#print(df["male"])

# Predict sex with topic probability vector from BERTTopic

In [32]:
import pandas as pd
import torch
import numpy as np
from sklearn.model_selection import train_test_split
from torch import nn
import matplotlib as plt

1. Load processed data

In [33]:
path = "/content/gdrive/MyDrive/Machine_Learning_NLP_Nora_Pauelsen_TU_Wien/data/processed/"

In [34]:
#1. Target vector Y (sex) and most probable topic

df_topics = pd.read_csv(path + "df_topics.csv")
df_topics = df_topics.drop("Unnamed: 0", 1)

df_topics_100 = pd.read_csv(path + "df_topics_100.csv")
df_topics_100 = df_topics_100.drop("Unnamed: 0", 1)

df_topics_50 = pd.read_csv(path + "df_topics_50.csv")
df_topics_50 = df_topics_50.drop("Unnamed: 0", 1)

#2. Feature Vector X (topic probabilities)
df_probs = pd.read_csv(path + "df_probs.csv")
df_probs = df_probs.drop("Unnamed: 0", 1)

  after removing the cwd from sys.path.
  import sys
  # Remove the CWD from sys.path while we load stuff.
  


In [35]:
df_topics.head()

Unnamed: 0,Profile_text,most_probable_topic,Sex
0,about me: i would love to think that i was so...,-1,m
1,i am a chef: this is what that means. 1. i am ...,-1,m
2,"i'm not ashamed of much, but writing public te...",5,m
3,i work in a library and go to school. . .,-1,m
4,hey how's it going? currently vague on the pro...,-1,m


In [36]:
df_probs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,219,220,221,222,223,224,225,226,227,sum_probabilities
0,0.003714,0.003332,0.004207,0.004285,0.028899,0.003352,0.003247,0.003272,0.004203,0.006259,...,0.002279,0.003432,0.003922,0.005268,0.008813,0.002133,0.004654,0.002327,0.010372,0.903629
1,0.008174,0.002646,0.003544,0.003185,0.005678,0.003403,0.004280,0.003661,0.002411,0.003769,...,0.003661,0.007638,0.005217,0.004998,0.004358,0.003140,0.004311,0.003427,0.005164,0.927664
2,0.002679,0.002559,0.003200,0.006352,0.003332,0.042416,0.002240,0.005888,0.002534,0.002981,...,0.002084,0.002619,0.004843,0.004646,0.002749,0.002459,0.002454,0.002191,0.005488,0.737023
3,0.003110,0.002981,0.003531,0.001953,0.003629,0.002078,0.008896,0.002469,0.002209,0.003655,...,0.002526,0.005787,0.002926,0.003137,0.004385,0.002240,0.010798,0.002827,0.002714,0.808096
4,0.002367,0.003329,0.004879,0.002862,0.002853,0.005279,0.002856,0.007899,0.002402,0.003368,...,0.002029,0.003278,0.006281,0.005321,0.002942,0.002506,0.002876,0.002282,0.003453,0.740662
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54453,0.002095,0.011450,0.040836,0.002320,0.003543,0.002442,0.003576,0.003640,0.003411,0.005755,...,0.001636,0.003193,0.004437,0.005437,0.003899,0.001675,0.003883,0.001742,0.003044,0.806925
54454,0.000925,0.000365,0.000434,0.000421,0.000501,0.000463,0.000540,0.000464,0.000310,0.000406,...,0.000798,0.000664,0.000500,0.000489,0.000503,0.000931,0.000609,0.001565,0.000477,0.145183
54455,0.014907,0.001963,0.002412,0.002414,0.004004,0.002413,0.003353,0.002458,0.001824,0.002609,...,0.004383,0.005260,0.003087,0.002900,0.003206,0.002996,0.003398,0.003733,0.003324,0.813876
54456,0.001384,0.002377,0.002651,0.001511,0.004441,0.001263,0.001701,0.001437,0.002965,0.003796,...,0.000960,0.001648,0.001774,0.002284,0.004793,0.000899,0.002635,0.000991,0.002328,0.459264


In [37]:
df_probs["most_probable_topic"] = df_topics["most_probable_topic"]

In [38]:
df_probs["female"] = np.where(df_topics["Sex"]=="f", 1,0)

In [39]:
df_probs = df_probs.drop("sum_probabilities", 1)

  """Entry point for launching an IPython kernel.


In [78]:
df_probs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,220,221,222,223,224,225,226,227,most_probable_topic,female
0,0.003714,0.003332,0.004207,0.004285,0.028899,0.003352,0.003247,0.003272,0.004203,0.006259,...,0.003432,0.003922,0.005268,0.008813,0.002133,0.004654,0.002327,0.010372,-1,0
1,0.008174,0.002646,0.003544,0.003185,0.005678,0.003403,0.004280,0.003661,0.002411,0.003769,...,0.007638,0.005217,0.004998,0.004358,0.003140,0.004311,0.003427,0.005164,-1,0
2,0.002679,0.002559,0.003200,0.006352,0.003332,0.042416,0.002240,0.005888,0.002534,0.002981,...,0.002619,0.004843,0.004646,0.002749,0.002459,0.002454,0.002191,0.005488,5,0
3,0.003110,0.002981,0.003531,0.001953,0.003629,0.002078,0.008896,0.002469,0.002209,0.003655,...,0.005787,0.002926,0.003137,0.004385,0.002240,0.010798,0.002827,0.002714,-1,0
4,0.002367,0.003329,0.004879,0.002862,0.002853,0.005279,0.002856,0.007899,0.002402,0.003368,...,0.003278,0.006281,0.005321,0.002942,0.002506,0.002876,0.002282,0.003453,-1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54453,0.002095,0.011450,0.040836,0.002320,0.003543,0.002442,0.003576,0.003640,0.003411,0.005755,...,0.003193,0.004437,0.005437,0.003899,0.001675,0.003883,0.001742,0.003044,-1,1
54454,0.000925,0.000365,0.000434,0.000421,0.000501,0.000463,0.000540,0.000464,0.000310,0.000406,...,0.000664,0.000500,0.000489,0.000503,0.000931,0.000609,0.001565,0.000477,-1,0
54455,0.014907,0.001963,0.002412,0.002414,0.004004,0.002413,0.003353,0.002458,0.001824,0.002609,...,0.005260,0.003087,0.002900,0.003206,0.002996,0.003398,0.003733,0.003324,-1,0
54456,0.001384,0.002377,0.002651,0.001511,0.004441,0.001263,0.001701,0.001437,0.002965,0.003796,...,0.001648,0.001774,0.002284,0.004793,0.000899,0.002635,0.000991,0.002328,-1,0


In [76]:
second_user = df_probs.iloc[1, :-2]

In [77]:
second_user

0      0.008174
1      0.002646
2      0.003544
3      0.003185
4      0.005678
         ...   
223    0.004358
224    0.003140
225    0.004311
226    0.003427
227    0.005164
Name: 1, Length: 228, dtype: float64

In [79]:
second_user.idxmax()

'209'

In [None]:
#df_probs.iloc[:,:-1]

2. Do a train/test split

In [40]:
X_train, X_test, y_train, y_test = train_test_split(df_probs.iloc[:,:-1], df_probs["female"], test_size=0.33, random_state=42) #random state to make it reproducible

In [41]:
y_train

48956    1
44255    1
54302    1
8892     1
30910    1
        ..
44732    0
54343    1
38158    1
860      0
15795    0
Name: female, Length: 36486, dtype: int64

In [42]:
X_train
#we have 228 columns topics and 36486 user profile texts 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,219,220,221,222,223,224,225,226,227,most_probable_topic
48956,0.008038,0.000815,0.001005,0.001159,0.001838,0.001134,0.001234,0.001091,0.000809,0.001133,...,0.001598,0.001752,0.001373,0.001224,0.001255,0.001174,0.001249,0.001299,0.001667,-1
44255,0.001452,0.002058,0.003487,0.001085,0.001703,0.001259,0.005320,0.001731,0.001256,0.002686,...,0.001154,0.003244,0.002097,0.002326,0.001929,0.001116,0.002621,0.001214,0.001498,-1
54302,0.000556,0.001501,0.001444,0.000539,0.001057,0.000534,0.000996,0.000661,0.000869,0.001491,...,0.000432,0.000880,0.000810,0.001084,0.001946,0.000423,0.002106,0.000478,0.000747,-1
8892,0.006649,0.001701,0.001967,0.002007,0.002276,0.002149,0.002568,0.002110,0.001482,0.001897,...,0.009726,0.002707,0.002108,0.001894,0.001976,0.007315,0.002427,0.015144,0.002146,158
30910,0.004601,0.001174,0.001372,0.001449,0.001569,0.001579,0.001726,0.001527,0.001045,0.001309,...,0.006168,0.001907,0.001553,0.001348,0.001379,0.006203,0.001675,0.010368,0.001547,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44732,0.002334,0.005925,0.006016,0.003270,0.007022,0.002767,0.002869,0.003120,0.014041,0.007709,...,0.001744,0.002889,0.003684,0.005191,0.006986,0.001737,0.004163,0.001855,0.004591,-1
54343,0.001802,0.002427,0.002764,0.002007,0.007222,0.001578,0.001935,0.001761,0.003016,0.003801,...,0.001187,0.002038,0.002216,0.002824,0.005524,0.001103,0.002951,0.001211,0.003265,-1
38158,0.005077,0.001902,0.002175,0.002089,0.002341,0.002265,0.002796,0.002298,0.001590,0.002022,...,0.006189,0.002853,0.002253,0.002014,0.002114,0.007411,0.002730,0.012937,0.002206,-1
860,0.002308,0.000561,0.000746,0.000623,0.000994,0.000685,0.001057,0.000755,0.000496,0.000759,...,0.001014,0.002005,0.001020,0.000955,0.000826,0.000773,0.000876,0.000851,0.000901,-1


### Convert X and y labels to numpy

In [43]:
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()

In [44]:
X_test = X_test.to_numpy()
X_test = torch.from_numpy(X_test)

In [45]:
y_test = y_test.to_numpy()
y_test = torch.from_numpy(y_test)

### Make X and y labels tensors

In [46]:
X_train = torch.from_numpy(X_train)

In [47]:
X_train.shape

torch.Size([36486, 229])

In [48]:
y_train = torch.from_numpy(y_train)

In [49]:
type(y_train)

torch.Tensor

In [50]:
y_train.shape

torch.Size([36486])

In [51]:
X_train.shape, y_train.shape

(torch.Size([36486, 229]), torch.Size([36486]))

In [52]:
X_train[:5], y_train[:5]

(tensor([[ 8.0377e-03,  8.1451e-04,  1.0053e-03,  ...,  1.2992e-03,
           1.6667e-03, -1.0000e+00],
         [ 1.4524e-03,  2.0584e-03,  3.4873e-03,  ...,  1.2140e-03,
           1.4979e-03, -1.0000e+00],
         [ 5.5613e-04,  1.5011e-03,  1.4440e-03,  ...,  4.7801e-04,
           7.4718e-04, -1.0000e+00],
         [ 6.6493e-03,  1.7014e-03,  1.9668e-03,  ...,  1.5144e-02,
           2.1464e-03,  1.5800e+02],
         [ 4.6006e-03,  1.1740e-03,  1.3718e-03,  ...,  1.0368e-02,
           1.5472e-03, -1.0000e+00]], dtype=torch.float64),
 tensor([1, 1, 1, 1, 1]))

In [53]:
import torch
import torch.nn as nn

## Pytorch Workflow

## 2. Create a model (input, output size, forward pass)

In [54]:
# Make device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [55]:
# 1. Construct a model class that subclasses nn.Module
class NeuralNetwork_binary(nn.Module):
    def __init__(self):
        super().__init__()
        # 2. Create 2 nn.Linear layers capable of handling X and y input and output shapes
        self.layer_1 = nn.Linear(in_features=229, out_features=500) # takes in 231 features (X), produces 500 features QUESTION: How many output features here (meaning how many hidden layers?)
        self.layer_2 = nn.Linear(in_features=500, out_features=1)
        #self.layer_3 = nn.Linear(in_features=500, out_features=1) # takes in 500 features, produces 1 feature (y)
        self.relu = nn.ReLU() # <- add in ReLU activation function

    # 3. Define a forward method containing the forward pass computation
    def forward(self, x):
        # Return the output of layer_2, a single feature, the same shape as y
        return self.relu(self.layer_2(self.relu(self.layer_1(x)))) # computation goes through layer_1 first then the output of layer_1 goes through layer_2

In [56]:
# 4. Create an instance of the model and send it to target device
model_0 = NeuralNetwork_binary().to(device)
model_0

NeuralNetwork_binary(
  (layer_1): Linear(in_features=229, out_features=500, bias=True)
  (layer_2): Linear(in_features=500, out_features=1, bias=True)
  (relu): ReLU()
)

2.) Construct loss and optimizer
Iterate this:
3.) Training Loop:
    - forward pass: compute prediction
    - backward pass: gradients
    - Update weights

## Define a Loss Function and Optimizer
Because we have a binary classification problem: Use binary cross entropy as loss function
We use Stochastic Gradient Descent as optimizer

In [57]:
# Create a loss function
# loss_fn = nn.BCELoss() # BCELoss = no sigmoid built-in
loss_fn = nn.BCEWithLogitsLoss()
# Create an optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
                            lr=0.1)

## Define a function for calculating accuracy as evaluation metric

In [58]:
# Calculate accuracy (a classification metric)
def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item() # torch.eq() calculates where two tensors are equal
    acc = (correct / len(y_pred)) * 100
    return acc

## Training the model
1. Forward Pass: Model goes through all of the training data once
2. Calculate the Loss
3. Set optimizer gradients to zero
4. Perform backpropagation on the Loss
5. Update the parameters with gradient descent

In [59]:
torch.manual_seed(42)

# Set the number of epochs
epochs = 200
#Put data to target device (Cuda if possible)
X_train, y_train = X_train.to(device), y_train.to(device)
X_test, y_test = X_test.to(device), y_test.to(device)

In [60]:
from tqdm import tqdm

In [61]:
# Build training and evaluation loop
for epoch in tqdm(range(epochs)):
    ### Training
    model_0.train()

    # 1. Forward pass (model outputs raw logits)
    y_logits = model_0(X_train.float()).squeeze() # squeeze to remove extra `1` dimensions, this won't work unless model and data are on same device
    y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labls

    # 2. Calculate loss/accuracy
    loss = loss_fn(y_logits,
                   y_train.float())
    acc = accuracy_fn(y_true=y_train.float(),
                      y_pred=y_pred)

    # 3. Optimizer zero grad
    optimizer.zero_grad()

    # 4. Loss backwards
    loss.backward()

    # 5. Optimizer step
    optimizer.step()

    ### Testing
    model_0.eval()
    with torch.inference_mode():
        # 1. Forward pass
        test_logits = model_0(X_test.float()).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits))
        # 2. Caculate loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test.float())
        test_acc = accuracy_fn(y_true=y_test.float(),
                               y_pred=test_pred)

    # Print out what's happening every 10 epochs
    if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f}, Accuracy: {acc:.2f}% | Test loss: {test_loss:.5f}, Test acc: {test_acc:.2f}%")

  4%|▍         | 8/200 [00:00<00:02, 72.37it/s]

Epoch: 0 | Loss: 0.69436, Accuracy: 44.96% | Test loss: 0.69318, Test acc: 46.74%
Epoch: 10 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 18%|█▊        | 37/200 [00:00<00:01, 90.38it/s]

Epoch: 20 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 30 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 28%|██▊       | 57/200 [00:00<00:01, 94.67it/s]

Epoch: 40 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 50 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 38%|███▊      | 77/200 [00:00<00:01, 96.49it/s]

Epoch: 60 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 70 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 48%|████▊     | 97/200 [00:01<00:01, 97.20it/s]

Epoch: 80 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 90 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 58%|█████▊    | 117/200 [00:01<00:00, 96.09it/s]

Epoch: 100 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 110 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 68%|██████▊   | 137/200 [00:01<00:00, 96.21it/s]

Epoch: 120 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 130 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 78%|███████▊  | 157/200 [00:01<00:00, 96.35it/s]

Epoch: 140 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 150 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


 88%|████████▊ | 177/200 [00:01<00:00, 96.53it/s]

Epoch: 160 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 170 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%


100%|██████████| 200/200 [00:02<00:00, 94.74it/s]

Epoch: 180 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%
Epoch: 190 | Loss: 0.69315, Accuracy: 59.85% | Test loss: 0.69315, Test acc: 60.15%





# Check for most probable topic in Neural Net 