In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os


# M.F.A Site Detector

  Recently an article was published in Forbes stating that around 20% of all programmatic media spend is going to 'Made For Advertising' (MFA) sites. These are sites that generate money by driving user visits through clicks on adverts placed cheaply on social media sites or at the bottom of genuine news articles, these visits are then turned into advertising revenue on their site. The content on MFA sites is generally complete rubbish: fake celeberity gossip, phony deals on expensive clothes, seemingly lucrative financial advice... All designed to drive traffic to their site, no matter how fleeting. 

  MFA owners have cleverly made their sites very attractrive places to advertise on, at least on the surface; high viewability , impressive click rates and low CPMs mean that when automatic processes are deployed to optimise digital marketing campaigns, large sums of money are directed towards them, so much so that MFA is now a multi-billion dollar industry. 

  The complete lack of journalistic integrity and the fact that it is now considered an embarsement to have your brand name appear on one of these sites means that advertisers have recently decided that any and all brand activity on these sites should cease. Digital marketers have reacted quickly by blocking large lists of sites with low proportions of organic visits (people visiting the site by simply searching it) compared to those driven there by clicks. For the time being this has kept the MFA sites at bay, however inevitably there will be ways found around this, especially with a multi-billion dolar industy at stake. 

  Whether it's the equivuilant of click farms in Asia , or bots on ever moving VPN adresses automating the process, ways will be found to forge organic visits, rendering the ratio mentioned above useless. However what wont change about these sites is the content. The way they keep their costs down is by generating countless, cheap, clickbait articles and then bombarding you with ads before you get the chance to leave the page. A human can take one look at these sites and tell what it is. But can a machine? Lets find out...

#  Gathering Data

I have a list of 1600 or so MFA sites, generated simply by looking for suspicious ratios of organic to click driven traffic. A similar length list is generated of reputable sites. I erred on the side of caution here and went for sites where impressions are more costly, assuming that the digital marketing world has collectivley got it right (bold) and is bidding more for higher quality inventory. I hope this does not simply mean any model I make will simply be looking for evidence of more expensive, upper crust sites, however this may not be a bad start!

I then use selenium to iterate through these lists and scrape anything and everything it can from each site. Let's import the data and see what we've got!




In [None]:
mfa_docs=pd.read_csv(r"/kaggle/input/mfa-docs/MFA_documents.csv")
mfa_docs.drop("Unnamed: 0", axis=1, inplace=True)
mfa_docs["MFA"]=1
mfa_docs.head()

Had to change encoding from the default UTF-8 in the Non MFA docs as there are some seriously dodgy characters in there, l will need to remove rows containing anything special before tokenisation

Ive changed MFA from True Flase here to 1/0 in the hopes it may help, I suspect the model wasnt rigged for boolean.



In [None]:
non_mfa_docs=pd.read_csv(r"/kaggle/input/d/lawrencebutler/non-mfa-docs/Non_MFA_ Documents.csv", encoding='Latin-1')
non_mfa_docs["MFA"]=0
non_mfa_docs.head()

now lets try and get rid of dodgy characters


In [None]:
non_mfa_docs["Document"]=non_mfa_docs["Document"].str.replace('\W',' ', regex=True)
non_mfa_docs.describe()


In [None]:
mfa_docs["Document"]=mfa_docs["Document"].str.replace('\W',' ', regex=True)
mfa_docs.head()

still the occasional dodgy character or accented letter (ø for example) , but lets roll with it for now, hoping that in tokenisation i can just ignore any errors thrown, including foreign/arabic stuff.

lets take a sample of each and try tokenising

In [None]:
df1=mfa_docs.sample(300 , random_state = 42)
df2=non_mfa_docs.sample(300 , random_state=42)
random_samples=[df1,df2]
df=pd.concat(random_samples)
df.rename({'MFA': 'labels'}, axis=1, inplace=True)
df['labels'] = df['labels'].astype(float) 
df['Document'] = df['Document'].str[:3000] # a little over 3000 here seems to be the limit, 4000 = no dice
df.head()
eval_df=df.sample(50 , random_state = 42)
eval_df.head()
df= pd.concat([df , eval_df]).drop_duplicates(keep=False)
df.dropna(inplace=True)
eval_df.head()

above i have limited the character count of the document due memory limit troubles in training

lets try investigating how much we need to cut it down by

In [None]:
avg_word_length=5.7
(sum(df['Document'].str.len())/df.shape[0])#/avg_word_length


so about 4.5k characters per document on average, equating to about 800 words, lets cut this down by 75 %. It seems around 3000 characters is the limit for the model im using here.

# Tokenisation

In [None]:
from datasets import Dataset,DatasetDict
ds=Dataset.from_pandas(df)

In [None]:
ds

In [None]:
model_nm = 'microsoft/deberta-v3-small'


In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm , padding= 'max_length' , truncation = 'max_length')

In [None]:
def tok_func(x): return tokz(x["Document"])

In [None]:
tok_ds = ds.map(tok_func, batched=True )

In [None]:
tok_ds.remove_columns(["__index_level_0__"])

i was thrown here as i was typing _ (normal underscore) when infact a ▁ (long underscore) was needed U+FF3F

In [None]:
#row#["Document"]#,row["input_ids"]
#just checking tokens line up! they do!

# Splitting training and validation sets , creating test set


Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use train_test_split


be careful here, confusingly here they've valled the vaildation set the test set!


In [None]:
#tok_ds.rename_column( 'MFA' , 'labels')
dds=tok_ds.train_test_split(0.25 , seed=42)
dds

I should have tokenised everything and then split up into smaller sets so i could scale easily later, i'll do this now , DONE (WITH LARGER SET CAME SOME GARBAGE, DROPPED NA)



In [None]:
eval_ds=Dataset.from_pandas(eval_df).map(tok_func, batched=True)

# Training Time!

In [None]:
import numpy
from transformers import TrainingArguments , Trainer

In [None]:
bs = 2
epochs = 4

In [None]:
lr=8e-5

In [None]:
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [None]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d )

In [None]:
trainer.train();

In [None]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

In [None]:
preds = np.clip(preds, 0, 1)
preds

In [None]:
eval_df

In [None]:
eval_df['labels'] = eval_df['labels'].astype(float)
incorrect_percent=sum((preds.round()-eval_df['labels'])**2)/100
incorrect_percent


17% error is the best ive got, not terrible, better than a purely random guess by a factor of 3


improved  to 15% by cleaning the rubbish from the documents 

In [None]:
preds.round()


In [None]:
eval_df["labels"]

# Improvements to be made

15% error isn't bad, especially considering the small data set and the blurred line between the two types of site im choosing between here.

Key changes that I suspect will improve the accuracy further. 
* ensemble the model with others that take into account the ads.txt data from the site
* change the base model used for one that already has weights trained on website classification