# Snapchat Political Ads
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the reach (number of views) of an ad.
    * Predict how much was spent on an ad.
    * Predict the target group of an ad. (For example, predict the target gender.)
    * Predict the (type of) organization/advertiser behind an ad.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
The prediction problem we are attempting is to predict how much a company spends on an advertisement. This is a regression problem that requires us to build models based on the features(Gender, Impressions, and Duration) that we assume will affect how much a company spends. Our evaluation metric is set to have a r ** 2 score of at least 0.75 for our accuracy.

### Baseline Model
For our baseline model, we wanted to use Gender and OSType to predict how much a company spends on an advertisement. We wanted to predict if companies who targeted all OSTypes and Genders would have to spend more. If the column value contained NaN, this would mean it is agnostic(targeting all Genders and OSTypes). We transformed these two categorical features using a Pipeline of SimpleImputer to fill all NaN values to 'all' and OneHotEncoder to turn these categorical values into digits so that we can model a LinearRegression Pipeline to predict how much a company spends on an advertisement. We tested the score of the prediction and received a '0.001026' which is a very low score. The reason to why we received that score and why majority of our predictions showed that most companies spent the same amount(1680) is because many of the values in the Gender and OSType columns are NaN, where majority of them may have turned to 0 which 'biases' the data when trying to predict how much a company makes. Also, we only used two categorical features to make our prediction, so of course it would not predict how much a company spends accurately. What would happen if we added more features to predict such as numerical features? We will show what will happen in our Final Model below.

### Final Model
To improve our model, we decided to include more categorical features such as RegionID, ElectoralID, AgeBracket, and BillingAddress as we believe it would help us predict how much a company spends on an advertisement based on what age group and area they are targeting. Additionally, we added numerical features such as Impressions, StartDate, and EndDate; we wanted to use the duration of an advertisement to help us predict the costs. To do this, we converted StartDate and EndDate columns into pandas datetime and converted all the datetimes into hours so that it would be easier to compute its duration and mean for when we model it within a Pipeline. For the numerical features, we modeled a Pipeline where we preprocessed the feature's mean and then normalized it using the z-scaled our data to make it look like a normal distribution to help our model perform better. For the categorical features, we modeled a Pipeline where we preprocessed the categories by filling the missing values, NaN, with 'all' using the SimpleImputer, and then utilized OneHotEncoding to distinguish the companies who targeted all types of: Gender, OsType, RegionID, ElectoralRegionID, Segments, and AgeBracket between those who only targeted 1 certain group. For the BillingAddress column, we believed it would help our model as it distinguishes the cost of an advertisement based on a company's location. 

### Fairness Evaluation
We wanted to test that between two languages the price of an ad would remain the same, so we compared the two largest single language target for each ad, which would be French and English. Our null hypothesis would be that our model predicts the price of the given ad fairly between the two languages. Our signifigance level was: 0.08, which is above alpha meaning that we fail to reject our null hypothesis which means that our model was not "fair" due to the fact that we didn't include "Language" as a feature of our model to predict the cost.

# Code

In [293]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder, Binarizer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import datetime as dt
from sklearn import metrics
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [279]:
fp_2018 = os.path.join('PoliticalAds18', 'PoliticalAds2018.csv')
fp_2019 = os.path.join('PoliticalAds19', 'PoliticalAds2019.csv')
data_2018 = pd.read_csv(fp_2018)
data_2019 = pd.read_csv(fp_2019)
bofa = pd.concat([data_2018, data_2019], ignore_index=True)
print(bofa.columns)
bofa.head()

Index(['ADID', 'CreativeUrl', 'Spend', 'Impressions', 'StartDate', 'EndDate',
       'OrganizationName', 'BillingAddress', 'CandidateBallotInformation',
       'PayingAdvertiserName', 'Gender', 'AgeBracket', 'CountryCode',
       'RegionID', 'ElectoralDistrictID', 'LatLongRad', 'MetroID', 'Interests',
       'OsType', 'Segments', 'LocationType', 'Language',
       'AdvancedDemographics', 'Targeting Connection Type',
       'Targeting Carrier (ISP)', 'Targeting Geo - Postal Code',
       'CreativeProperties'],
      dtype='object')


Unnamed: 0,ADID,CreativeUrl,Spend,Impressions,StartDate,EndDate,OrganizationName,BillingAddress,CandidateBallotInformation,PayingAdvertiserName,...,Interests,OsType,Segments,LocationType,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),Targeting Geo - Postal Code,CreativeProperties
0,2ac103bc69cce2d24b198e6a6d052dbff2c25ae9b6bb9e...,https://www.snap.com/political-ads/asset/69afd...,165,49446,2018/11/01 22:42:22Z,2018/11/06 23:00:00Z,Bully Pulpit Interactive,"1140 Connecticut Ave NW, Suite 800,Washington,...",,NextGen America,...,,,,,,,,,,web_view_url:https://nextgenamerica.org/lookup...
1,40ee7e900be9357ae88181f5c8a56baf6d5aab0e8d0f51...,https://www.snap.com/political-ads/asset/0885d...,17,23805,2018/11/15 15:52:06Z,2018/11/24 15:50:38Z,Amnesty International Switzerland,CH,,Amnesty International,...,,,Provided by Advertiser,,de,,,,,
2,c80ca50681d552551ceaf625981c0202589ca710d51925...,https://www.snap.com/political-ads/asset/a36b7...,60,12883,2018/09/28 23:10:14Z,2018/10/10 02:00:00Z,Chong and Koster,"1640 Rhode Island Ave. NW, Suite 600,Washingto...",,Voter Participation Center,...,,,Provided by Advertiser,,,Marital Status (Single),,,,web_view_url:https://www.voterparticipation.or...
3,a3106af2289b62f57f63f4fb89753bdf94e2fadede0478...,https://www.snap.com/political-ads/asset/46819...,2492,377236,2018/10/27 19:23:19Z,2018/11/06 23:00:00Z,"Middle Seat Consulting, LLC","Po Box 21600,Washington,20009,US",,Beto for Texas,...,,,,,,,,,,web_view_url:https://betofortexas.com/vote/?ut...
4,7afda4224482eb70315797966b4dcdeb856df916df5bdc...,https://www.snap.com/political-ads/asset/ee833...,5795,467760,2018/10/25 04:00:00Z,2018/11/06 23:00:00Z,"Middle Seat Consulting, LLC","Po Box 21600,Washington,20009,US",,Beto for Texas,...,,,,,,,,,,


### Baseline Model

In [280]:
cat_features = ['Gender', 'OsType']
cat_transformer = Pipeline([('const', SimpleImputer(strategy='constant', fill_value='all')), 
                           ('one-hot', OneHotEncoder(handle_unknown='ignore'))])
preproc = ColumnTransformer([('cat', cat_transformer, cat_features)])
lr = LinearRegression()
pl = Pipeline([('preproc', preproc), ('lin-reg', lr)])
X = bofa[['Gender', 'OsType']]
Y = bofa['Spend']
pl.fit(X, Y)
print(pl.predict(X))
pl.score(X, Y) #Using ONLY Gender and OsType as features for our modeling pipeline is terrible as you can see from
#the score below

[1680. 1680.  560. ... 1680. 1680. 1680.]


0.0010255948413259164

### Final Model

Both columns of 'StartDate' and 'EndDate' are object types, so we converted it into pandas datetimes where we would find the difference between the two columns to assign a 'Duration' column to show how long an advertisement has run for. We changed the datetimes of 'Duration' column into hours so we can z-scale the data later for our Pipeline model.

In [281]:
bofa['StartDate'] = pd.to_datetime(bofa['StartDate'])
bofa['EndDate'] = pd.to_datetime(bofa['EndDate'])
bofa['Duration'] = bofa['EndDate'] - bofa['StartDate']
def convert_timedelta(duration):
    days, seconds = duration.days, duration.seconds
    hours = (days * 24 + seconds // 3600)
    minutes = (seconds % 3600) // 60
    seconds = (seconds % 60)
    return hours + (minutes / 60) + (seconds / 3600)
bofa['Duration'] = bofa['Duration'].apply(convert_timedelta)

In [294]:
num_features = ['Duration', 'Impressions']
cat_features_final = ['Gender', 'OsType', 'RegionID', 'ElectoralDistrictID',
                      'AgeBracket', 'BillingAddress', 'Segments']
cat_final_transformer = Pipeline([('const', SimpleImputer(strategy='constant', fill_value='all')),
                                  ('one-hot', OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline([('mean', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
preproc_final = ColumnTransformer([('num', num_transformer, num_features), ('cat', cat_final_transformer, cat_features_final)])
pl_final = Pipeline([('preproc_final', preproc_final), ('lin-reg-final', lr)])
X_final = bofa[['Duration', 'Impressions', 'Gender', 'OsType', 'RegionID',
                'ElectoralDistrictID', 'AgeBracket', 'BillingAddress', 'Segments']]
Y = bofa['Spend']
X_tr, X_ts, Y_tr, Y_ts = train_test_split(X_final, Y, test_size = 0.25)
pl_final.fit(X_tr, Y_tr)
pred = pl_final.predict(X_ts)
pred

array([-3.43843786e+01, -2.10598084e+01,  1.54149902e+02, -1.31846150e+02,
       -9.05298309e+01, -1.32984689e+02,  3.87418444e+04,  1.33689936e+03,
       -1.20250295e+00, -5.53751431e+01, -2.02898531e+02,  1.66496041e+03,
        4.81765236e+02,  6.18182473e+02, -1.22041638e+02, -4.09932005e+02,
        6.21137536e+02,  4.39509936e+02,  9.72627598e+02, -8.70024889e+01,
        3.67206145e+01, -1.59795878e+01,  6.09321023e+01,  5.87870838e+02,
        7.46939018e+02,  4.15655622e+02,  5.34550605e+01,  4.13005146e+01,
        7.54199853e+02, -2.81742270e+01,  1.58289170e+02,  1.68395085e+02,
        1.90235847e+02,  3.32922414e+02,  5.80741165e+01,  4.29738098e+02,
        7.24588215e+02, -1.24247874e+00,  1.76435252e+03, -2.57792760e+03,
       -3.52221321e+01,  4.43908167e+01,  1.58960564e+03,  5.59086593e+01,
        9.22733124e+01,  1.14748693e+03,  1.94372125e+03,  1.14511535e+03,
       -1.27694741e+02,  6.28896018e+03, -6.71370624e+02,  6.80929171e+01,
        6.89397361e+00,  

In [295]:
print(pl_final.score(X_tr, Y_tr))

0.8916952518602872


### Fairness Evaluation

In [284]:
a = bofa.loc[bofa['Language'] == 'en']
b = bofa.loc[bofa['Language'] == 'fr']

In [285]:
a_X = a[['Duration', 'Impressions', 'Gender', 'OsType', 'RegionID', 
             'ElectoralDistrictID', 'AgeBracket', 'BillingAddress', 'Segments']]
a_Y = a['Spend']
b_X = b[['Duration', 'Impressions', 'Gender', 'OsType', 'RegionID', 
                 'ElectoralDistrictID', 'AgeBracket', 'BillingAddress', 'Segments']]
b_Y = b['Spend']
def r_diff(a_x, a_y, b_x, b_y):
    pl_final.fit(a_x, a_y)
    pl_final.fit(b_x, b_y)
    return pl_final.score(b_x, b_y) - pl_final.score(a_x, a_y)
obs = r_diff(a_X,a_Y,b_X,b_Y)
both = pd.concat([a,b])
n_reps = 100
results = []
for i in range(n_reps):
    sample = both['Language'].sample(frac=1, replace=False).reset_index(drop=True)
    both['Shuffled'] = sample
    an = both.loc[both['Shuffled'] == 'en']
    bn = both.loc[both['Shuffled'] == 'fr']
    a_Xn = an[['Duration', 'Impressions', 'Gender', 'OsType', 'RegionID', 
             'ElectoralDistrictID', 'AgeBracket', 'BillingAddress', 'Segments']]
    a_Yn = an['Spend']
    b_Xn = bn[['Duration', 'Impressions', 'Gender', 'OsType', 'RegionID', 
                 'ElectoralDistrictID', 'AgeBracket', 'BillingAddress', 'Segments']]
    b_Yn = bn['Spend']
    results.append(r_diff(a_Xn,a_Yn,b_Xn,b_Yn))
(results <= obs).mean()
    

0.08