# Snapchat Political Ads
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the reach (number of views) of an ad.
    * Predict how much was spent on an ad.
    * Predict the target group of an ad. (For example, predict the target gender.)
    * Predict the (type of) organization/advertiser behind an ad.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
This project will focus on predicting the country of origin of the ad. This will be a classification problem since the goal is to predict a categorical variable, rather than continuous. In order to predict the country of origin, this project will use CountryCode as the target variable. The objective will be to maximize the accuracy of the model. Given the malicious ad campaigns that have taken place in the US by foreign governments, this project asks whether it is feasible to know where an ad originated from, with only knowing some basic information of the ad. For the purpose of addressing this question we will exclude billing address from the data as well.

### Baseline Model
For a baseline model I decided to first clean up my data a little. I first filled null values with their appropriate meanings based on snapchat's readme for the data. As well, I excluded columns that had over 95% null values as I do not think they would provide a better prediction. I ended up using 16 feature variables. Of the 16, 12 of the features were nominal, 2 features were quantitative, and 2 features were ordinal. As a baseline model, I decided to one hot encode all of the features except the two quantitative features, impressions and spend. I used a Random Forest Classifier and was able to achieve an average of approximately 63% after rerunning the model with different training and test data 50 times. For a baseline model, I feel this is a decent outcome since it is over 50% it shows that the data has the necessary relationships to be able to make a prediction of country origin.

### Final Model
TODO

### Fairness Evaluation
TODO

# Code

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

### Baseline Model

In [201]:
# Importing the data 
ads_fp = os.path.join('data', 'PoliticalAds2018.csv')
ads2018 = pd.read_csv(ads_fp)
ads_fp = os.path.join('data', 'PoliticalAds2019.csv')
ads2019 = pd.read_csv(ads_fp)
ads = ads2018.append(ads2019)
orig_ads = ads.copy()

First some basic cleaning

In [202]:
# Replacing null values with actual value the null is representing
ads['Gender'] = ads['Gender'].fillna('ALL')
ads['AgeBracket'] = ads['AgeBracket'].fillna('ALL')
ads['Interests'] = ads['Interests'].fillna('NONE')
ads['Language'] = ads['Language'].fillna('NONE')
ads['CandidateBallotInformation'] = ads['CandidateBallotInformation'].fillna('NONE')
ads['Regions (Included)'] = ads['Regions (Included)'].fillna('NONE')
ads['Radius Targeting (Included)'] = ads['Radius Targeting (Included)'].fillna('NONE')
ads['Postal Codes (Included)'] = ads['Postal Codes (Included)'].fillna('NONE')
ads['EndDate'] = ads['EndDate'].fillna('ONGOING')
ads['Segments'] = ads['Segments'].fillna('NONE')

In [204]:
# Dropping columns that have over 95 % null values
cols = ['Regions (Excluded)', 'Electoral Districts (Excluded)', 'Electoral Districts (Included)', 
       'Radius Targeting (Excluded)', 'Metros (Excluded)', 'Metros (Included)', 
       'Postal Codes (Excluded)', 'Location Categories (Excluded)', 'Location Categories (Included)', 
       'OsType', 'AdvancedDemographics', 'Targeting Connection Type', 'Targeting Carrier (ISP)', 'CreativeUrl',
       'CreativeProperties']
ads = ads.drop(cols, axis=1)
# Drop ADID since it is a unique value for each ad
ads = ads.drop(['ADID'], axis=1)
# Dropping billing address since the point of the model is to predict origin location
ads = ads.drop('BillingAddress', axis=1)

In [206]:
# Choosing baseline features to use and creating a transformer to hold those features
transformer = ColumnTransformer([
    ('one_hot_curr', OneHotEncoder(), ['Currency Code']),
    ('one_hot_start', OneHotEncoder(handle_unknown='ignore'), ['StartDate']),
    ('one_hot_end', OneHotEncoder(handle_unknown='ignore'), ['EndDate']),
    ('one_hot_org', OneHotEncoder(handle_unknown='ignore'), ['OrganizationName']),
    ('one_hot_cand', OneHotEncoder(handle_unknown='ignore'), ['CandidateBallotInformation']),
    ('one_hot_pay', OneHotEncoder(handle_unknown='ignore'), ['PayingAdvertiserName']),
    ('one_hot_gender', OneHotEncoder(), ['Gender']),
    ('one_hot_age', OneHotEncoder(handle_unknown='ignore'), ['AgeBracket']),
    ('one_hot_regions', OneHotEncoder(handle_unknown='ignore'), ['Regions (Included)']),
    ('one_hot_radius', OneHotEncoder(handle_unknown='ignore'), ['Radius Targeting (Included)']),
    ('one_hot_postal', OneHotEncoder(handle_unknown='ignore'), ['Postal Codes (Included)']),
    ('one_hot_int', OneHotEncoder(handle_unknown='ignore'), ['Interests']),
    ('one_hot_seg', OneHotEncoder(handle_unknown='ignore'), ['Segments']),
    ('one_hot_lan', OneHotEncoder(handle_unknown='ignore'), ['Language'])
], remainder='passthrough')

In [213]:
# Creating a pipeline to process transformations and execute random forest tree
pl = Pipeline([
        ('transformations', transformer),
        ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=5, max_features='auto', max_leaf_nodes=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
                oob_score=False, random_state=None, verbose=0,
                warm_start=False))
])
scores = []
for i in range(50):
    # Getting the X and y data along with training and testing data
    X = ads.drop('CountryCode', axis=1)
    y = ads['CountryCode']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    # Fitting the data
    pl.fit(X_train, y_train)
    # Getting the accuracy score
    scores.append(accuracy_score(y_test, pl.predict(X_test)))
np.mean(np.array(scores))

0.6349414519906323

### Final Model

In [None]:
# TODO

### Fairness Evaluation

In [None]:
# TODO