# Snapchat Political Ads
This project uses political ads data from Snapchat, a popular social media app. Interesting questions to consider include:
- What are the most prevalent organizations, advertisers, and ballot candidates in the data? Do you recognize any?
- What are the characteristics of ads with a large reach, i.e., many views? What may a campaign consider when maximizing an ad's reach?
- What are the characteristics of ads with a smaller reach, i.e., less views? Aside from funding constraints, why might a campaign want to produce an ad with a smaller but more targeted reach?
- What are the characteristics of the most expensive ads? If a campaign is limited on advertising funds, what type of ad may the campaign consider?
- What groups or regions are targeted frequently? (For example, for single-gender campaigns, are men or women targeted more frequently?) What groups or regions are targeted less frequently? Why? Does this depend on the type of campaign?
- Have the characteristics of ads changed over time (e.g. over the past year)?
- When is the most common local time of day for an ad's start date? What about the most common day of week? (Make sure to account for time zones for both questions.)

### Getting the Data
The data and its corresponding data dictionary is downloadable [here](https://www.snap.com/en-US/political-ads/). Download both the 2018 CSV and the 2019 CSV. 

The CSVs have the same filename; rename the CSVs as needed.

Note that the CSVs have the exact same columns and the exact same data dictionaries (`readme.txt`).

### Cleaning and EDA
- Concatenate the 2018 CSV and the 2019 CSV into one DataFrame so that we have data from both years.
- Clean the data.
    - Convert `StartDate` and `EndDate` into datetime. Make sure the datetimes are in the correct time zone.
- Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.

*Hint 1: What is the "Z" at the end of each timestamp?*

*Hint 2: `pd.to_datetime` will be useful here. `Series.dt.tz_convert` will be useful if a change in time zone is needed.*

*Tip: To visualize geospatial data, consider [Folium](https://python-visualization.github.io/folium/) or another geospatial plotting library.*

### Assessment of Missingness
Many columns which have `NaN` values may not actually have missing data. How come? In some cases, a null or empty value corresponds to an actual, meaningful value. For example, `readme.txt` states the following about `Gender`:

>  Gender - Gender targeting criteria used in the Ad. If empty, then it is targeting all genders

In this scenario, an empty `Gender` value (which is read in as `NaN` in pandas) corresponds to "all genders".

- Refer to the data dictionary to determine which columns do **not** belong to the scenario above. Assess the missingness of one of these columns.

### Hypothesis Test / Permutation Test
Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.

# Summary of Findings

### Introduction
My partner and I have chosen to work on the political ads data(2018 & 2019) from Snapchat, a popular social media app. What we plan to achieve in this project is to clean the data using concise and clear code, assess the missingness of data and fill those missing values based on its context. Lastly, we will perform a hypothesis test based on the question: What are the characteristics of the most expensive ads? If a campaign is limited on advertising funds, what type of ad may the campaign consider?

### Cleaning and EDA
First, we downloaded the 2018 and 2019 datasets from the snapchat website where we would convert the Comma-Separated Values(CSV) into dataframes using pandas. Next, we concatenated the data from both years into one dataframe where we cleaned the data by changing the 'StartDate' and 'EndDate' into a consistent time zone (UTC) and a common date time in pandas. We converted each column that would be considered a choice by the advertiser. To clean our data, we made new columns for each columns that were Agnostic or not so that we can test for our hypothesis test later. The Exploratory Data Analysis we used was a Bivariate analysis where we took the mean of the Impressions / Spend and normalized the data so that it will be more ideal for us when running our hypothesis tests.

### Assessment of Missingness
We ran a permutation test to determine if gender missingness (NaN = both genders) is dependent on how much a company spends on an advertisement compared to companies who only target single genders (female or male). We concluded that gender missingness is not dependent on how much a company spends. We also ran a permutation test to determine if segment missingness is dependent on how much a company spends on an advertisement compared to companies who have a particular segment. We concluded that segment missingness is dependent on how much a company spends.

### Hypothesis Test
Test Statistic Used: Difference of Means
Null Hypothesis: Whether or not ads are targeted agnostically(by column) or specifically will not affect the impressions per dollar spent by the advertiser
Alternative Hypothesis: Having an ad be targeted at specific types of users will increase the number of impressions per dollar spent
We FAILED to reject the null hypothesis using the significance level of 0.05, tested columns(all failed to reject): Language(Involving English, or Agnostic), Interests, Ages, LatLongRad, Affiliation(Candidate/Ballot Info), District, OSType, Segments

# Code

In [273]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import datetime as dt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Cleaning and EDA

In [306]:
fp_2018 = os.path.join('PoliticalAds18', 'PoliticalAds2018.csv')
fp_2019 = os.path.join('PoliticalAds19', 'PoliticalAds2019.csv')
data_2018 = pd.read_csv(fp_2018)
data_2019 = pd.read_csv(fp_2019)
bofa = pd.concat([data_2018, data_2019], ignore_index=True)
bofa['StartDate'] = pd.to_datetime(bofa['StartDate'])
bofa['EndDate'] = pd.to_datetime(bofa['EndDate'])
#we 
bofa['Gender'] = bofa['Gender'].fillna('BOTH').replace(['MALE', 'FEMALE'], 'SINGLE')
#we made a column within the dataframe that keeps track of Impressions per Dollar spent for each ad
bofa['ImpPerDoll'] = bofa['Impressions'] / bofa['Spend']
#we replaced 'inf' with NaN values as some ads did not spend any money which means it was funded by Snapchat
bofa['ImpPerDoll'] = bofa['ImpPerDoll'].replace([np.inf, -np.inf], np.NaN)
#we normalized the Impressions per Dollar
bofa['ImpPerDoll'] = (bofa['ImpPerDoll'] - bofa['ImpPerDoll'].mean()) / bofa['ImpPerDoll'].std()
bofa['Agnostic'] = bofa['Language'].fillna('Agnostic') == 'Agnostic' #Any Languages vs Specific Languages
bofa['English'] = bofa['Language'].str.contains('en').fillna(True) #Contains english vs Does not 
bofa['AllOsTypes'] = bofa['OsType'].fillna('Agnostic') == 'Agnostic' #Specific OS types vs All OS
bofa['All_Interests'] = bofa['Interests'].fillna('Agnostic') == 'Agnostic' #All Interests vs Specified Interests
bofa['All_Ages'] = bofa['AgeBracket'].fillna('Agnostic') == 'Agnostic' #All Ages vs Specified Age Range
bofa['All_Lat'] = bofa['LatLongRad'].fillna('Agnostic') == 'Agnostic'#Any LongLatRad vs Specified LongLatRad
#Any Affiliation vs Specific Candidate/Ballot
bofa['Unaffiliated'] = bofa['CandidateBallotInformation'].fillna('Unaffiliated') == 'Unaffiliated' 
bofa['All_Dist'] = bofa['ElectoralDistrictID'].fillna('All') == 'All' #Any Dist vs Specific Dist of County

In [305]:
bofa['Seg'] = bofa['Segments'] == 'Provided by Advertiser' #Provided by Advertiser vs Not provided by advertiser
bofa['AdvancedDemographics'] = bofa['AdvancedDemographics'].fillna('Agnostic') == 'Agnostic' #Specified Demo vs All Demo

### Assessment of Missingness

In [276]:
'''We want to see if the missingness of Gender(NaN = all genders) is dependent on how much a Company spends, aka
does a company spend more when they want to target both Genders'''
bofa['G_isnull'] = bofa['Gender'].isnull()
null = bofa[bofa['G_isnull'] == True]['Spend'].mean()
not_null = bofa[bofa['G_isnull'] == False]['Spend'].mean()
obs = null - not_null
n_reps = 1000
diffs = []
for _ in range(n_reps):
    shuffled = bofa['G_isnull'].sample(replace=False, frac=1).reset_index(drop=True)
    bofa['G_shuffled'] = shuffled
    shuffled_null = bofa[bofa['G_shuffled'] == True]['Spend'].mean()
    shuffled_not_null = bofa[bofa['G_shuffled'] == False]['Spend'].mean()
    diff = shuffled_null - shuffled_not_null
    diffs.append(diff)
p_val = np.count_nonzero(pd.Series(diffs) >= obs) / n_reps
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means 
that the amount money company spends is not dependent on Gender missingness. It does not mean that the company pays 
more when targeting all genders.'''

'We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means \nthat the amount money company spends is not dependent on Gender missingness. It does not mean that the company pays \nmore when targeting all genders.'

In [277]:
bofa['Seg_isnull'] = bofa['Segments'].isnull()
null0 = bofa[bofa['Seg_isnull'] == True]['Spend'].mean()
not_null0 = bofa[bofa['Seg_isnull'] == False]['Spend'].mean()
obs0 = null0 - not_null0
n_reps = 1000
diffs0 = []
for _ in range(n_reps):
    shuffled = bofa['Seg_isnull'].sample(replace=False, frac=1).reset_index(drop=True)
    bofa['Seg_shuffled'] = shuffled
    shuffled_null = bofa[bofa['Seg_shuffled'] == True]['Spend'].mean()
    shuffled_not_null = bofa[bofa['Seg_shuffled'] == False]['Spend'].mean()
    diff = shuffled_null - shuffled_not_null
    diffs0.append(diff)
p_val0 = np.count_nonzero(pd.Series(diffs0) <= obs0) / n_reps
p_val0

0.009

### Hypothesis Test

In [296]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that gender has no influence for Impressions per Dollar'''
obs = bofa.groupby('Gender')['ImpPerDoll'].mean()
obs = abs(obs.diff().iloc[-1])
n_reps = 1000
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('Gender')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val = np.mean(pd.Series(diffs) >= obs)
p_val

0.511

In [297]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Language has no influence for Impressions per Dollar'''
obs2 = bofa.groupby('Agnostic')['ImpPerDoll'].mean()
obs2 = abs(obs2.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('Agnostic')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val2 = np.mean(pd.Series(diffs) >= obs2)
p_val2

0.496

In [298]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that English Language has no influence for Impressions per Dollar'''
obs3 = bofa.groupby('English')['ImpPerDoll'].mean()
obs3 = abs(obs3.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('English')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val3 = np.mean(pd.Series(diffs) >= obs3)
p_val3

0.493

In [299]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that OS has no influence for Impressions per Dollar'''
obs4 = bofa.groupby('AllOsTypes')['ImpPerDoll'].mean()
obs4 = abs(obs4.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('AllOsTypes')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(diff)
p_val4 = np.mean(pd.Series(diffs) >= obs4)
p_val4

0.459

In [300]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Interests has no influence for Impressions per Dollar'''
obs5 = bofa.groupby('All_Interests')['ImpPerDoll'].mean()
obs5 = abs(obs5.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('All_Interests')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val5 = np.mean(pd.Series(diffs) >= obs5)
p_val5

0.505

In [301]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Age Range has no influence for Impressions per Dollar'''
obs6 = bofa.groupby('All_Ages')['ImpPerDoll'].mean()
obs6 = abs(obs6.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('All_Ages')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val6 = np.mean(pd.Series(diffs) >= obs6)
p_val6

0.473

In [295]:
obs7 = bofa.groupby('All_Lat')['ImpPerDoll'].mean()
obs7 = abs(obs7.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = False, n = 1000)
    diff = simulation.groupby('All_Lat')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append(abs(diff))
p_val7 = np.mean(pd.Series(diffs) >= obs7)
p_val7
'''No one put a value for LatLongRad'''

'No one put a value for LatLongRad'

In [302]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Candidate/Ballot has no influence for Impressions per Dollar'''
obs8 = bofa.groupby('Unaffiliated')['ImpPerDoll'].mean()
obs8 = (obs8.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = True, n = 1000)
    diff = simulation.groupby('Unaffiliated')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append((diff))
p_val8 = np.mean(pd.Series(diffs) >= obs8)
p_val8

0.543

In [303]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Districts has no influence for Impressions per Dollar'''
obs9 = bofa.groupby('All_Dist')['ImpPerDoll'].mean()
obs9 = (obs9.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = True, n = 1000)
    diff = simulation.groupby('All_Dist')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append((diff))
p_val9 = np.mean(pd.Series(diffs) >= obs9)
p_val9

0.49

In [304]:
'''We fail to reject the null hypothesis as our p value is greater than the significance level of 0.05 which means
 that Segments has no influence for Impressions per Dollar'''
obs10 = bofa.groupby('Seg')['ImpPerDoll'].mean()
obs10 = (obs10.diff().iloc[-1])
diffs = []
for i in range(n_reps):
    simulation = bofa.sample(replace = True, n = 1000)
    diff = simulation.groupby('Seg')['ImpPerDoll'].mean().diff().iloc[-1]
    diffs.append((diff))
p_val10 = np.mean(pd.Series(diffs) >= obs10)
p_val10

0.494