# Probability Email this Year

The purpose of this script is to provide an approximate breakdown of the porportion of the USA population with an email address by year.
This is some very rough predictions, but it's better than nothing

Data source linked below:
* [Data 1996 & 1997](https://www.theguardian.com/technology/2002/mar/13/internetnews). Details the history/origins of email use.
* [Data 2013-2021](https://ntia.gov/other-publication/2022/digital-nation-data-explorer#sel=emailUser&demo=&pc=prop&disp=chart). It details the percent of people Age 15+ who use the internet in the United States.



----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2023-01-20</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

## 0. Load libraries, define functions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

Sigmoid function: 
$ y = \frac{1}{1+e^{-x}} $


Custom exponential function :
$ y = y_0 + (1 - e^{-k(x + x_0)}) $

---

Both of these functions include a multiplier in front, $z$ to help for fine-tuning

In [None]:
def sigmoid(x, y_0, k, x_0, z):
    return z*(y_0 + ( 1 / (1+ np.e**(-k*(x+x_0)))))

def custom_exponential_function(x, y_0, k, x_0, z):
    return z*(y_0 + (1 - np.e**(-k*(x+x_0))))

## 1. Gather Data

### 1.1 Read in data from .csv

The data source from the National Telecoummuncations and Information Administration details data (inconsistently) from 1997 to 2019.

In [None]:
df_source1 = pd.read_csv('../../../SupportingDocs/Email/01_Raw/ntia-analyze-table-FINAL21.csv')

#### 1.1.1 Wrangle source

We need to do a couple things:
* filter to only the email data
* convert the dataset form "3charmonth-YY" to numeric dtypes

In [None]:
# Filter to using email only
df_source1 = df_source1.query('description == "Uses Email"')

# Split up the month/year
df_source1[['Month','Year']] = df_source1['dataset'].str.split('-',expand=True)

# Create dict to replace months with percent through the year.
### Assume released during the halfway point in the month
convert_dict = {'Jan': 0.5/12,
                'Feb': 1.5/12,
                'Mar': 2.5/12,
                'Apr': 3.5/12,
                'May': 4.5/12,
                'Jun': 5.5/12,
                'Jul': 6.5/12,
                'Aug': 7.5/12,
                'Sep': 8.5/12,
                'Oct': 9.5/12,
                'Nov': 10.5/12,
                'Dec': 11.5/12,}

# Enact that dict
df_source1['Month'] = df_source1['Month'].replace(convert_dict).astype(float)

# Create a new time_observed column 
### *need to add 2000 since all years are post-2000 and only 2 digits right now
df_source1['time_observed'] = 2000 + df_source1.Year.astype(int) + df_source1.Month

# Filter down to only columns we care about
df_source1 = df_source1[['time_observed','usProp']]

### 1.2 Manually define dataset2

This data is from the Guardian data source [linked here](https://www.theguardian.com/technology/2002/mar/13/internetnews).
We'll manually define it.
Note that observations just say 1996 and 1997.
Since we don't know when it happened, we'll asume that it happened halfway through the year (1996.5 and 1997.5).

The observation from 1996 is when free email became available, which we assume to represent a starting proportion of the united states users as 0.0.

The observation from 1997 is when 10million users worldwide were using the internet.
We'll perhaps very wrongly assume that all of them are users within the United States.
By year the United States had a population of:
* 1997: 266,490,000.  Census data source [linked here](https://www2.census.gov/library/publications/1998/demographics/p23-194.pdf)
* 1998: 268,000,000.  Census data source [linked here](https://www2.census.gov/library/publications/1998/compendia/statab/118ed/tables/sasec1.pdf)

We'll take the average of the 2 to define our United States population at 1997.5.
This defines our denominator when finding out the proportion of email users in the United States at that time.

*Note that the population estimates don't reflect people only age 15+ like data source1.
We'll be alright with that and just assume it helps balance our massive assumption that the only 10 million people with emails in 1997 are people from the United States.
Art, not science.

In [None]:
df_source2 = pd.DataFrame.from_dict( {'time_observed':[1996.5, 1997.5],
                                      'usProp':[0.0, 10_000_000 / ( (266_490_000 + 268_000_000) / 2 )]})

### 1.3 Combine & Inspect

In [None]:
df = pd.concat([df_source2,df_source1],ignore_index=True)

plt.scatter(df['time_observed'], df['usProp'])
plt.xlabel('Year')
plt.ylabel('Percent of United States 15+ with Email')

## 2. Line Fitting

In [None]:
xs = df['time_observed']
ys = df['usProp']

### 2.1 Use `curve_fit()` function

In [None]:
# Try on sigmoid (s-curve) function
pars_sigmoid, cov_sigmoid = curve_fit(f=sigmoid, xdata=xs, ydata=ys, p0=[0, 0.3, -2006, 0.9])

# Try using decaying exponential
pars_exponential, cov2_exponential= curve_fit(f=custom_exponential_function, xdata=xs, ydata=ys, p0=[0, 0.1, -1996, 0.9])

### 2.2 Collect fitted y-values

In [None]:
xs2 = np.arange(1995,2026)

# Use parameters and plugin x values to get y values
ys_line_sigmoid = pars_sigmoid[3]* (pars_sigmoid[0] + ( 1 / (1+ np.e**(-pars_sigmoid[1]*(xs2+pars_sigmoid[2])))))
ys_line_exponential = pars_exponential[3]* (pars_exponential[0] + (1 - np.e**(-pars_exponential[1]*(xs2+pars_exponential[2]))))
ys_line_average = (ys_line_sigmoid + ys_line_exponential) / 2

### 2.3 Plot for visual inspection

In [None]:
# Define our figure and it's axes
fig, ax = plt.subplots(nrows=2,ncols=2,sharex=True,sharey=True,figsize=(9,7))
ax1, ax2, ax3, ax4 = list(ax[0])+list(ax[1])

# Plot 1: Just the raw data
######################################################
ax1.scatter(xs,ys, color='black', label='raw data')


# Plot 2: Raw Data AND sigmoid (s-curve) fit.
######################################################
ax2.scatter(xs,ys, color='black')
ax2.plot(xs2,ys_line_sigmoid, color = 'red', label='sigmoid line fit')

# Plot 3: Raw Data AND decaying exponential fit.
######################################################
ax3.scatter(xs,ys, color='black')
ax3.plot(xs2,ys_line_exponential, color = 'blue', label='exponential decay line fit')

# Plot 4: For funzies average the two previous fits
######################################################
ax4.scatter(xs,ys, color='black')
ax4.plot(xs2, (ys_line_sigmoid+ys_line_exponential)/2, color='purple', label='average of previous fits')

# Do a little cleanup / prettifying
for axis in [ax1,ax2,ax3,ax4]:
    axis.set_ylim(0,1)
    axis.set_xlim(1996,2025)
    axis.legend(loc='lower right')

# Add title,labels
plt.suptitle('Probability of ever-being married by age: Line fits')
fig.text(0.5, 0.04, 'Year', ha='center')
fig.text(0.04, 0.5, 'Probability of ever married (%)', va='center', rotation='vertical')

# Show it off!
plt.show()


### 2.4 Calculate RMSE

We'll calulate the root mean squared error to get some numerical backing for our choice in fits.

#### 2.4.1 Calculate value for few points only

For plotting, we showed all values between 1996-2025 (inclusive).
For comparing with our original data points, we only need data for a handful of points in our original source data.

In [None]:
# Use parameters and plugin x values to get y values
ys_points_sigmoid = pars_sigmoid[3]* (pars_sigmoid[0] + ( 1 / (1+ np.e**(-pars_sigmoid[1]*(xs+pars_sigmoid[2])))))
ys_points_exponential = pars_exponential[3]* (pars_exponential[0] + (1 - np.e**(-pars_exponential[1]*(xs+pars_exponential[2]))))

# Also calculate the average
ys_points_average = (ys_points_sigmoid + ys_points_exponential) / 2

#### 2.4.2 Calculate diff^2, then RMSE

We'll find the absoulte difference squared in this step.
Then calculate the mean difference squared.

In [None]:
# Calculate Differences
diff2_sigmoid = abs(ys_points_sigmoid - ys)
diff2_exponential = abs(ys_points_exponential - ys)
diff2_average = abs(ys_points_average - ys)

# Calculate RMSE
RMSE_sigmoid = np.mean(diff2_sigmoid)
RMSE_exponential = np.mean(diff2_exponential)
RMSE_average = np.mean(diff2_average)

#### 2.4.3 Reveal Results

We see that the result with the best results (slightly) was the sigmoid function.
This checks out visually as well.
We'll proceed with the data wrangling using this fit's estimates as our data choice.

In [None]:
print(f'{np.round(RMSE_sigmoid,4)}: RMSE of the sigmoid function approximation')
print(f'{np.round(RMSE_exponential,4)}: RMSE of our custom exponential function approximation')
print(f'{np.round(RMSE_average,4)}: RMSE of the average values between two function outputs approximation')

## 3. Wrangle Data

Now that we have a good idea of what percent of the population uses email for each year, we want to dive into our primary question:

<b>What is the likelyhood of getting an email for the first time at any given year?</b>

----

We'll tackle this problem using a manual approach.
In our manual compartementalized approach we'll use the following logic:
* Start with 100 individuals without an email tracking their email status each progressive year
* Each year, by the end of the year, the total number NumEmailed_ThisYear must equal the percert given in our sigmoid approximation for that age.
* The probability of getting an email in a given year is the number with an email this year / number never with an email at the beginning of the year.

In [None]:
# Get percent of population married at each age using our 2 line average fit.
df = pd.DataFrame(data = np.column_stack([xs2,ys_line_sigmoid]),\
                columns = ['Year','PercentWithEmail'])

# Ensure the negative probability values don't screw up calculations
df.loc[0:1,'PercentWithEmail'] = 0.0

# Filter to years we care about
df = df.query('Year > 1995')

### 3.2 Establish Probability Related Fields

In [None]:
# Given an initial population of 1, find the percent without an email
df['PercNoEmail'] = 1 - df['PercentWithEmail']

# Grab the population of last year (only considering singles)
df['PercNoEmail_LastYear'] = np.append(np.nan,np.array(df['PercNoEmail'][:-1]))

# Find out how many people were married this year
df['PercEmailed_ThisYear'] = df['PercNoEmail_LastYear'] - df['PercNoEmail']

# Probability of first marriage this year = number married / number initially single
df['ProbabilityOfEmail_ThisYear'] = df['PercEmailed_ThisYear'] / df['PercNoEmail_LastYear']

## 4. Saving

### 4.1 Save to .csv

Note we don't use this .csv in the process, but it is interesting to have a copy of.

In [None]:
# Format columns,names
df['Year'] = df['Year'].astype(int)
df['perc'] = df['ProbabilityOfEmail_ThisYear'].fillna(0)

# Put into proper column names
output = df[['Year','perc']]

# Save to .csv file final version we'll use
output.to_csv('../../../SupportingDocs/Email/03_Complete/ProbaEmailThisYear.csv',header=True,index=False)

# Also save a wrangled version with more columns for anyone interested
output.to_csv('../../../SupportingDocs/Email/02_Wrangled/ProbaEmailThisYear_ExtraCols.csv',header=True,index=False)


## 5. Extra Sanity Checks

### 5.1 See if our probabilities check out

In [None]:
# Calculate the inverse probability -> probability of staying single that year
df['inverseProba'] = 1- df['ProbabilityOfEmail_ThisYear']

# For each row (not the first for 17 year olds):
for i in np.arange(1,len(df)):

    # Find the probabilities up to that point
    subset = df['inverseProba'].iloc[1:i+1]

    # Find current row year
    cur_year = df.iloc[i]['Year']

    # Calculate product of all previous probabilies of staying single
    starter = 100
    for element in subset:
        starter = starter*element

    # Calc the difference between percent without and product of percent chances of leaving every previous year with no email.
    ### Should be the same
    difference = df.iloc[i].PercNoEmail - starter

    # If they're not the same, call me a monkey's uncle.
    if difference > 0.000000001:
        print(f'YOU FOOL!!!!\n At {cur_year}, difference was {difference} which is way too much!')


### 5.2 Fun Plotting

In [None]:
highest_proba = df.iloc[df['ProbabilityOfEmail_ThisYear'].argmax()]['Year']

plt.plot(df['Year'], df['ProbabilityOfEmail_ThisYear'].fillna(0), label='probability')
plt.vlines(x=highest_proba, ymin=0,ymax=0.2, colors='k',linestyles='--',label=f'max likelyhood if no email:\nyear {int(highest_proba)}')
plt.xlabel('Year')
plt.ylabel('Probability (normalized to 1.0)')
plt.title('Probability of First Email by Year')
plt.legend(loc='upper left')
plt.xlim(1996,2025)
plt.ylim(0,0.2)
plt.show()

In [None]:
plt.plot(df['Year'],df['PercentWithEmail'])
plt.xlim(1996,2025)
plt.ylim(0,1)
plt.xlabel('Year')
plt.ylabel('Percent of Population with Email')
plt.title('Percent of Population with Email by Year')
plt.show()