In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from altair import *
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_union

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Introduction**

This notebook contains the code used to create various machine learning models to answer proposed research questions.

The machine learning models are split up into two categories:

1. Machine learning models for herd immunity predictions for Democratic vs. Republican States.
2. Machine learning model for predicting when Yelp reviewers will return to a pre–pandemic lifestyle

The aim is to be able to analyze the results from these models separately, as well as their implications on eachother. This will be detailed in our group's analysis paper.

## **Machine Learning Models for Herd Immunity Predicted Date for Democratic vs. Republican States**

These models are built to answer the following question: "When will the amount of people vaccinated be high enough for the population to have herd immunity?", for Democratic states and Republican states. 

Once those models are built, we will use them to answer the following question: "Is there a difference between when Democratic states will achieve herd immunity status versus Republican states?"

In [None]:
# bring in the necessary data for these models from the "Data Exploration" notebook
df_covid_stats_states_normalized_party = pd.read_csv('/content/drive/MyDrive/DATA 301 Final Project Group 2/df_covid_stats_states_normalized_party.csv')
df_covid_stats_states_normalized_party

Unnamed: 0.1,Unnamed: 0,date,state,cases,deaths,statePopulation,count_fully_vaccinated,count_daily_vaccinations,cases_normalized,deaths_normalized,count_fully_vaccinated_normalized,count_daily_vaccinations_normalized,party_simplified,case_diff_normalized,case_diff,deaths_diff_normalized,deaths_diff
0,574,2020-03-13,Alabama,6,0.0,1274538.0,0.0,0.0,0.000005,0.000000,0.000000,0.000000,REPUBLICAN,0.000000,0.0,0.0,0.0
1,623,2020-03-14,Alabama,12,0.0,1497772.0,0.0,0.0,0.000008,0.000000,0.000000,0.000000,REPUBLICAN,0.000003,6.0,0.0,0.0
2,672,2020-03-15,Alabama,23,0.0,1880016.0,0.0,0.0,0.000012,0.000000,0.000000,0.000000,REPUBLICAN,0.000004,11.0,0.0,0.0
3,721,2020-03-16,Alabama,29,0.0,1880016.0,0.0,0.0,0.000015,0.000000,0.000000,0.000000,REPUBLICAN,0.000003,6.0,0.0,0.0
4,770,2020-03-17,Alabama,39,0.0,2342437.0,0.0,0.0,0.000017,0.000000,0.000000,0.000000,REPUBLICAN,0.000001,10.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19270,19070,2021-03-10,Wyoming,55014,691.0,578759.0,70919.0,3895.0,0.095055,0.001194,0.122536,0.006730,REPUBLICAN,0.000073,42.0,0.0,0.0
19271,19121,2021-03-11,Wyoming,55112,691.0,578759.0,71215.0,3911.0,0.095224,0.001194,0.123048,0.006758,REPUBLICAN,0.000169,98.0,0.0,0.0
19272,19172,2021-03-12,Wyoming,55163,691.0,578759.0,75057.0,4646.0,0.095313,0.001194,0.129686,0.008028,REPUBLICAN,0.000088,51.0,0.0,0.0
19273,19223,2021-03-13,Wyoming,55163,691.0,578759.0,78002.0,4085.0,0.095313,0.001194,0.134775,0.007058,REPUBLICAN,0.000000,0.0,0.0,0.0


In [None]:
# get overall count_fully_vaccinated numbers for democratic and republican states
# then create overall count_fully_vaccinated_normalized numbers

df_covid_stats_normalized_party = df_covid_stats_states_normalized_party.groupby(["party_simplified", "date"])["count_fully_vaccinated"].sum().to_frame(name="count_fully_vaccinated")
df_covid_stats_normalized_party["population"] = df_covid_stats_states_normalized_party.groupby(["party_simplified", "date"])["statePopulation"].sum()
df_covid_stats_normalized_party["count_fully_vaccinated_normalized"] = df_covid_stats_normalized_party["count_fully_vaccinated"] / df_covid_stats_normalized_party["population"]
df_covid_stats_normalized_party.reset_index(inplace=True)
df_covid_stats_normalized_party.fillna(0, inplace=True)
df_covid_stats_normalized_party = df_covid_stats_normalized_party.query("date not in ['2021-01-14', '2021-01-16', '2021-01-17', '2021-01-18', '2021-02-15']")

# drop the same dates as listed in the exploration notebook
Chart(df_covid_stats_normalized_party).mark_line().encode(
    x=X("date:T", title="Date (Early 2020 – current)"),
    color="party_simplified",
    y=Y("count_fully_vaccinated_normalized", title="People Fully Vaccinated (Normalized by Democrat/Republican States' Total Population)")
).properties(
    width=1000,
    height=500,
    title="People Fully Vaccinated (Normalized by Democrat/Republican States' Total Population) Over Time"
)

# goal is to fit a machine learning model to each of these curves to see when the herd immunity percentage will be hit PER PARTY
# herd immunity percentage not known for COVID, so try 85%, 90%, 95%

The above visualization is the result of combining and then normalizing the "people fully vaccinated" trends for all Democratic and Republican states respectively. It's interesting to note that the vaccination percentages between the two parties seem to be remarkably similar here, and they are both trending in an exponential direction.

With this data put together, we can now build the two machine learning models of interest. We start with the Democratic machine learning model, which is a linear regression with polynomial terms. This model will project the percentage of vaccinations within the Democratic states (normalized) over the next year. 

This projection will be used to get the dates for when certain "herd immunity percentages" will be hit. Since the herd immunity percentage for COVID is not yet known, we will be predicting the dates for 80%, 85%, 90%, 95%, and 100% vaccination coverage within the two groups. 

In [None]:
# build the first ML model (democrat)
# linear regression with hyperparameter tuning
# will start the model on the first date where the count_fully_vaccinated_normalized is NONZERO

df_ml_democrat = df_covid_stats_normalized_party.query("party_simplified == 'DEMOCRAT' and count_fully_vaccinated_normalized > 0").drop("party_simplified", axis=1)
df_ml_democrat["date"] = pd.to_datetime(df_ml_democrat["date"], format="%Y-%m-%d")
df_ml_democrat.set_index('date', inplace=True)

X_train = df_ml_democrat.index.year + (30 * (df_ml_democrat.index.month - 1) + df_ml_democrat.index.day) /365
y_train = df_ml_democrat["count_fully_vaccinated_normalized"]

# find best degree for polynomial features
model = make_pipeline(PolynomialFeatures(include_bias=False), LinearRegression())

parms = {'polynomialfeatures__degree': np.arange(1, 10)}

grid_search = GridSearchCV(model, 
                    param_grid=parms, 
                    cv = 10, 
                    scoring='neg_mean_squared_error')
grid_search.fit(X_train.to_frame(), y_train)

pipeline = grid_search.best_estimator_
pipeline

Pipeline(memory=None,
         steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=2, include_bias=False,
                                    interaction_only=False, order='C')),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [None]:
# get model predictions for existing data
y_train_ = pd.Series(
    pipeline.predict(X=X_train.to_frame()),
    index=y_train.index
)

# scatter chart
scatter = Chart(df_ml_democrat.reset_index()).mark_circle().encode(
    x = "date:T",
    y = "count_fully_vaccinated_normalized"
).properties(
    width=800,
    height=300
)

# line chart for model predictions
line = Chart(y_train_.reset_index().rename(columns={0:"prediction"})).mark_line(color='red').encode(
    x="date:T",
    y="prediction"
)

# estimate the test error
scores = cross_val_score(pipeline,
                         X=X_train.to_frame(),
                         y=y_train,
                         scoring="neg_mean_squared_error",
                         cv=10)
print("Estimated Test Error:", np.sqrt(-scores).mean())

# draw both of them together
scatter + line

Estimated Test Error: 0.002643577541402995


The above visualization shows the Democratic vaccination percentages (blue dots), as well as the linear regression fit to this data (red line). As we can see, the model predictions are pretty accurate, with a test error of 0.0026 (0.26%). This is great news, but we also want to be on the lookout for possible signs of the model being overfit to the data. This will be hard to determine, however, as the available vaccination data is limited. Based on this, we can continue with the herd immunity projections for Democratic states.

In [None]:
# predict when the democratic set will reach herd immunity percentages
# get the latest observation date and increment it by one
latest_covid_date = df_ml_democrat.index[-1]

# build a dataframe to contain the next year of predictions PAST the last date for observations
df_democrat_vaccination_predictions = pd.date_range(start=latest_covid_date + pd.Timedelta(days=1), end=latest_covid_date + pd.Timedelta(weeks=52), freq="D").to_frame(name="date_decimal")
df_democrat_vaccination_predictions["date_decimal"] = df_democrat_vaccination_predictions.index.year + (30 * (df_democrat_vaccination_predictions.index.month - 1) + df_democrat_vaccination_predictions.index.day) / 365

# run the predictions
df_democrat_vaccination_predictions["pred_vaccinated_normalized"] = pipeline.predict(df_democrat_vaccination_predictions[["date_decimal"]])

# replace any predictions that are greater than 1.0
df_democrat_vaccination_predictions.loc[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] > 1.0, "pred_vaccinated_normalized"] = 1.0

# plot said predictions
linePredicted = Chart(df_democrat_vaccination_predictions.reset_index()).mark_line(color='orange').encode(
    x=X("index:T", title="Date (Early 2021 – Mid-2022)"),
    y=Y("pred_vaccinated_normalized", title="Projected Vaccination Percentage (Normalized)")
).properties(
    title="Projected Vaccination Percentage (Normalized) Over Time for Democratic States"
)
scatter + line + linePredicted

The above visualization shows the model's projection of the vaccination percentage (normalized) over time for Democratic states. The blue datapoints are the currently available vaccination percentage data, the red portion of the curve is what the model's fit to the data, and the orange part of the curve is the actual model prediction. 

As we can see, the model predicts that the vaccination percentage will be ~80% in mid-July 2021, and the 100% mark will be hit around early-August 2021. Let's get these exact dates:

In [None]:
# get the herd immunity dates
print("democrat")
print("80% predicted vaccinated date: ", df_democrat_vaccination_predictions[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] >= 0.80]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("85% predicted vaccinated date: ", df_democrat_vaccination_predictions[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] >= 0.85]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("90% predicted vaccinated date: ", df_democrat_vaccination_predictions[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] >= 0.90]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("95% predicted vaccinated date: ", df_democrat_vaccination_predictions[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] >= 0.95]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("100% predicted vaccinated date:", df_democrat_vaccination_predictions[df_democrat_vaccination_predictions["pred_vaccinated_normalized"] >= 1.0]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))

democrat
80% predicted vaccinated date:  2021-07-12
85% predicted vaccinated date:  2021-07-18
90% predicted vaccinated date:  2021-07-23
95% predicted vaccinated date:  2021-07-29
100% predicted vaccinated date: 2021-08-04


As we can see, it looks like most of the herd immunity percentages will be hit in July of 2021. This is extremely promising! Our group thought this was super interesting, and it gives us an idea on the time period when we can expect to begin to return to our normal lives pre–pandemic.

Now that we have the Democratic projection, we run the same model and compute the same projections for the Republican data.

In [None]:
# build the second ML model (republican)
# linear regression with hyperparameter tuning
# will start the model on the first date where the count_fully_vaccinated_normalized is NONZERO

df_ml_republican = df_covid_stats_normalized_party.query("party_simplified == 'REPUBLICAN' and count_fully_vaccinated_normalized > 0").drop("party_simplified", axis=1)
df_ml_republican["date"] = pd.to_datetime(df_ml_republican["date"], format="%Y-%m-%d")
df_ml_republican.set_index('date', inplace=True)

X_train = df_ml_republican.index.year + (30 * (df_ml_republican.index.month - 1) + df_ml_republican.index.day) /365
y_train = df_ml_republican["count_fully_vaccinated_normalized"]

# find best degree for polynomial features
model = make_pipeline(PolynomialFeatures(include_bias=False), LinearRegression())

parms = {'polynomialfeatures__degree': np.arange(1, 10)}

grid_search = GridSearchCV(model, 
                    param_grid=parms, 
                    cv = 10, 
                    scoring='neg_mean_squared_error')
grid_search.fit(X_train.to_frame(), y_train)

pipeline = grid_search.best_estimator_
pipeline

Pipeline(memory=None,
         steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=2, include_bias=False,
                                    interaction_only=False, order='C')),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [None]:
# get model predictions for existing data
y_train_ = pd.Series(
    pipeline.predict(X=X_train.to_frame()),
    index=y_train.index
)

# scatter chart
scatter2 = Chart(df_ml_republican.reset_index()).mark_circle().encode(
    x = "date:T",
    y = "count_fully_vaccinated_normalized"
).properties(
    width=800,
    height=300
)

# line chart for model predictions
line2 = Chart(y_train_.reset_index().rename(columns={0:"prediction"})).mark_line(color='red').encode(
    x="date:T",
    y="prediction"
)

# estimate the test error
scores = cross_val_score(pipeline,
                         X=X_train.to_frame(),
                         y=y_train,
                         scoring="neg_mean_squared_error",
                         cv=10)
print("Estimated Test Error:", np.sqrt(-scores).mean())

# draw both of them together
scatter2 + line2

Estimated Test Error: 0.0030956468126717026


The above visualization shows the Republican vaccination percentages (blue dots), as well as the linear regression fit to this data (red line). As we can see, the model predictions are pretty accurate, with a test error of 0.0030 (0.30%, slightly higher than the Democratic model). This is great news, but we also want to be on the lookout for possible signs of the model being overfit to the data. This will be hard to determine, however, as the available vaccination data is limited. Based on this, we can continue with the herd immunity projections for Republican states.

In [None]:
# predict when the republican set will reach herd immunity percentages
# get the latest observation date and increment it by one
latest_covid_date = df_ml_republican.index[-1]

# build a dataframe to contain the next year of predictions PAST the last date for observations
df_republican_vaccination_predictions = pd.date_range(start=latest_covid_date + pd.Timedelta(days=1), end=latest_covid_date + pd.Timedelta(weeks=52), freq="D").to_frame(name="date_decimal")
df_republican_vaccination_predictions["date_decimal"] = df_republican_vaccination_predictions.index.year + (30 * (df_republican_vaccination_predictions.index.month - 1) + df_republican_vaccination_predictions.index.day) / 365

# run the predictions
df_republican_vaccination_predictions["pred_vaccinated_normalized"] = pipeline.predict(df_republican_vaccination_predictions[["date_decimal"]])

# replace any predictions that are greater than 1.0 with 1.0 – any predictions greater than 1.0 don't make sense
df_republican_vaccination_predictions.loc[df_republican_vaccination_predictions["pred_vaccinated_normalized"] > 1.0, "pred_vaccinated_normalized"] = 1.0

# plot said predictions
linePredicted2 = Chart(df_republican_vaccination_predictions.reset_index()).mark_line(color='orange').encode(
    x=X("index:T", title="Date (Early 2021 – Mid-2022)"),
    y=Y("pred_vaccinated_normalized", title="Projected Vaccination Percentage (Normalized)")
).properties(
    title="Projected Vaccination Percentage (Normalized) Over Time for Republican States"
)
scatter2 + line2 + linePredicted2

The above visualization shows the model's projection of the vaccination percentage (normalized) over time for Republican states. The blue datapoints are the currently available vaccination percentage data, the red portion of the curve is what the model's fit to the data, and the orange part of the curve is the acutal model prediction. 

As we can see, the model predicts that the vaccination percentage will be ~80% around the end of July 2021, and the 100% mark will be hit around mid-August 2021. Let's get these exact dates:

In [None]:
# get the herd immunity dates
print("republican")
print("80% predicted vaccinated date: ", df_republican_vaccination_predictions[df_republican_vaccination_predictions["pred_vaccinated_normalized"] >= 0.80]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("85% predicted vaccinated date: ", df_republican_vaccination_predictions[df_republican_vaccination_predictions["pred_vaccinated_normalized"] >= 0.85]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("90% predicted vaccinated date: ", df_republican_vaccination_predictions[df_republican_vaccination_predictions["pred_vaccinated_normalized"] >= 0.90]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("95% predicted vaccinated date: ", df_republican_vaccination_predictions[df_republican_vaccination_predictions["pred_vaccinated_normalized"] >= 0.95]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))
print("100% predicted vaccinated date:", df_republican_vaccination_predictions[df_republican_vaccination_predictions["pred_vaccinated_normalized"] >= 1.0]["pred_vaccinated_normalized"].head(1).index[0].strftime("%Y-%m-%d"))

republican
80% predicted vaccinated date:  2021-07-26
85% predicted vaccinated date:  2021-08-02
90% predicted vaccinated date:  2021-08-09
95% predicted vaccinated date:  2021-08-15
100% predicted vaccinated date: 2021-08-21


We can now compare the herd immunity projected dates for both the Democratic and Republican states. All around, it looks like the Republican states are projected to be around two weeks behind the Democratic states in terms of vaccine distribution and herd immunity status. Our group found this very interesting, and it could be telling of how Democratic states have handled vaccine distribution versus Republican states, as well as people's willingness to get inoculated in these states.

Despite the two-week differences, both models predict that herd immunity status will be reached nationwide in the Summer of 2021. This projected time matches up with what we have heard from other sources, and seems very plausible given the current rates of vaccination and its exponential growth. This is extremely promising news for us, and we are looking forward to when this time comes! 

## **Machine Learning model predicting number of daily reviews for restaraunts in the Yelp dataset:**

Implications of this model: The number of daily reviews for restaraunts in the Yelp dataset implies the degree to which people are getting outside and participating in social activities. From this metric, we can predict when we will see a return to pre-pandemic normalcy by comparing the number of daily Yelp reviews for restaraunts in our dataset to the pre-pandemic average of 1,802 reviews/day.

In [None]:
df_dailyReviewCounts = pd.read_csv("/content/drive/MyDrive/DATA 301 Final Project Group 2/dailyReviewCounts.csv").drop("Unnamed: 0", axis=1)
df_dailyReviewCounts["Date"] = pd.to_datetime(df_dailyReviewCounts["Date"], format="%Y-%m-%d")
df_dailyReviewCounts.set_index("Date", inplace=True)
df_reviewCountsBefore = df_dailyReviewCounts.loc[:"2020-03-09"]
df_reviewCountsCovid = df_dailyReviewCounts.loc["2020-03-09":]

In [None]:
print("Average total review count for the restaraunts in our dataset before the pandemic:", df_reviewCountsBefore["Number of Reviews"].mean())

Average total review count for the restaraunts in our dataset before the pandemic: 1802.3041474654378


**Hyperparameter tuning:**

In [None]:
date = df_reviewCountsCovid.index

X_train = date.year + (30 * (date.month - 1) + date.day) /365
y_train = df_reviewCountsCovid["Number of Reviews"]

# find best degree for polynomial features
model = make_pipeline(PolynomialFeatures(include_bias=False), LinearRegression())

parms = {'polynomialfeatures__degree': np.arange(1, 10)}

grid_search = GridSearchCV(model, 
                    param_grid=parms, 
                    cv = 10, 
                    scoring='neg_mean_squared_error')
grid_search.fit(X_train.to_frame(), y_train)

pipeline = grid_search.best_estimator_

In [None]:
y_train_ = pd.Series(
    pipeline.predict(X=X_train.to_frame()),
    index=y_train.index
)

scatter = Chart(df_dailyReviewCounts.reset_index()).mark_circle().encode(
    x = "Date",
    y = "Number of Reviews"
).properties(
    width=800,
    height=300
)

line = Chart(y_train_.reset_index().rename(columns={0:"# reviews"})).mark_line(color='red').encode(
    x="Date",
    y="# reviews"
)
scatter + line

**Estimating the test error:**

In [None]:
scores = cross_val_score(pipeline,
                         X=X_train.to_frame(),
                         y=y_train,
                         scoring="neg_mean_squared_error",
                         cv=10)
print("Estimated Test Error:", np.sqrt(-scores).mean())

Estimated Test Error: 262.0768683261627


####Our model has an estimated test error of 262.076. This indicates that we estimate our predicted number of reviews to be within 262 of the real number.

### **Let's try a model with seasonality**

In [None]:
SeasonalPipeline = make_pipeline(
    make_union(
      PolynomialFeatures(grid_search.best_params_["polynomialfeatures__degree"], include_bias=False),
      FunctionTransformer(lambda t: np.sin(2 * np.pi * t)),
      FunctionTransformer(lambda t: np.cos(2 * np.pi * t))
    ), 
    LinearRegression())
SeasonalPipeline.fit(X_train.to_frame().reset_index(drop=True), y_train)

y_train_ = pd.Series(
    SeasonalPipeline.predict(X=X_train.to_frame()),
    index=y_train.index
)

scatter = Chart(df_dailyReviewCounts.reset_index()).mark_circle().encode(
    x = "Date",
    y = "Number of Reviews"
).properties(
    width=800,
    height=300
)

line = Chart(y_train_.reset_index().rename(columns={0:"# reviews"})).mark_line(color='red').encode(
    x="Date",
    y="# reviews"
)
scatter + line

In [None]:
scores = cross_val_score(SeasonalPipeline,
                         X=X_train.to_frame(),
                         y=y_train,
                         scoring="neg_mean_squared_error",
                         cv=10)
print("Estimated Test Error for seasonal model:", np.sqrt(-scores).mean(), "Example prediction for July 23, 2024:", 
      SeasonalPipeline.predict(pd.Series([2024 + (30 * (7 - 1) + 23) / 365]).to_frame())) 

Estimated Test Error for seasonal model: 260.2522375146426 Example prediction for July 23, 2024: [1802.52722733]


#### The model with seasonality features has a marginaly better estimated test score over the model without (260 vs 262), but as we can see it over-fits to the training data as it predicts we won't see a pre-pandemic average total number of restaraunt ratings (1,802) until July 23, 2024! This clashes with our recent optimism brought about by vaccine distribution. As such, we will continue with the model without seasonality for our future analysis.

In [None]:
September2021 = pd.Series([2021 + (30 * (9 - 1) + 1) / 365]).to_frame()
July2022 = pd.Series([2022 + (30 * (7 - 1) + 1) / 365]).to_frame()

print("September 2021:", pipeline.predict(September2021), "July 2022:", pipeline.predict(July2022))

September 2021: [1391.5934266] July 2022: [1802.15869751]


#### We expect the average number of Yelp restaraunt reviews for the restaraunts in our dataset on September 1, 2021 to be 1,392. This is still less than the average 1,802 daily reviews before the pandemic period and indicates that although schools and other activities may be resuming pre-pandemic normalcy, the "end" to the pandemic period's influence on social daily life will not yet be fully realized.

#### Our model predicts that the total daily reviews for the restaraunts in our dataset will be equal to the pre-pandemic period average of 1,802 on July 1, 2022. As such, we predict that social-life will be back to the pre-pandemic "normal" in July of next year.

## **Conclusions from both Models:**

The projected vaccination percentage model predicts we will see herd immunity in the United States sometime in July 2021 while the Yelp review model predicts we will see a return to pre-pandemic social daily life in July 2022. The difference in these two predictions comes from the fact that the Yelp review model predicts a date for the realization of a cultural shift whereas the projected vaccination percentage model predicts a time where a cultural shift may reasonably start to take shape. From this, we gather that we will likely see herd immunity in July of this year and a return to pre-pandemic cultural normalcy by July of next year. 