# True Reach Estimator
We are looking to build an estimator around instagram users true reach and impressions. This data is accessible through the business api but in Q1 2018 Instagram blocked influencer marketing teams from this endpoint. We are able to collect a sample of this data to see if there working correlations between currently public data and true reach/impressions.

Data: the CSV we have contains one year worth of data from 5,000 influencer level Instagram users.

# The goal:
Given the public data points create an accurate estimate of each posts reach and impressions. The ideal would be as accurate as possible, but a 10% range will work for the business.

## Trial 2 - Logarithmic Transformation and Scaling

After an initial trial of exploring the data and cleaning it up. We decided to do a log transformation and scale the data.

Process:
- Import the data
- Split the data into a test and training set
- Observe the data once again before doing a transformation
- Log transform the training set
- Scale the data after the log transformation
- Set conditions on the training data to eliminate outliers
- Create a regression model on our variables to determine the effectiveness of the data so far 
- Use the regression model to check for accuracy in the prediction across the dataset
- Summarize our findings and suggest next steps if unsucessful

# Get the data into the notebook

In [None]:
# import necessary libraries
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
import bs4
import requests
import re
import warnings
warnings.filterwarnings("ignore")

plt.style.use('seaborn')

In [None]:
# import the data file
df = pd.read_csv('trial_1_2.csv')
df.head()

In [None]:
# drop the first column which are just the indices and the columns we will not use in for the training
df.drop(columns=['Unnamed: 0', "published", "impressions"], inplace=True)
df.head()
len(df)

# Split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
df, test = train_test_split(df, test_size=0.2)

# Visualize the data once more

Take a look at the scatter plots of the likes, comments, and followers against the reach

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df.likes, df.reach,  color='blue', alpha = 0.3)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df.comments, df.reach,  color='red', alpha = 0.3)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=df.followers, y=df.reach,  color='orange', alpha = 0.3)
plt.show()

## A few observations
- We seem to have a semi linear relationship between likes and reach, but heteroscedastic
- We seem to have a curvilinear relationship between comments and reach 
- We seem to have some weird things happening with followers and reach.

# Looking at the histograms prior to transformation

We take a look at the histograms of the independent variables we are interested in to determine which should be transformed to have a more normal distribution. We will log the features to see how it affects the distribution

In [None]:
df.hist(figsize=[10,10])

It seems like it the data can benefit from a log transform.

# Clean data

We need to get rid of zero likes, comments, and followers in order to avoid divide by zero errors when transforming and scaling.

In [None]:
# drop anyone that has no reach
df = df.loc[(df['reach']>=1)]
print(df.head())
len(df)

We just cut 85 post from our data, which is less than 1% of our original data. Let's continue cutting out some of the things we know are weird like having 0 likes yet having a reach or having 0 followers and having a reach.

In [None]:
df = df[(df['likes']>=1)]
df = df[(df['followers']>=1)]
print(df.head())
len(df)

We just cut 189165 posts from our data, so we have just cut about 25% of our original data, worth of bad data. Let's see how cutting comments less than 1 affects the data.

In [None]:
df = df[(df['comments']>=1)]
print(df.head())
len(df)

# Transforming the Likes, Comments, and Followers

Create a new dataframe with the logarithmic transformation

In [None]:
import numpy as np
data_log = pd.DataFrame([])
data_log["followers"] = np.log(df["followers"])
data_log["comments"] = np.log(df["comments"])
data_log["likes"] = np.log(df["likes"])
data_log["reach"] = df["reach"]
data_log.hist(figsize=[10,10])

## A few observations
- Logging the features seem to have normalize the distribution of the features a bit.
- Likes seem to benefit the most from the transformation being more normal, while comments skewed to the right, and followers are skewed to the left.

We scaled the the logarithmic transformation and plot the histograms to see if that makes a difference.

In [None]:
# # Standardization
scaled_fol = (data_log["followers"]-np.mean(data_log["followers"]))/np.sqrt(np.var(data_log["followers"]))
# # Standardization
scaled_com = (data_log["comments"]-np.mean(data_log["comments"]))/np.sqrt(np.var(data_log["comments"]))
# # Standardization
scaled_like = (data_log["likes"]-np.mean(data_log["likes"]))/np.sqrt(np.var(data_log["likes"]))
# scaled_reach = (data_log["reach"]-np.mean(data_log["reach"]))/np.sqrt(np.var(data_log["reach"]))


data_scaled = pd.DataFrame([])
data_scaled["followers"] = scaled_fol
data_scaled["comments"] = scaled_com
data_scaled["likes"] = scaled_like
data_scaled["reach"] = df.reach.copy()

data_scaled.hist(figsize  = [10, 10]);

In [None]:
data_scaled.head()

The histograms are look the same but the scales are just different. Since we are doing a linear regression, we do not need to scale.

# Creating our models

First lets take a look at the ols summary of the log transform

In [None]:
# take a stab at the ols model with the data that was log tansform
lr_model = ols(formula='reach~followers + likes + comments', data=data_log).fit()
lr_model.summary()

Using the coefficients from the summary above we defined a test function that takes in unscaled inputs and scales them before running it through the function.

>> `def reach_test_log(followers, likes, comments): 
    return (667.6353*np.log(followers)) + (6638.6280*np.log(likes)) + (-568.5896*np.log(comments)) -4.061e04`

But we can just use the ols built in predict function/method the run the data through 

In [None]:
# create a column  for the test reach and the reach difference
data_log["test_reach"] = lr_model.predict(data_log)
data_log["reach_diff"] = data_log.reach - data_log.test_reach

In [None]:
data_log.head()

### Observe the RSME of the log transforms

In [None]:
# rsme of the training data
mean_squared_error(data_log.test_reach, data_log.reach)**.5

In [None]:
# rsme of the test data
test_reach = lr_model.predict(test)
mean_squared_error(test_reach, test.reach)**.5

Next lets take a look at the ols summary of the scaled data

In [None]:
# take a look at the summary
lr_model_2 = ols(formula='reach~followers + likes + comments', data=data_scaled).fit()
lr_model_2.summary()

In [None]:
# create a new column in the data frame that has the reach for the test  
data_scaled["test_reach"] = lr_model_2.predict(data_scaled)

### Observe the RSME of the scaled data

In [None]:
mean_squared_error(data_scaled.test_reach, data_scaled.reach)**.5

In [None]:
scaled_test = lr_model_2.predict(test)
mean_squared_error(scaled_test, test.reach)**.5

In [None]:
data_scaled.head(5)

# Summary

Even after doing a log transform the $R^2$ was just above 21% and the RMSE was over 27,000 on our training data. For our test data, the RMSE was over 100,000(log transform) on the first model with our test dataset and over 200,000 on our second model(log transform and scaled). We conclude that we must do some further analysis on the data. In our next trial, we will see the effects of a curvilinear regression.