# True Reach Estimator
We are looking to build an estimator around instagram users true reach and impressions. This data is accessible through the business api but in Q1 2018 Instagram blocked influencer marketing teams from this endpoint. We are able to collect a sample of this data to see if there working correlations between currently public data and true reach/impressions.

- Data: the CSV we have contains one year worth of data from 5,000 influencer level Instagram users.

# The goal:
Given the public data points create an accurate estimate of each posts reach and impressions. The ideal would be as accurate as possible, but a 10% range will work for the business.

## Trial 8

Deleting the comments columns and trying a square root transformation to the data.

In [None]:
# import necessary libraries
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns
import bs4
import requests
import re

plt.style.use('seaborn')

In [None]:
# import the clean data file
df = pd.read_csv('clean_instagram_train.csv')
df.head()

## Drop the comments column all together

In [None]:
# drop the first column which are just the indices, dropped the published column, dropped impressions
df.drop(columns= ['impressions','published', 'Unnamed: 0', 'comments'], inplace=True)
df.head()

## Create a new dataframe

In [None]:
df_dropcomm= df.copy()

In [None]:
df_dropcomm.shape

In [None]:
df_dropcomm.head()

## Scatter plot of Followers and Reach, dropped rows where reach was greater than 200,000

In [None]:
df_dropcomm[df_dropcomm.reach>400000].count()

In [None]:
df_dropcomm.drop(df_dropcomm[df_dropcomm.reach>200000].index, inplace= True)

In [None]:
df_dropcomm.shape

In [None]:
plt.scatter(df_dropcomm['followers'], df_dropcomm['reach']);
# plt.xlim(0,2000)
# plt.ylim(0,700000);

## Scatter plot of Likes and Reach, here we remove outliers that have  'likes' greater than 17,000

In [None]:
df_dropcomm.drop(df_dropcomm[df_dropcomm['likes']>17000].index, inplace=True)

In [None]:
plt.scatter(df_dropcomm['likes'], df_dropcomm['reach']);

In [None]:
df_dropcomm.drop(df_dropcomm[df_dropcomm['reach']==0].index, inplace=True)

In [None]:
import numpy as np

In [None]:
df_dropcomm.head()

# Taking a look at the linear regression model excluding comments

In [None]:
lr_modelexcomm = ols(formula='reach~followers + likes', data=df_dropcomm).fit()
lr_modelexcomm.summary()

## Creating a new column in our dataframe using our regression model from ols

In [None]:
df_dropcomm['predicted_reach']= lr_modelexcomm.predict()

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(df_dropcomm.reach, df_dropcomm.predicted_reach)**.5

In [None]:
lr_modelexcomm.rsquared

In [None]:
len(df_dropcomm)

In [None]:
"The rsquared of our model excluding comments is {} and our mean-squared error is {}.".format(lr_modelexcomm.rsquared,mean_squared_error(df_dropcomm.reach, df_dropcomm.predicted_reach)**.5)

## The $R^2$ of our model excluding comments is 0.5124878727076694 and our mean-squared error is 6074.416304497445.

# Add a column for squareroot of followers and squareroot of likes

In [None]:
df_dropcomm['followers_sqrt']= df_dropcomm['followers'].apply(lambda x: np.sqrt(x))

In [None]:
df_dropcomm['likes_sqrt']= df_dropcomm['likes'].apply(lambda x: np.sqrt(x))

In [None]:
df_dropcomm.head()

# Creating a linear regression model exluding comments and creating a curvilinear regression using the square root of followers and the squareroot of likes


In [None]:
lr_modelexcomm_sqrt= ols(formula='reach~ followers+ followers_sqrt+ likes_sqrt + likes', data=df_dropcomm).fit()
lr_modelexcomm_sqrt.summary()

In [None]:
df_dropcomm['predicted_reach_sqrt']= lr_modelexcomm_sqrt.predict()

In [None]:
mean_squared_error(df_dropcomm.reach, df_dropcomm.predicted_reach_sqrt)**.5

In [None]:
"The rsquared of our model excluding comments and taking the squareroot of our independent variables is {} and our mean-squared error is {}.".format(lr_modelexcomm_sqrt.rsquared,mean_squared_error(df_dropcomm.reach, df_dropcomm.predicted_reach_sqrt)**.5)

### The $R^2$ of our model excluding comments and taking the squareroot of our independent variables is 0.5187820543257242 and our mean-squared error is 6035.076063689838.


Doing a square root transformation on our independent variables barely increases our R2 or mean squared error.

## Scatter plot of Sqrt Followers and Reach

In [None]:
plt.scatter(df_dropcomm['followers_sqrt'], df_dropcomm['reach']);

In [None]:
plt.scatter(df_dropcomm['likes'], df_dropcomm['reach']);

In [None]:
df_dropcomm.head()

## Take a look at the interaction between our independent variable and reach

In [None]:
from sklearn.linear_model import LinearRegression
regression_1 = LinearRegression()
regression_2 = LinearRegression()
regression_3 = LinearRegression()
regression_4 = LinearRegression()

likes = df_dropcomm["likes"].values.reshape(-1, 1)
likes_sqrt = df_dropcomm["likes_sqrt"].values.reshape(-1, 1)
followers_sqrt = df_dropcomm["followers"].values.reshape(-1, 1)
followers = df_dropcomm["followers_sqrt"].values.reshape(-1, 1)

regression_1.fit(likes, df_dropcomm["reach"])
regression_2.fit(followers_sqrt, df_dropcomm["reach"])
regression_3.fit(followers, df_dropcomm["reach"])
regression_4.fit(likes_sqrt, df_dropcomm["reach"])

# Make predictions using the testing set
pred_1 = regression_1.predict(likes)
pred_2 = regression_2.predict(followers_sqrt)
pred_3 = regression_3.predict(followers)
pred_4 = regression_4.predict(likes_sqrt)

# The coefficients
print(regression_1.coef_)
print(regression_2.coef_)
print(regression_3.coef_)
print(regression_4.coef_)



## Creating Scatter plots of our new data

In [None]:
fig = plt.figure(figsize=[20,20])

ax1 = fig.add_subplot(441)
ax1.scatter(df_dropcomm.followers, df_dropcomm.reach)
ax1.set_xlabel('reach')
ax1.set_ylabel('followers')
ax1.set_title("Scatter plot")

ax2 = fig.add_subplot(442)
ax2.scatter(df_dropcomm.followers_sqrt, df_dropcomm.reach)
ax2.set_xlabel('reach')
ax2.set_ylabel('followers_sqrt')
ax2.set_title("Scatter plot")

ax3 = fig.add_subplot(443)
ax3.scatter(df_dropcomm.likes, df_dropcomm.reach)
ax3.set_xlabel('reach')
ax3.set_ylabel('likes')
ax3.set_title("Scatter plot")

ax4 = fig.add_subplot(444)
ax4.scatter(df_dropcomm.likes_sqrt, df_dropcomm.reach)
ax4.set_xlabel('reach')
ax4.set_ylabel('likes_sqrt')
ax4.set_title("Scatter plot");

## After running our analysis excluding comments and transforming our independent variables, we determine that this is not our best fitting model based on our $R^2$ and mean squared error value. We still do not have a model that satisfies our goal.