# Feature Engineering

In this notebook we create new features from the 5 numerical columns: followers, pictures, videos, comments, likes.

The features we create are the following:
- Content Variable = pictures + videos: We assume that the effect of a video or of a picture will be the same
- Likes per Content = Likes/Content: To get an estimator of the amount of likes per content
- Comments per Likes: We dont think comments by themselves are relevant for the analysis, as they could be either possitive or negatives. However, if we calculate the ratio between comments and likes we could get an estimation of its sentiment.
- The weekly change of the followers, Likes per content, conments per Likes and the closing stock price
- Moving Averages to extract information about the trend and be able to identify better the anomalies

In [11]:
# Load data
import pandas as pd

df = pd.read_csv('../clean_data/financial_data_integrated.csv')
df.head()

Unnamed: 0,legal_entity,date,followers,pictures,videos,comments,likes,close
0,Academy Sports + Outdoors,2020-09-19,168956.0,8.0,5.0,485.0,7592.0,12.789742
1,Academy Sports + Outdoors,2020-09-26,169356.0,9.0,7.0,475.0,7577.0,13.094962
2,Academy Sports + Outdoors,2020-10-03,169717.0,6.0,12.0,460.0,7414.0,13.705405
3,Academy Sports + Outdoors,2020-10-10,169917.0,6.0,14.0,104.0,7157.0,14.355229
4,Academy Sports + Outdoors,2020-10-17,170309.0,5.0,13.0,89.0,5111.0,14.47338


In [12]:
df["content"] = df["pictures"] + df["videos"]
df["likes_per_content"] = df["likes"]/df["content"]
df["comments_per_likes"] = df["comments"]/df["likes"]

In [13]:
for company in df.legal_entity.unique():
    company_mask = df["legal_entity"] == company
    
    df.loc[company_mask, 'likes_per_content_weekly_change'] = df.loc[company_mask, 'likes_per_content'].pct_change() * 100
    df.loc[company_mask, 'followers_weekly_change'] = df.loc[company_mask, 'followers'].pct_change() * 100
    df.loc[company_mask, 'comments_per_likes_weekly_change'] = df.loc[company_mask, 'comments_per_likes'].pct_change() * 100
    df.loc[company_mask, 'Close_price_weekly_change'] = df.loc[company_mask, 'close'].pct_change() * 100

In [14]:
df.head()

Unnamed: 0,legal_entity,date,followers,pictures,videos,comments,likes,close,content,likes_per_content,comments_per_likes,likes_per_content_weekly_change,followers_weekly_change,comments_per_likes_weekly_change,Close_price_weekly_change
0,Academy Sports + Outdoors,2020-09-19,168956.0,8.0,5.0,485.0,7592.0,12.789742,13.0,584.0,0.063883,,,,
1,Academy Sports + Outdoors,2020-09-26,169356.0,9.0,7.0,475.0,7577.0,13.094962,16.0,473.5625,0.06269,-18.910531,0.236748,-1.86797,2.386441
2,Academy Sports + Outdoors,2020-10-03,169717.0,6.0,12.0,460.0,7414.0,13.705405,18.0,411.888889,0.062045,-13.023331,0.21316,-1.028779,4.661664
3,Academy Sports + Outdoors,2020-10-10,169917.0,6.0,14.0,104.0,7157.0,14.355229,20.0,357.85,0.014531,-13.119773,0.117843,-76.579451,4.741371
4,Academy Sports + Outdoors,2020-10-17,170309.0,5.0,13.0,89.0,5111.0,14.47338,18.0,283.944444,0.017413,-20.652663,0.230701,19.834482,0.82305


In [15]:
for company in df.legal_entity.unique():
    company_mask = df["legal_entity"] == company
    
    df.loc[company_mask, 'comments_per_likes_ma_2'] = df.loc[company_mask, 'comments_per_likes'].rolling(window=2).mean()
    df.loc[company_mask, 'change_followers_ma_2'] = df.loc[company_mask, 'followers_weekly_change'].rolling(window=2).mean()
    df.loc[company_mask, 'Close_price_weekly_change_ma_2'] = df.loc[company_mask, 'Close_price_weekly_change'].rolling(window=2).mean()
    
    df.loc[company_mask, 'comments_per_likes_ma_3'] = df.loc[company_mask, 'comments_per_likes'].rolling(window=3).mean()
    df.loc[company_mask, 'change_followers_ma_3'] = df.loc[company_mask, 'followers_weekly_change'].rolling(window=3).mean()    
    df.loc[company_mask, 'Close_price_weekly_change_ma_3'] = df.loc[company_mask, 'Close_price_weekly_change'].rolling(window=3).mean()
    
    df.loc[company_mask, 'comments_per_likes_ma_5'] = df.loc[company_mask, 'comments_per_likes'].rolling(window=5).mean()
    df.loc[company_mask, 'change_followers_ma_5'] = df.loc[company_mask, 'followers_weekly_change'].rolling(window=5).mean()
    df.loc[company_mask, 'Close_price_weekly_change_ma_5'] = df.loc[company_mask, 'Close_price_weekly_change'].rolling(window=5).mean()

In [16]:
# Let´s set to zero the comments_per_likes_weekly_change that are Na (they are originated because of value 0 in the comments of the last weeks)

df['comments_per_likes_weekly_change'] = df['comments_per_likes_weekly_change'].fillna(0)

In [17]:
df.dropna(inplace=True)

In [19]:
df.to_csv("../clean_data/processed.csv", index = False)