## Cleaning and Frature Engineering

In this notebook:

1. Saved CSV files with information about posts from Instagram were uploaded.
2. All CSV files were merged in one main df.
3. Data types and missing values were handeld. Some information were extracted. Some information were taken from other sources and inpute to main df.
4. Some features were created on profile and post level (such as hour and weekday of publishing, type of influencers, mean number of likes in a profile, subjectivity and polarity of caption of posts).
5. Cleaning caption of posts (text under the post) from non-letter characters.
6. Topics from topic modeling were added.

As a result main_df_clean.csv file were created.

### Content:
- [Uploading datasets](#Uploading-datasets)
	- [Uploading saved all_data file](#Uploading-saved-all_data-file)
	- [Uploading saved main_df_posts file](#Uploading-saved-main_df_posts-file)
	- [Uploading saved main_df_comments file](#Uploading-saved-main_df_comments-file)
- [Merging three datasets](#Merging-three-datasets)
- [Data Cleaning](#Data-Cleaning)
	- [Data imputation](#Data-imputation)
- [Feature creation](#Feature-creation)
- [Text preprocessing](#Text-preprocessing)
- [Modeled topic addition](#Modeled-topic-addition)

In [3]:
text = 'Uploading saved main_df_comments file'
text.replace(' ','-')

'Uploading-saved-main_df_comments-file'

Uploading scraped ready to use datasets

Cleaning

Feature Engineering

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
import math

from bs4 import BeautifulSoup
import re


# Gensim
import gensim, spacy
from gensim.utils import lemmatize, simple_preprocess
from nltk.tokenize import RegexpTokenizer

# # NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'night', 'first', 'soon',
                   'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 
                   'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 
                   'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 
                   'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come',
                  'day', 'week', 'time', 'people', 'much', 'always', 'look','new' , 'year', 'last','none',
                  "must", "tell", "ask", "text", "full", "back", "wait", "big", "soon", "keep", "really", 
                   "way", "still", "person", "stand", "today", "sit", "minute", 'simply', 's'])

import emoji
from textblob import TextBlob


Bad key "text.kerning_factor" on line 4 in
C:\Users\anpej\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


## Uploading datasets

### Uploading saved all_data file

The dataset containce information about profiles from [additional source]( https://starngage.com/app/us/influencer) (not from instagram directly).

In [2]:
# uploading dataset 
countries_info = pd.read_csv('./datasets/all_data.csv')

In [3]:
# looking foe shape
countries_info.shape

(4013, 10)

In [4]:
# Dropping duplicates
countries_info.drop_duplicates(inplace=True)

In [5]:
# looking for missing values
countries_info.isnull().sum()

Country           0
Account           0
Category          0
Free_Promotion    0
Paid_Promotion    0
Follower_count    0
Post_count        0
Female            0
Male              0
Text              0
dtype: int64

In [6]:
# renaming columns to lower case
countries_info.columns = [i.lower()for i in countries_info.columns]

#### Working with wrong types of data and unless infomation.

In [7]:
# account name
countries_info['account'] = countries_info['account'].map(lambda x:
                                                x.replace("@",""))
# persantage of female/male following the account
countries_info['female'] = countries_info['female'].map(lambda x:
                                                float(x.replace("%","")))
countries_info['male'] = countries_info['male'].map(lambda x:
                                                float(x.replace("%","")))
# take list of categories from descriptions
countries_info['text'] = countries_info['text'].map(lambda x:
                            x.split("posting about ")[-1].replace(".","").lower().split(','))
# converting srt to int
countries_info['follower_count'] = countries_info['follower_count'].map(lambda x:
                                                int(x.replace(",","")))
countries_info['post_count'] = countries_info['post_count'].map(lambda x:
                                                int(x.replace(",","")))

In [8]:
# renaming columns and drop unwanted
countries_info.rename(columns={'text':'profile_categories',
                              'category':'main_prof_category'},inplace=True)
countries_info.drop(columns=['paid_promotion','free_promotion'],inplace=True)

In [9]:
countries_info.head(3)

Unnamed: 0,country,account,main_prof_category,follower_count,post_count,female,male,profile_categories
0,United States,natgeo,Publishers,138202795,22841,51.2,48.8,"[photography, travel, nature]"
1,United States,jlo,Creators & Celebrities,123230943,2860,57.0,43.0,"[modeling, music, singer]"
2,United States,katyperry,Creators & Celebrities,98782627,1543,44.1,55.9,[singer]


### Uploading saved main_df_posts file
The information was scraped from instagram using Octoparse.

Deleting unwanted columns and renaming remained columns.

In [10]:
# opening saved df and make column names lower case
main_df_posts = pd.read_csv('./datasets/main_df_posts.csv')
main_df_posts.columns = [col.lower() for col in main_df_posts.columns]
main_df_posts.drop(columns=['datesing'],inplace=True)
main_df_posts.rename(columns={'title':'post_text', 'titlesing':'text2', 'alt1':'what_on_photo', 
        'locsing':'location', 'likesing':'num_likes_post',
       'viewsing':'num_views', 'timesoing':'time', 'profile':'profile_name', 'posts':'num_posts', 
        'name':'full_name_profile', 'description':'profile_description',
       'link':'personal_link','text6':'comment6', 'text9':'comment8',
       'text11':'comment10', 'count_comm':'num_followers', 
        'alt3':'what_on_photo3', 'alt2':'what_on_photo2', 'carousel':'is_carousel'},inplace=True)

In [11]:
# Exctracting post's ids for future merging with other information from json files
main_df_posts['photo_id'] = main_df_posts['page_url1'].map(lambda x: x.split('/')[-2])
main_df_posts.drop(columns=['page_url1'],inplace=True)

In [12]:
main_df_posts.head(2)

Unnamed: 0,post_text,text2,what_on_photo,location,num_likes_post,num_views,time,profile_name,num_posts,full_name_profile,...,comment9,comment8,comment11,comment10,comment12,num_followers,what_on_photo3,what_on_photo2,is_carousel,photo_id
0,Photo by Michaela Skovranova @mishkusk | An ic...,,"<img alt=""Photo by National Geographic on May ...",,"86,864 likes",,"<time class=""_1o9PC Nzb55"" datetime=""2020-05-2...",natgeo,"22,741 posts",National Geographic,...,Why must we ruin what is given to us 😢 corona ...,💗,🔥🔥,ٰ,🔥👌,137086411,,,,CAbg0DGF4vT
1,"Photo by Ivan Kashinsky @ivankphoto | ""As craz...",,"<img alt=""Photo by National Geographic on May ...",,"176,242 likes",,"<time class=""_1o9PC Nzb55"" datetime=""2020-05-2...",natgeo,"22,741 posts",National Geographic,...,At least the view of the sea is pleasing,This pic is corona virus,What?????!!!,They left open beaches to drive to a closed on...,Hello,137086409,,,,CAbPzJiM9R9


In [13]:
# looking at shape of df
main_df_posts.shape

(252697, 29)

In [14]:
# dropping duplicated rows (they were created due to videos in carousel)
main_df_posts.drop_duplicates(inplace=True)

In [15]:
# Looking at new shape of df
# 7989 rows where dropped
main_df_posts.shape

(244708, 29)

### Uploading saved main_df_comments file
The information was scrapped from Instagram using scraper.

In [16]:
# Reading csv with additional information
comments = pd.read_csv('./datasets/main_df_comments.csv')

In [17]:
# Changing post_type on more understandable
# GraphSidecar=carousel, GraphVideo=video, GraphImage=image
comments['post_type'] = comments['post_type'].map({'GraphSidecar':'carousel', 
                                                  'GraphVideo':'video', 
                                                  'GraphImage':'image'})

In [18]:
comments.head()

Unnamed: 0,num_likes,num_comments,post_type,user_name,photo_id,timestamp,caption_text
0,3506,6,image,100flavoursuk,CAsYhYdgRU3,1590587180,#100flavoursuk
1,7447,9,image,100flavoursuk,CAsYQMZA8cP,1590587039,#100flavoursuk
2,1921,5,image,100flavoursuk,CAqSkOfgOFB,1590516948,#100flavoursuk
3,2960,11,image,100flavoursuk,CAqScMKAJoW,1590516882,#100flavoursuk
4,2306,6,image,100flavoursuk,CAqSX76ghYH,1590516848,#100flavoursuk


In [19]:
# checking missing values and dtypes
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341121 entries, 0 to 341120
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   num_likes     341121 non-null  int64 
 1   num_comments  341121 non-null  int64 
 2   post_type     341121 non-null  object
 3   user_name     341121 non-null  object
 4   photo_id      341121 non-null  object
 5   timestamp     341121 non-null  int64 
 6   caption_text  336463 non-null  object
dtypes: int64(3), object(4)
memory usage: 18.2+ MB


In [20]:
# Looking at shape
comments.shape

(341121, 7)

In [21]:
# dropping duplicated rows
comments.drop_duplicates(inplace=True)

In [22]:
# Checking shape again, 389 rows where dropped
comments.shape

(340732, 7)

## Merging three datasets

In [23]:
# Merging posts and comments based on post's ids
main_df = pd.merge(main_df_posts,comments, on='photo_id',how='left')

In [24]:
# fill Nan in profile names before merging with another df
main_df['profile_name'] = main_df['profile_name'].fillna(main_df['user_name'])

In [25]:
# Merging posts and comments based on post's ids
main_df = pd.merge(main_df,countries_info,left_on='profile_name',
                            right_on='account',how='left')

In [26]:
# Looking for shape of merged df
main_df.shape

(244709, 43)

In [27]:
# Checking for missing values
main_df.isnull().sum()

post_text               98382
text2                  154740
what_on_photo          153270
location               174718
num_likes_post          39521
num_views              211894
time                     6498
profile_name              220
num_posts                 188
full_name_profile        3095
profile_description      7139
personal_link           39308
comment1                14740
comment2                22195
comment3                28959
comment4                35139
comment5                40704
comment6                45956
comment7                50530
comment9                54869
comment8                58892
comment11               62600
comment10               65950
comment12               69146
num_followers             247
what_on_photo3         200988
what_on_photo2         170823
is_carousel            134356
photo_id                    0
num_likes                1993
num_comments             1993
post_type                1993
user_name                1993
timestamp 

In [28]:
# Checking for data types
main_df.dtypes

post_text               object
text2                   object
what_on_photo           object
location                object
num_likes_post          object
num_views               object
time                    object
profile_name            object
num_posts               object
full_name_profile       object
profile_description     object
personal_link           object
comment1                object
comment2                object
comment3                object
comment4                object
comment5                object
comment6                object
comment7                object
comment9                object
comment8                object
comment11               object
comment10               object
comment12               object
num_followers           object
what_on_photo3          object
what_on_photo2          object
is_carousel             object
photo_id                object
num_likes              float64
num_comments           float64
post_type               object
user_nam

As we can see, there are a lot of missing values and wrong data types. Let's clean the data frame.

Some missing values were not scraped properly such profile name or photo id. It can be due to the privacy settings of the profiles.

Other missing values are the absence of the information in the profiles.

## Data Cleaning

While scraping, text information from different type of posts were saved in different columns, let's combine them together now.

In [29]:
# fill Nan in text from text 2
main_df['post_text'] = main_df['post_text'].fillna(main_df['text2'])
# fill Nan in text
main_df['post_text'] = main_df['post_text'].fillna(main_df['caption_text'])
main_df['post_text'] = main_df['post_text'].fillna('None')

In [30]:
# fill Nan in what_on_photo2
main_df['what_on_photo2'] = main_df['what_on_photo2'].fillna(main_df['what_on_photo'])
# fill Nan in what_on_photo3
main_df['what_on_photo3'] = main_df['what_on_photo3'].fillna(main_df['what_on_photo2'])

In [31]:
# Drop unwanted columns which were used above
main_df.drop(columns=['what_on_photo2','what_on_photo','text2'],inplace=True)
main_df.rename(columns={'what_on_photo3':'what_on_photo'},inplace=True)

In [32]:
main_df['what_on_photo'].isnull().sum()

47536

Location, personal link, profile description and full name were not specified in every post. Unspecified cells will be filled with 'Not specified'.

In [33]:
# fill Nan location
main_df['location'] = main_df['location'].fillna('Not specified')

In [34]:
# fill Nan in personal_link
main_df['personal_link'] = main_df['personal_link'].fillna('Not specified')

In [35]:
# fill Nan in profile_description
main_df['profile_description'] = main_df['profile_description'].fillna('Not specified')

In [36]:
# fill Nan in full_name_profile
main_df['full_name_profile'] = main_df['full_name_profile'].fillna('Not specified')

Converting strings to int deleting ',' and words.

In [37]:
# fill NA num_views
# deleting views
main_df['num_views'] = main_df['num_views'].fillna('0 views')
main_df['num_views'] = main_df['num_views'].map(lambda x: 
                                        int(x.split(' ')[0].replace(',','')))

In [38]:
# fill NA num_posts
# delete ','
not_null = main_df[main_df['num_posts'].isnull() == False].index
main_df.loc[not_null,'num_posts'] = main_df.loc[not_null,'num_posts'].map(lambda x:
                                    int(x.replace(',', '').split(' ')[0]))

In [39]:
# fill NA num_likes_post
# delete ','
index_not_nan = main_df[main_df['num_likes_post'].isnull()==False].index
main_df.loc[index_not_nan,'num_likes_post'] = main_df.loc[index_not_nan,'num_likes_post'].map(lambda x:
                                    x.replace(',', '').split(' ')[0])

In [40]:
# fill NA num_views
# delete ','
index_not_nan = main_df[main_df['num_views'].isnull()==False].index
main_df.loc[index_not_nan,'num_views'] = main_df.loc[index_not_nan,'num_views'].map(lambda x:
                                    x.replace(',', '').split(' ')[0] if isinstance(x,str) else x)

In [41]:
# delete ',' in numbers
index_not_nan = main_df[main_df['num_followers'].isnull()==False].index
main_df.loc[index_not_nan,'num_followers'] = main_df.loc[index_not_nan,'num_followers'].map(lambda x:
                                    int(x.replace(',', '')))

Some information from octoparse scraper were missed but second scraper got these info. Missing values from first scraper will be fild with info from another.

In [42]:
# if there is number of views it means it is video
index = main_df[(main_df['post_type'].isnull()==True)&(main_df['num_views']>0)].index
main_df.loc[index,'post_type'] = 'video'

In [43]:
# Checking NAN and fill num_comments from comments and posts df
main_df['num_likes_post'].fillna(main_df['num_likes'],inplace=True)

Getting time and converting to datetime type

In [44]:
main_df['time'][0]

'<time class="_1o9PC Nzb55" datetime="2020-05-21T00:31:43.000Z" title="May 21, 2020">3 hours ago</time>'

In [45]:
# Filtering NAN
not_null = main_df[main_df['time'].isnull() == False]
# Getting time from html tags
main_df.loc[not_null.index,'time'] = main_df.loc[not_null.index,'time'].map(lambda x: 
                                                    x.split(' ')[3].split('=')[1].split('.')[0])

In [46]:
main_df['time'].isnull().sum()

6498

In [47]:
# Filtering NAN with in time with timestamp
main_df['time'] = main_df['time'].fillna(main_df['timestamp'])
main_df['time'] = main_df['time'].fillna(0.0)

In [48]:
# Converting date to datetime format
not_null = main_df[main_df['time'].isnull() == False].index
# main_df.loc[not_null,'time'] = pd.to_datetime(main_df.loc[not_null,'time'], format='"%Y-%m-%dT%H:%M:%S')
main_df.loc[not_null,'time'] = main_df.loc[not_null,'time'].apply(lambda x: 
                                    dt.fromtimestamp(x) if isinstance(x,float)
                                    else pd.to_datetime(x,format='"%Y-%m-%dT%H:%M:%S'))

In [49]:
main_df['time'].value_counts().head()

1970-01-01 08:00:00    314
2020-05-06 18:54:43     13
2020-01-23 04:18:35     13
2020-04-06 10:33:52     12
2019-11-29 21:37:59     12
Name: time, dtype: int64

In [50]:
# Getting what on photo from html tags
not_null = main_df[main_df['what_on_photo'].isnull() == False]
main_df.loc[not_null.index,'what_on_photo'] = main_df.loc[not_null.index,'what_on_photo'].map(lambda x: 
                                        x.split(':')[1].split('"')[0])
main_df['what_on_photo'] = main_df['what_on_photo'].fillna('None')

In [51]:
# Fill Nan is_carousel with false
main_df['is_carousel'].fillna(False,inplace=True)
# Correcting 
main_df['is_carousel'] = main_df['is_carousel'].map(lambda x: True if x != False else False)

In [52]:
# filling Nan with ' ' space
for i in range(1,13):
    column_name = 'comment'+str(i)
    main_df[column_name] = main_df[column_name].fillna(' ')
# concatanate comments in one
main_df['comments'] = (main_df['comment1']+' '+main_df['comment2']+' '+main_df['comment3']+' '+
main_df['comment4']+' '+main_df['comment5']+' '+main_df['comment6']+' '+main_df['comment7']+' '+
main_df['comment8']+' '+main_df['comment9']+' '+main_df['comment10']+' '+main_df['comment11']+' '+
main_df['comment12'])

In [53]:
# Dropping usless columns
main_df.drop(columns=['num_likes','user_name','caption_text','comment1','comment2','comment3',
                     'comment4','comment5','comment6','comment7','comment8','comment9','comment10',
                     'comment11','comment12','account','is_carousel','follower_count',
                     'post_count'],inplace=True)

In [54]:
# Deleting rows with high number of Nan
main_df.dropna(subset=['profile_name'],inplace=True)

# deleting rows without number of likes and it's not a video
index = main_df[(main_df['num_likes_post'].isnull()==True)&(main_df['num_views']==0)].index
main_df.drop(index=index,inplace=True,axis=0)

In [55]:
# Dtype change for number of posts
main_df['num_posts'] = main_df['num_posts'].astype('int')

In [56]:
# Dtype change for num_followers
main_df['num_followers'] = main_df['num_followers'].astype('int')

### Data imputation

In [57]:
# additional df with likes for video from additional parse with octoparse
video_likes = pd.read_csv('./datasets/video_likes.csv')
video_likes['URL'] = video_likes['URL'].map(lambda x: x.split('/')[-2])
main_df = pd.merge(main_df,video_likes,left_on='photo_id',right_on='URL',how='left')
main_df['num_likes_post'] = main_df['num_likes_post'].fillna(main_df['Likes'])
main_df.drop(columns=['Likes','URL'],inplace=True)

Dropping remained Nan rows and Final check of missing values

In [58]:
main_df.isnull().sum()

post_text                 0
location                  0
num_likes_post          120
num_views                 0
time                      0
profile_name              0
num_posts                 0
full_name_profile         0
profile_description       0
personal_link             0
num_followers             0
what_on_photo             0
photo_id                  0
num_comments           1653
post_type               981
timestamp              1653
country                 272
main_prof_category      272
female                  272
male                    272
profile_categories      272
comments                  0
dtype: int64

In [59]:
# Dropping remained rows with missing values 
main_df.dropna(subset=['num_likes_post'],inplace=True)
main_df.dropna(subset=['num_comments'],inplace=True)
main_df.dropna(subset=['country'],inplace=True)
main_df.drop(index=main_df[main_df['num_likes_post']=='like'].index,inplace=True)
main_df.drop(index=main_df[(main_df['num_likes_post'] ==0)].index,inplace=True)
main_df.drop(columns=['timestamp'],inplace=True)

main_df['num_likes_post'] = main_df['num_likes_post'].astype(int)

# Dropping rows from wrong scraped countries
index = main_df[main_df['country']=='Russian Federation'].index
main_df.drop(index=index,inplace=True)
index = main_df[main_df['country']=='Indonesia'].index
main_df.drop(index=index,inplace=True)

In [60]:
main_df.isnull().sum()

post_text              0
location               0
num_likes_post         0
num_views              0
time                   0
profile_name           0
num_posts              0
full_name_profile      0
profile_description    0
personal_link          0
num_followers          0
what_on_photo          0
photo_id               0
num_comments           0
post_type              0
country                0
main_prof_category     0
female                 0
male                   0
profile_categories     0
comments               0
dtype: int64

In [61]:
main_df.dtypes

post_text                      object
location                       object
num_likes_post                  int32
num_views                       int64
time                   datetime64[ns]
profile_name                   object
num_posts                       int32
full_name_profile              object
profile_description            object
personal_link                  object
num_followers                   int32
what_on_photo                  object
photo_id                       object
num_comments                  float64
post_type                      object
country                        object
main_prof_category             object
female                        float64
male                          float64
profile_categories             object
comments                       object
dtype: object

In [62]:
main_df.shape

(241976, 21)

In [63]:
main_df.head(3)

Unnamed: 0,post_text,location,num_likes_post,num_views,time,profile_name,num_posts,full_name_profile,profile_description,personal_link,...,what_on_photo,photo_id,num_comments,post_type,country,main_prof_category,female,male,profile_categories,comments
0,Photo by Michaela Skovranova @mishkusk | An ic...,Not specified,86864,0,2020-05-21 00:31:43,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,"cloud, sky, ocean, outdoor, nature and water",CAbg0DGF4vT,464.0,image,United States,Publishers,51.2,48.8,"[photography, travel, nature]",💙 Worldstar blocked me because I post better f...
1,"Photo by Ivan Kashinsky @ivankphoto | ""As craz...",Not specified,176242,0,2020-05-20 22:03:04,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,"1 person, child and outdoor",CAbPzJiM9R9,3049.0,image,United States,Publishers,51.2,48.8,"[photography, travel, nature]",This is beautiful!! 🤷🏻‍♀️ okkkkkk??? And Jaden...
2,Photo by David Guttenfelder @dguttenfelder | A...,Not specified,271002,0,2020-05-20 19:34:07,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,"sky, cloud, outdoor and nature",CAa-wN1pAcV,1161.0,image,United States,Publishers,51.2,48.8,"[photography, travel, nature]","Damn Natgeo, clean up these bots 🔞🔞🔞\n..........."


## Feature creation

Categorizing profiles based on the number of followers:

<img src=https://hypeauditor.com/blog/wp-content/uploads/2019/03/How-many-Followers-make-an-Instagram-Influencer-1.png width="600" height="400">

In [64]:
# Categorizing profiles
def influencer_type(num_followers):
        if num_followers > 1_000_000:
            return 'Mega'
        elif 100_000 <= num_followers < 1_000_000:
            return 'Macro'
        elif 20_000 <= num_followers <100_000:
            return 'Midi'
        elif 5_000 <= num_followers < 20_000:
            return 'Micro'
        elif 1_000 <= num_followers < 5_000:
            return 'Nano'
        else:
            return 'Not influrncer'

main_df['influencer_type'] = main_df['num_followers'].map(lambda x:influencer_type(x))
main_df['influencer_type'].value_counts()

Macro             113696
Mega               77722
Midi               25516
Micro              24843
Not influrncer       154
Nano                  45
Name: influencer_type, dtype: int64

In [65]:
# we can see that number of 'Nano' and 'Not influrncer' is very small, so, they will be dropped
no_influens_index = main_df[main_df['influencer_type'].isin(['Nano','Not influrncer'])].index
main_df.drop(index=no_influens_index,inplace=True)
main_df['influencer_type'].value_counts()

Macro    113696
Mega      77722
Midi      25516
Micro     24843
Name: influencer_type, dtype: int64

In [66]:
# extracting hashtags
main_df['hashtags_post_text'] = main_df['post_text'].apply(lambda s:re.findall(r"#(\w+)", s.lower()))
main_df['hashtags_comments'] = main_df['comments'].apply(lambda s:re.findall(r"#(\w+)", s.lower()))
main_df['hashtags'] = main_df['hashtags_post_text']+main_df['hashtags_comments']
main_df['hashtags'] = main_df['hashtags'].map(lambda s: list(set(s)))
main_df['hashtags'][:3]

0    [landscape, antarctica, ocean, naturelovestori...
1    [humanity, human_rights, opencalifornia, war, ...
2                                                   []
Name: hashtags, dtype: object

In [67]:
# extracting emoji
main_df['emoji_post_text'] = main_df['post_text'].apply(lambda s:(list(set([c for c in s if c in emoji.UNICODE_EMOJI]))))
main_df['emoji_comments'] = main_df['comments'].apply(lambda s:(list(set([c for c in s if c in emoji.UNICODE_EMOJI]))))
main_df['emoji_comments'][:3]

0    [💙, ❤, 💗, 🔥, 👌, 👍, 👏, 🌍, 💜, 😢]
1                [♀, 🇺, 🤔, 🏻, 🇸, 🤷]
2                      [🔥, 🔞, 😍, 💖]
Name: emoji_comments, dtype: object

In [68]:
# creating a column length of text under photo
main_df['len_post_text'] = main_df['post_text'].map(lambda x: len(x))

In [69]:
# creating a column with number of unique hashtags
main_df['num_of_unique_hashtags'] = main_df['hashtags'].map(lambda x: len(x))

In [70]:
# # clean non english hashtags
# import nltk
# words = set(nltk.corpus.words.words())

# sent = "Io andiamo to the beach with my amico."
# " ".join(w for w in nltk.wordpunct_tokenize(sent) \
#          if w.lower() in words or not w.isalpha())

In [71]:
main_df.columns

Index(['post_text', 'location', 'num_likes_post', 'num_views', 'time',
       'profile_name', 'num_posts', 'full_name_profile', 'profile_description',
       'personal_link', 'num_followers', 'what_on_photo', 'photo_id',
       'num_comments', 'post_type', 'country', 'main_prof_category', 'female',
       'male', 'profile_categories', 'comments', 'influencer_type',
       'hashtags_post_text', 'hashtags_comments', 'hashtags',
       'emoji_post_text', 'emoji_comments', 'len_post_text',
       'num_of_unique_hashtags'],
      dtype='object')

In [72]:
# grouping dataset by unique profile names
unique_profiles = main_df.groupby('profile_name').mean()

In [73]:
# dropping profiles with 1 post only
prof_names_1_post = unique_profiles[unique_profiles['num_posts']<2].index
main_df.drop(index = main_df[main_df['profile_name'].isin(prof_names_1_post)].index,inplace=True)

In [75]:
# creating information about mean number of likes by profile,
# frequancy of post publishing
# and number of scraped rows per profile
main_df['number_rows'] = 0
main_df['mean_likes'] = 0
count = 0
for profile in unique_profiles.index:
    temp_df = main_df[main_df['profile_name']==profile]
    mean_liskes = temp_df['num_likes_post'].mean()
    temp_df = temp_df.sort_values('time')
    # mean likes in profile
    main_df.loc[temp_df.index,'mean_likes'] = mean_liskes
    # mean number of comments of a profile
    main_df.loc[temp_df.index,'mean_comments'] = mean_liskes
    # number of scraped posts of a profile
    main_df.loc[temp_df.index,'number_rows'] = temp_df.shape[0]
    # frequency of publishing posts
    index = temp_df.index
    temp_df = temp_df.drop_duplicates(subset=['time','post_text'])
    temp_df['sifted_time'] = temp_df['time'].shift(1)
    temp_df['diff'] = (temp_df['time'] - temp_df['sifted_time'])
    main_df.loc[index,'post_frequency'] = temp_df['diff'].mean().seconds / 60/60

In [None]:
# Converting GMT to local time with average timedelta for contry
tdelta_us = timedelta(hours=-7)
tdelta_uk = timedelta(hours=-1)
tdelta_aus = timedelta(hours=10)
tdelta_nz = timedelta(hours=12)
indx = main_df[main_df['country'] == 'United States'].index
main_df.loc[indx,'time'] = main_df.loc[indx,'time']- tdelta_us
indx = main_df[main_df['country'] == 'United Kingdom'].index
main_df.loc[indx,'time'] = main_df.loc[indx,'time']- tdelta_uk
indx = main_df[main_df['country'] == 'Australia'].index
main_df.loc[indx,'time'] = main_df.loc[indx,'time']+ tdelta_aus
indx = main_df[main_df['country'] == 'New Zealand'].index
main_df.loc[indx,'time'] = main_df.loc[indx,'time']+ tdelta_nz

# Extracting hour and week of posting
main_df['hour'] = main_df['time'].dt.hour
main_df['weekday'] = main_df['time'].dt.weekday

In [76]:
# creating column percent of comments_engagement = num_comments /num_followers
main_df['comments_engagement'] = round(main_df['num_comments']/main_df['num_followers']*100,2)

In [77]:
# creating column percent of like_engagment = num_likes /num_followers
main_df['like_engagement'] = round(main_df['num_likes_post']/main_df['num_followers']*100,2)

In [78]:
# Sentiment Analysis on post text
main_df['polarity_post_txt'] = main_df['post_text'].map(lambda x: TextBlob(x).sentiment.polarity)
main_df['subjectivity_post_txt'] = main_df['post_text'].map(lambda x: 
                                                    TextBlob(x).sentiment.subjectivity)

In [79]:
main_df[['polarity_post_txt','subjectivity_post_txt']].mean()

polarity_post_txt        0.186469
subjectivity_post_txt    0.389980
dtype: float64

### Text preprocessing

In [None]:
# Converting list with hashtags to string
main_df['hashtags'] = main_df['hashtags'].apply(lambda row: ' '.join(word for word in row))

In [16]:
def text_preprocessing(columns,df=main_df):
    for column in columns:
        # removing non-letters
        df[column] = df[column].map(lambda x:re.sub("[^a-zA-Z]", " ", x))
        # Instantiating Tokenizer and setting a pattern to only words
        # applying Tokenizer to texts
        tokenizer = RegexpTokenizer(r'\w+')
        df[column] = df[column].map(lambda x: tokenizer.tokenize(x.lower()))
        # # back to string
        # main_df['post_text'] = main_df['post_text'].apply(lambda row: ' '.join(word for word in row))

In [12]:
text_preprocessing(columns=['post_text','what_on_photo','hashtags'])

In [17]:
text_preprocessing(columns=['what_on_photo'])

In [18]:
main_df['what_on_photo']

0         [cloud, sky, ocean, outdoor, nature, and, water]
1                            [person, child, and, outdoor]
2                       [sky, cloud, outdoor, and, nature]
3                            [person, shoes, and, outdoor]
4                                        [person, outdoor]
                                ...                       
241770                                            [people]
241771                                            [people]
241772                                            [people]
241773                                            [people]
241774                                      [people, text]
Name: what_on_photo, Length: 241775, dtype: object

In [22]:
def process_words(columns, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove Stopwords and Lemmatization"""
    for column in columns:
        texts = main_df[column].values.tolist()
        texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
        texts_out = []
        nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
        for sent in texts:
            doc = nlp(" ".join(sent)) 
            texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
        # remove stopwords once more after lemmatization
        texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]    
        main_df[column] = texts_out

# data_ready = process_words(columns=['post_text','what_on_photo','hashtags'])  # processed Text Data!
data_ready = process_words(columns=['what_on_photo'])

In [23]:
main_df['what_on_photo'] = main_df['what_on_photo'].apply(lambda x: list(set(x)))

In [55]:
str_photo = main_df[['what_on_photo','hashtags']].copy()
str_photo['what_on_photo'] = str_photo['what_on_photo'].apply(lambda x: ' '.join(word for word in x))
str_photo[str_photo['what_on_photo'].str.contains('text')]['what_on_photo']

4188                            water stop plant texte dead
44789                      context parent word old patience
50345                                 common sense ex texte
67244     style addict note voice stop long literally texte
98828                                       well texte stop
111294                support lockdown remember check texte
139330                                   texture sweatshirt
139331                               cuff texture pant shoe
Name: what_on_photo, dtype: object

In [50]:
index = main_df[main_df['what_on_photo'].str.len() >10].index
main_df.loc[index,'what_on_photo'] = 'text'

In [119]:
main_df.isnull().sum()

post_text                 0
location                  0
num_likes_post            0
num_views                 0
time                      0
profile_name              0
num_posts                 0
full_name_profile         0
profile_description       0
personal_link             0
num_followers             0
what_on_photo             0
photo_id                  0
num_comments              0
post_type                 0
country                   0
main_prof_category        0
female                    0
male                      0
profile_categories        0
comments                  0
influencer_type           0
hashtags_post_text        0
hashtags_comments         0
hashtags                  0
emoji_post_text           0
emoji_comments            0
len_post_text             0
num_of_unique_hashtags    0
number_rows               0
mean_likes                0
post_frequency            0
comments_engagement       0
like_engagement           0
polarity_post_txt         0
subjectivity_post_tx

In [56]:
# saving clean dataset with created columns
main_df.to_csv('./datasets/main_df_clean.csv',index=False)

In [2]:
import ast
main_df = pd.read_csv('./datasets/main_df_clean.csv',parse_dates=True)
# main_df['hashtags'] = main_df['hashtags'].apply(ast.literal_eval)
# main_df['post_text'] = main_df['post_text'].apply(ast.literal_eval)
# # main_df['what_on_photo'] = main_df['what_on_photo'].apply(ast.literal_eval)
# main_df['emoji_post_text'] = main_df['emoji_post_text'].apply(ast.literal_eval)
# main_df['emoji_comments'] = main_df['emoji_comments'].apply(ast.literal_eval)
# main_df['profile_categories'] = main_df['profile_categories'].apply(ast.literal_eval)
# main_df['time'] = pd.to_datetime(main_df['time'])

## Modeled topic addition

Topic modeling were performed before this section in the separate notebook.

In [24]:
topics = pd.read_csv('./datasets/main_df_with_topics.csv')

In [26]:
topics.head()

Unnamed: 0,post_text,location,num_likes_post,num_views,time,profile_name,num_posts,full_name_profile,profile_description,personal_link,...,comments_engagement,like_engagement,polarity_post_txt,subjectivity_post_txt,all_words,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,Photo by Michaela Skovranova @mishkusk | An ic...,Not specified,86864,0,2020-05-21 00:31:43,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,0.0,0.06,0.155,0.63,"['photo', 'by', 'michaela', 'skovranova', 'mis...",0,11.0,0.513,"image, shoot, skin, beauty, light, product, ad...","['iceberg', 'shroud', 'mist', 'enchant', 'thin..."
1,"Photo by Ivan Kashinsky @ivankphoto | ""As craz...",Not specified,176242,0,2020-05-20 22:03:04,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,0.0,0.13,-0.043664,0.409091,"['photo', 'by', 'ivan', 'kashinsky', 'ivankpho...",1,9.0,0.3121,"support, help, business, work, world, local, t...","['sit', 'fence', 'free', 'month', 'live', 'tog..."
2,Photo by David Guttenfelder @dguttenfelder | A...,Not specified,271002,0,2020-05-20 19:34:07,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,0.0,0.2,0.25,0.3125,"['photo', 'by', 'david', 'guttenfelder', 'dgut...",2,9.0,0.3901,"support, help, business, work, world, local, t...","['highway', 'sunrise', 'cross', 'electric', 'c..."
3,"Photo by @sarahyltonphoto | In December, women...",Not specified,104315,0,2020-05-20 17:04:58,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,0.0,0.08,0.261161,0.517857,"['photo', 'by', 'sarahyltonphoto', 'in', 'dece...",3,9.0,0.5108,"support, help, business, work, world, local, t...","['woman', 'sort', 'city', 'large', 'recycling'..."
4,Photos by @gabrielegalimbertiphoto | Before th...,Not specified,135122,0,2020-05-20 14:36:39,natgeo,22741,National Geographic,Experience the world through the eyes of Natio...,on.natgeo.com/instagram,...,0.0,0.1,-0.020455,0.374242,"['photos', 'by', 'gabrielegalimbertiphoto', 'b...",4,11.0,0.346,"image, shoot, skin, beauty, light, product, ad...","['photo', 'treat', 'destine', 'suffocate', 'co..."


In [25]:
main_df['topic'] = topics['Dominant_Topic']
main_df['keywords'] = topics['Keywords']

In [5]:
# saving clean dataset with created columns
main_df.to_csv('./datasets/main_df_clean.csv',index=False)