### Importing Essential Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import nltk
nltk.download('punkt')
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\INKY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\INKY\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\INKY\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Import the Instagram Dataset 

In [2]:
data = pd.read_csv("Datasets/Instagram data.csv", encoding = "latin1")

### Cleaning the Instagram Dataset 

 Checking if dataset has any missing values (NULL vales)

In [3]:
for column in data:
    print("coloumn name:", column, "- missing values", data[column].isnull().sum())
    print("----------------------------------------------------------------------")

coloumn name: Impressions - missing values 0
----------------------------------------------------------------------
coloumn name: From Home - missing values 0
----------------------------------------------------------------------
coloumn name: From Hashtags - missing values 0
----------------------------------------------------------------------
coloumn name: From Explore - missing values 0
----------------------------------------------------------------------
coloumn name: From Other - missing values 0
----------------------------------------------------------------------
coloumn name: Saves - missing values 0
----------------------------------------------------------------------
coloumn name: Comments - missing values 0
----------------------------------------------------------------------
coloumn name: Shares - missing values 0
----------------------------------------------------------------------
coloumn name: Likes - missing values 0
-----------------------------------------------

In [4]:
data.head()

Unnamed: 0,Impressions,From Home,From Hashtags,From Explore,From Other,Saves,Comments,Shares,Likes,Profile Visits,Follows,Caption,Hashtags
0,3920,2586,1028,619,56,98,9,5,162,35,2,Here are some of the most important data visua...,#finance #money #business #investing #investme...
1,5394,2727,1838,1174,78,194,7,14,224,48,10,Here are some of the best data science project...,#healthcare #health #covid #data #datascience ...
2,4021,2085,1188,0,533,41,11,1,131,62,12,Learn how to train a machine learning model an...,#data #datascience #dataanalysis #dataanalytic...
3,4528,2700,621,932,73,172,10,7,213,23,8,Heres how you can write a Python program to d...,#python #pythonprogramming #pythonprojects #py...
4,2518,1704,255,279,37,96,5,4,123,8,0,Plotting annotations while visualizing your da...,#datavisualization #datascience #data #dataana...


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Impressions     119 non-null    int64 
 1   From Home       119 non-null    int64 
 2   From Hashtags   119 non-null    int64 
 3   From Explore    119 non-null    int64 
 4   From Other      119 non-null    int64 
 5   Saves           119 non-null    int64 
 6   Comments        119 non-null    int64 
 7   Shares          119 non-null    int64 
 8   Likes           119 non-null    int64 
 9   Profile Visits  119 non-null    int64 
 10  Follows         119 non-null    int64 
 11  Caption         119 non-null    object
 12  Hashtags        119 non-null    object
dtypes: int64(11), object(2)
memory usage: 12.2+ KB


 Removing `From Home` `From Hashtags` `From Explore`	`From Other` `Saves` `Profile Visits` `Follows` from the data set as our Main question do not require data from these 8 columns

In [6]:
data = data.drop(['From Home','From Hashtags','From Explore', 'From Other','Saves', 'Profile Visits', 'Follows'], axis = 1)
data

Unnamed: 0,Impressions,Comments,Shares,Likes,Caption,Hashtags
0,3920,9,5,162,Here are some of the most important data visua...,#finance #money #business #investing #investme...
1,5394,7,14,224,Here are some of the best data science project...,#healthcare #health #covid #data #datascience ...
2,4021,11,1,131,Learn how to train a machine learning model an...,#data #datascience #dataanalysis #dataanalytic...
3,4528,10,7,213,Heres how you can write a Python program to d...,#python #pythonprogramming #pythonprojects #py...
4,2518,5,4,123,Plotting annotations while visualizing your da...,#datavisualization #datascience #data #dataana...
...,...,...,...,...,...,...
114,13700,2,38,373,Here are some of the best data science certifi...,#datascience #datasciencejobs #datasciencetrai...
115,5731,4,1,148,Clustering is a machine learning technique use...,#machinelearning #machinelearningalgorithms #d...
116,4139,0,1,92,Clustering music genres is a task of grouping ...,#machinelearning #machinelearningalgorithms #d...
117,32695,2,75,549,Here are some of the best data science certifi...,#datascience #datasciencejobs #datasciencetrai...


Performing data cleaning on Unstructured data using Natural Language Processing. 
This is done by doing stemming and lemmatization on the caption column.

In [7]:
captionDF = data
captionDF['fullCaption'] = captionDF['Caption'] + ' ' + data['Hashtags']
captionDF['fullCaption'] = captionDF['fullCaption'].apply(lambda x: str(x).replace(u'\xa0', u' '))
captionDF['fullCaption'] = captionDF['fullCaption'].apply(lambda x: str(x).lower())
captionDF['fullCaption'][0]

def tokenize_text(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.lower() not in stop_words]
    return tokens

# Stemming function
def stem_tokens(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

# Lemmatization function
def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

df = captionDF
# Apply tokenization, remove stopwords, and perform stemming or lemmatization
df['Tokenized_Text'] = df['fullCaption'].apply(tokenize_text)
df['Stemmed_Text'] = df['Tokenized_Text'].apply(stem_tokens)
df['tokenizedCaptions'] = df['Stemmed_Text'].apply(lemmatize_tokens)

# Apply tokenization, remove stopwords, and perform stemming or lemmatization
df['Tokenized_Hashtags'] = df['Hashtags'].apply(tokenize_text)
df['Stemmed_Hashtags'] = df['Tokenized_Hashtags'].apply(stem_tokens)
df['tokenizedHashtags'] = df['Stemmed_Hashtags'].apply(lemmatize_tokens)

captionDF = df

In [8]:
captionDF

Unnamed: 0,Impressions,Comments,Shares,Likes,Caption,Hashtags,fullCaption,Tokenized_Text,Stemmed_Text,tokenizedCaptions,Tokenized_Hashtags,Stemmed_Hashtags,tokenizedHashtags
0,3920,9,5,162,Here are some of the most important data visua...,#finance #money #business #investing #investme...,here are some of the most important data visua...,"[important, data, visualizations, every, finan...","[import, data, visual, everi, financi, data, a...","[import, data, visual, everi, financi, data, a...","[finance, money, business, investing, investme...","[financ, money, busi, invest, invest, trade, s...","[financ, money, busi, invest, invest, trade, s..."
1,5394,7,14,224,Here are some of the best data science project...,#healthcare #health #covid #data #datascience ...,here are some of the best data science project...,"[best, data, science, project, ideas, healthca...","[best, data, scienc, project, idea, healthcar,...","[best, data, scienc, project, idea, healthcar,...","[healthcare, health, covid, data, datascience,...","[healthcar, health, covid, data, datasci, data...","[healthcar, health, covid, data, datasci, data..."
2,4021,11,1,131,Learn how to train a machine learning model an...,#data #datascience #dataanalysis #dataanalytic...,learn how to train a machine learning model an...,"[learn, train, machine, learning, model, givin...","[learn, train, machin, learn, model, give, inp...","[learn, train, machin, learn, model, give, inp...","[data, datascience, dataanalysis, dataanalytic...","[data, datasci, dataanalysi, dataanalyt, datas...","[data, datasci, dataanalysi, dataanalyt, datas..."
3,4528,10,7,213,Heres how you can write a Python program to d...,#python #pythonprogramming #pythonprojects #py...,heres how you can write a python program to d...,"[heres, write, python, program, detect, whethe...","[here, write, python, program, detect, whether...","[here, write, python, program, detect, whether...","[python, pythonprogramming, pythonprojects, py...","[python, pythonprogram, pythonproject, pythonc...","[python, pythonprogram, pythonproject, pythonc..."
4,2518,5,4,123,Plotting annotations while visualizing your da...,#datavisualization #datascience #data #dataana...,plotting annotations while visualizing your da...,"[plotting, annotations, visualizing, data, con...","[plot, annot, visual, data, consid, good, prac...","[plot, annot, visual, data, consid, good, prac...","[datavisualization, datascience, data, dataana...","[datavisu, datasci, data, dataanalyt, machinel...","[datavisu, datasci, data, dataanalyt, machinel..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,13700,2,38,373,Here are some of the best data science certifi...,#datascience #datasciencejobs #datasciencetrai...,here are some of the best data science certifi...,"[best, data, science, certifications, choose, ...","[best, data, scienc, certif, choos, datasci, d...","[best, data, scienc, certif, choos, datasci, d...","[datascience, datasciencejobs, datasciencetrai...","[datasci, datasciencejob, datasciencetrain, da...","[datasci, datasciencejob, datasciencetrain, da..."
115,5731,4,1,148,Clustering is a machine learning technique use...,#machinelearning #machinelearningalgorithms #d...,clustering is a machine learning technique use...,"[clustering, machine, learning, technique, use...","[cluster, machin, learn, techniqu, use, classi...","[cluster, machin, learn, techniqu, use, classi...","[machinelearning, machinelearningalgorithms, d...","[machinelearn, machinelearningalgorithm, datas...","[machinelearn, machinelearningalgorithm, datas..."
116,4139,0,1,92,Clustering music genres is a task of grouping ...,#machinelearning #machinelearningalgorithms #d...,clustering music genres is a task of grouping ...,"[clustering, music, genres, task, grouping, mu...","[cluster, music, genr, task, group, music, bas...","[cluster, music, genr, task, group, music, bas...","[machinelearning, machinelearningalgorithms, d...","[machinelearn, machinelearningalgorithm, datas...","[machinelearn, machinelearningalgorithm, datas..."
117,32695,2,75,549,Here are some of the best data science certifi...,#datascience #datasciencejobs #datasciencetrai...,here are some of the best data science certifi...,"[best, data, science, certifications, choose, ...","[best, data, scienc, certif, choos, datasci, d...","[best, data, scienc, certif, choos, datasci, d...","[datascience, datasciencejobs, datasciencetrai...","[datasci, datasciencejob, datasciencetrain, da...","[datasci, datasciencejob, datasciencetrain, da..."


In [9]:
captionDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Impressions         119 non-null    int64 
 1   Comments            119 non-null    int64 
 2   Shares              119 non-null    int64 
 3   Likes               119 non-null    int64 
 4   Caption             119 non-null    object
 5   Hashtags            119 non-null    object
 6   fullCaption         119 non-null    object
 7   Tokenized_Text      119 non-null    object
 8   Stemmed_Text        119 non-null    object
 9   tokenizedCaptions   119 non-null    object
 10  Tokenized_Hashtags  119 non-null    object
 11  Stemmed_Hashtags    119 non-null    object
 12  tokenizedHashtags   119 non-null    object
dtypes: int64(4), object(9)
memory usage: 12.2+ KB


In [10]:
captionDF.to_csv("Datasets/cleaned-IG-data.csv")