# Data Preprocessing

## Introduction
In this notebook, we will preprocess the dataset to prepare it for model training. This includes splitting the data into training and testing sets, scaling the features, and ensuring the data is in the correct format for machine learning models.

## Step 1: Load the Cleaned Data
We begin by loading the dataset that was prepared in the previous notebooks.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Load the final data from the previous notebook

In [5]:
df = pd.read_csv('../Data/Clean-Data/df_to_model.csv')
df = df.sample(1000)

In [6]:
# Display basic information about the dataset
df.info()
df.head(2)

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 211351 to 335218
Columns: 1063 entries, interactions to time_of_day_night
dtypes: bool(31), float64(1030), int64(2)
memory usage: 7.9 MB


Unnamed: 0,interactions,following,followers,num_posts,is_business_account,embedded_0,embedded_1,embedded_2,embedded_3,embedded_4,...,day_of_week_Monday,day_of_week_Saturday,day_of_week_Sunday,day_of_week_Thursday,day_of_week_Tuesday,day_of_week_Wednesday,time_of_day_afternoon,time_of_day_early_morning,time_of_day_morning,time_of_day_night
211351,167,743,1.25249,0.780913,False,0.235544,0.175321,-0.035041,0.161449,-0.518485,...,False,False,True,False,False,False,True,False,False,False
334344,38,1411,-0.553813,-0.346329,False,-0.228337,0.853343,-0.340312,0.104605,-0.946533,...,False,False,False,True,False,False,False,False,False,True


In [7]:
#get the name of the forst 50 columns
df.columns[:50]

Index(['interactions', 'following', 'followers', 'num_posts',
       'is_business_account', 'embedded_0', 'embedded_1', 'embedded_2',
       'embedded_3', 'embedded_4', 'embedded_5', 'embedded_6', 'embedded_7',
       'embedded_8', 'embedded_9', 'embedded_10', 'embedded_11', 'embedded_12',
       'embedded_13', 'embedded_14', 'embedded_15', 'embedded_16',
       'embedded_17', 'embedded_18', 'embedded_19', 'embedded_20',
       'embedded_21', 'embedded_22', 'embedded_23', 'embedded_24',
       'embedded_25', 'embedded_26', 'embedded_27', 'embedded_28',
       'embedded_29', 'embedded_30', 'embedded_31', 'embedded_32',
       'embedded_33', 'embedded_34', 'embedded_35', 'embedded_36',
       'embedded_37', 'embedded_38', 'embedded_39', 'embedded_40',
       'embedded_41', 'embedded_42', 'embedded_43', 'embedded_44'],
      dtype='object')

In [9]:
# get the name of the last 50 columns
df.columns[1020:]

Index(['embedded_1015', 'embedded_1016', 'embedded_1017', 'embedded_1018',
       'embedded_1019', 'embedded_1020', 'embedded_1021', 'embedded_1022',
       'embedded_1023', 'description_length', 'followers_trans',
       'num_posts_trans', 'description_length_trans',
       'category_arts_&_culture', 'category_business_&_entrepreneurs',
       'category_celebrity_&_pop_culture', 'category_diaries_&_daily_life',
       'category_family', 'category_fashion_&_style',
       'category_film_tv_&_video', 'category_fitness_&_health',
       'category_food_&_dining', 'category_gaming',
       'category_learning_&_educational', 'category_music',
       'category_news_&_social_concern', 'category_other_hobbies',
       'category_relationships', 'category_science_&_technology',
       'category_sports', 'category_travel_&_adventure',
       'category_youth_&_student_life', 'day_of_week_Friday',
       'day_of_week_Monday', 'day_of_week_Saturday', 'day_of_week_Sunday',
       'day_of_week_Thursda

# Check the data types

In [16]:
# change the data types from bool to int for every column with a for loop
for col in df.columns:
    if df[col].dtype == 'bool':
        df[col] = df[col].astype(int)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6738 entries, 0 to 6737
Columns: 1063 entries, interactions to embedded_1023
dtypes: float64(1030), int64(33)
memory usage: 54.6 MB


Loading Data: The dataset from the feature engineering step is loaded to be prepared for model training.

## Step 2: Define Features and Target Variable
We will define the features (input variables) and the target variable that we want to predict.

# Define the features (X) and the target variable (y)

In [18]:
features = df.drop('interactions', axis=1)
target = df['interactions']

# Split the data into training and testing sets

In [19]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [20]:
# Ensure all column names are strings
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

Feature and Target Definition: The interactions column is our target, and the rest of the columns are used as features for prediction.

## Step 3: Feature Scaling
We will scale the features to ensure that they are on a similar scale, which is important for many machine learning algorithms.

### Initialize the MinMaxScaler and fit it to the training data

In [21]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Verify the scaled data

In [22]:
print("Scaled Training Data:")
print(X_train_scaled[:2])

Scaled Training Data:
[[0.1232786  0.2229206  0.51072961 ... 0.51476643 0.48590362 0.52705736]
 [0.02979343 0.04734968 0.10354077 ... 0.43659714 0.58437964 0.68099872]]


Scaling: The features are scaled using MinMaxScaler to normalize the data between 0 and 1, which helps improve the performance of many machine learning models.

## Step 4: Save the Preprocessed Data
We will save the preprocessed data for use in the next notebook.

### Save the preprocessed data

In [23]:
pd.DataFrame(X_train_scaled, columns=X_train.columns).to_csv('../Data/Clean-Data/X_train_scaled.csv', index=False)
pd.DataFrame(X_test_scaled, columns=X_test.columns).to_csv('../Data/Clean-Data/X_test_scaled.csv', index=False)
y_train.to_csv('../Data/Clean-Data/y_train.csv', index=False)
y_test.to_csv('../Data/Clean-Data/y_test.csv', index=False)

In [24]:
print("Preprocessed data saved.")


Preprocessed data saved.


Data Saving: The preprocessed data is saved so that it can be easily loaded in the next steps of the project.

# Conclusion
The data is now preprocessed and ready for model training. In the next notebook, we will experiment with various machine learning models to predict Instagram post interactions.