# Data Collection

## Introduction
In this notebook, we will focus on loading the initial dataset and the embedded data. The primary goal is to combine these datasets, ensuring that only the rows present in both data frames based on the 'id' and 'description' columns are retained.



## Step 1: Import Necessary Libraries
We start by importing the necessary libraries for data manipulation.

In [20]:
import pandas as pd

# Load the base dataset

In [21]:
data = pd.read_csv('../Data/DF_complete/base.csv')
print("Base Data Loaded:")
print(data.head(2))
print("Shape of the base data:", data.shape)

Base Data Loaded:
     id                                        description  interactions  \
0  7583  matt dean dd with the three bears nwent to vis...            29   
1    80  sunday night with z meangirls sunday girlsnigh...            24   

  day_of_week    time_of_day  following  followers  num_posts  \
0      Sunday      afternoon       1777        449        808   
1      Monday  early_morning        976        843       2376   

   is_business_account              category  
0                False                family  
1                False  diaries_&_daily_life  
Shape of the base data: (1000000, 10)


# Load the embedded dataset

In [22]:
# List of file paths
file_paths = [#'../Data/DF_embedded/df_1-embed.csv', 
              #'../Data/DF_embedded/df_2-embed.csv', 
              #'../Data/DF_embedded/df_3-embed.csv',
              #'../Data/DF_embedded/df_4-embed.csv',
              #'../Data/DF_embedded/df_5-embed.csv',
              '../Data/DF_embedded/df_6-embed.csv',
              #'../Data/DF_embedded/df_7-embed.csv',
              #'../Data/DF_embedded/df_8-embed.csv',
              #'../Data/DF_embedded/df_9-embed.csv',
              #'../Data/DF_embedded/df_10-embed.csv'
              ]

# Initialize an empty list to hold the dataframes
dataframes = []

for file in file_paths:
    # Load the dataset
    df = pd.read_csv(file)
    
    # Drop the first row
    df = df.drop(df.index[0])
    
    # Append the modified dataframe to the list
    dataframes.append(df)

# Concatenate all dataframes into one
embed_df = pd.concat(dataframes, ignore_index=True)

# Display the concatenated dataframe
print(embed_df)

  df = pd.read_csv(file)


            id                                        description  embedded_0  \
0      3701655  cantinhodosmanos dollishill portugueserestaura...   -0.586680   
1       404256                                  my kind of curves   -0.028337   
2      4761324  severe thunderstorm along with flooding alert ...   -0.414129   
3      1905033  i think this is what dreams are made of not su...   -0.305486   
4      2801775  the porsche gb gt clubsport the track only ver...   -0.160066   
...        ...                                                ...         ...   
99995  3022302                    my kinda clich n goalsofdancing   -0.320121   
99996  3001817  don t drag me down one of my favorite sxdx son...    0.144971   
99997  2879561  only a couple days left before kicking the roo...    0.175868   
99998  2888955             rip soda peezy te i luv my real niggas    1.184273   
99999  3513332                posted on the block like a low life    0.015594   

       embedded_1  embedded

Data Loading: We load two datasets â€“ the original base.csv and the embedded data df_2-embed.csv.

## Step 2: Merging Datasets
We combine the base data and the embedded data, keeping only the rows that are present in both dataframes based on the 'id' and 'description' columns.

# Merge the datasets on 'id' and 'description'

In [23]:
df = data.merge(embed_df, on=['id', 'description'])
print("Merged Data:")
print(df.head(2))
print("Shape of the merged data:", df.shape)

Merged Data:
     id                                        description  interactions  \
0   778                       special friends yoshi mylove            13   
1  1407  i brushed a totino s pizza with garlic butter ...           205   

  day_of_week    time_of_day  following  followers  num_posts  \
0   Wednesday  early_morning        737       4694       1512   
1      Sunday  early_morning       1269       3426       4487   

   is_business_account       category  ...  embedded_1014  embedded_1015  \
0                False  relationships  ...       1.020114       0.405854   
1                False  food_&_dining  ...       0.686622      -0.422867   

   embedded_1016  embedded_1017  embedded_1018  embedded_1019  embedded_1020  \
0      -0.018507       0.441405       0.582225       0.617475      -0.021549   
1       0.089943      -1.160441       0.821644      -0.089272      -0.320949   

   embedded_1021  embedded_1022  embedded_1023  
0      -0.832010      -0.189848      -0.059

# Save the merged data for the next notebook

In [24]:
df = df.sample(10000)

In [25]:
df.to_csv('../Data/Clean-Data/merged_data.csv', index=False)

In [26]:
df.shape

(10000, 1034)

Merging: The merge function combines the two datasets on the common columns, ensuring only matching rows are retained.


## Conclusion
The merged dataset is saved as merged_data.csv for further processing in the next notebook.