# Data Merging & Shuffling

**`Goal:`** Merge the data on the different ISPs and shuffle the merged dataset ahead of training-validation-test set split for better representativeness

### 1. Library Importation

In [33]:
import pandas as pd
import os

### 2. Defining the main function

In [34]:
def merge_and_shuffle(df_list):
    
    """
    Function to merge dataframes and shuffle the (larger) merged dataframe
    
    Inputs:
        - df_list: List containing the dataframes to be merged
    """
    
    #1. Merge the dataframes
    merged_df = pd.concat(df_list)
    
    #2. Shuffle three times – where three was arbitrarily chosen
    for _ in range(4):
        
        merged_df = merged_df.sample(frac=1, random_state=1).reset_index(drop=True)
        
    return merged_df

### 3. Load the data

**a. Get file paths**

In [35]:
#Path to the data files
path = "../data/raw"

#List to store file paths
file_list = []

#Iterate through all the subfolders in the main directory
for root, dirs, files in os.walk(path):
    
    #Iterate through all the files in each subfolder
    for file in files:
        
        #If it is a csv file (a data file)
        if '.csv' in file:
            #Append the file name to the list
            file_list.append(os.path.join(root,file))

#See sample of the file names
file_list[:2]

['../data/raw/cobranet/cobranet_tweets_q4_2019.csv',
 '../data/raw/cobranet/cobranet_tweets_q3_2019.csv']

**b. Read data into pandas dataframe**

In [36]:
#List to store the dataframes
df_list = []

#Iterate through all the file paths
for file_path in file_list:
    
    #Create a pandas dataframe if possible
    try:
        df = pd.read_csv(file_path)
        df_list.append(df)
    #If not, pass
    except:
        pass

### 4. Merge & Shuffle the Data Files

In [38]:
merged_df = merge_and_shuffle(df_list)
merged_df.head()

Unnamed: 0,ISP_Name,Time,Text,Coordinates,Place,Source
0,ipnx,2020-03-22 13:35:27,Happy Mother's Day to Us all! \n\n#get #connec...,"{'type': 'Point', 'coordinates': [3.39583, 6.4...",Place(_api=<tweepy.api.API object at 0x7f96f37...,Instagram
1,tizeti,2019-04-17 21:49:06+00:00,"Hello Twitter, got any details on what the goo...",,,Twitter Web App
2,sprectranet,2020-12-05 20:53:04+00:00,@Spectranet_NG I just want to tell you guys th...,,Place(_api=<tweepy.api.API object at 0x7fbc029...,Twitter for Android
3,sprectranet,2020-03-29 04:00:13+00:00,@FunkeOnafuye Na to fight spectranet remain co...,,Place(_api=<tweepy.api.API object at 0x7fbc025...,Twitter for Android
4,sprectranet,2019-12-04 10:29:09,@philznnona @Spectranet_NG Oh you have had iss...,,Place(_api=<tweepy.api.API object at 0x7f96f37...,Twitter for Android


### 5. Write to CSV File

In [39]:
merged_df.to_csv('../data/raw/merged.csv',index=False)