# Data Merging & Shuffling

**`Goal:`** Merge the data on the different ISPs and shuffle the merged dataset ahead of training-validation-test set split for better representativeness

### 1. Library Importation

In [19]:
import pandas as pd
from glob import glob

### 2. Defining the main function

In [20]:
def merge_and_shuffle(df_list):
    
    """
    Function to merge dataframes and shuffle the (larger) merged dataframe
    
    Inputs:
        - df_list: List containing the dataframes to be merged
    """
    
    #1. Merge the dataframes
    merged_df = pd.concat(df_list)
    
    #2. Shuffle three times – where three was arbitrarily chosen
    for _ in range(4):
        
        merged_df = merged_df.sample(frac=1, random_state=1).reset_index(drop=True)
        
    return merged_df

### 3. Load the data

**a. Get file paths**

In [21]:
#Path to the data files
path = "../data/raw/*/"

#List to store file paths
file_list = []

#Iterate through all the subfolders in the main directory          
for folder in glob(path):
        
    #Iterate through all the files in each subfolder
    for file in glob(folder+'/*.csv'):
        
        #If it is a tweets file
        if 'tweets' in file:

            #Append the file name to the list
            file_list.append(file)

#See sample of the file names
file_list[:2]

['../data/raw/cobranet/cobranet_tweets_q4_2019.csv',
 '../data/raw/cobranet/cobranet_tweets_q3_2019.csv']

**b. Read data into pandas dataframe**

In [22]:
#List to store the dataframes
df_list = []

#Iterate through all the file paths
for file_path in file_list:
    
    #Create a pandas dataframe if possible
    try:
        df = pd.read_csv(file_path)
        df_list.append(df)
    #If not, pass
    except:
        pass

### 4. Merge & Shuffle the Data Files

In [23]:
merged_df = merge_and_shuffle(df_list)
merged_df.head()

Unnamed: 0,ISP_Name,Time,Text,Coordinates,Place,Source
0,sprectranet,2019-05-13 09:30:03,It gives me joy seeing my spectranet turning g...,,Place(_api=<tweepy.api.API object at 0x7f96f37...,Twitter for iPhone
1,sprectranet,2020-04-21 06:11:55+00:00,@Spectranet_NG is this even fair? I won’t rene...,,Place(_api=<tweepy.api.API object at 0x7fbc03d...,Twitter for iPhone
2,sprectranet,2020-02-04 18:30:35+00:00,My family used my spectranet and they don't wa...,,Place(_api=<tweepy.api.API object at 0x7fbc025...,Twitter for Android
3,sprectranet,2019-02-16 18:11:48,@Spectranet_NG Can I subscribe via @UBAGroup m...,,Place(_api=<tweepy.api.API object at 0x7f96f2f...,Twitter for Android
4,sprectranet,2020-08-14 06:25:29+00:00,@EniolaShitta YouTube is where spectranet star...,,Place(_api=<tweepy.api.API object at 0x7fbc025...,Twitter for Android


### 5. Write to CSV File

In [24]:
merged_df.to_csv('../data/raw/merged.csv',index=False)