## JSON File Sorter 
### Merging, Transforming, and Sorting JSON Files into a CSV
- Streamlines the process of merging multiple JSON files within a specified folder.
- Efficiently concatenates all JSON files into a single DataFrame.
- Saves the merged data into a CSV file for easy access and analysis.
- Transforms the timestamp data from seconds to the date format (dd-mm-yyyy) for improved readability and organization.
- Provides enhanced sorting capabilities, ensuring dates are arranged in chronological order.
- Simplifies the organization and processing of large amounts of JSON data.
- Delivers a neatly structured CSV file, ready for further analysis or integration into other systems.

In [16]:
import pandas as pd
import os
from tqdm import tqdm

# Choose the right file path
folder_path = r"C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_2"

# Get the list of JSON files in the specified directory
file_list = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith('.json')]

# Calculate the total number of JSON files
total_files = len(file_list)

# Initialize an empty list to store individual DataFrames
dataframes = []

# Create a tqdm instance to track the progress
progress_bar = tqdm(total=total_files, desc="Processing JSON files", unit="file")

# Read each JSON file and append its DataFrame to the list
for file in file_list:
    df = pd.read_json(file)
    dataframes.append(df)
    
    # Update the progress bar after each iteration
    progress_bar.update(1)

# Concatenate all DataFrames into a single DataFrame
merged_df = pd.concat(dataframes, ignore_index=True)

# Convert the time column to datetime format
merged_df['time'] = pd.to_datetime(merged_df['time'], unit='s')

# Change the format of the time column to dd-mm-yyyy
merged_df['time'] = merged_df['time'].dt.strftime('%d-%m-%Y')

# Sort the DataFrame by the "time" column using a custom key function
df_sorted = merged_df.sort_values('time', key=lambda x: pd.to_datetime(x, format='%d-%m-%Y', errors='coerce'))

# Give name to CSV file
csv_file_path = os.path.join(folder_path, 'sorted_Nvidia_2.csv')

# Save the sorted DataFrame to a CSV file
df_sorted.to_csv(csv_file_path, index=False)

# Display a message indicating that the process is completed
print("CSV file has been saved.")



Processing JSON files: 100%|██████████| 3341/3341 [03:31<00:00, 15.77file/s]

Processing JSON files:   0%|          | 4/3411 [00:00<01:32, 36.70file/s][A
Processing JSON files:   0%|          | 9/3411 [00:00<01:22, 41.18file/s][A
Processing JSON files:   0%|          | 14/3411 [00:00<01:21, 41.84file/s][A
Processing JSON files:   1%|          | 20/3411 [00:00<01:12, 46.60file/s][A
Processing JSON files:   1%|          | 25/3411 [00:00<01:11, 47.30file/s][A
Processing JSON files:   1%|          | 30/3411 [00:00<01:13, 45.89file/s][A
Processing JSON files:   1%|          | 35/3411 [00:00<01:18, 43.04file/s][A
Processing JSON files:   1%|          | 40/3411 [00:00<01:29, 37.82file/s][A
Processing JSON files:   1%|▏         | 47/3411 [00:01<01:14, 44.94file/s][A
Processing JSON files:   2%|▏         | 53/3411 [00:01<01:09, 48.12file/s][A
Processing JSON files:   2%|▏         | 59/3411 [00:01<01:06, 50.36file/s][A
Processing JSON files:   2%|▏         | 65/3411 [00:01<01:03, 52.

Processing JSON files:  29%|██▉       | 1003/3411 [00:24<00:51, 46.40file/s][A
Processing JSON files:  30%|██▉       | 1008/3411 [00:25<00:52, 45.53file/s][A
Processing JSON files:  30%|██▉       | 1013/3411 [00:25<00:53, 44.46file/s][A
Processing JSON files:  30%|██▉       | 1018/3411 [00:25<00:52, 45.77file/s][A
Processing JSON files:  30%|███       | 1024/3411 [00:25<00:50, 47.28file/s][A
Processing JSON files:  30%|███       | 1030/3411 [00:25<00:47, 49.99file/s][A
Processing JSON files:  30%|███       | 1036/3411 [00:25<00:49, 48.04file/s][A
Processing JSON files:  31%|███       | 1041/3411 [00:25<00:50, 46.71file/s][A
Processing JSON files:  31%|███       | 1047/3411 [00:25<00:47, 50.13file/s][A
Processing JSON files:  31%|███       | 1053/3411 [00:25<00:45, 52.39file/s][A
Processing JSON files:  31%|███       | 1059/3411 [00:26<00:46, 50.43file/s][A
Processing JSON files:  31%|███       | 1065/3411 [00:26<00:47, 49.37file/s][A
Processing JSON files:  31%|███▏      | 

Processing JSON files:  64%|██████▍   | 2195/3411 [00:49<00:22, 53.50file/s][A
Processing JSON files:  65%|██████▍   | 2202/3411 [00:49<00:21, 55.64file/s][A
Processing JSON files:  65%|██████▍   | 2208/3411 [00:49<00:22, 54.21file/s][A
Processing JSON files:  65%|██████▍   | 2214/3411 [00:49<00:22, 52.79file/s][A
Processing JSON files:  65%|██████▌   | 2221/3411 [00:49<00:20, 57.01file/s][A
Processing JSON files:  65%|██████▌   | 2228/3411 [00:50<00:19, 60.34file/s][A
Processing JSON files:  66%|██████▌   | 2235/3411 [00:50<00:19, 58.98file/s][A
Processing JSON files:  66%|██████▌   | 2241/3411 [00:50<00:20, 57.37file/s][A
Processing JSON files:  66%|██████▌   | 2247/3411 [00:50<00:22, 52.58file/s][A
Processing JSON files:  66%|██████▌   | 2253/3411 [00:50<00:22, 50.70file/s][A
Processing JSON files:  66%|██████▌   | 2259/3411 [00:50<00:23, 49.10file/s][A
Processing JSON files:  66%|██████▋   | 2266/3411 [00:50<00:21, 52.68file/s][A
Processing JSON files:  67%|██████▋   | 

CSV file has been saved.



Processing JSON files: 100%|██████████| 3411/3411 [01:31<00:00, 47.86file/s][A

In [25]:
df = pd.read_csv(r"C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_0\sorted_Nvidia_0.csv")

In [26]:
df.head()

Unnamed: 0,id,text,time,sentiment
0,524797770,$NVDA thank you for the 250 percent gainer on ...,26-04-2023,Bullish
1,524868442,$NVDA just wait for META to mention AI and the...,26-04-2023,
2,524870531,$NVDA prepare for a big rise!!,26-04-2023,
3,524870756,$BRSH 🚀\n\n$SPY \n$TQQQ \n$NVDA \n$ATOM,26-04-2023,Bullish
4,524870820,$NVDA going to 275 if meta holds!!,26-04-2023,


In [27]:
df.tail()

Unnamed: 0,id,text,time,sentiment
90322,534970519,$MSFT GS gives verdict its top AI plays:\n\n- ...,07-07-2023,
90323,534970255,$NVDA enters bullish trend https://srnk.us/go/...,07-07-2023,Bullish
90324,534969666,$NVDA pre fix for death https://youtu.be/39cL3...,07-07-2023,Bearish
90325,534972972,$NVDA 406/397.56 /378 .. July 12-21+ \nPossible?,07-07-2023,
90326,534983944,"$TSLA Enron has never left Twittor, he is stil...",07-07-2023,Bearish


In [28]:
len(df)

90327

## Concatenating .csv files

In [6]:
import os
import pandas as pd
from tqdm import tqdm

# Path to the folder containing the CSV files
folder_path = r'C:\Users\sahma\Desktop\Thises\TSLA'

# List to store individual DataFrames
dataframes = []

# Get the list of CSV files in the folder
csv_files = [filename for filename in os.listdir(folder_path) if filename.endswith('.csv')]

# Create a progress bar
progress_bar = tqdm(total=len(csv_files), desc='Processing Files')

# Iterate through each CSV file
for filename in csv_files:
    file_path = os.path.join(folder_path, filename)
    df = pd.read_csv(file_path)
    dataframes.append(df)
    progress_bar.update(1)

# Close the progress bar
progress_bar.close()

# Concatenate the DataFrames
concatenated_df = pd.concat(dataframes, ignore_index=True)

# Save the concatenated DataFrame to a CSV file in the same folder
concatenated_csv_path = os.path.join(folder_path, 'finbert_TSLA_06_2020_to_06_2023.csv')
concatenated_df.to_csv(concatenated_csv_path, index=False)

print(f'Concatenated data saved to {concatenated_csv_path}')


Processing Files: 100%|██████████| 26/26 [00:25<00:00,  1.03it/s]


Concatenated data saved to C:\Users\sahma\Desktop\Thises\TSLA\finbert_TSLA_06_2020_to_06_2023.csv


In [7]:
df = pd.read_csv(r"C:\Users\sahma\Desktop\Thises\TSLA\finbert_TSLA_06_2020_to_06_2023.csv")

In [8]:
df.head()

Unnamed: 0,id,text,time,sentiment,processed_text,finbert_sentiment,finbert_score
0,488005848,$TSLA BREAKING NEWS....TESLA HYPNOTIZES ENTIR...,30-09-2022,,breaking newstesla hypnotizes entire watch aud...,Neutral,0.999999
1,488008153,$SOFI can we all appreciated Michael Burry’s ...,30-09-2022,,can we all appreciated michael burrys analogy ...,Neutral,0.996284
2,488008139,$TSLA 245 waiting for you 😁,30-09-2022,Bearish,245 waiting for you,Neutral,0.907729
3,488008122,$TSLA This company is years and years ahead of...,30-09-2022,Bullish,this company is years and years ahead of other...,Neutral,0.994367
4,488008121,$TSLA the IQ&#39;s in that room 😄,30-09-2022,,the iq39s in that room,Neutral,0.999992


In [9]:
df.tail()

Unnamed: 0,id,text,time,sentiment,processed_text,finbert_sentiment,finbert_score
2736653,268681144,$TSLA its so amusing when in a bull scalp its ...,31-12-2020,,its so amusing when in a bull scalp its so eas...,Neutral,0.888716
2736654,268681239,"$TSLA gamble on the EOD crash and buy back up,...",31-12-2020,,gamble on the eod crash and buy back up like w...,Neutral,0.949054
2736655,268681328,"$TSLA now is the time for me, opening $500 put...",31-12-2020,,now is the time for me opening 500 puts for ma...,Neutral,0.997246
2736656,268678110,$TSLA your friendly neighborhood bear gotta ea...,31-12-2020,,your friendly neighborhood bear got to eat too...,Neutral,0.999284
2736657,268859974,$NIO $XPEV $TSLA whose getting drunk at tonigh...,31-12-2020,Bullish,whose getting drunk at tonight holla ev bi,Neutral,0.99978


In [5]:
#Find the maximum value in the time column
max_time = max(df['time'])
print("Maximum time:", max_time)

Maximum time: 2023-07-07


In [33]:
# Keep only the desired columns
selected_columns = ['id', 'body', 'created_at', 'entities']
df = df[selected_columns]

In [34]:
df.head()

Unnamed: 0,id,body,created_at,entities
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28T14:54:52Z,{'sentiment': {'basic': 'Bullish'}}
1,439881067,$NVDA holding 💎📈,2022-02-28T14:54:30Z,{'sentiment': {'basic': 'Bullish'}}
2,439880835,$NVDA about to moon … again,2022-02-28T14:54:00Z,{'sentiment': {'basic': 'Bullish'}}
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28T14:52:40Z,{'sentiment': {'basic': 'Bullish'}}
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28T14:51:39Z,{'sentiment': {'basic': 'Bullish'}}


In [35]:
# Rename the columns
column_mapping = {
    'body': 'text',
    'created_at': 'time',
    'entities': 'sentiment'
}
df.rename(columns=column_mapping, inplace=True)

In [38]:
df.head()

Unnamed: 0,id,text,time,sentiment
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28T14:54:52Z,{'sentiment': {'basic': 'Bullish'}}
1,439881067,$NVDA holding 💎📈,2022-02-28T14:54:30Z,{'sentiment': {'basic': 'Bullish'}}
2,439880835,$NVDA about to moon … again,2022-02-28T14:54:00Z,{'sentiment': {'basic': 'Bullish'}}
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28T14:52:40Z,{'sentiment': {'basic': 'Bullish'}}
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28T14:51:39Z,{'sentiment': {'basic': 'Bullish'}}


In [39]:
# Convert 'time' column to datetime format
df['time'] = pd.to_datetime(df['time'])

# Format 'time' column as year-month-day
df['time'] = df['time'].dt.strftime('%Y-%m-%d')

In [45]:
df.head(15)

Unnamed: 0,id,text,time,sentiment
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
1,439881067,$NVDA holding 💎📈,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
2,439880835,$NVDA about to moon … again,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
5,439879458,$NVDA 231,2022-02-28,"{""sentiment"": None}"
6,439879282,$NVDA 270 price target 1 second is 300,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
7,439877933,$NVDA a few more and it turns green and we bac...,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
8,439877328,$NVDA give us a 20$ move up sir,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"
9,439876825,$NVDA red To green Imo,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}"


In [56]:
import json
import re

# Use regular expressions to extract sentiment values
pattern = r'{"basic": "(.*?)"}'
df['sentiment_1'] = df['sentiment'].apply(lambda x: re.search(pattern, x).group(1) if re.search(pattern, x) else "")


In [57]:
df.head(15)

Unnamed: 0,id,text,time,sentiment,sentiment_1
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
1,439881067,$NVDA holding 💎📈,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
2,439880835,$NVDA about to moon … again,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
5,439879458,$NVDA 231,2022-02-28,"{""sentiment"": None}",
6,439879282,$NVDA 270 price target 1 second is 300,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
7,439877933,$NVDA a few more and it turns green and we bac...,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
8,439877328,$NVDA give us a 20$ move up sir,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish
9,439876825,$NVDA red To green Imo,2022-02-28,"{""sentiment"": {""basic"": ""Bullish""}}",Bullish


In [58]:
# Drop the 'sentiment' column
df.drop(columns=['sentiment'], inplace=True)

# Rename 'sentiment_1' column to 'sentiment'
df.rename(columns={'sentiment_1': 'sentiment'}, inplace=True)

In [59]:
df.head(15)

Unnamed: 0,id,text,time,sentiment
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,Bullish
1,439881067,$NVDA holding 💎📈,2022-02-28,Bullish
2,439880835,$NVDA about to moon … again,2022-02-28,Bullish
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28,Bullish
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28,Bullish
5,439879458,$NVDA 231,2022-02-28,
6,439879282,$NVDA 270 price target 1 second is 300,2022-02-28,Bullish
7,439877933,$NVDA a few more and it turns green and we bac...,2022-02-28,Bullish
8,439877328,$NVDA give us a 20$ move up sir,2022-02-28,Bullish
9,439876825,$NVDA red To green Imo,2022-02-28,Bullish


In [60]:
df.tail()

Unnamed: 0,id,text,time,sentiment
584176,12997231,Nvidia investing in â€˜once in a lifetime oppo...,2013-04-11,
584177,12995655,$NVDA shorting here is like picking up dimes i...,2013-04-11,Bullish
584178,12995587,As I said just over a week ago that $NVDA woul...,2013-04-11,Bullish
584179,12995313,"$NVDA quite a rebound today, i can hardly beli...",2013-04-11,
584180,12994205,$NVDA shorts using a lot of ammo trying to kee...,2013-04-11,


In [61]:
# Filter data within the specified date range
start_date = '2020-06-01'
end_date = '2022-02-28'
df['time'] = pd.to_datetime(df['time'])
filtered_df = df[(df['time'] >= start_date) & (df['time'] <= end_date)]

In [63]:
filtered_df.head()

Unnamed: 0,id,text,time,sentiment
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,Bullish
1,439881067,$NVDA holding 💎📈,2022-02-28,Bullish
2,439880835,$NVDA about to moon … again,2022-02-28,Bullish
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28,Bullish
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28,Bullish


In [64]:
filtered_df.tail()

Unnamed: 0,id,text,time,sentiment
224941,216124041,$SHOP $NVDA $BYND $SPY i want $W tooo,2020-06-01,
224942,216123311,$NVDA $370 Eow,2020-06-01,
224943,216118763,$SPY $GILD $BABA $AAPL $NVDA “Pandemic” + full...,2020-06-01,Bullish
224944,216114459,$NVDA going to new highs soon,2020-06-01,Bullish
224945,216113684,$NVDA Nasdaq will be green by 3AM. Brrrrrrrr.....,2020-06-01,


In [65]:
# Save the filtered DataFrame as a CSV file
output_file = r'C:\Users\sahma\Desktop\Thises\Stocks\NVDA\NVDA_2013_2022\Nvidia_1.csv'  
filtered_df.to_csv(output_file, index=False)

print(f'Filtered data saved to {output_file}')

Filtered data saved to C:\Users\sahma\Desktop\Thises\Stocks\NVDA\NVDA_2013_2022\Nvidia_1.csv


In [66]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\NVDA\NVDA_2013_2022\Nvidia_1.csv')

In [67]:
df.tail()

Unnamed: 0,id,text,time,sentiment
224941,216124041,$SHOP $NVDA $BYND $SPY i want $W tooo,2020-06-01,
224942,216123311,$NVDA $370 Eow,2020-06-01,
224943,216118763,$SPY $GILD $BABA $AAPL $NVDA “Pandemic” + full...,2020-06-01,Bullish
224944,216114459,$NVDA going to new highs soon,2020-06-01,Bullish
224945,216113684,$NVDA Nasdaq will be green by 3AM. Brrrrrrrr.....,2020-06-01,


In [90]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv')

In [91]:
df.head()

Unnamed: 0,id,text,time,sentiment
0,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,Bullish
1,439881067,$NVDA holding 💎📈,2022-02-28,Bullish
2,439880835,$NVDA about to moon … again,2022-02-28,Bullish
3,439880275,$NVDA got some calls for a few weeks out and s...,2022-02-28,Bullish
4,439879846,$NVDA I’m buying this it looks like bottom,2022-02-28,Bullish


In [92]:
df.tail()

Unnamed: 0,id,text,time,sentiment
224941,216124041,$SHOP $NVDA $BYND $SPY i want $W tooo,2020-06-01,
224942,216123311,$NVDA $370 Eow,2020-06-01,
224943,216118763,$SPY $GILD $BABA $AAPL $NVDA “Pandemic” + full...,2020-06-01,Bullish
224944,216114459,$NVDA going to new highs soon,2020-06-01,Bullish
224945,216113684,$NVDA Nasdaq will be green by 3AM. Brrrrrrrr.....,2020-06-01,


# Sort dates in ascending order

In [94]:
import pandas as pd

# Path to the CSV file
csv_file_path = r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv'

# Read the CSV file into a DataFrame
data = pd.read_csv(csv_file_path)

# Convert the 'time' column to datetime format with the new format
data['time'] = pd.to_datetime(data['time'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d')

# Sort the DataFrame by the 'time' column in ascending order
data = data.sort_values(by='time')

# Save the sorted DataFrame to a new CSV file
sorted_csv_file_path = r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv'
data.to_csv(sorted_csv_file_path, index=False)

print(f'Sorted data saved to {sorted_csv_file_path}')


Sorted data saved to C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv


In [95]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv')

In [96]:
df.head()

Unnamed: 0,id,text,time,sentiment
0,216113684,$NVDA Nasdaq will be green by 3AM. Brrrrrrrr.....,2020-06-01,
1,216269957,$NVDA those 400 calls for next week aren&#39;t...,2020-06-01,
2,216268160,@jensonlaw lol $QQQ up 13% in the last 3 month...,2020-06-01,
3,216266036,$NVDA still holding this name watching 355 are...,2020-06-01,
4,216262159,$NVDA weekly *almost* giving a big sell signa...,2020-06-01,


In [97]:
df.tail()

Unnamed: 0,id,text,time,sentiment
224941,439808681,https://youtube.com/watch?v=Tfp63bYqr_s&amp;fe...,2022-02-28,
224942,439809197,$NVDA Futures are tanking - as you all know. ...,2022-02-28,Bullish
224943,439809385,$NVDA good times coming to add to the position...,2022-02-28,Bullish
224944,439805726,$NVDA Volatility is King!! Simulated Weekly $2...,2022-02-28,
224945,439881211,$NVDA I see a 20 $ move to the upside,2022-02-28,Bullish


In [99]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_1\sorted_Nvidia_1.csv')
len(df)

224946

In [100]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_2\sorted_Nvidia_2.csv')
len(df)

90327

In [101]:
df = pd.read_csv(r'C:\Users\sahma\Desktop\Thises\Stocks\stocks\NVDA\NVDA_3\sorted_Nvidia_3.csv')
len(df)

100226