**Cell 1**:
- This cell imports necessary modules (`tweepy`, `json`, and `os`) and sets up authentication for the Twitter API using credentials stored in a `config.json` file. It ensures the path to the configuration file is correctly constructed and attempts to authenticate to the Twitter API, printing a confirmation message if successful or an error message if authentication fails.

In [None]:
import tweepy
import json
import os

# Correct path to access config.json
config_path = os.path.join(os.path.dirname(os.getcwd()), 'config.json')

# Load the API credentials from config.json using the corrected path
with open(config_path, 'r') as file:
    config = json.load(file)

# Authenticate to the Twitter API
auth = tweepy.OAuthHandler(config['api_key'], config['api_secret_key'])
auth.set_access_token(config['access_token'], config['access_token_secret'])
api = tweepy.API(auth)

# Check for authentication success
try:
    api.verify_credentials()
    print("Authentication OK")
except Exception as e:
    print("Failed to authenticate:", e)

**Cell 2**:
- This cell defines a function `print_json_structure` that recursively prints the keys and sub-keys of a JSON-like dictionary or list to visualize its structure. It loads JSON data from a file (`tweets_data.json`) and uses this function to print the structure of the data, helping to understand the hierarchical organization of the JSON content.

In [15]:
import json

def print_json_structure(data, indent=''):
    """
    Recursively prints the keys and sub-keys of a JSON-like dictionary or list.
    Args:
        data (dict or list): The JSON-like dictionary or list.
        indent (str): The indentation string to visualize hierarchy.
    """
    if isinstance(data, dict):
        for key, value in data.items():
            print(f"{indent}{key}")
            # Recursively print the structure of the nested dictionary or list
            print_json_structure(value, indent + '    ')
    elif isinstance(data, list):
        print(f"{indent}List of {len(data)} items")
        # Optionally, you can inspect the first item more closely if all items are expected to be similar
        if data:
            print_json_structure(data[0], indent + '    ')

# Load your JSON data
file_path = 'tweets_data.json'
with open(file_path, 'r') as file:
    data = json.load(file)

# Print the structure of the JSON data
print_json_structure(data)


List of 1294 items
    public_metrics
        retweet_count
        reply_count
        like_count
        quote_count
        bookmark_count
        impression_count
    id
    text
    conversation_id
    edit_history_tweet_ids
        List of 1 items
    lang
    referenced_tweets
        List of 1 items
            type
            id
    author_id
    context_annotations
        List of 6 items
            domain
                id
                name
                description
            entity
                id
                name
                description
    created_at


**Cell 3**:
- This cell repeats the process from Cell 2 but with a different JSON file (`tweets_data4.json`). The function `print_json_structure` is used again to recursively print the keys and sub-keys, aiding in the inspection of the JSON data's structure.

In [16]:
import json


def print_json_structure(data, indent=''):
    """
    Recursively prints the keys and sub-keys of a JSON-like dictionary or list.
    Args:
        data (dict or list): The JSON-like dictionary or list.
        indent (str): The indentation string to visualize hierarchy.
    """
    if isinstance(data, dict):
        for key, value in data.items():
            print(f"{indent}{key}")
            # Recursively print the structure of the nested dictionary or list
            print_json_structure(value, indent + '    ')
    elif isinstance(data, list):
        print(f"{indent}List of {len(data)} items")
        # Optionally, you can inspect the first item more closely if all items are expected to be similar
        if data:
            print_json_structure(data[0], indent + '    ')


# Load your JSON data
file_path = 'tweets_data4.json'
with open(file_path, 'r') as file:
    data = json.load(file)

# Print the structure of the JSON data
print_json_structure(data)

data
    List of 10 items
        lang
        edit_history_tweet_ids
            List of 1 items
        public_metrics
            retweet_count
            reply_count
            like_count
            quote_count
            bookmark_count
            impression_count
        created_at
        id
        text
        author_id
includes
    users
        List of 10 items
            location
            username
            id
            verified
            name
meta
    newest_id
    oldest_id
    result_count
    next_token


**Cell 4**:
- This cell defines two functions:
  - `read_and_combine_json_files`: Reads multiple JSON files, combines their content into a single list of dictionaries, ensures each tweet has a unique identifier by renaming and handling the `id` key.
  - `export_to_csv`: Converts the combined list of dictionaries to a pandas DataFrame and saves it as a CSV file.
- It specifies paths to multiple JSON files, combines the data from these files using the defined functions, and exports the combined data to a CSV file (`combined_tweets_data.csv`).

In [17]:
import json
import pandas as pd


def read_and_combine_json_files(file_paths):
    all_tweets = {}
    for file_path in file_paths:
        try:
            with open(file_path, 'r') as file:
                data = json.load(file)
                # Adjust to handle 'data' key or direct list based on JSON structure
                tweets = data.get('data', data) if isinstance(
                    data, dict) else data

                for tweet in tweets:
                    # Rename 'id' to 'tweet_id' and ensure uniqueness
                    tweet_id = tweet.get('id')
                    if tweet_id and tweet_id not in all_tweets:
                        tweet['tweet_id'] = tweet_id
                        del tweet['id']  # Remove the old 'id' key
                        all_tweets[tweet_id] = tweet
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON from {file_path}: {e}")
        except KeyError as e:
            print(f"Missing expected key in {file_path}: {e}")

    return list(all_tweets.values())


def export_to_csv(tweets, output_file):
    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(tweets)
    # Save DataFrame to CSV
    df.to_csv(output_file, index=False)
    print(f"Saved {len(df)} unique tweets to '{output_file}'.")


# Define the paths to your JSON files
file_paths = [
    'tweets_data.json', 'tweets_data1.json', 'tweets_data2.json',
    'tweets_data4.json', 'tweets_data5.json', 'tweets_data6.json',
    'tweets_data7.json', 'tweets_data8.json'
]

# Read and combine the data from the JSON files
combined_tweets = read_and_combine_json_files(file_paths)

# Export the combined data to a CSV file
export_to_csv(combined_tweets, 'combined_tweets_data.csv')

Saved 2440 unique tweets to 'combined_tweets_data.csv'.


**Cell 5**:
- This cell performs exploratory data analysis (EDA) on the combined dataset from `combined_tweets_data.csv`:
  - Loads the dataset into a pandas DataFrame.
  - Displays basic information about the dataset, including the number of entries and column types.
  - Shows the data types of each column.
  - Displays the number of non-null entries for each column.
  - Displays the number of missing entries for each column.
  - Shows the number of unique values for each column.
  - Provides a preview of the first few rows of the dataset.
  - Provides a basic statistical summary of numeric columns.
- This analysis helps understand the structure, completeness, and statistical properties of the dataset.

In [18]:
import pandas as pd

# Load the dataset
df = pd.read_csv('combined_tweets_data.csv')

# Display basic information about the dataset
print("Basic Information:")
print(df.info())

# Show data types of each column
print("\nData Types:")
print(df.dtypes)

# Display the number of non-null entries for each column
print("\nNon-Null Count:")
print(df.notnull().sum())

# Display the number of missing entries for each column
print("\nMissing Values Count:")
print(df.isnull().sum())

# Show the number of unique values for each column
print("\nUnique Values Count:")
print(df.nunique())

# Display the first few rows of the dataset to understand its structure
print("\nPreview of Data:")
print(df.head())

# Basic statistical summary of numeric columns
print("\nStatistical Summary of Numeric Columns:")
print(df.describe())

Basic Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2440 entries, 0 to 2439
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   public_metrics          2440 non-null   object 
 1   text                    2440 non-null   object 
 2   conversation_id         2388 non-null   float64
 3   edit_history_tweet_ids  2440 non-null   object 
 4   lang                    2440 non-null   object 
 5   referenced_tweets       1450 non-null   object 
 6   author_id               2390 non-null   float64
 7   context_annotations     977 non-null    object 
 8   created_at              2440 non-null   object 
 9   tweet_id                2440 non-null   int64  
 10  in_reply_to_user_id     439 non-null    float64
 11  geo                     9 non-null      object 
dtypes: float64(3), int64(1), object(8)
memory usage: 228.9+ KB
None

Data Types:
public_metrics             object
text            