**Cell 1**:
- This cell imports necessary libraries and modules (`pandas`, `numpy`, `torch`, `transformers`, `sklearn`, and `nltk`) and ensures NLTK resources are downloaded. It then loads a dataset of bills from a CSV file and displays data types, summary statistics for numerical columns, missing values, and the first few rows to understand the data format. The number of bills is also printed.

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import RobertaModel, RobertaTokenizer
from torch.utils.data import DataLoader, Dataset
from sklearn.preprocessing import LabelEncoder
from multiprocessing import Pool, cpu_count
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
bills_df = pd.read_csv('BillData/refined_detailed_bills.csv')

# Display data types
print("Data Types:")
print(bills_df.dtypes)
print("\n")

# Display summary statistics for numerical columns
print("Summary Statistics:")
print(bills_df.describe())
print("\n")

# Identify missing values
print("Missing Values:")
missing_values = bills_df.isnull().sum()
print(missing_values)
print("\n")

# Display the first few rows of the dataframe to understand the data format
print("Data Format (first few rows):")
print(bills_df.head())

print(f"Number of bills: {bills_df.shape[0]}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...


Data Types:
bill_id             int64
bill_number        object
title              object
description        object
url                object
state_link         object
status              int64
status_date        object
session_id          int64
state_id            int64
state              object
body_id             int64
current_body_id     int64
sponsors           object
subjects           object
texts              object
votes              object
dtype: object


Summary Statistics:
            bill_id      status   session_id    state_id     body_id  \
count  4.440000e+02  444.000000   444.000000  444.000000  444.000000   
mean   1.771129e+06    2.009009  2050.889640   26.396396   56.896396   
std    6.339997e+04    1.523306    45.977853   15.238231   31.293817   
min    1.636388e+06    1.000000  1986.000000    2.000000    1.000000   
25%    1.714478e+06    1.000000  2016.000000   13.000000   31.000000   
50%    1.783610e+06    1.000000  2034.000000   23.000000   54.500000   
75%   

[nltk_data]   Package wordnet is already up-to-date!


**Cell 2**:
- This cell defines two functions:
  - `safe_json_loads`: Safely loads JSON data by correcting common format mistakes.
  - `clean_text`: Cleans text by converting it to lowercase and removing non-word characters.
  - `preprocess_text`: Tokenizes, removes stopwords, and lemmatizes the input text.
- It then loads the bill data, drops unnecessary columns, converts date columns to datetime format, combines title and description into a single text column, cleans and preprocesses the text, creates a state mapping, and saves the processed data to a CSV file.

In [2]:
def safe_json_loads(s):
    """
    Safely loads JSON data correcting common format mistakes.
    
    Args:
    s (str): A string representation of JSON data.
    
    Returns:
    dict: A dictionary loaded from the JSON, empty if an error occurs.
    """
    s = re.sub(r"([{|,]\s*'?)(\w+)'?\s*:", r'\1"\2":', s)  # Fix keys
    s = re.sub(r":\s*'([^']+)'(\s*[},])", r': "\1"\2', s)  # Fix values
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        return {}


def clean_text(text):
    """
    Cleans text by converting to lower case and removing non-word characters.
    
    Args:
    text (pd.Series): Pandas Series containing text data.
    
    Returns:
    pd.Series: Cleaned text data.
    """
    return text.str.lower().str.replace(r'\W', ' ', regex=True)


def preprocess_text(text):
    """
    Tokenizes, removes stopwords, and lemmatizes the input text.
    
    Args:
    text (str): Text to preprocess.
    
    Returns:
    str: Preprocessed text.
    """
    tokens = word_tokenize(text)
    tokens = [
        word for word in tokens if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in tokens])


print("Starting to load data...")
bills_df = pd.read_csv('BillData/refined_detailed_bills.csv')
print("Data loaded successfully.")

# Drop unnecessary columns
bills_df.drop(['state_link', 'sponsors', 'votes', 'url'], axis=1, inplace=True)

print("Converting date columns...")
bills_df['status_date'] = pd.to_datetime(bills_df['status_date'])
print("Conversion completed.")

print("Combining title and description...")
bills_df['full_text'] = bills_df['title'] + ' ' + bills_df['description']
bills_df['full_text'] = clean_text(bills_df['full_text'])
print("Text combined and cleaned.")

print("Applying text preprocessing...")
bills_df['processed_text'] = bills_df['full_text'].apply(preprocess_text)
print("Text preprocessing completed.")

# Create and save state mapping
state_map = dict(enumerate(bills_df['state'].unique()))
json.dump(state_map, open('BillData/state_map.json', 'w'))
print("State map saved to 'BillData/state_map.json'.")

print("Saving processed data...")
bills_df.to_csv('BillData/final_processed_bills.csv', index=False)
print("Data saved successfully to 'BillData/final_processed_bills.csv'.")

Starting to load data...
Data loaded successfully.
Converting date columns...
Conversion completed.
Combining title and description...
Text combined and cleaned.
Applying text preprocessing...
Text preprocessing completed.
State map saved to 'BillData/state_map.json'.
Saving processed data...
Data saved successfully to 'BillData/final_processed_bills.csv'.


**Cell 3**:
- This cell defines a custom Dataset class for handling text data and functions to generate embeddings using a pre-trained RoBERTa model. It loads the preprocessed dataset, verifies columns, determines the device (CPU or GPU), loads the RoBERTa model and tokenizer, generates embeddings, and saves them to a .npy file.

In [6]:
import pandas as pd
import numpy as np
import torch
from transformers import RobertaModel, RobertaTokenizer
from torch.utils.data import DataLoader, Dataset
import os

# Ensure Numpy prints arrays completely
np.set_printoptions(threshold=np.inf)


class TextDataset(Dataset):
    """
    Custom Dataset class for handling text data.
    """

    def __init__(self, texts):
        """
        Initialize with a list of texts.
        """
        self.texts = texts

    def __len__(self):
        """
        Return the length of the dataset.
        """
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Return the text at the given index.
        """
        return self.texts[idx]


def get_embeddings(model, tokenizer, texts, batch_size=16, device='cpu'):
    """
    Generate embeddings for a list of texts using a pre-trained RoBERTa model.
    
    Args:
    - model: Pre-trained RoBERTa model.
    - tokenizer: Corresponding tokenizer.
    - texts: List of texts to process.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - np.ndarray: Array of embeddings.
    """
    dataset = TextDataset(texts)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    model = model.to(device)
    all_embeddings = []

    for batch_texts in data_loader:
        try:
            inputs = tokenizer(batch_texts, return_tensors="pt",
                               padding=True, truncation=True, max_length=512)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            with torch.no_grad():
                outputs = model(**inputs)
            embeddings = outputs.last_hidden_state.mean(
                dim=1).detach().cpu().numpy()
            all_embeddings.append(embeddings)
        except Exception as e:
            print(f"Error processing batch: {e}")

    return np.vstack(all_embeddings)


def save_embeddings(embeddings, file_name):
    """
    Save embeddings to a .npy file for efficient loading and use in PyG.
    
    Args:
    - embeddings: Embeddings to save.
    - file_name: Name of the file to save the embeddings.
    """
    np.save(file_name, embeddings)
    print(f"Embeddings saved successfully to {file_name}.")


if __name__ == "__main__":
    try:
        # Load preprocessed dataset
        print("Loading dataset...")
        bills_df = pd.read_csv('BillData/final_processed_bills.csv')

        # Verify columns
        if 'processed_text' not in bills_df.columns:
            raise ValueError("Processed text column not found in the dataset.")

        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {device}")

        # Load RoBERTa model and tokenizer
        print("Loading RoBERTa model and tokenizer...")
        model = RobertaModel.from_pretrained('roberta-base')
        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

        # Get the list of processed texts
        texts = bills_df['processed_text'].tolist()

        # Generate embeddings
        print("Generating embeddings...")
        embeddings = get_embeddings(model, tokenizer, texts, device=device)
        print("Embeddings generation complete.")

        # Save embeddings to .npy file
        embeddings_file = "BillData/roberta_bills_embeddings.npy"
        print("Saving embeddings...")
        save_embeddings(embeddings, embeddings_file)
        print("Process completed successfully.")

    except Exception as e:
        print(f"An error occurred: {e}")

Loading dataset...
Using device: cpu
Loading RoBERTa model and tokenizer...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generating embeddings...
Embeddings generation complete.
Saving embeddings...
Embeddings saved successfully to BillData/roberta_bills_embeddings.npy.
Process completed successfully.


**Cell 4**:
- This cell loads the generated embeddings and the original dataset, ensures they have the same length, computes pairwise cosine similarity, identifies duplicate bills based on a similarity threshold, removes duplicates from both the DataFrame and embeddings array, and saves the cleaned data and embeddings.

In [8]:
from scipy.spatial.distance import cdist
import numpy as np
import pandas as pd

# Load the embeddings and the original dataset
embeddings = np.load("BillData/RoBERTa_bills_embeddings.npy")
bills_df = pd.read_csv('BillData/final_processed_bills.csv')

# Ensure the embeddings and the DataFrame have the same length
assert len(embeddings) == len(
    bills_df), "Mismatch between embeddings and DataFrame length."

# Compute pairwise cosine similarity
cosine_similarities = 1 - cdist(embeddings, embeddings, metric='cosine')

# Define the similarity threshold for considering bills as duplicates
similarity_threshold = 0.99999999


def identify_duplicates(similarity_matrix, threshold):
    """
    Identify duplicates in the similarity matrix based on the given threshold.
    
    Args:
    - similarity_matrix: Pairwise cosine similarity matrix.
    - threshold: Cosine similarity threshold to consider bills as duplicates.
    
    Returns:
    - List of indices of duplicate bills to be removed.
    """
    num_bills = similarity_matrix.shape[0]
    duplicates = set()

    for i in range(num_bills):
        for j in range(i + 1, num_bills):
            if similarity_matrix[i, j] > threshold:
                duplicates.add(j)

    return list(duplicates)


# Identify duplicate bill indices
duplicate_indices = identify_duplicates(
    cosine_similarities, similarity_threshold)

# Remove duplicates from the DataFrame
bills_df_no_duplicates = bills_df.drop(
    index=duplicate_indices).reset_index(drop=True)

# Remove duplicates from the embeddings array
embeddings_no_duplicates = np.delete(embeddings, duplicate_indices, axis=0)

# Save the cleaned DataFrame
bills_df_no_duplicates.to_csv(
    'BillData/final_bills_no_duplicates.csv', index=False)

# Save the updated embeddings array to a new file
np.save("BillData/RoBERTa_bills_embeddings_no_duplicates.npy",
        embeddings_no_duplicates)

print(f"Removed {len(duplicate_indices)} duplicate bills. Cleaned data and embeddings saved successfully.")

Removed 31 duplicate bills. Cleaned data and embeddings saved successfully.


**Cell 5**:
- This cell applies dimensionality reduction using PCA to the embeddings and normalizes them using StandardScaler. It retains 95% variance in the PCA step, saves the updated embeddings, and updates the DataFrame with PCA embeddings, saving it to a CSV file.

**PCA and normalization:**
Apply dimensionality reduction using Principal Component Analysis (PCA), and normalization. This will help to spread out the embeddings in the feature space, reducing the likelihood of high similarity among embeddings and making connections more meaningful. 

In [11]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the bill embeddings and the original dataset
bill_embeddings = np.load(
    "BillData/RoBERTa_bills_embeddings_no_duplicates.npy")
bills_df = pd.read_csv('BillData/final_bills_no_duplicates.csv')

# Ensure the embeddings and the DataFrame have the same length
assert len(bill_embeddings) == len(
    bills_df), "Mismatch between bill embeddings and DataFrame length."

# Standardize the embeddings
scaler = StandardScaler()
bill_embeddings_scaled = scaler.fit_transform(bill_embeddings)

# Apply PCA to reduce dimensionality while retaining 95% variance
pca = PCA(n_components=0.95)
bill_embeddings_pca = pca.fit_transform(bill_embeddings_scaled)

# Save the updated embeddings
np.save("BillData/roberta_bills_embeddings_pca.npy", bill_embeddings_pca)
print(f"Updated bill embeddings saved to 'BillData/roberta_bills_embeddings_pca.npy'.")

# Update the DataFrame with PCA embeddings
bills_df['pca_embeddings'] = list(bill_embeddings_pca)
bills_df.to_csv('BillData/bills.csv', index=False)

print("DataFrame with PCA embeddings saved successfully.")

Updated bill embeddings saved to 'BillData/roberta_bills_embeddings_pca.npy'.
DataFrame with PCA embeddings saved successfully.


**Cell 6**:
- This cell defines functions to load a sentiment analysis model and tokenizer, perform sentiment analysis on texts, and save the results. It loads the bills DataFrame, applies sentiment analysis to each bill's processed text, and saves the updated DataFrame and sentiment probabilities tensor to files. It also verifies the saved CSV file by loading and displaying the first few rows.

In [17]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np


def load_sentiment_model():
    """
    Load the sentiment analysis model and tokenizer.
    
    Returns:
    - tokenizer: Pre-trained tokenizer for sentiment analysis.
    - model: Pre-trained sentiment analysis model.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            "cardiffnlp/twitter-roberta-base-sentiment")
        model = AutoModelForSequenceClassification.from_pretrained(
            "cardiffnlp/twitter-roberta-base-sentiment")
        print("Sentiment model and tokenizer loaded successfully.")
        return tokenizer, model
    except Exception as e:
        print(f"Error loading sentiment model or tokenizer: {e}")
        raise


def sentiment_analysis(texts, tokenizer, model, batch_size=16, device='cpu'):
    """
    Perform sentiment analysis on a list of texts using a pre-trained model.
    
    Args:
    - texts: List of texts to analyze.
    - tokenizer: Pre-trained tokenizer.
    - model: Pre-trained sentiment analysis model.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - torch.Tensor: Tensor containing sentiment probabilities for each text.
    """
    try:
        model = model.to(device)
        all_scores = []
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            encoded_input = tokenizer(
                batch_texts, return_tensors='pt', truncation=True, max_length=512, padding=True)
            encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
            with torch.no_grad():
                output = model(**encoded_input)
            scores = torch.nn.functional.softmax(output.logits, dim=-1).cpu()
            all_scores.append(scores)
        return torch.cat(all_scores, dim=0)
    except Exception as e:
        print(f"Error performing sentiment analysis: {e}")
        raise


def load_bills(file_path):
    """
    Load the bills and processed texts from a CSV file.
    
    Args:
    - file_path: Path to the CSV file containing the bills and processed texts.
    
    Returns:
    - pd.DataFrame: DataFrame containing the bills and their processed texts.
    """
    try:
        bills_df = pd.read_csv(file_path)
        print(f"Loaded {len(bills_df)} bills from {file_path}.")
        return bills_df
    except FileNotFoundError as e:
        print(f"Error loading file: {e}")
        raise
    except pd.errors.ParserError as e:
        print(f"Error parsing file: {e}")
        raise


def apply_sentiment_analysis_to_bills(bills_df, tokenizer, model, batch_size=16, device='cpu'):
    """
    Apply sentiment analysis to each bill's processed text in the DataFrame.
    
    Args:
    - bills_df: DataFrame containing the bills.
    - tokenizer: Pre-trained tokenizer.
    - model: Pre-trained sentiment analysis model.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - pd.DataFrame: Updated DataFrame with sentiment probabilities.
    - torch.Tensor: Tensor containing sentiment probabilities.
    """
    try:
        texts = bills_df['processed_text'].tolist()
        sentiments = sentiment_analysis(
            texts, tokenizer, model, batch_size, device)
        sentiments_df = pd.DataFrame(sentiments.numpy(), columns=[
                                     'bill_positive', 'bill_neutral', 'bill_negative'])
        bills_df = pd.concat([bills_df, sentiments_df], axis=1)
        print("Sentiment analysis applied to all bills.")
        return bills_df, sentiments
    except Exception as e:
        print(f"Error applying sentiment analysis: {e}")
        raise


def save_updated_bills_dataframe(bills_df, file_path):
    """
    Save the updated DataFrame with sentiment probabilities to a CSV file.
    
    Args:
    - bills_df: DataFrame containing the updated bills.
    - file_path: Path to save the updated CSV file.
    """
    try:
        bills_df.to_csv(file_path, index=False)
        print(f"Updated DataFrame saved to {file_path}.")
    except Exception as e:
        print(f"Error saving updated DataFrame: {e}")
        raise


def save_tensor(tensor, file_path):
    """
    Save the tensor to a file.
    
    Args:
    - tensor: Tensor to save.
    - file_path: Path to save the tensor file.
    """
    try:
        torch.save(tensor, file_path)
        print(f"Tensor saved to {file_path}.")
    except Exception as e:
        print(f"Error saving tensor: {e}")
        raise


def verify_saved_file(file_path):
    """
    Verify the integrity of the saved CSV file by loading it and checking the first few rows.
    
    Args:
    - file_path: Path to the saved CSV file.
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Verification successful. First few rows of {file_path}:")
        print(df.head())
    except Exception as e:
        print(f"Error verifying saved file: {e}")
        raise


if __name__ == "__main__":
    try:
        # Load the sentiment analysis model and tokenizer
        tokenizer, model = load_sentiment_model()

        # Load the bills DataFrame
        bills_df = load_bills('BillData/final_bills_no_duplicates.csv')

        # Determine device
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {device}")

        # Apply sentiment analysis to the DataFrame
        bills_df, sentiments_tensor = apply_sentiment_analysis_to_bills(
            bills_df, tokenizer, model, device=device)

        # Save the updated DataFrame
        save_updated_bills_dataframe(bills_df, 'BillData/bills.csv')

        # Save the tensor containing sentiment probabilities
        save_tensor(sentiments_tensor, 'BillData/bills_sentiments_tensor.pt')

        # Verify the saved file
        verify_saved_file('BillData/bills.csv')

    except Exception as e:
        print(f"An error occurred in the main execution block: {e}")

Sentiment model and tokenizer loaded successfully.
Loaded 413 bills from BillData/final_bills_no_duplicates.csv.
Using device: cpu
Sentiment analysis applied to all bills.
Updated DataFrame saved to BillData/bills.csv.
Tensor saved to BillData/bills_sentiments_tensor.pt.
Verification successful. First few rows of BillData/bills.csv:
   bill_id bill_number                                              title  \
0  1696211      HF1373  Consumer choice of fuel provided, rulemaking a...   
1  1862762      HF4800  Original equipment manufacturer required to fa...   
2  1862691      HF4790  State Board of Investment standards to require...   
3  1642978        HF30  Catalytic converter purchase or acquisition re...   
4  1856477      HF4331  Metropolitan Council abolished, duties transfe...   

                                         description  status status_date  \
0  Consumer choice of fuel provided, rulemaking a...       1  2023-02-06   
1  Original equipment manufacturer required to fa.

**Cell 7**:
- This cell loads the processed bill embeddings, drops unnecessary columns, converts the 'status_date' column to datetime format, creates placeholders for sentiment columns, and saves the modified DataFrame. It outputs the DataFrame for troubleshooting purposes.

In [20]:
import pandas as pd

# Load the processed bill embeddings
bills_df = pd.read_csv('BillData/bills.csv')

# List of columns to drop
columns_to_drop = [
    'bill_number', 'title', 'description', 'url', 'state_link', 'state',
    'current_body_id', 'sponsors', 'subjects', 'texts', 'votes', 'processed_text', 'full_text'
]

# Check which columns are present in the DataFrame
existing_columns_to_drop = [
    col for col in columns_to_drop if col in bills_df.columns]

# Drop the specified columns
bills_df.drop(columns=existing_columns_to_drop, inplace=True)

# Convert 'status_date' column to datetime format and focus on month, day, and year
bills_df['status_date'] = pd.to_datetime(
    bills_df['status_date']).dt.strftime('%Y-%m-%d')

# Create placeholders for positive, neutral, and negative sentiment columns
bills_df['positive'] = 0.0
bills_df['neutral'] = 0.0
bills_df['negative'] = 0.0

# Save the modified DataFrame
output_path = 'BillData/bills.csv'
bills_df.to_csv(output_path, index=False)

# Output for troubleshooting
print(f"DataFrame saved to {output_path}")
print(bills_df.head())

DataFrame saved to BillData/bills.csv
   bill_id  status status_date  session_id  state_id  body_id  bill_positive  \
0  1696211       1  2023-02-06        1986        23       55       0.303996   
1  1862762       1  2024-03-11        1986        23       55       0.192843   
2  1862691       1  2024-03-11        1986        23       55       0.036222   
3  1642978       4  2023-03-16        1986        23       55       0.332358   
4  1856477       1  2024-02-28        1986        23       55       0.281935   

   bill_neutral  bill_negative  positive  neutral  negative  
0      0.662853       0.033151       0.0      0.0       0.0  
1      0.763062       0.044095       0.0      0.0       0.0  
2      0.811128       0.152649       0.0      0.0       0.0  
3      0.646202       0.021440       0.0      0.0       0.0  
4      0.692732       0.025333       0.0      0.0       0.0  


**Cell 8**:
- This cell converts the 'status_date' column in the DataFrame to Unix timestamp format, saves the updated DataFrame to a CSV file, and outputs the DataFrame and its data types for verification.

In [23]:
import pandas as pd

# Load the dataset
bills_df = pd.read_csv('BillData/bills.csv')

# Convert 'status_date' to datetime and then to int64
bills_df['status_date'] = pd.to_datetime(bills_df['status_date'])
bills_df['status_date'] = bills_df['status_date'].astype(
    'int64') // 10**9  # Convert to Unix timestamp in seconds

# Save the updated DataFrame
bills_df.to_csv('BillData/final_bills.csv', index=False)

# Output for verification
print(f"DataFrame saved to 'BillData/final_bills.csv'")
print(bills_df.head())
print(bills_df.dtypes)

DataFrame saved to 'BillData/final_bills.csv'
   bill_id  status  status_date  session_id  state_id  body_id  bill_positive  \
0  1696211       1   1675641600        1986        23       55       0.303996   
1  1862762       1   1710115200        1986        23       55       0.192843   
2  1862691       1   1710115200        1986        23       55       0.036222   
3  1642978       4   1678924800        1986        23       55       0.332358   
4  1856477       1   1709078400        1986        23       55       0.281935   

   bill_neutral  bill_negative  positive  neutral  negative  
0      0.662853       0.033151       0.0      0.0       0.0  
1      0.763062       0.044095       0.0      0.0       0.0  
2      0.811128       0.152649       0.0      0.0       0.0  
3      0.646202       0.021440       0.0      0.0       0.0  
4      0.692732       0.025333       0.0      0.0       0.0  
bill_id            int64
status             int64
status_date        int64
session_id         i

: 

**Cell 9**:
- This cell displays data types, summary statistics for numerical columns, identifies missing values, and displays the first few rows of the DataFrame to understand the data format and completeness. It also prints the total number of bills in the DataFrame.

In [22]:
# Display data types
print("Data Types:")
print(bills_df.dtypes)
print("\n")

# Display summary statistics for numerical columns
print("Summary Statistics:")
print(bills_df.describe())
print("\n")

# Identify missing values
print("Missing Values:")
missing_values = bills_df.isnull().sum()
print(missing_values)
print("\n")

# Display the first few rows of the dataframe to understand the data format
print("Data Format (first few rows):")
print(bills_df.head())

print(f"Number of bills: {bills_df.shape[0]}")

Data Types:
bill_id            int64
status             int64
status_date       object
session_id         int64
state_id           int64
body_id            int64
bill_positive    float64
bill_neutral     float64
bill_negative    float64
positive         float64
neutral          float64
negative         float64
dtype: object


Summary Statistics:
            bill_id      status   session_id    state_id     body_id  \
count  4.130000e+02  413.000000   413.000000  413.000000  413.000000   
mean   1.770487e+06    2.041162  2049.861985   26.169492   56.786925   
std    6.289655e+04    1.540946    45.679156   15.368621   31.502475   
min    1.636388e+06    1.000000  1986.000000    2.000000    1.000000   
25%    1.714481e+06    1.000000  2016.000000   13.000000   31.000000   
50%    1.780652e+06    1.000000  2033.000000   23.000000   54.000000   
75%    1.820955e+06    3.000000  2111.000000   40.000000   80.000000   
max    1.872604e+06    6.000000  2129.000000   52.000000  116.000000   

   