# Milestone 1: Project Proposal and Data Selection/Preparation

## 1: Preparing for Your Project Proposal

### 1.1: Client/Dataset Selection

Client: Lobbyists4America

Dataset name: Congressional Tweets Dataset (2008–2017)

Source (example): https://www.dropbox.com/sh/qrq1pcjsji0v03u/AAC639WcH58tM0YZperwY388a?dl=0

Why this dataset was selected:

* The dataset directly aligns with Lobbyists4America’s goal: understanding topics, key members, and relationships within U.S. Congress to inform lobbying strategy.

* Ten years (2008–2017) gives a broad temporal span to detect topic trends, shifts around major political events (e.g., elections, legislation bursts), and persistent influencers.

* Tweets include text, timestamps, user metadata, mentions/retweets — useful for NLP topic modeling, network analysis (mentions/retweets), and member-level summary statistics.

### 1.2: Data Import & Cleaning

1.2.1: Importing the Data

In [None]:
import json
import os
import pandas as pd

tweets = "data/tweets.json"
users = "data/users.json"

tweets_path = os.path.join(os.getcwd(), tweets)
users_path = os.path.join(os.getcwd(), users)

tweets_data = []
user_data = []

with open(tweets_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():  # skip empty lines
            tweets_data.append(json.loads(line))

with open(users_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():  # skip empty lines
            user_data.append(json.loads(line))

tweets_df = pd.DataFrame(tweets_data)
user_df = pd.DataFrame(user_data)

1.2.2: Removing unwanted columns 

In [None]:
tweets_columns_needed = [
    "created_at",
    "screen_name",
    "user_id",
    "text",
    "lang",
    "retweet_count",
    "favorite_count",
    "entities",
    "in_reply_to_user_id",
    "in_reply_to_screen_name",
    "source",
    "is_quote_status",
    "quoted_status_id"
]

user_columns_needed = [
    "id",
    "id_str",
    "screen_name",
    "name",
    "description",
    "followers_count",
    "friends_count",
    "favourites_count",
    "statuses_count",
    "verified",
    "protected",
    "created_at",
    "location"
]

tweets_clean = tweets_df[tweets_columns_needed]
user_clean = user_df[user_columns_needed]

1.2.2: Data Cleaning 

In [None]:
# Drop completely empty rows
tweets_clean = tweets_clean.dropna(how="all")
user_clean = user_clean.dropna(how="all")

# Fill or drop specific important columns
tweets_clean = tweets_clean.dropna(subset=["text", "created_at"])
user_clean = user_clean.dropna(subset=["id", "screen_name"])

tweets_clean.drop_duplicates(subset=["text", "created_at"], inplace=True)
user_clean.drop_duplicates(subset=["id"], inplace=True)

### 1.3: Initial exploration of data

In [None]:
# Number of rows and columns
print("Tweets dataset shape:", tweets_clean.shape)
print("User dataset shape:", user_clean.shape)

# Column names
print("\nTweets columns:", tweets_clean.columns.tolist())
print("User columns:", user_clean.columns.tolist())

# Data types and nulls
print("\nTweets info:")
tweets_clean.info()
print("\nUser info:")
user_clean.info()


In [None]:
# Numeric stats
tweets_clean.describe()
user_clean.describe()


In [None]:
import matplotlib.pyplot as plt

# Top 10 users by followers
user_clean.nlargest(10, 'followers_count')[['screen_name','followers_count']].plot.bar(x='screen_name', y='followers_count', legend=False)
plt.title("Top 10 users by followers")
plt.ylabel("Followers count")
plt.show()


### 1.4 : Proposed ERD

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Load the image
img = mpimg.imread('./data/ERD.png')

# Display the image
plt.imshow(img)
plt.axis('off')  # Hide axes
plt.show()

### 1.5 : Saving the data to CSV


In [None]:
tweets = "data/tweets.csv"
users = "data/users..csv"

tweets_path = os.path.join(os.getcwd(), tweets)
users_path = os.path.join(os.getcwd(), users)

tweets_clean.to_csv(tweets_path, index=False)
user_clean.to_csv(users_path, index=False)