# Initial Data Exploration

Welcome to the initial phase of data exploration for this project. This notebook is dedicated to  examining the data we've extracted so far. The primary aim of this exploration is not to discover patterns, trends, or correlations within the data, but rather to gain an understanding of the data's characteristics, structure, and quality.

### Objectives:

In this stage, focus is on the following areas:

<!-- 1. **Understanding and Structuring Data**:
   - What types of data have we collected, and how comprehensive and representative is it in terms of the project's goals?
   - How is the data currently formatted, and are there inconsistencies in the structure?
   - Do the formats of dates, text, and other fields align with the requirements for future analysis? -->

1. **Data Cleaning and Preparation**:
   - What potential issues exist in the data, such as missing values, duplicates, or outliers?
   - What steps are necessary to clean the data and streamline the process for efficiency?
   - What transformations or preprocessing steps are required to prepare the data for analysis, ensuring it is in the right shape and format?
   - What additional data might be needed, and how can we incorporate it into our existing dataset?

2. **Feature Engineering**:
   - What new features or variables can be derived from the existing data to enhance the analysis?
   - How can raw data be transformed into features that better represent underlying patterns or trends?
   - What techniques can be applied to extract meaningful insights?

<!-- The insights and knowledge gained from this initial exploration will be pivotal in shaping the analytical functions that we develop in the next phases of the project. Understanding the data at a granular level will help us identify the best methods for cleaning, transforming, and analyzing the data.

Furthermore, this exploration will serve as the foundation for designing the framework of the Reddit Politics dashboard. By understanding the data thoroughly, we can create a dashboard that is not only informative but also capable of providing valuable insights and visualizations to end users.

In summary, this notebook represents the groundwork for the entire project, setting the stage for the development of effective analytical tools and a compelling, data-driven dashboard. The careful examination and preparation of the data at this stage are crucial steps toward ensuring the success of the project. -->

The insights gained from this exploration will inform the development of functions for data analysis and guide the overall framework for creating the Reddit Politics dashboard.



In [1]:
import pandas as pd
from functions import reddit_scrape


db_file_path = '/Users/muhammadmuhdhar/Desktop/Repo/Reddit Politics/data.db'

subreddits = [
    'politics',
    'democrats',
    'Republican'
]

for subreddit in subreddits: 
    data_pull = reddit_scrape(subreddit)


Failed to retrieve page. Status code: 429
https://old.reddit.com//r/Republican/comments/1ewpx0q/james_lindsay_how_i_left_the_leftwing_cult/.json
Failed to retrieve page. Status code: 429
https://old.reddit.com//r/Republican/comments/1ezl1ny/hilariously_calling_out_the_gaslighting/.json
Failed to retrieve page. Status code: 429
https://old.reddit.com//r/Republican/comments/1eywxlb/theres_finally_a_kamala_harris_policy_website/.json


In [79]:
import sqlite3

db_file_path = '/Users/muhammadmuhdhar/Desktop/Repo/Reddit Politics/data.db'
conn = sqlite3.connect(db_file_path)

subreddits = [
    'politics',
    'democrats',
    'Republican'
]

df = {}
for subreddit in subreddits: 
    df[subreddit] = pd.read_sql(f'SELECT * FROM {subreddit}', conn)


In [80]:
for subreddit in subreddits:
    print(f'description for r/{subreddit}:')
    print(df[subreddit].describe())
    print("\n")

description for r/politics:
             body        date
count       49478       49478
unique      43666           7
top     [deleted]  2024-08-23
freq          278        8521


description for r/democrats:
                                                     body        date
count                                               20540       20540
unique                                              15613           8
top     **Join:**\n\n* /r/KamalaHarris\n\n* /r/TimWalz...  2024-08-21
freq                                                  267        3851


description for r/Republican:
             body        date
count        5728        5728
unique       3700           8
top     [removed]  2024-08-22
freq          184        1114




In [81]:
for subreddit in subreddits:
    df[subreddit] = df[subreddit].drop_duplicates(subset=['body'], keep='first')

In [82]:
for subreddit in subreddits:
    print(f'description for r/{subreddit}:')
    print(df[subreddit].describe())
    print("\n")

description for r/politics:
                                                     body        date
count                                               43666       43666
unique                                              43666           7
top     \nAs a reminder, this subreddit [is for civil ...  2024-08-22
freq                                                    1        7375


description for r/democrats:
                                                     body        date
count                                               15613       15613
unique                                              15613           8
top     **Join:**\n\n* /r/KamalaHarris\n\n* /r/TimWalz...  2024-08-22
freq                                                    1        2720


description for r/Republican:
                                                 body        date
count                                            3700        3700
unique                                           3700           8
top     I t

In [83]:
for subreddit in subreddits:
    print(f'null values for r/{subreddit}:')
    print(df[subreddit].isnull().sum())
    print("\n")

null values for r/politics:
body    0
date    0
dtype: int64


null values for r/democrats:
body    0
date    0
dtype: int64


null values for r/Republican:
body    0
date    0
dtype: int64




In [84]:
for subreddit, data in df.items():
    print(f"Data types for r/{subreddit}:")
    print(data.dtypes)
    print("\n")

Data types for r/politics:
body    object
date    object
dtype: object


Data types for r/democrats:
body    object
date    object
dtype: object


Data types for r/Republican:
body    object
date    object
dtype: object




In [85]:
for subreddit, data in df.items():
    data['date'] = pd.to_datetime(data['date'])

In [86]:
for subreddit in subreddits:
    print(f"r/{subreddit} head:")
    print(df[subreddit].head())
    print("\n")

r/politics head:
                                                body       date
0  \nAs a reminder, this subreddit [is for civil ... 2024-08-21
1  Donald Trump and "perfect" criminal calls, nam... 2024-08-21
2  Yeah, this is like the third Perfect Phone Cal... 2024-08-21
3                           Donald Trump and Epstein 2024-08-21
4                             DJT and sexual assault 2024-08-21


r/democrats head:
                                                body       date
0  **Join:**\n\n* /r/KamalaHarris\n\n* /r/TimWalz... 2024-08-21
1                                 Samwise the Brave! 2024-08-21
2                        The real hero of the story. 2024-08-21
3  People say that, but Frodo made the journey en... 2024-08-21
4  Yeah, people that try to say “Sam is the real ... 2024-08-21


r/Republican head:
                                                body       date
0      I thought he was independent as of last year. 2024-08-24
1  He ran as an independent because dems would