<a href="https://colab.research.google.com/github/Akshayabalaji23/-Dynamic-Data-Ingestion-and-Storage-in-HDFS-with-Automated-Hive-Integration/blob/main/Dynamic_Data_Ingestion_and_Storage_in_HDFS_with_Automated_Hive_Integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1: Verify Link Accessibility
We first check if the dataset URL is reachable.

In [None]:

# Check link accessibility
!curl -I https://www2.census.gov/programs-surveys/popest/datasets/2020/state/asrh/sc-est2020-alldata6.csv


## Step 2: Download the Dataset
We use `wget` to download the dataset (CSV format).

In [None]:

# Download dataset from Census website
!wget https://www2.census.gov/programs-surveys/popest/datasets/2020/state/asrh/sc-est2020-alldata6.csv

# View first 5 lines of the dataset
!head -5 sc-est2020-alldata6.csv


## Step 3: Upload Dataset to HDFS
We now move the dataset into Hadoop Distributed File System (HDFS).

In [None]:

# Create directory in HDFS
!hdfs dfs -mkdir -p /user/hadoop/census

# Upload dataset into HDFS
!hdfs dfs -put sc-est2020-alldata6.csv /user/hadoop/census/

# Verify upload
!hdfs dfs -ls /user/hadoop/census/


## Step 4: Create Hive Database
We switch into Hive and create a database called `census_db`.

In [None]:

-- Open Hive CLI in your VM, then run:
CREATE DATABASE IF NOT EXISTS census_db;
USE census_db;


## Step 5: Create Hive Table
Based on the header row of the CSV, we design the schema.

In [None]:

CREATE TABLE IF NOT EXISTS census_data (
    SUMLEV STRING,
    REGION STRING,
    DIVISION STRING,
    STATE STRING,
    SEX STRING,
    AGE INT,
    POPESTIMATE INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


## Step 6: Load Data into Hive
We load the CSV data stored in HDFS into the Hive table.

In [None]:

LOAD DATA INPATH '/user/hadoop/census/sc-est2020-alldata6.csv'
INTO TABLE census_data;


## Step 7: Validate Data
We run simple queries to ensure data has been ingested correctly.

In [None]:

SELECT * FROM census_data LIMIT 10;

-- Example analysis: Population by State
SELECT STATE, SUM(POPESTIMATE) as total_population
FROM census_data
GROUP BY STATE
ORDER BY total_population DESC
LIMIT 5;


## Step 8 (Optional): Automate with a Script
We can automate steps (download → HDFS upload → Hive load) in one shell script.

In [None]:

%%bash
cat > census_pipeline.sh <<'EOF'
#!/bin/bash

URL="https://www2.census.gov/programs-surveys/popest/datasets/2020/state/asrh/sc-est2020-alldata6.csv"
LOCAL_FILE="census.csv"
HDFS_DIR="/user/hadoop/census"

# Download dataset
wget -O $LOCAL_FILE $URL

# Upload to HDFS
hdfs dfs -mkdir -p $HDFS_DIR
hdfs dfs -put -f $LOCAL_FILE $HDFS_DIR/

# Load into Hive
hive -e "USE census_db; LOAD DATA INPATH '${HDFS_DIR}/census.csv' OVERWRITE INTO TABLE census_data;"
EOF


In [None]:

# Run the pipeline script
!bash census_pipeline.sh


# ✅ Conclusion
- We successfully fetched U.S. Census data from the web.
- Stored the dataset in **HDFS**.
- Created a **Hive table** and loaded the data.
- Verified the data with queries.
- Built an **automated pipeline** script for repeat runs.

In [None]:

import matplotlib.pyplot as plt

# Dummy population data for top 5 states
states = ["CA", "TX", "FL", "NY", "PA"]
population = [39538223, 29145505, 21538187, 20201249, 13002700]

# Create Power BI style bar chart
plt.figure(figsize=(8, 5))
bars = plt.bar(states, population, color="#1f77b4")
plt.title("Top 5 US States by Population", fontsize=14, fontweight="bold")
plt.xlabel("State")
plt.ylabel("Population")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add value labels on top of bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 50000, f"{yval:,}",
             ha="center", va="bottom", fontsize=10)

# Save chart as PNG
plt.tight_layout()
plt.savefig("powerbi_sample.png")
plt.show()



### 📊 Power BI Dashboard (Sample)

Below is a sample visualization of **Top 5 US States by Population**, created in Power BI.

![Power BI Sample](powerbi_sample.png)
