# Notebook with Code to Scrape data off the Census of India [Website](https://censusindia.gov.in/census.website/data/census-tables)
Our original effort was to understand the [Gender Polyconflict](https://messenger.substack.com/p/thinking-in-public-the-case-for-gender) through any relevant data available.

Sifting through the Census of India website in a systematic manner, we ended up documenting our effort in a spreadsheet, which we call the [Metadata](https://github.com/SocratusCollective/gender-polyconflict/blob/main/Metadata_2011.xlsx) sheet. We have included several details within the sheet in the hope that it is self-explanatory.

Once done, we figured we could use it to actually download all of the data we needed. This notebook contains the code used to do just that.

We use Google Drive to store any files needed for our analysis. We store data and metadata as Google Sheets. If you prefer to not use Google, you could use your local computer storage and/or another cloud service and a different spreadsheet software.

## If you intend to use our Metadata sheet and the code in this notebook as such, you will...
... need access to Google Drive, Colab and Sheets and should first do the below:

- Download the Metadata sheet as an Excel file,
- Upload it as a Google Sheet (not just an Excel file), on Google Drive,
- Note down the Google Sheet's ID (see how-to instructions [here](https://stackoverflow.com/questions/36061433/how-do-i-locate-a-google-spreadsheet-id))
- Create a folder in which to dump data downloaded from the Census of India website, and note its path.

In [None]:
# Install & import required libraries
!pip install --quiet gspread oauth2client
from google.colab import auth, drive
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from google.auth import default
import requests
import os
from googleapiclient.discovery import build

# Google Drive Ops

The code cell below does the following:
- Authenticates Colab to enable access to files on Google Drive
- Creates an instance of a Google Sheets `client` which enables Colab to specifically access Google Sheets within Google Drive
- Opens the Metadata sheet.

In [None]:
# Authenticate
auth.authenticate_user()
creds, _ = default()

# Create the Google Sheets client
client = gspread.authorize(creds)
service = build('sheets', 'v4', credentials=creds)

# Open the Metadata Sheet
sheet_id = \
  "1E3GAnfUaiAhIlU-F2v-9EGlXFymngRmQ-K1pMsawX9s" # Google Sheet ID of your Metadata sheet
sheet = client.open_by_key(sheet_id)

# Mount Google Drive for Colab to access it
drive.mount('/content/drive')

Mounted at /content/drive


## Check if Required Tabs exist
- B-02, B-05, B-08, B-17, C-03A, C-23, HH-04 City data already available
- Others, download programmatically using links in Metadata file
- Quick check to see if tabs are spelled as expected

In [None]:
# Get the list of sheet (tab) names
tab_names = [worksheet.title for worksheet in sheet.worksheets()]

# Tabs Needed
total_plus_caste_granular_tabs = [
    'D-02', 'F-01', 'F-05', 'F-09', 'B-01', 'B-03', 'C-02', 'C-20',
    'B-07'
]
caste_granular_tabs = ['HL-14-SC-ST', 'B-04-SC-ST', 'B-06-SC-ST', 'HH-01-SC-ST']
tehsil_granular_tab = 'HL-14-Total'
no_caste_granularity_tabs = [
    'HH-02', 'B-16', 'C-03', 'D-04', 'D-05', 'F-02', 'F-03', 'F-04', 'F-06',
    'F-07', 'F-08', 'F-10', 'F-11', 'F-12', 'H-01', 'B-28', 'B-09',
    'B-04-Total', 'B-06-Total', 'HH-01-Total'
]
tabs_to_check = total_plus_caste_granular_tabs + caste_granular_tabs + [tehsil_granular_tab] + no_caste_granularity_tabs

# Check if each tab exists
missing_tabs = [tab for tab in tabs_to_check if tab not in tab_names]

# Print the results
if missing_tabs:
    print("Missing tabs:")
    for tab in missing_tabs:
        print(f"- {tab}")
else:
    print("All tabs exist.")

All tabs exist.


## Util Function: Download file from URL into Specified Google Drive

In [None]:
def download_and_save_to_drive(url, drive_folder_path, file_name):
    """
    Downloads an file from a URL and saves it to a specified location in Google
    Drive, with the specified name.

    Args:
        url (str): The URL of the file to download.
        drive_folder_path (str): The folder path in Google Drive where the file will be saved.
        file_name (str): The name to give the downloaded file.

    Returns:
        str: Full path of the saved file in Google Drive.
    """

    # Full file path
    file_path = os.path.join(drive_folder_path, file_name)

    # Download the file
    response = requests.get(url)
    if response.status_code == 200:
        # Extract file extension from the URL
        _, extension = os.path.splitext(url)
        if not extension:
            print("Unable to determine file extension from URL. Saving without extension.")
            extension = ""

        # Full file path with extension
        file_path = os.path.join(drive_folder_path, f"{file_name}{extension}")

        with open(file_path, 'wb') as f:
            f.write(response.content)
    else:
        print(f"Failed to download the file. HTTP Status Code: {response.status_code}")

## Util Function: Extract hyperlink from a Google Sheet Cell

In [None]:
def extract_link(sheet_name, col_alpha, row, sheet_id):
    # Get cell data with metadata
    range_ = f"{sheet_name}!{col_alpha}{row}"
    result = service.spreadsheets().get(
        spreadsheetId=sheet_id,
        ranges=range_,
        fields="sheets(data(rowData(values(hyperlink))))"
    ).execute()

    # Extract hyperlink
    sheets_data = result.get('sheets', [])
    cell_data = sheets_data[0]['data'][0]['rowData'][0]['values'][0]
    return cell_data.get('hyperlink', None)  # Return the hyperlink if present

In [None]:
total_plus_caste_granular_tabs_num_rows = [37]*len(total_plus_caste_granular_tabs)
caste_granular_tabs_num_rows = [37]*len(caste_granular_tabs)
tehsil_granular_tab_num_rows = 642
no_caste_granularity_tabs_num_rows = [35] + [37]*(len(no_caste_granularity_tabs)-2)+ [36]

In [None]:
drive_folder_path = \
  "/content/drive/MyDrive/Socratus/Census Data/2011 Data" # folder to dump
  # scraped data - change this as needed; don't forget to retain the
  # "/content/drive" - that is where your Google Drive is mounted on the
  # Colab environment

In [None]:
def cell_routine(tab_name, col_alpha, row, file_name):
  link = extract_link(tab_name, col_alpha, row, sheet_id)
  download_and_save_to_drive(link, drive_folder_path, file_name)

In [None]:
def get_data(tab_name_list, num_rows_list,
             table_type="Total Plus Caste Granular", start_from_row = 4):
    for tab_idx, tab_name in enumerate(tab_name_list):
      tab = sheet.worksheet(tab_name)
      num_rows = num_rows_list[tab_idx]
      for row_num in range(start_from_row, num_rows + 3):
        if table_type == "Total Plus Caste Granular":
            location = tab.cell(row_num, 1).value.replace(" ", "_")
            total_cell = tab.cell(row_num, 2)
            if total_cell.value != "N/A":
              name = f"{tab_name}_{location}_total"
              cell_routine(tab_name, "B", row_num, name)
            sc_cell = tab.cell(row_num, 3)
            if sc_cell.value != "N/A":
              name = f"{tab_name}_{location}_sc"
              cell_routine(tab_name, "C", row_num, name)
            st_cell = tab.cell(row_num, 4)
            if st_cell.value != "N/A":
              name = f"{tab_name}_{location}_st"
              cell_routine(tab_name, "D", row_num, name)
        if table_type == "Caste Granular":
            location = tab.cell(row_num, 1).value.replace(" ", "_")
            sc_cell = tab.cell(row_num, 2)
            if sc_cell.value != "N/A":
              name = f"{tab_name}_{location}_sc"
              cell_routine(tab_name, "B", row_num, name)
            st_cell = tab.cell(row_num, 3)
            if st_cell.value != "N/A":
              name = f"{tab_name}_{location}_st"
              cell_routine(tab_name, "C", row_num, name)
        elif table_type == "Tehsil Granular":
            state = tab.cell(row_num, 1).value.replace(" ", "_")
            district = tab.cell(row_num, 2).value.replace(" ", "_")
            total_cell = tab.cell(row_num, 3)
            if total_cell.value != "N/A":
              name = f"{tab_name}_{state}_{district}_total"
              cell_routine(tab_name, "C", row_num, name)
        else:
            location = tab.cell(row_num, 1).value.replace(" ", "_")
            total_cell = tab.cell(row_num, 2)
            if total_cell.value != "N/A":
              name = f"{tab_name}_{location}_total"
              cell_routine(tab_name, "B", row_num, name)
      print(f"Tab {tab_name} done")

In [None]:
get_data(total_plus_caste_granular_tabs, total_plus_caste_granular_tabs_num_rows)
get_data(caste_granular_tabs, caste_granular_tabs_num_rows, "Caste Granular")
get_data([tehsil_granular_tab], [tehsil_granular_tab_num_rows], "Tehsil Granular")
get_data(no_caste_granularity_tabs, no_caste_granularity_tabs_num_rows, "No Caste Granularity")