# **Indian Population Data Analysis**
##### **Author Name :** Sohal Dhavale
##### **Email :** sohaldhavale0m35@gmail.com

> ### importing all libraries

In [94]:
import pandas as pd 
import numpy as np  
import matplotlib.pyplot as plt 
import seaborn as sns 

> ### loading Dataset in a variable

In [95]:
df = pd.read_csv('india-districts-census-2011.csv')

In [96]:
df.columns

Index(['District code', 'State name', 'District name', 'Population', 'Male',
       'Female', 'Literate', 'Male_Literate', 'Female_Literate', 'SC',
       ...
       'Power_Parity_Rs_90000_150000', 'Power_Parity_Rs_45000_150000',
       'Power_Parity_Rs_150000_240000', 'Power_Parity_Rs_240000_330000',
       'Power_Parity_Rs_150000_330000', 'Power_Parity_Rs_330000_425000',
       'Power_Parity_Rs_425000_545000', 'Power_Parity_Rs_330000_545000',
       'Power_Parity_Above_Rs_545000', 'Total_Power_Parity'],
      dtype='object', length=118)

> ### **Q.1] What is the total population in the dataset?**

In [97]:
total_Population = df['Population'].sum()
print(f"Total population in this dataset is {total_Population}.")

Total population in this dataset is 1210854977.


> ### **Q.2] Which state has the highest population?**

In [98]:
State_high_population = df.groupby('State name')['Population'].sum().sort_values(ascending=False)
State_high_population.head(1)

State name
UTTAR PRADESH    199812341
Name: Population, dtype: int64

In [99]:
State_high_population = df.groupby('State name')['Population'].sum().nlargest(1)
State_high_population.idxmax()

'UTTAR PRADESH'

In [100]:
State_high_population = df.groupby('State name')['Population'].sum()
highest_population_state = State_high_population.idxmax()
highest_population = State_high_population.max()
# highest_population_state : name of state having maximum population
# State_high_population : All rows having sum of population
# highest_population : Number
highest_population_state

'UTTAR PRADESH'


idxmax() is a Pandas function that returns the index of the maximum value in a Series. In this context, we're using it to find the index (which is the state name) where the population is the highest.

Here's how it works:

df.groupby('State name')['Population'].sum() groups the DataFrame by state name and calculates the sum of the population for each state.
idxmax() is then applied to this Series, returning the index (state name) where the population is maximum.
So, using idxmax() helps us directly retrieve the state name with the highest population without having to sort the values explicitly or iterate through the Series.

> ### **Q.3] How many districts are there in each state?**

In [101]:
df.groupby('State name')['District name'].count().sort_values(ascending=False).head(10)

State name
UTTAR PRADESH     71
MADHYA PRADESH    50
BIHAR             38
MAHARASHTRA       35
RAJASTHAN         33
TAMIL NADU        32
ORISSA            30
KARNATAKA         30
ASSAM             27
GUJARAT           26
Name: District name, dtype: int64

> ### **Q.4] What is the average literacy rate across all districts?**

In [102]:
df['Literate'].mean()

1193185.64375

In [103]:
df.groupby('District name')['Literate'].mean().sort_values(ascending=False).head(20)

District name
Thane                         8227161.0
North Twenty Four Parganas    7608693.0
Mumbai Suburban               7575485.0
Bangalore                     7512276.0
Pune                          7171723.0
South Twenty Four Parganas    5531657.0
Ahmadabad                     5435760.0
Barddhaman                    5247208.0
Surat                         4571410.0
Nashik                        4345366.0
Jaipur                        4300965.0
Paschim Medinipur             4078412.0
Hugli                         4078388.0
Murshidabad                   4055834.0
Purba Medinipur               3923194.0
Chennai                       3776276.0
Nagpur                        3673808.0
Allahabad                     3665727.0
Haora                         3605206.0
Kolkata                       3588137.0
Name: Literate, dtype: float64

> ### **Q.5] Which district has the highest number of literate females?**

In [104]:
max_female_literate_row = df[df['Female_Literate'] == df['Female_Literate'].max()]
# max_female_literate_row
district_with_max_female_literate = max_female_literate_row['District name'].iloc[0]
district_with_max_female_literate

'Thane'

> ### **Q.6] What is the percentage of SC population in each district?**

In [105]:
percentage_SC_per_district = (df.groupby(['State name','District name'])['SC'].sum() / df.groupby(['State name','District name'])['Population'].sum()) * (100)
percentage_SC_per_district[percentage_SC_per_district>30.0].sort_values(ascending=False)

State name        District name             
WEST BENGAL       Koch Bihar                    50.170020
PUNJAB            Shahid Bhagat Singh Nagar     42.508533
                  Muktsar                       42.305765
                  Firozpur                      42.173228
                  Jalandhar                     38.951855
                  Faridkot                      38.919010
WEST BENGAL       Jalpaiguri                    37.653911
RAJASTHAN         Ganganagar                    36.584588
PUNJAB            Moga                          36.496958
                  Hoshiarpur                    35.137729
UTTAR PRADESH     Kaushambi                     34.721080
TAMIL NADU        Thiruvarur                    34.084856
PUNJAB            Kapurthala                    33.944782
                  Tarn Taran                    33.713728
                  Mansa                         33.631395
JHARKHAND         Chatra                        32.654864
WEST BENGAL       Bankura  

> ### **Q.7] How many districts have a SC population greater than 40%?**

In [106]:
percentage_SC_per_district = (df.groupby(['State name','District name'])['SC'].sum() / df.groupby(['State name','District name'])['Population'].sum()) * (100)
len(percentage_SC_per_district[percentage_SC_per_district>40.0])

4

> ### **Q.8] What is the average number of workers in rural areas?**

In [107]:
# Rural_Households
# Workers
df['Workers'].sum()/df['Rural_Households'].sum()

2.183497008467497

> ### **Q.9] Which state has the highest proportion of workers in urban areas?**

In [108]:
# Urban_Households
StathighProportion = df[df['Urban_Households']==df['Urban_Households'].max()]
StathighProportion['State name'].iloc[0]

'MAHARASHTRA'

> ### **Q.10] How many households have internet access?**

In [109]:
df['Households_with_Internet'].sum()

7708521

In [111]:
df.groupby('State name')['Households_with_Internet'].sum().sort_values(ascending=False)

State name
MAHARASHTRA                    1379351
TAMIL NADU                      772257
KARNATAKA                       638468
UTTAR PRADESH                   609773
NCT OF DELHI                    588951
ANDHRA PRADESH                  549284
KERALA                          483609
WEST BENGAL                     443642
GUJARAT                         381622
PUNJAB                          292111
HARYANA                         248076
RAJASTHAN                       226501
MADHYA PRADESH                  212473
BIHAR                           165521
ORISSA                          135393
ASSAM                           100173
JHARKHAND                        91074
CHHATTISGARH                     67820
UTTARAKHAND                      63032
JAMMU AND KASHMIR                57977
CHANDIGARH                       44283
HIMACHAL PRADESH                 41193
GOA                              41064
PONDICHERRY                      18086
MANIPUR                          10886
TRIPURA       

> ### **Q.11. What is the average household size in rural areas?**

In [112]:
df['Rural_Households'].mean()

344837.365625

In [None]:
df.isnull().sum()

> ### **Q.12. How many households have a computer but no internet access?**

In [115]:
# Households_with_Computer
# Households_with_Internet
abs((df['Households_with_Computer'].sum())-(df['Households_with_Internet'].sum()))

15654325

> ### **Q.13. What is the average number of household members in each age group?**

In [120]:
# Age_Group_0_29
# Age_Group_30_49
# Age_Group_50
# Households
formatted = df[['Age_Group_0_29','Age_Group_30_49','Age_Group_50']].mean()

In [128]:
df[['Age_Group_0_29','Age_Group_30_49','Age_Group_50']].mean()

Age_Group_0_29     1.102826e+06
Age_Group_30_49    4.820189e+05
Age_Group_50       3.001005e+05
dtype: float64

> ### **Q.14. Which religion has the highest representation in the dataset?**

In [131]:
'''
Hindus	Muslims	Christians	Sikhs	Buddhists	Jains	Others_Religions
'''
column = ['Hindus','Muslims','Christians','Sikhs','Buddhists','Jains','Others_Religions']
df[column].sum().sort_values(ascending=False)

Hindus              966257353
Muslims             172245158
Christians           27819588
Sikhs                20833116
Buddhists             8442972
Others_Religions      7937734
Jains                 4451753
dtype: int64

> ### **Q.15. How many districts have a majority Hindu population?**

In [133]:
# Determine the religion with the highest representation in each district
dominant_religion_per_district = df[['Hindus', 'Muslims', 'Christians', 'Sikhs', 'Buddhists', 'Jains', 'Others_Religions', 'Religion_Not_Stated']].idxmax(axis=1)

# Filter the DataFrame to include only the districts where Hindus constitute the majority religion
hindu_majority_districts = df[dominant_religion_per_district == 'Hindus']

# Determine the unique states from these districts
states_with_hindu_majority_districts = hindu_majority_districts['State name'].unique()

print("States where the majority of districts have a Hindu population:")
print(states_with_hindu_majority_districts)


States where the majority of districts have a Hindu population:
['JAMMU AND KASHMIR' 'HIMACHAL PRADESH' 'PUNJAB' 'CHANDIGARH'
 'UTTARAKHAND' 'HARYANA' 'NCT OF DELHI' 'RAJASTHAN' 'UTTAR PRADESH'
 'BIHAR' 'SIKKIM' 'ARUNACHAL PRADESH' 'MANIPUR' 'TRIPURA' 'ASSAM'
 'WEST BENGAL' 'JHARKHAND' 'ORISSA' 'CHHATTISGARH' 'MADHYA PRADESH'
 'GUJARAT' 'DAMAN AND DIU' 'DADRA AND NAGAR HAVELI' 'MAHARASHTRA'
 'ANDHRA PRADESH' 'KARNATAKA' 'GOA' 'KERALA' 'TAMIL NADU' 'PONDICHERRY'
 'ANDAMAN AND NICOBAR ISLANDS']
