# Aadhaar Data Analytic Project Using Pyspark

In [1]:
pip install pyspark    # install pyspark library

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# initializing pyspark libraries
from pyspark.sql import SparkSession     # an entry point to PySpark           

spark = SparkSession.builder.appName('Assignment').getOrCreate()   # i.e creation of spark session
 

In [3]:
adhaar_df = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("/content/UIDAI-ENR-DETAIL-20170308.csv")  # create dataframe from csv file

In [5]:
adhaar_df.show(5)

+--------------+--------------------+-------------+---------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|     Registrar|    Enrolment Agency|        State| District|Sub District|Pin Code|Gender|Age|Aadhaar generated|Enrolment Rejected|Residents providing email|Residents providing mobile number|
+--------------+--------------------+-------------+---------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|Allahabad Bank|A-Onerealtors Pvt...|Uttar Pradesh|Allahabad|        Meja|  212303|     F|  7|                1|                 0|                        0|                                1|
|Allahabad Bank|Asha Security Gua...|Uttar Pradesh|Sonbhadra| Robertsganj|  231213|     M|  8|                1|                 0|                        0|                                0|
|Allahabad Bank|   SGS INDIA PVT LTD|Utt

###Q1.Create a dataframe with Total Aadhaar's generated for each state

In [6]:
aadhaar_filtered = adhaar_df.filter(adhaar_df["Aadhaar generated"]>=1)   # filtering  data by addhar generated column greater than 1
df1 = aadhaar_filtered.groupBy("State").sum("Aadhaar generated") # create dataframe using groupby function and sum function to get total Aadhaar generated for each state
df1.show() # show dataframe

+--------------------+----------------------+
|               State|sum(Aadhaar generated)|
+--------------------+----------------------+
|            Nagaland|                   545|
|           Karnataka|                 19764|
|              Odisha|                 18182|
|              Kerala|                 15143|
|          Tamil Nadu|                 32485|
|        Chhattisgarh|                  6604|
|      Andhra Pradesh|                  5798|
|      Madhya Pradesh|                 53276|
|              Punjab|                  6506|
|             Manipur|                  1323|
|                 Goa|                  1167|
|             Mizoram|                  6279|
|Dadra and Nagar H...|                   140|
|    Himachal Pradesh|                  1547|
|          Puducherry|                    83|
|             Haryana|                  6804|
|   Jammu and Kashmir|                  1234|
|           Jharkhand|                  9868|
|   Arunachal Pradesh|            

Above dataframe with Total Aadhaar's generated for each state

###Q2. Create a dataframe with Total Aadhaar's generated by each enrollment agency

In [7]:
df2 = aadhaar_filtered.groupBy("Enrolment Agency").sum("Aadhaar generated") # # create dataframe using groupby function and sum functon to get total Aadhaar generated for each Enrolment agency
df2.show() # show dataframe

+--------------------+----------------------+
|    Enrolment Agency|sum(Aadhaar generated)|
+--------------------+----------------------+
|Raj Construction Co.|                   532|
|      CO JOMLO MOBUK|                     8|
|NPS Technologies ...|                  9692|
|    APOnline Limited|                   305|
|  Transmoovers India|                     5|
|Zephyr System Pvt...|                  6946|
|          ADC BOLENG|                     2|
|Emdee Digitronics...|                  2078|
|Netlink software ...|                  4832|
|     DSO STAT NAMSAI|                    50|
|Estex Telecom Pvt...|                  1894|
|Squaria Global In...|                  1368|
|EAC OFFICE KAYING...|                    21|
|IAP COMPANY Pvt. Ltd|                 10644|
|Prakash Computer ...|                  2817|
|      CDPO Tezu ICDS|                    66|
|       APEX Services|                   109|
|Synapses Solution...|                  2843|
|Yashi Informatics...|            

Above , dataframe with Total Aadhaar's generated by each enrollment agency

###Q3. Create dataframe with top 10 districts with maximum Aadhaar's generated for both Male and Female?

In [8]:
import pyspark.sql.functions as f      # import pyspark sql functions
df3 = aadhaar_filtered.groupBy("District","Gender").sum("Aadhaar generated") # use group by function 

male_data = df3.filter(df3["Gender"] == "M")  # filter gender column with male
male_top10 = male_data.orderBy(f.desc("sum(Aadhaar generated)")).limit(10) # use limit for top 10 and use order by for max addhar generated
male_top10.show() # show dataframe
 
female_data = df3.filter(df3["Gender"] == "F")  # filter gender column with female
female_top10 = female_data.orderBy(f.desc("sum(Aadhaar generated)")).limit(10) #use limit for top 10 and use order by for max addhar generated
female_top10.show() # show dataframe

+-----------------+------+----------------------+
|         District|Gender|sum(Aadhaar generated)|
+-----------------+------+----------------------+
|        Bhagalpur|     M|                 11007|
|South 24 Parganas|     M|                  7825|
|          Katihar|     M|                  6968|
|      Murshidabad|     M|                  6808|
|       Samastipur|     M|                  6195|
|            Patna|     M|                  6191|
|       Barddhaman|     M|                  6077|
|             Gaya|     M|                  5959|
|           Munger|     M|                  5781|
|            Nadia|     M|                  5509|
+-----------------+------+----------------------+

+-----------------+------+----------------------+
|         District|Gender|sum(Aadhaar generated)|
+-----------------+------+----------------------+
|       Barddhaman|     F|                  9744|
|South 24 Parganas|     F|                  8382|
|North 24 Parganas|     F|                  6108|

Above, dataframe with top 10 districts with maximum Aadhaar's generated for both Male and Female

###Q4. Create a dataframe with Total Aadhaar's generated for top 10 least state

In [9]:
df1 = aadhaar_filtered.groupBy("State").sum("Aadhaar generated") # create dataframe using groupby function and sum function to get total Aadhaar generated for each state
df2 = df1.orderBy(f.asc("sum(Aadhaar generated)")).limit(10) # create datadrame using order by function in ascending for least state
df2.show() # show dataframe

+--------------------+----------------------+
|               State|sum(Aadhaar generated)|
+--------------------+----------------------+
|         Lakshadweep|                     4|
|Andaman and Nicob...|                     5|
|              Others|                    12|
|              Sikkim|                    50|
|          Puducherry|                    83|
|       Daman and Diu|                   105|
|Dadra and Nagar H...|                   140|
|          Chandigarh|                   259|
|           Meghalaya|                   277|
|            Nagaland|                   545|
+--------------------+----------------------+



Above dataframe with Total Aadhaar's generated for top 10 least state

### Q5. Find the states and their number of  adhaar no generated ?


In [11]:
notgented_aadhaar = adhaar_df.filter(adhaar_df["Aadhaar generated"]< 1) # filter Aadhaar generated column for not generating aadhaar 
res1 = notgented_aadhaar.groupBy("State").count()  # counting for not generating aadhaar
df5 = res1.orderBy(f.desc("count")).limit(10) # create dataframe for most adhaar declined using order by in descending order
df5.show() # show dataframe

+--------------+-----+
|         State|count|
+--------------+-----+
|         Bihar| 2982|
| Uttar Pradesh| 2854|
|   West Bengal| 2770|
|Madhya Pradesh| 1654|
|     Rajasthan| 1143|
|   Maharashtra| 1117|
|       Gujarat| 1087|
|     Karnataka| 1025|
|        Odisha|  860|
|    Tamil Nadu|  776|
+--------------+-----+



Above dataframe where states and their number of aadhaar not generated

### Q6. Find the ages whose Enrolment rejected ?

In [13]:
# Grouping by Age
Enr_rej= adhaar_df.groupby("Age").sum("Enrolment Rejected").orderBy(f.desc("sum(Enrolment Rejected)"))
# count_by_age.withColumnRenamed("sum(Enrolment Rejected)","Total Enrolment Rejected").show(15)
Enr_rej.show()

+---+-----------------------+
|Age|sum(Enrolment Rejected)|
+---+-----------------------+
|  4|                   5673|
|  3|                   3842|
|  2|                   3372|
|  1|                   3333|
|  0|                   3219|
|  5|                   2208|
|  6|                   1931|
|  7|                   1572|
|  8|                   1357|
|  9|                    980|
| 10|                    920|
| 11|                    604|
| 12|                    560|
| 13|                    406|
| 18|                    384|
| 14|                    348|
| 22|                    329|
| 20|                    318|
| 21|                    300|
| 25|                    293|
+---+-----------------------+
only showing top 20 rows



Above ages and their sum whose enrolment rejected

### Q7. Find the ages whose adhaar not generated ?

In [15]:
notgented_aadhaar = adhaar_df.filter(adhaar_df["Aadhaar generated"]< 1) # filter Aadhaar generated column for not generating aadhaar 
res3 = notgented_aadhaar.groupBy("Age").count()  # counting for not generating aadhaar
df7 = res3.orderBy(f.desc("count")) # create dataframe for most adhaar declined using order by in descending order
df7.show(100) # show dataframe

+---+-----+
|Age|count|
+---+-----+
|  4| 1729|
|  3| 1492|
|  2| 1389|
|  1| 1294|
|  0| 1087|
|  5|  863|
|  6|  794|
|  7|  724|
|  8|  612|
|  9|  529|
| 10|  500|
| 11|  403|
| 12|  344|
| 13|  298|
| 18|  283|
| 22|  269|
| 20|  259|
| 14|  254|
| 25|  251|
| 21|  235|
| 19|  231|
| 23|  223|
| 15|  217|
| 24|  212|
| 27|  203|
| 30|  194|
| 16|  194|
| 26|  193|
| 28|  190|
| 17|  180|
| 29|  152|
| 35|  151|
| 32|  150|
| 31|  131|
| 45|  117|
| 60|  116|
| 34|  110|
| 40|  103|
| 62|  102|
| 33|  101|
| 70|  100|
| 65|   97|
| 67|   92|
| 41|   91|
| 36|   87|
| 37|   84|
| 50|   83|
| 38|   82|
| 51|   82|
| 39|   81|
| 47|   77|
| 55|   72|
| 42|   72|
| 52|   70|
| 48|   68|
| 46|   65|
| 61|   64|
| 57|   61|
| 66|   61|
| 58|   60|
| 44|   59|
| 72|   59|
| 43|   58|
| 64|   56|
| 56|   53|
| 71|   53|
| 68|   53|
| 63|   51|
| 54|   51|
| 59|   50|
| 49|   48|
| 69|   43|
| 77|   42|
| 73|   40|
| 74|   39|
| 53|   38|
| 75|   37|
| 80|   32|
| 76|   31|
| 78|   22|
| 79

Above ages and number whose adhaar not generated.