# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objective

At the end of this experiment, you will be able to:

* perform Data preprocessing

In [None]:
#@title Mini Hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="320" height="240" controls>
  <source src="https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/aiml_batch_15/preview_videos/Mini_Hackathon_Data_Munging_Briefing.mp4" type="video/mp4">
</video>
""")

## Problem Statement

We will be using district wise demographics, enrollments, and teacher indicator data to predict whether the literacy rate is high/ medium/ low in each district.

### Data Preprocessing

Data preprocessing is an important step in solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        - Handling missing data
        - Handling noisy data
        - Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format according to the need of the model we are building. There are many options used for
    transforming the data as below:
        - Normalization
        - Aggregation
        - Generalization
        
    4. Data Reduction → Following data transformation and scaling, the redundancy within the data is removed and is organized efficiently.



### Total Marks  = 20

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()
  
notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/B15_Data_Munging.zip")
    ipython.magic("sx unzip B15_Data_Munging.zip")
    print("Data downloaded successfully")
    return

setup()

Data downloaded successfully


In [None]:
!ls

B15_Data_Munging.zip
Districtwise_Basicdata.csv
Districtwise_Enrollment_details_indicator.csv
Districtwise_Teacher_indicator.csv
sample_data


## Exercise 1 - Load and Explore the Data (2 Marks)
1. We have three different files

  * Districtwise_Basicdata.csv
  * Districtwise_Enrollment_details_indicator.csv
  * Districtwise_Teacher_indicator.csv

  These files contain the necessary data to solve the problem. <br>

2. Load the files based on **team allocation** mentioned below. Observe the header level details, data records while loading the data.
  
  Hint : Use read_csv from pandas with [skiprows or header](https://towardsdatascience.com/import-csv-files-as-pandas-dataframe-with-skiprows-skipfooter-usecols-index-col-and-header-fbf67a2f92a) options.

3. Read the columns of the dataset and rename if required.

  Hint : Rename column names (if any) using the following [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

Team allocation for dataset selection

    Team A = 1,3,5,7,9,11,13,15 
        Districtwise_Basicdata.csv
        Districtwise_Enrollment_details_indicator.csv

    Team B = 2,4,6,8,10,12,14,16
        Districtwise_Basicdata.csv
        Districtwise_Teacher_indicator.csv

In [None]:
# Importing all the required packages and add neccesary imports if required
import pandas as pd
import numpy as np

In [None]:
d1=pd.read_csv('Districtwise_Basicdata.csv',header=1)
d2=pd.read_csv('Districtwise_Enrollment_details_indicator.csv',header=3)
#d2 = d2.rename(columns=d2.iloc[1])


d2.head()


Unnamed: 0,Year,Statecd,State Name,distcd,distname,Enr Govt1,Enr Govt2,Enr Govt3,Enr Govt4,Enr Govt5,Enr Govt6,Enr Govt7,Enr Govt9,Enr Pvt1,Enr Pvt2,Enr Pvt3,Enr Pvt4,Enr Pvt5,Enr Pvt6,Enr Pvt7,Enr Pvt9,Enr R Govt1,Enr R Govt2,Enr R Govt3,Enr R Govt4,Enr R Govt5,Enr R Govt6,Enr R Govt7,Enr R Govt9,Enr R Pvt1,Enr R Pvt2,Enr R Pvt3,Enr R Pvt4,Enr R Pvt5,Enr R Pvt6,Enr R Pvt7,Enr R Pvt9,Enr Py4 C1,Enr Py4 C2,Enr Py4 C3,...,Enr Dis G C8,Grossness P,Grossness Up,Enr Med1 1,Enr Med1 2,Enr Med1 3,Enr Med1 4,Enr Med1 5,Enr Med1 6,Enr Med1 7,Enr Med2 1,Enr Med2 2,Enr Med2 3,Enr Med2 4,Enr Med2 5,Enr Med2 6,Enr Med2 7,Enr Med3 1,Enr Med3 2,Enr Med3 3,Enr Med3 4,Enr Med3 5,Enr Med3 6,Enr Med3 7,Rep C1,Rep C2,Rep C3,Rep C4,Rep C5,Rep C6,Rep C7,Rep C8,Muslim P,Muslim Up,Muslim G P,Muslim G Up,Obc P,Obc Up,Obc G P,Obc G Up
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3232,3359.0,10620,0.0,1027,3739.0,0,0.0,2058.0,1994.0,5408.0,0,0.0,1153.0,0.0,0.0,1788,3030,4236,0.0,169,1750,0,0.0,1193,1113,2206,0,0,846,0,0.0,4080,3968.0,4229.0,...,14,35580,19816,4110,2771,12548,0,575,2458.0,0,742,1320,1302,0.0,144.0,1590,0,176.0,932,1457.0,0,0,410.0,0.0,61,13,14.0,14.0,10.0,12.0,7,4,2539,1383,1263,690,2289,1437,1159,747
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3996,3808.0,1162,1043.0,1397,1625.0,0,0.0,779.0,295.0,,0,0.0,225.0,0.0,0.0,3996,3808,1162,1043.0,1397,1625,0,0.0,779,295,0,0,0,225,0,0.0,2034,2143.0,2191.0,...,11,15997,9083,1689,2077,232,904,178,465.0,0,1119,765,514,136.0,314.0,1072,0,1805.0,1212,387.0,0,567,313.0,0.0,2,3,0.0,3.0,2.0,2.0,0,0,184,103,98,60,2100,1621,1066,825
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,1510,886.0,357,0.0,651,838.0,0,0.0,16.0,,,0,,,,,1510,886,357,,651,838,0,,16,0,0,0,0,0,0,0.0,498,572.0,516.0,...,2,4593,1987,883,678,357,0,269,437.0,0,643,208,0,0.0,382.0,353,0,0.0,0,0.0,0,0,48.0,0.0,3,4,6.0,4.0,0.0,1.0,5,2,62,34,33,14,17,5,6,0
3,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,135664,53374.0,1266,0.0,226,23850.0,59120,0.0,57987.0,45282.0,0.0,0,0.0,8169.0,34569.0,0.0,125503,50606,948,0.0,0,21341,50685,0.0,28423,29291,0,0,0,5026,14613,0.0,71495,60037.0,57554.0,...,261,498748,229986,142730,70090,0,0,0,21632.0,63391,40048,22525,1266,,226.0,9899,23529,14247.0,6750,0.0,0,0,856.0,5876.0,0,6,0.0,0.0,6.0,44.0,47,49,33747,13901,17216,7520,118300,62602,56425,30725
4,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,145256,71562.0,6867,870.0,92640,325.0,5271,0.0,78173.0,50979.0,1704.0,0,41617.0,507.0,6684.0,0.0,127573,62797,6060,870.0,74436,325,4756,0.0,34449,30406,1182,0,16744,198,1850,0.0,68652,65524.0,63496.0,...,201,554097,272427,154317,79676,1850,768,78082,523.0,5668,65301,41182,6684,,55525.0,309,5940,5105.0,3461,119.0,102,1004,0.0,296.0,27,10,7.0,3.0,8.0,105.0,84,72,34185,16310,17082,8465,188353,100391,90967,50266


In [None]:
d1.head()

Unnamed: 0,Year,Statecd,statename,distcd,distname,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,High,84.52
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,High,79.39
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,High,70.7
3,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99
4,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31


In [None]:
len(d1),len(d2)

(1324, 1324)

In [None]:
#d2.drop(d2.iloc[:,13:],inplace = True, axis = 1)
#d2.columns=d2.iloc[1]
#d2.head()
#d2.drop(d2.iloc[:,1:3],inplace = True, axis = 1)
#cols_to_use = d2.columns.difference(d1.columns)
#cols_to_use
#d3 = merge(d1, d2[cols_to_use], left_index=True, right_index=True, how='outer')
#d2.iloc[:,0:13]
#d2.columns['Year','State Code','Distcd','Primary']
#d2.head()



In [None]:
#d2.rename(columns={'Primary':'Year','Unnamed: 1':'State Code','Unnamed: 2':'State Name','Unnamed: 3':'District Code','Unnamed: 4':'District Name','Unnamed: 5':'Total Enrolment -Government Schools','Unnamed: 13':'Total Enrolment -Private Schools'},inplace=True)


## Exercise 2  - Data Integration (3 Marks)

As the required data is present in different datasets, we need to **integrate both to make a single dataframe/dataset**.
  * For integrating the datasets, create a unique identifier for each row in both the dataframes so that it can be used to map the data in different files.
   
    * Combine year, state code, district code columns and form a new unique identifier column, refer this [link](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas).
    * Set the identifier column as the index for each dataframe.

    * Integrate the dataframes using the above index
     
     Hint: For merging or joining the datasets, refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Example:** Data of the district Anantapur in Andrapradesh, which is present in different files should form a single row after integrating the datasets


In [None]:
# YOUR CODE HERE for integrating the datasets
#d2.rename(columns={'Unnamed: 0':'Year','Unnamed: 1':'State Code','Unnamed: 2':'State Name','Unnamed: 3':'District Code','Unnamed: 4':'District Name','Unnamed: 5':'Total Enrolment -Government Schools','Unnamed: 13':'Total Enrolment -Private Schools'},inplace=True)



In [None]:
#d1mod=d1.drop(d1.index[0:1]) # dropping 1st row
#d2mod=d2.drop(d2.index[0:3]) # dropping first 3 rows

In [None]:
d1['uniquekey']=d1['Year']+d1['Statecd'].astype(str)+d1['distcd'].astype(str)
d2['uniquekey']=d2['Year']+d2['Statecd'].astype(str)+d2['distcd'].astype(str)
d1 = d1.set_index('uniquekey')
d2 = d2.set_index('uniquekey')
d3 = d2.merge(d1, left_index=True, right_index=True)
d3.head()
#d3.loc[d3['District Code'] == '3501']
#d1.isnull().sum(axis =0)
#len(d3)


Unnamed: 0_level_0,Year_x,Statecd_x,State Name,distcd_x,distname_x,Enr Govt1,Enr Govt2,Enr Govt3,Enr Govt4,Enr Govt5,Enr Govt6,Enr Govt7,Enr Govt9,Enr Pvt1,Enr Pvt2,Enr Pvt3,Enr Pvt4,Enr Pvt5,Enr Pvt6,Enr Pvt7,Enr Pvt9,Enr R Govt1,Enr R Govt2,Enr R Govt3,Enr R Govt4,Enr R Govt5,Enr R Govt6,Enr R Govt7,Enr R Govt9,Enr R Pvt1,Enr R Pvt2,Enr R Pvt3,Enr R Pvt4,Enr R Pvt5,Enr R Pvt6,Enr R Pvt7,Enr R Pvt9,Enr Py4 C1,Enr Py4 C2,Enr Py4 C3,...,Enr Med3 3,Enr Med3 4,Enr Med3 5,Enr Med3 6,Enr Med3 7,Rep C1,Rep C2,Rep C3,Rep C4,Rep C5,Rep C6,Rep C7,Rep C8,Muslim P,Muslim Up,Muslim G P,Muslim G Up,Obc P,Obc Up,Obc G P,Obc G Up,Year_y,Statecd_y,statename,distcd_y,distname_y,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
uniquekey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2012-13353501,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3232,3359.0,10620,0.0,1027,3739.0,0,0.0,2058.0,1994.0,5408.0,0,0.0,1153.0,0.0,0.0,1788,3030,4236,0.0,169,1750,0,0.0,1193,1113,2206,0,0,846,0,0.0,4080,3968.0,4229.0,...,1457.0,0,0,410.0,0.0,61,13,14.0,14.0,10.0,12.0,7,4,2539,1383,1263,690,2289,1437,1159,747,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,High,84.52
2012-13353503,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3996,3808.0,1162,1043.0,1397,1625.0,0,0.0,779.0,295.0,,0,0.0,225.0,0.0,0.0,3996,3808,1162,1043.0,1397,1625,0,0.0,779,295,0,0,0,225,0,0.0,2034,2143.0,2191.0,...,387.0,0,567,313.0,0.0,2,3,0.0,3.0,2.0,2.0,0,0,184,103,98,60,2100,1621,1066,825,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,High,79.39
2012-13353502,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,1510,886.0,357,0.0,651,838.0,0,0.0,16.0,,,0,,,,,1510,886,357,,651,838,0,,16,0,0,0,0,0,0,0.0,498,572.0,516.0,...,0.0,0,0,48.0,0.0,3,4,6.0,4.0,0.0,1.0,5,2,62,34,33,14,17,5,6,0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,High,70.7
2012-13282801,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,135664,53374.0,1266,0.0,226,23850.0,59120,0.0,57987.0,45282.0,0.0,0,0.0,8169.0,34569.0,0.0,125503,50606,948,0.0,0,21341,50685,0.0,28423,29291,0,0,0,5026,14613,0.0,71495,60037.0,57554.0,...,0.0,0,0,856.0,5876.0,0,6,0.0,0.0,6.0,44.0,47,49,33747,13901,17216,7520,118300,62602,56425,30725,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99
2012-13282822,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,145256,71562.0,6867,870.0,92640,325.0,5271,0.0,78173.0,50979.0,1704.0,0,41617.0,507.0,6684.0,0.0,127573,62797,6060,870.0,74436,325,4756,0.0,34449,30406,1182,0,16744,198,1850,0.0,68652,65524.0,63496.0,...,119.0,102,1004,0.0,296.0,27,10,7.0,3.0,8.0,105.0,84,72,34185,16310,17082,8465,188353,100391,90967,50266,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31


## Exercise 3 - Data Cleaning (3 Marks)

1.  **Overall_lit** is our target variable. Delete rows with missing overall_lit value

   Hint: Refer to the link [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).


2.  Convert categorical values to numerical values.

  For example, If a feature contains categorical values such as dog, cat, mouse, etc then replace them with 1, 2, 3, etc or use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) 

3. Replace the missing values in any other column appropriately with mean / median / mode.

  Hint: Use pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function to replace the missing values




In [None]:
# YOUR CODE HERE for data cleaning
d3=d3.dropna(subset=['overall_lit'])
d3.isnull().sum(axis =0)
#MergedData.head()
d3.head()




Unnamed: 0_level_0,Year_x,Statecd_x,State Name,distcd_x,distname_x,Enr Govt1,Enr Govt2,Enr Govt3,Enr Govt4,Enr Govt5,Enr Govt6,Enr Govt7,Enr Govt9,Enr Pvt1,Enr Pvt2,Enr Pvt3,Enr Pvt4,Enr Pvt5,Enr Pvt6,Enr Pvt7,Enr Pvt9,Enr R Govt1,Enr R Govt2,Enr R Govt3,Enr R Govt4,Enr R Govt5,Enr R Govt6,Enr R Govt7,Enr R Govt9,Enr R Pvt1,Enr R Pvt2,Enr R Pvt3,Enr R Pvt4,Enr R Pvt5,Enr R Pvt6,Enr R Pvt7,Enr R Pvt9,Enr Py4 C1,Enr Py4 C2,Enr Py4 C3,...,Enr Med3 3,Enr Med3 4,Enr Med3 5,Enr Med3 6,Enr Med3 7,Rep C1,Rep C2,Rep C3,Rep C4,Rep C5,Rep C6,Rep C7,Rep C8,Muslim P,Muslim Up,Muslim G P,Muslim G Up,Obc P,Obc Up,Obc G P,Obc G Up,Year_y,Statecd_y,statename,distcd_y,distname_y,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
uniquekey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2012-13353501,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3232,3359.0,10620,0.0,1027,3739.0,0,0.0,2058.0,1994.0,5408.0,0,0.0,1153.0,0.0,0.0,1788,3030,4236,0.0,169,1750,0,0.0,1193,1113,2206,0,0,846,0,0.0,4080,3968.0,4229.0,...,1457.0,0,0,410.0,0.0,61,13,14.0,14.0,10.0,12.0,7,4,2539,1383,1263,690,2289,1437,1159,747,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,High,84.52
2012-13353503,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3996,3808.0,1162,1043.0,1397,1625.0,0,0.0,779.0,295.0,,0,0.0,225.0,0.0,0.0,3996,3808,1162,1043.0,1397,1625,0,0.0,779,295,0,0,0,225,0,0.0,2034,2143.0,2191.0,...,387.0,0,567,313.0,0.0,2,3,0.0,3.0,2.0,2.0,0,0,184,103,98,60,2100,1621,1066,825,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,High,79.39
2012-13353502,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,1510,886.0,357,0.0,651,838.0,0,0.0,16.0,,,0,,,,,1510,886,357,,651,838,0,,16,0,0,0,0,0,0,0.0,498,572.0,516.0,...,0.0,0,0,48.0,0.0,3,4,6.0,4.0,0.0,1.0,5,2,62,34,33,14,17,5,6,0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,High,70.7
2012-13282801,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,135664,53374.0,1266,0.0,226,23850.0,59120,0.0,57987.0,45282.0,0.0,0,0.0,8169.0,34569.0,0.0,125503,50606,948,0.0,0,21341,50685,0.0,28423,29291,0,0,0,5026,14613,0.0,71495,60037.0,57554.0,...,0.0,0,0,856.0,5876.0,0,6,0.0,0.0,6.0,44.0,47,49,33747,13901,17216,7520,118300,62602,56425,30725,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99
2012-13282822,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,145256,71562.0,6867,870.0,92640,325.0,5271,0.0,78173.0,50979.0,1704.0,0,41617.0,507.0,6684.0,0.0,127573,62797,6060,870.0,74436,325,4756,0.0,34449,30406,1182,0,16744,198,1850,0.0,68652,65524.0,63496.0,...,119.0,102,1004,0.0,296.0,27,10,7.0,3.0,8.0,105.0,84,72,34185,16310,17082,8465,188353,100391,90967,50266,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31


In [None]:
#d4=Encoder(d3)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
d3['overall_lit'] = le.fit_transform(d3['overall_lit'])

In [None]:
d4=d3.drop(columns=['statename','distname_x','Year_x','distcd_x','Statecd_x',	'distcd_y','distname_y','Statecd_y','State Name ','Year_y'],axis=1)
d4.head()



Unnamed: 0_level_0,Enr Govt1,Enr Govt2,Enr Govt3,Enr Govt4,Enr Govt5,Enr Govt6,Enr Govt7,Enr Govt9,Enr Pvt1,Enr Pvt2,Enr Pvt3,Enr Pvt4,Enr Pvt5,Enr Pvt6,Enr Pvt7,Enr Pvt9,Enr R Govt1,Enr R Govt2,Enr R Govt3,Enr R Govt4,Enr R Govt5,Enr R Govt6,Enr R Govt7,Enr R Govt9,Enr R Pvt1,Enr R Pvt2,Enr R Pvt3,Enr R Pvt4,Enr R Pvt5,Enr R Pvt6,Enr R Pvt7,Enr R Pvt9,Enr Py4 C1,Enr Py4 C2,Enr Py4 C3,Enr Py4 C4,Enr Py4 C5,Enr Py4 C6,Enr Py4 C7,Enr Py4 C8,...,Enr Med2 5,Enr Med2 6,Enr Med2 7,Enr Med3 1,Enr Med3 2,Enr Med3 3,Enr Med3 4,Enr Med3 5,Enr Med3 6,Enr Med3 7,Rep C1,Rep C2,Rep C3,Rep C4,Rep C5,Rep C6,Rep C7,Rep C8,Muslim P,Muslim Up,Muslim G P,Muslim G Up,Obc P,Obc Up,Obc G P,Obc G Up,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit
uniquekey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2012-13353501,3232,3359.0,10620,0.0,1027,3739.0,0,0.0,2058.0,1994.0,5408.0,0,0.0,1153.0,0.0,0.0,1788,3030,4236,0.0,169,1750,0,0.0,1193,1113,2206,0,0,846,0,0.0,4080,3968.0,4229.0,4122.0,4015,4562.0,4385.0,4299,...,144.0,1590,0,176.0,932,1457.0,0,0,410.0,0.0,61,13,14.0,14.0,10.0,12.0,7,4,2539,1383,1263,690,2289,1437,1159,747,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,0,84.52
2012-13353503,3996,3808.0,1162,1043.0,1397,1625.0,0,0.0,779.0,295.0,,0,0.0,225.0,0.0,0.0,3996,3808,1162,1043.0,1397,1625,0,0.0,779,295,0,0,0,225,0,0.0,2034,2143.0,2191.0,2139.0,2452,2806.0,2400.0,2152,...,314.0,1072,0,1805.0,1212,387.0,0,567,313.0,0.0,2,3,0.0,3.0,2.0,2.0,0,0,184,103,98,60,2100,1621,1066,825,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,0,79.39
2012-13353502,1510,886.0,357,0.0,651,838.0,0,0.0,16.0,,,0,,,,,1510,886,357,,651,838,0,,16,0,0,0,0,0,0,0.0,498,572.0,516.0,589.0,630,665.0,692.0,527,...,382.0,353,0,0.0,0,0.0,0,0,48.0,0.0,3,4,6.0,4.0,0.0,1.0,5,2,62,34,33,14,17,5,6,0,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,0,70.7
2012-13282801,135664,53374.0,1266,0.0,226,23850.0,59120,0.0,57987.0,45282.0,0.0,0,0.0,8169.0,34569.0,0.0,125503,50606,948,0.0,0,21341,50685,0.0,28423,29291,0,0,0,5026,14613,0.0,71495,60037.0,57554.0,55576.0,51416,46293.0,45863.0,42415,...,226.0,9899,23529,14247.0,6750,0.0,0,0,856.0,5876.0,0,6,0.0,0.0,6.0,44.0,47,49,33747,13901,17216,7520,118300,62602,56425,30725,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,1,51.99
2012-13282822,145256,71562.0,6867,870.0,92640,325.0,5271,0.0,78173.0,50979.0,1704.0,0,41617.0,507.0,6684.0,0.0,127573,62797,6060,870.0,74436,325,4756,0.0,34449,30406,1182,0,16744,198,1850,0.0,68652,65524.0,63496.0,63465.0,57840,56337.0,57943.0,52180,...,55525.0,309,5940,5105.0,3461,119.0,102,1004,0.0,296.0,27,10,7.0,3.0,8.0,105.0,84,72,34185,16310,17082,8465,188353,100391,90967,50266,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,1,54.31


## Exercise 4 - (3 Marks)

1. Remove the unneccesary columns which are not contributing to the overall literacy rate

2. Verify if there are any duplicate columns and remove them.

  For example: state name and district name are same as state code and district code.

3. Make sure that the final dataframe has no null or nan values. Delete the rows with missing values.

   Hint: Give df.isna() to verify on the nan values in the dataframe. 

## Exercise 5 - Apply Correlation Matrix (2 Marks)

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. More number of features does not imply better accuracy. More features may lead to a decline in the accuracy and create noise in the model, if they contain any irrelevant features.

*Features with high correlation value will imply the same meaning. Hence removing the highly correlated features*

**Function Description:**

`remove_Highly_Correlated()` function removes highly correlated features in the dataframe.
- Creates a correlation matrix of row and column wise features
- Extracts only uppertriangular matrix as correlation matrix, which will have the same values below and above the diagonal
- Removes columns which are having correlation value more than the threshold value.

In [None]:
def remove_Highly_Correlated(df, bar=0.9):
  # Creates correlation matrix
  corr = df.corr()

  # Set Up Mask To Hide Upper Triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))
  tri_df = corr.mask(mask)

  # Finding features with correlation value more than specified threshold value (bar=0.9)
  highly_cor_col = [col for col in tri_df.columns if any(tri_df[col] > bar )]
  print("length of highly correlated columns",len(highly_cor_col))

  # Drop the highly correlated columns
  reduced_df = df.drop(highly_cor_col, axis = 1)
  print("shape of total data",df.shape,"shape of reduced data",reduced_df.shape)
  return reduced_df

In [None]:
# YOUR CODE HERE to remove highly correlated features from the dataframe by calling above function.
d5=remove_Highly_Correlated(d4, bar=0.9)
d5.head()
d5=d5.dropna(axis=0)
d5.isnull().sum(axis =0)


length of highly correlated columns 94
shape of total data (1268, 175) shape of reduced data (1268, 81)


Enr Govt3      0
Enr Govt9      0
Enr Pvt1       0
Enr Pvt2       0
Enr Pvt5       0
              ..
growthrate     0
p_sc_pop       0
p_st_pop       0
overall_lit    0
female_lit     0
Length: 81, dtype: int64

## Exercise 6 - (3 Marks)

Perform Mean Correction and Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer the following link: 

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np
X=d5.drop(columns='overall_lit')
y=d5.iloc[:,-2]

#dfmedian_scaler = scaler.fit_transform(xmedian,ymedian)




In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler(with_mean=True,with_std=True)
#Xmean=X.mean()
#y.reshape(-1,
Xmean= scaler.fit_transform(X)
#dfmedian_scaler = scaler.fit_transform(xmedian,ymedian)
Xmean



array([[ 0.97303999, -0.07460974, -0.48110453, ..., -1.65710802,
        -0.57840019,  1.58198758],
       [-0.20477637, -0.07460974, -0.50485561, ..., -1.65710802,
        -0.61615486,  1.17305033],
       [ 0.02124781, -0.07460974,  0.55686744, ...,  0.30945514,
        -0.43304474, -0.97127949],
       ...,
       [-0.14711842, -0.07460974,  0.30836327, ..., -0.03530311,
        -0.62257315,  1.3659603 ],
       [ 0.00617953, -0.07460974, -0.29922963, ...,  0.49125692,
         0.05323532, -1.06693733],
       [-0.28746511, -0.07460974,  0.67384008, ...,  1.68959671,
        -0.59841017,  0.59113183]])

## Exercise 7 - (3 Marks)

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

* Split the data into train and test

* Fit the model with train data and find the accuracy of test data

### Expected Accuracy is above 90%

In [None]:
# YOUR CODE HERE for applying different classifiers

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(Xmean,y,test_size=0.2,random_state=42)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
score = accuracy_score(y_pred, y_test)
score


0.743801652892562

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(Xmean,y,test_size=0.2,random_state=42)
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
score = accuracy_score(y_pred, y_test)
score

0.9462809917355371

In [None]:
X_test.shape

(400, 80)