# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objective

At the end of this experiment, you will be able to:

* perform Data preprocessing

## Problem Statement

We will be using district wise demographics, enrollments, and teacher indicator data to predict whether the literacy rate is high/ medium/ low in each district.

### Data Preprocessing

Data preprocessing is an important step in solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        - Handling missing data
        - Handling noisy data
        - Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources and data are combined to form consistent data.
    This data after performing cleaning is used for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format according to the need of the model we are building.
    There are many options used for transforming the data as below:
        - Normalization
        - Aggregation
        - Generalization
        
    4. Data Reduction → Following data transformation and scaling, the redundancy within the data is removed and is organized efficiently.



### Total Marks  = 20

In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/B15_Data_Munging.zip
! unzip B15_Data_Munging.zip

In [None]:
!ls

## Exercise 1 - Load and Explore the Data (3 Marks)
1. We have three different files

  * Districtwise_Basicdata.csv
  * Districtwise_Enrollment_details_indicator.csv
  * Districtwise_Teacher_indicator.csv

  These files contain the necessary data to solve the problem. <br>

2. Load the files based on **team allocation** mentioned below. Observe the header level details, data records while loading the data.
  
  Hint : Use read_csv from pandas with [skiprows or header](https://towardsdatascience.com/import-csv-files-as-pandas-dataframe-with-skiprows-skipfooter-usecols-index-col-and-header-fbf67a2f92a) options.

3. Read the columns of the dataset and rename them if required.

  Hint : Rename column names (if any) using the following [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

Team allocation for dataset selection

    Team A = 1,3,5,7,9,11,13,15
        Districtwise_Basicdata.csv
        Districtwise_Enrollment_details_indicator.csv

    Team B = 2,4,6,8,10,12,14,16
        Districtwise_Basicdata.csv
        Districtwise_Teacher_indicator.csv

In [None]:
# Importing all the required packages and add necessary imports if required
import pandas as pd
import numpy as np

In [None]:
# YOUR CODE HERE for loading and exploring the datasets

## Exercise 2  - Data Integration (3 Marks)

As the required data is present in different datasets, we need to **integrate both to make a single dataframe/dataset**.
  * For integrating the datasets, create a unique identifier for each row in both the dataframes so that it can be used to map the data in different files.
   
    * Combine year, state code, district code columns and form a new unique identifier column, refer to this [link](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas).
    * Set the identifier column as the index for each dataframe.

    * Integrate the dataframes using the above index
     
     Hint: For merging or joining the datasets, refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Example:** Data of the district Anantapur in Andrapradesh, which is present in different files should form a single row after integrating the datasets


In [None]:
# YOUR CODE HERE for integrating the datasets

## Exercise 3 - Data Cleaning (3 Marks)

1.  **Overall_lit** is our target variable. Delete rows with missing overall_lit value

   Hint: Refer to the link [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).


2.  Convert categorical values to numerical values.

  For example, If a feature contains categorical values such as dog, cat, mouse, etc then replace them with 0, 1, 2, etc or use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) 

3. Replace the missing values in any other column appropriately with mean / median / mode.

  Hint: Use pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function to replace the missing values




In [None]:
# YOUR CODE HERE for data cleaning

## Exercise 4 - (3 Marks)

1. Remove the unnecessary columns which are not contributing to the overall literacy rate

2. Verify if there are any duplicate columns and remove them.

  For example: state name and district name are the same as state code and district code.

3. Make sure that the final dataframe has no null or nan values. Delete the rows with missing values.

   Hint: Give df.isna() to verify on the nan values in the dataframe. 

In [None]:
# YOUR CODE HERE for cleaning the dataframe

## Exercise 5 - Apply Correlation Matrix (2 Marks)

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. More number of features does not imply better accuracy. More features may lead to a decline in the accuracy and create noise in the model, if they contain any irrelevant features.

*Features with high correlation value will imply the same meaning. Hence removing the highly correlated features*

**Function Description:**

`remove_Highly_Correlated()` function removes highly correlated features in the dataframe.
- Creates a correlation matrix of row and column wise features
- Extracts only uppertriangular matrix as correlation matrix, which will have the same values below and above the diagonal
- Removes columns which are having correlation value more than the threshold value.

In [None]:
def remove_Highly_Correlated(df, bar=0.9):
  # Creates correlation matrix
  corr = df.corr()

  # Set Up Mask To Hide Upper Triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))
  tri_df = corr.mask(mask)

  # Finding features with correlation value more than specified threshold value (bar=0.9)
  highly_cor_col = [col for col in tri_df.columns if any(tri_df[col] > bar )]
  print("length of highly correlated columns",len(highly_cor_col))

  # Drop the highly correlated columns
  reduced_df = df.drop(highly_cor_col, axis = 1)
  print("shape of data",df.shape,"shape of reduced data",reduced_df.shape)
  return reduced_df

In [None]:
# YOUR CODE HERE to remove highly correlated features from the dataframe by calling above function.

## Exercise 6 - (3 Marks)

Perform Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer to the following link: 

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
# YOUR CODE HERE

## Exercise 7 - (3 Marks)

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

* Split the data into train and test

* Fit the model with train data and find the accuracy of test data

### Expected Accuracy is above 90%

In [None]:
# YOUR CODE HERE for applying different classifiers